Amaç ve Kapsam: Tanımlanan veri seti üzerinde konut fiyatlarının tahminlenmesi için Spark ML kütüphanesi kullanılarak PySpark ile bir regresyon modeli oluşturulacaktır.
Veri kümesi: California Housing Prices https://www.kaggle.com/datasets/camnugent/california-housing-prices
Ortam: Proje kaggle’de bulunan notebook ortamında python ve temelinde pyspark kütüphanesi ile yapılmıştır.
% pip install pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import pandas as pd
import numpy as np
# Spark session başlatma
spark = SparkSession.builder \\
.appName("California Housing Prices Regression") \\
.getOrCreate()
# Veri setini yükleme
file_path = "/kaggle/input/california-housing-prices/housing.csv"
housing_data = spark.read.csv(file_path, header=True, inferSchema=True)
SparkSession - in-memory
SparkContext
Spark UI
Versionv3.5.1
Masterlocal[*]
AppNameCalifornia Housing Prices Regression
housing_data.show(5)
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
| -122.23| 37.88| 41.0| 880.0| 129.0| 322.0| 126.0| 8.3252| 452600.0| NEAR BAY|
| -122.22| 37.86| 21.0| 7099.0| 1106.0| 2401.0| 1138.0| 8.3014| 358500.0| NEAR BAY|
| -122.24| 37.85| 52.0| 1467.0| 190.0| 496.0| 177.0| 7.2574| 352100.0| NEAR BAY|
| -122.25| 37.85| 52.0| 1274.0| 235.0| 558.0| 219.0| 5.6431| 341300.0| NEAR BAY|
| -122.25| 37.85| 52.0| 1627.0| 280.0| 565.0| 259.0| 3.8462| 342200.0| NEAR BAY|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
only showing top 5 rows