Multiple Linear Regression Case Study: Predicting House Prices in California
Multiple Linear Regression Case Study: Predicting House Prices in California
Case Study Question
Objective: Build a multiple linear regression model to predict the median house value in California based on various factors such as median income, average number of rooms, average number of occupants, latitude, and longitude.
Dataset: We will use the California Housing Dataset, which includes the following features:
MedInc: Median income in block group
AveRooms: Average number of rooms per household
AveOccup: Average number of household members
Latitude: Latitude of the block
Longitude: Longitude of the block
Steps:
Load and explore the dataset.
Define the features and the target variable.
Split the dataset into training and testing sets.
Create and train the multiple linear regression model.
Make predictions using the test set.
Evaluate the model's performance.
Visualize the results.
Step-by-Step Solution
Step 1: Import Libraries
First, we need to import the necessary libraries for data manipulation, modeling, and visualization.
Step 2: Load and Explore the Dataset
Load the California Housing Dataset and take a quick look at the first few rows.
Explanation:
We use
fetch_california_housing()
to load the dataset.Convert the dataset to a Pandas DataFrame for easier manipulation.
Add a new column
MedHouseVal
to represent the target variable (median house value).Display the first few rows of the dataset to get an overview of the data.
Step 3: Define Features and Target
Select the relevant features (independent variables) and define the target (dependent variable).
Explanation:
X
contains the features used to predict the house prices:MedInc
(median income),AveRooms
(average number of rooms),AveOccup
(average number of occupants per household),Latitude
, andLongitude
.Y
is the target variable, which is the median house value (MedHouseVal
).
Step 4: Split the Dataset
Split the dataset into training and testing sets to evaluate the model's performance on unseen data.
Explanation:
train_test_split
splits the data into training and testing sets.test_size=0.2
indicates that 20% of the data will be used for testing, and 80% for training.random_state=42
ensures reproducibility by fixing the random seed.
Step 5: Create and Train the Model
Create a linear regression model and train it using the training set.
Explanation:
LinearRegression()
creates an instance of the linear regression model.model.fit(X_train, Y_train)
trains the model using the training data.
Step 6: Make Predictions
Use the trained model to make predictions on the test set.
Explanation:
model.predict(X_test)
uses the trained model to predict house prices for the test set.
Step 7: Evaluate the Model
Evaluate the model's performance using Mean Squared Error (MSE) and R-squared (( R^2 )).
Explanation:
mean_squared_error(Y_test, Y_pred)
calculates the MSE, which measures the average of the squared differences between actual and predicted values.r2_score(Y_test, Y_pred)
calculates the ( R^2 ) score, which indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.print(f'MSE: {mse}')
andprint(f'R2 Score: {r2}')
display the MSE and ( R^2 ) score.
Step 8: Plot the Results
Visualize the actual vs. predicted values for one of the features to assess the model's performance.
Explanation:
plt.scatter(X_test['MedInc'], Y_test, color='blue', label='Actual')
plots the actual house prices against the median income.plt.scatter(X_test['MedInc'], Y_pred, color='red', label='Predicted')
plots the predicted house prices against the median income.plt.xlabel('Median Income')
andplt.ylabel('Median House Value')
label the axes.plt.title('Multiple Linear Regression')
adds a title to the plot.plt.legend()
adds a legend to differentiate between actual and predicted values.plt.show()
displays the plot.
Complete Case Study Code
Here is the complete code for the multiple linear regression case study using the California Housing Dataset:
Summary
In this detailed case study, we walked through the process of implementing multiple linear regression using the California Housing Dataset. Here are the key steps we covered:
Importing Libraries: We imported necessary libraries for data manipulation, modeling, and visualization.
Loading and Exploring the Dataset: We loaded the California Housing Dataset and took a quick look at the data.
Defining Features and Target: We selected relevant features and defined the target variable.
Splitting the Dataset: We split the data into training and testing sets.
Creating and Training the Model: We created a linear regression model and trained it using the training data.
Making Predictions: We used the trained model to make predictions on the test set.
Evaluating the Model: We evaluated the model's performance using MSE and ( R^2 ) score.
Plotting the Results: We visualized the actual vs. predicted values for one of the features.
By following these steps, we built a multiple linear regression model that can predict house prices in California based on various features. This comprehensive approach helps in understanding the relationship between different factors and house prices, providing valuable insights for various applications such as real estate and economics.
For more detailed explanations and examples, visit codeswithpankaj.com.
Last updated