# Multiple Linear Regression Case Study: Predicting House Prices in California

#### Multiple Linear Regression Case Study: Predicting House Prices in California

**Case Study Question**

**Objective**: Build a multiple linear regression model to predict the median house value in California based on various factors such as median income, average number of rooms, average number of occupants, latitude, and longitude.

**Dataset**: We will use the California Housing Dataset, which includes the following features:

**MedInc**: Median income in block group**AveRooms**: Average number of rooms per household**AveOccup**: Average number of household members**Latitude**: Latitude of the block**Longitude**: Longitude of the block

**Steps**:

Load and explore the dataset.

Define the features and the target variable.

Split the dataset into training and testing sets.

Create and train the multiple linear regression model.

Make predictions using the test set.

Evaluate the model's performance.

Visualize the results.

#### Step-by-Step Solution

**Step 1: Import Libraries**

First, we need to import the necessary libraries for data manipulation, modeling, and visualization.

**Step 2: Load and Explore the Dataset**

Load the California Housing Dataset and take a quick look at the first few rows.

**Explanation**:

We use

`fetch_california_housing()`

to load the dataset.Convert the dataset to a Pandas DataFrame for easier manipulation.

Add a new column

`MedHouseVal`

to represent the target variable (median house value).Display the first few rows of the dataset to get an overview of the data.

**Step 3: Define Features and Target**

Select the relevant features (independent variables) and define the target (dependent variable).

**Explanation**:

`X`

contains the features used to predict the house prices:`MedInc`

(median income),`AveRooms`

(average number of rooms),`AveOccup`

(average number of occupants per household),`Latitude`

, and`Longitude`

.`Y`

is the target variable, which is the median house value (`MedHouseVal`

).

**Step 4: Split the Dataset**

Split the dataset into training and testing sets to evaluate the model's performance on unseen data.

**Explanation**:

`train_test_split`

splits the data into training and testing sets.`test_size=0.2`

indicates that 20% of the data will be used for testing, and 80% for training.`random_state=42`

ensures reproducibility by fixing the random seed.

**Step 5: Create and Train the Model**

Create a linear regression model and train it using the training set.

**Explanation**:

`LinearRegression()`

creates an instance of the linear regression model.`model.fit(X_train, Y_train)`

trains the model using the training data.

**Step 6: Make Predictions**

Use the trained model to make predictions on the test set.

**Explanation**:

`model.predict(X_test)`

uses the trained model to predict house prices for the test set.

**Step 7: Evaluate the Model**

Evaluate the model's performance using Mean Squared Error (MSE) and R-squared (( R^2 )).

**Explanation**:

`mean_squared_error(Y_test, Y_pred)`

calculates the MSE, which measures the average of the squared differences between actual and predicted values.`r2_score(Y_test, Y_pred)`

calculates the ( R^2 ) score, which indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.`print(f'MSE: {mse}')`

and`print(f'R2 Score: {r2}')`

display the MSE and ( R^2 ) score.

**Step 8: Plot the Results**

Visualize the actual vs. predicted values for one of the features to assess the model's performance.

**Explanation**:

`plt.scatter(X_test['MedInc'], Y_test, color='blue', label='Actual')`

plots the actual house prices against the median income.`plt.scatter(X_test['MedInc'], Y_pred, color='red', label='Predicted')`

plots the predicted house prices against the median income.`plt.xlabel('Median Income')`

and`plt.ylabel('Median House Value')`

label the axes.`plt.title('Multiple Linear Regression')`

adds a title to the plot.`plt.legend()`

adds a legend to differentiate between actual and predicted values.`plt.show()`

displays the plot.

#### Complete Case Study Code

Here is the complete code for the multiple linear regression case study using the California Housing Dataset:

#### Summary

In this detailed case study, we walked through the process of implementing multiple linear regression using the California Housing Dataset. Here are the key steps we covered:

**Importing Libraries**: We imported necessary libraries for data manipulation, modeling, and visualization.**Loading and Exploring the Dataset**: We loaded the California Housing Dataset and took a quick look at the data.**Defining Features and Target**: We selected relevant features and defined the target variable.**Splitting the Dataset**: We split the data into training and testing sets.**Creating and Training the Model**: We created a linear regression model and trained it using the training data.**Making Predictions**: We used the trained model to make predictions on the test set.**Evaluating the Model**: We evaluated the model's performance using MSE and ( R^2 ) score.**Plotting the Results**: We visualized the actual vs. predicted values for one of the features.

By following these steps, we built a multiple linear regression model that can predict house prices in California based on various features. This comprehensive approach helps in understanding the relationship between different factors and house prices, providing valuable insights for various applications such as real estate and economics.

For more detailed explanations and examples, visit codeswithpankaj.com.

Last updated