# Guide with the Boston Housing Dataset

#### Linear Regression: A Step-by-Step Guide with the Boston Housing Dataset

Linear regression is a powerful tool used to model the relationship between one or more independent variables and a dependent variable. In this guide, we will use the Boston Housing Dataset to illustrate how to perform linear regression in Python.

**Step 1: Load and Explore the Dataset**

We begin by loading the Boston Housing Dataset, which contains information about various factors affecting house prices in Boston.

**Explanation**:

We use

`load_boston()`

to load the dataset.The dataset is converted to a Pandas DataFrame for easier manipulation.

The column names are set to the feature names provided by the dataset.

We add a new column called

`PRICE`

which contains the target variable (house prices).`print(data.head())`

displays the first few rows of the dataset.

**Step 2: Define the Features and Target**

Next, we define which columns we will use as features (independent variables) and which column is the target (dependent variable).

**Explanation**:

`X`

contains the features we will use to predict the house prices. Here, we use`RM`

(average number of rooms per dwelling),`LSTAT`

(percentage of lower status of the population), and`PTRATIO`

(pupil-teacher ratio by town).`Y`

is the target variable, which is the house prices (`PRICE`

).

**Step 3: Split the Dataset**

We split the dataset into training and testing sets to evaluate our model's performance on unseen data.

**Explanation**:

`train_test_split`

splits the data into training and testing sets.`test_size=0.2`

indicates that 20% of the data will be used for testing, and 80% for training.`random_state=42`

ensures reproducibility by fixing the random seed.

**Step 4: Create and Train the Model**

We create a linear regression model and train it using the training set.

**Explanation**:

`LinearRegression()`

creates an instance of the linear regression model.`model.fit(X_train, Y_train)`

trains the model using the training data.

**Step 5: Make Predictions**

We use the trained model to make predictions on the test set.

**Explanation**:

`model.predict(X_test)`

uses the trained model to predict house prices for the test set.

**Step 6: Evaluate the Model**

We evaluate the model's performance using Mean Squared Error (MSE) and R-squared (( R^2 )).

**Explanation**:

`mean_squared_error(Y_test, Y_pred)`

calculates the MSE, which measures the average of the squared differences between actual and predicted values.`r2_score(Y_test, Y_pred)`

calculates the ( R^2 ) score, which indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.`print(f'MSE: {mse}')`

and`print(f'R2 Score: {r2}')`

display the MSE and ( R^2 ) score.

**Step 7: Plot the Results**

We plot the actual vs. predicted values for one of the features to visualize the model's performance.

**Explanation**:

`plt.scatter(X_test['RM'], Y_test, color='blue', label='Actual')`

plots the actual house prices against the average number of rooms per dwelling.`plt.scatter(X_test['RM'], Y_pred, color='red', label='Predicted')`

plots the predicted house prices against the average number of rooms per dwelling.`plt.xlabel('Average number of rooms per dwelling (RM)')`

and`plt.ylabel('House Price')`

label the axes.`plt.title('Linear Regression')`

adds a title to the plot.`plt.legend()`

adds a legend to differentiate between actual and predicted values.`plt.show()`

displays the plot.

#### Complete Code

Here is the complete code for the linear regression example using the Boston Housing Dataset:

#### Key Concepts

**Linear Regression**: A method to model the relationship between dependent and independent variables using a linear equation.**Features**: Independent variables used to predict the target variable.**Target**: Dependent variable we are trying to predict.**Training Set**: Subset of the data used to train the model.**Testing Set**: Subset of the data used to evaluate the model.**Mean Squared Error (MSE)**: Measures the average squared difference between actual and predicted values.**R-squared (( R^2 ))**: Indicates the proportion of the variance in the dependent variable predictable from the independent variables.

This step-by-step guide helps you understand how to perform linear regression using a real-world dataset. For more detailed explanations and examples, visit codeswithpankaj.com.

Last updated