Simple Linear Regression in Python

Simple Linear Regression in Python

Problem: Predicting sales based on an advertisement on TV, Radio and Newspaper.

Importing Required Libraries

#importing required Library

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

#model building Library

import statsmodels

import statsmodels.api as sm

import sklearn

from sklearn.model_selection import train_test_split

Read Dataset :

adv=pd.read_csv("https://raw.githubusercontent.com/rkmishracs/dataset/main/advertising.csv")

Check Dataset

adv.head()

The first Three columns(TV, Radio and Newspaper) are predictive variables and Forth(Sales) column is a target variable.

Checking Shape of Dataset

adv.shape

Output

(200, 4)

Dataset detail info

adv.info()

Info will tell us any non null values present in dataset

Describe Data

adv.describe()

Its gives a summary of all statistics.

Data Visualisation: TV Vs Sales

sns.regplot(x='TV', y='Sales', data=adv)


X axis is TV and Y axis is Sales and as per diagram we can see its totally appropriate for linear regression.

Data Visualisation: Radio Vs Sales

sns.regplot(x='Radio', y='Sales', data=adv)

Relationship between Radio and Sales is not good like TV.

Data points are very scatters.

Data Visualisation: Newspaper vs Sales

#Newspaper Vs Sales

sns.regplot(x='Newspaper', y='Sales', data=adv)

Newspaper and Sales relationship is more worst as data points are more scatters towards y-axis

Visualisation:Pairplot->X-Axis(all predictors) and Y-Axis(Target variable)

sns.pairplot(data=adv, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales')

With pairplot we are able to see the comparative view between predictors and target variables:

With First subgraph relationship between TV and Sales is quite high positive correlation but relationship is not much strong between Radio and Sales. With Newspaper having lesser confident with correlation.

Correlation

adv.corr()

  • Correlation between TV and Sales 0.9 which is very high.

  • Correlation between Radio and Sales is 0.34 which is lesser than TV and Sales.

  • Correlation between Newspaper and Sales is 0,15 which is lowest one.

Visualization: Heatmap

sns.heatmap(adv.corr())

In this heatmap you are not able to see the numbers so only you will able to see the color between variable.

Visualization: Heatmap(with Values)

# heatmap with values

sns.heatmap(adv.corr(), annot=True)

Lighter side is more positive and dark side is more negative correlation between variables.

Model Building

Steps:

  1. Create X and y

  2. Create Train and Test sets(70-30, 80-20)

  3. Training the model on training set (i.e. learn the coefficient)

  4. Evaluate the model ( Training set, test set)

  1. Create X and y

X=adv['TV']

y=adv['Sales']

2. Create Train and Test sets(70-30, 80-20)

X_train, X_test, y_train, y_test= train_test_split(X,y, train_size=0.70, random_state=100)

3.Training the model on training set (i.e. learn the coefficient) Using StatsModels

X_train_sm=sm.add_constant(X_train)

X_train_sm.head()

# Training the model

#fitting the model

lr=sm.OLS(y_train, X_train_sm)

lr_model=lr.fit()

lr_model.params

const 6.948683

TV 0.054546

dtype: float64


y=mx+c

Sales=0.054 *TV +6.94

Model Summary

lr_model.summary()

OLS or Ordinary Least Squares is a useful method for evaluating a linear regression model.

By default, the statsmodels library fits a line on the dataset which passes through the origin. But in order to have an intercept, you need to manually use the add_constant attribute of statsmodels. And once you've added the constant to your X_train dataset, you can go ahead and fit a regression line using the OLS (Ordinary Least Squares) attribute of statsmodels .

Looking at some key statistics from the summary

The values we are concerned with are -

  1. The coefficients and significance (p-values)

  2. R-squared

  3. F statistic and its significance

1. The coefficient for TV is 0.054, with a very low p value

The coefficient is statistically significant. So the association is not purely by chance.

2. R - squared is 0.816

Meaning that 81.6% of the variance in Sales is explained by TV

This is a decent R-squared value.

3. F statistic has a very low p value (practically low)

Meaning that the model fit is statistically significant, and the explained variance isn't purely by chance.


The fit is significant. Let's visualize how well the model fit the data.

From the parameters that we get, our linear regression equation becomes:

𝑆𝑎𝑙𝑒𝑠=6.948+0.054×𝑇𝑉



Scatter Plot on X_train and y_train

plt.scatter(X_train, y_train)

Plotting Model Prediction

plt.scatter(X_train, y_train)

plt.plot(X_train, 6.948+0.054*X_train,'r')

plt.show()

Residual Analysis

To validate assumptions of the model, and hence the reliability for inference

Distribution of the error terms

We need to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

# error=f(y_train, y_train_pred)

y_train_pred=lr_model.predict(X_train_sm)


y_train_pred

Calculating Residual

residual=y_train-y_train_pred

residual

Plot the Residuals Histogram

plt.figure()

sns.displot(residual)

plt.title("Residual Plot")

The residuals are following the normally distributed with a mean 0. All good!

Plot Residual scatter plot

#Plotting residuals

plt.scatter(X_train, residual)

plt.show()


We can see that residuals are equally distributed which is quite good for the model.

Prediction and Evaluation of Model on Test Dataset

Now we have fitted a regression line on your train dataset, it's time to make some predictions on the test data. For this, you first need to add a constant to the X_test data like you did for X_train and then you can simply go on and predict the y values corresponding to X_test using the predict attribute of the fitted regression line.

# Add a constant to X_test

X_test_sm = sm.add_constant(X_test)


# Predict the y values corresponding to X_test_sm

y_pred = lr.predict(X_test_sm)

from sklearn.metrics import mean_squared_error

from sklearn.metrics import r2_score

#RMSE

np.sqrt(mean_squared_error(y_test, y_pred))

Output

2.019296008966232

R-squared on the test set

r_squared = r2_score(y_test, y_pred)

r_squared

Output

0.792103160124566

Visualizing the fit on the test set

plt.scatter(X_test, y_test)

plt.plot(X_test, 6.948 + 0.054 * X_test, 'r')

plt.show()

Linear Regression using linear_model in sklearn

Apart from statsmodels, there is another package namely sklearn that can be used to perform linear regression. We will use the linear_model library from sklearn to build the model. Since, we hae already performed a train-test split, we don't need to do it again.

There's one small step that we need to add, though. When there's only a single feature, we need to add an additional column in order for the linear regression fit to be performed successfully.


from sklearn.model_selection import train_test_split

X_train_lm, X_test_lm, y_train_lm, y_test_lm = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

X_train_lm.shape

output

(140,)

X_train_lm = X_train_lm.values.reshape(-1,1)

X_test_lm = X_test_lm.values.reshape(-1,1)

print(X_train_lm.shape)

print(y_train_lm.shape)

print(X_test_lm.shape)

print(y_test_lm.shape)

output

(140, 1)

(140,)

(60, 1)

(60,)

from sklearn.linear_model import LinearRegression


# Representing LinearRegression as lr(Creating LinearRegression Object)

lm = LinearRegression()


# Fit the model using lr.fit()

lm.fit(X_train_lm, y_train_lm)

print(lm.intercept_)

print(lm.coef_)

Output

6.948683200001357

[0.05454575]

The equationwe get is the same as what we got before!

Sales=6.948+0.054∗TV

Sales=6.948+0.054∗TV

Sklearn linear model is useful as it is compatible with a lot of sklearn utilites (cross validation, grid search etc.)


Scaling Methods

  1. Min-Max Scaling

  2. Standard Scaling

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

SciKit Learn has these scaling utilities

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# One aspect that you need to take care of is that the 'fit_transform' can be performed on 2D arrays only. So you need to

# reshape your 'X_train_scaled' and 'y_trained_scaled' data in order to perform the standardisation.

X_train_scaled = X_train.values.reshape(-1,1)

y_train_scaled = y_train.values.reshape(-1,1)

X_train_scaled.shape

output

(140, 1)


# Create a scaler object using StandardScaler()

scaler = StandardScaler()

#'Fit' and transform the train set; and transform using the fit on the test set later

X_train_scaled = scaler.fit_transform(X_train_scaled)

y_train_scaled = scaler.fit_transform(y_train_scaled)


print("mean and sd for X_train_scaled:", np.mean(X_train_scaled), np.std(X_train_scaled))

print("mean and sd for y_train_scaled:", np.mean(y_train_scaled), np.std(y_train_scaled))

output

mean and sd for X_train_scaled: 2.5376526277146434e-17 0.9999999999999999

mean and sd for y_train_scaled: -2.5376526277146434e-16 1.0


# Let's fit the regression line following exactly the same steps as done before

X_train_scaled = sm.add_constant(X_train_scaled)


lr_scaled = sm.OLS(y_train_scaled, X_train_scaled).fit()

# Check the parameters

lr_scaled.params

Output

array([-2.44596010e-16, 9.03212773e-01])


As you might notice, the value of the parameters have changed since we have changed the scale.

Let's look at the statistics of the model, to see if any other aspect of the model has changed.