Simple Linear Regression in Python
Simple Linear Regression in Python
Problem: Predicting sales based on an advertisement on TV, Radio and Newspaper.
Importing Required Libraries
Read Dataset :
The first Three columns(TV, Radio and Newspaper) are predictive variables and Forth(Sales) column is a target variable.
Checking Shape of Dataset
Dataset detail info
Info will tell us any non null values present in dataset
Its gives a summary of all statistics.
Data Visualisation: TV Vs Sales
X axis is TV and Y axis is Sales and as per diagram we can see its totally appropriate for linear regression.
Data Visualisation: Radio Vs Sales
Relationship between Radio and Sales is not good like TV.
Data points are very scatters.
Data Visualisation: Newspaper vs Sales
Newspaper and Sales relationship is more worst as data points are more scatters towards y-axis
Visualisation:Pairplot->X-Axis(all predictors) and Y-Axis(Target variable)
With pairplot we are able to see the comparative view between predictors and target variables:
With First subgraph relationship between TV and Sales is quite high positive correlation but relationship is not much strong between Radio and Sales. With Newspaper having lesser confident with correlation.
Correlation between TV and Sales 0.9 which is very high.
Correlation between Radio and Sales is 0.34 which is lesser than TV and Sales.
Correlation between Newspaper and Sales is 0,15 which is lowest one.
In this heatmap you are not able to see the numbers so only you will able to see the color between variable.
Visualization: Heatmap(with Values)
Lighter side is more positive and dark side is more negative correlation between variables.
Create X and y
Create Train and Test sets(70-30, 80-20)
Training the model on training set (i.e. learn the coefficient)
Evaluate the model ( Training set, test set)
- Create X and y
2. Create Train and Test sets(70-30, 80-20)
3.Training the model on training set (i.e. learn the coefficient) Using StatsModels
# Training the model
Sales=0.054 *TV +6.94
OLS or Ordinary Least Squares is a useful method for evaluating a linear regression model.
By default, the statsmodels library fits a line on the dataset which passes through the origin. But in order to have an intercept, you need to manually use the add_constant attribute of statsmodels. And once you've added the constant to your X_train dataset, you can go ahead and fit a regression line using the OLS (Ordinary Least Squares) attribute of statsmodels .
Looking at some key statistics from the summary
The values we are concerned with are -
The coefficients and significance (p-values)
F statistic and its significance
1. The coefficient for TV is 0.054, with a very low p value
The coefficient is statistically significant. So the association is not purely by chance.
2. R - squared is 0.816
Meaning that 81.6% of the variance in Sales is explained by TV
This is a decent R-squared value.
3. F statistic has a very low p value (practically low)
Meaning that the model fit is statistically significant, and the explained variance isn't purely by chance.
The fit is significant. Let's visualize how well the model fit the data.
From the parameters that we get, our linear regression equation becomes:
Scatter Plot on X_train and y_train
Plotting Model Prediction
To validate assumptions of the model, and hence the reliability for inference
Distribution of the error terms
We need to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.
Plot the Residuals Histogram
The residuals are following the normally distributed with a mean 0. All good!
Plot Residual scatter plot
We can see that residuals are equally distributed which is quite good for the model.
Prediction and Evaluation of Model on Test Dataset
Now we have fitted a regression line on your train dataset, it's time to make some predictions on the test data. For this, you first need to add a constant to the X_test data like you did for X_train and then you can simply go on and predict the y values corresponding to X_test using the predict attribute of the fitted regression line.
R-squared on the test set
Visualizing the fit on the test set
Linear Regression using linear_model in sklearn
Apart from statsmodels, there is another package namely sklearn that can be used to perform linear regression. We will use the linear_model library from sklearn to build the model. Since, we hae already performed a train-test split, we don't need to do it again.
There's one small step that we need to add, though. When there's only a single feature, we need to add an additional column in order for the linear regression fit to be performed successfully.
The equationwe get is the same as what we got before!
mean and sd for X_train_scaled: 2.5376526277146434e-17 0.9999999999999999
mean and sd for y_train_scaled: -2.5376526277146434e-16 1.0