Simple Linear regression assumes that the relationship between a dependent variable y and explanatory variable x, can be modelled as a straight line of the form y = a + b * x; a and b are the regression coefficients. a represents the y‑intercept (the point where a graph of a function intersects the y‑axis), and b the slope (the ratio of vertical change between two points).
Regression is an example of supervised machine learning because it aims to model the relationship between a number of features and a continuous target variable. In the case of simple linear regression, where we aim to model a relationship according y = a + b * x; x would be our feature vector, and y our target variable.
Lets consider the following dataset:
import numpy as np x = np.arange(0, 11) # 0, 1, .. 10 y = [1, 6, 17, 34, 57, 86, 122, 160, 207, 260, 323]
We would like to predict the value of y. Lets plot the dataset to see if we can find some relation between y and x.
import matplotlib.pyplot as plt plt.plot(x, y, 'o')
When looking at the dots, in our mind we're likely already visualizing a line which follows a linear path throught the dots. Lets plot again, but now we will include a line which connects the dots.
plt.plot(x, y, '-ok')
Not perfectly linear, but we could probably estimate that y is around 330 at x = 12
When implementing a linear regression model, we are modelling towards y = a + b * x; in which we aim to estimate the regression coefficients a and b, for a "best fit" of our dataset. To estimate these values we will use sklearn.linear_model.LinearRegression which applies an Ordinary Least Squares (OLS) regression algorithm. This algorithm applies a criteria which aims to minimize the Residual Sum of Squares (RSS). RSS is the sum of squared differences, between the observed y value, and the predicted y value.
Now that we have an idea of what goes on under the hood in sklearn, lets go ahead and implement the code of our linear regression model.
from sklearn.linear_model import LinearRegression # Data preperation. Convert to numpy arrays, x must be 2-dimensional. x = np.array(x).reshape((-1, 1)) y = np.array(y) # Create a model and fit it. model = LinearRegression() model.fit(x, y) print(model.coef_) # >> 30.745 print(model.intercept_) # >> -37.182
At this point we have a model available which we can use to make predictions. Note that the print statements; print(model.coef_); outputs b print(model.intercept_); outputs a of y = a + b * x. Lets see what we get when we try to predict y for x = 12.
print(model.predict(np.array([]))) # >> array([331.76363636])
We have a predicted value of y = 331 for x = 12. This looks reasonable, lets print the predicted and observed data in one graph.
y_predict = model.predict(x) plt.plot(x, y_predict, '-b', label='y_predict') plt.plot(x, y, '-r', label='y') plt.legend(loc='upper right')
We can see our prediction is off sometimes. Lets asses the quality of our model by computing the Mean Squared Error (MSE). With MSE we measure the average squared difference between the predicted and observed values. MSE can be computed by mse = 1/n * sum(squared_errors), at which squared_errors is a list of (y_observed - y_predict)**2 entries. MSE is non negative, values close to zero are better (i.e. indicate less error in the predicted versus the observed values). We will use sklearn.metrics.mean_squared_error which makes it very easy to compute the MSE.
from sklearn.metrics import mean_squared_error print(mean_squared_error(y, y_predict)) # >> 452.691
452.691 is nowhere close to zero, and given that our observed y values are in the 0..300 range, this MSE value basically indicates that this is not a good model for fitting our data. Of course this is not completely surprising, even though we can see some linearity in the graphs, we actually had only 11 datapoints, the model is able to average through these points, but to actually have a good fitting model, we would need a larger amount of data that shows a good amount of linearity in order to apply a linear regression model succesfully.