December 30, 2019

Basic Linear Model Fitting with Scikit-learn

Baldomero Valdezbaldiyo

DURATION

5 min

Introduction

In this post we give an example of the capabilities of the scikit-learn library to fit a Machine Learning (ML) linear model. First we start with conceptual definitions of the model. Then, a straightforward modeling is provided using the implementations from scikit-learn. Finally, we provide a conclusion from this exercise.

The model of choice for this exercise is ElasticNet for a regression problem (predict a numerical value). This model is in the category of supervised learning, that is, we provide training examples with pairs of training values and ground truth. As for the data set used, scikit-learn offers some built in data sets that are a good starting point to experiment with the supported models. We chose the diabetes data set. Note that for real-world applications extensive Exploratory Data Analysis (EDA) is required, in most cases the results depend on how well we are able to pre-process the data set. We will focus mostly on the model implementation.

Conceptual definitions

As previously stated the model of our concern is ElasticNet. It is considered an improved implementation of the Ridge and Lasso models. The idea of these models is to add a regularization penalty, such as in the case of Ridge regression (called L2 penalty) and Lasso regression (called L1 penalty). ElasticNet aims to leverage the benefits of the two.

Modeling

First we import the necessary libraries:

1
2
3
4
5
6
7
8
9
from sklearn.linear_model
import ElasticNet
from sklearn.model_selection
import train_test_split
from sklearn.datasets
import load_diabetes
from sklearn.metrics
import mean_squared_error as MSE
import numpy as np

Then we load the data set. It is common to use the variable X to refer to the data (training, validation or test) and use y to represent the ground truth or target. We use the train_test_split function from scikit-learn to split the data set into training and test. Here, we take 80% of the data for training and 20% for testing. The format expected by the scikit-learn model is a two dimensional array, where each row represents an observation and each column a feature such as age, sex, body mass index, etc.

1
2
X, y = load_diabetes(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 5)

The next step is to train the model, usually referred to as fitting the data to the model. It is performed as follows:

1
2
net = ElasticNet(alpha = 0.001)
net.fit(X_train, y_train)

One of the most important parameters is alpha. This parameter tells the ElasticNet model how much the Lasso and Ridge models should be included, a value of alpha = 1 being equivalent to a Lasso model and alpha = 0 to a Ridge one. This means that with the value specified in our example (close to zero) we would have a score very similar to fitting a Ridge model. There are other parameters for fine-tuning, each has its own contribution to the model’s results. The ElasticNet default parameters are defined as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
__init__(
  self,
  alpha = 1.0,
  l1_ratio = 0.5,
  fit_intercept = True,
  normalize = False,
  precompute = False,
  max_iter = 1000,
  copy_X = True,
  tol = 0.0001,
  warm_start = False,
  positive = False,
  random_state = None,
  selection = 'cyclic'
)

Finally, we make the model’s predictions and evaluate the performance of our model.

1
2
y_pred = net.predict(X_test)
rmse = np.sqrt(MSE(y_test, y_pred))

Here, we make predictions for the test data set and use Root Mean Squared Error (RMSE) as the evaluation metric. This metric tells us how much the points are spread about the fitted line in the vertical direction. The lower RMSE score, the better the model is. In our example we are getting an approximate score of 55.585.

Conclusion

In this post we have discussed a model fitted with scikit-learn. The same steps presented could be used to fit different models such as LinearRegression (OLS), Lasso, LassoLars, LassoLarsIC, BayesianRidge or SGDRegressor, among others. More elaborate strategies can be used, such as using pipelines, model selection and parameter tuning via cross-validation, etc. This post gives us an idea of the straightforward logic that scikit-learn uses to implement the most common models in literature.

References

[1] https://scikit-learn.org/stable/modules/linear_model.html

[2] https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net

Chat on Discord

December 30, 2019

Basic Linear Model Fitting with Scikit-learn

DURATION

categories

Tags

share

Topcoder SKILL BUILDER COMPETITIONS

Introduction

Conceptual definitions

Modeling

Conclusion

References