In this post we give an example of the capabilities of the scikit-learn library to fit a Machine Learning (ML) linear model. First we start with conceptual definitions of the model. Then, a straightforward modeling is provided using the implementations from scikit-learn. Finally, we provide a conclusion from this exercise.
The model of choice for this exercise is ElasticNet for a regression problem (predict a numerical value). This model is in the category of supervised learning, that is, we provide training examples with pairs of training values and ground truth. As for the data set used, scikit-learn offers some built in data sets that are a good starting point to experiment with the supported models. We chose the diabetes data set. Note that for real-world applications extensive Exploratory Data Analysis (EDA) is required, in most cases the results depend on how well we are able to pre-process the data set. We will focus mostly on the model implementation.
As previously stated the model of our concern is ElasticNet. It is considered an improved implementation of the Ridge and Lasso models. The idea of these models is to add a regularization penalty, such as in the case of Ridge regression (called L2 penalty) and Lasso regression (called L1 penalty). ElasticNet aims to leverage the benefits of the two.
First we import the necessary libraries:
1 2 3 4 5 6 7 8 9from sklearn.linear_model import ElasticNet from sklearn.model_selection import train_test_split from sklearn.datasets import load_diabetes from sklearn.metrics import mean_squared_error as MSE import numpy as np
Then we load the data set. It is common to use the variable X to refer to the data (training, validation or test) and use y to represent the ground truth or target. We use the train_test_split function from scikit-learn to split the data set into training and test. Here, we take 80% of the data for training and 20% for testing. The format expected by the scikit-learn model is a two dimensional array, where each row represents an observation and each column a feature such as age, sex, body mass index, etc.
1
2
X, y = load_diabetes(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 5)
The next step is to train the model, usually referred to as fitting the data to the model. It is performed as follows:
1 2net = ElasticNet(alpha = 0.001) net.fit(X_train, y_train)
One of the most important parameters is alpha. This parameter tells the ElasticNet model how much the Lasso and Ridge models should be included, a value of alpha = 1 being equivalent to a Lasso model and alpha = 0 to a Ridge one. This means that with the value specified in our example (close to zero) we would have a score very similar to fitting a Ridge model. There are other parameters for fine-tuning, each has its own contribution to the model’s results. The ElasticNet default parameters are defined as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
__init__(
self,
alpha = 1.0,
l1_ratio = 0.5,
fit_intercept = True,
normalize = False,
precompute = False,
max_iter = 1000,
copy_X = True,
tol = 0.0001,
warm_start = False,
positive = False,
random_state = None,
selection = 'cyclic'
)
Finally, we make the model’s predictions and evaluate the performance of our model.
1 2y_pred = net.predict(X_test) rmse = np.sqrt(MSE(y_test, y_pred))
Here, we make predictions for the test data set and use Root Mean Squared Error (RMSE) as the evaluation metric. This metric tells us how much the points are spread about the fitted line in the vertical direction. The lower RMSE score, the better the model is. In our example we are getting an approximate score of 55.585.
In this post we have discussed a model fitted with scikit-learn. The same steps presented could be used to fit different models such as LinearRegression (OLS), Lasso, LassoLars, LassoLarsIC, BayesianRidge or SGDRegressor, among others. More elaborate strategies can be used, such as using pipelines, model selection and parameter tuning via cross-validation, etc. This post gives us an idea of the straightforward logic that scikit-learn uses to implement the most common models in literature.
[1] https://scikit-learn.org/stable/modules/linear_model.html
[2] https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net