Machine Learning
Table of Contents
1. Course Information
1.1. Myself
- Contact: mathieu.fauvel@inrae.fr
- Ph.D. degree in signal and image processing from the National Polytechnic Institute, Grenoble, France, and the University of Iceland, in 2007
- Post-doc position at INRIA - MISTIS Team, 2008-2010
- Assistant Professor (Grenoble, 2007-2008 & Toulouse, 2010-2011)
- Associate Professor at DYNAFOR, National Polytechnic Institute, Toulouse, between 2011-2018.
- Since 2018, Research (CR1) at CESBIO, INRA.
- Research interests are: machine learning for environmental/ecological monitoring
1.2. Prerequisites
- Basics in descriptive statistic
- Basics in Python
1.3. Student Learning Objectives
- Learn how to define a machine learning pipeline
- Evaluate the quality of a predictive model
- Compare models
- Implement it using python
1.4. Course Material
For this course, we use Scikit-Learn (Pedregosa et al. 2011) python library to perform most of the different steps of ML, from pre-processing, learning, prediction and model assessment. For deep learning, we use Pytorch (Paszke et al. 2019). Both are open-source and widely used in the community. The pandas library is used to process and manipulate the data (McKinney 2010) and geopandas is used to process geospatial data (Jordahl et al. 2020).
The materials of this document are available here: https://framagit.org/mfauvel/machine-learning.
To install all the python dependencies, use conda and the file environment.yml:
# replace my-env by the name you want conda env create --name my-env --file environment.yml
Don’t forget to activate your environment !
2. Introduction
2.1. What is machine learning ?
Machine learning (ML) is a part of Artificial Intelligence (IA) science. It focuses on performing human tasks using computers and algorithms. For simple tasks, the algorithms can be completely handcrafted, however for advanced tasks, it might be impossible for a human to define the algorithm. In that situation, observed data related to the considered task can be used to learn the desire algorithm behavior, with few (or no) human interactions.
More specifically, ML is about the implementation of algorithms into computer programs that can learn from the observation of some realizations of a specific task. In (Mitchell 2006), learning is defined as the “central question” of the ML scientific field:
“How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?”
But how to define or assess the process of learning ? This is done using a performance measure (similar to grades for final exams …): together with the task and the observations of some realizations of this task, this performance measure provides a quality value of the computer program outputs for the given observed realizations. Thus, learning is defined as the capacity of a computer program to improve its performance measure with observations. Learning can also be formulated as the capacity of an algorithm to extract knowledge from data.
It is common to say that ML is centered on efficient algorithms and applications, with strong connection with applied mathematics and computer science. Its old brother Statistical Learning (see this must read (Hastie, Tibshirani, and Friedman 2009)), is more oriented towards modeling the data, and arises from the statistics and signal/image processing. Well, saying that, boundaries between ML and Statistical Learning are soft in practice and both field share a lot in common.
What about the so cool (as of end 2021) Deep Learning (DL)? DL can be seen as a subfield of ML, with particularly large models and the ability to perform well on large scale data set (aka, Big Data), as illustrated in figure 1. These models are addressed in this course, as well as standard machine learning tools.
Figure 1: Deep Learning is a subfield of Machine Learning.
Maybe the oldest and most famous ML application is the recognition of handwritten digits from US postal envelopes. The following python code load the digital number, a 8 by 8 grayscale images, with its label and plot the 12 first samples, see figure 2. Here the task is to recognise which number is written, the data are the digital images of handwritten numbers and the measure of performance is the correct recognition rate.
import matplotlib.pyplot as plt from sklearn.datasets import load_digits X, y = load_digits(return_X_y=True) fig, axs = plt.subplots(nrows=3, ncols=4, figsize=(8, 6)) plt.gray() for i in range(3): for j in range(4): axs[i, j].matshow(X[i*4+j, :].reshape(8, 8), cmap='gray') axs[i, j].tick_params(axis=u'both', which=u'both',length=0) axs[i, j].set_xticks([]) axs[i, j].set_yticks([]) axs[i, j].set_title(f"Label: {y[i*4+j]}")
Figure 2: Handwritten digits, from https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits and https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html.
We will see later that the way the data are represented are very important for the final results. First attempts simply flattened the image into a vector of size \(8\times 8=64\) features, as illustrated in figure 3 for the first three digits. As a final remark, best results are now obtained by taking into account the image structure of the digits, with convolutional layer for instance.
fig, axs = plt.subplots(nrows=3, ncols=1, figsize=(8, 8), gridspec_kw={'hspace': 0.75}) fig.tight_layout() for i in range(3): axs[i].plot(X[i, :]) axs[i].set_title(f"Label: {y[i]}")
Figure 3: Flattened digits.
2.2. Main Notation
In the remaining of this course, we will use the following notation.
2.2.1. Variable terminology
- Observed data \(\mathbf{x}\), called input variables or predictors.
- Data to be predicted \(y\), called output variables or responses.
Usually, the input variables is composed of \(d\) features represented as a \(d\)-dimensional vector of vector space \(\mathcal{X}\): \(\mathbf{x}=[x_1, \ldots,x_d]^\top\in\mathcal{X}\subset\mathbb{R}^d\). For simplicity, in the following, we will refer to \(\mathbb{R}^d\), but be aware that for some situations, we might work with a subspace of it. For instance, we can have strictly positive measurements, i.e., \(x_i \geq 0, \forall i\in{1,\ldots,d}\), and this kind of constraint could be included in the learning problem.
2.2.2. Prediction problem
- If \(y\) are quantitative data, the learning problem is usually called regression and \(y\in\mathcal{Y}\subset\mathbb{R}\).
- If \(y\) are categorical data, the learning problem is usually called classification and \(y\in\{C_1, \ldots, C_C\}\) with \(C_i\) refers to the \(i^{th}\) category.
These are very close problem: the algorithm might differ only by the performance measure used to learn the predictive model.
2.2.3. Prediction rule
The link between the input and output variables is modeled by the prediction rule, denoted by the function \(f\). This function takes as input the predictors and outputs response … Mathematically, it writes:
\begin{eqnarray*} f: \mathbb{R}^d & \to & \mathbb{R},\\ \mathbf{x}& \mapsto & y.\\ \end{eqnarray*}The (learnable) parameters of the function are denoted \(\boldsymbol{\theta}\) and are optimized (or fit) with the learning set, i.e., they are automatically tuned to increase the performance measure. Once the optimal set of parameters are obtained, the function is usually denoted \(\hat{f}\), indicating that it was fitted on the data. We might also define hyperparameters of \(f\) as the set parametes of the function that is fixed before the learning set and are not optimized.
2.2.4. Supervised or unsupervised learning
We call training set \(\mathcal{S}\) the set of sample available to learn the prediction rule \(f\). For a training set of size \(n\), three different cases exist:
- Supervised learning: \(\mathcal{S} = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}\), both input and output variables are available.
- Unsupervised learning: \(\mathcal{S} = \{(\mathbf{x}_1), \ldots, (\mathbf{x}_n)\}\), only input variables are available.
- Semi-supervised learning: \(\mathcal{S} = \mathcal{S}_s \cup \mathcal{S}_u = \{(\mathbf{x}^s_1, y^s_1), \ldots, (\mathbf{x}^s_{n_s}, y^s_{n_s})\} \cup \{\mathbf{x}^u_1,\ldots, \mathbf{x}^u_{n_u}\}\), this is a mixed scenario often encountered in practice. Some samples are provided with the output variables, but most of the sample are available only with input variable.
2.3. Toy Example
In this example, we are going to illustrate the ML machinery on a simple case. The task considered is the regression, i.e., from the sample \(\mathbf{x}\) we want to predict the associated value \(y\). The data are distributed as shown in figure 4.
For the prediction rule, we use a simple linear model: \(f(\mathbf{x}) = \mathbf{w}^\top\mathbf{x} + b\), which in our 1-dimensional case reduces to \(f(x) = wx +b\) and \(\theta = (w,b)\). For the performance measure, the mean square error (MSE) is used. We will refer to the loss \(\mathcal{L}\) is the following rather than the performance measure: the former is simply the negative of the later. It is more convenient to write the optimization problem as a minimization problem.
\[MSE(\theta) = \frac{1}{n}\sum_{i=1}^n \big(f(x_i) - y_i\big)^2=\frac{1}{n}\sum_{i=1}^n \big(wx_i + b - y_i\big)^2.\]
import numpy as np import matplotlib.pyplot as plt def function_to_be_estimated(x): return x*0.5 + 0.5*np.sin(x*2*np.pi/5) + 0.1*np.random.randn(x.shape[0], x.shape[1]) # Set random seed np.random.seed(0) # Generate some data n_samples = 20 x = 10*np.random.rand(n_samples,1)-5 y = function_to_be_estimated(x) # Plot the function plt.figure(figsize=(3,3)) plt.scatter(x, y) plt.grid("on") plt.xlabel("x") plt.ylabel("y")
Figure 4: Synthetic regression problem.
The learning step consists in finding the best parameters according to the loss function computed on the training set. For this particular model, the solution is explicit, i.e., we can derive a closed-form solution from the gradient of the MSE.
\begin{eqnarray} \frac{\partial MSE}{\partial w} & = & \frac{2}{n}\sum_{i=1}^n x_i(wx_i +b-y_i)\label{eq:lin:grad:w}\\ \frac{\partial MSE}{\partial b} & = & \frac{2}{n}\sum_{i=1}^n (wx_i +b-y_i)\label{eq:lin:grad:b} \end{eqnarray}Setting (\ref{eq:lin:grad:b}) to zero, we get \(b=\mu_y - w\mu_x\). Then injecting it into (\ref{eq:lin:grad:w}), setting the derivative to zero and arranging the different terms, we get \(w =\frac{\mathrm{Cov}(x, y)}{\mathrm{Cov}(x, x)}\) (see https://en.wikipedia.org/wiki/Covariance). The following code compute the estimation of the optimal parameters from the observed data.
mu_x = np.mean(x) mu_y = np.mean(y) cov = np.cov(x, y, rowvar=False) w = cov[0, 1] / cov[0, 0] b = mu_y - w* mu_x t = np.linspace(-5, 5).reshape(-1, 1) plt.figure(figsize=(3,3)) plt.plot(t, w*t +b, 'r') plt.scatter(x, y) plt.grid("on") plt.xlabel("x") plt.ylabel("y") plt.title(f"(w,b)=({w:.2f}, {b:.2f})")
For these parameters, the mean square error computed on the training set is
y_prediction = w*x+b mse = np.mean((y_prediction-y)**2) print(f"{mse:.2f}")
0.16
Can we do better ? Would be it safe in terms of accuracy ? Before answering these questions, we need additional tool that will ease the data manipulation. They are based on scikit-learn
library, and allow us to skip the maths part. So, let’s do it again with scikit-learn:
from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error as MSE np.set_printoptions(precision=2) # Initialize and fit the model model = LinearRegression() model.fit(x, y) # Print model parameters print(f"w={model.coef_} and b={model.intercept_}") # Do prediction y_prediction = model.predict(x) mse = MSE(y, y_prediction) print(f"MSE = {mse:.2f}")
w=[[0.45]] and b=[0.02] MSE = 0.16
The code is highly simplified: for all the ML models provided in scikit-learn
, the fit
and predict
functions are provided to learn the model and do the prediction !
Now we can try more advanced model. For instance, rather than considering the original variable only, we could include higher order information and add some polynomial features. Formally, it consists in transforming the data before using the linear model:
\begin{align} \phi: \mathbb{R} & \to & \mathbb{R}^p\\ x & \mapsto & \phi(x) = [x, x^2, x^3, ..., x^p] \end{align}
Again with scikit-learn
everything is available, and a common strategy is to define a pipeline (https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to sequentially modify the data:
from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import Pipeline # Define the pipeline as a polynomial transformation + linear model degree = 5 model = Pipeline( [("PolyTransform", PolynomialFeatures(degree)), ("LinearReg", LinearRegression())] ) # We can use "fit" again, it will sequentially appply the list of transformation provided at the creation of the pipeline model.fit(x, y) yt = model.predict(t) # Plot the results to check if it is fine plt.figure(figsize=(3, 3)) plt.scatter(x, y) plt.plot(t, yt, "r") plt.grid("on") plt.xlabel("x") plt.ylabel("y") # Compute MSE y_prediction = model.predict(x) mse = MSE(y, y_prediction) print(f"MSE = {mse:.2f}")
MSE = 0.01
Whoua ! It seems to work better: it fits better the data, and the MSE on the training set if lower. But remind that the objective is to predict unseen data. Using more complex model (as with polynomial feature), we have increased the learning capacity: the model can learn more information from the data, but also more noise :( This is known as overfitting (https://en.wikipedia.org/wiki/Overfitting). We will see later how to check if this phenomenon occurs, and more broadly, how to assess model performance and to model selection. For now, since the data are generated from ourselves, we can check this manually:
Based on the python code provide in this example, perform the following analysis in a notebook (files in the notebooks)
- Load the data, available in the note
- For a polynomial degree ranging in {0, 20},
- Fit the model
- Compute the MSE on the training data
- Compute the MSE on the test data, located here in the data repertory.
- Plot the prediction in the interval [-5, 5], then [-10, 10].
- Summarizes your results visually (using matplotlib plotting facility)
2.3.1. Homework - Braking distance
The table tab:braking:distance contains the braking distance for various cars and driver. The file can be download here. Load the data and try a linear regression (with different polynomial degree) to related the speed with the braking distance.
Tips: you can load the file using the following snippet:
import pandas as pd df = pd.read_csv("/home/mfauvel/Documents/Enseignement/INRA/MasterSigma/data/braking_distance.csv", index_col=0) # Path need to be adapted ! print(df.head())
Speed: 0 Speed: 40 Speed: 50 Speed: 70 Speed: 90 Speed: 100 Speed: 110 Speed: 130
Car #
0 4.488990 12.60200 7.93177 32.4445 60.5546 56.0518 76.0007 101.365
1 -0.561011 11.11080 11.79650 35.3948 57.9588 63.6805 78.7270 107.571
2 -3.026890 1.03175 25.18210 28.4790 48.5566 61.8898 76.6653 104.722
3 -6.701320 11.58230 8.69358 25.5270 49.7054 59.2811 70.4830 107.165
4 -5.491230 11.01720 8.43203 26.6295 52.0120 66.0341 81.1067 115.733
Car # | Speed: 0 | Speed: 40 | Speed: 50 | Speed: 70 | Speed: 90 | Speed: 100 | Speed: 110 | Speed: 130 |
---|---|---|---|---|---|---|---|---|
0 | 4.48899 | 12.602 | 7.93177 | 32.4445 | 60.5546 | 56.0518 | 76.0007 | 101.365 |
1 | -0.561011 | 11.1108 | 11.7965 | 35.3948 | 57.9588 | 63.6805 | 78.727 | 107.571 |
2 | -3.02689 | 1.03175 | 25.1821 | 28.479 | 48.5566 | 61.8898 | 76.6653 | 104.722 |
3 | -6.70132 | 11.5823 | 8.69358 | 25.527 | 49.7054 | 59.2811 | 70.483 | 107.165 |
4 | -5.49123 | 11.0172 | 8.43203 | 26.6295 | 52.012 | 66.0341 | 81.1067 | 115.733 |
5 | 1.96217 | 5.10659 | 5.89579 | 36.9719 | 48.4752 | 71.6416 | 83.6499 | 111.316 |
6 | 5.3934 | 7.88628 | 15.3244 | 35.0379 | 53.5495 | 71.2402 | 81.3549 | 113.507 |
7 | 1.63565 | 11.1314 | 19.6684 | 27.9187 | 49.1378 | 67.668 | 84.3406 | 105.267 |
8 | -0.261377 | 3.85402 | 14.7941 | 31.4941 | 52.5009 | 70.9246 | 76.8773 | 106.695 |
9 | 2.64496 | 6.32143 | 17.0739 | 28.6738 | 46.0595 | 60.9116 | 76.2826 | 109.836 |
3. Machine Learning Algorithms
This sections presents various conventional ML methods. The linear models, mainly coming the statistics science are one of the most interpretable models. Yet, their learning capacity are limited and non-linear models, i.e., that can learn non linear relationship between the inputs and outputs variables, provides usually better results at the cost of a lack of interpretability.
3.1. Linear models
Linear models assume that the prediction rule \(f\) is linear in the inputs variable, i.e., \(f(\mathbf{x}) = \sum_{i=1}^d w_ix_i = \mathbf{w}^\top\mathbf{x} + b\). In the following, linear models for regression and classification are presented.
3.1.1. Regression
As discussed in 2.3, if we fit a linear model with the mean square error, we have an explicit solution for 1-dimensional data. The result is still valid for any dimension. To see this, denote \(\mathbf{X} = \big[\mathbf{x}_1, \ldots, \mathbf{x}_n\big]^\top\) a \(n\times d+1\) matrix that collects all the input samples augmented with one variable set to 1 (for the intercept), and let \(\mathbf{y}=[y_1, \ldots, y_n]^\top\) the vector containing all the outputs samples. Then the MSE can be written as
\begin{align} MSE(\mathbf{w}) = \frac{1}{n}\Big(\mathbf{y} - \mathbf{X}\mathbf{w}\Big)^\top\Big(\mathbf{y} - \mathbf{X}\mathbf{w}\Big)\label{mse:matrix} \end{align}where \(\mathbf{w}= [w_1, \ldots, w_d, b]^\top\). We can derive w.r.t \(\mathbf{w}\) and by setting the derivative to zero, we obtain the general expression for fitting a linear model with the MSE:
\begin{align} \hat{\mathbf{w}} = \Big(\mathbf{X}^T\mathbf{X}\Big)^{-1}\mathbf{X}^T\mathbf{y}. \end{align}
This is exactly what the fit
method does in scikit-learn. The “hat” on \(\mathbf{w}\) indicates that it is an estimation, w.r.t. collected data.
Interpretability of a linear model comes from the possibility to relate the values of \(\hat{\mathbf{w}}\) to the output: a very low value of \(w_i\) indicates a very low relationship1 of the i\(^{th}\) input variable with the output variable, providing input variables have been standardized (see section 5).
In the following, we will apply the linear regression to the California Housing dataset from the scikit-learn
datasets (https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset). From the data set description:
the output variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). The input variables are
- MedInc median income in block group.
- HouseAge median house age in block group.
- AveRooms average number of rooms per household.
- AveBedrms average number of bedrooms per household.
- Population block group population.
- AveOccup average number of household members.
- Latitude block group latitude.
- Longitude block group longitude.
where a block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
import pandas as pd import geopandas as gpd from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt # Open data and create the geo-data-frame cal_housing = fetch_california_housing(as_frame=True) df = pd.concat([cal_housing.data, cal_housing.target], axis=1) gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude)) # Inspect the first 5 rows of the dataframe print(gdf.head())
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal geometry 0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526 POINT (-122.23000 37.88000) 1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585 POINT (-122.22000 37.86000) 2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521 POINT (-122.24000 37.85000) 3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413 POINT (-122.25000 37.85000) 4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422 POINT (-122.25000 37.85000)
We can use pandas info()
function to get some information about the data:
# Print some information about the data set_model_parameters gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MedInc 20640 non-null float64 1 HouseAge 20640 non-null float64 2 AveRooms 20640 non-null float64 3 AveBedrms 20640 non-null float64 4 Population 20640 non-null float64 5 AveOccup 20640 non-null float64 6 Latitude 20640 non-null float64 7 Longitude 20640 non-null float64 8 MedHouseVal 20640 non-null float64 9 geometry 20640 non-null geometry dtypes: float64(9), geometry(1) memory usage: 1.6 MB
The very first steps of any machine learning analysis is to split the data into two sets: One used to train/learn the model, and the second one used to compute the accuracy of the model. Scikit-learn propose several tools to do such operations (see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection). In this course, we will make use a lot of the train_test_split
function:
states = gpd.read_file("data/usa-states-census-2014.shp") # Split the data into train and test data set gdf_train, gdf_test = train_test_split(gdf, train_size=0.5, random_state=10) # Plot the target value fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 5), edgecolor='w') for i, (gdf, data) in enumerate(zip([gdf_train, gdf_test], ["train", "test"])): states[states.NAME == "California"].plot(color="gray", alpha=0.5, ax=axs[i]) gdf.plot( kind="scatter", x="Longitude", y="Latitude", s=gdf["Population"] / 100, c="MedHouseVal", colorbar=True, ax=axs[i], ) axs[i].set_title(f"{data}")
Figure 5: Median
In a notebook, using the above python code to open the data, perform a linear regression on this data set to predict the target value and try to obtain the best MSE score. You are free to use polynomial transformation to fit your model, but each groups need to validate the following requirements before reporting the results:
- Take care to use the input variables: [“MedInc”, “HouseAge”, “AveRooms”, “AveBedrms”, “Population”, “AveOccup”, “Latitude”, “Longitude”]
- Compute both the MSE on the train and test set, several times for a same model, but with a different
random_state
for the splitter, and report the mean and standard deviation error, alternatively you can draw boxplots. - Plot the results and display where are the most important errors. For that you can plot the map, or plot the scatter plot.
3.1.1.1. Extended linear models
Several extensions of linear model exists that add a so-called regularization terms \(R\) to the MSE: \[\mathcal{L}(\theta) = MSE(\theta) + \lambda R(\theta)\] where \(\lambda\) is a weight parameter to control the trade-off between the data fit and regularization. The regularizer is used to enforce some properties in \(f\): smoothness, sparsity, positivity etc …
The objectives of regularization are numerous: avoid over-fitting, improve model generalization, improve robustness to outliers, prevents numerical issues … They are two mains regularization usually applied with linear models:
- Ridge regression (Hastie, Tibshirani, and Friedman 2009): \(R(\mathbf{w}) = \|\mathbf{w}\|_2^2 = \sum_{i=1}^d w_i^2\),
- Lasso (Hastie, Tibshirani, and Friedman 2009) : \(R(\mathbf{w}) = \|\mathbf{w}\|_1 = \sum_{i=1}^d |w_i|\).
They both penalized high value of \(\mathbf{w}\), they behavior are illustrated in figure 6. With no surprise such models are available in scikit-learn
under the name Ridge and Lasso. A linear model with Lasso regularization is known to be sparse: depending on the value of the weight parameter, some of the variables \(\mathbf{w}\) are strictly equal to zero. Thus the model learns with a subset in the input variables and this property can be exploited to select the most relevant input variables.
Figure 6: Ridge and Lasso 1D regularization.
If we continue the toy example of section 2.3, the loss function is modified to (intercept is usually not regularized): \[\frac{1}{n}\sum_{i=1}^n \big(wx_i + b - y_i\big)^2 + \lambda w^2.\] Using the same minimization strategy, we get now a slightly modified solution \(w =\frac{\mathrm{Cov}(x, y)}{\mathrm{Cov}(x, x) + \lambda}\). We can directly see the effect of regularization: for inputs variable with very low covariance, adding a small terms avoid division per zeros. Also, if we set \(\lambda\) to a large value, the estimation will tends to zero while when \(\lambda\) tends to zero, the regularization has no effect.
Figure 7: Ridge regularization for linear regression with one input variable.
The following code below illustrate the effects of ridge regularization of the toy example with a over-parametrized linear model2.
import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error as MSE from sklearn.linear_model import Ridge from sklearn.preprocessing import StandardScaler def function_to_be_estimated(x): return ( x * 0.5 + 0.5 * np.sin(x * 2 * np.pi / 5) + 0.1 * np.random.randn(x.shape[0], x.shape[1]) ) # Set random seed np.random.seed(0) # Generate some data n_samples = 20 x = 10 * np.random.rand(n_samples, 1) - 5 y = function_to_be_estimated(x) t = np.linspace(-5, 5).reshape(-1, 1) # Define the model for various regularization values fig, axs = plt.subplots(nrows=3, ncols=5, sharex=True, sharey=True, figsize=(15, 5)) for alpha, ax in zip(np.logspace(-9, 5, 15), fig.axes): model = Pipeline( [ ("poly_feat", PolynomialFeatures(20)), ("scaler", StandardScaler()), ("lin_reg", Ridge(alpha=alpha)), ] ) model.fit(x, y) yp = model.predict(t) ax.scatter(x, y) ax.plot(t, yp, "r") ax.set_xlim(-5, 5) ax.set_ylim(-2.5, 2.5) ax.grid("on") ax.set_title(f"$\lambda$= {alpha}")
Figure 8: Ridge linear model with a polynomial transformation of degree 20, with various regularization parameters.
Continue the previous notebook and now change the linear model to either Ridge or Lasso model (choose one, the other one will be done as homework):
- Select the best weight parameter using try and test method (we will see later in section 4 how to select it) for a linear model with polynomial feature of size 10. Plot the MSE as a function of \(\lambda\) and select the best one.
- Look at the model parameters (
model.coef_
) for some values of \(\lambda\) and discuss.
3.1.2. Classification
For classification problems, we have a categorical variable to estimate instead of a having quantitative variable, it means the algorithm should predict categories (“Dogs”, “Cats”, “Monkey” …). For such output variable, the MSE is not a good measure of performance: how do you define the square error between category “Dogs” and “Cats” ?
One solution usually adopted for linear models is to consider the estimation of the posterior probability \(p(y_i=c | \mathbf{x}_i)\) to each class \(c\in\big\{1, \ldots, C\big\}\) and decide the class membership using the maximum a posteriori (MAP) rule: \[\hat{c}_i = \arg\max_{c} p(y_i=c | \mathbf{x}_i).\]
The logistic regression (despite its name it is a classifier algorithm) models the logarithm of the posterior probability as a linear function of the inputs variables (Hastie, Tibshirani, and Friedman 2009): \[\log\big(p(y=c | \mathbf{x})\big) = \mathbf{w}_c^\top\mathbf{x} + b_c + K\] where \(K\) is a constant independent of the class. You should recognise the expression of a linear model of regression: this is exactly what it is. The logistic regression performs a regression on the logarithm of the posterior probability of each class: we have to learn \(C\) regression problems. However, these problems are not independent since the sum of the posterior probabilities should be exactly 1: this is the role of the constant \(K\):
\begin{align*} \sum_{c=1}^Cp(y_i=c | \mathbf{x}_i) & = & 1 \\ \sum_{c=1}^C \exp\Big\{\mathbf{w}_c^\top\mathbf{x}_i + b_c + K_i\Big\} & = & 1 \\ \exp\{K_i\}\sum_{c=1}^C \exp\Big\{\mathbf{w}_c^\top\mathbf{x}_i + b_c\Big\} & = & 1 \\ \exp\{K_i\} &=& \frac{1}{\sum_{c=1}^C \exp\Big\{\mathbf{w}_c^\top\mathbf{x}_i + b_c\Big\}} \end{align*}thus
\[p(y_i=c | \mathbf{x}_i) = \frac{\exp\Big\{\mathbf{w}_{c}^\top\mathbf{x}_i + b_{c}\Big\}}{\sum_{c'=1}^C \exp\Big\{\mathbf{w}_{c'}^\top\mathbf{x}_i + b_{c'}\Big\}}.\]
This ratio of exponential is known as the softmax function and it is used (widely in neural network) to transform a set of unconstrained variables to a set of variables between \([0, 1]\) that sums to one (as probability distribution).
To train our logistic regression model, we need to estimate \((\mathbf{w}_c, b_c)\ \forall c\in\{1, \ldots, C\}\). From our training set, we have the following information:
\begin{eqnarray*} p(y_i=c|\mathbf{x}_i) = \left\{ \begin{array}{lr} 1 & : \mathbf{x}_i\in c \\ 0 & : \mathbf{x}_i\not\in c. \end{array} \right. \end{eqnarray*}The output variable \(y_i\) is vector of size the number of class, and contains the information above, usually on-hot-encoded (https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/): \[\mathbf{y}_i=\begin{bmatrix}0\\ \vdots\\1\\ \vdots\\0\end{bmatrix} \text{ with }\mathbf{y}_{ij}=1 \text{ if } \mathbf{x}_i\in c_j \text{ and } 0 \text{ otherwise}.\]
For such observations (membership probabilities), the loss conventionally used is the negative of the log likelihood, which can be written as (the regularization is already included):
\begin{eqnarray*} L(\theta) & = & -\sum_{i=1}^n \log(p(y_i=c_i|\mathbf{x}_i)) + \lambda \sum_{c=1}^C \|\mathbf{w}_c\|_2^2 \\ & = & -\sum_{i=1}^n (\mathbf{w}_{ci}^\top\mathbf{x}_i+b_{ci}) + \sum_{i=1}^n\log\Big\{\sum_{c=1}^C\exp\big(\mathbf{w}_{c}^\top\mathbf{x}_i + b_{c}\big)\Big\}+ \lambda \sum_{c=1}^C \|\mathbf{w}_c\|_2^2 \\ \end{eqnarray*}
There is no explicit solution and the optimization problem is solved by gradient descent. For linear model, there are many existing algorithms, with different regularization and optimization strategies. The scikit-learn
library has implemented a lot of them, you can read a description here: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression.
We will illustrate how the logistic regression works on a synthetic example.
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression as LR from sklearn.metrics import accuracy_score as OA # Create toy 2D data X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=0) # Split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.5, random_state=0) # Fit model model = LR(random_state=0, C = 1) model.fit(X_train, y_train) y_pred_test = model.predict(X_test)
That’s all, as usual with scikit learn. So far, we have generated some toy data set with 2 input variables and 3 classes. The generated data have been split into a training and testing set and we learn the model with default parameters.
Now since we are in two dimensional space, it is easy to plot the prediction of the model. We can output two information: the posterior probability (the class membership) and the classification. Sometimes they are referred to as soft and hard assignment values. Scikit learn offers two function: predict_proba
and predict
for the former and the latter respectively. The following code generates a grid of points in \(\mathbb{R}^2\) and compute their class membership, then it displays it using contours plot.
# Generate points for the bounding box x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5 y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5 h = 0.02 # step size in the mesh xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) # Predict posterior probability and class Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()]) y_pred = model.predict(np.c_[xx.ravel(), yy.ravel()]) # Plot the results fig, axs = plt.subplots(nrows=1, ncols=4, figsize=(15, 5)) for c in range(3): cs = axs[c].contour(xx, yy, Z[:,c].reshape(xx.shape), [0.25, 0.5, 0.75]) axs[c].scatter(X_test[:, 0], X_test[:, 1], c=y_test) axs[c].clabel(cs, inline=1, fontsize=10) axs[-1].contourf(xx, yy, y_pred.reshape(xx.shape)) axs[-1].scatter(X_test[:, 0], X_test[:, 1], c=y_test)
Figure 9: Classification of toy data set: each color represents a class (\(y\in\{0, 1, 2\}\)), and the input variables are in \(\mathbb{R}^2\). The lines of the three plots from the left corresponds to the isoline of posterior probability for class 0, 1, and 2 respectively. The last plot corresponds to the partition of the space.
The figure 9 shows the classification result: the decision boundaries are linear w.r.t. the input variables (on the right plot, boundaries are straight lines). It is off course possible to perform polynomial transformation before the linear model to obtain non-linear decision function.
Perform the classification of optical recognition of handwritten digits dataset (see section 2.1 to load the data). The recommendation for the labworks are
- Split the data : 50% for training and 50% for testing, repeat simulation with different
random_seed
, - Use polynomial transformation (up to 2 or 3, if you have a lot of RAM),
- Try various regularization weight values (parameter
C
in the model), for instance[0.01, 0.1, 1, 10, 100]
, - Perform individual classification reports (for one
random_seed
) and also sensibility analysis for several runs with differentrandom_seed
, - Plot few wrong prediction to visually asses what’s going on for misclassified samples? Tip you can have access to wrong classification using
t = np.where(y_test != y_pred_test)[0]
.
3.2. K-Nearest Neighbors
K-Nearest Neighbors (KNN) methods, either for classification or regression, are memory-based algorithms, i.e., no training is done and the prediction is done solely using the training samples (Hastie, Tibshirani, and Friedman 2009). For a given test sample \(\mathbf{x}_t\), the prediction is done by finding the k training samples closest to \(\mathbf{x}_t\) w.r.t. a given distance (for instance the Euclidean distance). Then,
- for the classification, the predicted class is found using a majority vote among the k neighbors, i.e., the most common class,
- for the regression, the mean or median operator is used to estimate the predicted value.
The algorithm has two parameters: the distance function between samples and the number of neighbors to consider. Usually, for qualitative variables, the euclidean distance is used. It is defined as \[d(\mathbf{x}_i, \mathbf{x}_j) = \|\mathbf{x}_i- \mathbf{x}_j\|_2^2=\sum_{k=1}^d(x_{ik}-x_{jk})^2.\] The number of neighbors is usually set to relatively a small value (5 to 20). The decision function tends to be smoother as the number of neighbors used increased, because of the averaging effect. Also, for classification, is possible to get the probabilities membership by computing the proportion of each class in the neighbors.
KNN behavior is illustrated in Figure 10 for the classification case. We use the same grid of points to draw the contour lines than for logistic regression.
from sklearn.neighbors import KNeighborsClassifier # for k in range(2,11,2): model = KNeighborsClassifier(n_neighbors=2) model.fit(X_train, y_train) y_pred_test = model.predict(X_test) Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()]) y_pred = model.predict(np.c_[xx.ravel(), yy.ravel()]) # Plot the results fig, axs = plt.subplots(nrows=1, ncols=4, figsize=(15, 5)) for c in range(3): cs = axs[c].contour(xx, yy, Z[:,c].reshape(xx.shape), [0.25, 0.5, 0.75]) axs[c].scatter(X_test[:, 0], X_test[:, 1], c=y_test) axs[c].clabel(cs, inline=1, fontsize=10) axs[-1].contourf(xx, yy, y_pred.reshape(xx.shape)) axs[-1].scatter(X_test[:, 0], X_test[:, 1], c=y_test)
Figure 10: Classification of toy data set: each color represents a class (\(y\in\{0, 1, 2\}\)), and the input variables are in \(\mathbb{R}^2\). The lines of the three plots from the left corresponds to the isoline of posterior probability for class 0, 1, and 2 respectively. The last plot corresponds to the partition of the space.
- Reproduce the toy example for the classification with various number of neighbors.
- Do the same work with the toy example for regression.
3.3. Random Forest
3.3.1. Decision Trees
A decision tree is a (simple & naive) model that partition the input variables space into zones where the prediction (regression or classification) is constant. The learning step consists in finding these zones. Basically, the algorithm recursively search for each input variable a threshold that satisfy a purity index, i.e., that splits the data into into pure subset (in terms of output variables), and construct a set of if ... then ... else
binary rules that defines these zones. Since each zone contains “pure” output variables, i.e. very similar values (same class number for instance) a constant prediction make sense.
Let’s us continue with the toy regression example to better understand how a decision tree behave. The figure 11 shows a decision tree fits with two size of tree, here controlled by the max_depth
parameter.
import numpy as np from sklearn.tree import DecisionTreeRegressor import matplotlib.pyplot as plt from sklearn import tree # Load data data = np.load("./data/toy_data.npz") x, y, xt, yt = data["x"], data["y"], data["xt"], data["yt"] # Fit model & plot model plt.figure(figsize=(5,5)) plt.scatter(x, y, c='b') plt.scatter(xt[::10,0], yt[::10,0], c='k') for max_depth, col in zip([10, 2], ['r', 'g']): model = DecisionTreeRegressor(max_depth=max_depth) model.fit(x, y) # Plot decision function t = np.linspace(-5, 5, 1000).reshape(-1, 1) plt.plot(t, model.predict(t), col) plt.legend(["Train points","Test points","Depth=10", "Depth=3"])
Figure 11: Decition tree regression for the toy data set
From the figure, it can be seen that a too deep tree might overfit the data. We will see in section 4 how to select it. Yet, in the next section, we will show that pragmatically, it is not of high importance.
Sklearn
provides function to plot decision tree: this allow to interpret how the decision is done and which variables are the most important. This this the plot_tree
function (https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html#sklearn.tree.plot_tree). The figure 12 shows a decition tree with a max_depth
of size 2.
Figure 12: Decision tree.
A decision tree can be either used for classification or regression.
- Reproduce the toy example for the regression with various max_depth parameters.
- Do the same work with the toy example for classification.
3.3.2. Random Forest
Decision tree can overfit and its learning capacity is limited (e.g., because of constant prediction). However, by combining many trees, build on a subset of the data and a subset of the input variables it is possible to achieve much better results. This is the idea of Random Forest (RF), a powerful ensemble method (Hastie, Tibshirani, and Friedman 2009).
The algorithm of a random forest can be summarized as:
- Select the number of trees and the parameters of each tree (max_depth etc ..)
- For each tree
- Select a subset of the sample
- Fit the decision tree recursively by selecting a subset of input variables for each node
- For the prediction of one sample:
- Compute the decision for each tree
- Average the results for regression or do a majority vote for classification.
RF is usually fast and little sensitive to hyperparameter setting: if the number of trees is high enough, good results will be achieved without the need to investigate several values.
The combination of several decision trees does not maintain the interpretability of decision tree. Yet, RF comes with the ability to rank input features according to how they improved the purity criterion at each node of the decision trees. In sklearn, this is accessible using the feature_importance_
property of the model.
As usual, the behavior of RF is illustrated on the toy example. Here for classification.
from sklearn.ensemble import RandomForestClassifier as RF # Learn model = RF() model.fit(X_train, y_train) y_pred_test = model.predict(X_test) Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()]) y_pred = model.predict(np.c_[xx.ravel(), yy.ravel()]) # Plot the results fig, axs = plt.subplots(nrows=1, ncols=4, figsize=(15, 5)) for c in range(3): cs = axs[c].contour(xx, yy, Z[:,c].reshape(xx.shape), [0.25, 0.5, 0.75]) axs[c].scatter(X_test[:, 0], X_test[:, 1], c=y_test) axs[c].clabel(cs, inline=1, fontsize=10) axs[-1].contourf(xx, yy, y_pred.reshape(xx.shape)) axs[-1].scatter(X_test[:, 0], X_test[:, 1], c=y_test)
- Reproduce the toy example for the classification with various number of trees.
- Do the same work with the toy example for regression.
4. Model selection & Model Validation
4.1. What is “overfitting”
Overfitting is the situation where the model learns the learning samples, without modeling the statistical relationship between the input and output variables. This includes the observed noise in the training set. Usually, bad or worse prediction accuracy are obtained on the training set. The table 2 shows an example of overfitting.
Classifier | Train set score | Test set score |
---|---|---|
1 | 0.875 | 0.825 |
2 | 0.925 | 0.820 |
3 | 0.995 | 0.740 |
- Find in the previously discussed algorithms a behavior of overfitting per se.
4.2. Model assessment
As discussed above, a machine learning framework usually involves some splitting of the data in order to train the model, and to validate it on unseen data. As we have seen previously, the “random” effect in the split as has an influence on the test accuracy measure, and thus we have to repeat the process several times to obtain several realizations of the measure. This process is known as model assessment. By taking the average value of the different realizations we obtain a good estimation of the expected test accuracy measure (which will be the accuracy measure obtained with an infinite number of training and testing samples). The figure 13 shows graphically the workflow to assess model performance.
Figure 13: Global framework for model assessment.
Model assessment is very important especially for complex learning model: complex model can easily learn the noise in the data. The figure 14 shows this behavior for the K-NN.
Figure 14: K-NN regression accuracy (as measured by the mean square error) as a function of the complexity. Here the complexity is measure as the number of neighbors used for the prediction (one neighbors leading to the most complex model).
Up to now, we consider the problem of validating the model using the testing samples. However, the framework proposed should be used to validate the model only. Not for selecting the hyperparameters of the model nor the model itself for a pool of models. This task, named “Model selection” in the figure 13 and is discussed in the following section
4.3. Model selection
Model selection is the process to select the set of best hyperparameters (if any) for the considered model or/and to choose between different models using only sample from the training set. As we know now, if we look at the training error, we will face overfitting and thus it is not a good strategy.
Some models as a measure of fit that can be used to select their parameters. But practical cases show that the good way is to estimate the expected error using some random-splitting strategies.
The most used one is called “cross-validation”. Its principle is quite simple, yet effective: the training data is split into K disjoint groups. One group, or fold, is left out and the model is learned with the (K-1) remaining folds. Then the left out fold is used to compute the test error. The expected test error is estimated as the average of the test error computed for every fold:
\[CV(\hat{f},\theta) = \frac{1}{K}\sum_{k=1}^K\text{Err}_k(\hat{f},\theta).\]
Hence, this procedure involves the training of K models on K different subsets of the training set. The table 3 shows a graphical illustration of the procedure.
Fold\(_1\) | Fold\(_2\) | Fold\(_3\) | Fold\(_4\) | Fold\(_5\) | |
---|---|---|---|---|---|
Model\(_1\) | Train | Train | Train | Train | Validation |
Model\(_2\) | Train | Train | Train | Validation | Train |
Model\(_3\) | Train | Train | Validation | Train | Train |
Model\(_4\) | Train | Validation | Train | Train | Train |
Model\(_5\) | Validation | Train | Train | Train | Train |
An important rule is that the different steps of the learning process should be done for each model (e.g. feature transformation, model fit etc …). The following code shows how to select hyperparameters for random forest classifiers on a synthetic data set.
import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier as RF from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split from sklearn.metrics import accuracy_score from sklearn.datasets import make_classification X, y = make_classification(n_samples=1000, n_features=50, n_informative=40, n_redundant=10, n_classes=5, n_clusters_per_class=4, random_state=0) # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, train_size=0.5, random_state=0) # Hyperparameters params = {"n_estimators": [1, 10, 50, 100, 200, 400, 600], "max_features": [1, 5, 10, 15, 20, 25]} # Do Cross Validation grid = GridSearchCV(RF(), # Set up the classifier param_grid=params, cv= 5, refit=True, n_jobs=-1) # Do the grid search in parallel grid.fit(X_train, y_train) # Run the grid search # Compute the test score score = grid.best_estimator_.score(X_test, y_test) # Plot the results res = [[param["max_features"], param["n_estimators"], err] for param, err in zip(grid.cv_results_["params"], grid.cv_results_["mean_test_score"])] fig, ax = plt.subplots(nrows=1, ncols=1) im = ax.imshow( grid.cv_results_["mean_test_score"].reshape(6, 7) ) ax.set_xticks(range(7)) ax.set_xticklabels(params["n_estimators"]) ax.set_yticks(range(6)) ax.set_yticklabels(params["max_features"]) cb = fig.colorbar(im, ax=ax) ax.set_title(f"Best parameters {grid.best_params_} \n Train score {grid.best_score_:.2f} \n Test score: {score:.2f}")
Figure 15: Cross-validation estimation of the espected error for RF.
Finish your machine learning framework, for either classification or regression, with the two data sets from sklearn:
- Diabetes: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes
- Wine recognition: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine
The code should do:
- Split the data into training and testing set,
- For the linear, RF and K-NN models, do hyperparameters selection then model selection,
- Perform a correct estimation of the espected error.
5. Data Transformation
In this section, we will review some techniques to pre-process the data for a machine learning pipeline. You can find many information here:
We will only review the main techniques.
5.1. Feature scaling
Many algorithm in machine learning are not invariant to shifting and scaling of the data. A shift is a translation of the data by some constant and a scaling is a multiplication by another constant, e.g. : \[x_s = \frac{x + m}{s}.\]
Hence, different unit of the data may lead to different solution: this is something not acceptable in practice. Let’s imagine you have some measurements in meters and you decide to put them in centimeters. The link function should not change because of the change of unit.
Another influence of feature unit, or dynamic or range, is for algorithm that compute the Euclidean distance. If one feature is much larger than the others, it will drives completely the algorithm whatever the values of the others features. Let’s exemplify this with a (very) artificial example. Suppose we have a data set with the height and weight of some people and we want to classify them between tall and small. Also related, feature dynamic can play a role in feature extraction, see PCA example later for instance.
import pandas as pd from sklearn.neighbors import KNeighborsClassifier as KNN # Define some arbitrary data di = {"Heigth": [170,160,110,120], "Weight":[80000,75000,50000,75000], "Size":["Tall", "Tall", "Small", "Small"]} df = pd.DataFrame.from_dict(di) print(df)
Heigth Weight Size 0 170 80000 Tall 1 160 75000 Tall 2 110 50000 Small 3 120 75000 Small
The height is in centimeter and the weight is in gram. If we classify this toy data set with KNN we get the following results. We took the first and third element as training samples.
model = KNN(n_neighbors=1) model.fit(df.iloc[[0,2],:2], df.iloc[[0,2],2]) print(model.predict(df.iloc[:,:2]))
['Tall' 'Tall' 'Small' 'Tall']
We have an error for the last prediction: Why ? We can compute the distance between each samples:
from sklearn.metrics import pairwise_distances as pd print(pd(df.iloc[:,:2]))
[[ 0. 5000.00999999 30000.05999994 5000.24999375] [ 5000.00999999 0. 25000.04999995 40. ] [30000.05999994 25000.04999995 0. 25000.002 ] [ 5000.24999375 40. 25000.002 0. ]]
The distance is 5000 between the first and the last samples, which is closer than the distance with the third element. If now we change the unit of the weight to kilogram, what do we get ?
df["Weight"] /= 1000 print(pd(df.iloc[:,:2]))
[[ 0. 11.18033989 67.08203932 50.24937811] [11.18033989 0. 55.90169944 40. ] [67.08203932 55.90169944 0. 26.92582404] [50.24937811 40. 26.92582404 0. ]]
We have now a smaller distance between the third and fourth samples, which is correct.
To prevent this kind of problem, a good behavior to scale the data. There are many techniques to do that, in this course, we recommend the most standard one at a first try: the standardization. It consists in removing the mean of each feature and scale it to unit variance. Hence, each feature has a zero mean and unit standard deviation, and therefore a similar scale. The process is simply: \[x_s = \frac{x - \mu}{\sigma}.\] where \(\mu\) is the mean of the feature and \(\sigma\) is the standard deviation, both estimated on the training set!
You can do it in Scikit-learn using the StandardScaler
tool, or any robust version of it.
5.2. Feature reduction
In machine learning, feature extraction or feature selection is the process of selecting or constructing the most relevant features for the considered task from the input variable. It aims to remove redundant or non-informative variables.
If we select a few subset of the original input variables, the process is called feature selection. While if we construct a new subset of variables (e.g., by a linear combination of the original variables), the process is called feature extraction. Furthermore, if we can use some data for the feature selection/extraction, the process supervised, otherwise it is unsupervised.
In this course, we are going to review some very conventional algorithms: one for feature selection, and two for feature extraction.
5.2.1. Feature selection
One of the most simple algorithm, yet effective but time consuming, is the sequential forward feature selection. It can be applied to any supervised regression/classification algorithm. It works as follows:
- Starts with an empty pool \(F\) of selected features,
- Select the feature \(f_1\) that provides the best value for the selected criterion and add it to \(F\).
- Select the feature \(f_2\) such that the couple of features \((f_1,f_2)\) provides the best value for the selected criterion and add it to \(F\).
- Select the feature \(f_3\) such that the triplet of features \((f_1,f_2,f_3)\) …
- …
- The algorithm stops either if the increase of the criterion is too low or if the maximum number of features is reached.
A variant of the algorithm is obtained by sequentially remove one feature from the set of input feature, no surprise it is called sequential backward feature selection. In practice this version if less effective than the forward one.
Let’s see how it works for the California housing data set.
import numpy as np from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt # Open data california_housing_data = fetch_california_housing() X, y = california_housing_data.data, california_housing_data.target feature_names = np.asarray(california_housing_data.feature_names) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y)
We now use the class SequentialFeatureSelector
from Scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html#sklearn.feature_selection.SequentialFeatureSelector). We need to also set a regression model. For this example a Knn is used, but any models from scikit is working.
from sklearn.feature_selection import SequentialFeatureSelector as SFS from sklearn.neighbors import KNeighborsRegressor as KNR knn = KNR(n_neighbors=3) sfs = SFS(knn, n_features_to_select=3, n_jobs=-1) sfs.fit(X_train, y_train) print(f"Selected feature: {feature_names[sfs.get_support()]}")
Selected feature: ['MedInc' 'Latitude' 'Longitude']
We can compute the score on the test set and compare it when no feature selection is done.
knn.fit(sfs.transform(X_train), y_train) print(f"Regression score on the test set with feature selection: {knn.score(sfs.transform(X_test), y_test)}") knn.fit(X_train, y_train) print(f"Regression score on the test set without feature selection: {knn.score(X_test, y_test)}")
Regression score on the test set with feature selection: 0.7539353810565258 Regression score on the test set without feature selection: 0.11211148340676591
5.2.2. Feature extraction
In this part, we will cover the famous Principal Component Analysis (PCA), an unsupervised feature extraction algorithm.
PCA is a Linear transformation used to reduce the dimensionality of the data (Jolliffe 2002). The new feature \(z_i\) is found by projecting the original data (i.e., a linear combination of), such as the projected feature account for most of the variability of the data:
- \(z_1,~z_2,~z_3,\ldots\) are mutually uncorrelated (i.e., \(\langle z_i, z_j\rangle=0, \forall j\neq i\)),
- \(\text{var}(z_i)\) is as large as possible,
- \(\text{var}(z_1)>\text{var}(z_2)>\text{var}(z_3)>\ldots\)
Figure 16: PCA on synthetic data. The red axis represents the new feature with the maximum of variance.
The PCA algorithm is as follow:
It starts by searching \(\mathbf{v}_1\) such as \(\max_{\mathbf{v}_1}\text{var}(z_1)\) with
\begin{eqnarray}\label{eq:lagrangian} \text{var}(z_1) & = & \text{var}(\langle\mathbf{v}_1,\mathbf{x}\rangle)\\ &=& \mathbf{v}_1^\top\boldsymbol{\Sigma}\mathbf{v}_1 \end{eqnarray}This problem is indetermined: if \(\hat{\mathbf{v}}_1\) maximizes the variance, \(\alpha\hat{\mathbf{v}}_1\) too! Therefore, we add a constraint on \(\mathbf{v}_1\): \(\langle\mathbf{v}_1,\mathbf{v}_1\rangle=1\). Thus, we have a constraint optimization problem, whose Lagrangian can be written as (Boyd and Vandenberghe 2004):
\begin{eqnarray} \mathcal{L}(\mathbf{v}_1,\lambda_1) = \mathbf{v}_1^\top\boldsymbol{\Sigma}\mathbf{v}_1 + \lambda_1(1- \mathbf{v}_1^\top\mathbf{v}_1) \end{eqnarray}By computing the derivative w.r.t \(\mathbf{v}_1\) and set it to zeros, we have:
\begin{eqnarray} \frac{\partial\mathcal{L}}{\partial\mathbf{v}_1} = 2\boldsymbol{\Sigma}\mathbf{v}_1-2\lambda_1\mathbf{v}_1 \end{eqnarray}Hence \(\mathbf{v}_1\) is an eigenvector of the covariance matrix of \(\mathbf{x}\):
\begin{eqnarray} \boldsymbol{\Sigma}\mathbf{v}_1 =\lambda_1\mathbf{v}_1 \end{eqnarray}To get a maximal variance, \(\mathbf{v}_1\) should be the eigenvector corresponding to the largest eigenvalues (to see this, we insert the above equation into \ref{eq:lagrangian})!
\begin{eqnarray} \text{var}(z_1) = \mathbf{v}_1^\top\boldsymbol{\Sigma}\mathbf{v}_1 = \lambda_1 \mathbf{v}_1^\top\mathbf{v}_1 = \lambda_1 \end{eqnarray}Any linear algebra library has a function to find the eigenvectors/eigenvalues of a square positive definite matrix (see for instance https://numpy.org/doc/stable/reference/generated/numpy.linalg.eigh.html)
Once \(\mathbf{v}_1\) if found, the algorithm no searches for \(\mathbf{v}_2\) such as \(\max\text{var}(z_2)\) and \(\langle\mathbf{v}_2,\mathbf{v}_2\rangle=1\) and \(\langle\mathbf{v}_1,\mathbf{v}_2\rangle=0\). Again we can write the Lagragian:
\begin{eqnarray} \mathcal{L}(\mathbf{v}_2,\lambda_2,\beta_1) = \mathbf{v}_2^\top\boldsymbol{\Sigma}\mathbf{v}_2 + \lambda_2(1- \mathbf{v}_2^\top\mathbf{v}_2) + \beta_1(0 - \mathbf{v}_2^\top\mathbf{v}_1) \end{eqnarray}and compute the derivative w.r.t \(\mathbf{v}_2\), and try to vanish it:
\begin{eqnarray} \frac{\partial\mathcal{L}}{\partial\mathbf{v}_2} &=& 2\boldsymbol{\Sigma}\mathbf{v}_2-2\lambda_2\mathbf{v}_2-\beta_1\mathbf{v}_1\\ \boldsymbol{\Sigma}\mathbf{v}_2 &=& \lambda_2\mathbf{v}_2+2\beta_1\mathbf{v}_1 \end{eqnarray}At optimality, \(\langle\mathbf{v}_1,\mathbf{v}_2\rangle=0\). Left-multiplying by \(\mathbf{v}_1^\top\) the above equation:
\begin{eqnarray} \mathbf{v}_1^\top\boldsymbol{\Sigma}\mathbf{v}_2 &=& 2\beta_1 \\ \lambda_1\mathbf{v}_1^\top\mathbf{v}_2 &=& 2\beta_1 \\ 0 &=& 2\beta_1 \end{eqnarray}Hence, we have
\begin{eqnarray} \boldsymbol{\Sigma}\mathbf{v}_2 =\lambda_2\mathbf{v}_2. \end{eqnarray}\(\mathbf{v}_2\) is the eigenvector corresponding the second largest eigenvalues. For axis \(k\ge 3\), we can repeat the same procedure and get that \(\mathbf{v}_k\) is the eigenvector corresponding the \(k^{\text{th}}\) largest eigenvalues.
In practice, the eigensolver returns all the eigenvector/eigenvalues in one call. Basically, the algorithm reduces to
Empirical estimation the mean value:
\begin{eqnarray} \boldsymbol{\mu} = \frac{1}{n}\sum_{i=1}^n\mathbf{x}_i \end{eqnarray}Empirical estimation the covariance matrix:
\begin{eqnarray} \boldsymbol{\Sigma} = \frac{1}{n-1}\sum_{i=1}^n(\mathbf{x}_i-\boldsymbol{\mu})(\mathbf{x}_i-\boldsymbol{\mu})^\top \end{eqnarray}Compute the eigenvalues/eigenvectors and keep the \(p\) first… How to choose \(p\) ? Take a look to the explained variance, i.e., the variance of the data explained by the \(p\) first eigenvectors, it can be computed as
\[\frac{\sum_{i=1}^p\lambda_i}{\sum_{i=1}^d\lambda_i}\]
In practice, we choose \(p\) such as a given amount of the variance is kept (e.g., 90% or 95%).
- Tips for high dimensional data set: if the number of samples is lower than the number of feature, see (Manolakis, Lockwood, and Cooley 2016) page 420.
We now compare how PCA works on the California housing data (see https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html for a list of options):
from sklearn.decomposition import PCA from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler # Two cases are investigated: same number of feature than with sequential feature selector # or by thresholing the cummulative variance for param in [3, 0.90, 0.95, 0.99, 0.999]: model = Pipeline([ ("Scaler", StandardScaler()), # You can try to remove this part and see the difference ("Feature Extraction", PCA(n_components = param)), ("KNN", KNR(n_neighbors=3)) ]) model.fit(X_train, y_train) print(f"Regression score with the number of component {model.named_steps['Feature Extraction'].n_components_}: {model.score(X_test, y_test)}")
Regression score with the number of component 3: 0.28924002334403465 Regression score with the number of component 5: 0.47881899995519794 Regression score with the number of component 6: 0.523310120515489 Regression score with the number of component 7: 0.6129552696601799 Regression score with the number of component 8: 0.6368428619732054
- Apply the sequential forward feature selection and the PCA on the digit data.
- Compare the results in terms of
- Classification accuracy,
- Number of extracted feature,
- Interpretability of the extracted feature.
5.3. TODO Imputation of missing values
6. Bibliography
Footnotes:
This is true if the data are standardized (see 5 for a definition of standardization). An exact derivation is given in (Hastie, Tibshirani, and Friedman 2009), chapter 3.
A careful reader might have seen that a StandardScaler
pre-processing has been included in the pipeline, with regularization, this is something require and critical since we generally defined only one weight for all parameters. Hence, we must be sure that each parameter has the same dynamic (range of values) in order to penalized its “high” values in a similar way than the others parameters.