Machine Learning

1. Course Information
2. Introduction
3. Machine Learning Algorithms
4. Model selection & Model Validation
5. Data Transformation
6. Bibliography

1. Course Information

1.1. Myself

Contact: mathieu.fauvel@inrae.fr
Ph.D. degree in signal and image processing from the National Polytechnic Institute, Grenoble, France, and the University of Iceland, in 2007
Post-doc position at INRIA - MISTIS Team, 2008-2010
Assistant Professor (Grenoble, 2007-2008 & Toulouse, 2010-2011)
Associate Professor at DYNAFOR, National Polytechnic Institute, Toulouse, between 2011-2018.
Since 2018, Research (CR1) at CESBIO, INRA.
Research interests are: machine learning for environmental/ecological monitoring

1.2. Prerequisites

Basics in descriptive statistic
Basics in Python

1.3. Student Learning Objectives

Learn how to define a machine learning pipeline
Evaluate the quality of a predictive model
Compare models
Implement it using python

1.4. Course Material

For this course, we use Scikit-Learn (Pedregosa et al. 2011) python library to perform most of the different steps of ML, from pre-processing, learning, prediction and model assessment. For deep learning, we use Pytorch (Paszke et al. 2019). Both are open-source and widely used in the community. The pandas library is used to process and manipulate the data (McKinney 2010) and geopandas is used to process geospatial data (Jordahl et al. 2020).

The materials of this document are available here: https://framagit.org/mfauvel/machine-learning.

To install all the python dependencies, use conda and the file environment.yml:

# replace my-env by the name you want
conda env create --name my-env --file environment.yml

Don’t forget to activate your environment !

2. Introduction

2.1. What is machine learning ?

Machine learning (ML) is a part of Artificial Intelligence (IA) science. It focuses on performing human tasks using computers and algorithms. For simple tasks, the algorithms can be completely handcrafted, however for advanced tasks, it might be impossible for a human to define the algorithm. In that situation, observed data related to the considered task can be used to learn the desire algorithm behavior, with few (or no) human interactions.

More specifically, ML is about the implementation of algorithms into computer programs that can learn from the observation of some realizations of a specific task. In (Mitchell 2006), learning is defined as the “central question” of the ML scientific field:

“How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?”

But how to define or assess the process of learning ? This is done using a performance measure (similar to grades for final exams …): together with the task and the observations of some realizations of this task, this performance measure provides a quality value of the computer program outputs for the given observed realizations. Thus, learning is defined as the capacity of a computer program to improve its performance measure with observations. Learning can also be formulated as the capacity of an algorithm to extract knowledge from data.

It is common to say that ML is centered on efficient algorithms and applications, with strong connection with applied mathematics and computer science. Its old brother Statistical Learning (see this must read (Hastie, Tibshirani, and Friedman 2009)), is more oriented towards modeling the data, and arises from the statistics and signal/image processing. Well, saying that, boundaries between ML and Statistical Learning are soft in practice and both field share a lot in common.

What about the so cool (as of end 2021) Deep Learning (DL)? DL can be seen as a subfield of ML, with particularly large models and the ability to perform well on large scale data set (aka, Big Data), as illustrated in figure 1. These models are addressed in this course, as well as standard machine learning tools.

Figure 1: Deep Learning is a subfield of Machine Learning.

Maybe the oldest and most famous ML application is the recognition of handwritten digits from US postal envelopes. The following python code load the digital number, a 8 by 8 grayscale images, with its label and plot the 12 first samples, see figure 2. Here the task is to recognise which number is written, the data are the digital images of handwritten numbers and the measure of performance is the correct recognition rate.

import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
fig, axs = plt.subplots(nrows=3, ncols=4, figsize=(8, 6))
plt.gray()
for i in range(3):
    for j in range(4):
        axs[i, j].matshow(X[i*4+j, :].reshape(8, 8),  cmap='gray')
        axs[i, j].tick_params(axis=u'both', which=u'both',length=0)
        axs[i, j].set_xticks([])
        axs[i, j].set_yticks([])
        axs[i, j].set_title(f"Label: {y[i*4+j]}")

Figure 2: Handwritten digits, from https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits and https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html.

We will see later that the way the data are represented are very important for the final results. First attempts simply flattened the image into a vector of size $8\times 8=64$ features, as illustrated in figure 3 for the first three digits. As a final remark, best results are now obtained by taking into account the image structure of the digits, with convolutional layer for instance.

fig, axs = plt.subplots(nrows=3, ncols=1, figsize=(8, 8), gridspec_kw={'hspace': 0.75})
fig.tight_layout()
for i in range(3):
    axs[i].plot(X[i, :])
    axs[i].set_title(f"Label: {y[i]}")

Figure 3: Flattened digits.

2.2. Main Notation

In the remaining of this course, we will use the following notation.

2.2.1. Variable terminology

Observed data $\mathbf{x}$, called input variables or predictors.
Data to be predicted $y$, called output variables or responses.

Usually, the input variables is composed of $d$ features represented as a $d$-dimensional vector of vector space $\mathcal{X}$: $\mathbf{x}=[x_1, \ldots,x_d]^\top\in\mathcal{X}\subset\mathbb{R}^d$. For simplicity, in the following, we will refer to $\mathbb{R}^d$, but be aware that for some situations, we might work with a subspace of it. For instance, we can have strictly positive measurements, i.e., $x_i \geq 0, \forall i\in{1,\ldots,d}$, and this kind of constraint could be included in the learning problem.

2.2.2. Prediction problem

If $y$ are quantitative data, the learning problem is usually called regression and $y\in\mathcal{Y}\subset\mathbb{R}$.
If $y$ are categorical data, the learning problem is usually called classification and $y\in\{C_1, \ldots, C_C\}$ with $C_i$ refers to the $i^{th}$ category.

These are very close problem: the algorithm might differ only by the performance measure used to learn the predictive model.

2.2.3. Prediction rule

The link between the input and output variables is modeled by the prediction rule, denoted by the function $f$. This function takes as input the predictors and outputs response … Mathematically, it writes:

\begin{eqnarray*} f: \mathbb{R}^d & \to & \mathbb{R},\\ \mathbf{x}& \mapsto & y.\\ \end{eqnarray*}

The (learnable) parameters of the function are denoted $\boldsymbol{\theta}$ and are optimized (or fit) with the learning set, i.e., they are automatically tuned to increase the performance measure. Once the optimal set of parameters are obtained, the function is usually denoted $\hat{f}$, indicating that it was fitted on the data. We might also define hyperparameters of $f$ as the set parametes of the function that is fixed before the learning set and are not optimized.

2.2.4. Supervised or unsupervised learning

We call training set $\mathcal{S}$ the set of sample available to learn the prediction rule $f$. For a training set of size $n$, three different cases exist:

Supervised learning: $\mathcal{S} = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$, both input and output variables are available.
Unsupervised learning: $\mathcal{S} = \{(\mathbf{x}_1), \ldots, (\mathbf{x}_n)\}$, only input variables are available.
Semi-supervised learning: $\mathcal{S} = \mathcal{S}_s \cup \mathcal{S}_u = \{(\mathbf{x}^s_1, y^s_1), \ldots, (\mathbf{x}^s_{n_s}, y^s_{n_s})\} \cup \{\mathbf{x}^u_1,\ldots, \mathbf{x}^u_{n_u}\}$, this is a mixed scenario often encountered in practice. Some samples are provided with the output variables, but most of the sample are available only with input variable.

2.3. Toy Example

In this example, we are going to illustrate the ML machinery on a simple case. The task considered is the regression, i.e., from the sample $\mathbf{x}$ we want to predict the associated value $y$. The data are distributed as shown in figure 4.

For the prediction rule, we use a simple linear model: $f(\mathbf{x}) = \mathbf{w}^\top\mathbf{x} + b$, which in our 1-dimensional case reduces to $f(x) = wx +b$ and $\theta = (w,b)$. For the performance measure, the mean square error (MSE) is used. We will refer to the loss $\mathcal{L}$ is the following rather than the performance measure: the former is simply the negative of the later. It is more convenient to write the optimization problem as a minimization problem.

\[MSE(\theta) = \frac{1}{n}\sum_{i=1}^n \big(f(x_i) - y_i\big)^2=\frac{1}{n}\sum_{i=1}^n \big(wx_i + b - y_i\big)^2.\]

import numpy as np
import matplotlib.pyplot as plt

def function_to_be_estimated(x):
    return x*0.5 + 0.5*np.sin(x*2*np.pi/5) + 0.1*np.random.randn(x.shape[0], x.shape[1])

# Set random seed
np.random.seed(0)

# Generate some data
n_samples = 20
x = 10*np.random.rand(n_samples,1)-5
y = function_to_be_estimated(x)

# Plot the function
plt.figure(figsize=(3,3))
plt.scatter(x, y)
plt.grid("on")
plt.xlabel("x")
plt.ylabel("y")

Figure 4: Synthetic regression problem.

The learning step consists in finding the best parameters according to the loss function computed on the training set. For this particular model, the solution is explicit, i.e., we can derive a closed-form solution from the gradient of the MSE.

\begin{eqnarray} \frac{\partial MSE}{\partial w} & = & \frac{2}{n}\sum_{i=1}^n x_i(wx_i +b-y_i)\label{eq:lin:grad:w}\\ \frac{\partial MSE}{\partial b} & = & \frac{2}{n}\sum_{i=1}^n (wx_i +b-y_i)\label{eq:lin:grad:b} \end{eqnarray}

Setting (\ref{eq:lin:grad:b}) to zero, we get $b=\mu_y - w\mu_x$. Then injecting it into (\ref{eq:lin:grad:w}), setting the derivative to zero and arranging the different terms, we get $w =\frac{\mathrm{Cov}(x, y)}{\mathrm{Cov}(x, x)}$ (see https://en.wikipedia.org/wiki/Covariance). The following code compute the estimation of the optimal parameters from the observed data.

mu_x = np.mean(x)
mu_y = np.mean(y)

cov = np.cov(x, y,  rowvar=False)
w =  cov[0, 1] / cov[0, 0]
b = mu_y - w* mu_x

t = np.linspace(-5, 5).reshape(-1, 1)
plt.figure(figsize=(3,3))
plt.plot(t, w*t +b, 'r')
plt.scatter(x, y)
plt.grid("on")
plt.xlabel("x")
plt.ylabel("y")
plt.title(f"(w,b)=({w:.2f}, {b:.2f})")

For these parameters, the mean square error computed on the training set is

y_prediction = w*x+b
mse = np.mean((y_prediction-y)**2)
print(f"{mse:.2f}")

0.16

Can we do better ? Would be it safe in terms of accuracy ? Before answering these questions, we need additional tool that will ease the data manipulation. They are based on scikit-learn library, and allow us to skip the maths part. So, let’s do it again with scikit-learn:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as MSE
np.set_printoptions(precision=2)

# Initialize and fit the model
model = LinearRegression()
model.fit(x, y)

# Print model parameters
print(f"w={model.coef_} and b={model.intercept_}")

# Do prediction
y_prediction = model.predict(x)
mse = MSE(y, y_prediction)
print(f"MSE = {mse:.2f}")

w=[[0.45]] and b=[0.02]
MSE = 0.16

The code is highly simplified: for all the ML models provided in scikit-learn, the fit and predict functions are provided to learn the model and do the prediction !

Now we can try more advanced model. For instance, rather than considering the original variable only, we could include higher order information and add some polynomial features. Formally, it consists in transforming the data before using the linear model:

\begin{align} \phi: \mathbb{R} & \to & \mathbb{R}^p\\ x & \mapsto & \phi(x) = [x, x^2, x^3, ..., x^p] \end{align}

Again with scikit-learn everything is available, and a common strategy is to define a pipeline (https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to sequentially modify the data:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Define the pipeline as a polynomial transformation + linear model
degree = 5
model = Pipeline(
    [("PolyTransform", PolynomialFeatures(degree)), ("LinearReg", LinearRegression())]
)

# We can use "fit" again, it will sequentially appply the list of transformation provided at the creation of the pipeline
model.fit(x, y)
yt = model.predict(t)

# Plot the results to check if it is fine
plt.figure(figsize=(3, 3))
plt.scatter(x, y)
plt.plot(t, yt, "r")
plt.grid("on")
plt.xlabel("x")
plt.ylabel("y")

# Compute MSE
y_prediction = model.predict(x)
mse = MSE(y, y_prediction)
print(f"MSE = {mse:.2f}")

MSE = 0.01

Whoua ! It seems to work better: it fits better the data, and the MSE on the training set if lower. But remind that the objective is to predict unseen data. Using more complex model (as with polynomial feature), we have increased the learning capacity: the model can learn more information from the data, but also more noise :( This is known as overfitting (https://en.wikipedia.org/wiki/Overfitting). We will see later how to check if this phenomenon occurs, and more broadly, how to assess model performance and to model selection. For now, since the data are generated from ourselves, we can check this manually:

Based on the python code provide in this example, perform the following analysis in a notebook (files in the notebooks)

Load the data, available in the note
For a polynomial degree ranging in {0, 20},
Fit the model
Compute the MSE on the training data
Compute the MSE on the test data, located here in the data repertory.
Plot the prediction in the interval [-5, 5], then [-10, 10].
Summarizes your results visually (using matplotlib plotting facility)

2.3.1. Homework - Braking distance

The table tab:braking:distance contains the braking distance for various cars and driver. The file can be download here. Load the data and try a linear regression (with different polynomial degree) to related the speed with the braking distance.

Tips: you can load the file using the following snippet:

import pandas as pd
df = pd.read_csv("/home/mfauvel/Documents/Enseignement/INRA/MasterSigma/data/braking_distance.csv", index_col=0) # Path need to be adapted !
print(df.head())

       Speed: 0  Speed: 40  Speed: 50  Speed: 70  Speed: 90  Speed: 100  Speed: 110  Speed: 130
Car #                                                                                          
0      4.488990   12.60200    7.93177    32.4445    60.5546     56.0518     76.0007     101.365
1     -0.561011   11.11080   11.79650    35.3948    57.9588     63.6805     78.7270     107.571
2     -3.026890    1.03175   25.18210    28.4790    48.5566     61.8898     76.6653     104.722
3     -6.701320   11.58230    8.69358    25.5270    49.7054     59.2811     70.4830     107.165
4     -5.491230   11.01720    8.43203    26.6295    52.0120     66.0341     81.1067     115.733

Table 1: Braking distance as a function of the speed (km/h) for different cars
Car #	Speed: 0	Speed: 40	Speed: 50	Speed: 70	Speed: 90	Speed: 100	Speed: 110	Speed: 130
0	4.48899	12.602	7.93177	32.4445	60.5546	56.0518	76.0007	101.365
1	-0.561011	11.1108	11.7965	35.3948	57.9588	63.6805	78.727	107.571
2	-3.02689	1.03175	25.1821	28.479	48.5566	61.8898	76.6653	104.722
3	-6.70132	11.5823	8.69358	25.527	49.7054	59.2811	70.483	107.165
4	-5.49123	11.0172	8.43203	26.6295	52.012	66.0341	81.1067	115.733
5	1.96217	5.10659	5.89579	36.9719	48.4752	71.6416	83.6499	111.316
6	5.3934	7.88628	15.3244	35.0379	53.5495	71.2402	81.3549	113.507
7	1.63565	11.1314	19.6684	27.9187	49.1378	67.668	84.3406	105.267
8	-0.261377	3.85402	14.7941	31.4941	52.5009	70.9246	76.8773	106.695
9	2.64496	6.32143	17.0739	28.6738	46.0595	60.9116	76.2826	109.836

3. Machine Learning Algorithms

This sections presents various conventional ML methods. The linear models, mainly coming the statistics science are one of the most interpretable models. Yet, their learning capacity are limited and non-linear models, i.e., that can learn non linear relationship between the inputs and outputs variables, provides usually better results at the cost of a lack of interpretability.

3.1. Linear models

Linear models assume that the prediction rule $f$ is linear in the inputs variable, i.e., $f(\mathbf{x}) = \sum_{i=1}^d w_ix_i = \mathbf{w}^\top\mathbf{x} + b$. In the following, linear models for regression and classification are presented.

3.1.1. Regression

As discussed in 2.3, if we fit a linear model with the mean square error, we have an explicit solution for 1-dimensional data. The result is still valid for any dimension. To see this, denote $\mathbf{X} = \big[\mathbf{x}_1, \ldots, \mathbf{x}_n\big]^\top$ a $n\times d+1$ matrix that collects all the input samples augmented with one variable set to 1 (for the intercept), and let $\mathbf{y}=[y_1, \ldots, y_n]^\top$ the vector containing all the outputs samples. Then the MSE can be written as

\begin{align} MSE(\mathbf{w}) = \frac{1}{n}\Big(\mathbf{y} - \mathbf{X}\mathbf{w}\Big)^\top\Big(\mathbf{y} - \mathbf{X}\mathbf{w}\Big)\label{mse:matrix} \end{align}

where $\mathbf{w}= [w_1, \ldots, w_d, b]^\top$. We can derive w.r.t $\mathbf{w}$ and by setting the derivative to zero, we obtain the general expression for fitting a linear model with the MSE:

\begin{align} \hat{\mathbf{w}} = \Big(\mathbf{X}^T\mathbf{X}\Big)^{-1}\mathbf{X}^T\mathbf{y}. \end{align}

This is exactly what the fit method does in scikit-learn. The “hat” on $\mathbf{w}$ indicates that it is an estimation, w.r.t. collected data.

Interpretability of a linear model comes from the possibility to relate the values of $\hat{\mathbf{w}}$ to the output: a very low value of $w_i$ indicates a very low relationship¹ of the i$^{th}$ input variable with the output variable, providing input variables have been standardized (see section 5).

In the following, we will apply the linear regression to the California Housing dataset from the scikit-learn datasets (https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset). From the data set description:

the output variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). The input variables are

MedInc median income in block group.

HouseAge median house age in block group.

AveRooms average number of rooms per household.

AveBedrms average number of bedrooms per household.

Population block group population.

AveOccup average number of household members.

Latitude block group latitude.

Longitude block group longitude.

where a block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

import pandas as pd
import geopandas as gpd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Open data and create the geo-data-frame
cal_housing = fetch_california_housing(as_frame=True)
df = pd.concat([cal_housing.data, cal_housing.target], axis=1)
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))


# Inspect the first 5 rows of the dataframe
print(gdf.head())

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseVal                     geometry
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23        4.526  POINT (-122.23000 37.88000)
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22        3.585  POINT (-122.22000 37.86000)
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85    -122.24        3.521  POINT (-122.24000 37.85000)
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85    -122.25        3.413  POINT (-122.25000 37.85000)
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85    -122.25        3.422  POINT (-122.25000 37.85000)

We can use pandas info() function to get some information about the data:

# Print some information about the data set_model_parameters
gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   MedInc       20640 non-null  float64 
 1   HouseAge     20640 non-null  float64 
 2   AveRooms     20640 non-null  float64 
 3   AveBedrms    20640 non-null  float64 
 4   Population   20640 non-null  float64 
 5   AveOccup     20640 non-null  float64 
 6   Latitude     20640 non-null  float64 
 7   Longitude    20640 non-null  float64 
 8   MedHouseVal  20640 non-null  float64 
 9   geometry     20640 non-null  geometry
dtypes: float64(9), geometry(1)
memory usage: 1.6 MB

The very first steps of any machine learning analysis is to split the data into two sets: One used to train/learn the model, and the second one used to compute the accuracy of the model. Scikit-learn propose several tools to do such operations (see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection). In this course, we will make use a lot of the train_test_split function:

states = gpd.read_file("data/usa-states-census-2014.shp")
# Split the data into train and test data set
gdf_train, gdf_test = train_test_split(gdf, train_size=0.5, random_state=10)

# Plot the target value
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 5), edgecolor='w')
for i, (gdf, data) in enumerate(zip([gdf_train, gdf_test], ["train", "test"])):
    states[states.NAME == "California"].plot(color="gray", alpha=0.5, ax=axs[i])
    gdf.plot(
        kind="scatter",
        x="Longitude",
        y="Latitude",
        s=gdf["Population"] / 100,
        c="MedHouseVal",
        colorbar=True,
        ax=axs[i],
    )
    axs[i].set_title(f"{data}")

Figure 5: Median

In a notebook, using the above python code to open the data, perform a linear regression on this data set to predict the target value and try to obtain the best MSE score. You are free to use polynomial transformation to fit your model, but each groups need to validate the following requirements before reporting the results:

Take care to use the input variables: [“MedInc”, “HouseAge”, “AveRooms”, “AveBedrms”, “Population”, “AveOccup”, “Latitude”, “Longitude”]
Compute both the MSE on the train and test set, several times for a same model, but with a different random_state for the splitter, and report the mean and standard deviation error, alternatively you can draw boxplots.
Plot the results and display where are the most important errors. For that you can plot the map, or plot the scatter plot.

3.1.1.1. Extended linear models

Several extensions of linear model exists that add a so-called regularization terms $R$ to the MSE: \[\mathcal{L}(\theta) = MSE(\theta) + \lambda R(\theta)\] where $\lambda$ is a weight parameter to control the trade-off between the data fit and regularization. The regularizer is used to enforce some properties in $f$: smoothness, sparsity, positivity etc …

The objectives of regularization are numerous: avoid over-fitting, improve model generalization, improve robustness to outliers, prevents numerical issues … They are two mains regularization usually applied with linear models:

Ridge regression (Hastie, Tibshirani, and Friedman 2009): $R(\mathbf{w}) = \|\mathbf{w}\|_2^2 = \sum_{i=1}^d w_i^2$,
Lasso (Hastie, Tibshirani, and Friedman 2009) : $R(\mathbf{w}) = \|\mathbf{w}\|_1 = \sum_{i=1}^d |w_i|$.

They both penalized high value of $\mathbf{w}$, they behavior are illustrated in figure 6. With no surprise such models are available in scikit-learn under the name Ridge and Lasso. A linear model with Lasso regularization is known to be sparse: depending on the value of the weight parameter, some of the variables $\mathbf{w}$ are strictly equal to zero. Thus the model learns with a subset in the input variables and this property can be exploited to select the most relevant input variables.

Figure 6: Ridge and Lasso 1D regularization.

If we continue the toy example of section 2.3, the loss function is modified to (intercept is usually not regularized): \[\frac{1}{n}\sum_{i=1}^n \big(wx_i + b - y_i\big)^2 + \lambda w^2.\] Using the same minimization strategy, we get now a slightly modified solution $w =\frac{\mathrm{Cov}(x, y)}{\mathrm{Cov}(x, x) + \lambda}$. We can directly see the effect of regularization: for inputs variable with very low covariance, adding a small terms avoid division per zeros. Also, if we set $\lambda$ to a large value, the estimation will tends to zero while when $\lambda$ tends to zero, the regularization has no effect.

Figure 7: Ridge regularization for linear regression with one input variable.

The following code below illustrate the effects of ridge regularization of the toy example with a over-parametrized linear model².

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error as MSE
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler


def function_to_be_estimated(x):
    return (
        x * 0.5
        + 0.5 * np.sin(x * 2 * np.pi / 5)
        + 0.1 * np.random.randn(x.shape[0], x.shape[1])
    )


# Set random seed
np.random.seed(0)

# Generate some data
n_samples = 20
x = 10 * np.random.rand(n_samples, 1) - 5
y = function_to_be_estimated(x)
t = np.linspace(-5, 5).reshape(-1, 1)

# Define the model for various regularization values
fig, axs = plt.subplots(nrows=3, ncols=5, sharex=True, sharey=True, figsize=(15, 5))
for alpha, ax in zip(np.logspace(-9, 5, 15), fig.axes):
    model = Pipeline(
        [
            ("poly_feat", PolynomialFeatures(20)),
            ("scaler", StandardScaler()),
            ("lin_reg", Ridge(alpha=alpha)),
        ]
    )
    model.fit(x, y)
    yp = model.predict(t)
    ax.scatter(x, y)
    ax.plot(t, yp, "r")
    ax.set_xlim(-5, 5)
    ax.set_ylim(-2.5, 2.5)
    ax.grid("on")
    ax.set_title(f"$\lambda$= {alpha}")

Figure 8: Ridge linear model with a polynomial transformation of degree 20, with various regularization parameters.

Continue the previous notebook and now change the linear model to either Ridge or Lasso model (choose one, the other one will be done as homework):

Select the best weight parameter using try and test method (we will see later in section 4 how to select it) for a linear model with polynomial feature of size 10. Plot the MSE as a function of $\lambda$ and select the best one.
Look at the model parameters (model.coef_) for some values of $\lambda$ and discuss.

3.1.2. Classification

For classification problems, we have a categorical variable to estimate instead of a having quantitative variable, it means the algorithm should predict categories (“Dogs”, “Cats”, “Monkey” …). For such output variable, the MSE is not a good measure of performance: how do you define the square error between category “Dogs” and “Cats” ?

One solution usually adopted for linear models is to consider the estimation of the posterior probability $p(y_i=c | \mathbf{x}_i)$ to each class $c\in\big\{1, \ldots, C\big\}$ and decide the class membership using the maximum a posteriori (MAP) rule: \[\hat{c}_i = \arg\max_{c} p(y_i=c | \mathbf{x}_i).\]

The logistic regression (despite its name it is a classifier algorithm) models the logarithm of the posterior probability as a linear function of the inputs variables (Hastie, Tibshirani, and Friedman 2009): \[\log\big(p(y=c | \mathbf{x})\big) = \mathbf{w}_c^\top\mathbf{x} + b_c + K\] where $K$ is a constant independent of the class. You should recognise the expression of a linear model of regression: this is exactly what it is. The logistic regression performs a regression on the logarithm of the posterior probability of each class: we have to learn $C$ regression problems. However, these problems are not independent since the sum of the posterior probabilities should be exactly 1: this is the role of the constant $K$:

\begin{align*} \sum_{c=1}^Cp(y_i=c | \mathbf{x}_i) & = & 1 \\ \sum_{c=1}^C \exp\Big\{\mathbf{w}_c^\top\mathbf{x}_i + b_c + K_i\Big\} & = & 1 \\ \exp\{K_i\}\sum_{c=1}^C \exp\Big\{\mathbf{w}_c^\top\mathbf{x}_i + b_c\Big\} & = & 1 \\ \exp\{K_i\} &=& \frac{1}{\sum_{c=1}^C \exp\Big\{\mathbf{w}_c^\top\mathbf{x}_i + b_c\Big\}} \end{align*}

thus

\[p(y_i=c | \mathbf{x}_i) = \frac{\exp\Big\{\mathbf{w}_{c}^\top\mathbf{x}_i + b_{c}\Big\}}{\sum_{c'=1}^C \exp\Big\{\mathbf{w}_{c'}^\top\mathbf{x}_i + b_{c'}\Big\}}.\]

This ratio of exponential is known as the softmax function and it is used (widely in neural network) to transform a set of unconstrained variables to a set of variables between $[0, 1]$ that sums to one (as probability distribution).

To train our logistic regression model, we need to estimate $(\mathbf{w}_c, b_c)\ \forall c\in\{1, \ldots, C\}$. From our training set, we have the following information:

\begin{eqnarray*} p(y_i=c|\mathbf{x}_i) = \left\{ \begin{array}{lr} 1 & : \mathbf{x}_i\in c \\ 0 & : \mathbf{x}_i\not\in c. \end{array} \right. \end{eqnarray*}

The output variable $y_i$ is vector of size the number of class, and contains the information above, usually on-hot-encoded (https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/): \[\mathbf{y}_i=\begin{bmatrix}0\\ \vdots\\1\\ \vdots\\0\end{bmatrix} \text{ with }\mathbf{y}_{ij}=1 \text{ if } \mathbf{x}_i\in c_j \text{ and } 0 \text{ otherwise}.\]

For such observations (membership probabilities), the loss conventionally used is the negative of the log likelihood, which can be written as (the regularization is already included):

\begin{eqnarray*} L(\theta) & = & -\sum_{i=1}^n \log(p(y_i=c_i|\mathbf{x}_i)) + \lambda \sum_{c=1}^C \|\mathbf{w}_c\|_2^2 \\ & = & -\sum_{i=1}^n (\mathbf{w}_{ci}^\top\mathbf{x}_i+b_{ci}) + \sum_{i=1}^n\log\Big\{\sum_{c=1}^C\exp\big(\mathbf{w}_{c}^\top\mathbf{x}_i + b_{c}\big)\Big\}+ \lambda \sum_{c=1}^C \|\mathbf{w}_c\|_2^2 \\ \end{eqnarray*}

There is no explicit solution and the optimization problem is solved by gradient descent. For linear model, there are many existing algorithms, with different regularization and optimization strategies. The scikit-learn library has implemented a lot of them, you can read a description here: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression.

We will illustrate how the logistic regression works on a synthetic example.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression as LR
from sklearn.metrics import accuracy_score as  OA

# Create toy 2D data 
X, y = make_blobs(n_samples=100,
                  centers=3,
                  n_features=2,
                  random_state=0)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

# Fit model
model = LR(random_state=0, C = 1)
model.fit(X_train, y_train)
y_pred_test = model.predict(X_test)

That’s all, as usual with scikit learn. So far, we have generated some toy data set with 2 input variables and 3 classes. The generated data have been split into a training and testing set and we learn the model with default parameters.

Now since we are in two dimensional space, it is easy to plot the prediction of the model. We can output two information: the posterior probability (the class membership) and the classification. Sometimes they are referred to as soft and hard assignment values. Scikit learn offers two function: predict_proba and predict for the former and the latter respectively. The following code generates a grid of points in $\mathbb{R}^2$ and compute their class membership, then it displays it using contours plot.

# Generate points for the bounding box
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
h = 0.02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict posterior probability and class
Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])
y_pred = model.predict(np.c_[xx.ravel(), yy.ravel()])

# Plot the results
fig, axs = plt.subplots(nrows=1, ncols=4, figsize=(15, 5))
for c in range(3):
    cs = axs[c].contour(xx, yy, Z[:,c].reshape(xx.shape), [0.25, 0.5, 0.75])
    axs[c].scatter(X_test[:, 0],
                        X_test[:, 1],
                        c=y_test)
    axs[c].clabel(cs, inline=1, fontsize=10)

axs[-1].contourf(xx, yy, y_pred.reshape(xx.shape))
axs[-1].scatter(X_test[:, 0],
                X_test[:, 1],
                c=y_test)

Figure 9: Classification of toy data set: each color represents a class ($y\in\{0, 1, 2\}$), and the input variables are in $\mathbb{R}^2$. The lines of the three plots from the left corresponds to the isoline of posterior probability for class 0, 1, and 2 respectively. The last plot corresponds to the partition of the space.

The figure 9 shows the classification result: the decision boundaries are linear w.r.t. the input variables (on the right plot, boundaries are straight lines). It is off course possible to perform polynomial transformation before the linear model to obtain non-linear decision function.

Perform the classification of optical recognition of handwritten digits dataset (see section 2.1 to load the data). The recommendation for the labworks are

Split the data : 50% for training and 50% for testing, repeat simulation with different random_seed,
Use polynomial transformation (up to 2 or 3, if you have a lot of RAM),
Try various regularization weight values (parameter C in the model), for instance [0.01, 0.1, 1, 10, 100],
Perform individual classification reports (for one random_seed) and also sensibility analysis for several runs with different random_seed,
Plot few wrong prediction to visually asses what’s going on for misclassified samples? Tip you can have access to wrong classification using t = np.where(y_test != y_pred_test)[0].

3.2. K-Nearest Neighbors

K-Nearest Neighbors (KNN) methods, either for classification or regression, are memory-based algorithms, i.e., no training is done and the prediction is done solely using the training samples (Hastie, Tibshirani, and Friedman 2009). For a given test sample $\mathbf{x}_t$, the prediction is done by finding the k training samples closest to $\mathbf{x}_t$ w.r.t. a given distance (for instance the Euclidean distance). Then,

for the classification, the predicted class is found using a majority vote among the k neighbors, i.e., the most common class,
for the regression, the mean or median operator is used to estimate the predicted value.

The algorithm has two parameters: the distance function between samples and the number of neighbors to consider. Usually, for qualitative variables, the euclidean distance is used. It is defined as \[d(\mathbf{x}_i, \mathbf{x}_j) = \|\mathbf{x}_i- \mathbf{x}_j\|_2^2=\sum_{k=1}^d(x_{ik}-x_{jk})^2.\] The number of neighbors is usually set to relatively a small value (5 to 20). The decision function tends to be smoother as the number of neighbors used increased, because of the averaging effect. Also, for classification, is possible to get the probabilities membership by computing the proportion of each class in the neighbors.

KNN behavior is illustrated in Figure 10 for the classification case. We use the same grid of points to draw the contour lines than for logistic regression.

from sklearn.neighbors import KNeighborsClassifier

# for k in range(2,11,2):
model = KNeighborsClassifier(n_neighbors=2)
model.fit(X_train, y_train)
y_pred_test = model.predict(X_test)
Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])
y_pred = model.predict(np.c_[xx.ravel(), yy.ravel()])

# Plot the results
fig, axs = plt.subplots(nrows=1, ncols=4, figsize=(15, 5))
for c in range(3):
    cs = axs[c].contour(xx, yy, Z[:,c].reshape(xx.shape), [0.25, 0.5, 0.75])
    axs[c].scatter(X_test[:, 0],
                        X_test[:, 1],
                        c=y_test)
    axs[c].clabel(cs, inline=1, fontsize=10)

axs[-1].contourf(xx, yy, y_pred.reshape(xx.shape))
axs[-1].scatter(X_test[:, 0],
                X_test[:, 1],
                c=y_test)

Figure 10: Classification of toy data set: each color represents a class ($y\in\{0, 1, 2\}$), and the input variables are in $\mathbb{R}^2$. The lines of the three plots from the left corresponds to the isoline of posterior probability for class 0, 1, and 2 respectively. The last plot corresponds to the partition of the space.

Reproduce the toy example for the classification with various number of neighbors.
Do the same work with the toy example for regression.

3.3. Random Forest

3.3.1. Decision Trees

A decision tree is a (simple & naive) model that partition the input variables space into zones where the prediction (regression or classification) is constant. The learning step consists in finding these zones. Basically, the algorithm recursively search for each input variable a threshold that satisfy a purity index, i.e., that splits the data into into pure subset (in terms of output variables), and construct a set of if ... then ... else binary rules that defines these zones. Since each zone contains “pure” output variables, i.e. very similar values (same class number for instance) a constant prediction make sense.

Let’s us continue with the toy regression example to better understand how a decision tree behave. The figure 11 shows a decision tree fits with two size of tree, here controlled by the max_depth parameter.

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
from sklearn import tree

# Load data
data = np.load("./data/toy_data.npz")
x, y, xt, yt = data["x"], data["y"], data["xt"], data["yt"]

# Fit model & plot model
plt.figure(figsize=(5,5))
plt.scatter(x, y, c='b')
plt.scatter(xt[::10,0], yt[::10,0], c='k')
for max_depth, col in zip([10, 2],
                          ['r', 'g']):
    model = DecisionTreeRegressor(max_depth=max_depth)
    model.fit(x, y)

    # Plot decision function
    t = np.linspace(-5, 5, 1000).reshape(-1, 1)
    plt.plot(t, model.predict(t), col)
plt.legend(["Train points","Test points","Depth=10", "Depth=3"])

Figure 11: Decition tree regression for the toy data set

From the figure, it can be seen that a too deep tree might overfit the data. We will see in section 4 how to select it. Yet, in the next section, we will show that pragmatically, it is not of high importance.

Sklearn provides function to plot decision tree: this allow to interpret how the decision is done and which variables are the most important. This this the plot_tree function (https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html#sklearn.tree.plot_tree). The figure 12 shows a decition tree with a max_depth of size 2.

Figure 12: Decision tree.

A decision tree can be either used for classification or regression.

Reproduce the toy example for the regression with various max_depth parameters.
Do the same work with the toy example for classification.

3.3.2. Random Forest

Decision tree can overfit and its learning capacity is limited (e.g., because of constant prediction). However, by combining many trees, build on a subset of the data and a subset of the input variables it is possible to achieve much better results. This is the idea of Random Forest (RF), a powerful ensemble method (Hastie, Tibshirani, and Friedman 2009).

The algorithm of a random forest can be summarized as:

Select the number of trees and the parameters of each tree (max_depth etc ..)
For each tree
- Select a subset of the sample
- Fit the decision tree recursively by selecting a subset of input variables for each node
For the prediction of one sample:
- Compute the decision for each tree
- Average the results for regression or do a majority vote for classification.

RF is usually fast and little sensitive to hyperparameter setting: if the number of trees is high enough, good results will be achieved without the need to investigate several values.

The combination of several decision trees does not maintain the interpretability of decision tree. Yet, RF comes with the ability to rank input features according to how they improved the purity criterion at each node of the decision trees. In sklearn, this is accessible using the feature_importance_ property of the model.

As usual, the behavior of RF is illustrated on the toy example. Here for classification.

from sklearn.ensemble import RandomForestClassifier as RF

# Learn
model = RF()
model.fit(X_train, y_train)
y_pred_test = model.predict(X_test)
Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])
y_pred = model.predict(np.c_[xx.ravel(), yy.ravel()])

# Plot the results
fig, axs = plt.subplots(nrows=1, ncols=4, figsize=(15, 5))
for c in range(3):
    cs = axs[c].contour(xx, yy, Z[:,c].reshape(xx.shape), [0.25, 0.5, 0.75])
    axs[c].scatter(X_test[:, 0],
                        X_test[:, 1],
                        c=y_test)
    axs[c].clabel(cs, inline=1, fontsize=10)

axs[-1].contourf(xx, yy, y_pred.reshape(xx.shape))
axs[-1].scatter(X_test[:, 0],
                X_test[:, 1],
                c=y_test)

Reproduce the toy example for the classification with various number of trees.
Do the same work with the toy example for regression.

4. Model selection & Model Validation

4.1. What is “overfitting”

Overfitting is the situation where the model learns the learning samples, without modeling the statistical relationship between the input and output variables. This includes the observed noise in the training set. Usually, bad or worse prediction accuracy are obtained on the training set. The table 2 shows an example of overfitting.

Table 2: Overfitting example: some classifiers performs extremely well on the training set, yet they do not generalize well on the test set.
Classifier	Train set score	Test set score
1	0.875	0.825
2	0.925	0.820
3	0.995	0.740

Find in the previously discussed algorithms a behavior of overfitting per se.

4.2. Model assessment

As discussed above, a machine learning framework usually involves some splitting of the data in order to train the model, and to validate it on unseen data. As we have seen previously, the “random” effect in the split as has an influence on the test accuracy measure, and thus we have to repeat the process several times to obtain several realizations of the measure. This process is known as model assessment. By taking the average value of the different realizations we obtain a good estimation of the expected test accuracy measure (which will be the accuracy measure obtained with an infinite number of training and testing samples). The figure 13 shows graphically the workflow to assess model performance.

Figure 13: Global framework for model assessment.

Model assessment is very important especially for complex learning model: complex model can easily learn the noise in the data. The figure 14 shows this behavior for the K-NN.

Figure 14: K-NN regression accuracy (as measured by the mean square error) as a function of the complexity. Here the complexity is measure as the number of neighbors used for the prediction (one neighbors leading to the most complex model).

Up to now, we consider the problem of validating the model using the testing samples. However, the framework proposed should be used to validate the model only. Not for selecting the hyperparameters of the model nor the model itself for a pool of models. This task, named “Model selection” in the figure 13 and is discussed in the following section

4.3. Model selection

Model selection is the process to select the set of best hyperparameters (if any) for the considered model or/and to choose between different models using only sample from the training set. As we know now, if we look at the training error, we will face overfitting and thus it is not a good strategy.

Some models as a measure of fit that can be used to select their parameters. But practical cases show that the good way is to estimate the expected error using some random-splitting strategies.

The most used one is called “cross-validation”. Its principle is quite simple, yet effective: the training data is split into K disjoint groups. One group, or fold, is left out and the model is learned with the (K-1) remaining folds. Then the left out fold is used to compute the test error. The expected test error is estimated as the average of the test error computed for every fold:

\[CV(\hat{f},\theta) = \frac{1}{K}\sum_{k=1}^K\text{Err}_k(\hat{f},\theta).\]

Hence, this procedure involves the training of K models on K different subsets of the training set. The table 3 shows a graphical illustration of the procedure.

Table 3: 5-fold cross validation.
	Fold$_1$	Fold$_2$	Fold$_3$	Fold$_4$	Fold$_5$
Model$_1$	Train	Train	Train	Train	Validation
Model$_2$	Train	Train	Train	Validation	Train
Model$_3$	Train	Train	Validation	Train	Train
Model$_4$	Train	Validation	Train	Train	Train
Model$_5$	Validation	Train	Train	Train	Train

An important rule is that the different steps of the learning process should be done for each model (e.g. feature transformation, model fit etc …). The following code shows how to select hyperparameters for random forest classifiers on a synthetic data set.

import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=50,
                           n_informative=40, n_redundant=10,
                           n_classes=5, n_clusters_per_class=4,
                           random_state=0)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.5, random_state=0)

# Hyperparameters
params = {"n_estimators": [1, 10, 50, 100, 200, 400, 600],
          "max_features": [1, 5, 10, 15, 20, 25]}

# Do Cross Validation
grid = GridSearchCV(RF(),  # Set up the classifier
                    param_grid=params, cv= 5,
                    refit=True, n_jobs=-1) # Do the grid search in parallel
grid.fit(X_train, y_train) # Run the grid search

# Compute the test score
score = grid.best_estimator_.score(X_test, y_test)


# Plot the results
res = [[param["max_features"], param["n_estimators"], err]
       for param, err in zip(grid.cv_results_["params"],
                             grid.cv_results_["mean_test_score"])]
fig, ax = plt.subplots(nrows=1, ncols=1)
im = ax.imshow(
    grid.cv_results_["mean_test_score"].reshape(6, 7)
)
ax.set_xticks(range(7))
ax.set_xticklabels(params["n_estimators"])
ax.set_yticks(range(6))
ax.set_yticklabels(params["max_features"])
cb = fig.colorbar(im, ax=ax)
ax.set_title(f"Best parameters {grid.best_params_} \n Train score {grid.best_score_:.2f} \n Test score: {score:.2f}")

Figure 15: Cross-validation estimation of the espected error for RF.

Finish your machine learning framework, for either classification or regression, with the two data sets from sklearn:

The code should do:

Split the data into training and testing set,
For the linear, RF and K-NN models, do hyperparameters selection then model selection,
Perform a correct estimation of the espected error.

5. Data Transformation

In this section, we will review some techniques to pre-process the data for a machine learning pipeline. You can find many information here:

https://scikit-learn.org/stable/data_transforms.html.

We will only review the main techniques.

5.1. Feature scaling

Many algorithm in machine learning are not invariant to shifting and scaling of the data. A shift is a translation of the data by some constant and a scaling is a multiplication by another constant, e.g. : \[x_s = \frac{x + m}{s}.\]

Hence, different unit of the data may lead to different solution: this is something not acceptable in practice. Let’s imagine you have some measurements in meters and you decide to put them in centimeters. The link function should not change because of the change of unit.

Another influence of feature unit, or dynamic or range, is for algorithm that compute the Euclidean distance. If one feature is much larger than the others, it will drives completely the algorithm whatever the values of the others features. Let’s exemplify this with a (very) artificial example. Suppose we have a data set with the height and weight of some people and we want to classify them between tall and small. Also related, feature dynamic can play a role in feature extraction, see PCA example later for instance.

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier as KNN

# Define some arbitrary data
di = {"Heigth": [170,160,110,120], "Weight":[80000,75000,50000,75000], "Size":["Tall", "Tall", "Small", "Small"]}
df = pd.DataFrame.from_dict(di)
print(df)

   Heigth  Weight   Size
0     170   80000   Tall
1     160   75000   Tall
2     110   50000  Small
3     120   75000  Small

The height is in centimeter and the weight is in gram. If we classify this toy data set with KNN we get the following results. We took the first and third element as training samples.

model = KNN(n_neighbors=1)
model.fit(df.iloc[[0,2],:2], df.iloc[[0,2],2])
print(model.predict(df.iloc[:,:2]))

['Tall' 'Tall' 'Small' 'Tall']

We have an error for the last prediction: Why ? We can compute the distance between each samples:

from sklearn.metrics import pairwise_distances as pd
print(pd(df.iloc[:,:2]))

[[    0.          5000.00999999 30000.05999994  5000.24999375]
 [ 5000.00999999     0.         25000.04999995    40.        ]
 [30000.05999994 25000.04999995     0.         25000.002     ]
 [ 5000.24999375    40.         25000.002          0.        ]]

The distance is 5000 between the first and the last samples, which is closer than the distance with the third element. If now we change the unit of the weight to kilogram, what do we get ?

df["Weight"] /= 1000
print(pd(df.iloc[:,:2]))

[[ 0.         11.18033989 67.08203932 50.24937811]
 [11.18033989  0.         55.90169944 40.        ]
 [67.08203932 55.90169944  0.         26.92582404]
 [50.24937811 40.         26.92582404  0.        ]]

We have now a smaller distance between the third and fourth samples, which is correct.

To prevent this kind of problem, a good behavior to scale the data. There are many techniques to do that, in this course, we recommend the most standard one at a first try: the standardization. It consists in removing the mean of each feature and scale it to unit variance. Hence, each feature has a zero mean and unit standard deviation, and therefore a similar scale. The process is simply: \[x_s = \frac{x - \mu}{\sigma}.\] where $\mu$ is the mean of the feature and $\sigma$ is the standard deviation, both estimated on the training set!

You can do it in Scikit-learn using the StandardScaler tool, or any robust version of it.

5.2. Feature reduction

In machine learning, feature extraction or feature selection is the process of selecting or constructing the most relevant features for the considered task from the input variable. It aims to remove redundant or non-informative variables.

If we select a few subset of the original input variables, the process is called feature selection. While if we construct a new subset of variables (e.g., by a linear combination of the original variables), the process is called feature extraction. Furthermore, if we can use some data for the feature selection/extraction, the process supervised, otherwise it is unsupervised.

In this course, we are going to review some very conventional algorithms: one for feature selection, and two for feature extraction.

5.2.1. Feature selection

One of the most simple algorithm, yet effective but time consuming, is the sequential forward feature selection. It can be applied to any supervised regression/classification algorithm. It works as follows:

Starts with an empty pool $F$ of selected features,
Select the feature $f_1$ that provides the best value for the selected criterion and add it to $F$.
Select the feature $f_2$ such that the couple of features $(f_1,f_2)$ provides the best value for the selected criterion and add it to $F$.
Select the feature $f_3$ such that the triplet of features $(f_1,f_2,f_3)$ …
…
The algorithm stops either if the increase of the criterion is too low or if the maximum number of features is reached.

A variant of the algorithm is obtained by sequentially remove one feature from the set of input feature, no surprise it is called sequential backward feature selection. In practice this version if less effective than the forward one.

Let’s see how it works for the California housing data set.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Open data
california_housing_data = fetch_california_housing()
X, y = california_housing_data.data, california_housing_data.target
feature_names = np.asarray(california_housing_data.feature_names)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y)

We now use the class SequentialFeatureSelector from Scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html#sklearn.feature_selection.SequentialFeatureSelector). We need to also set a regression model. For this example a Knn is used, but any models from scikit is working.

from sklearn.feature_selection import SequentialFeatureSelector as  SFS
from sklearn.neighbors import KNeighborsRegressor as KNR

knn = KNR(n_neighbors=3)
sfs = SFS(knn, n_features_to_select=3, n_jobs=-1)

sfs.fit(X_train, y_train)
print(f"Selected feature: {feature_names[sfs.get_support()]}")

Selected feature: ['MedInc' 'Latitude' 'Longitude']

We can compute the score on the test set and compare it when no feature selection is done.

knn.fit(sfs.transform(X_train), y_train)
print(f"Regression score on the test set with feature selection: {knn.score(sfs.transform(X_test), y_test)}")
knn.fit(X_train, y_train)
print(f"Regression score on the test set without feature selection: {knn.score(X_test, y_test)}")

Regression score on the test set with feature selection: 0.7539353810565258
Regression score on the test set without feature selection: 0.11211148340676591

5.2.2. Feature extraction

In this part, we will cover the famous Principal Component Analysis (PCA), an unsupervised feature extraction algorithm.

PCA is a Linear transformation used to reduce the dimensionality of the data (Jolliffe 2002). The new feature $z_i$ is found by projecting the original data (i.e., a linear combination of), such as the projected feature account for most of the variability of the data:

$z_1,~z_2,~z_3,\ldots$ are mutually uncorrelated (i.e., $\langle z_i, z_j\rangle=0, \forall j\neq i$),
$\text{var}(z_i)$ is as large as possible,
$\text{var}(z_1)>\text{var}(z_2)>\text{var}(z_3)>\ldots$

Figure 16: PCA on synthetic data. The red axis represents the new feature with the maximum of variance.

The PCA algorithm is as follow:

It starts by searching $\mathbf{v}_1$ such as $\max_{\mathbf{v}_1}\text{var}(z_1)$ with
\begin{eqnarray}\label{eq:lagrangian} \text{var}(z_1) & = & \text{var}(\langle\mathbf{v}_1,\mathbf{x}\rangle)\\ &=& \mathbf{v}_1^\top\boldsymbol{\Sigma}\mathbf{v}_1 \end{eqnarray}
This problem is indetermined: if $\hat{\mathbf{v}}_1$ maximizes the variance, $\alpha\hat{\mathbf{v}}_1$ too! Therefore, we add a constraint on $\mathbf{v}_1$: $\langle\mathbf{v}_1,\mathbf{v}_1\rangle=1$. Thus, we have a constraint optimization problem, whose Lagrangian can be written as (Boyd and Vandenberghe 2004):
\begin{eqnarray} \mathcal{L}(\mathbf{v}_1,\lambda_1) = \mathbf{v}_1^\top\boldsymbol{\Sigma}\mathbf{v}_1 + \lambda_1(1- \mathbf{v}_1^\top\mathbf{v}_1) \end{eqnarray}
By computing the derivative w.r.t $\mathbf{v}_1$ and set it to zeros, we have:
\begin{eqnarray} \frac{\partial\mathcal{L}}{\partial\mathbf{v}_1} = 2\boldsymbol{\Sigma}\mathbf{v}_1-2\lambda_1\mathbf{v}_1 \end{eqnarray}
Hence $\mathbf{v}_1$ is an eigenvector of the covariance matrix of $\mathbf{x}$:
\begin{eqnarray} \boldsymbol{\Sigma}\mathbf{v}_1 =\lambda_1\mathbf{v}_1 \end{eqnarray}
To get a maximal variance, $\mathbf{v}_1$ should be the eigenvector corresponding to the largest eigenvalues (to see this, we insert the above equation into \ref{eq:lagrangian})!
\begin{eqnarray} \text{var}(z_1) = \mathbf{v}_1^\top\boldsymbol{\Sigma}\mathbf{v}_1 = \lambda_1 \mathbf{v}_1^\top\mathbf{v}_1 = \lambda_1 \end{eqnarray}
Any linear algebra library has a function to find the eigenvectors/eigenvalues of a square positive definite matrix (see for instance https://numpy.org/doc/stable/reference/generated/numpy.linalg.eigh.html)
Once $\mathbf{v}_1$ if found, the algorithm no searches for $\mathbf{v}_2$ such as $\max\text{var}(z_2)$ and $\langle\mathbf{v}_2,\mathbf{v}_2\rangle=1$ and $\langle\mathbf{v}_1,\mathbf{v}_2\rangle=0$. Again we can write the Lagragian:
\begin{eqnarray} \mathcal{L}(\mathbf{v}_2,\lambda_2,\beta_1) = \mathbf{v}_2^\top\boldsymbol{\Sigma}\mathbf{v}_2 + \lambda_2(1- \mathbf{v}_2^\top\mathbf{v}_2) + \beta_1(0 - \mathbf{v}_2^\top\mathbf{v}_1) \end{eqnarray}
and compute the derivative w.r.t $\mathbf{v}_2$, and try to vanish it:
\begin{eqnarray} \frac{\partial\mathcal{L}}{\partial\mathbf{v}_2} &=& 2\boldsymbol{\Sigma}\mathbf{v}_2-2\lambda_2\mathbf{v}_2-\beta_1\mathbf{v}_1\\ \boldsymbol{\Sigma}\mathbf{v}_2 &=& \lambda_2\mathbf{v}_2+2\beta_1\mathbf{v}_1 \end{eqnarray}
At optimality, $\langle\mathbf{v}_1,\mathbf{v}_2\rangle=0$. Left-multiplying by $\mathbf{v}_1^\top$ the above equation:
\begin{eqnarray} \mathbf{v}_1^\top\boldsymbol{\Sigma}\mathbf{v}_2 &=& 2\beta_1 \\ \lambda_1\mathbf{v}_1^\top\mathbf{v}_2 &=& 2\beta_1 \\ 0 &=& 2\beta_1 \end{eqnarray}
Hence, we have
\begin{eqnarray} \boldsymbol{\Sigma}\mathbf{v}_2 =\lambda_2\mathbf{v}_2. \end{eqnarray}
$\mathbf{v}_2$ is the eigenvector corresponding the second largest eigenvalues. For axis $k\ge 3$, we can repeat the same procedure and get that $\mathbf{v}_k$ is the eigenvector corresponding the $k^{\text{th}}$ largest eigenvalues.

In practice, the eigensolver returns all the eigenvector/eigenvalues in one call. Basically, the algorithm reduces to

Empirical estimation the mean value:
\begin{eqnarray} \boldsymbol{\mu} = \frac{1}{n}\sum_{i=1}^n\mathbf{x}_i \end{eqnarray}
Empirical estimation the covariance matrix:
\begin{eqnarray} \boldsymbol{\Sigma} = \frac{1}{n-1}\sum_{i=1}^n(\mathbf{x}_i-\boldsymbol{\mu})(\mathbf{x}_i-\boldsymbol{\mu})^\top \end{eqnarray}
Compute the eigenvalues/eigenvectors and keep the $p$ first… How to choose $p$ ? Take a look to the explained variance, i.e., the variance of the data explained by the $p$ first eigenvectors, it can be computed as

\[\frac{\sum_{i=1}^p\lambda_i}{\sum_{i=1}^d\lambda_i}\]

In practice, we choose $p$ such as a given amount of the variance is kept (e.g., 90% or 95%).
Tips for high dimensional data set: if the number of samples is lower than the number of feature, see (Manolakis, Lockwood, and Cooley 2016) page 420.

We now compare how PCA works on the California housing data (see https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html for a list of options):

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Two cases are investigated: same number of feature than with sequential feature selector
# or by thresholing the cummulative variance
for param in [3, 0.90, 0.95, 0.99, 0.999]:
    model = Pipeline([
        ("Scaler", StandardScaler()), # You can try to remove this part and see the difference
        ("Feature Extraction", PCA(n_components = param)),
        ("KNN", KNR(n_neighbors=3))
    ])
    model.fit(X_train, y_train)
    print(f"Regression score with the number of component {model.named_steps['Feature Extraction'].n_components_}: {model.score(X_test, y_test)}")

Regression score with the number of component 3: 0.28924002334403465
Regression score with the number of component 5: 0.47881899995519794
Regression score with the number of component 6: 0.523310120515489
Regression score with the number of component 7: 0.6129552696601799
Regression score with the number of component 8: 0.6368428619732054

Apply the sequential forward feature selection and the PCA on the digit data.
Compare the results in terms of
- Classification accuracy,
- Number of extracted feature,
- Interpretability of the extracted feature.

5.3. TODO Imputation of missing values

6. Bibliography

Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge university press.

Hastie, T., R. Tibshirani, and J.H. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer. https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12.pdf.

Jolliffe, I.T. 2002. Principal Component Analysis. Springer Series in Statistics. Springer. https://books.google.fr/books?id=_olByCrhjwIC.

Jordahl, Kelsey, Joris Van den Bossche, Martin Fleischmann, Jacob Wasserman, James McBride, Jeffrey Gerard, Jeff Tratner, et al. 2020. “Geopandas/Geopandas: V0.8.1.” Zenodo. doi:10.5281/zenodo.3946761.

Manolakis, D., R. Lockwood, and T. Cooley. 2016. Hyperspectral Imaging Remote Sensing: Physics, Sensors, and Algorithms. Cambridge University Press. https://books.google.fr/books?id=Zeg8DQAAQBAJ.

McKinney, Wes. 2010. “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 56–61. doi:10.25080/Majora-92bf1922-00a.

Mitchell, T. 2006. “The Discipline of Machine Learning.” CMU ML-06 108.

Paszke, A., S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” In Advances in Neural Information Processing Systems 32, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, 8024–35. Curran Associates, Inc. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.

Footnotes:

This is true if the data are standardized (see 5 for a definition of standardization). An exact derivation is given in (Hastie, Tibshirani, and Friedman 2009), chapter 3.

A careful reader might have seen that a StandardScaler pre-processing has been included in the pipeline, with regularization, this is something require and critical since we generally defined only one weight for all parameters. Hence, we must be sure that each parameter has the same dynamic (range of values) in order to penalized its “high” values in a similar way than the others parameters.

	Fold\(_1\)	Fold\(_2\)	Fold\(_3\)	Fold\(_4\)	Fold\(_5\)
Model\(_1\)	Train	Train	Train	Train	Validation
Model\(_2\)	Train	Train	Train	Validation	Train
Model\(_3\)	Train	Train	Validation	Train	Train
Model\(_4\)	Train	Validation	Train	Train	Train
Model\(_5\)	Validation	Train	Train	Train	Train