# Tutorial 08 â€“ Advanced Preprocessing and Validation

In this tutorial we will look at how to select subset of features and not hurt the model performance, how to use PCA for feature engineering and how to validate your model.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

sns.set()  # make plots nicer

## Feature Selection

The goal of feature selection is to identify important features and select only those for model training. Feature selection can serve two purposes:
1. Reducing the data and model size and speeding up computation. This is not a problem for simple models we are dealing with but it might be important to deliver everything on time in production environment.
2. Improving model performance. While not all models will benefit from this it can happen that models will perform better with fewer features. This is often the case for correlated features with duplicated information. Also model relying on distance computation might benefit from fewer features.

Let's load titanic data and see how feature selection will affect KNN classifier.

In [None]:
titanic = sns.load_dataset("titanic")
# drop redundant columns
titanic = titanic.drop(columns=["embarked", "who", "class", "alive"])
titanic.head()

In [None]:
titanic.info()

There are features with missing values. Unfortunately, there is no simple way of imputing categorical features as encoder require all data to be present and KNN imputer cannot impute categorical data. Therefore, let's just create a new category for missing values.

In [None]:
titanic["deck"] = titanic.deck.cat.add_categories("missing")
titanic["deck"] = titanic.deck.fillna("missing")
titanic["embark_town"] = titanic.embark_town.fillna("missing")
titanic["embark_town"] = titanic.embark_town.astype("category")

In [None]:
titanic.info()

Now we still have some missing ages in the data that will need to be imputed. Let's create train and test set and prepare pipeline.

In [None]:
from sklearn.model_selection import train_test_split

titanic_X, titanic_y = titanic.drop(columns="survived"), titanic.survived

titanic_train_X, titanic_test_X, titanic_train_y, titanic_test_y = train_test_split(
    titanic_X, titanic_y, test_size=0.2, random_state=42
)

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler,
)
from sklearn.impute import KNNImputer
from sklearn.compose import make_column_transformer, make_column_selector

knn_pipeline = make_pipeline(
    make_column_transformer(
        (OrdinalEncoder(), ["sex"]),
        (OneHotEncoder(), ["deck", "embark_town"]),
        remainder="passthrough",
    ),
    KNNImputer(),
    StandardScaler(),
    KNeighborsClassifier(),
)

knn_pipeline.fit(titanic_train_X, titanic_train_y)
print(round(knn_pipeline.score(titanic_test_X, titanic_test_y), 2))

<div class="alert alert-block alert-warning"><h5><b>Exercise 1</b></h5></div>

Modify the `knn_pipeline` by adding feature selection in the form of selecting only few 'most useful' features. This can be achieved using [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) class from `scikit-learn`. Add it as another step into the pipeline after the imputer. For this exercise, let's set $k = 8$, i.e., select 8 'most useful' features, and compute score on the test set.

You should obtain score/accuracy of $0.82$.

In [None]:
# TODO: your code goes here...

Of course, setting $k=8$ is a bit arbitrary. Ideally, we would like to have some supporting evidence for choosing a specific value of $k$. One way is to test all possible values of $k$ and see which one maximizes the score. It is not that straight forward, though. Optimizing score of a classifier on test data could lead to over-fitting to the test data. Generally, we want the model to perform well on the test data but even the test data is just a small sample of the whole reality and specifically tailoring the model to the specific test data will not give us good estimates on how the model will perform on completely new and unseen data.

We need to do an intermediate step and put aside part of training data for so called validation. This validation dataset is not accessible to the model during training but we can use it for tuning model's parameters. In this case changing $k$ to maximize model's performance on unseen validation data. Test data is then used to give estimate on model performance with already tuned parameters on new data that the model might encounter in the wild.

<div class="alert alert-block alert-warning"><h5><b>Exercise 2</b></h5></div>

Use [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to set aside 20 % of training data as validation data. Use `random_state=42` for reproducibility.

In [None]:
# TODO: titanic_train_X, titanic_validation_X, titanic_train_y, titanic_validation_y = your code goes here...

<div class="alert alert-block alert-warning"><h5><b>Exercise 3</b></h5></div>

Compute score of modified `knn_pipeline` for all possible values of $k$ selected features on the validation dataset and find the best $k$ and `score`. Note that there are **\*20\*** features after encoding that could be selected.

The best score on the validation dataset should be $0.86$ achieved by $k=8$.

In [None]:
# TODO: you code goes here...

There is also a visualization in Yellowbrick already prepared exactly for this use-case. It even uses cross-validation instead of fixed validation set, so you do not need to create one yourself. You just need to specify which parameter should be tuned and what values should be tested. Notice that parameter names of steps in a pipeline are prefixed by the name of the step.

**\*Note:\*** Google Colab users might need to install the new version of yellowbrick with `!pip install yellowbrick==1.3`.

In [None]:
from yellowbrick.datasets import load_energy
from yellowbrick.model_selection import ValidationCurve

from sklearn.tree import DecisionTreeRegressor

viz = ValidationCurve(
    knn_select_pipeline,
    param_name="selectkbest__k",
    param_range=np.arange(1, 21),
    cv=10,
    scoring="f1",
)

# Fit and show the visualizer
viz.fit(titanic_train_X, titanic_train_y)
viz.show()

Now that we know that eight features are enough and using only them results in a better performing model it is natural to ask: "What are those eight features?"

<div class="alert alert-block alert-danger"><h5><b>(Optional) Exercise 4</b></h5></div>

Find out what features were selected using `SelectKBest` with $k=8$. There is unfortunately not easy way of doing this. You can access boolean index of what features were selected using `get_support` method of `SelectKBest`. Unfortunately, one hot encoding result in bunch of new columns being added and you need to investigate the input `titanic_train_X` values and transformed values to figure out the relationships. **Write the features names as a comment in the cell bellow**.

In [None]:
# TODO: your code goes here...

## Feature engineering

Sometimes it is useful to define new features if the model is not  capable enough (it has high bias). There is no single recipe how to create new features but you can do various combinations of existing features. This can include math operations like multiplying features together, making other linear and non-linear combinations or using domain knowledge like in case of feature `adult_male` in titanic dataset that is engineered from features `sex` and `age`.

One common way is to create embedding of existing features. Commonly used embedding are principal components that are obtained using PCA. Apart from combining features it can also reduce dimensionality of the data if we limit the number of components. Let's load new dataset and test PCA.

In [None]:
sonar = pd.read_csv("https://www.fi.muni.cz/ib031/datasets/sonar.csv")
sonar.head()

The dataset contains strengths of 60 sound frequencies after a sonar pulse bouncing of either a rock or a sea mine. The goal is to classify which object did the sonar pulse bounced of.

In [None]:
sonar_X, sonar_y = sonar.drop(columns="label"), sonar.label

sonar_train_X, sonar_test_X, sonar_train_y, sonar_test_y = train_test_split(
    sonar_X, sonar_y, test_size=0.2, random_state=42
)

Using decision tree classifier with maximum depth limited to 3 will result in the following accuracy.

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_pipeline = make_pipeline(DecisionTreeClassifier(max_depth=3))
dt_pipeline.fit(sonar_train_X, sonar_train_y)
round(dt_pipeline.score(sonar_test_X, sonar_test_y), 2)

The data has many features but they do not contain that much information. We can improve the model by apply PCA to features.

<div class="alert alert-block alert-warning"><h5><b>Exercise 5</b></h5></div>

Computer first 10 principal component using [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and select 4 best of them using [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) and use them to train a new decision tree.

You should now get score of $0.79$ on the test set.

In [None]:
# TODO: your code goes here...

However, selection of 10 components and 4 best features is again arbitrary and smells like over-fitting on the test set. Let's use K-fold cross-validation to get a more reliable estimate of how capable would the classifier be on new data.

<div class="alert alert-block alert-warning"><h5><b>Exercise 6</b></h5></div>

Use [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function to compute **\*mean accuracy (score) across 10-fold cross-validation\*** on sonar dataset. You can also get 95 % confidence interval as +/- two standard deviation on either side of the mean.

If you use pipeline from the previous exercise, you should get mean accuracy 0.66 (+/- 0.24) hinting that result from previous exercise was too optimistic.

In [None]:
# TODO: your code goes here...

If you would like to find best model parameters correctly, you should use grid search with cross validation. We will look at it in some future tutorial.