# Tutorial 03 – Machine Learning Pipeline

In this tutorial, we will be solving two classic machine learning tasks:
* classification – predicting a label from a predefined set of class labels
* regression – predicting a value from a continuous range

and we will build our first machine learning pipeline using scikit-learn package.

But before we get to training the model, we first need to preprocess the data. Real-world datasets are not "clean"; they contain features that have to be converted to other data types, missing values (NaN), useless or misleading features, etc. To deal with these imperfections in the dataset, we have to write a bit of code that does the preprocessing. Pipelines allow us to write the code once and reuse it on multiple datasets or easily swap parts of preprocessing or event machine learning models.

We will be working with a real dataset that contains reports about car accidents in South Moravia Region from 2018. The dataset is provided by Policie ČR and published as open data [here](https://kod.brno.cz/cs_CZ/dataset/dopravni-nehody).

In [None]:
import pandas as pd


def load_dataset():
    accidents = pd.read_csv(
        "https://www.fi.muni.cz/ib031/datasets/accidents-brno.csv",
        parse_dates=["Date"],
    )
    accidents["WeekDay"] = accidents.Date.dt.weekday
    accidents["Month"] = accidents.Date.dt.month
    accidents["Day"] = accidents.Date.dt.day
    accidents["TotalHarmed"] = (
        accidents["DeathToll"] + accidents["HeavyInjuries"] + accidents["LightInjuries"]
    )
    return accidents


accidents = load_dataset()
accidents.head()

In [None]:
accidents.info()

In [None]:
accidents.iloc[0]

There is a total of 22 columns in the dataset, but 10 of them are of type `object` (strings) and 1 is a date. Most machine learning algorithms can work only with numerical or categorical data and do not accept `object` data. We will solve this problem by changing them to the `category` type and making then categorical. A categorical columns takes on a fixed number of possible values (categories) similarly to enum in pure Python.

<div class="alert alert-block alert-warning"><h5><b>Exercise 1</b></h5></div>

Convert all columns having `object` type to `category` type.

In [None]:
# TODO: your code goes here...

Let's check that all `object` columns now have type `category`. You should see `category` under Dtype section of info, if you have converted columns in Exercise 1 correctly.

In [None]:
accidents.info()

---

## Train/Test Split

When training a model (decision trees in this tutorial, but the idea applies to other models as well), we want it to generalize well to unseen data. In other words, we want it to correctly predict labels or values for examples outside of the training data. To asses model generalizability, we need the unseen data. One way is to obtain more data (e.g., make new measurements), but it can be too expensive or otherwise impossible to do so. The alternative is to put aside a portion of data (e.g., 20 %) and train the model on the remaining data. Then, we evaluate the model using some metric on the unseen data (test set) and check if the results on unseen data are as good as on training data (train set).

We compare results between train and test sets to detect two major issues in machine learning – **\*overfitting\*** and **\*underfitting\***. Overfitting happens when the model is well performing on train set but has very poor performance on test set. Underfitting happens when model has poor performance on both train and test set. If you want to read more about overfitting and underfitting, you can look at this [article](https://towardsdatascience.com/overfitting-vs-underfitting-a-complete-example-d05dd7e19765) with a nice explanations and figures.

Note that it is best to split data early in the process, even before exploratory data analysis, to truly simulate new unseen data. Otherwise we risk leaking some information into training process.

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(accidents, test_size=0.2, random_state=42)
print(f"Number of examples in train set: {len(train_set)}")
print(f"Number of examples in test set:  {len(test_set)}")

Now we need to separate model inputs and target values we want to predict. For the next exercises, we will try to predict damage caused by an accident.

<div class="alert alert-block alert-warning"><h5><b>Exercise 2</b></h5></div>

Create two new variables `train_inputs` and `train_targets` with data from `train_set`. Variable `train_inputs` will contain all columns except for **\*Damage\*** and `train_targets` will contain only single column **\*Damage\***. Do the same for `test_set` and create another two new variables `test_inputs`, and `test_targets`.

In [None]:
# TODO: your code goes here...

# train_inputs, train_targets =
# test_inputs, test_targets =

---

We are now ready to do exploratory analysis without leaking information from test set. Here are basic statistics of numerical data.

In [None]:
train_inputs.describe()

Here are prevalence of some category in categorical data.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set()  # make plots prettier

for column in [
    "DayNight",
    "Type",
    "CausedBy",
    "Alcohol",
    "MainCause",
]:
    sns.countplot(
        data=train_set, y=column, order=train_set[column].value_counts().index
    )
    plt.show()

Here is the distribution of targets

In [None]:
sns.displot(data=train_set, x="Damage")

Finally, a "map" of accidents.

In [None]:
train_set.plot(
    kind="scatter",
    x="X",
    y="Y",
    alpha=0.3,
    s=train_targets / 5000,
    label="Damage",
    figsize=(10, 7),
    c="TotalHarmed",
    cmap=plt.get_cmap("jet"),
    colorbar=True,
)

In [None]:
# TODO: here you can do your own analysis on the data

We are ready to work with the dataset. But, what if we want to do some preprocessing on the data, e.g., normalization? We would have to use the same preprocessing on both the train and test set, do everything twice and not leak any information from test set into training. Luckily, [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) can do just that. They allow you to write transformations once and then easily apply them to the test set. Even more, pipelines have methods for evaluating machine learning models and can be used with any model from `scikit-learn`.

#### Some useful pipeline methods
* `fit` – calculates parameters needed for transformations and trains the model
* `transform` – transforms the input data if all steps in the pipeline are transformations
* `predict` – return predictions of the given input data
* `score` – calculate prediction score

We will create a pipeline for numerical features. Often transformations include [imputation](https://scikit-learn.org/stable/modules/impute.html#impute) of missing values (NaN) and [standardization](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling), which scales every feature to zero mean $\mu = 0$ and unit variance $\sigma^2 = 1$.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline(
    [("imputer", SimpleImputer(strategy="median")), ("std_scaler", StandardScaler())]
)

To apply the `num_pipeline` on numerical features only, we have to use the [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) class.

In [None]:
from sklearn.compose import ColumnTransformer

num_features = train_inputs.select_dtypes("number").columns
pipeline = ColumnTransformer(
    [
        ("num", num_pipeline, num_features),
    ]
)

In [None]:
pipeline.fit(train_inputs)
num_features_transformed = pipeline.transform(train_inputs)

print(f"Mean: {num_features_transformed.mean():.2f}")
print(f"Standard deviation: {num_features_transformed.std():.2f}")

We can also transform the categorical features using a pipeline. The standard operation for categorical variables is [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), i.e., we create a new binary column for every category with ones in places where categorical variable had the value of that category.

<div class="alert alert-block alert-warning"><h5><b>Exercise 3</b></h5></div>

Create new pipeline for categorical variables, which will be composed of a [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) with `most_frequent` strategy and a [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). Then create a new pipeline which transforms both the numerical and the category variables using [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=column%20transformer).

In [None]:
# TODO: your code goes here...

# cat_pipeline = ...
# pipeline = ...

In [None]:
train_set_transformed = pipeline.fit_transform(train_inputs)
train_set_transformed.shape  # should be (6127, 80) if the exercise was done correctly

---

## Decision trees

Now, we can use a machine learning model to learn how to predict the damage caused by an accident. We will use [decision trees](https://scikit-learn.org/stable/modules/tree.html), which are often used for classification but some algorithms can do both regression and classification. The algorithm implemented in `scikit-learn` is called `CART` and it is very similar to the `C4.5` decision tree algorithm described in the lecture.

### Regression tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error


def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)


def print_score(y_true, y_pred):
    print("Predictions:", list(y_true[:10]), "...")
    print("Labels:     ", list(y_pred[:10]), "...")
    print(
        f"RMSE: {rmse(y_true, y_pred):.3f}"
    )  # calculate root mean square error (RMSE) of the regressor


reg_pipeline = Pipeline(
    [
        ("transform", pipeline),
        ("reg", DecisionTreeRegressor(random_state=42)),
    ]
)

reg_pipeline.fit(train_inputs, train_targets)  # fit the training set
train_preds = reg_pipeline.predict(
    train_inputs
)  # predict the labels based on train data
print_score(train_targets, train_preds)

We trained a regressor that has a Root Mean Squared Error of ~2000 CZK on the training set for the damage caused. That should be pretty good, right? Well, let's try on the test set.

In [None]:
test_preds = reg_pipeline.predict(test_inputs)
print_score(test_targets, test_preds)

Not so good results anymore. The tree **\*overfit\*** on the training data. We can use multiple methods to reduce overfitting, e.g., pruning or feature selection.

<div class="alert alert-block alert-warning"><h5><b>Exercise 4</b></h5></div>

Try to train a [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) that has similar RMSE on both train and test set and does not overfit the training data. Experiment with parameters of [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) and try to get a model with RMSE on test set lower than 63000.

In [None]:
# TODO: your code goes here...

---

### Classification tree

Classification trees are similar to regression trees, except they predict value from a fixed domain instead of a continuous value. We will use the accidents dataset once more to predict causes of accidents from the data.

In [None]:
def prepare_dataset(dataset, label_column="CausedBy"):
    labels = dataset[label_column]
    data = dataset.drop(
        columns=["ID", "Type", "MainCause", label_column]
    )  # drop columns which are very related to label
    return data, labels


train_inputs_clf, train_labels = prepare_dataset(train_set)
test_inputs_clf, test_labels = prepare_dataset(test_set)

<div class="alert alert-block alert-danger"><h5><b>(Optional) Exercise 5</b></h5></div>

Now try to create a classification pipeline on your own. Try to use both numerical and categorical columns, use standard preprocessing and [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). You can use only some features or do more preprocessing (you can try PCA).

You are predicting the cause of the accident from other features. Do not overfit the training set and try to get an accuracy of 0.9. You can use [pipeline.score](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline.score) to get score of the model.

In [None]:
# TODO: your code goes here...

# clf_pipeline = ...

In [None]:
clf_pipeline.fit(train_inputs_clf, train_labels)

print(f"Train accuracy: {clf_pipeline.score(train_inputs_clf, train_labels):.4f}")
print(f"Test accuracy : {clf_pipeline.score(test_inputs_clf, test_labels):.4f}")

---

For classification tasks, accuracy is one of available metrics for quantifying the quality of predictions. But, it is not the only one. To see how our model predicts class by class, we can plot a confusion matrix.

In [None]:
from sklearn.metrics import plot_confusion_matrix

sns.reset_orig()
plot_confusion_matrix(clf_pipeline, test_inputs_clf, test_labels)

We can see the class imbalance in target variable here - on average, the accident is caused by the car driver (1200 samples). We can balance the class imbalance in confusion matrix by normalizing it.

In [None]:
plot_confusion_matrix(clf_pipeline, test_inputs_clf, test_labels, normalize="true")

For explaining the concept of the following metrics, imagine that for each label there are two classes - *positive* and *negative*. Positive means the sample is from that class, negative means it is from some other. It is quite similar to one-hot encoding.

* **True Positives (TP)** - sample is from positive class and predicted as positive
* **True Negatives (TN)** - sample is from negative class and predicted as negative
* **False Positives (FP)** - sample is from negative class and predicted as positive
* **False Negatives (FN)** - sample is from positive class and predicted as negative

When dealing with some machine learning problems, we want some of these metrics to be high and some do not matter that much - for example, when predicting a disease, the number of False Negatives should be as low as possible, even if there are some False Positives.

**Sensitivity** (true positive rate, recall) is percentage of correctly identified positives to all positives.  
**Specificity** (true negative rate) is percentage of correctly identified negatives to all negatives.  
**$F_1$-score** is a combination of precision and recall and is very often used when evaluating a model.

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score

y_true = test_labels
y_pred = clf_pipeline.predict(test_inputs_clf)

print("Precision = TP / (TP + FP)")
print(precision_score(y_true, y_pred, average="micro"))  # calculate total TP, FN, FP
print(
    precision_score(y_true, y_pred, average="macro", zero_division=0)
)  # calculate metrics for each class individually with same weight

print("\nRecall = TP / (TP + FN)")
print(recall_score(y_true, y_pred, average="micro"))
print(recall_score(y_true, y_pred, average="macro"))

print("\nF1 = 2 * (precision * recall) / (precision + recall)")
print(f1_score(y_true, y_pred, average="micro"))
print(f1_score(y_true, y_pred, average="macro"))