# Tutorial 02 â€“ Exploratory Data Analysis

Exploratory data analysis or initial data analysis is a task of assessing and understanding the data. One can describe the data using statistics (e.g., mean, standard deviation, and quantiles), distributions (e.g., normal, exponential, and log-normal), and plots (e.g. histograms, box plots, and scatter plots). All of these techniques are aimed at providing simple and comprehensive image of what actually is in the data, how noisy is the data, and what model assumptions might be violated by this data. We do not have any model right now, so we will focus more on data understanding and quality assessment.

Let's use `Pandas` to load some data and explore it a bit.

In [None]:
import pandas as pd

In [None]:
datasaurus = pd.read_csv(
    "https://www.fi.muni.cz/ib031/datasets/datasaurus.csv",
    header=0,  # do not use values as column names
    names=["x", "y"],  # set custom column names
)
datasaurus

The dataset is fairly small, only 141 observations and 2 features per observation. We can now calculate basic statistics of each feature.

In [None]:
datasaurus.describe()

Both features live roughly between 0 and 100. Feature `x` seems to be shifted slightly towards 100. We can also calculate correlation of the two features.

In [None]:
datasaurus.corr()

So far, the features seems to be pretty random and not much correlated. But you should never trust the summary statistics alone, you should always plot your data. Here is why...

In [None]:
datasaurus.plot(kind="scatter", x="x", y="y")

Images are very natural way of representing things for humans, and we are well adapted to seeing patterns (even if there is are no patterns in the data). However, choosing the right kind of plot (and its parameters) is a bit tricky. Here are examples of plots you might use.

#### Scatter plot
The plot you have just created above is called scatter plot. Each observation is represented a dot in a 2D plane. Coordinates of the dot are values of two chosen features of the observation. This plot is useful to assess relation between two features.

#### Line plot
Each feature is represented by a curve that follows changes in its value. This plot is most useful for time series data or data with some other natural progression. Sorting values might help in some cases, but not in this one.

In [None]:
datasaurus.plot(kind="line")
# sort data by value of `y` and reindex the data
datasaurus.sort_values(by=["y"]).reset_index(drop=True).plot(kind="line")

#### Histograms
Histogram partitions the value range in smaller intervals (bins) and counts how many values fall into each interval. Each bin is then visualized as a single column and the height of the column is the number of values in the bin. Histograms are especially useful for getting an idea of the feature distribution. It can take years of experience to recognize distributions from mean, deviation and quantiles, but it takes seconds to do that with histograms. Histograms have one important parameter: number of bins. Using only few bins "smooths out" the distribution and might hide useful details. While having too many bins might reduce clarity and produce unwanted artifacts due to random noise.

In [None]:
datasaurus.hist()
datasaurus.hist(bins=3)
datasaurus.hist(bins=50)

#### Box plot
Box plots are classical way of representing distributions. The top and bottom of the 'box' are 1st quartile (Q1) and 3rd quartile (Q3). The line inside the box is 2nd quartile (Q2 and also median). The 'whiskers' (lines above and below the box) are typically the last data point inside 1.5 times the inter quartile range (IQR = Q3 - Q1). The dots below and above 'whiskers' are often labeled as outliers, an unlikely observations given the distribution.

In [None]:
datasaurus.plot(kind="box")

## Robomission dataset
For a basic description of data structure see the [GitHub page](https://github.com/adaptive-learning/adaptive-learning-research/tree/master/data/robomission-2019-12).

In [None]:
attempts = pd.read_csv(
    "https://www.fi.muni.cz/ib031/datasets/attempts.csv", parse_dates=[3]
)
attempts.head()

Let's see how well the students are doing, i.e., what are their solving times, on the first problem in the system (problem id 51).

In [None]:
attempts[attempts.problem == 51].time.hist()

That is weird, right. Seems like it takes everybody the same amount of time to solve the problem, because there is only single column. Well, not really.

<div class="alert alert-block alert-warning"><b>Exercise 1</b></div>

Filter out some rows and draw a more representative histogram of solving **times** for **problem 51** that looks similar to [this](https://www.fi.muni.cz/ib031/tutorial02/img/problem51.png). Think of reasons why there is only a single column in the histogram above.

In [None]:
attempts[(attempts.problem == 51) & (attempts.time < 60 * 2)].time.hist()

---

From the histogram and from the literature it seems like the solving time follows log-normal distribution, i.e., logarithm of solving time is normally distributed. Let's test if it is true. 

<div class="alert alert-block alert-warning"><b>Exercise 2</b></div>

1. Create a new column with natural logarithm of values in column `time`. Do not forget to handle zero values whose logarithm is not defined.
2. Test the hypothesis that natural logarithm of solving time for **problem 51** of students that have actually **solved** the problem is normally distributed. Use function `normaltest` from `scipy` to do the test. Expected test statistic is either 1774.3614734433863 or 1960.9834551674717 depending on how you handle zero values. Expected p-value is 0.
3. Based on the p-value of the test and the example in `scipy` documentation for `normaltest` function, decide whether the log solving times are normally distributed or not.

In [None]:
import numpy as np
from scipy.stats import normaltest

In [None]:
attempts["log_time"] = np.log1p(attempts["time"])
normaltest(attempts[(attempts.problem == 51) & (attempts.solved)].log_time)
# The p-value of the test is very small => we can reject the null hypothesis => log times are not normally distributed.

---

To have statistically reliable results from these tests or any kind of analysis, we have to have enough data. Let's check we do have enough data collected for each of the problems in the dataset.

<div class="alert alert-block alert-warning"><b>Exercise 3</b></div>

Compute number of occurrences of each problem in the data and plot the distribution using bar plot. Make sure to order problems from the most common to the least common for better readability of the plot. The resulting plot should look like [this](https://www.fi.muni.cz/ib031/tutorial02/img/problem_counts.png).

In [None]:
attempts.problem.value_counts().plot(kind="bar")

---

This plot is not really visually pleasing nor informative. There is no title (What does the plot show?), axes are missing labels (What kind of numbers are on y-axis?), and tick labels overlap (What is the third number from the left on the bottom x-axis?).

All of these parameters can be changed using `matplotlib`'s `axis` object. All of the plots were actually drawn by `matplotlib` but pandas luckily provides high-level API, so you don't have to draw individual lines. Let's improve the plot.

In [None]:
# random data similar to previous exercise just to not spoil the solution
attempt_counts_data = pd.Series(np.random.randint(0, 8000, 84)).sort_values(
    ascending=False
)
ax = attempt_counts_data.plot(kind="bar", figsize=(15, 5))
ax.set_title("Number of attempts per problem")  # plot title
ax.set_ylabel("Number of attempts")  # y-axis (vertical) label
ax.set_xlabel("Problem id")  # x-axis (horizontal) label
ax.tick_params(
    axis="x", rotation=0, labelsize=8
)  # make number smaller and stand upright

<div class="alert alert-block alert-warning"><b>Exercise 4</b></div>

Use the axis modification example above to improve plot from Exercise 3.

In [None]:
ax = attempts.problem.value_counts().plot(kind="bar", figsize=(15, 5))
ax.set_title("Number of attempts per problem")
ax.set_ylabel("Number of attempts")
ax.set_xlabel("Problem id")
ax.tick_params(axis="x", rotation=0, labelsize=8)

---

Much better! Now this plot might be somewhat useful. You can also save the plot for later. You need to import `pyplot` and call `savefig` function. It will save everything that has been plotted in the cell.

In [None]:
import matplotlib.pyplot as plt

# just plot it again
ax = attempt_counts_data.plot(kind="bar", figsize=(15, 5))
ax.set_title("Number of attempts per problem")  # plot title
ax.set_ylabel("Number of attempts")  # y-axis (vertical) label
ax.set_xlabel("Problem id")  # x-axis (horizontal) label
ax.tick_params(
    axis="x", rotation=0, labelsize=8
)  # make number smaller and stand upright

# the file format is automatically recognised from file extension
plt.savefig(
    "my_plot.png",  # file name of the save image
    dpi=300,  # resolution of produced image
    bbox_inches="tight",  # make borders tigther to plot
)

Matplotlib is a huge library and while it can do a lot of things, it's far from intuitive and easy. Luckily, there are higher-level plotting libraries like `seaborn` or `plotly`. We will focus on `seaborn`.

In [None]:
import seaborn as sns  # import seaborn

sns.set()  # make plots magically prettier :)

sns.scatterplot(
    data=attempts,  # dataframe to take values from
    x="n_edits",  # column name to use as x-axis
    y="n_executions",  # column name to use as y-axis
    hue="problemset",  # column name whos value will decide point color
    palette="viridis",  # collor palet to use
    alpha=0.5,  # make point a bit translucent to better see the density of points
)

Apart from making plots nicer, it also comes with handy plotting functions both for plots introduced earlier and also other plots not supported by `Pandas`'s `plot` method. 

First function is `pairplot` that lets you plot all columns from a data frame against each other using scatter plots and plot each column's distribution using Kernel Density Estimation (a fancy version histogram) - all in one grid. Let's load a toy weather dataset from the tutorial 01 and try it out.

In [None]:
weather = pd.read_csv("https://www.fi.muni.cz/ib031/datasets/weather.csv")

In [None]:
sns.pairplot(weather, vars=["temperature", "humidity", "windy"], hue="play")

<div class="alert alert-block alert-warning"><b>Exercise 5</b></div>

Use `pairplot` to plot columns **time**, **n_edits**, and **n_executions** from attempts dataset and change hue based on whether students **solved** the **problem 17**. The result should look like [this](https://www.fi.muni.cz/ib031/tutorial02/img/attempts_pairplot.png). Are there any relations between time, number of edits, and number of executions? 

In [None]:
sns.pairplot(
    attempts[attempts.problem == 17],  # plot data about problem 17
    vars=["time", "n_edits", "n_executions"],  # what columns to plot against each other
    hue="solved",  # change color based on student ability to solve problem
)
# number of edits and executions looks correlated

---

Let's look at heatmaps, another plot not available in `Pandas`.

#### Heatmap
Heatmaps are used to visualize function values either in plane or on a discrete grid. The color is based on the function value. Humans can easily see "hot spots" in the data where the function value is different.

As an example, let's plot number of transported passengers through months and years. The plot illustrates the peak in summer months.

In [None]:
flights = sns.load_dataset("flights")  # load seaborn's built-in exaple dataset
flights = flights.pivot("month", "year", "passengers")  # create pivot table
ax = sns.heatmap(flights)  # visualize

When collecting data, we are often interested in patterns in time. There are two typical goals of exploratory data analysis for time-series data. 
1. We are interested in seeing patterns in progression of some variable in time that will help us in forecasting future values (e.g., stock markets, weather, and machine failures).
2. We are interested in differences in data collected at different points in time and whether they share the same characteristics (e.g., Is the traffic data similar in the morning and in the evening?).

<div class="alert alert-block alert-danger"><b>(Optional) Exercise 6</b></div>

Let's explore the stock markets! We want to visualize stock prices throughout year 2017 and find out if we could see any trends. Then, we want to compare how Google stock prices evolved in each month. Finally, we want to explore if there are any patterns in stock price daily changes on different days of the week.

1. Load dataset with [sample of real stocks data](https://www.fi.muni.cz/ib031/datasets/stocks_sample.csv).
2. Make sure that column date has been converted to type `datetime` and is not interpreted as string!
3. Create three new columns called `weekday`, `month`, and `day` with number of week day, number of month, and day number respectively. See `Pandas` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetimelike-properties) on how to extract these informations from date column.
4. Use `seaborns`'s `lineplot` to plot stock prices throughout 2017 at market closing. The plot should look like [this](https://www.fi.muni.cz/ib031/tutorial02/img/stocks_year.png).
5. Use the same function to plot Google's stocks (stocks with name 'GOOGL') progression in each of the months. The plot should look like [this](https://www.fi.muni.cz/ib031/tutorial02/img/google_stocks.png).
6. Create new column with differences in opening and closing stock prices.
7. Visualize the differences for each weekday using `Seaborn`'s `boxplot` function. The plot should look like [this](https://www.fi.muni.cz/ib031/tutorial02/img/stocks_weekday.png).

In [None]:
# 1
stocks = pd.read_csv(
    "https://www.fi.muni.cz/ib031/datasets/stocks_sample.csv", parse_dates=[0]
)
# 3
stocks["weekday"] = stocks.date.dt.weekday
stocks["month"] = stocks.date.dt.month
stocks["day"] = stocks.date.dt.day
# 4
sns.lineplot(data=stocks, x="date", y="close", hue="name")
plt.show()
# 5
sns.lineplot(data=stocks[stocks.name == "GOOGL"], x="day", y="close", hue="month")
plt.show()
# 6
stocks["difference"] = stocks["close"] - stocks["open"]
plt.show()
# 7
sns.boxplot(data=stocks, x="weekday", y="difference", hue="name")