Data preparation
E0420
Week 2

Let me analyze already!
•Different types of variables
•Basic diagnostics of variables in dataset are necessary
•Without it, findings can be meaningless/spurious/null!
•

Distribution
•How are values distributed within a sample
•The shape of the distribution determines how we can analyze the data
•Fortunately, majority of values in a sample conform to a well-known distribution
•

Normal/Gaussian distribution
•Also called bell curve
•
https://commons.wikimedia.org/wiki/File:Wechsler.svg

Galton board and the laws of nature

https://www.edumedia-sciences.com/en/media/905-galton-board

Central limit theorem
•The distribution of sums of random variables will resemble normal distribution

https://seeing-theory.brown.edu/probability-distributions/index.html#section3

Specific types of distributions
•Binomial
•Beta distribution
•Gamma distribution
•and other…
https://commons.wikimedia.org/wiki/File:Gamma_distribution_pdf.svg

Basic descriptive terms
•Sum – adding values together
•Mean (M) – sum of values divided by their count
•Mode – most frequently occurring value
•Median – value at the 50% (“in the middle”)
•Standard deviation (SD) – distance of a value from a sample mean
•Variance – squared SD
•Quantile – cut point dividing the range of the distribution into intervals with equal
probabilities
•Minimum – the smallest value
•Maximum – the largest value
•

Please review information from your previous stats course.

Plotting data
•One variable
•Histogram – bars represent meaningful groups of data
•Box plot – box-and-whisker-plot
•Represents minimum, maximum, median, and interquartile range (IQR)
•Box is IQR (25%-75%), whiskers are min/max or 1.5 IQR
•
•Two variables
•Scatterplot
•Represents data points as related to two variables
•

https://commons.wikimedia.org/wiki/File:Example_histogram.png
https://commons.wikimedia.org/wiki/File:Box-Plot_mit_Min-Max_Abstand.png
https://commons.wikimedia.org/wiki/File:Scatter_diagram_for_quality_characteristic_XXX.svg
Box plot
Scatterplot

Scatterplot
•Graphical representation of association between two variables (correlation)
•can add a trendline (a line of best fit) – linear regression
•
•


Non-normal distributions
•Skewness
•
•
•
•Kurtosis
https://commons.wikimedia.org/wiki/File:Relationship_between_mean_and_median_under_different_skewne
ss.png
https://medical-dictionary.thefreedictionary.com/kurtosis

•
https://commons.wikimedia.org/wiki/File:Distribution_of_Annual_Household_Income_in_the_United_State
s_2010.png

Outliers
•Atypical data points (with regards to the sample values)
•Could be due to:
•Contamination (for bio samples)
•Error in data entry
•Just a really atypical case
•

Outliers – why do we care?
•Outliers can have a huge impact on the characteristics of the sample
•
•Example
•Erasmus students in class – 10 students
•
•With outlier:
•M = 25.8
•SD = 15.9
•Median = 21
•
•Without outlier:
•M = 20.8
•SD = 0.83
•Median = 21
#
age
1
20
2
21
3
20
4
22
5
21
6
20
7
22
8
20
9
71
10
21

Identifying outliers
1.“Eyeballing it” - scatterplot
2.Using box plot or histogram
3.Using some cut-off
•+- 2 SDs  or 1.5*IQR -Q1
•4. Using indices for multivariate outliers
https://commons.wikimedia.org/wiki/File:Outlier_statistics.svg

Identifying outliers – graphs
https://en.wikipedia.org/wiki/Box_plot#/media/File:Box-Plot_mit_Interquartilsabstand.png
https://www.itl.nist.gov/div898/handbook/eda/section3/eda33e8.htm
Box plot with 1.5 IQR = everything beyond that is outlier
symmetric with outlier histogram

Mahalanobis distance
•Identifying multivariate outliers – outliers that are distant from a combination of scores
•A point can be a multivariate outlier even if it is not a univariate outlier
•
https://commons.wikimedia.org/wiki/File:Outlier_statistics.svg

Outliers – what should we do?
•Errors in data entry – need to fix
•Extreme values
•Remove?
•Keep in?
•Substitute?
•Depends on the type of data

Outliers?
Yang, S., Puggioni, G., Harlow, L. L., & Redding, C. A. (2017). A Comparison of Different Methods
of Zero-Inflated Data Analysis and an Application in Health Surveys. JMASM Editors, 16(1), 518-543.
Remove only unlikely values
Variables can be transformed
Other analytic techniques can be used (Poisson regression)