Basics of quantitative methodology E2040 Week 4 2024 Study objectives •At the end of this lesson, student will be able to: 1.Understand the basics of quantitative methodology 2.List basic types of variables 3.Know basic types of data visualization 4.Know basic types of summary statistics Types of variables Categorical variables •Nominal – qualitative values, no ordering •sex, ethnicity, birth month •Binary/dichotomous – two categories •No/yes, dead/alive, case/control • •Ordinal – have several ordered categories •Level of education (elementary, high school, college), Likert scale ([1]strongly disagree – strongly agree[5]) • Continuous variables •Can take any value within a range •Theoretically infinite values • •Examples: blood pressure, height, temperature, liquid volume KvIS 1 Distribution A graph of a price Description automatically generated with medium confidence •How data values are spread across different values •Normal distribution – Gaussian, symmetrical •Skewed distribution •Positively skewed (right-skewed), negatively skewed (left-skewed) •Bimodal distribution • Normal/Gaussian distribution Galton board and the laws of nature • https://www.edumedia-sciences.com/en/media/905-galton-board Measures of central tendency 1.Mean – the average value (sum of values / number of values) 2.Median – the value in the middle of distribution (50th percentile) 3.Mode – the most frequent value • Measures of central tendency • A diagram of a function Description automatically generated •Skewness • • • •Kurtosis Non-normal distributions • KvIS 2 Measures of spread •How spread our data are around the central tendency •The lower the spread, the more representative the measures of central tendency are of the data •High spread = large variability Standard deviation (SD) •Amount of variation of the values from the mean •High SD = high variability • •(population SD): square root of the mean of squared differences of individual values from the mean = variance •Sample SD = using n-1 instead of N Basic descriptive terms •Sum – adding values together •Mean (M) – sum of values divided by their count •Mode – most frequently occurring value •Median – value at the 50% (“in the middle”) •Standard deviation (SD) – distance of a value from a sample mean •Variance – squared SD •Quantile – cut point dividing the range of the distribution into intervals with equal probabilities •Minimum – the smallest value •Maximum – the largest value • Visualizing the distribution A graph of a number of colored bars Description automatically generated with medium confidence A graph of different colored bars Description automatically generated A graph with different colored boxes Description automatically generated A pie chart with different colored bars Description automatically generated Pie chart •Categorical variables •Visualizing the proportions of values •Typically as a percentage A pie chart with different colored bars Description automatically generated Bar chart •Categorical variables •Frequency of values for each category • A graph of different colored bars Description automatically generated Boxplot •Continuous variables •Illustrates the spread of values •Median, percentiles, min/max •Outliers • • A graph with different colored boxes Description automatically generated Histogram •Continuous variables •Values are divided into bins = range of values •Visualizing the density • • •Bars in histogram vs bars in bar chart •Different range of values vs different categories A graph of a number of colored bars Description automatically generated with medium confidence Scatterplot •Two continuous variables •Bivariate distribution • • A diagram of a happiness score Description automatically generated • KvIS 3 Data cleaning •Data often contain errors, missing values, outliers • •This might be due to •Contamination (biological samples) •Error in data entry •Just a really atypical case (with regards to outliers) Outliers •atypical data point with regards to sample values •Example •Erasmus students in class – 10 students • •With outlier: •M = 25.8 •SD = 15.9 •Median = 21 • •Without outlier: •M = 20.8 •SD = 0.83 •Median = 21 • • • • # age 1 20 2 21 3 20 4 22 5 21 6 20 7 22 8 20 9 71 10 21 Identifying outliers – graphs Box plot with 1.5 IQR = everything beyond that is outlier symmetric with outlier histogram Outliers – what should we do? •Errors in data entry – need to fix •Extreme values •Remove? •Keep in? •Substitute? •Transform? •Depends on the type of data • Outliers? Yang, S., Puggioni, G., Harlow, L. L., & Redding, C. A. (2017). A Comparison of Different Methods of Zero-Inflated Data Analysis and an Application in Health Surveys. JMASM Editors, 16(1), 518-543. Missing data A white puzzle with a missing piece Description automatically generated •Missing data – no response for some or all of variables for an individual •Missing data can lead to biased results • •Handling missing data •Poor handling: •Listwise deletion •Pairwise deletion •Mean/median imputation •Good handling: •Multiple imputation •Full information maximum likelihood • Multiple imputation Nissen, J., Donatello, R., & Van Dusen, B. (2019). Missing data and bias in physics education research: A case for using multiple imputation. Physical Review Physics Education Research, 15(2), 020106.