Lineární a logistická regrese Mojmír Dočekal 2023-05-02 2 Contents 1 Dvě témata 5 2 Linearní regrese 7 2.1 Historie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Nelingvistický příklad . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Multiple regression analysis (vícenásobná lineární regrese) . . . . 11 2.4 Lingvistický příklad . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Logistická regrese 41 3.1 Klasická lineární regrese . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Logistická regrese . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Jiný nelingvistický příklad . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Lingvistický příklad . . . . . . . . . . . . . . . . . . . . . . . . . 49 4 Shrnutí a domácí úkol (do prezentace) 51 4.1 Lingvistické použití . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Domácí úkol do prezentace . . . . . . . . . . . . . . . . . . . . . 55 4.3 Dodatky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3 4 CONTENTS Chapter 1 Dvě témata • lineární regrese – mimo-lingvistiku – lingvistické užití • logistická regrese (mimolingvistický příklad) 5 6 CHAPTER 1. DVĚ TÉMATA Chapter 2 Linearní regrese 2.1 Historie • Laplace, Gauss, … • metoda nejmenších čtverců • Gaus a astronomie: Ceres 2.2 Nelingvistický příkladrequire(graphics) plot(women, xlab = "Height (in)", ylab = "Weight (lb)", main = "women data: American women aged 30-39") 7 8 CHAPTER 2. LINEARNÍ REGRESE 58 60 62 64 66 68 70 72 120130140150160 women data: American women aged 30−39 Height (in) Weight(lb) data("women") head(women) ## height weight ## 1 58 115 ## 2 59 117 ## 3 60 120 ## 4 61 123 ## 5 62 126 ## 6 63 129 help("women") Average Heights and Weights for American Women Description This data set gives the average heights and weights for American women aged 30–39. Usage women Format A data frame with 15 observations on 2 variables. [,1] height numeric Height (in) [,2] weight numeric Weight (lbs) Details The data set appears to have been taken from the American Society of Actuaries Build and Blood Pressure Study for some (unknown to us) earlier year. The World Almanac notes: “The figures represent weights in ordinary indoor clothing and shoes, and heights with shoes”. Source The World Almanac and Book of Facts, 1975. References McNeil, D. R. (1977) Interactive Data Analysis. Wiley. 2.2. NELINGVISTICKÝ PŘÍKLAD 9 Examples require(graphics) plot(women, xlab = “Height (in)”, ylab = “Weight (lb)”, main = “women data: American women aged 30-39”) [Package datasets version 4.0.5 Index] • a nyní model interpretující intuitivně jasný vztah mezi výškou a váhou • výška je explanatory variable (vysvětlující proměnná) • váha je dependent variable (závislá proměnná) – jiná terminologie: explanatory – independent variables • model: lm <- lm(women$weight ~ women$height) lm ## ## Call: ## lm(formula = women$weight ~ women$height) ## ## Coefficients: ## (Intercept) women$height ## -87.52 3.45 • y-intercept (-87.52) – žena s nulovou výškou by vážila -87.52 liber • regression-coefficient (3.45) – numericky vyjádřený vztah mezi explanatory a dependent variable – nárustek závislé proměnné, vzroste-li explanatory proměnná a 1 jed- notku summary(lm) ## ## Call: ## lm(formula = women$weight ~ women$height) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.7333 -1.1333 -0.3833 0.7417 3.1167 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 *** ## women$height 3.45000 0.09114 37.85 1.09e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 10 CHAPTER 2. LINEARNÍ REGRESE ## ## Residual standard error: 1.525 on 13 degrees of freedom ## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903 ## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14 • každý koeficient má: 1. sign (znaménko): pozitivní nebo negativní 2. velikost (síla efektu) 3. signifikance: pravděpodobnost nulové hypotéze vůči danému vzorku • při dostatečném množství pozorování se díky central limit theoeremu neoctneme příliš daleko (standard error) od skutečného mean celé populace • R2 měří úspěšnost modelu – v daném, velmi omezeném vzorku, vysvětluje model 99% – nerealistické v normálním vzorku ∗ samozřejmě mnohem víc faktorů: věk, pohlaví, … • koeficienty lineární regrese zároveň určují proložení přímky • lineární regrese • a také residuals pro každé pozorování plot(women, xlab = "Height (in)", ylab = "Weight (lb)", main = "women data: American women aged 30-39") abline(lm(women$weight ~ women$height)) 2.3. MULTIPLE REGRESSION ANALYSIS (VÍCENÁSOBNÁ LINEÁRNÍ REGRESE)11 58 60 62 64 66 68 70 72 120130140150160 women data: American women aged 30−39 Height (in) Weight(lb) 2.3 Multiple regression analysis (vícenásobná lineární regrese) • malinko realističtější příklad • data z link setwd("heart_biking_smoking") heart.data <- read.csv2("heart.data.csv", header = TRUE, sep = ",") heart.data$X <- NULL heart.data$biking <- as.numeric(heart.data$biking) heart.data$smoking <- as.numeric(heart.data$smoking) heart.data$heart.disease <- as.numeric(heart.data$heart.disease) head(heart.data) ## biking smoking heart.disease ## 1 30.801246 10.896608 11.769423 ## 2 65.129215 2.219563 2.854081 ## 3 1.959665 17.588331 17.177803 ## 4 44.800196 2.802559 6.816647 ## 5 69.428454 15.974505 4.062224 ## 6 54.403626 29.333176 9.550046 12 CHAPTER 2. LINEARNÍ REGRESE • statistika měst a procentuálního poměru cyklistů, kuřáká a osob se srdeční chorobou • dvě explanatory proměnné • jedna závislá proměnná • vícenásobná lineární regrese • korelační grafy: plot(heart.data$heart.disease ~ heart.data$biking, xlab = "Biking (percents)", ylab = " main = "Linear regression") 0 20 40 60 05101520 Linear regression Biking (percents) Heartdisease(percents) plot(heart.data$heart.disease ~ heart.data$smoking, xlab = "Smoking (percents)", ylab = main = "Linear regression") 2.3. MULTIPLE REGRESSION ANALYSIS (VÍCENÁSOBNÁ LINEÁRNÍ REGRESE)13 0 5 10 15 20 25 30 05101520 Linear regression Smoking (percents) Heartdisease(percents) • model podobně konstruovaný, ale s dvěma explanatory proměnnými heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = heart.data) heart.disease.lm ## ## Call: ## lm(formula = heart.disease ~ biking + smoking, data = heart.data) ## ## Coefficients: ## (Intercept) biking smoking ## 14.9847 -0.2001 0.1783 • podrobnější info: summary(heart.disease.lm) ## ## Call: ## lm(formula = heart.disease ~ biking + smoking, data = heart.data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.1789 -0.4463 0.0362 0.4422 1.9331 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) 14 CHAPTER 2. LINEARNÍ REGRESE ## (Intercept) 14.984658 0.080137 186.99 <2e-16 *** ## biking -0.200133 0.001366 -146.53 <2e-16 *** ## smoking 0.178334 0.003539 50.39 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.654 on 495 degrees of freedom ## Multiple R-squared: 0.9796, Adjusted R-squared: 0.9795 ## F-statistic: 1.19e+04 on 2 and 495 DF, p-value: < 2.2e-16 # library(ggplot2) #abline(lm(heart.disease ~ biking, data = heart.data)) • jednodušší model: heart.disease.lm2<-lm(heart.disease ~ biking, data = heart.data) heart.disease.lm2 ## ## Call: ## lm(formula = heart.disease ~ biking, data = heart.data) ## ## Coefficients: ## (Intercept) biking ## 17.6979 -0.1991 summary(heart.disease.lm2) ## ## Call: ## lm(formula = heart.disease ~ biking, data = heart.data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.028 -1.206 -0.004 1.151 3.643 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 17.697884 0.146780 120.57 <2e-16 *** ## biking -0.199091 0.003378 -58.94 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.618 on 496 degrees of freedom ## Multiple R-squared: 0.8751, Adjusted R-squared: 0.8748 ## F-statistic: 3474 on 1 and 496 DF, p-value: < 2.2e-16 2.3. MULTIPLE REGRESSION ANALYSIS (VÍCENÁSOBNÁ LINEÁRNÍ REGRESE)15 • horší R2 • graf jednodušího modelu 3: plot(heart.data$heart.disease ~ heart.data$biking, xlab = "Biking (percents)", ylab = "Heart dise main = "Linear regression") abline(lm(heart.data$heart.disease ~ heart.data$biking)) 0 20 40 60 05101520 Linear regression Biking (percents) Heartdisease(percents) - graf jednoduššího modelu 2: plot(heart.data$heart.disease ~ heart.data$smoking, xlab = "Biking (percents)", ylab = "Heart dis main = "Linear regression") abline(lm(heart.data$heart.disease ~ heart.data$smoking)) 16 CHAPTER 2. LINEARNÍ REGRESE 0 5 10 15 20 25 30 05101520 Linear regression Biking (percents) Heartdisease(percents) heart.disease.lm3<-lm(heart.disease ~ smoking, data = heart.data) heart.disease.lm3 ## ## Call: ## lm(formula = heart.disease ~ smoking, data = heart.data) ## ## Coefficients: ## (Intercept) smoking ## 7.5431 0.1705 summary(heart.disease.lm3) ## ## Call: ## lm(formula = heart.disease ~ smoking, data = heart.data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.7065 -3.7069 0.5007 3.6597 8.5434 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.54311 0.41251 18.286 < 2e-16 *** ## smoking 0.17048 0.02355 7.239 1.73e-12 *** ## --- 2.3. MULTIPLE REGRESSION ANALYSIS (VÍCENÁSOBNÁ LINEÁRNÍ REGRESE)17 ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.352 on 496 degrees of freedom ## Multiple R-squared: 0.09556, Adjusted R-squared: 0.09374 ## F-statistic: 52.41 on 1 and 496 DF, p-value: 1.729e-12 • srovnání modelů pomocí ANOVA anova(heart.disease.lm2, heart.disease.lm) ## Analysis of Variance Table ## ## Model 1: heart.disease ~ biking ## Model 2: heart.disease ~ biking + smoking ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 496 1297.74 ## 2 495 211.74 1 1086 2538.8 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 • komplexnější (Df) model je mnohem lepší • srovnání dvou jednodušších modelů anova(heart.disease.lm3, heart.disease.lm2) ## Analysis of Variance Table ## ## Model 1: heart.disease ~ smoking ## Model 2: heart.disease ~ biking ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 496 9395.6 ## 2 496 1297.7 0 8097.8 • model užívající biking jako explanatory proměnnou je mnohem lepší • zpět ke komplexnímu modelu summary(heart.disease.lm) ## ## Call: ## lm(formula = heart.disease ~ biking + smoking, data = heart.data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.1789 -0.4463 0.0362 0.4422 1.9331 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 14.984658 0.080137 186.99 <2e-16 *** 18 CHAPTER 2. LINEARNÍ REGRESE ## biking -0.200133 0.001366 -146.53 <2e-16 *** ## smoking 0.178334 0.003539 50.39 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.654 on 495 degrees of freedom ## Multiple R-squared: 0.9796, Adjusted R-squared: 0.9795 ## F-statistic: 1.19e+04 on 2 and 495 DF, p-value: < 2.2e-16 # library(ggplot2) #abline(lm(heart.disease ~ biking, data = heart.data)) • jednotlivé koeficienty jsou propočítány vzhledem k tomu, kdy jsou ostatní faktory fixovány • např. jak moc zlepšuje cyklistika srdeční choroby, pokud je kouření fixováno na stejné hodnotě 2.4 Lingvistický příklad • lineární model z Hančiny diplomky 2.4.1 Comparatively and superlatively modified numerals (theories, contrasts) 2.4.1.1 Background • the accepted contrasts between comparative modifiers and superlative modifiers (see ?, ?,?, ?, ? a.o.): 1. comparative modifiers don’t but superlative modifiers do give raise to obligatory ignorance implicatures (?I have at most three bikes vs. I have less than four bikes); 2. comparative modifiers can scope over or under existential modals (EM) but superlative modifiers have to outscope them (The cup of Darjeeling tea can contain less than 50mg of caffeine: ♢ > 50 ok vs. The cup of Darjeeling tea can contain at most 50mg of caffeine: *♢ > 50) • A no more than Num construction like in English (1a) from ? is then claimed to: 1. to allow both scopes w.r.t. existential modals (?): (1a)) – ♢ > 20/20 > ♢; 2. to have scalar bounding inference (50 for (1b)); • no more than Num construction is then claimed to be: 1. subtype of differential quantifiers in the class of comparative modifiers (as slightly less in (1c)); 2.4. LINGVISTICKÝ PŘÍKLAD 19 2. supports come from the comparative morphology of no more than Num; 2.4.1.2 Our claims • we bring new experimental evidence (from Czech) against such claims, showing that (unlike in English): 1. no more than Num prefers to be interpreted with wider scope than existential modals; 2. no more than Num can behave unlike other differential comparative mod- ifiers; 3. our experiment shows: • there are two kinds of differential quantifiers: comparative ((1c) and English (1a)) and superlative (Czech no more than Num as in (2a)/(2c)); 2.4.2 Experiment • two experiments to target two research questions: 1. does Czech no more than behave more like a comparative modifier or a superlative modifier (in the modal environment)? 2. does Czech no more behave like other differential quantifiers? 2.4.2.1 Design of both experiments • Czech native speakers • Likert scale 1-5 • the appropriateness of one of the conditions in a context • further, only exp 2: it included all the conditions of exp 1 • truth-value judgment task where a context described a situation strongly preferring the wide scope of the existential modal over the degree quanti- fiers • 16 items and 16 fillers • 98 subjects (all of them passed fillers uncontroversial TVJT) • implemented on L-Rex experiment followed an observation (?,? a.o.): • comparative modifiers allow both wide and narrow scope w.r.t. an existential modal reading but superlative modifiers have to out-scope the existential modals (split-scope); • four conditions: 1. comparative modifiers: fewer, (2a); 2. superlative modifiers: at-most, (2b); 3. no more modifier: no-more, (2c); 20 CHAPTER 2. LINEARNÍ REGRESE 4. differential slightly less modifier, (2d) – slightly-less; 2.4.2.2 Assumptions • the conditions fewer and at-most tested the acceptability of modified numerals without differential; • the conditions slightly-less, no-more tested the presence of a differential (zero degree differential in the case of no-more); • all conditions tested possible narrow scope w.r.t. the existential modal; • ? predicts no-more to be the comparative modifiers as slightly-less; • the design was 2x2 factorial: comparative modifiers or superlative modifiers (classA,classB) x absence/presence of a differential (DiffYes,DiffNo); • an example item from the experiment is in (2). 2.4.2.3 Predictions • in the existential modal environment manipulated in a way strongly prefering wide scope of the modal • N(o)M(ore than)C(onstruction) • C(omparative)M(odifier) • S(uperlative)M(odifier) fewer atmost no-more slightly- less Predictions NMC as CM ✓ � ✓ ✓ NMC as SM ✓ � � ✓ 2.4.2.4 Results • the data was analyzed in a mixed-effects linear model with subject and item intercept+slope random effects (R package lmerTest); • the dependent variable: the subject’s response; • several models, the best: conditions as independent variables, plus their interaction; • the less fitting models included models with main effects only and models where no more was treated as a comparative modifiers 1. we found a negative main effect of classB (superlative modifiers) (t-value: -11.004, 𝑝 < 0.001) and a positive effect of the absence of a differential (tvalue: 3.946 𝑝 < 0.001); 2. the model also reports a negative interaction of classB (superlative modifiers) by DiffNo (t-value: -3.129, 𝑝 = 0.002); 2.4. LINGVISTICKÝ PŘÍKLAD 21 3. Tukey’s pairwise comparison of the conditions reveals that only at-most and no-more were statistically non-significantly different (t-value: -0.478, 𝑝 = 0.964); • all other pairs of conditions differed significantly; • the boxplot representing means and SEs in Fig. 2.1 • the experiment thus confirms: 1. the scope behavior of Czech no more construction follows the pattern of superlative modifiers, not the comparative modifiers, since subjects accepted to the same extent no-more as at-least; 2. the significant difference between no-more and slighty-less which can be explained by classifying no more as an superlative modifiers differential quantifier and slightly less as a comparative modifiers; The surprising result of this exp is the overall low acceptability of all conditions: • even the most default comparative modifiers without a differential (cond fewer) had 𝜇=2.51 (SD: 1.61, SE: 0.04); – we hypothesize that this results from the priming effect of the most frequent everyday contexts, which strongly prefer the 𝑚𝑎𝑥 𝑑 > ♢ reading, just the opposite against the contexts described in our exp. knitr::include_graphics("plot_zoom.png") 2.4.2.5 Analysis • the scope behaviour of Czech NMC is of a superlative modifiers profile • we follow original (?) suggestion to analyze German/Dutch nicht mehr/niet meer as a negative differential expressing that there is no positive difference in degree between the arguments of the comparative more: [[nicht mehr ]]𝛼 = 𝜆𝑃.¬∃𝑑′ [𝑚𝑎𝑥 𝑑(𝑃(𝑑)) = 𝛼 + 𝑑′ ] • since the negative differential analysis is equivalent to the superlative modifiers at-issue semantics of at most: 𝜆𝑃.𝑚𝑎𝑥 𝑑(𝑃(𝑑)) ≤ 𝛼 (after (?)) • applied to Czech experimental data correctly derives the similar scope behaviour of NMC and superlative modifiers; • the wide scope of the NMC/superlative modifiers and then is: 𝑚𝑎𝑥 𝑑(♢contain(ChocBar, 𝑑)) ≤ 65𝑔 • incompatible with Alex’s continuation and predicts low acceptability of no-more and at-most in the experiment; 22 CHAPTER 2. LINEARNÍ REGRESE Figure 2.1: Boxplot of responses 2.4. LINGVISTICKÝ PŘÍKLAD 23 • weak surface scope (♢[𝑚𝑎𝑥 𝑑(contain(ChocBar, 𝑑)) ≤ 65𝑔]) which allows ‘more than’ reading is allowed only for comparative modifiers and explains the higher acceptability of fewer and slightly-less (whatever the reasons for obligatory wide scope of superlative modifiers over existential modals are, see (?)); • the scope behaviour of Czech NMC then shows that semantically NMC behaves as superlative modifiers, despite its comparative morphology; • Secondly, the experiment brings support for the comparative modifiers vs. superlative modifiers theory presented by (?) where the distinction boils down to the type of ordering relation (strict vs. non-strict) • NMC can, at least in languages like Czech, be interpreted as ¬ (strict) resulting in ordering entailments of non-strict ordering. Regular differential quantifiers (slightly-less) remain strictly ordered, thus comparative modifiers; • finally, cross-linguistically we found three types of NMC-languages: 1. NMC as comparative modifiers, English type of NMC (bounding inferences and both scopes w.r.t. existential modals); 2. NMC as superlative modifiers, Czech type of NMC (only 𝑚𝑎𝑥 𝑑 > ♢, lack of bounding inferences: (?)); 3. languages where NMC depending on its realization behaves as comparative modifiers or as superlative modifiers (Hungarian according to Balázs Surányi (p.c.)); • the variation is related to the morpho-syntactic status, a constituent negation in NMC (Czech) behaves as superlative modifiers; a negative quantifier (English) in NMC leads to comparative modifiers; • our experiment clearly shows that treating uniformly all NMC as comparative modifiers is cross-linguistically untenable and the distinction between comparative modifiers and superlative modifiers isn’t purely morphologi- cal: • Czech NMC contains both comparative marker and comparative standard marker but unlike regular comparative modifiers differentials, Czech NMC acts as a superlative numeral modifier. 2.4.3 Reports library(plyr) library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:plyr': ## 24 CHAPTER 2. LINEARNÍ REGRESE ## arrange, count, desc, failwith, id, mutate, rename, summarise, ## summarize ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union #setwd("breakout_rooms_report") items <- read.csv2("clean_items.csv", encoding = 'UTF-8', header = TRUE) items <- items %>% mutate(condition=replace(condition, condition == "item-méně_než", "fewer")) %>% mutate(condition=replace(condition, condition == "item-nanejvýš", "at-most")) %>% mutate(condition=replace(condition, condition == "item-ne_víc_než", "no-more")) %> mutate(condition=replace(condition, condition == "item-trochu_méně", "slightly-les as.data.frame() ddply(items, .(condition), summarise, Means = mean(rating1, na.rm=TRUE)) ## condition Means ## 1 at-most 1.262755 ## 2 fewer 2.512755 ## 3 no-more 1.311224 ## 4 slightly-less 2.211735 ddply(items, .(condition), summarise, Medians = median(rating1,na.rm=TRUE)) ## condition Medians ## 1 at-most 1 ## 2 fewer 2 ## 3 no-more 1 ## 4 slightly-less 2 • graphs library(ggplot2) ggplot(items, aes(x = rating1, fill = condition)) + geom_histogram(position = "dodge", alpha = 1, binwidth = 0.5) 2.4. LINGVISTICKÝ PŘÍKLAD 25 0 100 200 300 1 2 3 4 5 rating1 count condition at−most fewer no−more slightly−less ggplot(items, aes(x = rating1)) + geom_histogram(fill = "white", colour = "black") + facet_grid(condition ~ .) ## `stat_bin()` using `bins = 30`. Pick ## better value with `binwidth`. 26 CHAPTER 2. LINEARNÍ REGRESE at−mostfewerno−moreslightly−less 1 2 3 4 5 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 rating1 count data.to.plot <- items graph_to_plot <- ggplot(data.to.plot, aes(condition,rating1)) graph_to_plot + geom_boxplot() + stat_summary(fun.y=mean, geom="point", size=3) ## Warning: The `fun.y` argument of ## `stat_summary()` is deprecated as of ## ggplot2 3.3.0. ## i Please use the `fun` argument ## instead. ## This warning is displayed once every ## 8 hours. ## Call ## `lifecycle::last_lifecycle_warnings()` ## to see where this warning was ## generated. 2.4. LINGVISTICKÝ PŘÍKLAD 27 1 2 3 4 5 at−most fewer no−more slightly−less condition rating1 items$classAB <- "A" items$classAB[items$condition == "at-most"] <- "B" items$classAB[items$condition == "no-more"] <- "B" p <- ggplot(items, aes(condition, rating1, fill = classAB)) + stat_summary(geom = "bar", fun.y = mean, position = "dodge") + stat_summary(geom = "errorbar", fun.data = mean_se, size=.3, width=.2, position=position_dodge(.9)) ## Warning: Using `size` aesthetic for lines was ## deprecated in ggplot2 3.4.0. ## i Please use `linewidth` instead. ## This warning is displayed once every ## 8 hours. ## Call ## `lifecycle::last_lifecycle_warnings()` ## to see where this warning was ## generated. p + labs(y = "acceptability") + theme(axis.text=element_text(size=15), axis.title=element_text(size=17,face="bold")) 28 CHAPTER 2. LINEARNÍ REGRESE 0 1 2 at−most fewer no−more slightly−less condition acceptability classAB A B 2.4.4 Modely • modely library(ggplot2) data.to.plot <- items graph_to_plot <- ggplot(data.to.plot, aes(condition,rating1)) graph_to_plot + geom_boxplot() + stat_summary(fun.y=mean, geom="point", size=3) 2.4. LINGVISTICKÝ PŘÍKLAD 29 1 2 3 4 5 at−most fewer no−more slightly−less condition rating1 ggplot(items, aes(condition, rating1, fill = condition)) + stat_summary(geom = "bar", fun.y = mean, position = "dodge") + stat_summary(geom = "errorbar", fun.data = mean_se, size=.3, width=.2, position=position_dodge(.9)) 0 1 2 at−most fewer no−more slightly−less condition rating1 condition at−most fewer no−more slightly−less 30 CHAPTER 2. LINEARNÍ REGRESE library(dplyr) N = length(items$rating1) N ## [1] 1568 items.summary <- items %>% group_by(condition) %>% summarise( sd = sd(rating1), se = sd / sqrt(N), acceptability = mean(rating1) ) items.summary ## # A tibble: 4 x 4 ## condition sd se acceptability ## ## 1 at-most 0.767 0.0194 1.26 ## 2 fewer 1.61 0.0407 2.51 ## 3 no-more 0.905 0.0228 1.31 ## 4 slightly-less 1.43 0.0361 2.21 items.summary[1,1] <- "less than" items.summary[2,1] <- "at most" items.summary[3,1] <- "no more than" items.summary[4,1] <- "a bit less than" items.summary$class <- NA items.summary$class[1] <- "A" items.summary$class[2] <- "B" items.summary$class[3] <- "B" items.summary$class[4] <- "A" dodge <- position_dodge(width=0.3) qplot(condition, acceptability, colour=class, pch=class, lty=class, data=items.summary, ## Warning: `qplot()` was deprecated in ggplot2 ## 3.4.0. ## This warning is displayed once every ## 8 hours. ## Call ## `lifecycle::last_lifecycle_warnings()` ## to see where this warning was 2.4. LINGVISTICKÝ PŘÍKLAD 31 ## generated. 1.2 1.6 2.0 2.4 a bit less than at most less than no more than condition acceptability class A B • models # linear model library(lmerTest) ## Loading required package: lme4 ## Loading required package: Matrix ## ## Attaching package: 'lmerTest' ## The following object is masked from 'package:lme4': ## ## lmer ## The following object is masked from 'package:stats': ## ## step items$condition <- as.factor(items$condition) items$condition <- relevel(items$condition, ref="at-most") m1 <- lmer(as.numeric(rating1) ~ condition + (1|participant) + (1|item), data=items) 32 CHAPTER 2. LINEARNÍ REGRESE summary(m1) ## Linear mixed model fit by REML. t-tests use Satterthwaite's method [ ## lmerModLmerTest] ## Formula: as.numeric(rating1) ~ condition + (1 | participant) + (1 | item) ## Data: items ## ## REML criterion at convergence: 4936.9 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -2.1683 -0.6481 -0.2050 0.4471 3.6096 ## ## Random effects: ## Groups Name Variance Std.Dev. ## participant (Intercept) 0.1414 0.3761 ## item (Intercept) 0.1375 0.3708 ## Residual 1.2402 1.1137 ## Number of obs: 1568, groups: participant, 98; item, 16 ## ## Fixed effects: ## Estimate Std. Error df t value Pr(>|t|) ## (Intercept) 1.271e+00 1.149e-01 2.942e+01 11.063 5.3e-12 *** ## conditionfewer 1.242e+00 7.959e-02 1.452e+03 15.601 < 2e-16 *** ## conditionno-more 3.996e-02 7.959e-02 1.452e+03 0.502 0.616 ## conditionslightly-less 9.322e-01 7.961e-02 1.452e+03 11.710 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) cndtnf cndtnn## conditinfwr -0.346 ## conditnn-mr -0.346 0.500 ## cndtnslght- -0.346 0.500 0.500 library(emmeans) emmeans(m1, list(pairwise ~ condition), adjust = "tukey") ## Cannot use mode = "kenward-roger" because *pbkrtest* package is not installed ## $`emmeans of condition` ## condition emmean SE df lower.CL upper.CL ## at-most 1.27 0.115 29.4 1.04 1.51 ## fewer 2.51 0.115 29.4 2.28 2.75 ## no-more 1.31 0.115 29.4 1.08 1.55 2.4. LINGVISTICKÝ PŘÍKLAD 33 ## slightly-less 2.20 0.115 29.4 1.97 2.44 ## ## Degrees-of-freedom method: satterthwaite ## Results are given on the as.numeric (not the response) scale. ## Confidence level used: 0.95 ## ## $`pairwise differences of condition` ## 1 estimate SE df t.ratio p.value ## (at-most) - fewer -1.242 0.0796 1452 -15.601 <.0001 ## (at-most) - (no-more) -0.040 0.0796 1452 -0.502 0.9586 ## (at-most) - (slightly-less) -0.932 0.0796 1452 -11.710 <.0001 ## fewer - (no-more) 1.202 0.0796 1452 15.095 <.0001 ## fewer - (slightly-less) 0.309 0.0796 1452 3.888 0.0006 ## (no-more) - (slightly-less) -0.892 0.0796 1452 -11.211 <.0001 ## ## Note: contrasts are still on the as.numeric scale ## Degrees-of-freedom method: satterthwaite ## P value adjustment: tukey method for comparing a family of 4 estimates # m2: class A vs. class B traditionally # going from morphology: adding column with class A/class B status # condition "at-most" the only class B items$classAB <- "A" items$classAB[items$condition == "at-most"] <- "B" items$classAB <- as.factor(items$classAB) m2 <- lmer(as.numeric(rating1) ~ classAB + (1|participant) + (1|item), data=items) summary(m2) ## Linear mixed model fit by REML. t-tests use Satterthwaite's method [ ## lmerModLmerTest] ## Formula: as.numeric(rating1) ~ classAB + (1 | participant) + (1 | item) ## Data: items ## ## REML criterion at convergence: 5157.5 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -1.8526 -0.6835 -0.3082 0.4417 3.3690 ## ## Random effects: ## Groups Name Variance Std.Dev. ## participant (Intercept) 0.1285 0.3584 ## item (Intercept) 0.1366 0.3696 34 CHAPTER 2. LINEARNÍ REGRESE ## Residual 1.4480 1.2033 ## Number of obs: 1568, groups: participant, 98; item, 16 ## ## Fixed effects: ## Estimate Std. Error df t value Pr(>|t|) ## (Intercept) 2.00910 0.10526 20.39488 19.09 1.76e-14 *** ## classABB -0.73794 0.07022 1454.35268 -10.51 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) ## classABB -0.167 anova(m1,m2) ## refitting model(s) with ML (instead of REML) ## Data: items ## Models: ## m2: as.numeric(rating1) ~ classAB + (1 | participant) + (1 | item) ## m1: as.numeric(rating1) ~ condition + (1 | participant) + (1 | item) ## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq) ## m2 5 5161.3 5188.1 -2575.7 5151.3 ## m1 7 4937.8 4975.3 -2461.9 4923.8 227.5 2 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # m1 fares better # adding m3 with "no more" reclassified as class B items$classAB[items$condition == "no-more"] <- "B" m3 <- lmer(as.numeric(rating1) ~ classAB + (1|participant) + (1|item), data=items) summary(m3) ## Linear mixed model fit by REML. t-tests use Satterthwaite's method [ ## lmerModLmerTest] ## Formula: as.numeric(rating1) ~ classAB + (1 | participant) + (1 | item) ## Data: items ## ## REML criterion at convergence: 4945.8 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -2.0147 -0.6657 -0.2035 0.4376 3.5794 2.4. LINGVISTICKÝ PŘÍKLAD 35 ## ## Random effects: ## Groups Name Variance Std.Dev. ## participant (Intercept) 0.1407 0.3751 ## item (Intercept) 0.1367 0.3698 ## Residual 1.2517 1.1188 ## Number of obs: 1568, groups: participant, 98; item, 16 ## ## Fixed effects: ## Estimate Std. Error df t value Pr(>|t|) ## (Intercept) 2.35808 0.10761 22.83181 21.91 <2e-16 *** ## classABB -1.06693 0.05655 1454.41148 -18.87 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) ## classABB -0.263 anova(m1,m2,m3) ## refitting model(s) with ML (instead of REML) ## Data: items ## Models: ## m2: as.numeric(rating1) ~ classAB + (1 | participant) + (1 | item) ## m3: as.numeric(rating1) ~ classAB + (1 | participant) + (1 | item) ## m1: as.numeric(rating1) ~ condition + (1 | participant) + (1 | item) ## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq) ## m2 5 5161.3 5188.1 -2575.7 5151.3 ## m3 5 4949.1 4975.9 -2469.6 4939.1 212.180 0 ## m1 7 4937.8 4975.3 -2461.9 4923.8 15.316 2 0.0004723 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # m1 still the best # m4: interactions? # class A vs. class B as the main effect # presence of the differential phrase ("ne" or "trochu") as an interaction effect items$Diff <- "NoDiff" items$Diff[items$condition == "no-more"|items$condition == "slightly-less"] <- "Diff" items$Diff <- as.factor(items$Diff) m4 <- lmer(as.numeric(rating1) ~ classAB * Diff + (1|participant) + (1|item), data=items) 36 CHAPTER 2. LINEARNÍ REGRESE summary(m4) ## Linear mixed model fit by REML. t-tests use Satterthwaite's method [ ## lmerModLmerTest] ## Formula: as.numeric(rating1) ~ classAB * Diff + (1 | participant) + (1 | ## item) ## Data: items ## ## REML criterion at convergence: 4936.9 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -2.1683 -0.6481 -0.2050 0.4471 3.6096 ## ## Random effects: ## Groups Name Variance Std.Dev. ## participant (Intercept) 0.1414 0.3761 ## item (Intercept) 0.1375 0.3708 ## Residual 1.2402 1.1137 ## Number of obs: 1568, groups: participant, 98; item, 16 ## ## Fixed effects: ## Estimate Std. Error df t value Pr(>|t|) ## (Intercept) 2.20337 0.11491 29.42257 19.175 < 2e-16 *** ## classABB -0.89223 0.07959 1452.25414 -11.211 < 2e-16 *** ## DiffNoDiff 0.30942 0.07959 1452.25414 3.888 0.000106 *** ## classABB:DiffNoDiff -0.34938 0.11252 1452.10172 -3.105 0.001939 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) clsABB DffNDf ## classABB -0.346 ## DiffNoDiff -0.346 0.500 ## clssABB:DND 0.245 -0.707 -0.707 # m5: "no more" as classA anova(m1,m2,m3,m4) ## refitting model(s) with ML (instead of REML) ## Data: items ## Models: ## m2: as.numeric(rating1) ~ classAB + (1 | participant) + (1 | item) ## m3: as.numeric(rating1) ~ classAB + (1 | participant) + (1 | item) ## m1: as.numeric(rating1) ~ condition + (1 | participant) + (1 | item) 2.4. LINGVISTICKÝ PŘÍKLAD 37 ## m4: as.numeric(rating1) ~ classAB * Diff + (1 | participant) + (1 | item) ## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq) ## m2 5 5161.3 5188.1 -2575.7 5151.3 ## m3 5 4949.1 4975.9 -2469.6 4939.1 212.180 0 ## m1 7 4937.8 4975.3 -2461.9 4923.8 15.316 2 0.0004723 *** ## m4 7 4937.8 4975.3 -2461.9 4923.8 0.000 0 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 items$classAB[items$condition == "no-more"] <- "A" m5 <- lmer(as.numeric(rating1) ~ classAB * Diff + (1|participant) + (1|item), data=items) ## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient summary(m5) ## Linear mixed model fit by REML. t-tests use Satterthwaite's method [ ## lmerModLmerTest] ## Formula: as.numeric(rating1) ~ classAB * Diff + (1 | participant) + (1 | ## item) ## Data: items ## ## REML criterion at convergence: 5054.3 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -2.0655 -0.6653 -0.2473 0.4431 3.4743 ## ## Random effects: ## Groups Name Variance Std.Dev. ## participant (Intercept) 0.1348 0.3672 ## item (Intercept) 0.1387 0.3724 ## Residual 1.3465 1.1604 ## Number of obs: 1568, groups: participant, 98; item, 16 ## ## Fixed effects: ## Estimate Std. Error df t value Pr(>|t|) ## (Intercept) 1.75728 0.10845 22.67242 16.20 5.88e-14 *** ## classABB -1.24116 0.08292 1453.27185 -14.97 < 2e-16 *** ## DiffNoDiff 0.75525 0.07183 1453.38009 10.52 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) clsABB ## classABB 0.000 38 CHAPTER 2. LINEARNÍ REGRESE ## DiffNoDiff -0.221 -0.577 ## fit warnings: ## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient summary(m4) ## Linear mixed model fit by REML. t-tests use Satterthwaite's method [ ## lmerModLmerTest] ## Formula: as.numeric(rating1) ~ classAB * Diff + (1 | participant) + (1 | ## item) ## Data: items ## ## REML criterion at convergence: 4936.9 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -2.1683 -0.6481 -0.2050 0.4471 3.6096 ## ## Random effects: ## Groups Name Variance Std.Dev. ## participant (Intercept) 0.1414 0.3761 ## item (Intercept) 0.1375 0.3708 ## Residual 1.2402 1.1137 ## Number of obs: 1568, groups: participant, 98; item, 16 ## ## Fixed effects: ## Estimate Std. Error df t value Pr(>|t|) ## (Intercept) 2.20337 0.11491 29.42257 19.175 < 2e-16 *** ## classABB -0.89223 0.07959 1452.25414 -11.211 < 2e-16 *** ## DiffNoDiff 0.30942 0.07959 1452.25414 3.888 0.000106 *** ## classABB:DiffNoDiff -0.34938 0.11252 1452.10172 -3.105 0.001939 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) clsABB DffNDf ## classABB -0.346 ## DiffNoDiff -0.346 0.500 ## clssABB:DND 0.245 -0.707 -0.707 anova(m4,m5) ## refitting model(s) with ML (instead of REML) ## Data: items ## Models: ## m5: as.numeric(rating1) ~ classAB * Diff + (1 | participant) + (1 | item) ## m4: as.numeric(rating1) ~ classAB * Diff + (1 | participant) + (1 | item) 2.4. LINGVISTICKÝ PŘÍKLAD 39 ## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq) ## m5 6 5056.6 5088.8 -2522.3 5044.6 ## m4 7 4937.8 4975.3 -2461.9 4923.8 120.81 1 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # m1 wins but m4 reports intersting interaction 40 CHAPTER 2. LINEARNÍ REGRESE Chapter 3 Logistická regrese • podle The Art of Statistics GitHub repo 3.1 Klasická lineární regrese Data from 1991-1995 are contained in 05-1-galton-x.csv, Although the book says the data is from HistData: Data Sets from the History of Statistics and Data Visualization, 2018, I have actually used this version of Galton’s Height Data galton<-read.csv("05-1-galton-x.csv",header=TRUE) # read csv file into dataframe galton attach(galton) #uncomment if/while necessary summary(galton) ## Family Father Mother Gender ## Length:898 Min. :62.00 Min. :58.00 Length:898 ## Class :character 1st Qu.:68.00 1st Qu.:63.00 Class :character ## Mode :character Median :69.00 Median :64.00 Mode :character ## Mean :69.23 Mean :64.08 ## 3rd Qu.:71.00 3rd Qu.:65.50 ## Max. :78.50 Max. :70.50 ## Height Kids ## Min. :56.00 Min. : 1.000 ## 1st Qu.:64.00 1st Qu.: 4.000 ## Median :66.50 Median : 6.000 ## Mean :66.76 Mean : 6.136 ## 3rd Qu.:69.70 3rd Qu.: 8.000 ## Max. :79.00 Max. :15.000 41 42 CHAPTER 3. LOGISTICKÁ REGRESE # summary statistics # need means for unique fathers and mothers - identify first mention of each family Unique.Fathers=numeric() Unique.Mothers=numeric() nunique=1 # number of unique families Unique.Fathers[1] = Father[1] Unique.Mothers[1] = Mother[1] for(i in 2:length(Family)) { if(Family[i] != Family[i-1]){ nunique=nunique+1 Unique.Fathers[nunique]=Father[i] Unique.Mothers[nunique]=Mother[i] } } length(Unique.Fathers) ## [1] 197 summary(Unique.Fathers) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 62.00 68.00 69.50 69.35 71.00 78.50 sd(Unique.Fathers) ## [1] 2.622034 length(Unique.Mothers) ## [1] 197 summary(Unique.Mothers) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 58.00 62.70 64.00 63.98 65.50 70.50 sd(Unique.Mothers) ## [1] 2.355607 Son = Height[Gender=="M"] length(Son) ## [1] 465 summary(Son) ## Min. 1st Qu. Median Mean 3rd Qu. Max. 3.1. KLASICKÁ LINEÁRNÍ REGRESE 43 ## 60.00 67.50 69.20 69.23 71.00 79.00 sd(Son) ## [1] 2.631594 Daughter = Height[Gender=="F"] length(Daughter) ## [1] 433 summary(Daughter) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 56.00 62.50 64.00 64.11 65.50 70.50 sd(Daughter) ## [1] 2.37032 3.1.1 Figure 5.1 (page 124) Linear regression of sons’ on fathers’ heights # Heights of fathers of sons FatherS = Father[Gender=="M"] fit <- lm(Son ~ FatherS) # linear regression data in fit Predicted <- predict(fit) # Get the predicted values summary(fit) ## ## Call: ## lm(formula = Son ~ FatherS) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.3774 -1.4968 0.0181 1.6375 9.3987 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 38.25891 3.38663 11.30 <2e-16 *** ## FatherS 0.44775 0.04894 9.15 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.424 on 463 degrees of freedom ## Multiple R-squared: 0.1531, Adjusted R-squared: 0.1513 ## F-statistic: 83.72 on 1 and 463 DF, p-value: < 2.2e-16 44 CHAPTER 3. LOGISTICKÁ REGRESE FatherS.j <- jitter(FatherS, factor=5) Son.j <- jitter(Son, factor=5) xlims=ylims=c(55,80) par(mfrow=c(1,1), mar=c(4,4,2,0), pty="s") # square plot plot(FatherS.j, Son.j, xlim=xlims,ylim=ylims,cex=0.7, xlab="father's height (inches)",ylab="son's height (inches)" , col="gray68") lines(c(xlims[1],xlims[2]),c(xlims[1],xlims[2]),lty=2 ) lines(Predicted~FatherS,lwd=2) 55 60 65 70 75 80 556065707580 father’s height (inches) son’sheight(inches) 3.1.2 Now in ggplot library(ggplot2) # create new data frame with exact and jittered, and predcted values Males = cbind.data.frame(FatherS,FatherS.j,Son,Son.j,Predicted) p <- ggplot(Males, aes(x=FatherS, y=Son)) # initial plot object 3.1. KLASICKÁ LINEÁRNÍ REGRESE 45 p <- p + geom_point(x=FatherS.j,y=Son.j,shape= 1) # defines scatter type plot p <- p + labs(x="Father's height (inches)", y= "Son's height (inches)") # adds x and y axis label p <- p + theme(legend.position="none")#, legend.box = "horizontal") # removes the legend p <- p + expand_limits(x = c(55,80),y = c(55,80)) # expand the axis limits p <- p + geom_line(aes(FatherS,Predicted),size=1.5) # add previously fitted linear regression lin p <- p + geom_abline(slope=1, linetype="dashed") # line to represent equality between son and fat # select single data points by CSV datarow numbers pointA=c(137) pointB=c(28) # plot residual line and end points for selectedpointA p <- p + geom_point(aes(x=FatherS.j[pointA], y = Predicted[pointA]), shape = 1) p <- p + geom_point(aes(x=FatherS.j[pointA], y = Son.j[pointA]), shape = 1) p <- p + geom_segment(linetype="dashed", size=1, colour="purple",aes(x=FatherS.j[pointA],y=Son.j[ # plot residual line and end points for pointB p <- p + geom_point(aes(x=FatherS.j[pointB], y = Predicted[pointB]), shape = 1) p <- p + geom_point(aes(x=FatherS.j[pointB], y = Son.j[pointB]), shape = 1) p <- p + geom_segment(linetype="dashed", size=1, colour="purple",aes(x=FatherS.j[pointB],y=Son.j[ p #displays the result 55 60 65 70 75 80 55 60 65 70 75 80 Father’s height (inches) Son’sheight(inches) 46 CHAPTER 3. LOGISTICKÁ REGRESE Figure 5.1 Scatter of heights of 465 fathers and sons from Galton’s data (many fathers are repeated since they have multiple sons). A jitter has been added to separate the points, and the diagonal dashed line represents exact equality between son and father’s heights. The solid line is the standard ‘best-fit’ line. Each point gives rise to a ‘residual’ (dashed line), which is the size of the error were we to use the line to predict a son’s height from his father’s. 3.2 Logistická regrese Data from 1991-1995 are contained in 02-5-child-heart-surgery-1991-x.csv, and taken from D. J. Spiegelhalter et al., Commissioned Analysis of Surgical Performance Using Routine Data: Lessons from the Bristol Inquiry. library(ggplot2) child.1991 <- read.csv("02-5-child-heart-surgery-1991-x.csv") # read data into datafram attach(child.1991) # leave first row (Bristol) out of the fit fit=glm(Survivors/Operations ~ Operations, weight=Operations, family="binomial",data=ch summary(fit) ## ## Call: ## glm(formula = Survivors/Operations ~ Operations, family = "binomial", ## data = child.1991[-1, ], weights = Operations) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.7348545 0.1410843 12.297 <2e-16 *** ## Operations 0.0009615 0.0003807 2.526 0.0115 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 18.623 on 10 degrees of freedom ## Residual deviance: 12.169 on 9 degrees of freedom ## AIC: 72.622 ## ## Number of Fisher Scoring iterations: 4 predictions=100*predict(fit, data.frame(Operations=0:700), type="response") # predicti pred.frame = data.frame(Extremes=0:700,predictions ) # data frame for predictions 3.3. JINÝ NELINGVISTICKÝ PŘÍKLAD 47 p <- ggplot() p <- p + geom_point(child.1991, mapping=aes(x=Operations, y=100*Survivors/Operations, col=Hospita p <- p + expand_limits(x = c(0,700),y=c(70,100)) p <- p + labs(x="Number of operations", y = "% 30-day survival", title="(a) Survival in under-1s p <- p + geom_line(dat=pred.frame, aes(x=Extremes,y=predictions), size=1) # add previously fitted p 70 80 90 100 0 200 400 600 Number of operations %30−daysurvival Hospital Birmingham Bristol Brompton Great Ormond St Guys Harefield Leeds Leicester Liverpool Newcastle Oxford Southampton (a) Survival in under−1s, 1991−1995 Figure 5.2 Fitted logistic regression model for child heart surgery data for under- 1s in UK hospitals between 1991 and 1995. Hospitals treating more patients have better survival. The line is part of a curve that will never reach 100%, and is fitted ignoring the outlying data-point representing Bristol. 3.3 Jiný nelingvistický příklad • z Wikipedie hours <- c(0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 1.75, 2.00, 2.25, 2.50, 2.75, 3.00, 3.25, 3.50, 4. pass <- c(0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1) plot(hours, pass) 48 CHAPTER 3. LOGISTICKÁ REGRESE 1 2 3 4 5 0.00.20.40.60.81.0 hours pass df <- data.frame(hours, pass) logitm <- glm(pass ~ hours ,data = df, family = "binomial") logitm ## ## Call: glm(formula = pass ~ hours, family = "binomial", data = df) ## ## Coefficients: ## (Intercept) hours ## -4.078 1.505 ## ## Degrees of Freedom: 19 Total (i.e. Null); 18 Residual ## Null Deviance: 27.73 ## Residual Deviance: 16.06 AIC: 20.06 summary(logitm) ## ## Call: ## glm(formula = pass ~ hours, family = "binomial", data = df) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -4.0777 1.7610 -2.316 0.0206 * ## hours 1.5046 0.6287 2.393 0.0167 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 3.4. LINGVISTICKÝ PŘÍKLAD 49 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 27.726 on 19 degrees of freedom ## Residual deviance: 16.060 on 18 degrees of freedom ## AIC: 20.06 ## ## Number of Fisher Scoring iterations: 5 exp(1.5046) ## [1] 4.502352 3.4 Lingvistický příklad • Neg-raising experiment 50 CHAPTER 3. LOGISTICKÁ REGRESE Chapter 4 Shrnutí a domácí úkol (do prezentace) • lineární regrese – sonický šroubovák humanitních věd – jednoduché použítí je naprosto intuitivní a jednoduché: require(graphics) plot(women, xlab = "Height (in)", ylab = "Weight (lb)", main = "women data: American women aged 30-39") 51 52 CHAPTER 4. SHRNUTÍ A DOMÁCÍ ÚKOL (DO PREZENTACE) 58 60 62 64 66 68 70 72 120130140150160 women data: American women aged 30−39 Height (in) Weight(lb) data("women") head(women) ## height weight ## 1 58 115 ## 2 59 117 ## 3 60 120 ## 4 61 123 ## 5 62 126 ## 6 63 129 help("women") lm <- lm(women$weight ~ women$height) lm ## ## Call: ## lm(formula = women$weight ~ women$height) ## ## Coefficients: ## (Intercept) women$height ## -87.52 3.45 • y-intercept (-87.52) – žena s nulovou výškou by vážila -87.52 liber 4.1. LINGVISTICKÉ POUŽITÍ 53 • regression-coefficient (3.45) – numericky vyjádřený vztah mezi explanatory a dependent variable – nárustek závislé proměnné, vzroste-li explanatory proměnná a 1 jed- notku summary(lm) ## ## Call: ## lm(formula = women$weight ~ women$height) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.7333 -1.1333 -0.3833 0.7417 3.1167 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 *** ## women$height 3.45000 0.09114 37.85 1.09e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.525 on 13 degrees of freedom ## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903 ## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14 • každý koeficient má: 1. sign (znaménko): pozitivní nebo negativní 2. velikost (síla efektu) 3. signifikance: pravděpodobnost nulové hypotéze vůči danému vzorku 4.1 Lingvistické použití • v experimentu je závislá proměnná odpověď • vysvětlující proměnná (manipulovaná) podmínka • vždy stanovíme jednu podmínku jako referenční • od ní pak model poměřuje “zlepšování”/“zhoršování” jinými podmínkami • základ formule: as.numeric(rating1) ~ condition + (1|participant) + (1|item), data=items • závislá ~ vysvětlující_podmínka + (random effect1) + … library(plyr) library(dplyr) items <- read.csv2("clean_items.csv", encoding = 'UTF-8', header = TRUE) 54 CHAPTER 4. SHRNUTÍ A DOMÁCÍ ÚKOL (DO PREZENTACE) items <- items %>% mutate(condition=replace(condition, condition == "item-méně_než", "fewer")) %>% mutate(condition=replace(condition, condition == "item-nanejvýš", "at-most")) %>% mutate(condition=replace(condition, condition == "item-ne_víc_než", "no-more")) %> mutate(condition=replace(condition, condition == "item-trochu_méně", "slightly-les as.data.frame() ddply(items, .(condition), summarise, Means = mean(rating1, na.rm=TRUE)) ## condition Means ## 1 at-most 1.262755 ## 2 fewer 2.512755 ## 3 no-more 1.311224 ## 4 slightly-less 2.211735 ddply(items, .(condition), summarise, Medians = median(rating1,na.rm=TRUE)) ## condition Medians ## 1 at-most 1 ## 2 fewer 2 ## 3 no-more 1 ## 4 slightly-less 2 library(lmerTest) items$condition <- as.factor(items$condition) items$condition <- relevel(items$condition, ref="at-most") m1 <- lmer(as.numeric(rating1) ~ condition + (1|participant) + (1|item), data=items) summary(m1) ## Linear mixed model fit by REML. t-tests use Satterthwaite's method [ ## lmerModLmerTest] ## Formula: as.numeric(rating1) ~ condition + (1 | participant) + (1 | item) ## Data: items ## ## REML criterion at convergence: 4936.9 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -2.1683 -0.6481 -0.2050 0.4471 3.6096 ## ## Random effects: ## Groups Name Variance Std.Dev. 4.2. DOMÁCÍ ÚKOL DO PREZENTACE 55 ## participant (Intercept) 0.1414 0.3761 ## item (Intercept) 0.1375 0.3708 ## Residual 1.2402 1.1137 ## Number of obs: 1568, groups: participant, 98; item, 16 ## ## Fixed effects: ## Estimate Std. Error df t value Pr(>|t|) ## (Intercept) 1.271e+00 1.149e-01 2.942e+01 11.063 5.3e-12 *** ## conditionfewer 1.242e+00 7.959e-02 1.452e+03 15.601 < 2e-16 *** ## conditionno-more 3.996e-02 7.959e-02 1.452e+03 0.502 0.616 ## conditionslightly-less 9.322e-01 7.961e-02 1.452e+03 11.710 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) cndtnf cndtnn## conditinfwr -0.346 ## conditnn-mr -0.346 0.500 ## cndtnslght- -0.346 0.500 0.500 4.2 Domácí úkol do prezentace • lineární model dvou podmínek • měl by vyjít analogicky vůči zdrojovému textu • random effects: není nutné, ale lze za ně získat body navíc 4.3 Dodatky Data from 1991-1995 are contained in 05-1-galton-x.csv, Although the book says the data is from HistData: Data Sets from the History of Statistics and Data Visualization, 2018, I have actually used this version of Galton’s Height Data galton<-read.csv("05-1-galton-x.csv",header=TRUE) # read csv file into dataframe galton attach(galton) #uncomment if/while necessary ## The following objects are masked from galton (pos = 4): ## ## Family, Father, Gender, Height, Kids, Mother summary(galton) ## Family Father Mother Gender ## Length:898 Min. :62.00 Min. :58.00 Length:898 ## Class :character 1st Qu.:68.00 1st Qu.:63.00 Class :character ## Mode :character Median :69.00 Median :64.00 Mode :character 56 CHAPTER 4. SHRNUTÍ A DOMÁCÍ ÚKOL (DO PREZENTACE) ## Mean :69.23 Mean :64.08 ## 3rd Qu.:71.00 3rd Qu.:65.50 ## Max. :78.50 Max. :70.50 ## Height Kids ## Min. :56.00 Min. : 1.000 ## 1st Qu.:64.00 1st Qu.: 4.000 ## Median :66.50 Median : 6.000 ## Mean :66.76 Mean : 6.136 ## 3rd Qu.:69.70 3rd Qu.: 8.000 ## Max. :79.00 Max. :15.000 # summary statistics # need means for unique fathers and mothers - identify first mention of each family Unique.Fathers=numeric() Unique.Mothers=numeric() nunique=1 # number of unique families Unique.Fathers[1] = Father[1] Unique.Mothers[1] = Mother[1] for(i in 2:length(Family)) { if(Family[i] != Family[i-1]){ nunique=nunique+1 Unique.Fathers[nunique]=Father[i] Unique.Mothers[nunique]=Mother[i] } } length(Unique.Fathers) ## [1] 197 summary(Unique.Fathers) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 62.00 68.00 69.50 69.35 71.00 78.50 sd(Unique.Fathers) ## [1] 2.622034 length(Unique.Mothers) ## [1] 197 summary(Unique.Mothers) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 58.00 62.70 64.00 63.98 65.50 70.50 4.3. DODATKY 57 sd(Unique.Mothers) ## [1] 2.355607 Son = Height[Gender=="M"] length(Son) ## [1] 465 summary(Son) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 60.00 67.50 69.20 69.23 71.00 79.00 sd(Son) ## [1] 2.631594 Daughter = Height[Gender=="F"] length(Daughter) ## [1] 433 summary(Daughter) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 56.00 62.50 64.00 64.11 65.50 70.50 sd(Daughter) ## [1] 2.37032 4.3.1 Figure 5.1 (page 124) Linear regression of sons’ on fathers’ heights # Heights of fathers of sons FatherS = Father[Gender=="M"] fit <- lm(Son ~ FatherS) # linear regression data in fit Predicted <- predict(fit) # Get the predicted values summary(fit) ## ## Call: ## lm(formula = Son ~ FatherS) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.3774 -1.4968 0.0181 1.6375 9.3987 ## 58 CHAPTER 4. SHRNUTÍ A DOMÁCÍ ÚKOL (DO PREZENTACE) ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 38.25891 3.38663 11.30 <2e-16 *** ## FatherS 0.44775 0.04894 9.15 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.424 on 463 degrees of freedom ## Multiple R-squared: 0.1531, Adjusted R-squared: 0.1513 ## F-statistic: 83.72 on 1 and 463 DF, p-value: < 2.2e-16 FatherS.j <- jitter(FatherS, factor=5) Son.j <- jitter(Son, factor=5) xlims=ylims=c(55,80) par(mfrow=c(1,1), mar=c(4,4,2,0), pty="s") # square plot plot(FatherS.j, Son.j, xlim=xlims,ylim=ylims,cex=0.7, xlab="father's height (inches)",ylab="son's height (inches)" , col="gray68") lines(c(xlims[1],xlims[2]),c(xlims[1],xlims[2]),lty=2 ) lines(Predicted~FatherS,lwd=2) 4.3. DODATKY 59 55 60 65 70 75 80 556065707580 father’s height (inches) son’sheight(inches) 4.3.2 Now in ggplot library(ggplot2) # create new data frame with exact and jittered, and predcted values Males = cbind.data.frame(FatherS,FatherS.j,Son,Son.j,Predicted) p <- ggplot(Males, aes(x=FatherS, y=Son)) # initial plot object p <- p + geom_point(x=FatherS.j,y=Son.j,shape= 1) # defines scatter type plot p <- p + labs(x="Father's height (inches)", y= "Son's height (inches)") # adds x and y axis label p <- p + theme(legend.position="none")#, legend.box = "horizontal") # removes the legend p <- p + expand_limits(x = c(55,80),y = c(55,80)) # expand the axis limits p <- p + geom_line(aes(FatherS,Predicted),size=1.5) # add previously fitted linear regression lin p <- p + geom_abline(slope=1, linetype="dashed") # line to represent equality between son and fat # select single data points by CSV datarow numbers pointA=c(137) 60 CHAPTER 4. SHRNUTÍ A DOMÁCÍ ÚKOL (DO PREZENTACE) pointB=c(28) # plot residual line and end points for selectedpointA p <- p + geom_point(aes(x=FatherS.j[pointA], y = Predicted[pointA]), shape = 1) p <- p + geom_point(aes(x=FatherS.j[pointA], y = Son.j[pointA]), shape = 1) p <- p + geom_segment(linetype="dashed", size=1, colour="purple",aes(x=FatherS.j[pointA # plot residual line and end points for pointB p <- p + geom_point(aes(x=FatherS.j[pointB], y = Predicted[pointB]), shape = 1) p <- p + geom_point(aes(x=FatherS.j[pointB], y = Son.j[pointB]), shape = 1) p <- p + geom_segment(linetype="dashed", size=1, colour="purple",aes(x=FatherS.j[pointB p #displays the result 55 60 65 70 75 80 55 60 65 70 75 80 Father’s height (inches) Son’sheight(inches) Figure 5.1 Scatter of heights of 465 fathers and sons from Galton’s data (many fathers are repeated since they have multiple sons). A jitter has been added to separate the points, and the diagonal dashed line represents exact equality between son and father’s heights. The solid line is the standard ‘best-fit’ line. Each point gives rise to a ‘residual’ (dashed line), which is the size of the error were we to use the line to predict a son’s height from his father’s. Data from 1991-1995 are contained in 02-5-child-heart-surgery-1991-x.csv, and taken from D. J. Spiegelhalter et al., Commissioned Analysis of Surgical Performance Using Routine Data: Lessons from the Bristol Inquiry. 4.3. DODATKY 61 library(ggplot2) child.1991 <- read.csv("02-5-child-heart-surgery-1991-x.csv") # read data into dataframe attach(child.1991) ## The following objects are masked from child.1991 (pos = 4): ## ## Deaths, Hospital, Operations, PercentageDying, Survivors, ## ThirtyDaySurvival # leave first row (Bristol) out of the fit fit=glm(Survivors/Operations ~ Operations, weight=Operations, family="binomial",data=child.1991[- summary(fit) ## ## Call: ## glm(formula = Survivors/Operations ~ Operations, family = "binomial", ## data = child.1991[-1, ], weights = Operations) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.7348545 0.1410843 12.297 <2e-16 *** ## Operations 0.0009615 0.0003807 2.526 0.0115 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 18.623 on 10 degrees of freedom ## Residual deviance: 12.169 on 9 degrees of freedom ## AIC: 72.622 ## ## Number of Fisher Scoring iterations: 4 predictions=100*predict(fit, data.frame(Operations=0:700), type="response") # predictions for ex pred.frame = data.frame(Extremes=0:700,predictions ) # data frame for predictions p <- ggplot() p <- p + geom_point(child.1991, mapping=aes(x=Operations, y=100*Survivors/Operations, col=Hospita p <- p + expand_limits(x = c(0,700),y=c(70,100)) p <- p + labs(x="Number of operations", y = "% 30-day survival", title="(a) Survival in under-1s p <- p + geom_line(dat=pred.frame, aes(x=Extremes,y=predictions), size=1) # add previously fitted p 62 CHAPTER 4. SHRNUTÍ A DOMÁCÍ ÚKOL (DO PREZENTACE) 70 80 90 100 0 200 400 600 Number of operations %30−daysurvival Hospital Birmingham Bristol Brompton Great Ormond St Guys Harefield Leeds Leicester Liverpool Newcastle Oxford Southampton (a) Survival in under−1s, 1991−1995 Figure 5.2 Fitted logistic regression model for child heart surgery data for under- 1s in UK hospitals between 1991 and 1995. Hospitals treating more patients have better survival. The line is part of a curve that will never reach 100%, and is fitted ignoring the outlying data-point representing Bristol. Bibliography 63