POL036 (first part) Research Methods in Political Science I: Advanced Topics in Applied Regression Class day/time Fall semester 2017, 29 September – 1 October, room 41 Number of credits 2[1] Class type lectures and seminars Type of completion credit Instructor Constantin Manuel Bosancianu[2] Contact person Miroslav Nemčok (miroslav.nemcok@gmail.com) Assistance Michal Pink 1 Workshop plan This intensive 3-day workshop will expand the standard OLS toolbox that participants are accustomed to using. The expansion ventures into territory where OLS estimation produces biased or inefficient results due to the violation of one or more of the standard regression assumptions. Given the relative simplicity, speed, elegance and robustness of OLS over more complex estimation procedures, this course tries to therefore “save” the least squares framework (as much as possible) by introducing adaptations of it. We start with a brief coverage of the most important OLS assumptions, emphasizing in particular those that refer to the regression residuals: their Gaussian distribution, constant variance (homoskedasticity), and linear relationship to the predictors. We discuss, in turn, how OLS estimates of effect and uncertainty are impacted by violations of these assumptions, and what tools we have available in R to diagnose these problems. I make the point that these assumptions are frequently not met in the course of many analyses, leading to biased estimates and, therefore, shaky conclusions. The rest of the first session is dedicated to the issue of heteroskedasticity[3]: what its implications are for estimates, how it can be detected in the course of a standard analysis, and how commonly it appears as a problem. To address this issue, I present two potential solutions. The first is heteroskedasticity-consistent standard errors, which address the problem in cases in which we have no clear idea of the shape of the non-constant variance. Heteroskedasticity-consistent SEs continue to be a very popular approach in a variety of disciplines, which is why they are covered in depth here. The second, more general, solution is the use of Weighted Least Squares (WLS). Both subtopics are discussed from a theoretical perspective, as well as in a practical setting, in the laboratory. In the second day of the workshop I take up the issue of effect heterogeneity across different subpopulations in the sample. In practice, this will involve an in-depth discussion of interactions in linear models. We will cover two-way and three-way inter-actions, both for continuous and dichotomous predictors, as well as how to present marginal effects in a graphical way. As we will see, interactions are frequently a source of confusion in published work, and continue to be misinterpreted. In the final part of the day, I bring up the issue of fixed effects, as a solution to omitted variable bias in regression models. Such a strategy is frequently invoked in the search for accurate causal estimates of effects. As in the previous days, the theoretical coverage is followed by applied lab work, using R and empirical data. I conclude the workshop with a presentation of semi-parametric models, which can be used to model simultaneously both linear and non-linear relationships between variables. Such models allow us to test models where not all relationships between predictors and outcome are linear, in the search for a more faithful regression-based description of the data. We start from very simple bivariate specifications, based on smoothing splines, and end our discussion with full semi-parametric models. 2 Grading The students are required to attend all classes to successfully complete the course. 3 Readings Day 1 (room 41, 15:15-18:30) · [mandatory] Wooldridge, J. M. (2013). Introductory Econometrics: A Modern Approach, 5th edition. Mason, OH: Cengage Learning. Chapter 8: “Heteroskedasticity” (pp. 268–302). · [optional] Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. New York: Sage. Chapter 12: “Diagnosing non-normality, nonconstant error variance, and nonlinearity” (pp. 267–306). Day 2 (room 41, 09:45-13:00) · [mandatory] Kam, C. D., & Franzese Jr., R. J. (2007). Modeling and Interpreting Interactive Hypotheses in Regression Analysis. Ann Arbor, MI: The University of Michigan Press. Chapter 3 (“Theory to practice”) and Chapter 4 (“The meaning, use, and abuse of some common general-practice rules”), pp. 13–102. · [optional] Brüderl, J., & Ludwig, V. (2015). “Fixed-effects panel regression”. In Best, H., & Wolf, C. The SAGE Handbook of Regression Analysis and Causal Inference. London: SAGE Reference. Chapter 15, pp. 327–357. (please read only until roughly page 338). Day 3 (room 41, 09:45-13:00) · [mandatory] Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. New York: Sage. Chapter 17: “Nonlinear regression” (pp. 451–475). · [optional] Keele, L. (2008). Semiparametric Regression for the Social Sciences. New York: Wiley. Chapters 1–3 and 5 (pp. 1–84 and 109–136). 4 Software requirements Participants should bring their own laptops to the sessions. Please make sure that R version 3.4.1 or newer is installed on your computer. Additionally, you will want to have installed a GUI for R - my recommendation is RStudio, which is freely available from https://www.rstudio.com/products/rstudio/download/ 5 Ancillary materials I will circulate all slides, data sets and R script files on which the lectures and labs are based one day before the sessions are scheduled to commence. (Second part continues below) POL036 (second part) Research Methods in Political Science II: Introduction to Computer Assisted Text Analysis Course Information Class day/time Fall semester 2017, 18-20 December, room 41 Number of credits 2[4] Class type lectures and seminars Type of completion credit Instructor Juraj Medzihorsky[5] Contact person Miroslav Nemčok (miroslav.nemcok@gmail.com) Assistance Michal Pink Course outline This course provides a concise hands-on introduction to computer assisted text analysis for social scientists. The participants will learn how to automate document collection and processing, scale text using dictionaries and dimensionality reduction techniques, and use machine learning techniques to automate text annotation. The course relies on the R language. The course will meet in three days. Each day will consist of two 90 minute sessions, and contain both a theoretical exposition of the material as well as computer exercises. Course requirements Basic familiarity with content analysis and with the R language. While it is not necessary, the participants are strongly encouraged to install the R language on their computers. Grading At the end of each day of the course the participants will be given a take-home exercise. Each of the three exercises will contribute 20% to the final grade. The remaining 40% of the final grade consists of in-class activity, which includes participating in in-class exercises. Components Participation 40% Exercise 1 20% Exercise 2 20% Exercise 3 20% Scale credit minimum maximum pass 60% 100% fail 0% 59% Reading list Day 1: Introduction to computer-assisted text analysis (room 41, December 18, 9:45-16:45) * Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267-297. Recommended: * Munzert, S., Rubba, C., Meißner, P., & Nyhuis, D. (2014). Automated data collection with R: A practical guide to web scraping and text mining. John Wiley & Sons. * Roberts, C. W. (2000). A conceptual framework for quantitative text analysis. Quality & Quantity, 34(3), 259-274. Day 2: Text scaling (room 41, December 19, 9:45-16:45) * Lowe, W. (2016). Scaling things we can count. Available online. Recommended: * Slapin, J. B., & Proksch, S. O. (2008). A scaling model for estimating time‐series party positions from texts. American Journal of Political Science, 52(3), 705-722. * Lowe, W., Benoit, K., Mikhaylov, S., & Laver, M. (2011). Scaling policy preferences from coded political texts. Legislative studies quarterly, 36(1), 123-155. * Lowe, W. (2008). Understanding wordscores. Political Analysis, 16(4), 356-371. Day 3: Text categorization and annotation (room 41, December 20, 9:45-16:45) * Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. * Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., & Mikhaylov, S. (2016). Crowd-sourced text analysis: reproducible and agile production of political data. American Political Science Review, 110(2), 278-295. Recommended: * Roberts, M. E., et al. (2014). Structural Topic Models for Open‐Ended Survey Responses. American Journal of Political Science, 58(4), 1064-1082. * Carlson, D., & Montgomery, J. M. (2017). A Pairwise Comparison Framework for Fast, Flexible, and Reliable Human Coding of Political Texts. American Political Science Review, 1-9. ________________________________ [1] If a student attends only this class, s/he will be awarded 2 ECTS. In case of attendance of both classes, the resulting number of ECTS for a student is 4. [2] Research Fellow in the Institutions and Political Inequality research unit, Wissenschaftszentrum Berlin für Sozialforschung (WZB), Berlin, Germany. Email: manuel.bosancianu@outlook.com [3] Essentially, this is the violation of the assumption of constant variance (homoskedasticity). [4] If a student attends only this class, s/he will be awarded 2 ECTS. In case of attendance of both classes, the resulting number of ECTS for a student is 4. [5] Postdoctoral Fellow at the V-Dem Institute, The University of Gothenburg, Sweden. Email: juraj.medzihorsky@gmail.com