Standards of programming in R R style guide Stanislav Katina1 11nstitute of Mathematics and Statistics, Masaryk University Honorary Research Fellow, The University of Glasgow February 20, 2018 (j) Gi is open source software. It has many advantages of other commercial statistical platforms such as MATLAB, SAS and SPSS. (2) @ has its roots in the statistics community, being created by statisticians for statisticians. This is reflected in the design of the programming language: many of its core language elements are geared toward statistical analysis. © The amount of code that we need to write in «3 is very small compared to other programming languages. There are many high-level data types and functions available in <® that hide the low-level implementation details from the programmer. Although there exist systems used in production with significant complexity, for most data analysis tasks, we need to write only a few lines of code. ■O0.O 4 □ ► 4 fiJ ► 4 1/32 Stanislav Katina Standards ot programming in R 2/32 Stanislav Katina Standards ot programming in R Statistics Statistics Why «1? Why «1? (i) <@'s history is inexorably tied to its domain specific predecessors and cousins, as it is 100 percent focused and built for statistical data analysis and visualization. (J) <@ can access and manipulate various file types and databases (and was also designed for flexibility and extensibility) © @ focus on foundational analytics-oriented data types. <® makes it remarkably simple to run extensive statistical analyses on your data and then generate informative and appealing visualizations with just a few lines of code. (s) More modern <® libraries/packages extend and enhance these base capabilities and are the foundations of many of mind-and eye-catching examples of cutting-edge data analysis and visualization. Vast package library called the Comprehensive R Archive Network, or more commonly known as CRAN 3/32 Stanislav Katina Standards ot programming in R 4 & * 4 ^ ► 4 ' Q) «H also provides an interactive execution shell that has enough basic functionality for general needs. © The desire for even more interactivity sparked the development of, which is a combination of integrated development environment (IDE), data exploration tool, and iterative experimentation environment that exponentially enhances is more than a programming language. It is an interactive environment for doing statistics. Think of lists active variables but does not list files that begin with a dot o ... is used to indicate a variable number of function arguments (7) cě uses "$" in a manner analogous to the way other languages use dot (identifying the parts of an object) - see e.g. data.frame() and list () ® has several one-letter reserved words: c, q, s, t, c, d, f, i, and t - actually, these are not reserved, but its best to think of them as reserved 16/32 Stanislav Katina Standards ot programming in R Statistics ffl style guide Statistics CR style guide (D the preferred form for variable names is all lower case letters and words separated with dots (variable. name), but variableName is also accepted 9 function names have initial capital letters and no dots (FunctionName) (jy) constants are named like functions but with an initial k (kConstantName) (jj) line length - the maximum line length is 80 characters @ indentation - when indenting your code, use two spaces -never use tabs or mix tabs and spaces (exception: when a line break occurs inside parentheses, align the wrapped line with the first character inside the parenthesis) (jj) spacing • place spaces around all binary operators (=, +, -, <-, etc.) exception: spaces around ='s are optional when passing parameters in a function call » do not place a space before a comma, but always place one after a comma • place a space before left parenthesis, except in a function call j extra spacing (i.e., more than one space in a row) is okay if it improves alignment of equals signs or arrows (<-) o do not place spaces around code in parentheses or square brackets exception: always place a space after a comma. 9 semicolons - do not terminate your lines with semicolons or use semicolons to put more than one command on the same line i -00.0 17/32 Stanislav Katina Standards ot programming in R 18/32 Stanislav Katina Standards ot programming in R Statistics ® style guide Statistics «S> style guide (J^ attach () - avoid using it - the possibilities for creating errors when using attach are numerous 9 commenting - comment your code j entire commented lines should begin with "#" and one space o short comments can be placed after code preceded by two spaces, "#", and then one space 9 function definitions and calls - function definitions should first list arguments without default values, followed by those with default values - in both function definitions and function calls, multiple arguments per line are allowed; line breaks are only allowed between assignments © function documentation • functions should contain a comments section immediately below the function definition line - these comments should consist of a one-sentence description of the function • a list of the function's arguments, denoted by Args:, with a description of each (including the data type) a and a description of the return values, denoted by Returns: • the comments should be descriptive enough that a caller can use the function without reading any of the function's code 19/32 Stanislav Katina Standards of programming in R 20/32 Stanislav Katina Standards of programming in R Statistics Statistics flu style guide © general layout and ordering o copyright statement comment • author comment » file description comment, including purpose of program, inputs, and outputs • source () and library () statements » function definitions • executed statements, if applicable (e.g., print, plot) For more details see: Google's R Style Guide and R Coding Conventions (j) built-in function for creating vectors is c () "container vector" - an ordered collection of numbers with no other structure o the length of a vector is the number of elements in the container • operations are applied componentwise (5) "mathematical vector" - an element of a vector space • length of a vector is geometrical length determined by an inner product • the number of components is called dimension • operations are not applied componentwise ■O0.O 4 □ ► 4 fiJ ► 4 21/32 Stanislav Katina Standards ot programming in R 22/32 Stanislav Katina Standards ot programming in R Statistics ® - vectors Statistics ® - vectors A vector in is a container vector, a statisticians collection of data, not a mathematical vector. The ci language is designed around the assumption that a vector is an ordered set of measurements rather than a geometrical position or a physical state. ► 4 Standards ot programming in R 9 five types of indices/subscripts in <& o positive integers - subscripts that reference particular elements a negative integers - is an instruction to remove an element from a vector (it makes sense in statistical context) • zero - is does nothing (it doesn't even produce an error) • Booleans - a Boolean expression with a vector evaluates to a vector of Boolean values, the results of evaluating the expression componentwise (e.g. x [x>3] - the expression x>3 evaluates to the vector of true or false) - when a vector with a Boolean subscript appears in an assignment, the assignment applies to the elements that would have been extracted if there had been no assignment (x[x > 3] <- 7) • nothing - a subscript can be left out entirely (So x [ ] would simply return x) 24/32 Stanislav Katina Standards ot programming in R ® sequences » the expression seq (a, b, n) creates a closed interval from a to b in steps of size n • the notation a:b is an abbreviation for seq (a, b, l) j the notation seq (a, b, iength=n) is a variation that will set the step size to (b-a)/(n-i) so that the sequence has n points 16 17 18 seq(1,10, by=2) # odd numbers seq(l,10, length=4) seq(l,10, by=0.05) # sufficiently dense sequence (?) 0 replications - function rep(x) replicates the values in important arguments are times, each and length 19 rep (1 4, 2) 20 rep (1 4, each=2) # not the same as above 21 rep (1 4, c(2,2, 2 2) ) # the same as above 22 rep (1 4, c(2,1, 2 1) ) 23 rep (1 4, each=2, len=4) # only first four $ the type of a vector is the type of the elements it contains and must be one Of the following logical, integer, numeric, character, factor, complex, double (creates a double-precision vector), or raw - all elements of a vector must have the same underlying type (this restriction does not apply to lists) 24 25 26 27 28 xl <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) # logical vector x2 <- c(1,2,5.3,6,-2,4) # numeric vector x3 <- c("one","two","three") # character vector gender <- c(rep("male",20), rep("female", 30)) gender <- factor(gender) # factor vector 0 type conversion functions have the naming convention as. xxxx () for the function converts its argument to type xxxx, e.g., as. integer (4.2) returns the integer 3, and as. character (4.2) returns the string "4.2" (see also is.xxxx()) < □ ► < fiP ► < ■= ► 1 -00.0 4 □ ► < fit ► < -i ► < = ► 1 -OO-C* 25/32 Stanislav Katina Standards of programming in R 26/32 Stanislav Katina Standards of programming in R Statistics ® - Boolean operators Statistics ® - lists, matrices, arrays (jj) Boolean operators o true values - t or true and false values - For false • the shorter form operators and "&" and or" |" apply element-wise on vectors (are vectorized) 29 30 ((-# 2:2) >= 0) & ( (-1] FALSE FALSE 2:2) <= 0) TRUE FALSE FALSE the longer form operators and "&&" and or" |" are often used in conditional statements (evaluates left to right examining only the first element of each vector) 31 ( (-2:2) >= 0) && ( (-2:2) <= 0) 32 # [1] FALSE • the operators will not evaluate their second argument if the return value is determined by the first argument (j) lists are like vectors, except elements need not all have the same type, e.g. the first element of a list could be an integer and the second element be a string or a vector of Boolean values • are created using the list () function • elements can be access by position using "[[]]". • named elements of lists can be accessed by dollar sign "$" 33 34 35 A <- list(name="John", age=24) A[ [1] ] A$name if you attempt to access a non-existent element of a list, say a [ [ 3 ] ] above, you will get an error you can assign to a non-existent element of a list, thus extending the list; if the index you assign to is more than one past the end of the list, intermediate elements are created and assigned null values 27/32 Stanislav Katina 4 □ ► 4 s> ► < Standards of programming in R 28/32 Stanislav Katina 4 □ ► < g ► 4 Standards of programming in R Statistics ffl - matrices, arrays, data frames Statistics CR - missing values and NaNs 9 matrix and array - data frame - is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.) 39 40 41 42 43 xl <- c (1, 2, 3, 4) x2 <- c("red","white","red",NA) x3 <- c(TRUE,TRUE,TRUE,FALSE) mydata <- data.frame(xl,x2,x3) names(mydata) <- c ("ID","Color","Passed") # variable names & missing values and NaNs - the result of an operation on numbers may return different types non-number • "not a number" - NaN • "not applicable" - na (to indicate missing data, and is unfortunately fairly common in data sets) o the author of an ® function, has no control over the data his function will receive because na is a legal value inside an