Statistics Why <®? _ Standards of programming in R R style guide J Stanislav Katina1 11nstitute of Mathematics and Statistics, Masaryk University Honorary Research Fellow, The University of Glasgow November 29, 2016 Q «5$ is open source software. It has many advantages of other commercial statistical platforms such as MATLAB, SAS and SPSS. 0 ft has its roots in the statistics community, being created by statisticians for statisticians. This is reflected in the design of the programming language: many of its core language elements are geared toward statistical analysis. 9 The amount of code that we need to write in console o Ct help viewer J Graphics panel viewer a File system explorer • Package manager • Integration with version control systems The primary difference is that one runs as a standalone, single-user application (RStudio Desktop) and the other (RStudio Server) is installed on a server, accessed via browser, and enables multiple users to take advantage of the compute infrastructure. Stanislav Katina « □ ► 4 9 ► ■«■=► 4 Standards of programming in R 1 -00.0 Statistics Cfl - reading in data 4 1 -00.0 00,0 Stanislav Katina Standards of programming in R A vector in ® is a container vector, a statisticians collection of data, not a mathematical vector. The «i§ language is designed around the assumption that a vector is an ordered set of measurements rather than a geometrical position or a physical state. 3] - the expression x>3 evaluates to the vector of true or false) - when a vector with a Boolean subscript appears in an assignment, the assignment applies to the elements that would have been extracted if there had been no assignment (x[x > 3] <- 7) a nothing - a subscript can be left out entirely (So x [] would simply return x) 00,0 Stanislav Katina Standards of programming in R <® - sequences, replications - types 8 sequences j the expression seq (a, b, n) creates a closed interval from a to b in steps of size n • the notation a :b is an abbreviation for seq(a, b, 1) » the notation seq (a, b, lengthen) is a variation that will set the step size to (b-a)/(n-i) so that the sequence has n points 16 17 18 seq(1,10, by=2) # odd numbers seq(1,10, length=4) seq(l,10, by=0.05) # sufficiently dense sequence (?) 9 replications - function rep (x) replicates the values in x — important arguments are times, each and length 19 rep (1 4, 2) 20 rep (1 4, each=2) # not the same as above 21 rep (1 4, c(2,2,2, 2) ) # the same as above 22 rep (1 4, c(2,1,2, 1) ) 23 rep (1 4, each=2, len=4) # only first four elements 1 -00.0 9 the type of a vector is the type of the elements it contains and must be one Of the following logical, integer, numeric, character, factor, complex, double (creates a double-precision vector), or raw - all elements of a vector must have the same underlying type (this restriction does not apply to lists) 24 25 26 27 28 xl <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) # logical vector x2 <- c (1,2,5.3,6,-2,4) # numeric vector x3 <- c("one","two","three") # character vector gender <- c(rep("male",20), rep("female", 30)) gender <- factor(gender) # factor vector type conversion functions have the naming convention as. xxxx () for the function converts its argument to type xxxx, e.g., as. integer (4 .2) returns the integer 3, and as. character (4 .2) returns the string "4.2" (see also i s.xxxx()) Stanislav Katina Standards of programming in R Stanislav Katina Standards of programming in R Statistics Statistics 1 - lists, matrices, arrays Boolean operators o true values - t or true and false values - f or false • the shorter form operators and "&" and or" |" apply element-wise on vectors (are vectorized) 29 30 31 32 ((-2:2) >= 0) & ((-2:2) <= 0) # [1] FALSE FALSE TRUE FALSE FALSE j the longer form operators and and or" | |" are often used in conditional statements (evaluates left to right examining only the first element of each vector) ((-2:2) >= 0) && ((-2:2) # [1] FALSE • the operators will not evaluate their second argument if the return value is determined by the first argument (ij) lists are like vectors, except elements need not all have the same type, e.g. the first element of a list could be an integer and the second element be a string or a vector of Boolean values a are created using the list () function a elements can be access by position using "[[]]". » named elements of lists can be accessed by dollar sign "$" 33 34 35 A <- list(name="John" A[ [1] ] A$name age=24) if you attempt to access a non-existent element of a list, say a [ [3 ] ] above, you will get an error you can assign to a non-existent element of a list, thus extending the list; if the index you assign to is more than one past the end of the list, intermediate elements are created and assigned null values 1 -00.0 27/32 Statistics CS - matrices, arrays, data frames Statistics <3i - missing values and NaNs (jj) matrix and array - Cj? does not support matrices and arrays, only vectors, but you can change the dimension of a vector, essentially making it a matrix (see also rbind () ,cbind ()) o <® fills matrices by column j to fill matrix by row, add the argument byrow = true to the call to the matrix () function 36 37 38 Al <- array(c(l,2,3,4,5,6), A2 <- matrix(c(1,2,3,4,5,6) A3 <- matrix(c(1,2,3,4,5,6) dim=c (2,3)) , 2, 3) ) , 2, 3, byrow=TRUE) ^ data frame - is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.) 39 40 41 42 43 xl <- c(l,2,3,4) x2 <- c("red","white","red",NA) x3 <- c(TRUE,TRUE,TRUE,FALSE) mydata <- data.frame(xl,x2,x3) names(mydata) <- c("ID" , "Color","Passed" 9 missing values and NaNs - the result of an operation on numbers may return different types non-number » "not a number" - NaN a "not applicable" - na (to indicate missing data, and is unfortunately fairly common in data sets) j the author of an "® function, has no control over the data his function will receive because na is a legal value inside an <® vector - there is no way to specify that a function takes only vectors with non-null components - you must handle na values, even if you handle them by returning an error o the function is.nanl) will return true for those components of its argument that are NaN (see also lis.nan()) a the function is. na () will return true for those components that are na or NaN (see also lis.naO) # variable names 1 -00.0 00,0 Stanislav Katina Standards of programming in R Statistics <3i - miscellaneous 9 sessioninfo () - prints the <8> version, OS, packages loaded, etc. (yjj) help (f ctn) - displays help on any function f ctn, (Jj) the function quit () or its alias q () terminate the current d session 9 save. image () is just a short-cut for "save my current workspace" 9 is () - shows which objects are defined 9 rm(list=ls ()) - clears all defined objects (2^ prefixes d, p, q, r stand for density (probability density function, PDF), probability (cumulative distribution function, CDF), quantile (CDF-1), and random sample - e.g., dnorm () is the density function of a normal random variable and rnorm () generates a sample from a normal random variable etc. function | description function | description binomial distribution Poisson distribution dbinomO pbinomO qbinomO rbinomO probability mass function distribution function quantile pseudo-random numbers dpoisO ppoisO qpoisO rpoisQ probability mass function distribution function quantile pseudo-random numbers multinomial c istribution gamma distribution dmultinomO pmultinomO qmultinomO rmultinomO probability mass function distribution function quantile pseudo-random numbers dgammaO pgammaO qgammaO rgammaQ density function distribution function quantile pseudo-random numbers normal distribution Student t distribution dnorm() pnorm() qnorm() rnormO density function distribution function quantile pseudo-random numbers dto PtO qtO rtO density function distribution function quantile pseudo-random numbers X1 distribution Fisher F distribution dchisqO pchisqO qchisqO rchisqO density function distribution function quantile pseudo-random numbers dfO PfO qfO rfO density function distribution function quantile pseudo-random numbers multivatiate normal distribution library mvtnorm multivatiate normal distribution library MASS rmvnormO | pseudo-random numbers mvrnorm() | pseudo-random numbers For more details see e.g. R language for programmers. 1 >0<*,0 •00,0 Stanislav Katina Standards of programming in R Stanislav Katina Standards of programming in R