Statistics
Why <®?
_
Standards of programming in R
R style guide
J
Stanislav Katina1
11nstitute of Mathematics and Statistics, Masaryk University Honorary Research Fellow, The University of Glasgow
November 29, 2016
Q «5$ is open source software. It has many advantages of other commercial statistical platforms such as MATLAB, SAS and SPSS.
0 ft has its roots in the statistics community, being created by statisticians for statisticians. This is reflected in the design of the programming language: many of its core language elements are geared toward statistical analysis.
9 The amount of code that we need to write in <B is very small compared to other programming languages. There are many high-level data types and functions available in r that hide the low-level implementation details from the programmer. Although there exist <J$ systems used in production with significant complexity, for most data analysis tasks, we need to write only a few lines of code.
Stanislav Katina
« □ ►   4 9 ►   ■«■=► 4
Standards of programming in R
1 -00.0
00,0
Stanislav Katina
Standards of programming in R
Statistics
Why <®?
0 <jf$'s history is inexorably tied to its domain specific predecessors and cousins, as it is 100 percent focused and built for statistical data analysis and visualization.
(J) <® can access and manipulate various file types and databases (and was also designed for flexibility and extensibility)
0 et focus on foundational analytics-oriented data types.
0 ft makes it remarkably simple to run extensive statistical analyses on your data and then generate informative and appealing visualizations with just a few lines of code.
0 More modern 'Q libraries/packages extend and enhance these base capabilities and are the foundations of many of mind-and eye-catching examples of cutting-edge data analysis and visualization. Vast package library called the Comprehensive R Archive Network, or more commonly known as CRAN
Statistics
Why <®?
0 Qt also provides an interactive execution shell that has enough basic functionality for general needs.
9 The desire for even more interactivity sparked the development 0f ©studio which is a combination of integrated development environment (IDE), data exploration tool, and iterative experimentation environment that exponentially enhances «s§'s default capabilities.
Click below to see more: The Comprehensive R Archive Network
RStudio - Open source and enterprise-ready professional software
forR H^Studio
Both links provide full installation details for Linux, Windows, and macOS systems. RStudio comes in two flavors: Desktop and Seryjr. <
3/32
Statistics
Why
RStudio core features:
O Built-in IDE
9 Data structure and workspace exploration tools
9 Quick access to the «8> console
o Ct help viewer
J Graphics panel viewer
a File system explorer
• Package manager
• Integration with version control systems
The primary difference is that one runs as a standalone, single-user application (RStudio Desktop) and the other (RStudio Server) is installed on a server, accessed via browser, and enables multiple users to take advantage of the compute infrastructure.
Stanislav Katina
« □ ►   4 9 ►   ■«■=► 4
Standards of programming in R
1 -00.0
Statistics
Cfl - reading in data
<B abstract quite a bit of complexity when it comes to reading and parsing data into structures for processing. See functions:
9 read. table () - reads a * . txt file in table format and creates a data frame from it
9 read. csv () - reads a * . csv file in table format and creates a data frame from it (check also argument encoding, e.g. "Windows-1250", "utf-8" or other)
9 read.delim()
See help () arguments header, sep and delim.
9 download, file (url, destfile) - to download a single file from the url and store it in destfile; the url must start with a scheme such as http: //, https: //, ftp: // or file : //
o getURL (url) - to download a single file from the url directly to <fl and then use function read. table () to read data - in
library(RCurl)
00,0
Stanislav Katina
Standards of programming in R
Statistics
Cp? - the statistician - data cleaning
Statistics
1 - reading in data
g
10
11
12
13
14
15
First set a working directory to dir using function setwd(dir) You can check an absolute f ilepath representing the current working directory using function getwd ().
## reading *.txt file
DATA <- read.table("DATA.txt",header=TRUE) ## reading *.csv file
DATA <- read.csv("DATA.csv",encoding = "Windows-125 0" ,
header=TRUE) ## reading from the web
URL <-  "http://www.math.muni.cz/.../DATA.txt"
download.file(URL,destfile="DATA.txt",method="libcurl")
DATA <- read.table("DATA.txt",header=TRUE)
## reading from the web
install.packages("RCurl")
library(RCurl)
URL <-  getURL(URL)
DATA <- read.table(textConnection(URL)) head(DATA)
£S functions for reading data from other statistical software: Q readMatO - package R. matlab
• read, spss () - reads a file stored by the SPSS save or export commands - also in library foreign
• read. ssd () - generates a SAS program to convert the content of ssd data file to SAS transport format and then uses
read.xport () to obtain a data. frames () - library foreign
• read. xport () - reads a file as a SAS XPORT format library and returns a list Of data . frames () - library foreign
<H also provides extensive support for accessing data stored in various SQL and NoSQL databases. For SQL databases, use e.g.
library(RPostgreSQL) .
4 9 ►   4 s ► 4
1 -00.0
00,0
7/32
Statistics
Statistics
The statistician
The consistency in the record format makes the consumption of the data equally as straightforward in each language. In each language/environment, we follow a typical pattern of:
0 Reading in data
0 Assigning meaningful column names (if necessary)
0 Using built-in functions to get an overview of the data structure
9 Taking a look at the first few rows of data, typically with the
head() or tail () function
Stanislav Katina
« □ ►   4 9 ►   ■«■=► 4
Standards of programming in R
1 -00.0
Given some of the "rookie mistakes" seen in many scientific reports (bio-medical, geographical or other) or industry reports (pharmaceutical, security or other) and the prevalence of raw counts in science/industry dashboards, there is a high probability that statistics is the weakest area for science/industry professionals.
You do not need a Ph.D. in statistics to be an effective data scientist. However, its important to have an understanding of the fundamentals of statistical analysis, even when you are part of a multidisciplinary team.
Understanding and applying statistics correctly is more complex than you might imagine, and individuals in disciplines with a rich history of using statistics to solve complex problems oftentimes fall into common traps.
A hallmark of a good data scientist is adaptability and you should be continually scouring the digital landscape for emerging tools that will help you solve problems.
00,0
Stanislav Katina
Standards of programming in R
Statistics
The data science workflow
Statistics
Data science
Decide on a question
Acquire data
Clean/transform data
Update question
Update analyses
Perform data analysis
Examine output
The methodology of extracting insights from data is called as data science. Historically, data science has been known by different names: in the early days, it was known simply as statistics, after which it became known as data analytics. There is an important difference between data science as compared to statistics and data analytics.
Data science is a multi-disciplinary subject: it is a combination of statistical analysis, programming, and domain expertise.
Over the last few years, data science has emerged as a discipline in its own right.
□ ► <     ► ■« .š ► ■< .š ►    ^ O0.O
00,0
11/32
Statistics
Data science
Three aspects and their importance:
9 Statistical skills are essential in applying the right kind of statistical methodology along with interpreting the results.
0 Programming skills are essential to implement the analysis methodology, combine data from multiple sources and especially, working with large-scale datasets.
0 Domain expertise is essential in identifying the problems that need to be solved, forming hypotheses about the solutions, and most importantly understanding how the insights of the analysis should be applied.
13/32	Stanislav Katina	Standards of programming in R
		
Statistics		
® style guide		
9 the assignment operator in <& is "<-" (the arrow) with the receiving variable on the left; it is also possible, though uncommon, to reverse the arrow and put the receiving variable on the right; it is sometimes possible to use "=" for assignment
9 when supplying default function arguments or calling functions with named arguments, you must use the "=" operator and cannot use the arrow
0 at some time in the past <1 used underscore as assignment -this meant that the C convention of using underscores as separators in multi-word variable names was not only disallowed but produced strange side effects; however, ® allows underscore as a variable character and not as an assignment operator
9 don't use hyphens "-"
However, there is no standardized set of tools that are used in the analysis. Data scientists use a variety of programming languages and tools in their work, sometimes even using a combination of heterogeneous tools to perform a single analysis. This increases the learning curve for the new data scientists.
The <5t programming environment presents a great homogeneous set of tools for most data science tasks.
<® is more than a programming language. It is an interactive environment for doing statistics. Think of ® as having a programming language than being a programming language. The <H language is the scripting language for the Cjf environment. Variables can't be declared. They come into existence on first assignment (lexical scoping) - it is not always easy to determine the scope of a variable.
9 because the underscore was not allowed as a variable
character, the convention arose to use dot as a name separator
0 unlike its use in many object oriented languages, the dot
character in <@ has no special significance, with two exceptions
a the is () function in <S lists active variables but does not
list files that begin with a dot j ... is used to indicate a variable number of function
arguments
0 ft uses "$" in a manner analogous to the way other languages use dot (identifying the parts of an object) - see e.g.
data.frame() and list ()
0 £1 has several one-letter reserved words: c, q, s, t, c, d, f, i, and t - actually, these are not reserved, but its best to think of them as reserved
15/32
Statistics
<® style guide
9 the preferred form for variable names is all lower case letters and words separated with dots (variable .name), but variableName is also accepted
$ function names have initial capital letters and no dots
(Funct i onName)
(jj) constants are named like functions but with an initial k
(kConstantName)
^ line length - the maximum line length is 80 characters
9 indentation - when indenting your code, use two spaces -never use tabs or mix tabs and spaces (exception: when a line break occurs inside parentheses, align the wrapped line with the first character inside the parenthesis)
ijj) spacing
a place spaces around all binary operators (=,+,-,<-, etc.)
exception: spaces around ='s are optional when passing
parameters in a function call j do not place a space before a comma, but always place
one after a comma a place a space before left parenthesis, except in a function
call
a extra spacing (i.e., more than one space in a row) is okay if it improves alignment of equals signs or arrows (< -)
a do not place spaces around code in parentheses or square brackets
exception: always place a space after a comma.
semicolons - do not terminate your lines with semicolons or use semicolons to put more than one command on the same line
i 9 ► * = > 4
1 -00.0
00,0
Stanislav Katina
Standards of programming in R
Statistics
G? style guide
0 attach () - avoid using it - the possibilities for creating errors when using attach are numerous
9 commenting - comment your code
• entire commented lines should begin with "#" and one space
• short comments can be placed after code preceded by two spaces, "#", and then one space
(jj) function definitions and calls - function definitions should first list arguments without default values, followed by those with default values - in both function definitions and function calls, multiple arguments perlině are allowed; line breaks are only allowed between assignments
(JD function documentation
a functions should contain a comments section immediately below the function definition line - these comments should consist of a one-sentence description of the function
a a list of the function's arguments, denoted by Args:, with a description of each (including the data type)
a and a description of the return values, denoted by
Returns:
9 the comments should be descriptive enough that a caller can use the function without reading any of the function's code
Stanislav Katina
« □ ►   4 9 ►   < -š ► 4
Standards of programming in R
1 -00.0
00,0
Stanislav Katina
Standards of programming in R
Statistics
<® style guide
0 general layout and ordering
9 copyright statement comment j author comment
a file description comment, including purpose of program, inputs, and outputs
9 source () and library () statements o function definitions
» executed statements, if applicable (e.g., print, plot)
For more details see: Google's R Style Guide and R Coding Conventions
|  built-in function for creating vectors is c ()
9 "container vector" - an ordered collection of numbers with no other structure
a the length of a vector is the number of elements in the container
a operations are applied componentwise
0 "mathematical vector" - an element of a vector space
a length of a vector is geometrical length determined by an
inner product j the number of components is called dimension 9 operations are not applied componentwise
i 9 ► * = > 4
1 -00.0
00,0
Stanislav Katina
Standards of programming in R
A vector in ® is a container vector, a statisticians collection of data, not a mathematical vector. The «i§ language is designed around the assumption that a vector is an ordered set of measurements rather than a geometrical position or a physical state. <B supports mathematical vector operations, but they are secondary in the design of the language.
The <® language has no provision for scalars. The only way to represent a single number in a variable is to use a vector of length one. It is usually clearer and more efficient in <@ to operate on vectors as a whole.
G) vectors in <Sť are indexed starting with 1 and matrices in are stored in column-major order
9 elements of a vector can be accessed using " []".
0 vectors automatically expand when assigning to an index past the end of the vector
« □ ► <     ► ■« 1 ► ■< 1 ►    ^ O0.O Stanislav Katina       Standards of programming in R
Statistics
G? - vectors
0 five types of indices/subscripts in <H
» positive integers - subscripts that reference particular elements
a negative integers - is an instruction to remove an element
from a vector (it makes sense in statistical context) a zero - is does nothing (it doesn't even produce an error) a Booleans
- a Boolean expression with a vector evaluates to a vector of Boolean values, the results of evaluating the expression componentwise (e.g. x [x>3] - the expression x>3 evaluates to the vector of true or false)
- when a vector with a Boolean subscript appears in an assignment, the assignment applies to the elements that would have been extracted if there had been no assignment
(x[x >  3]   <- 7)
a nothing - a subscript can be left out entirely (So x [] would simply return x)
00,0
Stanislav Katina
Standards of programming in R
<® - sequences, replications
- types
8 sequences
j the expression seq (a, b, n) creates a closed interval
from a to b in steps of size n • the notation a :b is an abbreviation for seq(a, b, 1) » the notation seq (a, b, lengthen) is a variation that will
set the step size to (b-a)/(n-i) so that the sequence
has n points
16
17
18
seq(1,10,  by=2)     # odd numbers seq(1,10, length=4)
seq(l,10,  by=0.05)   # sufficiently dense sequence
(?)
9 replications - function rep (x) replicates the values in x — important arguments are times, each and length
19	rep (1	4,	2)		
20	rep (1	4,	each=2)	# not	the same as above
21	rep (1	4,	c(2,2,2,	2) ) #	the same as above
22	rep (1	4,	c(2,1,2,	1) )	
23	rep (1	4,	each=2,	len=4)	# only first four elements
1 -00.0
9 the type of a vector is the type of the elements it contains and must be one Of the following logical, integer, numeric, character, factor, complex, double (creates a double-precision vector), or raw - all elements of a vector must have the same underlying type (this restriction does not apply to lists)
24
25
26
27
28
xl  <-  c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE)     #  logical vector
x2 <- c (1,2,5.3,6,-2,4)     # numeric vector
x3  <- c("one","two","three")     # character vector
gender <- c(rep("male",20),   rep("female", 30))
gender <-  factor(gender)     # factor vector
type conversion functions have the naming convention as. xxxx () for the function converts its argument to type xxxx, e.g., as. integer (4 .2) returns the integer 3, and as. character (4 .2) returns the string "4.2" (see also
i s.xxxx())
Stanislav Katina
Standards of programming in R
Stanislav Katina
Standards of programming in R
Statistics
Statistics
1 - lists, matrices, arrays
Boolean operators
o true values - t or true and false values - f or false
• the shorter form operators and "&" and or" |" apply element-wise on vectors (are vectorized)
29
30
31
32
((-2:2)   >=  0)   &   ((-2:2)   <= 0)
#   [1]   FALSE FALSE    TRUE FALSE FALSE
j the longer form operators and      and or" | |" are often used in conditional statements (evaluates left to right examining only the first element of each vector)
((-2:2) >= 0) && ((-2:2) #   [1] FALSE
• the operators will not evaluate their second argument if the return value is determined by the first argument
(ij) lists are like vectors, except elements need not all have the same type, e.g. the first element of a list could be an integer and the second element be a string or a vector of Boolean values
a are created using the list () function
a elements can be access by position using "[[]]".
» named elements of lists can be accessed by dollar sign "$"
33
34
35
A <-  list(name="John"
A[ [1] ]
A$name
age=24)
if you attempt to access a non-existent element of a list, say a [ [3 ] ] above, you will get an error you can assign to a non-existent element of a list, thus extending the list; if the index you assign to is more than one past the end of the list, intermediate elements are created and assigned null values
1 -00.0
27/32
Statistics
CS - matrices, arrays, data frames
Statistics
<3i - missing values and NaNs
(jj) matrix and array - Cj? does not support matrices and arrays, only vectors, but you can change the dimension of a vector, essentially making it a matrix (see also rbind () ,cbind ()) o <® fills matrices by column
j to fill matrix by row, add the argument byrow = true to the call to the matrix () function
36
37
38
Al <- array(c(l,2,3,4,5,6), A2 <- matrix(c(1,2,3,4,5,6) A3 <- matrix(c(1,2,3,4,5,6)
dim=c (2,3)) ,   2,   3) )
,   2,   3, byrow=TRUE)
^ data frame - is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.)
39
40
41
42
43
xl  <- c(l,2,3,4)
x2 <- c("red","white","red",NA)
x3  <-  c(TRUE,TRUE,TRUE,FALSE)
mydata <- data.frame(xl,x2,x3)
names(mydata)   <- c("ID" , "Color","Passed"
9 missing values and NaNs - the result of an operation on numbers may return different types non-number
» "not a number" - NaN
a "not applicable" - na (to indicate missing data, and is unfortunately fairly common in data sets)
j the author of an "® function, has no control over the data his function will receive because na is a legal value inside an <® vector - there is no way to specify that a function takes only vectors with non-null components - you must handle na values, even if you handle them by returning an error
o the function is.nanl) will return true for those components of its argument that are NaN (see also lis.nan())
a the function is. na () will return true for those components that are na or NaN (see also lis.naO)
# variable names
1 -00.0
00,0
Stanislav Katina
Standards of programming in R
Statistics
<3i - miscellaneous
9 sessioninfo () - prints the <8> version, OS, packages loaded, etc.
(yjj) help (f ctn) - displays help on any function f ctn,
(Jj) the function quit () or its alias q () terminate the current d session
9 save. image () is just a short-cut for "save my current workspace"
9 is () - shows which objects are defined
9 rm(list=ls ()) - clears all defined objects
(2^ prefixes d, p, q, r stand for density (probability density function, PDF), probability (cumulative distribution function, CDF), quantile (CDF-1), and random sample - e.g., dnorm () is the density function of a normal random variable and rnorm () generates a sample from a normal random variable etc.
function      | description		function    | description	
binomial distribution		Poisson distribution	
dbinomO pbinomO qbinomO rbinomO	probability mass function distribution function quantile pseudo-random numbers	dpoisO ppoisO qpoisO rpoisQ	probability mass function distribution function quantile pseudo-random numbers
multinomial c	istribution	gamma distribution	
dmultinomO pmultinomO qmultinomO rmultinomO	probability mass function distribution function quantile pseudo-random numbers	dgammaO pgammaO qgammaO rgammaQ	density function distribution function quantile pseudo-random numbers
normal distribution		Student t distribution	
dnorm() pnorm() qnorm() rnormO	density function distribution function quantile pseudo-random numbers	dto PtO qtO rtO	density function distribution function quantile pseudo-random numbers
X1 distribution		Fisher F distribution	
dchisqO pchisqO qchisqO rchisqO	density function distribution function quantile pseudo-random numbers	dfO PfO qfO rfO	density function distribution function quantile pseudo-random numbers
multivatiate normal distribution library mvtnorm		multivatiate normal distribution library MASS	
rmvnormO    |  pseudo-random numbers		mvrnorm() |  pseudo-random numbers	
For more details see e.g. R language for programmers.
1 >0<*,0
•00,0
Stanislav Katina
Standards of programming in R
Stanislav Katina       Standards of programming in R