Getting Started

R is an open source high level programming language specifically designed for statistical calculations and data analysis. You can get it for free at https://www.r-project.org.

Installation

Linux

Exact instructions for installing R will vary depending on your distribution. In Ubuntu the command is

sudo apt-get install r-base r-base-dev

If you want the latest version you’ll have to follow these instructions. If you’re a cool guy who runs Arch the command is

sudo pacman -S r

OS X

If you’re not using Homebrew then you need to start. It’s a utility for OS X designed to emulate the features of package managers in Linux. You use it to install software. You can install it with the following command

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

It’s worth noting that blindly executing a script that you download from a URL is a terrible security practice and it’s a shame that it’s the recommended install method for the most popular OS X package manager.

Anyway, once you have homebrew you can install R with

brew install r

Windows

You can download R from this website and install it for Windows. It comes with a GUI that includes a console where you can run R commands.

Running R

On Linux and OS X you can open a terminal and type R to start an R session. You can use this to type commands into the console. On Windows you can open the R GUI to get mostly the same functionality.

It’s worth noting that RStudio is the premiere IDE for R. I don’t use it personally but most R users do all of their work in RStudio.

General Idea

When you get down to it the core idea behind working with R is very simple. You type commands into the console, hit enter, and R gives you something in return. Generally you store that result for future use, either as an input for your next command or as your final answer to whatever problem you are trying to solve.

It’s really that simple. The hard part is figuring out what commands you need to type into the R machine to get it to do what you want.

Basic Commands/Syntax

We’ll start off with some basic calculator operations in R.

> 10^2 + 36
[1] 136

You can store values in a variable using the <- operator allowing you to access the values later. You can also use the = operator to do the same thing, but best practices dictate using <- for assigning variables.

> a <- 4
> a
[1] 4
> a * 4
[1] 16
> a <- a + 10
> a
[1] 14

R organizes numerical data into scalars (single number), vectors (1-dimensional array of numbers), and matrices (2-dimensional array of numbers, like a table). The c() function is used to create a vector (it stands for combine).

> b <- c(3,4,5)

If we want to compute the mean of the values stored in the above vector, we could do it manually like this:

> (3+4+5)/3
[1] 4

We will rarely deal with vectors this small, so R provides functions that take in arguments and give an output. We can use the mean() function to calculate the mean of a vector.

> mean(b)
[1] 4

The rnorm() function is another example of a standard R function. It outputs a given number of random samples from a normal distributions. It defaults to a standard normal distribution with mean 0 and standard deviation 1, but you can specify your own if you want.

> rnorm(10)
 [1]  0.9012507  0.4269103 -0.2995751 -1.1878327 -0.4559061  0.9675524
 [7] -0.5239229 -1.2686967  1.6898687  1.2104018
> rnorm(10, mean = 5, sd = 3)
 [1]  2.360569  3.648287 10.179476  4.626027  1.766484  6.538299  9.780183
 [8]  6.147346 11.161036  1.993700

R can also create plots for you. Here is a simple example. I’m going to store 100 random samples from a standard normal distribution into x and the plot them as a scatterplot.

> x <- rnorm(100)
> plot(x)

The documentation for R is incredibly rich and will be your best resource when learning how to use R. You can access the documentation for any function using the help() command or the ? operator.

Data Structures

Vectors

We’ve looked at vectors already, but let’s go deeper into what we can do with them. Vectors are the building blocks of most other structures in R (i.e. a matrix is nothing more than a vector of vectors). They can contain numerical, character, or logical data. First, let’s create a simple vector.

> vec1 <- c(1,4,6,8,10)
> vec1
[1]  1  4  6  8 10

What if we want to access the 5th element in the vector?

> vec1[5]
[1] 10

It’s important to note that R begins indexing at 1 rather than 0. This can cause confusion for those coming from Python, for example.

Let’s assign a new value to the 3rd element in the vector.

> vec1[3] <- 12
> vec1
[1]  1  4 12  8 10

You can use the seq() function to create a vector of numbers without having to type them all out manually.

> vec2 <- seq(0,1,0.25)
> vec2
[1] 0.00 0.25 0.50 0.75 1.00

Many functions take vectors as inputs. Here are a few examples.

> sum(vec1)
[1] 35
> max(vec1)
[1] 12
> min(vec1)
[1] 1
> median(vec1)
[1] 8
> sd(vec1)
[1] 4.472136

You can also add, subtract, multiple, and divide two vectors.

> vec1 + vec2
[1]  1.00  4.25 12.50  8.75 11.00
> vec1 - vec2
[1]  1.00  3.75 11.50  7.25  9.00
> vec1 * vec2
[1]  0  1  6  6 10
> vec1 / vec2
[1]      Inf 16.00000 24.00000 10.66667 10.00000

Matrices

Matrices are nothing more than 2-dimensional vectors. You can define a matrix using the matrix() function. I won’t spend much time on this since in practice you won’t use the matrix data structure very much, but it’s useful to know the basics.

There are many ways of defining a matrix (see ?matrix). For now I’ll create a simple matrix using the matrix() function and specifying the number of columns I want.

> mat <- matrix(data = c(1,2,3,4,5,6), ncol = 3)
> mat
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Accessing and manipulating matrices is very similar to vectors. The main difference is that you access an element by its row and column instead of just its index. Row always comes before column.

> mat[1,2]
[1] 3
> mat[2,]
[1] 2 4 6
> mean(mat)
[1] 3.5

Data Frames

Data frames are the bread and butter of R. In practice you will spend the vast majority of your time dealing with data frame objects. Data frames are really nothing more than a matrix with column names, which is more useful than it sounds. You can create a data frame using the data.frame() function.

> data <- data.frame(x = c(1,2,3), y = c(4,5,6), z = c(7,8,9))
> data
  x y z
1 1 4 7
2 2 5 8
3 3 6 9

The $ operator allows you to access columns of a data frame by name. When you access a column you get a vector in return, and thus can perform any of the vector operations we have discussed so far.

> data$x
[1] 1 2 3
> mean(data$x)
[1] 2
> data$x[1]
[1] 1

You can access several columns at once, either by name or by column number.

> data[,c(1,2)]
  x y
1 1 4
2 2 5
3 3 6
> data[,c("x","y")]
  x y
1 1 4
2 2 5
3 3 6

You can subset the rows of a data frame based on whatever criteria you want.

> data[data$x == 2,]
  x y z
2 2 5 8
> data[data$z >= 8,]
  x y z
2 2 5 8
3 3 6 9
> data[data$x == 1 | data$y == 6,]
  x y z
1 1 4 7
3 3 6 9
> data[data$x == 1 & data$y == 6,]
[1] x y z
<0 rows> (or 0-length row.names)

The %in% operator checks to see if the variable is equal to any of the values in the provided vector.

> data[data$x %in% c(1,2),]
  x y z
1 1 4 7
2 2 5 8

A very common mistake is forgetting the comma in the above expressions. When you subset the data frame you still have to select which columns you want. Leaving the column space empty means you want all of the columns, but you can select a few columns and subset the rows all in the same expression if you want.

Here are some miscellaneous useful functions for data frames (most can be applied to other data types as well).

> class(data)
[1] "data.frame"
> nrow(data)  ### Use length() for vectors
[1] 3
> names(data)
[1] "x" "y" "z"
> str(data)
'data.frame':	3 obs. of  3 variables:
 $ x: num  1 2 3
 $ y: num  4 5 6
 $ z: num  7 8 9
> summary(data)
       x             y             z
 Min.   :1.0   Min.   :4.0   Min.   :7.0
 1st Qu.:1.5   1st Qu.:4.5   1st Qu.:7.5
 Median :2.0   Median :5.0   Median :8.0
 Mean   :2.0   Mean   :5.0   Mean   :8.0
 3rd Qu.:2.5   3rd Qu.:5.5   3rd Qu.:8.5
 Max.   :3.0   Max.   :6.0   Max.   :9.0

Plotting

Plotting is an important part of analytics, and R comes with several plotting functions built in. I cannot provide a detailed overview of plotting in R here, but I can give a few examples to get you started. The documentation is very helpful for this. First, a simple line chart.

> plot(rnorm(100), type = "l", col = "gold")

Next, we’ll create a data frame from randomly generated values and show how information can be added to a plot in layers. The lines() and points() functions add information to the existing plot from the plot() function.

> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- rnorm(100)
> data <- data.frame(a = x1, b = x1 + x2, c = x1 + x2 + x3)
> plot(data$a, type = "l", ylim = range(data), lwd = 3, col = rgb(1,0,0,0.3))
> lines(data$b, type = "s", lwd = 2, col = rgb(0.3,0.4,0.3,0.9))
> points(data$c, pch = 20, cex = 4, col = rgb(0,0,1,0.3))

Packages

R comes with many built-in functions, but the true power of R comes from the vast library of packages available. Use the install.packages() function to install packages and the library() function to load them into R. I’ve compiled a list of some packages I’ve found useful, but Google is your friend here. Chances are if you want to do it in R, somebody has already made a package for it.

Load Data

  • RODBC, RMySQL, RPostgresSQL, RSQLite – used to connect R with a database.
  • XLConnect, xlsx – read and write Excel files, though I recommend just saving them as .csv files.
  • foreign – read data files in various proprietary formats like SAS and SPSS.
  • readr – replaces the default read functions with new ones that have more sensible defaults and some extra features.

Manipulate Data

  • dplyr – subsetting, summarizing, rearranging, and joining data sets, I use this package almost every day.
  • reshape2 – convert between different layouts for your data (i.e. wide to long format)
  • stringr – regular expression and general string manipulation utilities.
  • lubridate – dates and times, much easier to use than the base R functions.

Data Visualization

  • ggplot2 – the undisputed king of data visualization in R.
  • rgl – interactive 3D visualizations.
  • googleVis – create interactive Google Charts in R.
  • shiny – create interactive data visualizations for the web in R.

Spatial Data

  • sp, maptools, rgdal, rgeos – tools for loading and using spatial data including shapefiles.
  • maps – provides a set of common map polygons for plotting.
  • ggmap – download street maps from Google or OpenStreetMap and plot with them.

Time Series or Financial Data

  • zoo – the standard package for all things time series, it is the basis for almost every other time series package in R.
  • quantmod – download financial data and conduct technical analyses.

High Performance

  • Rcpp – call C++ code from with R to speed up certain operations.
  • data.table – an alternative to data frames focused on high performance.
  • parallel, foreach – parallel processing in R.

Colors and Scales

  • colorspace – provides an extensive framework for defining colors and color palettes.
  • RColorBrewer – gives access to all palettes defined by the very popular Color Brewer website.
  • scales - lots of useful scale functions.

Example with Real Data

Greene and Shaffer (1992) analyzed decisions by the Canadian Federal Court of Appeals on cases filed by refugee applicants who had been turned down by the Immigration and Refugee Board. The resulting data is provided in judges_and_immigration.txt. We are interested in the fact that there are large differences among judges.

Of course it is possible that the judges get to hear very different cases. We will control for an expert assessment of whether or not the case had merit, the city where the original application was filed (Toronto, Montreal, or Other) and the language in which it was filed (English or French). An additional predictor is the logit of the success rate for all cases from the applicant’s country. The country itself is also available. For now, we’re just going to explore the data.

First you need to set your working directory using setwd() so R knows where to look for files. This is what mine looks like.

> setwd("/home/kjohnson/ownCloud/Work/Introduction_to_R/")

For this data we need to use the read.table() function, but there is also a handy function called read.csv() for csv files. If your data is in Excel format, I recommend re-saving it as a csv file, but there are packages available that will read Excel files directly. The sep argument tells R how columns are separated in your data. In this case the columns are separated by spaces. Generally that’s a bad idea, you’ll see commas or tabs most of the time. I always set stringsAsFactors to be false since it can cause some odd behavior sometimes. The header argument tells R that variables names are included in the first lane of the data file.

> data <- read.table("data/judges_and_immigration.txt", sep = " ",
+     stringsAsFactors = FALSE, header = TRUE)

Usually, the first thing you want to do with any dataset is use the head() function to look at the first few rows and the summary() function to look at some standard statistics. The str() functions is also useful to look at the structure of the data frame.

> head(data)
  Index  JudgeName        Country GrantedAppeal Merit Language     City
1    13      Heald        Lebanon            no    no  English  Toronto
2    15      Heald      Sri_Lanka            no    no  English  Toronto
3    19      Heald    El_Salvador            no   yes  English  Toronto
4    30  MacGuigan Czechoslovakia            no   yes   French Montreal
5    36 Desjardins        Lebanon           yes   yes   French Montreal
6    42      Stone        Lebanon           yes   yes  English  Toronto
  SuccessRate
1    -1.09861
2    -0.75377
3    -1.04597
4     0.40547
5    -1.09861
6    -1.09861
> summary(data)
     Index       JudgeName           Country          GrantedAppeal     
 Min.   :  13   Length:384         Length:384         Length:384        
 1st Qu.: 539   Class :character   Class :character   Class :character  
 Median :1247   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1210                                                           
 3rd Qu.:1831                                                           
 Max.   :2461                                                           
    Merit             Language             City            SuccessRate     
 Length:384         Length:384         Length:384         Min.   :-2.0907  
 Class :character   Class :character   Class :character   1st Qu.:-1.0986  
 Mode  :character   Mode  :character   Mode  :character   Median :-0.9946  
                                                          Mean   :-1.0204  
                                                          3rd Qu.:-0.7538  
                                                          Max.   : 0.4055
> str(data)
'data.frame':   384 obs. of  8 variables:
 $ Index        : int  13 15 19 30 36 42 45 46 51 52 ...
 $ JudgeName    : chr  "Heald" "Heald" "Heald" "MacGuigan" ...
 $ Country      : chr  "Lebanon" "Sri_Lanka" "El_Salvador" "Czechoslovakia" ...
 $ GrantedAppeal: chr  "no" "no" "no" "no" ...
 $ Merit        : chr  "no" "no" "yes" "yes" ...
 $ Language     : chr  "English" "English" "English" "French" ...
 $ City         : chr  "Toronto" "Toronto" "Toronto" "Montreal" ...
 $ SuccessRate  : num  -1.099 -0.754 -1.046 0.405 -1.099 ...

Now, let’s examine the relationships between the number of granted appeals and several possible explanatory variables. The table() function can give us counts for categorical variables, allowing us to see how many applications were accepted and rejected for each language, city, judge, etc. We can plot the output of table() using the barplot() function.

> counts <- table(data[,c("GrantedAppeal", "Language")])
> counts
             Language
GrantedAppeal English French
          no      167     87
          yes      86     44
> barplot(counts, beside = TRUE, xlab = "Language", ylab = "Count",
+     legend = rownames(counts), col = c("red", "blue"))

> counts <- table(data[,c("GrantedAppeal", "City")])
> counts
             City
GrantedAppeal Montreal other Toronto
          no        90    39     125
          yes       48    16      66
> barplot(counts, beside = TRUE, xlab = "City", ylab = "Count",
+     legend = rownames(counts), col = c("red", "blue"))

> counts <- table(data[,c("GrantedAppeal", "Merit")])
> counts
             Merit
GrantedAppeal  no yes
          no  201  53
          yes  69  61
> barplot(counts, beside = TRUE, xlab = "Merit", ylab = "Count",
+     legend = rownames(counts), col = c("red", "blue"))

The last plot shows that there is significant disagreement between the expert opinion on whether or not a case has merit and the ruling of the judge. In total, 122 out of 384 cases showed disagreement between the two variables. Let’s see if we can find out which judges disagree with the experts most often.

> counts <- table(data[,c("GrantedAppeal", "Merit", "JudgeName")])
> counts
, , JudgeName = Desjardins

             Merit
GrantedAppeal no yes
          no  15  12
          yes  7  12

, , JudgeName = Heald

             Merit
GrantedAppeal no yes
          no  20   4
          yes  5   7

, , JudgeName = Hugessen

             Merit
GrantedAppeal no yes
          no  34   5
          yes 16   7

, , JudgeName = Iacobucci

             Merit
GrantedAppeal no yes
          no  21   0
          yes  5   3

, , JudgeName = MacGuigan

             Merit
GrantedAppeal no yes
          no  40   8
          yes 13   9

, , JudgeName = Mahoney

             Merit
GrantedAppeal no yes
          no  12   4
          yes  5   9

, , JudgeName = Marceau

             Merit
GrantedAppeal no yes
          no   9  11
          yes  1   4

, , JudgeName = Pratte

             Merit
GrantedAppeal no yes
          no  25   3
          yes 11   3

, , JudgeName = Stone

             Merit
GrantedAppeal no yes
          no  19   3
          yes  6   5

, , JudgeName = Urie

             Merit
GrantedAppeal no yes
          no   6   3
          yes  0   2

This is where a package like dplyr comes in handy. It provides an extensive framework for grouping and rearranging data. I’ll give a short example here, but the capabilities of this package go far byeond what I’m about to show. The following code calculates the percentage of cases where each judge agrees with the expert assessment.

> library(dplyr)
> # 1 if judge agrees with expert, 0 otherwise
> data$agreed <- ifelse(data$Merit == data$GrantedAppeal, 1, 0)
> judgeAccuracy <- data %>%
+     group_by(JudgeName) %>%
+     summarize(correct = sum(agreed),
+         total = length(agreed),
+         accuracy = correct/total) %>%
+     arrange(accuracy)
> judgeAccuracy
Source: local data frame [10 x 4]

    JudgeName correct total  accuracy
        (chr)   (dbl) (int)     (dbl)
1     Marceau      13    25 0.5200000
2  Desjardins      27    46 0.5869565
3    Hugessen      41    62 0.6612903
4      Pratte      28    42 0.6666667
5   MacGuigan      49    70 0.7000000
6     Mahoney      21    30 0.7000000
7       Stone      24    33 0.7272727
8        Urie       8    11 0.7272727
9       Heald      27    36 0.7500000
10  Iacobucci      24    29 0.8275862
> par(las = 3)
> barplot(height = judgeAccuracy$accuracy, names.arg = judgeAccuracy$JudgeName)

That’s much better. We can now clearly see that Judge Marceau has the lowest agreement with the experts and Judge Iacobucci has the highest agreement with the experts. We can test the significance of the difference using the prop.test() function. The test fails to reject the null hypothesis that the two proportions are equal, so we conclude that Jugde Marceau has significantly less accuracy than Judge Iacobucci.

> prop.test(x = judgeAccuracy$correct[c(1,10)], n = judgeAccuracy$total[c(1,10)])

        2-sample test for equality of proportions with continuity correction

data:  judgeAccuracy$correct[c(1, 10)] out of judgeAccuracy$total[c(1, 10)]
X-squared = 4.549, df = 1, p-value = 0.03294
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.58410581 -0.03106661
sample estimates:
   prop 1    prop 2
0.5200000 0.8275862