R is an open source high level programming language specifically designed for statistical calculations and data analysis. You can get it for free at https://www.r-project.org.
Exact instructions for installing R will vary depending on your distribution. In Ubuntu the command is
If you want the latest version you’ll have to follow these instructions. If you’re a cool guy who runs Arch the command is
If you’re not using Homebrew then you need to start. It’s a utility for OS X designed to emulate the features of package managers in Linux. You use it to install software. You can install it with the following command
It’s worth noting that blindly executing a script that you download from a URL is a terrible security practice and it’s a shame that it’s the recommended install method for the most popular OS X package manager.
Anyway, once you have homebrew you can install R with
You can download R from this website and install it for Windows. It comes with a GUI that includes a console where you can run R commands.
On Linux and OS X you can open a terminal and type
R to start an R session. You can use this to type commands into the console. On Windows you can open the R GUI to get mostly the same functionality.
It’s worth noting that RStudio is the premiere IDE for R. I don’t use it personally but most R users do all of their work in RStudio.
When you get down to it the core idea behind working with R is very simple. You type commands into the console, hit enter, and R gives you something in return. Generally you store that result for future use, either as an input for your next command or as your final answer to whatever problem you are trying to solve.
It’s really that simple. The hard part is figuring out what commands you need to type into the R machine to get it to do what you want.
We’ll start off with some basic calculator operations in R.
You can store values in a variable using the
<- operator allowing you to access the values later. You can also use the
= operator to do the same thing, but best practices dictate using
<- for assigning variables.
R organizes numerical data into scalars (single number), vectors (1-dimensional array of numbers), and matrices (2-dimensional array of numbers, like a table). The
c() function is used to create a vector (it stands for combine).
If we want to compute the mean of the values stored in the above vector, we could do it manually like this:
We will rarely deal with vectors this small, so R provides functions that take in arguments and give an output. We can use the
mean() function to calculate the mean of a vector.
rnorm() function is another example of a standard R function. It outputs a given number of random samples from a normal distributions. It defaults to a standard normal distribution with mean 0 and standard deviation 1, but you can specify your own if you want.
R can also create plots for you. Here is a simple example. I’m going to store 100 random samples from a standard normal distribution into
x and the plot them as a scatterplot.
The documentation for R is incredibly rich and will be your best resource when learning how to use R. You can access the documentation for any function using the
help() command or the
We’ve looked at vectors already, but let’s go deeper into what we can do with them. Vectors are the building blocks of most other structures in R (i.e. a matrix is nothing more than a vector of vectors). They can contain numerical, character, or logical data. First, let’s create a simple vector.
What if we want to access the 5th element in the vector?
It’s important to note that R begins indexing at 1 rather than 0. This can cause confusion for those coming from Python, for example.
Let’s assign a new value to the 3rd element in the vector.
You can use the
seq() function to create a vector of numbers without having to type them all out manually.
Many functions take vectors as inputs. Here are a few examples.
You can also add, subtract, multiple, and divide two vectors.
Matrices are nothing more than 2-dimensional vectors. You can define a matrix using the
matrix() function. I won’t spend much time on this since in practice you won’t use the matrix data structure very much, but it’s useful to know the basics.
There are many ways of defining a matrix (see
?matrix). For now I’ll create a simple matrix using the
matrix() function and specifying the number of columns I want.
Accessing and manipulating matrices is very similar to vectors. The main difference is that you access an element by its row and column instead of just its index. Row always comes before column.
Data frames are the bread and butter of R. In practice you will spend the vast majority of your time dealing with data frame objects. Data frames are really nothing more than a matrix with column names, which is more useful than it sounds. You can create a data frame using the
$ operator allows you to access columns of a data frame by name. When you access a column you get a vector in return, and thus can perform any of the vector operations we have discussed so far.
You can access several columns at once, either by name or by column number.
You can subset the rows of a data frame based on whatever criteria you want.
%in% operator checks to see if the variable is equal to any of the values in the provided vector.
A very common mistake is forgetting the comma in the above expressions. When you subset the data frame you still have to select which columns you want. Leaving the column space empty means you want all of the columns, but you can select a few columns and subset the rows all in the same expression if you want.
Here are some miscellaneous useful functions for data frames (most can be applied to other data types as well).
Plotting is an important part of analytics, and R comes with several plotting functions built in. I cannot provide a detailed overview of plotting in R here, but I can give a few examples to get you started. The documentation is very helpful for this. First, a simple line chart.
Next, we’ll create a data frame from randomly generated values and show how information can be added to a plot in layers. The
points() functions add information to the existing plot from the
R comes with many built-in functions, but the true power of R comes from the vast library of packages available. Use the
install.packages() function to install packages and the
library() function to load them into R. I’ve compiled a list of some packages I’ve found useful, but Google is your friend here. Chances are if you want to do it in R, somebody has already made a package for it.
- RODBC, RMySQL, RPostgresSQL, RSQLite – used to connect R with a database.
- XLConnect, xlsx – read and write Excel files, though I recommend just saving them as .csv files.
- foreign – read data files in various proprietary formats like SAS and SPSS.
- readr – replaces the default read functions with new ones that have more sensible defaults and some extra features.
- dplyr – subsetting, summarizing, rearranging, and joining data sets, I use this package almost every day.
- reshape2 – convert between different layouts for your data (i.e. wide to long format)
- stringr – regular expression and general string manipulation utilities.
- lubridate – dates and times, much easier to use than the base R functions.
- ggplot2 – the undisputed king of data visualization in R.
- rgl – interactive 3D visualizations.
- googleVis – create interactive Google Charts in R.
- shiny – create interactive data visualizations for the web in R.
- sp, maptools, rgdal, rgeos – tools for loading and using spatial data including shapefiles.
- maps – provides a set of common map polygons for plotting.
- ggmap – download street maps from Google or OpenStreetMap and plot with them.
Time Series or Financial Data
- zoo – the standard package for all things time series, it is the basis for almost every other time series package in R.
- quantmod – download financial data and conduct technical analyses.
- Rcpp – call C++ code from with R to speed up certain operations.
- data.table – an alternative to data frames focused on high performance.
- parallel, foreach – parallel processing in R.
Colors and Scales
- colorspace – provides an extensive framework for defining colors and color palettes.
- RColorBrewer – gives access to all palettes defined by the very popular Color Brewer website.
- scales - lots of useful scale functions.
Example with Real Data
Greene and Shaffer (1992) analyzed decisions by the Canadian Federal Court of Appeals on cases filed by refugee applicants who had been turned down by the Immigration and Refugee Board. The resulting data is provided in judges_and_immigration.txt. We are interested in the fact that there are large differences among judges.
Of course it is possible that the judges get to hear very different cases. We will control for an expert assessment of whether or not the case had merit, the city where the original application was filed (Toronto, Montreal, or Other) and the language in which it was filed (English or French). An additional predictor is the logit of the success rate for all cases from the applicant’s country. The country itself is also available. For now, we’re just going to explore the data.
First you need to set your working directory using
setwd() so R knows where to look for files. This is what mine looks like.
For this data we need to use the
read.table() function, but there is also a handy function called
read.csv() for csv files. If your data is in Excel format, I recommend re-saving it as a csv file, but there are packages available that will read Excel files directly. The
sep argument tells R how columns are separated in your data. In this case the columns are separated by spaces. Generally that’s a bad idea, you’ll see commas or tabs most of the time. I always set
stringsAsFactors to be false since it can cause some odd behavior sometimes. The
header argument tells R that variables names are included in the first lane of the data file.
Usually, the first thing you want to do with any dataset is use the
head() function to look at the first few rows and the
summary() function to look at some standard statistics. The
str() functions is also useful to look at the structure of the data frame.
Now, let’s examine the relationships between the number of granted appeals and several possible explanatory variables. The
table() function can give us counts for categorical variables, allowing us to see how many applications were accepted and rejected for each language, city, judge, etc. We can plot the output of
table() using the
The last plot shows that there is significant disagreement between the expert opinion on whether or not a case has merit and the ruling of the judge. In total, 122 out of 384 cases showed disagreement between the two variables. Let’s see if we can find out which judges disagree with the experts most often.
This is where a package like
dplyr comes in handy. It provides an extensive framework for grouping and rearranging data. I’ll give a short example here, but the capabilities of this package go far byeond what I’m about to show. The following code calculates the percentage of cases where each judge agrees with the expert assessment.
That’s much better. We can now clearly see that Judge Marceau has the lowest agreement with the experts and Judge Iacobucci has the highest agreement with the experts. We can test the significance of the difference using the
prop.test() function. The test fails to reject the null hypothesis that the two proportions are equal, so we conclude that Jugde Marceau has significantly less accuracy than Judge Iacobucci.