Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information. - NYTimes (2014)

Today we’re going to learn about a package by Hadley Wickham called dplyr and how it will help you with simple data exploration, and how you can use it in combination with the %>% operator for more complex wrangling (including a lot of the things you would use for loops for.

And we’re going to do this in Rmarkdown in the my-project repository we created this morning.

Here are the steps:

  1. Open RStudio
  2. Make sure you’re in your my-project repo (and if not, get there)
  3. New > Rmarkdown… (defaults are fine)
  4. Save as gapminder-dplyr.rmd
  5. Our workflow together will be to write some description of our analysis in Markdown for humans to read, and we will write all of our R code in the ‘chunks’. Get ready for the awesomeness, here we go…

Today’s materials are again borrowing from some excellent sources, including

1 install our first package: dplyr

Packages are bundles of functions, along with help pages and other goodies that make them easier for others to use, (ie. vignettes).

So far we’ve been using packages included in ‘base R’; they are ‘out-of-the-box’ functions. You can also install packages from online. The most traditional is CRAN, the Comprehensive R Archive Network. This is where you went to download R originally, and will go again to look for updates.

You don’t need to go to CRAN’s website to install packages, we can do it from within R with the command install.packages("package-name-in-quotes").

## from CRAN:
#install.packages("dplyr") ## do this once only to install the package on your computer.

library(dplyr) ## do this every time you restart R and need it 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##     filter, lag
## The following objects are masked from 'package:base':
##     intersect, setdiff, setequal, union

What’s the difference between install.packages() and library()? Here’s my analogy:

  • install.packages() is setting up electricity for your house. Just need to do this once (let’s ignore monthly bills).
  • library() is turning on the lights. You only turn them on when you need them, otherwise it wouldn’t be efficient. And when you quit R, and come back, you’ll have to turn them on again with library(), but you already have your electricity set up.

2 Use dplyr::filter() to subset data row-wise.

First let’s read in the gapminder data.

# install.packages('gapminder') # instead of reading in the csv
library(gapminder) # this is the package name
str(gapminder) # and it's also the data.frame name, just like yesterday
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

filter() takes logical expressions and returns the rows for which all are TRUE. Visually, we are doing this (thanks RStudio for your cheatsheet):