Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information. - NYTimes (2014)
Today we’re going to learn about a package by Hadley Wickham called dplyr
and how it will help you with simple data exploration, and how you can use it in combination with the %>%
operator for more complex wrangling (including a lot of the things you would use for loops for.
And we’re going to do this in Rmarkdown in the my-project
repository we created this morning.
Here are the steps:
my-project
repo (and if not, get there)gapminder-dplyr.rmd
Today’s materials are again borrowing from some excellent sources, including
dplyr
Packages are bundles of functions, along with help pages and other goodies that make them easier for others to use, (ie. vignettes).
So far we’ve been using packages included in ‘base R’; they are ‘out-of-the-box’ functions. You can also install packages from online. The most traditional is CRAN, the Comprehensive R Archive Network. This is where you went to download R originally, and will go again to look for updates.
You don’t need to go to CRAN’s website to install packages, we can do it from within R with the command install.packages("package-name-in-quotes")
.
## from CRAN:
#install.packages("dplyr") ## do this once only to install the package on your computer.
library(dplyr) ## do this every time you restart R and need it
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
What’s the difference between install.packages()
and library()
? Here’s my analogy:
install.packages()
is setting up electricity for your house. Just need to do this once (let’s ignore monthly bills).library()
is turning on the lights. You only turn them on when you need them, otherwise it wouldn’t be efficient. And when you quit R, and come back, you’ll have to turn them on again with library()
, but you already have your electricity set up.dplyr::filter()
to subset data row-wise.First let’s read in the gapminder data.
# install.packages('gapminder') # instead of reading in the csv
library(gapminder) # this is the package name
str(gapminder) # and it's also the data.frame name, just like yesterday
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
filter()
takes logical expressions and returns the rows for which all are TRUE
. Visually, we are doing this (thanks RStudio for your cheatsheet):