Here is an analogy to start us off. If you were a pilot, R is an an airplane. You can use R to go places! With practice you’ll gain skills and confidence; you can fly further distances and get through tricky situations. You will become an awesome pilot and can fly your plane anywhere.
And if R were an airplane, RStudio is the airport. RStudio provides support! Runways, communication and other services, and just makes your overall life easier. So although you can fly your plane without an airport and we could learn R without RStudio, that’s not what we’re going to do.
We are learning R together with RStudio and its many supporting features.
Something else to start us off is to mention that you are learning a new language here. It’s an ongoing process, it takes time, you’ll make mistakes, it can be frustrating, but it will be overwhelmingly awesome in the long run. We all speak at least one language; it’s a similar process, really. And no matter how fluent you are, you’ll always be learning, you’ll be trying things in new contexts, etc, just like everybody else. And just like any form of communication, there will be miscommunications but hands down we are all better off because of it.
OK, let’s get going.
To learn R and RStudio we will be using Dr. Jenny Bryan’s lectures from STAT545 at UBC. I have modifed them slightly here for our purposes; to see them in their full and awesome entirety, visit stat545-ubc.github.io. Specifically, we’ll be using these lectures:
Something we won’t cover today but that will be helpful to you in the future is:
I’ve modified them in part with my own text and in part with text from Software Carpentry’s R for reproducible scientific analysis, specifically:
(modified from Jenny Bryan’s STAT545)
Launch RStudio/R.
Notice the default panes:
FYI: you can change the default location of the panes, among many other things: Customizing RStudio.
There are other great features we don’t really have time for today as we walk through the IDE together. (IDE stands for integrated development environment.) Check out the webinar and RStudio IDE cheatsheet for more. (And this is my blog post about RStudio Awesomeness).
Go into the Console, where we interact with the live R process.
Make an assignment and then inspect the object you just created.
x <- 3 * 4
x
## [1] 12
In my head I hear, e.g., “x gets 12”.
All R statements where you create objects – “assignments” – have this form: objectName <- value
.
I’ll write it in the command line with a hashtag #
, which is the way R comments so it won’t be evaluated.
# objectName <- value
Object names cannot start with a digit and cannot contain certain other characters such as a comma or a space. You will be wise to adopt a convention for demarcating words in names.
# i_use_snake_case
# other.people.use.periods
# evenOthersUseCamelCase
Make an assignment
this_is_a_really_long_name <- 2.5
To inspect this variable, instead of typing it, we can press the up arrow key and call your command history, with the most recent commands first. Let’s do that, and then delete the assignment:
this_is_a_really_long_name
## [1] 2.5
Another way to inspect this variable is to begin typing this_
…and RStudio will automagically have suggested completions for you that you can select by hitting the tab key, then press return.
Make another assignment
this_is_shorter <- 2 ^ 3
To inspect this, try out RStudio’s completion facility: type the first few characters, press TAB, add characters until you disambiguate, then press return.
this_is_shorter
## [1] 8
One more:
jenny_rocks <- 2
Let’s try to inspect:
jennyrocks
## Error in eval(expr, envir, enclos): object 'jennyrocks' not found
Implicit contract with the computer / scripting language: Computer will do tedious computation for you. In return, you will be completely precise in your instructions. Typos matter. Case matters. Get better at typing.
Remember that this is a language, not unsimilar to English! There are times you aren’t understood – your friend might say ‘what?’ but R will say ‘error’.
A moment about logical operators and expressions. We can ask questions about the objects we just made.
==
means ‘is equal to’!=
means ‘is not equal to’<
means ` is less than’>
means ` is greater than’<=
means ` is less than or equal to’>=
means ` is greater than or equal to’jenny_rocks == 2
## [1] TRUE
jenny_rocks <= 30
## [1] TRUE
jenny_rocks != 5
## [1] TRUE
Shortcuts You will make lots of assignments and the operator
<-
is a pain to type. Don’t be lazy and use=
, although it would work, because it will just sow confusion later. Instead, utilize RStudio’s keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automagically surrounds<-
with spaces, which demonstrates a useful code formatting practice. Code is miserable to read on a good day. Give your eyes a break and use spaces. RStudio offers many handy keyboard shortcuts. Also, Alt+Shift+K brings up a keyboard shortcut reference card.
My most common shortcuts include command-Z (undo), and combinations of arrow keys in combination with shift/option/command (moving quickly up, down, sideways, with or without highlighting.
R has a mind-blowing collection of built-in functions that are accessed like so
# functionName(arg1 = val1, arg2 = val2, and so on)
Let’s try using seq()
which makes regular sequences of numbers and, while we’re at it, demo more helpful features of RStudio.
Type se
and hit TAB. A pop up shows you possible completions. Specify seq()
by typing more to disambiguate or using the up/down arrows to select. Notice the floating tool-tip-type help that pops up, reminding you of a function’s arguments. If you want even more help, press F1 as directed to get the full documentation in the help tab of the lower right pane.
Type the arguments 1, 10
and hit return.
seq(1, 10)
## [1] 1 2 3 4 5 6 7 8 9 10
We could probably infer that the seq()
function makes a sequence, but let’s learn for sure. Type (and you can autocomplete) and let’s explore the help page:
?seq
help(seq) # same as ?seq
seq(from = 1, to = 10) # same as seq(1, 10); R assumes by position
## [1] 1 2 3 4 5 6 7 8 9 10
seq(from = 1, to = 10, by = 2)
## [1] 1 3 5 7 9
The above also demonstrates something about how R resolves function arguments. You can always specify in name = value
form. But if you do not, R attempts to resolve by position. So above, it is assumed that we want a sequence from = 1
that goes to = 10
. Since we didn’t specify step size, the default value of by
in the function definition is used, which ends up being 1 in this case. For functions I call often, I might use this resolve by position for the first argument or maybe the first two. After that, I always use name = value
.
The help page tells the name of the package in the top left, and broken down into sections:
The examples can be copy-pasted into the console for you to understand what’s going on. Let’s try it.
Exercise: Talk to your neighbor(s) and look up the help file for a function you know. Try the examples, see if you learn anything new. (need ideas?
?getwd()
,?plot()
).
Help for when you only sort of remember the function name: double-questionmark:
??install
Not all functions have (or require) arguments:
date()
## [1] "Thu Apr 14 22:36:25 2016"
Now look at your workspace – in the upper right pane. The workspace is where user-defined objects accumulate. You can also get a listing of these objects with commands:
objects()
## [1] "jenny_rocks" "this_is_a_really_long_name"
## [3] "this_is_shorter" "x"
ls()
## [1] "jenny_rocks" "this_is_a_really_long_name"
## [3] "this_is_shorter" "x"
If you want to remove the object named y
, you can do this
rm(y)
## Warning in rm(y): object 'y' not found
To remove everything:
rm(list = ls())
or click the broom in RStudio’s Environment pane.
Exercise: Clear your workspace, then create a few new variables. Discuss what makes a good filename. Hint: give variables short informative names (
lifeExp
versus “X5”)
One day you will need to quit R, go do something else and return to your analysis later.
One day you will have multiple analyses going that use R and you want to keep them separate.
One day you will want to collaborate with colleagues/friends–need a portable way to do this.
So, what about your analysis do you want to capture (what is ‘real’), and where should it ‘live’?
The Console is good for quick tests, but you really want to work in saved R scripts as “real”. Huge benefits:
So we will talk about scripts in a moment, but first let’s talk about where they should live.
We’re not going to cover workspaces today, but this is another alternative to scripts. You can learn about it in this RStudio article: Working Directories and Workspaces.
Any process running on your computer has a notion of its “working directory”. In R, this is where R will look, by default, for files you ask it to load. It also where, by default, any files you write to disk will go.
You can explicitly check your working directory with:
getwd()
## [1] "/Users/julialowndes/github/2016-04-15-UCSB/R_RStudio"
It is also displayed at the top of the RStudio console.
As a beginning R user, it’s OK let your home directory or any other weird directory on your computer be R’s working directory. Very soon, I urge you to evolve to the next level, where you organize your analytical projects into directories and, when working on Project A, set R’s working directory to Project A’s directory.
Although I do not recommend it, in case you’re curious, you can set R’s working directory at the command line like so. You could also do this in a script.
setwd("~/myCoolProject")
But there’s a better way. A way that also puts you on the path to managing your R work like an expert.
Keeping all the files associated with a project organized together – input data, R scripts, analytical results, figures – is such a wise and common practice that RStudio has built-in support for this via its projects.
Let’s make one to use for the rest of this workshop/class.
Do this: File > New Project … New Directory > Empty Project. The directory name you choose here will be the project name. Call it whatever you want (or follow me for convenience).
I created a directory and, therefore RStudio project, called swc
in my github
directory, FYI. What do you notice about your RStudio pane? Look in the right corner–‘software-carpentry’.
Now check that the “home” directory for your project is the working directory of our current R process:
getwd()
# "/Users/julialowndes/tmp/software-carpentry"
I can’t print my output here because this document itself does not reside in the RStudio Project we just created.
This is the absolute path, just like we learned in the shell this morning. But from here, your paths within this project can be relative, and so our files within our project could work on your computer or mine, without worrying about the absolute paths.
Let’s enter a few commands in the Console, as if we are just beginning a project. Since we’re learning a new language here, an example is often the best way to see how things work. So we’re going to make an introductory plot using the cars
dataset that is loaded into R.
cars
plot(cars)
z <- line(cars)
abline(coef(z), col = "purple")
dev.print(pdf, "toy_line_plot.pdf")
## quartz_off_screen
## 2
Let’s say this is a good start of an analysis and your ready to start preserving the logic and code. Visit the History tab of the upper right pane. Select these commands. Click “To Source”. Now you have a new pane containing a nascent R script. Click on the floppy disk to save. Give it a name ending in .R
or .r
, I used toy-line.r
and note that, by default, it will go in the directory associated with your project. It is traditional to save R scripts with a .R
or .r
suffix.
A few things:
Let’s comment our script: Comments start with one or more #
symbols. Use them. RStudio helps you (de)comment selected lines with Ctrl+Shift+C (windows and linux) or Command+Shift+C (mac).
Walk through line by line by keyboard shortcut (command + enter) or mouse (click Run in the upper right corner of editor pane).
Source the entire document – equivalent to entering source('toy-line.r')
in the Console – by keyboard shortcut (shift command S) or mouse (click Source in the upper right corner of editor pane or select from the mini-menu accessible from the associated down triangle).
## toy-line.r
## J Lowndes lowndes@nceas.uscb.edu
## plots R's cars data with a fitted line ----
plot(cars)
z <- line(cars)
abline(coef(z), col = "purple")
## save as .pdf
dev.print(pdf, "toy_line_plot.pdf")
Notice that the notation with ----
in a comment also enables us to ‘jump’ to it in RStudio
This workflow will serve you well in the future:
Avoid using the mouse for pieces of your analytical workflow, such as loading a dataset or saving a figure. Terribly important for reproducility and for making it possible to retrospectively determine how a numerical table or PDF was actually produced (searching on local disk on filename, among .R
files, will lead to the relevant script).
To do before coffee: create a folder called
data
in your RStudio project folder and copygapminder-FiveYearData.csv
there. On my computer this is~/tmp/software-carpentry/data/gapminder-FiveYearData.csv
(modified from Jenny Bryan’s STAT545)
Let’s start fresh.
You should clean out your workspace. In RStudio, click on the “Clear” broom icon from the Environment tab or use Session > Clear Workspace. You can also enter rm(list = ls())
in the Console to accomplish same.
Now restart R. In RStudio, use Session > Restart R. Otherwise, quit R with q()
and re-launch it.
Why do we do this? So that the code you write is complete and re-runnable.
Let’s check our working directory: getwd()
Finally, let’s create a new R script from scratch. We will evelop and run our code from there. We’ll be using this script today and tomorrow.
In RStudio, use File > New File > R Script. Save this script with a name ending in .r
or .R
, containing no spaces or other funny stuff, and that evokes whatever it is we’re doing today. Example: software-carpentry-ucsb.r
.
We will work with some of the data from the Gapminder project. Jenny Bryan has also released this as an R package, so you could also install it from CRAN and load it into R like so: install.packages("gapminder"); library(gapminder)
. But here we will use read.csv
with the file we downloaded before class.
## read gapminder csv
gapminder <- read.csv('data/gapminder-FiveYearData.csv')
Let’s inspect:
## explore the gapminder dataset
gapminder # this is super long! Let's inspect in different ways
Let’s use head
and tail
:
head(gapminder) # shows first 6
tail(gapminder) # shows last 6
head(gapminder, 10) # shows first X that you indicate
tail(gapminder, 12) # guess what this does!
str()
will provide a sensible description of almost anything: when in doubt, just str()
some of the recently created objects to get some ideas about what to do next.
str(gapminder) # ?str - displays the structure of an object
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
gapminder
is a data.frame
. We’ve got a mixture of character data (Factors) and quantative data (integers and numeric)
We aren’t going to get into the other types of data receptacles today (‘arrays’, ‘matrices’), because working with data.frames is what you should primarily use. Why?
We can also see the gapminder
variable in RStudio’s Environment pane (top right)
More ways to learn basic info on a data.frame.
names(gapminder)
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
dim(gapminder) # ?dim dimension
## [1] 1704 6
ncol(gapminder) # ?ncol number of columns; same as dim(gapminder)[1]
## [1] 6
nrow(gapminder) # ?nrow number of rows; same as dim(gapminder)[2]
## [1] 1704
We can combine using c()
to reverse-engineer dim()
! Just a side-note here, but I wanted to introduce you to c()
: we’ll use it later.
c(nrow(gapminder), ncol(gapminder)) # ?c combines values into a vector or list.
## [1] 1704 6
A statistical overview can be obtained with summary()
summary(gapminder)
## country year pop continent
## Afghanistan: 12 Min. :1952 Min. :6.001e+04 Africa :624
## Albania : 12 1st Qu.:1966 1st Qu.:2.794e+06 Americas:300
## Algeria : 12 Median :1980 Median :7.024e+06 Asia :396
## Angola : 12 Mean :1980 Mean :2.960e+07 Europe :360
## Argentina : 12 3rd Qu.:1993 3rd Qu.:1.959e+07 Oceania : 24
## Australia : 12 Max. :2007 Max. :1.319e+09
## (Other) :1632
## lifeExp gdpPercap
## Min. :23.60 Min. : 241.2
## 1st Qu.:48.20 1st Qu.: 1202.1
## Median :60.71 Median : 3531.8
## Mean :59.47 Mean : 7215.3
## 3rd Qu.:70.85 3rd Qu.: 9325.5
## Max. :82.60 Max. :113523.1
##
What other information would you want to know when first exploring new data? …How about plotting to see if these data make sense/have outliers?
Although we haven’t begun our formal coverage of visualization yet, it’s so important for smell-testing dataset that we will make a few figures anyway. Here we use only base R graphics, which are very basic.
## plot gapminder
plot(gapminder$year, gapminder$lifeExp) # ?plot
plot(gapminder$gdpPercap, gapminder$lifeExp)
To specify a single variable from a data.frame, use the dollar sign $
.
Let’s explore a numeric variable: life expectancy.
## explore numeric variable
summary(gapminder$lifeExp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.60 48.20 60.71 59.47 70.85 82.60
hist(gapminder$lifeExp)
Let’s explore a categorical variable (stored as a factor in R): continent.
## explore factor variable
summary(gapminder$continent)
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
nlevels(gapminder$continent)
## [1] 5
hist(gapminder$continent) # whaaaa!?
## Error in hist.default(gapminder$continent): 'x' must be numeric
This error is because of what factors are ‘under the hood’: R is really storing integer codes 1, 2, 3 here, but represent them as text to us. Factors can be problematic to us because of this, but you can learn to navigate with them. There are resources to learn how to properly care and feed for factors.
One thing you’ll learn is how to visualize factors with which functions/packages.
class(gapminder$continent) # ?class returns the class type of the object
## [1] "factor"
table(gapminder$continent) # ?table builds a table based on factor levels
##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
class(table(gapminder$continent)) # this has morphed the factor...
## [1] "table"
hist(table(gapminder$continent)) # so we can plot!
I don’t want us to get too bogged down with what’s going on with table()
and plotting factors, but I want to expose you to these situations because you will encounter them. Googling the error messages you get, and knowing how to look for good responses is a critical skill. (I tend to look for responses from stackoverflow.com that are recent and have green checks, and ignore snarky comments).
Exercise with your neighbor: Explore
gapminder$gdpPercap
. What kind of data is it? So which commands do you use?
You will want to isolate bits of your data.frames; maybe you want to just look at Africa or years since 2000. R calls this subsetting.
There is a stand-alone function called subset()
, that can isolate pieces of an object for inspection or assignment. subset()
’s main argument is also (unfortunately) called subset
. Remember your logical expressions from this morning? We’ll use ==
here.
## subset gapminder
subset(gapminder, subset = country == "Uruguay") # Ah, inspecting Uruguay. Self documenting!
## country year pop continent lifeExp gdpPercap
## 1621 Uruguay 1952 2252965 Americas 66.071 5716.767
## 1622 Uruguay 1957 2424959 Americas 67.044 6150.773
## 1623 Uruguay 1962 2598466 Americas 68.253 5603.358
## 1624 Uruguay 1967 2748579 Americas 68.468 5444.620
## 1625 Uruguay 1972 2829526 Americas 68.673 5703.409
## 1626 Uruguay 1977 2873520 Americas 69.481 6504.340
## 1627 Uruguay 1982 2953997 Americas 70.805 6920.223
## 1628 Uruguay 1987 3045153 Americas 71.918 7452.399
## 1629 Uruguay 1992 3149262 Americas 72.752 8137.005
## 1630 Uruguay 1997 3262838 Americas 74.223 9230.241
## 1631 Uruguay 2002 3363085 Americas 75.307 7727.002
## 1632 Uruguay 2007 3447496 Americas 76.384 10611.463
Contrast the above command with this one accomplishing the same thing:
gapminder[1621:1632, ] # No idea what we are inspecting. Don't do this.
## country year pop continent lifeExp gdpPercap
## 1621 Uruguay 1952 2252965 Americas 66.071 5716.767
## 1622 Uruguay 1957 2424959 Americas 67.044 6150.773
## 1623 Uruguay 1962 2598466 Americas 68.253 5603.358
## 1624 Uruguay 1967 2748579 Americas 68.468 5444.620
## 1625 Uruguay 1972 2829526 Americas 68.673 5703.409
## 1626 Uruguay 1977 2873520 Americas 69.481 6504.340
## 1627 Uruguay 1982 2953997 Americas 70.805 6920.223
## 1628 Uruguay 1987 3045153 Americas 71.918 7452.399
## 1629 Uruguay 1992 3149262 Americas 72.752 8137.005
## 1630 Uruguay 1997 3262838 Americas 74.223 9230.241
## 1631 Uruguay 2002 3363085 Americas 75.307 7727.002
## 1632 Uruguay 2007 3447496 Americas 76.384 10611.463
Yes, these both return the same result. But the second command is horrible for these reasons:
gapminder
are reordered or if some observations are eliminated, these rows may no longer correspond to the Uruguay data.In contrast, the first command, using subset()
, is self-documenting; one does not need to be an R expert to take a pretty good guess at what’s happening. It’s also more robust. It will still produce the correct result even if gapminder
has undergone some reasonable set of transformations (what if it were in in reverse alphabetical order?)
You can use subset =
and select =
together to simultaneously filter rows and columns or variables.
subset(gapminder, subset = country == "Mexico",
select = c(country, year, lifeExp)) # ?c: combines values into a vector or list
## country year lifeExp
## 985 Mexico 1952 50.789
## 986 Mexico 1957 55.190
## 987 Mexico 1962 58.299
## 988 Mexico 1967 60.110
## 989 Mexico 1972 62.361
## 990 Mexico 1977 65.032
## 991 Mexico 1982 67.405
## 992 Mexico 1987 69.498
## 993 Mexico 1992 71.455
## 994 Mexico 1997 73.670
## 995 Mexico 2002 74.902
## 996 Mexico 2007 76.195
You can also subset more than one condition using &
, |
, etc. Let’s take a peek at logical operators: ?"&"
subset(gapminder, subset = country == c("Mexico", "Uruguay") & year == 2007)
## country year pop continent lifeExp gdpPercap
## 1632 Uruguay 2007 3447496 Americas 76.384 10611.46
Exercise: with a partner,
1. subset data of interest using at least 2 conditionals.
2. assign this to a variable.
3. what did you learn?
# one potential exercise answer, no peeking
gap_sample = subset(gapminder, subset = country == c("France", "Brazil") & year >= 2002)
head(gap_sample)
str(gap_sample)
if
and else
Often when we’re coding we want to control the flow of our actions. This can be done by setting actions to occur only if a condition or a set of conditions are met. Alternatively, we can also set an action to occur a particular number of times.
# if
if (condition is true) {
do something
}
# if ... else
if (condition is true) {
do something
} else { # that is, if the condition is false,
do something different
}
Say, for example, that we want R to print a message if the variable we just created has a has a particular value.
# sample a random number from a Poisson distribution
# with a mean (lambda) of 8
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
}
## [1] "x is greater than or equal to 10"
x
## [1] 13
Note you may not get the same output as your neighbour because you may be sampling different random numbers from the same distribution.
Let’s go a step further:
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
} else if (x > 5) {
print("x is greater than 5")
} else {
print("x is less than 5")
}
## [1] "x is greater than 5"
Important: when R evaluates the condition inside if
statements, it is looking for a logical element, i.e., TRUE
or FALSE
. This can cause some headaches. For example:
x <- 4 == 3
if (x) {
"4 equals 3"
}
As we can see, the message was not printed because the vector x is FALSE
x <- 4 == 3
x
## [1] FALSE
Exercise: Use an
if
statement to print a suitable message reporting whether there are any records from 2002 in thegapminder
dataset. Now do the same for 2012.
Did anyone get a warning message like this?
if (gapminder$year == 2002) {
print('this will only check the first element in gapminder$year')
}
## Warning in if (gapminder$year == 2002) {: the condition has length > 1 and
## only the first element will be used
If your condition evaluates to a vector with more than one logical element, the function if
will still run, but will only evaluate the condition in the first element. Remember our analogy about spoken language? This is when R understood your command, but flagging that you may have misspoken. R isn’t saying ‘what!?’ (that’s an error message), it’s saying ‘I understood what you said, but I want to alert you that it might not be what you meant’. These warning messages can be really helpful, but you can’t rely that they will catch all misinterpretations.
We’ll talk about a good way to do this tomororw, but in case you wanted to just do a quick check to see if 2002 was even in the gapminder data, you could use any
or %in%
if (any(gapminder$year == 2002)) {
print('yes 2002 is included at least once in gapminder')
}
## [1] "yes 2002 is included at least once in gapminder"
if (2002 %in% gapminder$year) {
print('yes 2002 is included at least once in gapminder')
}
## [1] "yes 2002 is included at least once in gapminder"
If you want to iterate over a set of values, and perform the same operation on each, a for
loop will do the job. We saw for
loops in the shell lessons earlier.
The basic structure of a for
loop is:
for(iterator in set of values){
do a thing
}
For example:
for(i in 1:10){
print(i)
}
The 1:10
bit creates a vector on the fly; you can iterate over any other vector as well.
We can use a for
loop nested within another for
loop to iterate over two things at once.
for (i in 1:5){
for(j in c('a', 'b', 'c', 'd', 'e')){
print(paste(i,j))
}
}
for (gap_cont in unique(gapminder$continent)){ # gap_cont = 'Africa'
temp <- subset(gapminder, continent == gap_cont)
print(paste('mean life expectency for', gap_cont, 'is', mean(temp$lifeExp)))
}
## [1] "mean life expectency for Asia is 60.0649032323232"
## [1] "mean life expectency for Europe is 71.9036861111111"
## [1] "mean life expectency for Africa is 48.8653301282051"
## [1] "mean life expectency for Americas is 64.6587366666667"
## [1] "mean life expectency for Oceania is 74.3262083333333"
Rather than printing the results, we could write the loop output to a new object.
continent_mean_lifeExp <- c()
for (gap_cont in unique(gapminder$continent)){ # gap_cont = 'Africa'
temp <- subset(gapminder, continent == gap_cont)
temp_output <- paste(gap_cont, mean(temp$lifeExp))
continent_mean_lifeExp <- c(continent_mean_lifeExp, temp_output)
}
continent_mean_lifeExp
## [1] "Asia 60.0649032323232" "Europe 71.9036861111111"
## [3] "Africa 48.8653301282051" "Americas 64.6587366666667"
## [5] "Oceania 74.3262083333333"
This approach can be useful, but ‘growing your results’ (building the result object incrementally) is computationally inefficient, so avoid it when you are iterating through a lot of values.
For loops can also lead to temporary variables that you don’t need. Tomorrow we will learn about a few packages that will help your data wrangling well beyond for loops!
OK, let’s clean up and save your .r
script, we’ll be using it again tomorrow! Restart R. This will ensure you don’t have any packages loaded from previous calls to library()
. In RStudio, use Session > Restart R. Otherwise, quit R with q()
and re-launch it.
Run through each line of code again, make sure your comments are good, delete anything you don’t need. Your script might look like this:
## explore the gapminder dataset ----
gapminder = read.csv('data/gapminder-FiveYearData.csv')
str(gapminder) #displays the structure of an object
head(gapminder) # shows first 6 by default
tail(gapminder, 12)# shows last X that you indicate, or 6 by default
names(gapminder)
dim(gapminder) # ?dim dimension
ncol(gapminder) # ?ncol number of columns
nrow(gapminder) # ?nrow number of rows
length(gapminder) # ?length length; although better for vectors
summary(gapminder)
## plot gapminder
plot(lifeExp ~ year, gapminder)
plot(lifeExp ~ gdpPercap, gapminder)
## explore numeric variable
head(gapminder$lifeExp)
summary(gapminder$lifeExp)
hist(gapminder$lifeExp)
## explore numeric variable that functions like a categorical variable
head(gapminder$year)
summary(gapminder$year)
## explore factor variable
class(gapminder$continent)
summary(gapminder$continent)
levels(gapminder$continent)
nlevels(gapminder$continent)
barplot(table(gapminder$continent))
## subset gapminder. Self documenting!
subset(gapminder, subset = country == "Mexico",
select = c(country, year, lifeExp)) # ?c: combines values
## practice an if statement
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
} else if (x > 5) {
print("x is greater than 5")
} else {
print("x is less than 5")
}
## two ways to see if values exist: `any` and `%in%`
if (any(gapminder$year == 2002)) {
print('yes 2002 is included at least once in gapminder')
}
if (2002 %in% gapminder$year) {
print('yes 2002 is included at least once in gapminder')
}
## practice a for loop
continent_mean_lifeExp <- c()
for (gap_cont in unique(gapminder$continent)){ # gap_cont = 'Africa'
temp <- subset(gapminder, continent == gap_cont)
temp_output <- paste(gap_cont, mean(temp$lifeExp))
continent_mean_lifeExp <- c(continent_mean_lifeExp, temp_output)
}
continent_mean_lifeExp