My unofficial mantra (again):
In Data organization with spreadsheets we learned how to write down our data, avoid common errors, and how to save your data so that you and others can read and understand it later.
Now let’s learn how to write down ALL the steps needed to take you from your raw data to publication quality figures. That’s the key to reproducibility, if every step is written down, then others can reproduce your findings! Below, You will learn how to clean your data in a reproducible way using OpenRefine and R.
Pro-tip
After spending all that time entering your raw data, NEVER change it again.
- Do not edit the data
- Do not edit the column headers
- Do not remove ‘outliers’
- Do not do calculations directly on the raw data
Store your data in a
raw_data
directory (folder) in your project directory, and never save/write over it!
I advise you archive your data immediately upon collection to reduce risk of data loss. Zenodo and Figshare both offer different free options for archiving your raw data permanently and have some awesome options for embargoes, restricted access, or even private storage to suite your privacy needs. More detail can be found in the Data Archiving & version controls lesson.
By now I’ve probably said the words reproducible and reproducibility so often that it’s starting to lose meaning. Trust me, this is important! (or don’t trust me and read “A manifesto for reproducible Science”, and “Reproducible Data Science with R”). By not ‘hiding’ you workflow, you help:
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another. It’s effectively a reproducible way to work in a spreadsheet, no coding is required on your part since it generates a script that details what you did to the data, step by step.
First, let’s open OpenRefine. You’ll notice it opens in the browser, but it’s running locally (does not require internet connection).
The first step is to load your data and to create a project. If you haven’t done so already, please download the zip file the entire github project repository (contains data files, and all the R scripts used to make this very website!) and extract it somewhere convenient.
In OpenRefine, click Choose Files
and find the larval abundance.csv
which is in the rawdata
directory of the project folder. Then click on the Create Project
button (you may want to rename the project).
You are now working on a copy of the raw data and changes you make in OpenRefine will not ‘break’ your original raw data. In here you can do all the regular ‘spreadsheet-y’ things. You can edit specific cells, sort, undo/redo, view subsets of your data (facet), etc. But the more powerful functions of OpenRefine are:
Cluster (click on column header arrow, then Edit cells > Cluster and Edit...
) which means “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings “New York” and “new york” are very likely to refer to the same concept and just have capitalization differences.
Whitespace management (click on column header arrow, then Edit cells > Common transforms > Trim leading and trailing whitespace.
and Edit cells > Common transforms > Collapse consecutive whitespace.
) Strings with spaces at the beginning or end are particularly hard for we humans to tell from strings without, but the blank characters will make a difference to the computer. We usually want to remove these.
OpenRefine saves every change, every edit you make to the larvalAbundance in a file you can save on your machine. If you had 20 files to clean, and they all had the same type of errors, and all files had the same columns, you could save the script, open a new file to clean, paste in the script and run it. Voila, clean data.
Undo / Redo section
, click Extract
, save the bits desired using the check boxes..txt
file.Extract / Apply
section, paste in the .txt
file, click Apply
.For more information and tutorials on OpenRefine, please see Data Carpentry
You can do everything mentioned above in R, it may at first appear more difficult to do it in R, but in my opinion, you will save time by streamlining your workflow using just one tool. I’m not at all disouraging the use of OpenRefine,it is open source and reproducible, such much kudos is due.
For this and future lessons, we will focus on achieving reproducibility by using R which is an open source language and environment for statistical computing and graphics. There are many other such languages (e.g. Python, Julia, MATLAB,etc) used by conservationists, biologist, and oceanographers; however, we believe R is currently the most widely adopted among our colleagues and also has the most convenient set of statistical tools developed for our field.
In “the olden days” we had to walk to school uphill both ways used R in the terminal or using the built in graphical use interface (GUI). Yes, before 2011, RStudio did not exist and yes, R and RStudio are not the same thing!
R is accessible in the terminal (that thing that looks like DOS, and in case my ‘old’ is showing, the thing that is usually a black screen, a blinking cursor and you can only type in commands) by typing R.exe
on Windows, or just R
on Mac or Linux. In this way, you can type commands in one by one, or similar to what we just saw with OpenRefine, you save your steps/instuctions/commands in a plain text file (with a .R
extension instead of .txt
) and you can run those in the terminal by typing Rscript.exe scriptname.R
on Windows, or just Rscript scriptname.R
on Mac or Linux. While I don’t often work in this way anymore, but this is the only option when using Compute Canada’s awesome resources. Using the commands above and a little server specific magic, you can run your scripts on 100’s of processors instead of the one lonely processor on your computer! I’ve used hundreds of years of computer time in a matter of weeks, all for free! If you are affiliated with any Canadian university, you can do this too!
R also come with it’s own GUI, in which you can have a script editor, which is essentially a plain text editor, to write/develop your script and an interactive R console where you can actually execute commands. The advantage of the GUI is that you can execute the entire script (‘source’) or run it line by line all while recording your commands in the script file.
RStudio takes this GUI concept a bit further and provides you with several extra support window. If the idea of having windows for your environment, your files, your plots, as well as packages and a help tab all at hand does not excite you, hold on tight, you’ll get there.
There’s also a lot more information about RStudio on their cheatsheet. On the subject of cheatsheets, RStudio has developed several super useful cheatsheets; seriously, you probably will want to print most of these and put them on the wall in your office.
You read my mind! However, before we get to cleaning the data, we need to cover a few R fundamentals so that what we do in later steps makes sense.
Go back to the project folder you downloaded during the OpenRefine lesson and open the 2017-CHONe-Data.Rproj
file. This is an R project file that allows you to set a number of options for the project (see here), but for our purposes just know that the project file is setting the ‘working directory’, it tells R where all your files are. We’ll get back to that later.
In R you can do math, type the command below in your R console and hit enter:
1+1
## [1] 2
You can also assign values to variables, or in R parlance an ‘object’ using the <-
symbol (shortkey Alt
+ -
).
a <- 2
b <- 1+2
You’ll notice that there was no output this time there was no output. That’s because the value on the right side of the <-
symbol is assigned to a object, it goes to your environment (top right window) instead of being output to the console. In the 1+1 example above, there was no object to go to, so it defaulted to printing in the console.
You can see the contents of a object in the environment window, or by typing the object into the console. You can also use these objects like algebra
a
## [1] 2
b
## [1] 3
a/b
## [1] 0.6666667
Up to now, we’ve been dealing with numbers, but R can also deal with character string if surrounded by single or double quotation marks. According to R help (I learned this today!): “Single and double quotes delimit character constants. They can be used interchangeably but double quotes are preferred (and character constants are printed using double quotes), so single quotes are normally only used to delimit character constants containing double quotes.”
f <- "This is a character string, you can tell because of the quotation marks"
A object can also contain multiple values, this is called a vector. The :
symbol essentially means ‘to’
x <- 1:3
Another way to do that, with more flexibility is using the c()
; the c is short for concatenate and the round brackets indicate that it’s a function. So this concatenate function will concatenate all the ‘arguments’ (things inside the round brackets) which are separated by commas. You can also combine these strategies
x <- c(1,3,5)
y <- c(1:4,6,8)
There are many functions, but they all follow the format functionName(argument1,argument2,argument3,...)
where the ‘arguments’ are the input to the function. Some are fairly straightforward:
mean(y)
## [1] 4
But even then there are some surprises, let’s look at the help file for mean()
. To do that you can: - if you are on the active line in the console or anywhere in a script, put your cursor on the function and press F1
- in the console, type ?mean
(or ??mean
if your not so sure mean
is the name of the function) - find the help window (one of the tabs for the bottom right window) and use the search bar - also, when all else fails, Google is your friend!
Any method should get you to something like this: In R in most cases you could use =
instead of the <-
symbol with no problems when you are assigning a value to a object. However, it is best practice to use <-
when assigning environment objects and =
when defining function arguments. Oh, and NA
in R means ‘Not Available’ / Missing Values. Like so:
x <- c(1,2,5,7,88,3,4,2,4,6,7,NA)
mean(x)
## [1] NA
mean(x, na.rm = TRUE)
## [1] 11.72727
I also snuck a TRUE
in there; TRUE
and FALSE
are called logical and are distinct from numeric or character strings. They are sometimes used as arguments values, but they can also used to test things. The ==
asks if both sides are equal (since the single =
is already used for other things), and the !=
asks if both sides are not equal.
2==1
## [1] FALSE
2==2
## [1] TRUE
2!=1
## [1] TRUE
Up until now, we’ve been playing in the console which means the ‘instructions’ we need to save to reproduce our science are lost (well not really, they can be retrieved from the History tab in the top right, or the console if it hasn’t rolled off the screen). It is a good idea to develop your analysis using a script file (those simple text files with the .R
extension I was talking about earlier) because you can save your code easily.
To create a new script, your can click on the little paper with the plus symbol (see below), or you can hit Ctrl
+Shift
+N
(Windows), or Command
+Shift
+N
(Mac), and if that’s not enough options, you can click File > New File > New Script
These scripts are designed to read by R from top to bottom when you hit the button, or Ctrl
+Shift
+S
(Windows), or Command
+Shift
+S
(Mac). Alternatively, you can run portions of your code with Ctrl
+Enter
(Windows), or Command
+Enter
(Mac) and either putting your cursor on a line to run the entire line, or highlighting a subsection of code to run just that portion. This will allow us to build multi-step data processing and analysis scripts.
Pro-tip If you don’t want R to read something, us the
#
. Anything that is preceded by a#
is regarded as a ‘comment’ by R and it does not try to execute those lines (i.e. R ignores anything after a#
). This is also useful if you want to avoid running a few lines of code when you are developing your script. Instead of typing a#
in front of each line of code, you can highlight the lines you want commented out and hitCtrl
+Shift
+C
(Windows), orCommand
+Shift
+C
(Mac). Magic!
Commenting is super useful to include human readable instructions/documentation in your code. Let’s give this a try, write this chunk of code into your script, then run it line by line.
x <- 1
x <- 2
# x <- 3
What is the value of x
after running all the lines and why?
We already mentioned that we can have multiple values in an object, but so far this has been in only 1 dimension, but using matrix()
(2D) and array()
(>2D) we can store numbers in multiple dimensions
# make a matrix
x <- matrix(data = c(1,1,2,5,3,4), nrow = 2, ncol = 3)
x
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 1 5 4
# make an array
y <- array(data = c(1:8), dim = c(2,2,2))
y
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
Exercise
Try storing numeric and character strings in a matrix (or an array). What happens to the numerics?
That’s great that we can store this data in multiple dimensions, but how do I get it back? That’s what []
are for!
# in 1D
x <- c(1,2,3,4,6,34,2,1,5,6,7)
# if we want the 7th value
x[7]
## [1] 2
# if we want the 2nd and 7th value
x[c(2,7)]
## [1] 2 2
# if we want only values greater than 10
x[x>10]
## [1] 34
# whoa, that blew my mind! How did that work?
# well indexing works by giving the numeric index, or a logical vector the length of the vector we're working with
# so x>10 produces a logical vector the length of x
x>10
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# in 2D
x <- matrix(data = c(1,1,2,5,3,4), nrow = 2, ncol = 3)
# this works similar, but you need to provide 2 numbers (or 2 vectors of numbers), for row number and for column number
x[1,3]
## [1] 3
x[c(1,2),1]
## [1] 1 1
# if you leave one dimension blank, the whole row or column will be returned
x[,1]
## [1] 1 1
x[2,]
## [1] 1 5 4
# in 3D, add another dimension!
x <- array(data = c(1:24), dim = c(2,3,4))
x[1,2,3]
## [1] 15
x[,,3]
## [,1] [,2] [,3]
## [1,] 13 15 17
## [2,] 14 16 18
Data frames (data.frame()
) are a special type of matrix that can hold numeric and character strings (and logicals, factors, geometries, etc) without having to convert them all to a single type. It’s effectively a collection of vectors of the same length.
# make a data frame
x <- data.frame(nums = c(1,2,3),
chars = c("one","two","three"),
logis = c(TRUE, FALSE, TRUE))
# print the whole thing to the console
x
## nums chars logis
## 1 1 one TRUE
## 2 2 two FALSE
## 3 3 three TRUE
# or for a very big dataframe, you may want to use head to see the top 5 rows
head(x)
## nums chars logis
## 1 1 one TRUE
## 2 2 two FALSE
## 3 3 three TRUE
# use the str function to see it's structure
str(x)
## 'data.frame': 3 obs. of 3 variables:
## $ nums : num 1 2 3
## $ chars: Factor w/ 3 levels "one","three",..: 1 3 2
## $ logis: logi TRUE FALSE TRUE
Pro-tip
Did you notice that the structure of chars was
Factor
and notchr
? Factors are a special type of of character string vector that conserves information about the vector’s ‘levels’ (and you can also order those levels). Many functions, such asdata.frame()
have an argument calledstringsAsFactors
and is usually set toTRUE
by default, you may want to set this toFALSE
. Often, these are interchangeable, but I have had many frustrating errors because I mistakenly had a character string where I needed a factor and vice versa. Be aware that they are different and not knowing which one you have can lead to errors. You can easily convert back and forth withas.factor
(oras.ordered()
) andas.character()
# let's try having both strings and factors in the data frame
# make a data frame
x <- data.frame(nums = c(1,2,3),
chars = c("one","two","three"),
facts = as.factor(c("one","two","three")),
logis = c(TRUE, FALSE, TRUE),
stringsAsFactors = FALSE)
# print the whole thing to the console
x
## nums chars facts logis
## 1 1 one one TRUE
## 2 2 two two FALSE
## 3 3 three three TRUE
# or for a very big dataframe, you may want to use head to see the top 5 rows
head(x)
## nums chars facts logis
## 1 1 one one TRUE
## 2 2 two two FALSE
## 3 3 three three TRUE
# use the str function to see it's structure
str(x)
## 'data.frame': 3 obs. of 4 variables:
## $ nums : num 1 2 3
## $ chars: chr "one" "two" "three"
## $ facts: Factor w/ 3 levels "one","three",..: 1 3 2
## $ logis: logi TRUE FALSE TRUE
Indexing in dataframes can be the same as for a 2D matrix, or you can use $
to access columns by name
# make a data frame
x <- data.frame(nums = c(1,2,3),
chars = c("one","two","three"),
facts = as.factor(c("one","two","three")),
logis = c(TRUE, FALSE, TRUE),
stringsAsFactors = FALSE)
x[1,2]
## [1] "one"
x[,2]
## [1] "one" "two" "three"
x$chars
## [1] "one" "two" "three"
# and you can also treat these columns as 1D vectors
x$chars[1]
## [1] "one"
That’s all great, but our data is store in a .csv
file, how do we get that? The simplest way is to use the read.csv()
function. But we need to know where on the computer the file is stored.
The first step is to load your data and to create a project. If you haven’t done so already, please download the zip file the entire github project repository (contains data files, and all the R scripts used to make this very website!) and extract it somewhere convenient.
Go back to the project folder you extracted and open the 2017-CHONe-Data.Rproj
file. This is an R project file that allows you to set a number of options for the project (see here), but for our purposes just know that the project file is setting the ‘working directory’, it tells R where all your files are.
# you can hard code the whole file path, but try to never do that!
# your computer may not have a C drive, and your name almost certainly is not Remi, so this will not work for you
# larvalAbundance <- read.csv("C:/Users/Remi-Work/Desktop/2017-CHONe-Data/rawdata/larval abundance.csv")
# The above is unsurprisingly not reproducible! Always use relative paths!
# Relative paths are a shortened version of the above, you only need to type what comes after the project directory
# REMINDER: The project directory is where the .Rproj file is stored
larvalAbundance <- read.csv("rawdata/larval abundance.csv", stringsAsFactors = FALSE)
# writing it this way (relative path) means that your project is reproducible since you can move this whole directory to another location on your computer OR ANY OTHER COMPUTER!!!
This data comes from one of Remi’s PhD thesis chapters1 that was part of the first CHONe. I modified it from the original version archived on Dryad so we would have cleaning to do!
Common errors in data are:
The first step I always take is to make sure that the data loaded in as expected. Head over to the Environment window and click on data
. You can do (temporary) sorting and filtering of the data in the data viewer. I also use the str()
function to make sure all the variables (columns) are of the correct type.
str(larvalAbundance)
## 'data.frame': 79 obs. of 25 variables:
## $ year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
## $ month : chr "8" "8" "8" "August" ...
## $ day : int 7 8 7 7 7 7 7 8 8 8 ...
## $ time : chr "10:40:00 " "9:05:00" "12:14:00" "13:47:00" ...
## $ depth : int 3 3 12 12 12 12 12 12 12 12 ...
## $ site : int 5 5 6 7 8 9 10 11 12 13 ...
## $ long : num 45.8 45.8 45.8 45.8 45.8 ...
## $ lat : num -61.9 -61.9 -61.8 -61.7 -61.6 ...
## $ Astyris.lunata : num 4.478 12.548 31.834 21.162 0.324 ...
## $ Bittiolum.alternatum : num 0.373 0 3.93 1.953 0.379 ...
## $ Margarites.spp. : num 9.52 5.38 11.14 4.88 1.03 ...
## $ Arrhoges.occidentalis : num 0.3732 0.717 0.786 0.1628 0.0541 ...
## $ Diaphana.minuta : num 1.679 0.896 4.978 1.302 0.135 ...
## $ Crepidula.spp. : num 4.478 7.708 22.664 2.767 0.027 ...
## $ Other.Gastropods : num 1.12 1.793 1.703 0.651 0.243 ...
## $ Mytilus.spp. : num 9.52 12.01 5.63 3.74 1.14 ...
## $ Modiolus.modiolus : num 0.187 0.538 0.262 0.326 0 ...
## $ Anomia.simplex : num 0.56 0.896 0.655 0.488 0.676 ...
## $ Other.Bivalve : num 8.77 4.661 1.965 1.302 0.352 ...
## $ Electra.pilosa : num 12.5 3.94 35.5 18.56 1.03 ...
## $ Membranipora.membranacea: num 0.187 0 0.393 0.163 0 ...
## $ Carcinus.maenas : num 0.0594 0.1051 0.0524 0 0 ...
## $ Cancer.irroratus : num 0.1696 0.1731 2.3057 1.172 0.0451 ...
## $ Neopanopeus.sayi : num 0 0.00618 0 0 0 ...
## $ Crangon.septemspinosa : num 0.27141 0.08036 1.07422 1.04182 0.00901 ...
If I wanted to make a correction manually, we can use indexing:
# say for example, I knew that the second observation was in fact taken at 12 m depth
larvalAbundance$depth[2]
## [1] 3
# the value in the data frame is indeed 3 WRONG! Let,s correct it
larvalAbundance$depth[2] <- 12
head(larvalAbundance)
## year month day time depth site long lat Astyris.lunata
## 1 2008 8 7 10:40:00 3 5 45.75782 -61.87513 4.4783154
## 2 2008 8 8 9:05:00 12 5 45.75782 -61.87513 12.5483524
## 3 2008 8 7 12:14:00 12 6 45.77987 -61.79735 31.8337216
## 4 2008 August 7 13:47:00 12 7 45.80723 -61.69652 21.1619877
## 5 2008 8 7 15:28:00 12 8 45.82607 -61.58807 0.3244847
## 6 2008 8 7 17:05:00 12 9 45.73428 -61.62138 23.5602998
## Bittiolum.alternatum Margarites.spp. Arrhoges.occidentalis
## 1 0.3731930 9.516420 0.37319295
## 2 0.0000000 5.377865 0.71704871
## 3 3.9300891 11.135252 0.78601782
## 4 1.9534143 4.883536 0.16278452
## 5 0.3785655 1.027535 0.05408078
## 6 0.2926745 4.682793 0.14633727
## Diaphana.minuta Crepidula.spp. Other.Gastropods Mytilus.spp.
## 1 1.6793683 4.47831542 1.1195789 9.5164203
## 2 0.8963109 7.70827362 1.7926218 12.0105659
## 3 4.9781128 22.66351371 1.7030386 5.6331277
## 4 1.3022762 2.76733685 0.6511381 3.7440440
## 5 0.1352020 0.02704039 0.2433635 1.1356964
## 6 2.6340708 2.63407078 0.7316863 0.5853491
## Modiolus.modiolus Anomia.simplex Other.Bivalve Electra.pilosa
## 1 0.1865965 0.5597894 8.7700344 12.501964
## 2 0.5377865 0.8963109 4.6608166 3.943768
## 3 0.2620059 0.6550148 1.9650445 35.501805
## 4 0.3255690 0.4883536 1.3022762 18.557435
## 5 0.0000000 0.6760098 0.3515251 1.027535
## 6 0.2926745 0.2926745 0.7316863 23.852974
## Membranipora.membranacea Carcinus.maenas Cancer.irroratus
## 1 0.1865965 0.05937161 0.16963316
## 2 0.0000000 0.10508473 0.17308072
## 3 0.3930089 0.05240119 2.30565226
## 4 0.1627845 0.00000000 1.17204855
## 5 0.0000000 0.00000000 0.04506732
## 6 0.1463373 0.08780236 0.45364552
## Neopanopeus.sayi Crangon.septemspinosa
## 1 0.000000000 0.271413056
## 2 0.006181454 0.080358907
## 3 0.000000000 1.074224349
## 4 0.000000000 1.041820933
## 5 0.000000000 0.009013463
## 6 0.000000000 0.936558500
# The other issue you may notice is that my months values are mostly numbers, but there's at least 1 "August" in there
# We could correct just that one we see on line 4
larvalAbundance$month[4] <- 8
# Or we could get rid of all the "August"'s in one pass
larvalAbundance$month[larvalAbundance$month=="August"] <- 8
# but that column is still a character string, let's convert it to numeric
larvalAbundance$month <- as.numeric(larvalAbundance$month)
str(larvalAbundance)
## 'data.frame': 79 obs. of 25 variables:
## $ year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
## $ month : num 8 8 8 8 8 8 8 8 8 8 ...
## $ day : int 7 8 7 7 7 7 7 8 8 8 ...
## $ time : chr "10:40:00 " "9:05:00" "12:14:00" "13:47:00" ...
## $ depth : num 3 12 12 12 12 12 12 12 12 12 ...
## $ site : int 5 5 6 7 8 9 10 11 12 13 ...
## $ long : num 45.8 45.8 45.8 45.8 45.8 ...
## $ lat : num -61.9 -61.9 -61.8 -61.7 -61.6 ...
## $ Astyris.lunata : num 4.478 12.548 31.834 21.162 0.324 ...
## $ Bittiolum.alternatum : num 0.373 0 3.93 1.953 0.379 ...
## $ Margarites.spp. : num 9.52 5.38 11.14 4.88 1.03 ...
## $ Arrhoges.occidentalis : num 0.3732 0.717 0.786 0.1628 0.0541 ...
## $ Diaphana.minuta : num 1.679 0.896 4.978 1.302 0.135 ...
## $ Crepidula.spp. : num 4.478 7.708 22.664 2.767 0.027 ...
## $ Other.Gastropods : num 1.12 1.793 1.703 0.651 0.243 ...
## $ Mytilus.spp. : num 9.52 12.01 5.63 3.74 1.14 ...
## $ Modiolus.modiolus : num 0.187 0.538 0.262 0.326 0 ...
## $ Anomia.simplex : num 0.56 0.896 0.655 0.488 0.676 ...
## $ Other.Bivalve : num 8.77 4.661 1.965 1.302 0.352 ...
## $ Electra.pilosa : num 12.5 3.94 35.5 18.56 1.03 ...
## $ Membranipora.membranacea: num 0.187 0 0.393 0.163 0 ...
## $ Carcinus.maenas : num 0.0594 0.1051 0.0524 0 0 ...
## $ Cancer.irroratus : num 0.1696 0.1731 2.3057 1.172 0.0451 ...
## $ Neopanopeus.sayi : num 0 0.00618 0 0 0 ...
## $ Crangon.septemspinosa : num 0.27141 0.08036 1.07422 1.04182 0.00901 ...
If you were wondering how many years I had sampled, you could do:
unique(larvalAbundance$year)
## [1] 2008 2009
# or for time
unique(larvalAbundance$time)
## [1] "10:40:00 " "9:05:00" "12:14:00" "13:47:00" "15:28:00 "
## [6] "17:05:00" "18:48:00" "11:00:00" "12:43:00" "13:30:00"
## [11] "15:18:00" "17:11:00 " "9:18:00" "12:29:00" "14:07:00"
## [16] "15:41:00" "17:18:00" "19:02:00" "11:14:00" "12:26:00"
## [21] "13:43:00" "15:33:00" "17:19:00" "20:06:00" "16:39:00"
## [26] "15:00:00" "13:17:00" "11:08:00" "9:15:00" "11:35:00"
## [31] "13:09:00" "14:28:00" "15:38:00" "20:19:00" "16:51:00"
## [36] "15:11:00" "13:03:00" "11:21:00" "9:32:00" "9:30:00"
## [41] "11:47:00" "13:22:00" "14:39:00" "15:50:00" "19:44:00"
## [46] "20:37:00" "18:19:00" "11:27:00" "12:58:00" "13:42:00"
## [51] "11:34:00" "9:59:00" "15:59:00" "16:11:00" "14:33:00"
## [56] "12:39:00" "16:52:00" "14:43:00" "17:47:00" "12:57:00"
## [61] "19:31:00" "20:25:00" "18:06:00" "11:13:00" "12:46:00"
## [66] "13:24:00" "11:03:00" "9:44:00" "11:30:00" "15:31:00"
## [71] "15:57:00" "14:17:00" "12:25:00" "16:34:00" "14:30:00"
## [76] "17:34:00" "12:40:00"
# as you can see there are a few trailing white spaces, to do text substitutions, let's use gsub()
# but first let's see how this works; it matches a pattern in x and replaces it.
gsub(pattern = "ABC",replacement = "XYZ",x = "TUVWABC")
## [1] "TUVWXYZ"
gsub(pattern = "doesn't work",replacement = "works",x = "If this sentence no longer contains the pattern, then gsub doesn't work")
## [1] "If this sentence no longer contains the pattern, then gsub works"
# So now, let's use gsub to remove those spaces. The patter we are matching is just a space and the replacement is nothing, so that will remove white spaces
larvalAbundance$time <- gsub(pattern = " ",replacement = "",x = larvalAbundance$time)
Feel the power! The gsub()
function is very powerful and the pattern matching works based on ‘regular expressions’ (which is a nearly universal pattern matching language/protocol). For example, if you had spaces you wanted to keep and only wanted to remove white spaces at the end, you could use pattern = " +$"
since the dollar sign in regex means ‘ends with’ and the plus means ‘one or more’, so gsub would match one or more spaces at the end of a character string. You can practice your ‘regex’ with regexpal and this cheatsheet
Other useful things you should check are the minimum and maximum values for each column to make sure things were entered all in the same order of magnitude, or a even a quick histogram
min(larvalAbundance$Margarites.spp.)
## [1] 0.02709548
max(larvalAbundance$Margarites.spp.)
## [1] 203.875
hist(larvalAbundance$Margarites.spp.)
Lastly, there are some column names that are not ‘up to code’, so to avoid a stern talking to from CHONe’s data manager, let’s fix that now! (Also, you may have noticed that latitude and longitude were swapped!)
names(larvalAbundance)
## [1] "year" "month"
## [3] "day" "time"
## [5] "depth" "site"
## [7] "long" "lat"
## [9] "Astyris.lunata" "Bittiolum.alternatum"
## [11] "Margarites.spp." "Arrhoges.occidentalis"
## [13] "Diaphana.minuta" "Crepidula.spp."
## [15] "Other.Gastropods" "Mytilus.spp."
## [17] "Modiolus.modiolus" "Anomia.simplex"
## [19] "Other.Bivalve" "Electra.pilosa"
## [21] "Membranipora.membranacea" "Carcinus.maenas"
## [23] "Cancer.irroratus" "Neopanopeus.sayi"
## [25] "Crangon.septemspinosa"
names(larvalAbundance)[names(larvalAbundance)=="long"] <- "decimalLatitude"
names(larvalAbundance)[names(larvalAbundance)=="lat"] <- "decimalLongitude"
names(larvalAbundance)[names(larvalAbundance)=="site"] <- "locationID"
Everything now seems reasonable to me, but it is your responsibility to check that each column of your data ‘makes sense’. But for sake of time, lets move on!
Now we have a choice, we can either run a data cleaning script every time we load the raw data, or we can save a ‘clean’ data product. Let’s do the latter.
# let's create a new folder for intermediate data products
dir.create("data")
## Warning in dir.create("data"): 'data' already exists
# Then let's save the cleaned data in that folder
write.csv(larvalAbundance, file = "data/larvalAbundanceClean.csv", row.names = FALSE)
Pro-tip
Notice that: - we are not writing over the raw data - we are not writing in the same folder as the raw data - we are naming our new data file informatively
Part of R’s awesomeness is that it already comes with a lot of functions that are very useful for everyday science. Additionally, there are many packages
which are essentially collections of new functions and help files generated by users like you and me that add to the already broad functionality of R.
Warning: shameless self promotion below!
Here is a package I created called BESTMPA and the peer-reviewed paper describing it: An adaptable toolkit to assess commercial fishery costs and benefits related to marine protected area network design
Making your own function follows the particular format below, let’s make one called custommean()
custommean <- function(x){
m <- sum(x)/length(x)
return(m)
}
# so we defined custommean as a function with arguments 'x', and it does what is inside the curly brackets
# it will return m which is the mean of x.
x <- c(1,2,3,6)
# does it work?
custommean(x = x)
## [1] 3
If you make a few of those and write some help files to go along with them, you can make your own package. If you’re interested see the “R packages” book by Hadley Wickham (Chief Scientist at RStudio, not the last time I will mention him), but making packages is beyond the scope of what I can cover in this workshop
Anyway, there is a central organization called ‘Comprehensive R Archive Network’ or CRAN which houses all the official package, but there are also other packages (like mine), as well as the development version of many of the official packages on github that are worth taking a look at.
To install packages, you can either use the Packages window at the bottom right, or you can do it with written commands (my preference). Here are a few we will use tomorrow, try installing them now and let us know if you get any errors.
install.packages('tidyverse') # The tidyverse is a collection of R packages that share common philosophies and are designed to work together. (e.g. ggplot2, dplyr, tidyr)
install.packages('marmap') # Import xyz data from the NOAA (National Oceanic and Atmospheric Administration, <http://www.noaa.gov>), GEBCO (General Bathymetric Chart of the Oceans, <http://www.gebco.net>) and other sources, plot xyz data to prepare publication-ready figures
install.packages('raster') # Reading, writing, manipulating, analyzing and modeling of gridded spatial data (and also getting access to GADM basemaps)
install.packages('devtools') # Allows you to install packages from github
# the 'robis' package on CRAN does not work with the latest version of R (yet), so we need to get the latest version from github
devtools::install_github("iobis/robis")
# you already have the CRAN version of ggplot2, but that version is not compatible with the sf package yet
# So, we need the github version of ggplot2 as well!
devtools::install_github("tidyverse/ggplot2")
install.packages('gridExtra') # to be able to arrange multiple ggplot plots
install.packages('taxize') # To extract and validate species taxonomy
install.packages('rfishbase') # To access resources available on Fishbase and SeaLifeBase
install.packages('rglobi') # To access interactions data
devtools::install_github("ropensci/rnoaa") # To access environmental data from the NOAA databases
install.packages('knitr')
install.packages('biomod2') # to perform species distribution models
install.packages('iGraph') # to produce network plots
install.packages('networkD3') # to produce html (a.k.a. interactive) network plots!
devtools::install_github("guiblanchet/HMSC") # package to perform hierarchical modeling of species communities
install.packages('coda') # package to summarize and plot outputs from Markov Chains
install.packages('corrplot') # visualization of correlation matrices
install.packages('circlize') # circular visualization of data
install.packages('ModelMetrics') # collection of metrics coded for efficiency in C++ using Rcpp
install.packages("pdftools") # to extract pdf content
install.packages('stringr') # Simple, Consistent Wrappers for Common String Operations
install.packages('tidytext') # text analysis package
install.packages('viridis') # Port of the new 'matplotlib' color maps
install.packages('tibble') # Simple Data Frames
devtools::install_github("dgrtwo/widyr") # Widen, process, and re-tidy a dataset
install.packages('ggraph') # An Implementation of Grammar of Graphics for Graphs and Networks
install.packages('wordcloud2') # wordle generator
install.packages('leaflet') # Create and customize interactive maps
install.packages('mapview') # Interactive Viewing of Spatial Objects in R
install.packages('scales') # Graphical scales map data to aesthetics
# to gain access to the functions in a package, you can use library(), eg:
library(tidyverse)
Every so often, R releases a new update. Many of you had install issues last night that were resolved by installing the newest version of R. To keep things running smoothly in the future, here is a great trick:
# install the package called installr
install.packages("installr")
# load the library
library(installr)
# run the updater function (this is best done outside Rstudio in the R gui)
updater()
This will prompt you to install the latest version of R and copy over all of you packages if you choose to do so (the alternative is installing them all by hand again), and it can also update your packages for you. It’s really a great time saver!
Daigle RM, Metaxas A, deYoung B (2014) Bay-scale patterns in the distribution, aggregation and spatial variability of larvae of benthic invertebrates. Marine Ecology Progress Series 503:139-156. http://dx.doi.org/10.3354/meps10734↩