<<BACK

(Remi/David) Now, if you remember what we did yesterday, we loaded and cleaned my larval abundance data, but I did not follow the principles of ‘tidy’ data when I uploaded my data.

Again, this data comes from one of Remi’s PhD thesis chapters1 that was part of the first CHONe. We cleaned in the Data cleaning and raw data management and is also archived on Dryad so we would have cleaning to do!

Reminder: The ‘tidy’ concept is something we will revisit in the R section, but the rules of tidy data are2:

  1. Each variable you measure should be in one column
  2. Each different observation of that variable should be in a different row
  3. There should be one table for each “kind” of variable
  4. If you have multiple tables, they should include a column in the table that allows them to be linked

While it was certainly easier to enter the abundance data in a spreadsheet by giving each species a column, this does not abide by the ‘tidy’ rules and we will have a much easier time analysing and plotting our data if we first ‘tidy’ the data.

# let's load all packages and the data before we begin
library(sf)
library(raster)
library(marmap)
library(robis)
library(gridExtra)
library(tidyverse)
larvalAbundance <- read.csv("data/larvalAbundanceClean.csv", stringsAsFactors = FALSE)
head(larvalAbundance)

The tidyverse ‘meta’ package

The tidyverse is a package that is made up of other packages commonly used together for data analysis and vizualization. You can find much more detail in the book R for Data Science by Garrett Grolemund and Hadley Wickham. Below we will learn about 3 of them:

  • tidyr for tidying data
  • dplyr for manipulating data
  • ggplot2 for plotting data

The tidyr package

This aptly named package will allow us to tidy our data. Our data is currently in the wide format since we have multiple observations of species abundances in seperate columns. We want our data to be in the long format in order to facilitate analysis and plotting (and to abide by the ‘tidy’ rules).

We currently have a few ID type variable (i.e. they are not observations; year,month,day,time,depth,site,long,lat) and an observational variable for each species.

names(larvalAbundance)
 [1] "year"                     "month"                    "day"                      "time"                    
 [5] "depth"                    "locationID"               "decimalLatitude"          "decimalLongitude"        
 [9] "Astyris.lunata"           "Bittiolum.alternatum"     "Margarites.spp."          "Arrhoges.occidentalis"   
[13] "Diaphana.minuta"          "Crepidula.spp."           "Other.Gastropods"         "Mytilus.spp."            
[17] "Modiolus.modiolus"        "Anomia.simplex"           "Other.Bivalve"            "Electra.pilosa"          
[21] "Membranipora.membranacea" "Carcinus.maenas"          "Cancer.irroratus"         "Neopanopeus.sayi"        
[25] "Crangon.septemspinosa"   
larvalAbundanceLong <- gather(larvalAbundance,                                  # the old data frame
                              key = "scientificName",                           # what the column of old column names will be called
                              value = "larvalAbundanceIndPerm3",                # what the gathered numbers will be called
                              Astyris.lunata:Crangon.septemspinosa)             # which columns  to gather
# alternatively, I could of specified which columns not to include instead
larvalAbundanceLong <- gather(larvalAbundance,                                  # the old data frame
                              key = "scientificName",                           # what the column of old column names will be called
                              value = "larvalAbundanceIndPerm3",                # what the gathered numbers will be called
                              -year,-month,-day,-time,-depth,-locationID,-decimalLongitude,-decimalLatitude)  # which columns not to gather
head(larvalAbundanceLong)                           

We can also clean up and/or separate the scientificName column

# first let's swap multiple species 
larvalAbundanceLong <- separate(larvalAbundanceLong,           #old data frame
                                col = scientificName,          # column to be seperated
                                into = c("genus","species"),   # to be seperated into
                                extra = "merge",               # if there are too many ".", what to do with the extra column
                                remove = FALSE,                # whether to delete the seperated column
                                sep = "\\.")                   # regex for ".", but need escape character "\\" since . is a wildcard
# Let's check out our work
head(larvalAbundanceLong)
# genus and species don't look terrible, but scientificName still has that pesky period
# we could use gsub again, but the regex is tricky since there are spp. for species
# pasting the genus and species together will be much easier
larvalAbundanceLong$scientificName <- paste(larvalAbundanceLong$genus,larvalAbundanceLong$species,sep = " ")
# also, spp. is not a species designation, let's get rid of them
larvalAbundanceLong$species[larvalAbundanceLong$species=="spp."] <- NA
# Let's check out our work again
head(larvalAbundanceLong)

The dplyr package

Manipulation of dataframes means many things to many researchers, we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:

mean(larvalAbundanceLong[larvalAbundanceLong$scientificName == "Astyris lunata", "larvalAbundanceIndPerm3"])
[1] 13.85279
mean(larvalAbundanceLong[larvalAbundanceLong$scientificName == "Bittiolum alternatum", "larvalAbundanceIndPerm3"])
[1] 1.713291
mean(larvalAbundanceLong[larvalAbundanceLong$scientificName == "Arrhoges occidentalis", "larvalAbundanceIndPerm3"])
[1] 0.7773257

But this isn’t very nice because there is a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.

Here we’re going to cover 5 of the most commonly used functions as well as using pipes (%>%) to combine them.

  1. select()
  2. filter()
  3. group_by()
  4. summarize()
  5. mutate()

Using select()

If, for example, we wanted to move forward with only a few of the variables in our dataframe we could use the select() function. This will keep only the variables you select.

yearAndSpecies <- select(larvalAbundanceLong,year,species,genus,larvalAbundanceIndPerm3)

If we open up yearAndSpecies we’ll see that it only contains the year, and the larval abundance for each species.

Above we used ‘normal’ grammar, but the strengths of dplyr lie in combining several functions using pipes (%>%). Since the pipes grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.

yearTaxonAbun <- larvalAbundanceLong %>% select(year,species,genus,larvalAbundanceIndPerm3)

To help you understand why we wrote that in that way, let’s walk through it step by step. In both cases, the left side of <- is yearTaxonAbun. First we summon the larvalAbundanceLong dataframe and pass it on, using the pipe symbol %>%, to the next step, which is the select() function. In this case we don’t specify which .data argument we use in the select() function since in gets that from the previous pipe. Essentially, the output of what is written before the pipe becomes the first argument, in this case .data, of the function after the pipe. Fun Fact: There is a chance you have encountered pipes before in the shell. In R, a pipe symbol is %>% while in the shell it is | but the concept is the same!

Pro-tip

Did you get an error while running the select() function? Something like: Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘select’ for signature ‘"data.frame"’ That’s because many packages have a select() function; In this case, it’s the raster package causing problems. It is possible to get around that problem the ‘hacky’ way by loading tidyverse after all the other packages so that the dplyr (which is in the tidyverse) version of select sort of overwrites the one loaded from raster.

Alternatively, you can designate which package using the :: (e.g. package::function()) when it’s not clear. yearTaxonAbun <- larvalAbundanceLong %>% dplyr::select(year,species,genus,larvalAbundanceIndPerm3)

Using filter()

If we now wanted to move forward with the above, but only with species for which we have a valid genus and species, we can combine select and filter

yearTaxonAbunValid <- larvalAbundanceLong %>%
    select(year,species,genus,larvalAbundanceIndPerm3) %>% 
    filter(genus!="Other") %>% 
    filter(!is.na(species))
# we can also string the two filter operations into one using the and symbol '&'
yearTaxonAbunValid <- larvalAbundanceLong %>%
    select(year,species,genus,larvalAbundanceIndPerm3) %>% 
    filter(genus!="Other" & !is.na(species))

** Challenge 1 **

Write a single command (which can span multiple lines and includes pipes) that will produce a dataframe that has the values for larvalAbundanceIndPerm3 for all species at locationID 5, but not for other sites. I also want locationID, and lat/long removed from the output. How many rows does your dataframe have and why? Solution

Using group_by()

Now, we were supposed to be reducing the error prone repetitiveness of what can be done with base R, but up to now we haven’t done that since we would have to repeat the above for each species. Instead of filter(), which will only pass observations that meet your criteria (in the above: locationID==5), we can use group_by(), which will essentially use every unique criteria that you could have used in filter.

str(larvalAbundanceLong)
'data.frame':   1343 obs. of  12 variables:
 $ year                   : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
 $ month                  : int  8 8 8 8 8 8 8 8 8 8 ...
 $ day                    : int  7 8 7 7 7 7 7 8 8 8 ...
 $ time                   : chr  "10:40:00" "9:05:00" "12:14:00" "13:47:00" ...
 $ depth                  : int  3 12 12 12 12 12 12 12 12 12 ...
 $ locationID             : int  5 5 6 7 8 9 10 11 12 13 ...
 $ decimalLatitude        : num  45.8 45.8 45.8 45.8 45.8 ...
 $ decimalLongitude       : num  -61.9 -61.9 -61.8 -61.7 -61.6 ...
 $ scientificName         : chr  "Astyris lunata" "Astyris lunata" "Astyris lunata" "Astyris lunata" ...
 $ genus                  : chr  "Astyris" "Astyris" "Astyris" "Astyris" ...
 $ species                : chr  "lunata" "lunata" "lunata" "lunata" ...
 $ larvalAbundanceIndPerm3: num  4.478 12.548 31.834 21.162 0.324 ...
str(larvalAbundanceLong %>% group_by(locationID))
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1343 obs. of  12 variables:
 $ year                   : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
 $ month                  : int  8 8 8 8 8 8 8 8 8 8 ...
 $ day                    : int  7 8 7 7 7 7 7 8 8 8 ...
 $ time                   : chr  "10:40:00" "9:05:00" "12:14:00" "13:47:00" ...
 $ depth                  : int  3 12 12 12 12 12 12 12 12 12 ...
 $ locationID             : int  5 5 6 7 8 9 10 11 12 13 ...
 $ decimalLatitude        : num  45.8 45.8 45.8 45.8 45.8 ...
 $ decimalLongitude       : num  -61.9 -61.9 -61.8 -61.7 -61.6 ...
 $ scientificName         : chr  "Astyris lunata" "Astyris lunata" "Astyris lunata" "Astyris lunata" ...
 $ genus                  : chr  "Astyris" "Astyris" "Astyris" "Astyris" ...
 $ species                : chr  "lunata" "lunata" "lunata" "lunata" ...
 $ larvalAbundanceIndPerm3: num  4.478 12.548 31.834 21.162 0.324 ...
 - attr(*, "vars")=List of 1
  ..$ : symbol locationID
 - attr(*, "drop")= logi TRUE
 - attr(*, "indices")=List of 16
  ..$ : int  0 1 12 23 34 45 46 62 63 79 ...
  ..$ : int  2 13 24 35 47 64 81 92 103 114 ...
  ..$ : int  3 14 25 36 48 65 82 93 104 115 ...
  ..$ : int  4 15 26 37 49 66 83 94 105 116 ...
  ..$ : int  5 16 27 38 50 67 84 95 106 117 ...
  ..$ : int  6 17 28 39 51 68 85 96 107 118 ...
  ..$ : int  7 18 29 40 52 69 86 97 108 119 ...
  ..$ : int  8 19 30 41 53 70 87 98 109 120 ...
  ..$ : int  9 20 31 42 54 71 88 99 110 121 ...
  ..$ : int  10 21 32 43 55 72 89 100 111 122 ...
  ..$ : int  11 22 33 44 56 73 90 101 112 123 ...
  ..$ : int  57 74 136 153 215 232 294 311 373 390 ...
  ..$ : int  58 75 137 154 216 233 295 312 374 391 ...
  ..$ : int  59 76 138 155 217 234 296 313 375 392 ...
  ..$ : int  60 77 139 156 218 235 297 314 376 393 ...
  ..$ : int  61 78 140 157 219 236 298 315 377 394 ...
 - attr(*, "group_sizes")= int  153 102 102 102 102 102 102 102 102 102 ...
 - attr(*, "biggest_group_size")= int 153
 - attr(*, "labels")='data.frame':  16 obs. of  1 variable:
  ..$ locationID: int  5 6 7 8 9 10 11 12 13 14 ...
  ..- attr(*, "vars")=List of 1
  .. ..$ : symbol locationID
  ..- attr(*, "drop")= logi TRUE

You will notice that the structure of the dataframe where we used group_by() (grouped_df) is not the same as the original larvalAbundanceLong (data.frame). A grouped_df can be thought of as a list where each item in the listis a data.frame which contains only the rows that correspond to the a particular value locationID (at least in the example above).

Using summarize()

The above was a bit on the uneventful side because group_by() much more exciting in conjunction with summarize(). This will allow use to create new variable(s) by using functions that repeat for each of the locationID-specific data frames. That is to say, using the group_by() function, we split our original dataframe into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize().

IndPerm3_byLocationID <- larvalAbundanceLong %>%
    group_by(locationID) %>%
    summarize(mean_larvalAbundanceIndPerm3 = mean(larvalAbundanceIndPerm3))

That allowed us to calculate the mean larval abundance for each location, but it gets so much better!

** Challenge 2 **

Calculate the mean larval abundance per locationID. Which has the highest mean larval abundance and which has the lowest mean larval abundance? Solution

The function group_by() allows us to group by multiple variables. Let’s group by locationID and scientificName.

IndPerm3_byLocationID_bySpp <- larvalAbundanceLong %>%
    group_by(locationID,scientificName ) %>%
    summarize(mean_larvalAbundanceIndPerm3 = mean(larvalAbundanceIndPerm3))

That is already quite powerful, but it gets even better! You’re not limited to defining 1 new variable in summarize().

IndPerm3_byLocationID_bySpp <- larvalAbundanceLong %>%
    group_by(locationID,scientificName ) %>%
    summarize(mean_larvalAbundanceIndPerm3 = mean(larvalAbundanceIndPerm3),
              sd_larvalAbundanceIndPerm3 = sd(larvalAbundanceIndPerm3),
              mean_depth = mean(depth),
              sd_depth= sd(depth))

Using mutate()

We can also create new variables prior to (or even after) summarizing information using mutate().

IndPerm3_byLocationID_bySpp <- larvalAbundanceLong %>%
    mutate(larvalAbundanceIndPercm3=larvalAbundanceIndPerm3*1000000) %>% 
    group_by(locationID,scientificName ) %>%
    summarize(mean_larvalAbundanceIndPercm3 = mean(larvalAbundanceIndPercm3),
              sd_larvalAbundanceIndPercm3 = sd(larvalAbundanceIndPercm3),
              mean_depth = mean(depth),
              sd_depth= sd(depth))

Plotting

R comes with basic plotting function, usually referred to as ‘base plot’, e.g:

plot(larvalAbundance$Margarites.spp.,larvalAbundance$Diaphana.minuta)

However, general concensus seems to be that ggplot2 (part of the tidyverse) is the way to go for plotting. I (Remi) have mixed feelings about this package (my age is showing again). It does make making quick plots easier and is compatible with the rest of the tidyverse, but until recently, ggplot2 really struggled with any spatial plotting (see ‘The classic and the cutting edge: sp vs sf’) so I had to know the base plot anyway. I still think base plot is more customizeable, but it may also be that I have not sufficiently mastered the ggplot world. We’re mostly going to cover ggplot2, but be aware that there are other plotting paradigms in R (e.g. base, lattice, etc)

The ggplot2 package

A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”. You provide the data, tell ‘ggplot2’ how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details

So for the same plot as above, let me walk you through each line below step by step. The first line creates a new blank ggplot plot. Some people set aes() and data in ggplot(), but since different layers can take different aes() and data, I like to set those in the geom functions.

Unlike other tidyverse functions, you can tie together ggplot2 functions with + instead of %>%, but otherwise it works the same. So to our blank plot, we’re adding geom_point() which is the function for a scatterplot. We’re telling geom_point() that all the data it needs should be in larvalAbundance and what aesthetic mapping we want (aes()) which sets the variables used in the plot and takes arguments like x, y, colour, size.

Finally, I set the theme as theme_classic because the default plots are ugly! (try it, I dare you) #

ggplot() +
    geom_point(data = larvalAbundance, aes(x=Margarites.spp.,y=Diaphana.minuta)) +
    theme_classic()

Yeah, that was way more code than base plot just to get a simple plot! But now that we have the basic ggplot skeleton, our investment is going to pay off because it’s now super easy to modify. Let’s assign different colours for different depths.

ggplot() +
    geom_point(data = larvalAbundance, aes(x=Margarites.spp.,y=Diaphana.minuta,colour=as.character(depth))) +
    theme_classic()

I had to convert year and depth to characters since ggplot assumes that numeric variables are continous. Let’s combine what we learned above about the tidyverse.

# prepare data for plotting
larvalAbundancePlot <- larvalAbundance %>% 
    mutate(Year = as.character(year),
           Depth = as.character(depth))
# plot new data
ggplot() +
    geom_point(data=larvalAbundancePlot,aes(x=Margarites.spp.,y=Diaphana.minuta,colour=Depth,shape=Year)) +
    theme_classic()

Let’s add a linear regression and make the axes log scale:

ggplot() +
    geom_point(data=larvalAbundancePlot,aes(x=Margarites.spp.,y=Diaphana.minuta,colour=Depth,shape=Year)) +
    geom_smooth(method="lm")+
    scale_x_log10()+
    scale_y_log10()+
    theme_classic()

Try commenting out (#) some of those lines to see what they do.

Now that you understand how that works, let’s try a few different plot types

# prepare long data for plotting
larvalAbundanceLongPlot <- larvalAbundanceLong %>% 
    mutate(Year = as.character(year),
           Depth = as.character(depth))
# plot new data
ggplot() +
    geom_boxplot(data=larvalAbundanceLongPlot,aes(x=Depth,y=larvalAbundanceIndPerm3,fill=Depth)) +
    scale_y_log10()+
    theme_classic()

Ready to get your mind blown? By adding 1 line to the above, we can have a plot for each species

ggplot() +
    geom_boxplot(data=larvalAbundanceLongPlot,aes(x=Depth,y=larvalAbundanceIndPerm3,fill=Depth)) +
    scale_y_log10()+
    facet_wrap(~scientificName)+
    theme_classic()

Or swapping out the geom gives you a completely different (better) plot.

ggplot() +
    geom_violin(data=larvalAbundanceLongPlot,aes(x=Depth,y=larvalAbundanceIndPerm3,fill=Depth)) +
    scale_y_log10()+
    facet_wrap(~scientificName)+
    theme_classic()

Or swapping out the geom again and we can map the abundance.

# prepare long data for plotting
larvalAbundanceLongPlot <- larvalAbundanceLong %>% 
    group_by(scientificName,decimalLongitude,decimalLatitude) %>% 
    summarize(mean_IndPerm3 = mean(larvalAbundanceIndPerm3))
# plot new mean data
ggplot() +
    geom_point(data=larvalAbundanceLongPlot,aes(x=decimalLongitude,y=decimalLatitude,size=log10(mean_IndPerm3))) +
    facet_wrap(~scientificName)+
    theme_classic()

Let’s add a bit more to these maps with some proper mapping tools. But before we move away from ggplot, bookmark this excellent “R Graph Catalog”. You can visually find a graph you want to make and get the code! Super useful!

Mapping

You can always plot any spatial data and use latitude and longitude as normal Cartesian coordinates as we did above, but if you want to do anything with projections, or other GIS type operations (buffer, intersection, overlaps, etc), you really need to dig a little deeper.

Projection and datum information in R in both sp and sf (as well as many other applications) are handled by the proj.4 standard. We won’t go in to deep into the ‘how to choose a projection’, but there is substantial information regarding the parameters on the proj.4 website and this Projection Wizard is super helpful for recommending projection (and producing formatted proj4 strings) based on the extent of your map and what you want to optimize for (i.e. Equal-area, Conformal, Equidistant, Compromise).

Be aware that in most cases R will throw an error if you are trying to use multiple spatial objects with different transformations, so remember to transform them to a common projection before trying to combine them. Also, remember that some projections have units of latitude and longitude and some have units of meters and they are of course incompatible.

The classic and the cutting edge: sp vs sf

Before November 2016, if you wanted to do any spatial tasks in R your choice was a simple one. You needed to use the sp package (along with rgeos, and rgdal). Now, the sf package is replacing that set of packages, and that’s a good thing because it allows for faster computation and it is compatible with the tidyverse. However, many of the helper packages that were built to interact with sp have not been upgraded to sf (yet).

The sp package

The bread and butter ‘Spatial’ object class from the sppackage are of the form SpatialBLANKDataFrame where blank can be Points, Lines, Polygons, etc. Let’s create a SpatialPointsDataFrame from our larvalAbundanceLong

# first make spatial points and set the projection (more on that below)
spoints <- SpatialPoints(coords = larvalAbundanceLong[,c("decimalLongitude","decimalLatitude")],
                         proj4string = CRS("+proj=longlat +datum=WGS84"))
# then make a SpatialPointsDataFrame by addint the data frame back in
spointsdf <- SpatialPointsDataFrame(spoints,larvalAbundanceLong)
# now for each point in spointsdf there is a row in the data frame which can be accessed by spointsdf@data, or if you want a particular variable spointsdf$year (or spointsdf@data$year), but the data and the geometries are a little disjoint.
# plot it
plot(spointsdf)

That’s not particularly useful without a basemap, so let’s use the raster package to get some high resolution province level boundaries data (not the only way to get coastlines, but one of my favorite).

Canada <- getData('GADM', country="CAN", level=1)
plot(Canada)
# this is a little excessive since we only need Nova Scotia
NS <- Canada[Canada$NAME_1=="Nova Scotia",]
plot(NS,add=T,col='blue')

Now let’s combine my data with the coastline

# see the bounding box of the larval data
bbox(spointsdf)
                       min       max
decimalLongitude -61.88518 -61.57103
decimalLatitude   45.71147  46.05218
# transform both Spatial objects
proj_aea <- "+proj=aea +lat_1=45.34229371956944 +lat_2=46.154426884765485 +lon_0=-61.70471191406251"
spointsdf_aea <- spTransform(spointsdf,proj_aea)
NS_aea <- spTransform(NS,proj_aea)
# see the bounding box of the larval data after projection
bbox(spointsdf_aea)
                        min        max
decimalLongitude  -13996.94   10365.71
decimalLatitude  4714435.00 4752308.03
plot(spointsdf_aea,pch=16, xlim = c(-15000,12000), ylim = c(4705000,4755000))
plot(NS_aea, col = 'grey',add=T)

You can find more info about the other Spatial classes in my GIS tutorial I wrote before sf was created.

The sf package

The sf package is an an R implementation of “Simple Features” (officially Simple Feature Access) which is both an Open Geospatial Consortium (OGC) and International Organization for Standardization (ISO) standard ISO 19125 that specifies a common storage and access model of mostly two-dimensional geometries (point, line, polygon, multi-point, multi-line, etc.) used by geographic information systems.

The bread and butter object classes in the sf package are sf (created by st_sf()) which are effectively a normal data frame with the addition of one (or more) simple features list column (sfc; created by st_sfc()). This sfc column, which is really just another column in your data frame, contains the geometric information that gets plotted on a map. This geometric information is stored as a simple feature which can take the form of polygon(s) (created by st_polygon() or st_multipolygon() ), line(s) (created by st_st_linestring() or st_multist_linestring() ), or point(s) (created by st_point() or st_multipoint() )

Let’s start by turning our larvalAbundanceLong into an sf object and transform it into our desired projection in 1 command (reminder: sf is compatible with the tidyverse):

larvalAbundanceLong_sf <- larvalAbundanceLong %>% 
    st_as_sf(coords = c("decimalLongitude", "decimalLatitude"),
             crs="+proj=longlat +datum=WGS84",
             remove=FALSE) %>% 
    st_transform(proj_aea)
ggplot()+
    geom_sf(data = larvalAbundanceLong_sf)

As you may have noticed, geom_sf does not need a defined aes() since is already knows that latitude is your y-axis and longitude is the x-axis. You can of course toggle group, colour, size, and fill in the aes based on other columns in your data frame.

Let’s plot is with a basemap! raster only provides the GADM coastline data in sp format, but we can convert with st_as_sf() and then filter() to get just the Maritime provinces with dplyr

MarProv_sf <- getData('GADM', country="CAN", level=1) %>% 
    st_as_sf() %>% 
    filter(NAME_1=="Nova Scotia"|
               NAME_1=="Prince Edward Island"|
               NAME_1=="New Brunswick")%>% 
    st_transform(proj_aea)
ggplot()+
    geom_sf(data = larvalAbundanceLong_sf)+
    geom_sf(data = MarProv_sf)

And to make things fun, let’s have one zoomed in plot with the dots, and a zoomed out plot with a ‘zoom box’

#define the corners (with a buffer)
larvalbbox <- st_buffer(larvalAbundanceLong_sf, dist = 10000) %>% 
    st_bbox()

#create a matrix that traces a line the whole way around your box
bbsfc <- cbind(
            c(larvalbbox[1],larvalbbox[3],larvalbbox[3],larvalbbox[1],larvalbbox[1]),
            c(larvalbbox[2],larvalbbox[2],larvalbbox[4],larvalbbox[4],larvalbbox[2])
        ) %>% 
    list() %>%                  # put that matrix in a list
    st_polygon() %>%            # put that list in a simple features polygon
    st_sfc(crs = proj_aea) %>%  # put that polygon into a simple features column and give projection info
    st_sf(name="Study Site", geometry=.)    # finally make it an sf object

# plot it!
p1 <- ggplot()+
    geom_sf(data = MarProv_sf)+
    geom_sf(data = bbsfc, size = 2, fill = "transparent")

p2 <- ggplot()+
    geom_sf(data = larvalAbundanceLong_sf)+
    geom_sf(data = MarProv_sf) +
    coord_sf(xlim = larvalbbox[c(1,3)],ylim = larvalbbox[c(2,4)])

grid.arrange(p1,p2)

Finally, let’s bring facets back!

#define the corners (with a buffer)
larvalbbox <- st_buffer(larvalAbundanceLong_sf, dist = 10000) %>% 
    st_bbox()
#create a matrix that traces a line the whole way around your box
bbsfc <- cbind(
            c(larvalbbox[1],larvalbbox[3],larvalbbox[3],larvalbbox[1],larvalbbox[1]),
            c(larvalbbox[2],larvalbbox[2],larvalbbox[4],larvalbbox[4],larvalbbox[2])
        ) %>% 
    list() %>%                  # put that matrix in a list
    st_polygon() %>%            # put that list in a simple features polygon
    st_sfc(crs = proj_aea) %>%  # put that polygon into a simple features column and give projection info
    st_sf(name="Study Site", geometry=.)    # finally make it an sf object
# plot it!
p1 <- ggplot()+
    geom_sf(data = MarProv_sf)+
    geom_sf(data = bbsfc, size = 2, fill = "transparent")
p2 <- ggplot()+
    geom_sf(data = larvalAbundanceLong_sf)+
    geom_sf(data = MarProv_sf) +
    coord_sf(xlim = larvalbbox[c(1,3)],ylim = larvalbbox[c(2,4)])
grid.arrange(p1,p2)

# oh, and if you want to save a plot, pick your format, size, and resolution
ggsave(filename="speciesabundance.tiff", plot = p3, width = 20, height = 16.5, unit = "cm", dpi = 300)
ggsave(filename="speciesabundance.pdf", plot = p3, width = 20, height = 16.5, unit = "cm", dpi = 300)
ggsave(filename="speciesabundance.png", plot = p3, width = 20, height = 16.5, unit = "cm", dpi = 300)

Challenge answers

Challenge 1 solution

site5 <- larvalAbundanceLong %>%
   filter(locationID==5) %>%
   select(-locationID,-decimalLatitude,-decimalLatitude)

Note: The order of operations is very important in this case. If we used ‘select’ first, filter would not be able to find the variable continent since we would have removed it in the previous step.

Challenge 2 solution

IndPerm3_byLocationID <- larvalAbundanceLong %>%
   group_by(locationID) %>%
   summarize(mean_IndPerm3 = mean(larvalAbundanceIndPerm3)) %>% 
   filter(mean_IndPerm3 == min(mean_IndPerm3) | mean_IndPerm3 == max(mean_IndPerm3))

<<BACK


  1. Daigle RM, Metaxas A, deYoung B (2014) Bay-scale patterns in the distribution, aggregation and spatial variability of larvae of benthic invertebrates. Marine Ecology Progress Series 503:139-156. http://dx.doi.org/10.3354/meps10734

  2. Jeff Leek, The Elements of Data Analytic Style, Leanpub, 2015-03-02

---
title: "Data Analysis and Visualization in R"
output:
  html_document:
    toc: yes
    toc_float: yes
  html_notebook:
    toc: yes
    toc_float: yes
---
[<<BACK](https://remi-daigle.github.io/2017-CHONe-Data/)

(Remi/David)
Now, if you remember what we did yesterday, we loaded and cleaned my larval abundance data, but I did not follow the principles of 'tidy' data when I uploaded my data.

Again, this data comes from one of Remi's PhD thesis chapters[^1] that was part of the first CHONe. We cleaned in the [Data cleaning and raw data management](https://remi-daigle.github.io/2017-CHONe-Data/cleaning.nb.html) and is also archived on [Dryad](http://datadryad.org/handle/10255/dryad.59482) so we would have cleaning to do!

[^1]: Daigle RM, Metaxas A, deYoung B (2014) Bay-scale patterns in the distribution, aggregation and spatial variability of larvae of benthic invertebrates. Marine Ecology Progress Series 503:139-156. http://dx.doi.org/10.3354/meps10734

**Reminder**:
The 'tidy' concept is something we will revisit in the R section, but the rules of tidy data are[^2]:

1. Each variable you measure should be in one column
2. Each different observation of that variable should be in a different row
3. There should be one table for each “kind” of variable
4. If you have multiple tables, they should include a column in the table that allows them to be linked

While it was certainly easier to enter the abundance data in a spreadsheet by giving each species a column, this does not abide by the 'tidy' rules and we will have a much easier time analysing and plotting our data if we first 'tidy' the data.

[^2]: [Jeff Leek, The Elements of Data Analytic Style, Leanpub, 2015-03-02](https://leanpub.com/datastyle)

```{r,message=FALSE}
# let's load all packages and the data before we begin
library(sf)
library(raster)
library(marmap)
library(robis)
library(gridExtra)
library(tidyverse)

larvalAbundance <- read.csv("data/larvalAbundanceClean.csv", stringsAsFactors = FALSE)

head(larvalAbundance)
```


# The `tidyverse` 'meta' package

The `tidyverse` is a package that is made up of other packages commonly used together for data analysis and vizualization. You can find much more detail in the book [R for Data Science](http://r4ds.had.co.nz/) by Garrett Grolemund and Hadley Wickham. Below we will learn about 3 of them:

- `tidyr` for tidying data
- `dplyr` for manipulating data
- `ggplot2` for plotting data

## The `tidyr` package

This aptly named package will allow us to tidy our data. Our data is currently in the wide format since we have multiple observations of species abundances in seperate columns. We want our data to be in the long format in order to facilitate analysis and plotting (and to abide by the 'tidy' rules).


```{r widevslong,results='asis',echo=FALSE, message=FALSE}
require(DiagrammeR)
grViz('digraph html {
      table1 [shape=none, margin=0, label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">ID</TD>
      <TD BGCOLOR="#FF7400" CELLPADDING="0">a1</TD>
      <TD BGCOLOR="#009999" CELLPADDING="0">a2</TD>
      <TD BGCOLOR="#00CC00" CELLPADDING="0">a3</TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">2</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">3</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      </TABLE>>];
      
      table2 [shape=none, margin=0,label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">ID</TD>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">ID2</TD>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">A</TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">1</TD>
      <TD BGCOLOR="#FF7400">a1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">2</TD>
      <TD BGCOLOR="#FF7400">a1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">3</TD>
      <TD BGCOLOR="#FF7400">a1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">1</TD>
      <TD BGCOLOR="#009999">a2</TD>
      <TD BGCOLOR="#009999"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">2</TD>
      <TD BGCOLOR="#009999">a2</TD>
      <TD BGCOLOR="#009999"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">3</TD>
      <TD BGCOLOR="#009999">a2</TD>
      <TD BGCOLOR="#009999"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">1</TD>
      <TD BGCOLOR="#00CC00">a3</TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">2</TD>
      <TD BGCOLOR="#00CC00">a3</TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000">3</TD>
      <TD BGCOLOR="#00CC00">a3</TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      </TABLE>>];
      
      subgraph {
      rank = same; table1; table2;
      }
      
      
      labelloc="t";
      fontname="Courier";
      label="wide  vs  long";
      }
      ')
  

```

We currently have a few ID type variable (i.e. they are not observations; `year`,`month`,`day`,`time`,`depth`,`site`,`long`,`lat`) and an observational variable for each species.

```{r}
names(larvalAbundance)
larvalAbundanceLong <- gather(larvalAbundance,                                  # the old data frame
                              key = "scientificName",                           # what the column of old column names will be called
                              value = "larvalAbundanceIndPerm3",                # what the gathered numbers will be called
                              Astyris.lunata:Crangon.septemspinosa)             # which columns  to gather

# alternatively, I could of specified which columns not to include instead
larvalAbundanceLong <- gather(larvalAbundance,                                  # the old data frame
                              key = "scientificName",                           # what the column of old column names will be called
                              value = "larvalAbundanceIndPerm3",                # what the gathered numbers will be called
                              -year,-month,-day,-time,-depth,-locationID,-decimalLongitude,-decimalLatitude)  # which columns not to gather

head(larvalAbundanceLong)                           
```

We can also clean up and/or separate the `scientificName` column

```{r}
# first let's swap multiple species 
larvalAbundanceLong <- separate(larvalAbundanceLong,           #old data frame
                                col = scientificName,          # column to be seperated
                                into = c("genus","species"),   # to be seperated into
                                extra = "merge",               # if there are too many ".", what to do with the extra column
                                remove = FALSE,                # whether to delete the seperated column
                                sep = "\\.")                   # regex for ".", but need escape character "\\" since . is a wildcard

# Let's check out our work
head(larvalAbundanceLong)

# genus and species don't look terrible, but scientificName still has that pesky period
# we could use gsub again, but the regex is tricky since there are spp. for species
# pasting the genus and species together will be much easier
larvalAbundanceLong$scientificName <- paste(larvalAbundanceLong$genus,larvalAbundanceLong$species,sep = " ")

# also, spp. is not a species designation, let's get rid of them
larvalAbundanceLong$species[larvalAbundanceLong$species=="spp."] <- NA

# Let's check out our work again
head(larvalAbundanceLong)
```


## The `dplyr` package

Manipulation of dataframes means many things to many researchers, we often
select certain observations (rows) or variables (columns), we often group the
data by a certain variable(s), or we even calculate summary statistics. We can
do these operations using the normal base R operations:

```{r}
mean(larvalAbundanceLong[larvalAbundanceLong$scientificName == "Astyris lunata", "larvalAbundanceIndPerm3"])
mean(larvalAbundanceLong[larvalAbundanceLong$scientificName == "Bittiolum alternatum", "larvalAbundanceIndPerm3"])
mean(larvalAbundanceLong[larvalAbundanceLong$scientificName == "Arrhoges occidentalis", "larvalAbundanceIndPerm3"])
```

But this isn't very *nice* because there is a fair bit of repetition. Repeating
yourself will cost you time, both now and later, and potentially introduce some
nasty bugs.

Luckily, the [`dplyr`](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf)
package provides a number of very useful functions for manipulating dataframes
in a way that will reduce the above repetition, reduce the probability of making
errors, and probably even save you some typing. As an added bonus, you might
even find the `dplyr` grammar easier to read.

Here we're going to cover 5 of the most commonly used functions as well as using
pipes (`%>%`) to combine them.

1. `select()`
2. `filter()`
3. `group_by()`
4. `summarize()`
5. `mutate()`

### Using `select()`

If, for example, we wanted to move forward with only a few of the variables in
our dataframe we could use the `select()` function. This will keep only the
variables you select.


```{r select,results='asis',echo=FALSE, message=FALSE}
grViz('digraph html {
      table1 [shape=none, margin=0,label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD PORT="f0" BGCOLOR="#FF0000" CELLPADDING="0">a</TD>
      <TD BGCOLOR="#FF7400" CELLPADDING="0">b</TD>
      <TD PORT="f1" BGCOLOR="#009999" CELLPADDING="0">c</TD>
      <TD BGCOLOR="#00CC00" CELLPADDING="0">d</TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000"></TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000"></TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000"></TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      </TABLE>>];
      
      table2 [shape=none, margin=0, label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD PORT="f0" BGCOLOR="#FF0000" CELLPADDING="0">a</TD>
      <TD PORT="f1" BGCOLOR="#009999" CELLPADDING="0">c</TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000"></TD>
      <TD BGCOLOR="#009999"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000"></TD>
      <TD BGCOLOR="#009999"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000"></TD>
      <TD BGCOLOR="#009999"></TD>
      </TR>
      </TABLE>>];
      
      table1:f1 -> table2:f1 
      table1:f0 -> table2:f0
      
      subgraph {
      rank = same; table1; table2;
      }
      
      labelloc="t";
      fontname="Courier";
      label="select(data.frame,a,c)";
      }
      ')
  
```

```{r}
yearAndSpecies <- select(larvalAbundanceLong,year,species,genus,larvalAbundanceIndPerm3)
```


If we open up `yearAndSpecies` we'll see that it only contains the year, and the larval abundance for each species. 

Above we used 'normal' grammar, but the strengths of `dplyr` lie in combining several functions using pipes (`%>%`). Since the pipes grammar is unlike anything we've seen in R before, let's repeat what we've done above using pipes.

```{r}
yearTaxonAbun <- larvalAbundanceLong %>% select(year,species,genus,larvalAbundanceIndPerm3)
```

To help you understand why we wrote that in that way, let's walk through it step by step. In both cases, the left side of `<-` is `yearTaxonAbun`. First we summon the `larvalAbundanceLong` dataframe and pass it on, using the pipe symbol `%>%`, to the next step, which is the `select()` function. In this case we don't specify which `.data` argument we use in the `select()` function since in gets that from the previous pipe. Essentially, the output of what is written before the pipe becomes the first argument, in this case `.data`, of the function after the pipe. **Fun Fact**: There is a chance you have encountered pipes before in the shell. In R, a pipe symbol is `%>%` while in the shell it is `|` but the concept is the same!

>**Pro-tip**
>
> Did you get an error while running the `select()` function? Something like:
> `Error in (function (classes, fdef, mtable)  : `
>  `unable to find an inherited method for function ‘select’ for signature ‘"data.frame"’`
> That's because many packages have a `select()` function; In this case, it's the `raster` package causing problems.
> It is possible to get around that problem the 'hacky' way by loading `tidyverse` after all the other packages so that the `dplyr` (which is in the `tidyverse`) version of `select` sort of overwrites the one loaded from `raster`.
>
> Alternatively, you can designate which package using the `::` (e.g. `package::function()`) when it's not clear. 
> `yearTaxonAbun <- larvalAbundanceLong %>% dplyr::select(year,species,genus,larvalAbundanceIndPerm3)`

### Using `filter()`

If we now wanted to move forward with the above, but only with species for which we have a valid `genus` and `species`, we can combine `select` and `filter`

```{r}
yearTaxonAbunValid <- larvalAbundanceLong %>%
    select(year,species,genus,larvalAbundanceIndPerm3) %>% 
    filter(genus!="Other") %>% 
    filter(!is.na(species))

# we can also string the two filter operations into one using the and symbol '&'

yearTaxonAbunValid <- larvalAbundanceLong %>%
    select(year,species,genus,larvalAbundanceIndPerm3) %>% 
    filter(genus!="Other" & !is.na(species))
```

> ** Challenge 1 **
>
> Write a single command (which can span multiple lines and includes pipes) that will produce a dataframe that has the values for `larvalAbundanceIndPerm3` for all species at `locationID` 5, but not for other sites. I also want locationID, and lat/long removed from the output. How many rows does your dataframe have and why?
> [Solution](https://remi-daigle.github.io/2017-CHONe-Data/dataviz.nb.html#challenge_answers)


### Using `group_by()`

Now, we were supposed to be reducing the error prone repetitiveness of what can
be done with base R, but up to now we haven't done that since we would have to
repeat the above for each species. Instead of `filter()`, which will only pass
observations that meet your criteria (in the above: `locationID==5`), we
can use `group_by()`, which will essentially use every unique criteria that you
could have used in filter.

```{r}
str(larvalAbundanceLong)

str(larvalAbundanceLong %>% group_by(locationID))
```
You will notice that the structure of the dataframe where we used `group_by()`
(`grouped_df`) is not the same as the original `larvalAbundanceLong` (`data.frame`). A
`grouped_df` can be thought of as a `list` where each item in the `list`is a
`data.frame` which contains only the rows that correspond to the a particular
value `locationID` (at least in the example above).

```{r group_by,results='asis',echo=FALSE, message=FALSE}
grViz('digraph html {
      table1 [shape=none, margin=0,label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">a</TD>
      <TD BGCOLOR="#FF7400" CELLPADDING="0">b</TD>
      <TD BGCOLOR="#009999" CELLPADDING="0">c</TD>
      <TD BGCOLOR="#00CC00" CELLPADDING="0">d</TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD PORT="f0" BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">2</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD PORT="f1" BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">2</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">3</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD PORT="f2" BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">3</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      </TABLE>>];
      table2 [shape=none, margin=0,label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">a</TD>
      <TD BGCOLOR="#FF7400" CELLPADDING="0">b</TD>
      <TD BGCOLOR="#009999" CELLPADDING="0">c</TD>
      <TD BGCOLOR="#00CC00" CELLPADDING="0">d</TD>
      </TR>
      
      <TR>
      <TD PORT="f0" BGCOLOR="#FF0000" CELLPADDING="0">1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      </TABLE>>];
      table3 [shape=none, margin=0,label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">a</TD>
      <TD BGCOLOR="#FF7400" CELLPADDING="0">b</TD>
      <TD BGCOLOR="#009999" CELLPADDING="0">c</TD>
      <TD BGCOLOR="#00CC00" CELLPADDING="0">d</TD>
      </TR>
      
      <TR>
      <TD PORT="f1" BGCOLOR="#FF0000" CELLPADDING="0">2</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">2</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      </TABLE>>];
      table4 [shape=none, margin=0,label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">a</TD>
      <TD BGCOLOR="#FF7400" CELLPADDING="0">b</TD>
      <TD BGCOLOR="#009999" CELLPADDING="0">c</TD>
      <TD BGCOLOR="#00CC00" CELLPADDING="0">d</TD>
      </TR>
      
      <TR>
      <TD PORT="f2" BGCOLOR="#FF0000" CELLPADDING="0">3</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">3</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      </TABLE>>];
      
      
      
      table1:f0 -> table2:f0 
      table1:f1 -> table3:f1
      table1:f2 -> table4:f2
      
      subgraph {
      rank = same; table1; table2; table3 ;table4;
      }
      subgraph {
      table2
      table3
      table4
      }
      
      labelloc="t";
      fontname="Courier";
      label="data.frame %>% group_by(a)";
      }
      ')

```

### Using `summarize()`

The above was a bit on the uneventful side because `group_by()` much more
exciting in conjunction with `summarize()`. This will allow use to create new
variable(s) by using functions that repeat for each of the locationID-specific
data frames. That is to say, using the `group_by()` function, we split our
original dataframe into multiple pieces, then we can run functions
(e.g. `mean()` or `sd()`) within `summarize()`.

```{r}
IndPerm3_byLocationID <- larvalAbundanceLong %>%
    group_by(locationID) %>%
    summarize(mean_larvalAbundanceIndPerm3 = mean(larvalAbundanceIndPerm3))
```

```{r summarize,results='asis',echo=FALSE, message=FALSE}
grViz('digraph html {
      table1 [shape=none, margin=0,label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">a</TD>
      <TD BGCOLOR="#FF7400" CELLPADDING="0">b</TD>
      <TD BGCOLOR="#009999" CELLPADDING="0">c</TD>
      <TD BGCOLOR="#00CC00" CELLPADDING="0">d</TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD PORT="f0" BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">2</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD PORT="f1" BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">2</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">3</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD PORT="f2" BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">3</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      </TABLE>>];
      
      table2 [shape=none, margin=0,label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD PORT="f0" BGCOLOR="#FF0000" CELLPADDING="0">a</TD>
      <TD BGCOLOR="#FF7400" CELLPADDING="0">b</TD>
      <TD BGCOLOR="#009999" CELLPADDING="0">c</TD>
      <TD BGCOLOR="#00CC00" CELLPADDING="0">d</TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD PORT="f3" BGCOLOR="#FF0000" CELLPADDING="0">1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      </TABLE>>];
      
      table3 [shape=none, margin=0,label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD PORT="f1" BGCOLOR="#FF0000" CELLPADDING="0">a</TD>
      <TD BGCOLOR="#FF7400" CELLPADDING="0">b</TD>
      <TD BGCOLOR="#009999" CELLPADDING="0">c</TD>
      <TD BGCOLOR="#00CC00" CELLPADDING="0">d</TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">2</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD PORT="f3" BGCOLOR="#FF0000" CELLPADDING="0">2</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      </TABLE>>];
      
      table4 [shape=none, margin=0,label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD PORT="f2" BGCOLOR="#FF0000" CELLPADDING="0">a</TD>
      <TD BGCOLOR="#FF7400" CELLPADDING="0">b</TD>
      <TD BGCOLOR="#009999" CELLPADDING="0">c</TD>
      <TD BGCOLOR="#00CC00" CELLPADDING="0">d</TD>
      </TR>
      
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">3</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      
      <TR>
      <TD PORT="f3" BGCOLOR="#FF0000" CELLPADDING="0">3</TD>
      <TD BGCOLOR="#FF7400"></TD>
      <TD BGCOLOR="#009999"></TD>
      <TD BGCOLOR="#00CC00"></TD>
      </TR>
      </TABLE>>];
      
      table5 [shape=none, margin=0,label=<
      <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" CELLPADDING="6">
      <TR>
      <TD BGCOLOR="#FF0000" CELLPADDING="0">a</TD>
      <TD BGCOLOR="#FF7400" CELLPADDING="0">mean_b</TD>
      </TR>
      
      <TR>
      <TD PORT="f0" BGCOLOR="#FF0000" CELLPADDING="0">1</TD>
      <TD BGCOLOR="#FF7400"></TD>
      </TR>
      
      <TR>
      <TD PORT="f1" BGCOLOR="#FF0000" CELLPADDING="0">2</TD>
      <TD BGCOLOR="#FF7400"></TD>
      </TR>
      
      <TR>
      <TD PORT="f2" BGCOLOR="#FF0000" CELLPADDING="0">3</TD>
      <TD BGCOLOR="#FF7400"></TD>
      </TR>
      
      </TABLE>>];
      
      
      table1:f0 -> table2:f0 
      table1:f1 -> table3:f1
      table1:f2 -> table4:f2
      table2:f3 -> table5:f0 
      table3:f3 -> table5:f1
      table4:f3 -> table5:f2
      subgraph {
      rank = same; table1; table2; table3 ; table4;
      }
      subgraph {
      table2
      table3
      table4
      }
      
      labelloc="t";
      fontname="Courier";
      label="data.frame %>%\\l\tgroup_by(a) %>%\\l\tsummarize(mean_b=mean(b))\\l";
      }
      ')
```

That allowed us to calculate the mean larval abundance for each location, but it gets
so much better!

> ** Challenge 2 **
>
> Calculate the mean larval abundance per locationID. Which has the highest mean larval abundance and which has the lowest mean larval abundance?
> [Solution](https://remi-daigle.github.io/2017-CHONe-Data/dataviz.nb.html#challenge_answers)


The function `group_by()` allows us to group by multiple variables. Let's group by `locationID` and `scientificName`.


```{r}
IndPerm3_byLocationID_bySpp <- larvalAbundanceLong %>%
    group_by(locationID,scientificName ) %>%
    summarize(mean_larvalAbundanceIndPerm3 = mean(larvalAbundanceIndPerm3))
```

That is already quite powerful, but it gets even better! You're not limited to defining 1 new variable in `summarize()`.

```{r}
IndPerm3_byLocationID_bySpp <- larvalAbundanceLong %>%
    group_by(locationID,scientificName ) %>%
    summarize(mean_larvalAbundanceIndPerm3 = mean(larvalAbundanceIndPerm3),
              sd_larvalAbundanceIndPerm3 = sd(larvalAbundanceIndPerm3),
              mean_depth = mean(depth),
              sd_depth= sd(depth))
```

### Using `mutate()`

We can also create new variables prior to (or even after) summarizing information using `mutate()`.

```{r}
IndPerm3_byLocationID_bySpp <- larvalAbundanceLong %>%
    mutate(larvalAbundanceIndPercm3=larvalAbundanceIndPerm3*1000000) %>% 
    group_by(locationID,scientificName ) %>%
    summarize(mean_larvalAbundanceIndPercm3 = mean(larvalAbundanceIndPercm3),
              sd_larvalAbundanceIndPercm3 = sd(larvalAbundanceIndPercm3),
              mean_depth = mean(depth),
              sd_depth= sd(depth))
```

# Plotting
R comes with basic plotting function, usually referred to as 'base plot', e.g: 

```{r}
plot(larvalAbundance$Margarites.spp.,larvalAbundance$Diaphana.minuta)
```

However, general concensus seems to be that `ggplot2` (part of the tidyverse) is the way to go for plotting. I (Remi) have mixed feelings about this package (my age is showing again). It does make making quick plots easier and is compatible with the rest of the tidyverse, but until recently, `ggplot2` really struggled with any spatial plotting (see '[The classic and the cutting edge: `sp` vs `sf`](https://remi-daigle.github.io/2017-CHONe-Data/dataviz.nb.html#the_classic_and_the_cutting_edge:_sp_vs_sf)') so I had to know the base plot anyway. I still think base plot is more customizeable, but it may also be that I have not sufficiently mastered the ggplot world. We're mostly going to cover `ggplot2`, but be aware that there are other plotting paradigms in R (e.g. base, lattice, etc)

## The `ggplot2` package

A system for 'declaratively' creating graphics, based on "[The Grammar of Graphics](http://www.springer.com/gp/book/9780387245447)". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details

So for the same plot as above, let me walk you through each line below step by step. The first line creates a new blank ggplot plot. Some people set `aes()` and `data` in `ggplot()`, but since different layers can take different `aes()` and `data`, I like to set those in the geom functions. 

Unlike other `tidyverse` functions, you can tie together `ggplot2` functions with `+` instead of `%>%`, but otherwise it works the same. So to our blank plot, we're adding `geom_point()` which is the function for a scatterplot. We're telling `geom_point()` that all the data it needs should be in `larvalAbundance` and what aesthetic mapping we want (`aes()`) which sets the variables used in the plot and takes arguments like `x`, `y`, `colour`, `size`. 

Finally, I set the theme as `theme_classic` because the default plots are ugly! (try it, I dare you)
# 

```{r}
ggplot() +
    geom_point(data = larvalAbundance, aes(x=Margarites.spp.,y=Diaphana.minuta)) +
    theme_classic()
```

Yeah, that was way more code than base plot just to get a simple plot! But now that we have the basic ggplot skeleton, our investment is going to pay off because it's now super easy to modify. Let's assign different colours for different depths.

```{r}
ggplot() +
    geom_point(data = larvalAbundance, aes(x=Margarites.spp.,y=Diaphana.minuta,colour=as.character(depth))) +
    theme_classic()
```

I had to convert `year` and `depth` to characters since ggplot assumes that numeric variables are continous. Let's combine what we learned above about the tidyverse.

```{r}
# prepare data for plotting
larvalAbundancePlot <- larvalAbundance %>% 
    mutate(Year = as.character(year),
           Depth = as.character(depth))

# plot new data
ggplot() +
    geom_point(data=larvalAbundancePlot,aes(x=Margarites.spp.,y=Diaphana.minuta,colour=Depth,shape=Year)) +
    theme_classic()
```

Let's add a linear regression and make the axes log scale:

```{r}
ggplot() +
    geom_point(data=larvalAbundancePlot,aes(x=Margarites.spp.,y=Diaphana.minuta,colour=Depth,shape=Year)) +
    geom_smooth(method="lm")+
    scale_x_log10()+
    scale_y_log10()+
    theme_classic()
```

Try commenting out (`#`) some of those lines to see what they do.

Now that you understand how that works, let's try a few different plot types

```{r}
# prepare long data for plotting
larvalAbundanceLongPlot <- larvalAbundanceLong %>% 
    mutate(Year = as.character(year),
           Depth = as.character(depth))

# plot new data
ggplot() +
    geom_boxplot(data=larvalAbundanceLongPlot,aes(x=Depth,y=larvalAbundanceIndPerm3,fill=Depth)) +
    scale_y_log10()+
    theme_classic()
```

Ready to get your mind blown? By adding 1 line to the above, we can have a plot for each species

```{r}
ggplot() +
    geom_boxplot(data=larvalAbundanceLongPlot,aes(x=Depth,y=larvalAbundanceIndPerm3,fill=Depth)) +
    scale_y_log10()+
    facet_wrap(~scientificName)+
    theme_classic()
```

Or swapping out the geom gives you a completely different (better) plot.

```{r}
ggplot() +
    geom_violin(data=larvalAbundanceLongPlot,aes(x=Depth,y=larvalAbundanceIndPerm3,fill=Depth)) +
    scale_y_log10()+
    facet_wrap(~scientificName)+
    theme_classic()
```

Or swapping out the geom again and we can map the abundance.

```{r}
# prepare long data for plotting
larvalAbundanceLongPlot <- larvalAbundanceLong %>% 
    group_by(scientificName,decimalLongitude,decimalLatitude) %>% 
    summarize(mean_IndPerm3 = mean(larvalAbundanceIndPerm3))

# plot new mean data
ggplot() +
    geom_point(data=larvalAbundanceLongPlot,aes(x=decimalLongitude,y=decimalLatitude,size=log10(mean_IndPerm3))) +
    facet_wrap(~scientificName)+
    theme_classic()

```

Let's add a bit more to these maps with some proper mapping tools. But before we move away from ggplot, bookmark this excellent "[R Graph Catalog](http://shiny.stat.ubc.ca/r-graph-catalog/)". You can visually find a graph you want to make and get the code! Super useful!

# Mapping

You can always plot any spatial data and use latitude and longitude as normal Cartesian coordinates as we did above, but if you want to do anything with projections, or other GIS type operations (buffer, intersection, overlaps, etc), you really need to dig a little deeper. 

Projection and datum information in R in both `sp` and `sf` (as well as many other applications) are handled by the [proj.4](http://proj4.org/index.html) standard. We won't go in to deep into the 'how to choose a projection', but there is substantial information regarding the parameters on the [proj.4](http://proj4.org/index.html) website and this [Projection Wizard](http://projectionwizard.org/) is super helpful for recommending projection (and producing formatted proj4 strings) based on the extent of your map and what you want to optimize for (i.e. Equal-area, Conformal, Equidistant, Compromise).

Be aware that in most cases R will throw an error if you are trying to use multiple spatial objects with different transformations, so remember to transform them to a common projection before trying to combine them. Also, remember that some projections have units of latitude and longitude and some have units of meters and they are of course incompatible.

## The classic and the cutting edge: `sp` vs `sf`

Before November 2016, if you wanted to do any spatial tasks in R your choice was a simple one. You needed to use the `sp` package (along with `rgeos`, and `rgdal`). Now, the `sf` package is replacing that set of packages, and that's a good thing because it allows for faster computation and **it is compatible with the `tidyverse`**. However, many of the helper packages that were built to interact with `sp` have not been upgraded to `sf` (yet).

### The `sp` package

The bread and butter 'Spatial' object class from the `sp`package are of the form `SpatialBLANKDataFrame` where blank can be `Points`, `Lines`, `Polygons`, etc. Let's create a `SpatialPointsDataFrame` from our `larvalAbundanceLong`

```{r}
# first make spatial points and set the projection (more on that below)
spoints <- SpatialPoints(coords = larvalAbundanceLong[,c("decimalLongitude","decimalLatitude")],
                         proj4string = CRS("+proj=longlat +datum=WGS84"))

# then make a SpatialPointsDataFrame by addint the data frame back in
spointsdf <- SpatialPointsDataFrame(spoints,larvalAbundanceLong)

# now for each point in spointsdf there is a row in the data frame which can be accessed by spointsdf@data, or if you want a particular variable spointsdf$year (or spointsdf@data$year), but the data and the geometries are a little disjoint.

# plot it
plot(spointsdf)
```

That's not particularly useful without a basemap, so let's use the `raster` package to get some high resolution province level boundaries data (not the only way to get coastlines, but one of my favorite).

```{r}
Canada <- getData('GADM', country="CAN", level=1)
plot(Canada)

# this is a little excessive since we only need Nova Scotia

NS <- Canada[Canada$NAME_1=="Nova Scotia",]
plot(NS,add=T,col='blue')
```

Now let's combine my data with the coastline
```{r}
# see the bounding box of the larval data
bbox(spointsdf)

# transform both Spatial objects
proj_aea <- "+proj=aea +lat_1=45.34229371956944 +lat_2=46.154426884765485 +lon_0=-61.70471191406251"
spointsdf_aea <- spTransform(spointsdf,proj_aea)
NS_aea <- spTransform(NS,proj_aea)

# see the bounding box of the larval data after projection
bbox(spointsdf_aea)

plot(spointsdf_aea,pch=16, xlim = c(-15000,12000), ylim = c(4705000,4755000))
plot(NS_aea, col = 'grey',add=T)
```

You can find more info about the other Spatial classes in my [GIS tutorial](http://remi-daigle.github.io/GIS_mapping_in_R/) I wrote before `sf` was created.

### The `sf` package

The `sf` package is an an R implementation of "Simple Features" (officially Simple Feature Access) which is both an Open Geospatial Consortium (OGC) and International Organization for Standardization (ISO) standard ISO 19125 that specifies a common storage and access model of mostly two-dimensional geometries (point, line, polygon, multi-point, multi-line, etc.) used by geographic information systems.

The bread and butter object classes in the `sf` package are `sf` (created by `st_sf()`) which are effectively a normal data frame with the addition of one (or more) simple features list column (`sfc`; created by `st_sfc()`). This `sfc` column, which is really just another column in your data frame, contains the geometric information that gets plotted on a map. This geometric information is stored as a simple feature which can take the form of polygon(s) (created by `st_polygon()` or `st_multipolygon()` ), line(s) (created by `st_st_linestring()` or `st_multist_linestring()` ), or point(s) (created by `st_point()` or `st_multipoint()` )

Let's start by turning our `larvalAbundanceLong` into an `sf` object and transform it into our desired projection in 1 command (reminder: `sf` is compatible with the tidyverse):

```{r}
larvalAbundanceLong_sf <- larvalAbundanceLong %>% 
    st_as_sf(coords = c("decimalLongitude", "decimalLatitude"),
             crs="+proj=longlat +datum=WGS84",
             remove=FALSE) %>% 
    st_transform(proj_aea)

ggplot()+
    geom_sf(data = larvalAbundanceLong_sf)
```
As you may have noticed, `geom_sf` does not *need* a defined `aes()` since is already knows that latitude is your y-axis and longitude is the x-axis. You can of course toggle `group`, `colour`, `size`, and `fill` in the `aes` based on other columns in your data frame.

Let's plot is with a basemap! `raster` only provides the GADM coastline data in `sp` format, but we can convert with `st_as_sf()` and then `filter()` to get just the Maritime provinces with `dplyr`
```{r}
MarProv_sf <- getData('GADM', country="CAN", level=1) %>% 
    st_as_sf() %>% 
    filter(NAME_1=="Nova Scotia"|
               NAME_1=="Prince Edward Island"|
               NAME_1=="New Brunswick")%>% 
    st_transform(proj_aea)

ggplot()+
    geom_sf(data = larvalAbundanceLong_sf)+
    geom_sf(data = MarProv_sf)

```

And to make things fun, let's have one zoomed in plot with the dots, and a zoomed out plot with a 'zoom box'
```{r, bounding box}
#define the corners (with a buffer)
larvalbbox <- st_buffer(larvalAbundanceLong_sf, dist = 10000) %>% 
    st_bbox()

#create a matrix that traces a line the whole way around your box
bbsfc <- cbind(
            c(larvalbbox[1],larvalbbox[3],larvalbbox[3],larvalbbox[1],larvalbbox[1]),
            c(larvalbbox[2],larvalbbox[2],larvalbbox[4],larvalbbox[4],larvalbbox[2])
        ) %>% 
    list() %>%                  # put that matrix in a list
    st_polygon() %>%            # put that list in a simple features polygon
    st_sfc(crs = proj_aea) %>%  # put that polygon into a simple features column and give projection info
    st_sf(name="Study Site", geometry=.)    # finally make it an sf object

# plot it!
p1 <- ggplot()+
    geom_sf(data = MarProv_sf)+
    geom_sf(data = bbsfc, size = 2, fill = "transparent")

p2 <- ggplot()+
    geom_sf(data = larvalAbundanceLong_sf)+
    geom_sf(data = MarProv_sf) +
    coord_sf(xlim = larvalbbox[c(1,3)],ylim = larvalbbox[c(2,4)])

grid.arrange(p1,p2)
```


Finally, let's bring facets back!

```{r}
# prepare long data for plotting
larvalAbundanceLongPlot <- larvalAbundanceLong_sf %>% 
    group_by(scientificName,decimalLongitude,decimalLatitude) %>% 
    summarize(mean_IndPerm3 = mean(larvalAbundanceIndPerm3)) %>% 
    ungroup() %>% 
    mutate(lat_aea = unlist(lapply(geometry,'[',2)),
           long_aea = unlist(lapply(geometry,'[',1)))


p3 <- ggplot()+
    geom_point(data = larvalAbundanceLongPlot, aes(y = lat_aea, x = long_aea, size=mean_IndPerm3))+
    geom_sf(data = MarProv_sf) +
    facet_wrap(~scientificName)+
    coord_sf(xlim = larvalbbox[c(1,3)],ylim = larvalbbox[c(2,4)]) 
p3
```
```{r, eval = FALSE}
# oh, and if you want to save a plot, pick your format, size, and resolution
ggsave(filename="speciesabundance.tiff", plot = p3, width = 20, height = 16.5, unit = "cm", dpi = 300)
ggsave(filename="speciesabundance.pdf", plot = p3, width = 20, height = 16.5, unit = "cm", dpi = 300)
ggsave(filename="speciesabundance.png", plot = p3, width = 20, height = 16.5, unit = "cm", dpi = 300)


```


# Important links

[R for Data Science](http://r4ds.had.co.nz/)
[R Graph Catalog](http://shiny.stat.ubc.ca/r-graph-catalog/)
A [GIS tutorial](http://remi-daigle.github.io/GIS_mapping_in_R/) written by Remi (mostly in `sp`)
Projection stuff: [proj.4](http://proj4.org/index.html) and [Projection Wizard](http://projectionwizard.org/)


# Challenge answers
> **Challenge 1 solution **
>```{r,eval=FALSE}
>site5 <- larvalAbundanceLong %>%
>    filter(locationID==5) %>%
>    select(-locationID,-decimalLatitude,-decimalLatitude)
>```
>**Note:** The order of operations is very important in this case. If we used 'select' first, filter would not be able to find the variable continent since we would have removed it in the previous step.


> **Challenge 2 solution **
>```{r}
>IndPerm3_byLocationID <- larvalAbundanceLong %>%
>    group_by(locationID) %>%
>    summarize(mean_IndPerm3 = mean(larvalAbundanceIndPerm3)) %>% 
>    filter(mean_IndPerm3 == min(mean_IndPerm3) | mean_IndPerm3 == max(mean_IndPerm3))
>```


[<<BACK](https://remi-daigle.github.io/2017-CHONe-Data/)
