To begin, we wish to credit Julia Silge for much of the material presented here. Her post on tidytext
data analysis and the code therein inspired most of this portion of the workshop.
Let’s begin by downloading two documents written by our dear Dr. Paul Snelgrove. The first one is his book entitled ‘Discoveries of the Census of Marine Life: Making Ocean Life Count’ and the second is one of his most cited paper (n = 886) according to google scholar, Getting to the Bottom of Marine Biodiversity: Sedimentary Habitats: Ocean bottoms are the most widespread habitat on Earth and support high biodiversity and key ecosystem services.
# Download file from the web
download.file('http://www.cambridge.org/download_file/153663','Snelgrove_Text_Only.pdf', mode = 'wb')
download.file('https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioscience/49/2/10.2307/1313538/2/49-2-129.pdf?Expires=1493740003&Signature=Jp0aTSea-mCne3vgy6tE2JYz58l-K4iWPF3gnd8r-GagpywgUq8U9WcImKsSa~bDZOj5mY3216xNoqSDWXGLBdumI-WIRvUrHFKj1AetoS-Rsyup0NVO9nWE0te8dsIYDWEKkUXvu7-9xdHnmpa5QSNUlE8kM8V~4B68mdJR7W0eE-at~GH4p7IrPRDeAN9n8U~I2Kd-s~KkYgAV6ASxvzXhK4sLUaja~xs3n5eXdhTrcaJINFOxk2~2nU17jDgXM6PUw3W5Epvdywz4s6~eU4IyWCst4sokZAV3OczO7NiagYdKq3foTi~Y-EP~c0uvPr1gRFoxT6itFNGSSZCk6A__&Key-Pair-Id=APKAIUCZBIA4LVPAVW3Q/', '49-2-129.pdf', mode = 'wd')
The package pdftools
allows you to read the content of a pdf rather easily and is available on ROpenSci. See package details for more functions
# Read pdf file
book <- pdf_text("Snelgrove_Text_Only.pdf")
paper <- pdf_text("49-2-129.pdf")
book[1]
[1] " Discoveries of the Census of Marine Life:\n Making Ocean Life Count\nOver the 10-year course of the recently completed Census of Marine Life,\na global network of researchers in more than 80 nations has collaborated\nto improve our understanding of marine biodiversity – past, present, and\nfuture.\n Providing insight into this remarkable project, this book explains\nthe rationale behind the Census and highlights some of its most important\nand dramatic findings, illustrated with full-color photographs throughout.\nIt explores how new technologies and partnerships have contributed to\ngreater knowledge of marine life, from unknown species and habitats, to\nmigration routes and distribution patterns, and to a better appreciation of\nhow the oceans are changing. Looking to the future, it identifies-what\nneeds to be done to close the remaining gaps in our knowledge, and\nprovides information that will enable us to manage resources more\neffectively, conserve diversity, reverse habitat losses, and respond to\nglobal climate change.\n PAUL SNELGROVE is a Professor in Memorial University of\nNewfoundland’s Ocean Sciences Centre and Biology Department. He\nchaired the Synthesis Group of the Census of Marine Life that has\noverseen the final phase of the program. He is now Director of the\nNSERC Canadian Healthy Oceans Network, a research collaboration of\n"
Discoveries of the Census of Marine Life:
Making Ocean Life Count
Over the 10-year course of the recently completed Census of Marine Life,
a global network of researchers in more than 80 nations has collaborated
to improve our understanding of marine biodiversity – past, present, and
future.
Providing insight into this remarkable project, this book explains
the rationale behind the Census and highlights some of its most important
and dramatic findings, illustrated with full-color photographs throughout.
It explores how new technologies and partnerships have contributed to
greater knowledge of marine life, from unknown species and habitats, to
migration routes and distribution patterns, and to a better appreciation of
how the oceans are changing. Looking to the future, it identifies-what
needs to be done to close the remaining gaps in our knowledge, and
provides information that will enable us to manage resources more
effectively, conserve diversity, reverse habitat losses, and respond to
global climate change.
PAUL SNELGROVE is a Professor in Memorial University of
Newfoundland’s Ocean Sciences Centre and Biology Department. He
chaired the Synthesis Group of the Census of Marine Life that has
overseen the final phase of the program. He is now Director of the
NSERC Canadian Healthy Oceans Network, a research collaboration of
length(book)
[1] 398
paper[1]
[1] " Getting to the Bottom of Marine\n Biodiversity: Sedimentary Habitats\n Oceanbottomsare the most widespreadhabitaton Earthand\n support high biodiversityand key ecosystemservices\n Paul V. R. Snelgrove\nT heoceansencompasshabitats I\n Living in marine sediments\n ranging from highly produc-\n tive coastal regions to lightless, Estimates of total Organisms that live in marine sedi-\n ments face numerous challenges.\nhigh-pressure, and low-temperature\ndeep-sea environments. The benthic species numbers Except in the shallowest areas, where\n there is sufficient light to allow pho-\n(bottom-living) species that reside suggest that less\nwithin the sediments in these habi- tosynthesis at the bottom, most sedi-\n mentary organisms are dependent on\ntats form one of the richest species than 1 % of marine phytoplankton and other organic\npools in the oceans and perhaps on\nEarth. Even though 70.8% of the benthic species are material sinking down from surface\nearth is covered by oceans, and most waters above. The spatial decoupling\nocean floor is covered by sediments, presently known of production from most marine\n benthic environments makes these\nthere is still much to learn about\n environments fundamentally differ-\nbiodiversity in marine sediments. The\n terns are thought to exist, and why ent from those of terrestrial (Wall\nmajor reasons for the gaps in knowl-\n we should care. Further discussions and Moore 1999) and freshwater\nedge are logistics and effort. Ap-\n of marine biodiversity (NRC 1995), (Covich et al. 1999) benthos. With\nproximately 65.5% of the planet is\ncovered by ocean that is greater than and biodiversity in marine sediments increasing water depth, the amount\n130 m in depth (i.e., the approxi- in particular (Snelgrove et al. 1997), of material reaching the bottom de-\nmate depth limit of the continental may be found elsewhere. creases; most deep-sea sedimentary\n The oceans harbor tremendous environments are thought to be food\nshelf) and is accessible only by sub-\nmersibles or remote-sampling gear. biological diversity. Of the 29 limited.\nEven the remaining shallow areas nonsymbiont animal phyla that have To take advantage of whatever\n been described so far, all but one has food is present, some organisms (sus-\n(i.e., approximately 5% of the earth's\nsurface) present challenges in terms living representatives in the ocean, pension feeders) are able to remove\nof ship availability and cost, as well and 13 are represented only in the suspended particles from near-bot-\nas loss of experiments and ship time oceans; all of these phyla have repre- tom water; others (deposit feeders)\nto weather. sentatives in the benthos, and most rely on particles that have settled\n have representatives in marine sedi- onto the bottom. Some mega- and\n Despite these logistical difficul- ments. Most of the species diversity macrofaunal species suspension feed,\nties, it is important to improve our in marine ecosystems consists of in-\nunderstanding of biodiversity in many deposit feed, and a few\nmarine sediments. In this article, I vertebrates residing in (infauna) and macrofaunal species do both. Meio-\ndescribe the biodiversity of organ- on (epifauna) sediments. These in- fauna and microbiota depend on de-\nisms residing in the marine sedimen- vertebrates include large animals posited organic material. The mobil-\ntary environment, the patterns that (megafauna), such as scallops and ity of many benthic organisms is\nhave been observed, why these pat- crabs, that can be identified from relatively limited; many are sessile,\n bottom photographs. However, most and others have only limited mobil-\nPaulV. R. Snelgrove(psnelgro@gill.ifmt. species are polychaetes, crustaceans, ity within sediments. As a result, many\n mollusks (macrofauna, larger than benthic species rely completely on the\nnf.ca) is an associate chair of Fisheries\nConservationin the Fisheriesand Marine 300 gim), and tiny crustaceans and water above them to supply food.\nInstitute, Memorial University of New- nematodes (meiofauna, 44-300 gim). Water also supplies oxygen, a ba-\nfoundland, Box 4920, St. John's, New- In addition, there are the poorly known sic requirement for most organisms\nfoundland, Canada AiC 5R3. ? 1999 microbiota (smaller than 44 ,um), residing in sediments. As organisms\nAmericanInstituteof BiologicalSciences. which include bacteria and protists. respire and use up oxygen, sediments\nFebruary 1999 129\n"
Getting to the Bottom of Marine
Biodiversity: Sedimentary Habitats
Oceanbottomsare the most widespreadhabitaton Earthand
support high biodiversityand key ecosystemservices
Paul V. R. Snelgrove
T heoceansencompasshabitats I
Living in marine sediments
ranging from highly produc-
tive coastal regions to lightless, Estimates of total Organisms that live in marine sedi-
ments face numerous challenges.
high-pressure, and low-temperature
deep-sea environments. The benthic species numbers Except in the shallowest areas, where
there is sufficient light to allow pho-
(bottom-living) species that reside suggest that less
within the sediments in these habi- tosynthesis at the bottom, most sedi-
mentary organisms are dependent on
tats form one of the richest species than 1 % of marine phytoplankton and other organic
pools in the oceans and perhaps on
Earth. Even though 70.8% of the benthic species are material sinking down from surface
earth is covered by oceans, and most waters above. The spatial decoupling
ocean floor is covered by sediments, presently known of production from most marine
benthic environments makes these
there is still much to learn about
environments fundamentally differ-
biodiversity in marine sediments. The
terns are thought to exist, and why ent from those of terrestrial (Wall
major reasons for the gaps in knowl-
we should care. Further discussions and Moore 1999) and freshwater
edge are logistics and effort. Ap-
of marine biodiversity (NRC 1995), (Covich et al. 1999) benthos. With
proximately 65.5% of the planet is
covered by ocean that is greater than and biodiversity in marine sediments increasing water depth, the amount
130 m in depth (i.e., the approxi- in particular (Snelgrove et al. 1997), of material reaching the bottom de-
mate depth limit of the continental may be found elsewhere. creases; most deep-sea sedimentary
The oceans harbor tremendous environments are thought to be food
shelf) and is accessible only by sub-
mersibles or remote-sampling gear. biological diversity. Of the 29 limited.
Even the remaining shallow areas nonsymbiont animal phyla that have To take advantage of whatever
been described so far, all but one has food is present, some organisms (sus-
(i.e., approximately 5% of the earth's
surface) present challenges in terms living representatives in the ocean, pension feeders) are able to remove
of ship availability and cost, as well and 13 are represented only in the suspended particles from near-bot-
as loss of experiments and ship time oceans; all of these phyla have repre- tom water; others (deposit feeders)
to weather. sentatives in the benthos, and most rely on particles that have settled
have representatives in marine sedi- onto the bottom. Some mega- and
Despite these logistical difficul- ments. Most of the species diversity macrofaunal species suspension feed,
ties, it is important to improve our in marine ecosystems consists of in-
understanding of biodiversity in many deposit feed, and a few
marine sediments. In this article, I vertebrates residing in (infauna) and macrofaunal species do both. Meio-
describe the biodiversity of organ- on (epifauna) sediments. These in- fauna and microbiota depend on de-
isms residing in the marine sedimen- vertebrates include large animals posited organic material. The mobil-
tary environment, the patterns that (megafauna), such as scallops and ity of many benthic organisms is
have been observed, why these pat- crabs, that can be identified from relatively limited; many are sessile,
bottom photographs. However, most and others have only limited mobil-
PaulV. R. Snelgrove(psnelgro@gill.ifmt. species are polychaetes, crustaceans, ity within sediments. As a result, many
mollusks (macrofauna, larger than benthic species rely completely on the
nf.ca) is an associate chair of Fisheries
Conservationin the Fisheriesand Marine 300 gim), and tiny crustaceans and water above them to supply food.
Institute, Memorial University of New- nematodes (meiofauna, 44-300 gim). Water also supplies oxygen, a ba-
foundland, Box 4920, St. John's, New- In addition, there are the poorly known sic requirement for most organisms
foundland, Canada AiC 5R3. ? 1999 microbiota (smaller than 44 ,um), residing in sediments. As organisms
AmericanInstituteof BiologicalSciences. which include bacteria and protists. respire and use up oxygen, sediments
February 1999 129
length(paper)
[1] 10
As you can see, the resulting book is coerced as a page per string for a total of 398 pages, while there are 10 pages for the paper, with lines divided by ‘\n’. If you look at the actual pdf, you will also realize that tables are not imported in R using the pdf_text
function. This could be highly useful to get data embedded in pdf format. If you wish do so, you can take a look at the package tabulizer
, which we will not cover in this workshop.
Now let’s tidy up the text to make it more easily usable for further analyses.
# Divide strings per line using '\n as a separator'
book <- str_split(book, '\n')
paper <- str_split(paper, '\n')
book[[1]][1:10]
[1] " Discoveries of the Census of Marine Life:"
[2] " Making Ocean Life Count"
[3] "Over the 10-year course of the recently completed Census of Marine Life,"
[4] "a global network of researchers in more than 80 nations has collaborated"
[5] "to improve our understanding of marine biodiversity – past, present, and"
[6] "future."
[7] " Providing insight into this remarkable project, this book explains"
[8] "the rationale behind the Census and highlights some of its most important"
[9] "and dramatic findings, illustrated with full-color photographs throughout."
[10] "It explores how new technologies and partnerships have contributed to"
Discoveries of the Census of Marine Life:
Making Ocean Life Count
Over the 10-year course of the recently completed Census of Marine Life,
a global network of researchers in more than 80 nations has collaborated
to improve our understanding of marine biodiversity – past, present, and
future.
Providing insight into this remarkable project, this book explains
the rationale behind the Census and highlights some of its most important
and dramatic findings, illustrated with full-color photographs throughout.
It explores how new technologies and partnerships have contributed to
# Trim whitespaces at the beginning and end of lines
book <- lapply(X = book, FUN = str_trim, side = 'both')
paper <- lapply(X = paper, FUN = str_trim, side = 'both')
book[[1]][1:10]
[1] "Discoveries of the Census of Marine Life:"
[2] "Making Ocean Life Count"
[3] "Over the 10-year course of the recently completed Census of Marine Life,"
[4] "a global network of researchers in more than 80 nations has collaborated"
[5] "to improve our understanding of marine biodiversity – past, present, and"
[6] "future."
[7] "Providing insight into this remarkable project, this book explains"
[8] "the rationale behind the Census and highlights some of its most important"
[9] "and dramatic findings, illustrated with full-color photographs throughout."
[10] "It explores how new technologies and partnerships have contributed to"
Discoveries of the Census of Marine Life:
Making Ocean Life Count
Over the 10-year course of the recently completed Census of Marine Life,
a global network of researchers in more than 80 nations has collaborated
to improve our understanding of marine biodiversity – past, present, and
future.
Providing insight into this remarkable project, this book explains
the rationale behind the Census and highlights some of its most important
and dramatic findings, illustrated with full-color photographs throughout.
It explores how new technologies and partnerships have contributed to
# Transform as a matrix
bookMat <- matrix(nrow = 0, ncol = 3, dimnames = list(c(), c('text','page','document')))
for(i in 1:length(book)) {
bk <- cbind(book[[i]], rep(i, length(book[[i]])), 'Discoveries of the Census of Marine Life')
bookMat <- rbind(bookMat, bk)
}
paperMat <- matrix(nrow = 0, ncol = 3, dimnames = list(c(), c('text','page','document')))
for(i in 1:length(paper)) {
bk <- cbind(paper[[i]], rep(i, length(paper[[i]])), 'Getting to the Bottom of Marine Biodiversity')
paperMat <- rbind(paperMat, bk)
}
kable(bookMat[1:10, ])
text | page | document |
---|---|---|
Discoveries of the Census of Marine Life: | 1 | Discoveries of the Census of Marine Life |
Making Ocean Life Count | 1 | Discoveries of the Census of Marine Life |
Over the 10-year course of the recently completed Census of Marine Life, | 1 | Discoveries of the Census of Marine Life |
a global network of researchers in more than 80 nations has collaborated | 1 | Discoveries of the Census of Marine Life |
to improve our understanding of marine biodiversity – past, present, and | 1 | Discoveries of the Census of Marine Life |
future. | 1 | Discoveries of the Census of Marine Life |
Providing insight into this remarkable project, this book explains | 1 | Discoveries of the Census of Marine Life |
the rationale behind the Census and highlights some of its most important | 1 | Discoveries of the Census of Marine Life |
and dramatic findings, illustrated with full-color photographs throughout. | 1 | Discoveries of the Census of Marine Life |
It explores how new technologies and partnerships have contributed to | 1 | Discoveries of the Census of Marine Life |
# Remove empty strings
bookMat[bookMat[,'text'] == '', 'text'] <- NA
bookMat <- na.omit(bookMat)
paperMat[paperMat[,'text'] == '', 'text'] <- NA
paperMat <- na.omit(paperMat)
# Convert to data.frame
bookMat <- as.data.frame(bookMat)
bookMat[, 'text'] <- as.character(bookMat[, 'text'])
bookMat[, 'page'] <- as.numeric(paste(bookMat[, 'page']))
bookMat[, 'document'] <- as.character(paste(bookMat[, 'document']))
kable(bookMat[1:10, ])
text | page | document |
---|---|---|
Discoveries of the Census of Marine Life: | 1 | Discoveries of the Census of Marine Life |
Making Ocean Life Count | 1 | Discoveries of the Census of Marine Life |
Over the 10-year course of the recently completed Census of Marine Life, | 1 | Discoveries of the Census of Marine Life |
a global network of researchers in more than 80 nations has collaborated | 1 | Discoveries of the Census of Marine Life |
to improve our understanding of marine biodiversity – past, present, and | 1 | Discoveries of the Census of Marine Life |
future. | 1 | Discoveries of the Census of Marine Life |
Providing insight into this remarkable project, this book explains | 1 | Discoveries of the Census of Marine Life |
the rationale behind the Census and highlights some of its most important | 1 | Discoveries of the Census of Marine Life |
and dramatic findings, illustrated with full-color photographs throughout. | 1 | Discoveries of the Census of Marine Life |
It explores how new technologies and partnerships have contributed to | 1 | Discoveries of the Census of Marine Life |
paperMat <- as.data.frame(paperMat)
paperMat[, 'text'] <- as.character(paperMat[, 'text'])
paperMat[, 'page'] <- as.numeric(paste(paperMat[, 'page']))
paperMat[, 'document'] <- as.character(paste(paperMat[, 'document']))
kable(paperMat[1:10, ])
text | page | document |
---|---|---|
Getting to the Bottom of Marine | 1 | Getting to the Bottom of Marine Biodiversity |
Biodiversity: Sedimentary Habitats | 1 | Getting to the Bottom of Marine Biodiversity |
Oceanbottomsare the most widespreadhabitaton Earthand | 1 | Getting to the Bottom of Marine Biodiversity |
support high biodiversityand key ecosystemservices | 1 | Getting to the Bottom of Marine Biodiversity |
Paul V. R. Snelgrove | 1 | Getting to the Bottom of Marine Biodiversity |
T heoceansencompasshabitats I | 1 | Getting to the Bottom of Marine Biodiversity |
Living in marine sediments | 1 | Getting to the Bottom of Marine Biodiversity |
ranging from highly produc- | 1 | Getting to the Bottom of Marine Biodiversity |
tive coastal regions to lightless, Estimates of total Organisms that live in marine sedi- | 1 | Getting to the Bottom of Marine Biodiversity |
ments face numerous challenges. | 1 | Getting to the Bottom of Marine Biodiversity |
# bind as a single data.framt
paul <- rbind(bookMat, paperMat)
The resulting data for the paper are not perfect (i.e. truncated words are not regrouped), but they will serve for our purposes!
Now that we have our document as a R data frame, we can start playing around with it. There are multiple packages that allow you perform text analyses. We will focus on package tidytext
as it aligns with the tidyverse, but the package tm
is another important package used to perform text analysis under R. Each package uses certain classes of object, but classes can be modified quite easily for use between the packages. See this vignette for a description of steps and functions used to achieve this.
We will first transform the book as a tibble class object for use in tidytext
# Convert to tibble class object for use in tidytext
paul <- as_tibble(paul)
paul
The book can easily be grouped by certain criterion. For exemple, we will start by grouping the data according to which part of the book they belong. There are a total of three parts to this book. Unfortunately, this particular process does not work for our pdf document, but the code does work we might use it again elsewhere, let’s say for an exercice (cough cough).
paulDoc <- paul %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^part [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup()
paulDoc
unique(paulDoc[, 'chapter'])
This function is not useful with these documents as they are not divided by chapters and the function does not differentiate between documents, yet the regex code can still be useful to divide your text however you may see fit.
A frequent analysis performed with text is to evaluate the frequency of words used in the text. This can be done by first unnesting the individual words from the text with the unnest_tokens
function and using the count
function to count the number of words.
# Count per word
paulWord <- paul %>%
unnest_tokens(word, text) %>% # divide by word
count(word, sort = TRUE) # counts the word frequency
paulWord
# Total number of words
totalWords <- paulWord %>% summarize(total = sum(n))
totalWords
You could also apply this directly to the data grouped by chapters to obtain a word count per chapters.
# Count per word per chapter
paulWordDocs <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
count(document, word, sort = TRUE) %>%
ungroup()
# Attach total number of words per chapter
totalWords <- paulWordDocs %>% group_by(document) %>% summarize(total = sum(n))
totalWords
paulWordDocs <- left_join(paulWordDocs, totalWords)
Joining, by = "document"
paulWordDocs
ggplot(paulWordDocs, aes(n/total, fill = document)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~document, ncol = 2, scales = "free_y")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 290 rows containing non-finite values (stat_bin).
Words like ‘the’, ‘and’, ‘of’ are the most common by far. Those are referred to as stop words and are usually removed in text analysis. A list of such words is available in the stop_words
dataset and can be removed from your dataset using function anti_join
.
# Count per word after stop words removal
data("stop_words")
paulWordDocs <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
count(document, word, sort = TRUE) %>%
ungroup()
Joining, by = "word"
totalWords <- paulWordDocs %>% group_by(document) %>% summarize(total = sum(n))
paulWordDocs <- left_join(paulWordDocs, totalWords)
Joining, by = "document"
paulWordDocs
ggplot(paulWordDocs, aes(n/total, fill = document)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~document, ncol = 2, scales = "free_y")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 326 rows containing non-finite values (stat_bin).
Warning: Removed 2 rows containing missing values (geom_bar).
The frequency of words could also be compared between documents.
paulBook <- as_tibble(bookMat)
paulPaper <- as_tibble(paperMat)
paulBookWord <- paulBook %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
count(word, sort = TRUE) %>%
ungroup()
Joining, by = "word"
paulPaperWord <- paulPaper %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
count(word, sort = TRUE) %>%
ungroup()
Joining, by = "word"
frequency <- paulBookWord %>%
rename(Book = n) %>%
inner_join(paulPaperWord) %>%
rename(Paper = n) %>%
mutate(Book = Book / sum(Book),
Paper = Paper / sum(Paper)) %>%
ungroup()
Joining, by = "word"
ggplot(frequency, aes(x = Book, y = Paper, color = abs(Paper - Book))) +
geom_abline(color = "gray40") +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.4, height = 0.4) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
theme_minimal(base_size = 14) +
theme(legend.position="none") +
labs(title = "Comparing Word Frequencies",
subtitle = "Word frequencies in Paul Snelgroves's book and paper",
y = "Getting to the Bottom of Marine Biodiversity", x = "Discoveries of the Census of Marine Life")
tidytext
also gives the opportunity to perform a cursory sentiment analysis, i.e. evaluating whether the text is more or less negative, using the sentiment
dataset. While it may not be as useful to qualify positive or negative science, it may reveal some insights as to the overall style of writing of the author (we are looking at you Dr. Snelgrove).
# Gather list of sentiments from tidytext
bing <- sentiments %>%
filter(lexicon == "bing") %>%
dplyr::select(-score)
bing
paulSentiment <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
inner_join(bing) %>% # join with sentiment dataset
count(document, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
Joining, by = "word"
Joining, by = "word"
paulSentiment
# plot
ggplot(paulSentiment, aes(index, sentiment, fill = document)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~document, ncol = 2, scales = "free_x") +
theme_minimal(base_size = 13) +
labs(title = "Sentiment in Paul Snelgrove's writing",
y = "Sentiment") +
scale_fill_viridis(end = 0.75, discrete=TRUE, direction = -1) +
scale_x_discrete(expand=c(0.02,0)) +
theme(strip.text=element_text(hjust=0)) +
theme(strip.text = element_text(face = "italic")) +
theme(axis.title.x=element_blank()) +
theme(axis.ticks.x=element_blank()) +
theme(axis.text.x=element_blank())
# Most common positive and negative words
paulSentimentCount <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
inner_join(bing) %>% # join with sentiment dataset
count(document, word, sentiment, sort = TRUE) %>%
ungroup()
Joining, by = "word"
Joining, by = "word"
# Contribution to sentiment
paulSentimentCount %>%
filter(n > 10) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Contribution to sentiment")
Words are not the only units of text that can be extracted using the unnest_tokens
function. Look at the package vignette for more information!
paulSentences <- paul %>%
group_by(document) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
ungroup()
paulSentences$sentence[1]
[1] "discoveries of the census of marine life: making ocean life count over the 10-year course of the recently completed census of marine life, a global network of researchers in more than 80 nations has collaborated to improve our understanding of marine biodiversity – past, present, and future."
discoveries of the census of marine life: making ocean life count over the 10-year course of the recently completed census of marine life, a global network of researchers in more than 80 nations has collaborated to improve our understanding of marine biodiversity – past, present, and future.
You can also asociate which words are used more often together with the function pairwise_count
from the package widyr
paulWord <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
paulWordOcc <- pairwise_count(paulWord, word, linenumber, sort = TRUE)
set.seed(1813)
paulWordOcc %>%
filter(n >= 25) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8) +
ggtitle(expression(paste("Word Network in Paul Snelgrove's book ",
italic("Census of Marine Life")))) +
theme_void()
netD3 <- paulWordOcc %>%
filter(n >= 25) %>%
graph_from_data_frame()
netD3 <- networkD3::igraph_to_networkD3(netD3, group = rep(1, vcount(netD3)), what = 'both')
networkD3::forceNetwork(Links = netD3$links,
Nodes = netD3$nodes,
Source = 'source',
Target = 'target',
NodeID = 'name',
Group = 'group',
zoom = TRUE,
linkDistance = 50,
fontSize = 12,
opacity = 0.9,
charge = -10)
Another neat visual tool available in R is the ability to produce custom wordle based on the results of your analyses using the package wordcloud2
paulWord <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
count(document, word, sort = TRUE) %>% # counts the word frequency
ungroup() %>%
filter(n >= 20)
Joining, by = "word"
paulWordDF <- as.data.frame(paulWord[, c('word','n')])
# Basic wordle
wordcloud2::wordcloud2(paulWordDF, size = 1, color="random-light", backgroundColor=1)
# # Word shaped wordle
# wordcloud2::letterCloud(paulWordDF, word = "Paul")
#
# # Image shaped wordle
# wordcloud2::wordcloud2(paulWordDF, figPath = "./CoML_icon.png", size = 1.5)
#
# wordcloud2::wordcloud2(paulWordDF, figPath = "./CHONe.jpg", size = 1.5)