In this lab, we will practice working with text using stringr, tidytext, and tm packages.
See the “Instructions” section of the Introduction to Lab Assignments page for more information about the labs. That page also gives descriptions for the datasets we will be using.
Required reading:
Optional reading:
str_extract_all
with
unnest
.ex1. In senator_tweet_sample
, find the
correlation between the number of “mentions” (cases where you see an “@”
followed by non-whitespace characters) in Tweet texts and the number of
favorites they receive.
HINT: you can just look up the right regex string in threads like this one. No need to learn regex in much detail - most people just do this.
# Your answer here
tdf <- senator_tweet_sample %>%
mutate(num_mentions=str_count(text, "@[A-Za-z0-9_]+")) %>%
select(num_mentions, favorite_count)# %>%
cor.test(tdf$num_mentions, tdf$favorite_count)
##
## Pearson's product-moment correlation
##
## data: tdf$num_mentions and tdf$favorite_count
## t = -0.58149, df = 988, p-value = 0.561
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08070917 0.04385943
## sample estimates:
## cor
## -0.01849665
From this result, we can say that we don't have confidence that there is not a correlation between the two variables.
ex2. In senator_tweet_sample
, find the average
number of “mentions” (cases where you see an “@” followed by
non-whitespace characters) in Tweet texts by gender.
# first match congress twitter handles with gender
congress_merged <- congress %>%
left_join(congress_contact, by='bioguide_id') %>%
filter(twitter != '') %>% # remove those with no twitter accoutns
mutate(twitter=tolower(twitter)) %>% # lower-case twitter handles for matching
select(twitter, gender)
congress_merged %>% head()
## twitter gender
## 1 sensherrodbrown M
## 2 senatorcantwell F
## 3 senatorcardin M
## 4 senatorcarper M
## 5 senbobcasey M
## 6 senfeinstein F
# match gender with twitter handles, count the number of mentions
tdf <- senator_tweet_sample %>%
mutate(num_mentions=str_count(text, "(?<=@)[^\\s:]+")) %>% # this is the key regex
mutate(screen_name=tolower(screen_name)) %>% # lower-case to fix matches
select(screen_name, num_mentions) %>%
left_join(congress_merged, by=c('screen_name'='twitter')) %>%
drop_na(gender) # drop those who didn't have matching Twitter handles
tdf %>% head()
## # A tibble: 6 × 3
## screen_name num_mentions gender
## <chr> <int> <fct>
## 1 senjohnkennedy 1 M
## 2 senjohnkennedy 2 M
## 3 chrisvanhollen 0 M
## 4 chrisvanhollen 1 M
## 5 chrisvanhollen 1 M
## 6 chrisvanhollen 8 M
# average by user and then gender (accounts for unequal # tweets for congress members)
tdf %>%
group_by(gender, screen_name) %>%
summarize(av_mentions = mean(num_mentions)) %>%
group_by(gender) %>%
summarize(av_mentions_gender=mean(av_mentions))
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
## # A tibble: 2 × 2
## gender av_mentions_gender
## <fct> <dbl>
## 1 F 0.525
## 2 M 0.691
1. Create a regular expression which matches a URL in the
example string ex
, and verify that it works using
str_view_all
(described in R for data science Ch. 14). The output
should show both URLs highlighted. Now do the same for hashtags -
strings that include a “#” symbol followed by any letter, number, or
underscore and ignoring capitalization.
Hint: You should not need to learn regex in any detail to complete this problem. These are common tasks in cleaning and analyzing Tweet/text data, so doing some online research (i.e. google search) might save you a lot of time.
Hint: be wary of how R specifically interprets regex strings. It might be helpful to look for regex strings specifically written for R.
# Your answer here
ex <- "BREAKING NEWS - @brumleader urges everyone to do their bit in order to tackle the threat posed by rising coronavirus case numbers in city. Full statement here:\n\nhttps://t.co/3tbc6xcRFP\n\n#KeepBrumSafe\n#Btogether\n#COVID19\n#Coronavirus https://t.co/mo5bPUgGgC"
# your solution here
2. Add two new columns to the
senator_tweet_sample
dataframe: n_link
should
include the number of URLs in the Tweet text, and n_ht
should be the number of hashtags. Then, create a linear model using
lm
to predict retweet_count
from
n_link
and n_ht
. Show the model summaries.
Were either of these predictors statistically significant?
HINT: see the str_count
function.
# your solution here
written response here
3. Using stringr and dplyr (not tm or tidytext), produce a
dataframe consisting of the 5 most used hashtags in our Tweets with the
number of times they were used. If there is more than one tied for 5th
place, you can ignore them - just choose in 5 total (i.e. you could just
use head(5)
).
HINT: try using str_extract_all
in conjunction with
unnest
(not unnest_tokens
) to extract the
hashtags. This Stack
Overflow solution may be helpful.
# your solution here
4. Create a new column in senator_tweet_sample
called cleaned
which includes the original Tweets with
hashtags and links removed. We will use this column for the remaining
questions.
HINT: see the gsub
or str_replace_all
functions.
# your solution here
5. Using tidytext, produce a dataframe showing the ten most
common words in the Tweets after URLs and hashtags have been removed
(use our new column cleaned
). Then secondly show the most
common words, excluding stopwords.
Hint: look at the tidytext docs for unnest_tokens
.
# your solution here
6. Using tm
, create a document-term matrix from
our cleaned text data. We will discuss what to do with a dtm in the next
lab.
HINT: you might want to check out the cast_dtm
function.
# your solution here
7. How could you potentially use text analysis in the final project you have been working on? (You don’t necessarily need to do it for the project, just think hypothetically).
response
8. Last week you proposed some datasets that you might be able to use for our final projects in the class. If you haven’t yet, try to download or otherwise get access to the dataset so you can start playing with it. Either way, what did you find? Did your data have the information you needed after all? Was it as easy to access as you expected? Even if you’re not able to get all the data by now, write something about your plan for getting access to the data.
response