Lab #7: Basics of Text Analysis

In this lab, we will practice working with text using stringr, tidytext, and tm packages.

See the “Instructions” section of the Introduction to Lab Assignments page for more information about the labs. That page also gives descriptions for the datasets we will be using.

Required reading:

R for Data Science: Working with strings (Chapter 14)
Text Mining with R: The tidy text format

Optional reading:

What are Stop Words?
Help with using str_extract_all with unnest.
stringr package docs
tidytext package docs
Test and develop regex expressions on regexr.com
tm package docs

Example Questions

ex1. In senator_tweet_sample, find the correlation between the number of “mentions” (cases where you see an “@” followed by non-whitespace characters) in Tweet texts and the number of favorites they receive.

HINT: you can just look up the right regex string in threads like this one. No need to learn regex in much detail - most people just do this.

# Your answer here
tdf <- senator_tweet_sample %>% 
  mutate(num_mentions=str_count(text, "@[A-Za-z0-9_]+")) %>% 
  select(num_mentions, favorite_count)# %>% 
cor.test(tdf$num_mentions, tdf$favorite_count)

## 
##  Pearson's product-moment correlation
## 
## data:  tdf$num_mentions and tdf$favorite_count
## t = -0.58149, df = 988, p-value = 0.561
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08070917  0.04385943
## sample estimates:
##         cor 
## -0.01849665

From this result, we can say that we don't have confidence that there is not a correlation between the two variables.

ex2. In senator_tweet_sample, find the average number of “mentions” (cases where you see an “@” followed by non-whitespace characters) in Tweet texts by gender.

# first match congress twitter handles with gender
congress_merged <- congress %>% 
  left_join(congress_contact, by='bioguide_id') %>% 
  filter(twitter != '') %>% # remove those with no twitter accoutns
  mutate(twitter=tolower(twitter)) %>% # lower-case twitter handles for matching
  select(twitter, gender)
congress_merged %>% head()

##           twitter gender
## 1 sensherrodbrown      M
## 2 senatorcantwell      F
## 3   senatorcardin      M
## 4   senatorcarper      M
## 5     senbobcasey      M
## 6    senfeinstein      F

# match gender with twitter handles, count the number of mentions
tdf <- senator_tweet_sample %>% 
  mutate(num_mentions=str_count(text, "(?<=@)[^\\s:]+")) %>%  # this is the key regex
  mutate(screen_name=tolower(screen_name)) %>% # lower-case to fix matches
  select(screen_name, num_mentions) %>% 
  left_join(congress_merged, by=c('screen_name'='twitter')) %>%
  drop_na(gender) # drop those who didn't have matching Twitter handles
tdf %>% head()

## # A tibble: 6 × 3
##   screen_name    num_mentions gender
##   <chr>                 <int> <fct> 
## 1 senjohnkennedy            1 M     
## 2 senjohnkennedy            2 M     
## 3 chrisvanhollen            0 M     
## 4 chrisvanhollen            1 M     
## 5 chrisvanhollen            1 M     
## 6 chrisvanhollen            8 M

# average by user and then gender (accounts for unequal # tweets for congress members)
tdf %>% 
  group_by(gender, screen_name) %>% 
  summarize(av_mentions = mean(num_mentions)) %>% 
  group_by(gender) %>% 
  summarize(av_mentions_gender=mean(av_mentions))

## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.

## # A tibble: 2 × 2
##   gender av_mentions_gender
##   <fct>               <dbl>
## 1 F                   0.525
## 2 M                   0.691

Questions

1. Create a regular expression which matches a URL in the example string ex, and verify that it works using str_view_all (described in R for data science Ch. 14). The output should show both URLs highlighted. Now do the same for hashtags - strings that include a “#” symbol followed by any letter, number, or underscore and ignoring capitalization.

Hint: You should not need to learn regex in any detail to complete this problem. These are common tasks in cleaning and analyzing Tweet/text data, so doing some online research (i.e. google search) might save you a lot of time.

Hint: be wary of how R specifically interprets regex strings. It might be helpful to look for regex strings specifically written for R.

# Your answer here

ex <- "BREAKING NEWS - @brumleader urges everyone to do their bit in order to tackle the threat posed by rising coronavirus case numbers in city. Full statement here:\n\nhttps://t.co/3tbc6xcRFP\n\n#KeepBrumSafe\n#Btogether\n#COVID19\n#Coronavirus https://t.co/mo5bPUgGgC"

# your solution here

2. Add two new columns to the senator_tweet_sample dataframe: n_link should include the number of URLs in the Tweet text, and n_ht should be the number of hashtags. Then, create a linear model using lm to predict retweet_count from n_link and n_ht. Show the model summaries. Were either of these predictors statistically significant?

HINT: see the str_count function.

# your solution here

written response here

3. Using stringr and dplyr (not tm or tidytext), produce a dataframe consisting of the 5 most used hashtags in our Tweets with the number of times they were used. If there is more than one tied for 5th place, you can ignore them - just choose in 5 total (i.e. you could just use head(5)).

HINT: try using str_extract_all in conjunction with unnest (not unnest_tokens) to extract the hashtags. This Stack Overflow solution may be helpful.

# your solution here

4. Create a new column in senator_tweet_sample called cleaned which includes the original Tweets with hashtags and links removed. We will use this column for the remaining questions.

HINT: see the gsub or str_replace_all functions.

# your solution here

5. Using tidytext, produce a dataframe showing the ten most common words in the Tweets after URLs and hashtags have been removed (use our new column cleaned). Then secondly show the most common words, excluding stopwords.

Hint: look at the tidytext docs for unnest_tokens.

# your solution here

6. Using tm, create a document-term matrix from our cleaned text data. We will discuss what to do with a dtm in the next lab.

HINT: you might want to check out the cast_dtm function.

# your solution here

7. How could you potentially use text analysis in the final project you have been working on? (You don’t necessarily need to do it for the project, just think hypothetically).

response

8. Last week you proposed some datasets that you might be able to use for our final projects in the class. If you haven’t yet, try to download or otherwise get access to the dataset so you can start playing with it. Either way, what did you find? Did your data have the information you needed after all? Was it as easy to access as you expected? Even if you’re not able to get all the data by now, write something about your plan for getting access to the data.

response

Lab #7: Basics of Text Analysis

Data Science and Society (Sociology 367)

Example Questions

Questions