Lab #8: Dictionary-Based Text Analysis

In this lab, we will practice working with dictionary-based methods for text analysis.

See the “Instructions” section of the Introduction to Lab Assignments page for more information about the labs. That page also gives descriptions for the datasets we will be using.

Required reading:

Text Mining with R: Chapter 2: Sentiment analysis with tidy data
Text Mining with R: Chapter 3: Analyzing word and document frequency: tf-idf

Optional reading:

From previous labs:

R for Data Science: Working with strings (Chapter 14)
Text Mining with R: Chapter 1: The tidy text format
stringr package docs
tidytext package docs
Test and develop regex expressions on regexr.com

Example Questions

ex1. In the same way we can use sentiment dictionaries like bing or afinn (see required reading on sentiment analysis), we can create our own dictionaries that allow us to detect features of the text we may be interested in. For this example, I will create a custom dictionary that can help us detect instances when Tweets mention or reference the concepts (i.e. different spellings or wordings that mean the same thing) “children” and “taxes”. To construct the dictionary, start by looking closely at our Tweets to see if you can find any alternative spellings or words used to mean the same thing, and assign that set of relevant words to each category. From this, you can create a dataframe assigning each word to one of those categories. Finally, use the dictionary to count the proportion of Tweets that discuss these topics by weekday vs weekend.

NOTE: Be sure to remove URLs, stopwords, and “mentions” before performing word counts (cases where you see an “@” followed by non-whitespace characters - see Lab 8 example problems).

ALSO NOTE: see the example dictionaries in the required reading on sentiment analysis to see what your custom dictionary dataframe should look like.

# start by looking at Tweets that include the word "children"
senator_tweet_sample %>% 
  filter(str_detect(text, 'tax')) %>% 
  select(text) %>% 
  head(2)

## # A tibble: 2 × 1
##   text                                                                          
##   <chr>                                                                         
## 1 Almost half of all MDers use the state and local tax deduction—nearly 1.3M in…
## 2 There’s a lot left out of the GOP’s tax plan, including real reform that woul…

# we see a bunch of different words that are used as synonyms: "children", "child", "youth", "childs"

# now try looking at Tweets that include the string "tax"
senator_tweet_sample %>% 
  filter(str_detect(text, 'tax')) %>% 
  select(text) %>% 
  head(2)

## # A tibble: 2 × 1
##   text                                                                          
##   <chr>                                                                         
## 1 Almost half of all MDers use the state and local tax deduction—nearly 1.3M in…
## 2 There’s a lot left out of the GOP’s tax plan, including real reform that woul…

# I only see "tax" or "taxes", but you may find others

# now we create our dictionary in the format of a dataframe (see the sentiment analysis section to see the format of dictionaries)
custom_topic_dict <- data.frame(
  word=c('child', 'children', 'childs', 'youth', 'youths', 'tax', 'taxes'), # list of words
  category=c('children', 'children', 'children', 'children', 'children', 'taxes', 'taxes') # list of categories associated with each word
)
custom_topic_dict

##       word category
## 1    child children
## 2 children children
## 3   childs children
## 4    youth children
## 5   youths children
## 6      tax    taxes
## 7    taxes    taxes

Now we’ll clean the text, tokenize it, and apply our new dictionary. The dictionary is applied using a left join.

# define regex patterns for removal (you did these in previous problems)
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
mention_pattern <- "@[A-Za-z0-9_]+"

# tokenize and count the appearance of each word
token_counts <- senator_tweet_sample %>% 
  mutate(is_weekend=wday(created_at, label=T) %in% c('Sat', 'Sun')) %>% 
  select(status_id, is_weekend, text) %>% 
  mutate(text = str_remove_all(text, url_pattern)) %>% # remove urls
  mutate(text = str_remove_all(text, mention_pattern)) %>% # remove mentions
  unnest_tokens("word", text, token = "tweets") %>%  # tokenize, preserving hashtags
  anti_join(stop_words) %>% # remove stopwords
  count(status_id, is_weekend, word) # count the words by weekday/weekend

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## Joining, by = "word"

token_counts %>% head()

## # A tibble: 6 × 4
##   status_id           is_weekend word          n
##   <chr>               <lgl>      <chr>     <int>
## 1 1001849567137161216 FALSE      #mtpol        1
## 2 1001849567137161216 FALSE      days          1
## 3 1001849567137161216 FALSE      start         1
## 4 1003702749161238529 FALSE      amp           3
## 5 1003702749161238529 FALSE      china         1
## 6 1003702749161238529 FALSE      communist     1

# apply the dictionary to the word count dataframe
token_cats <- token_counts %>% 
  left_join(custom_topic_dict, by='word') # match topics with words
  
# compute proportion of tweets that mention each category by weekend/weekday
token_cats %>% 
  # first we detect whether or not each tweet mentions one of our categoreis
  group_by(status_id, is_weekend) %>% 
  summarize(mentioned_taxes='taxes' %in% category, mentioned_children='children' %in% category) %>% 
  group_by(is_weekend) %>% 
  summarize(
    # show basic average differences
    av_taxes=mean(mentioned_taxes), 
    av_children=mean(mentioned_children),
  )

## `summarise()` has grouped output by 'status_id'. You can override using the
## `.groups` argument.

## # A tibble: 2 × 3
##   is_weekend av_taxes av_children
##   <lgl>         <dbl>       <dbl>
## 1 FALSE        0.0257      0.0213
## 2 TRUE         0.0211      0.0316

# we can see that Tweets published on weekdays are more likely to mention taxes and less likely to mention children...

ex2. After removing URLs, stopwords, and “mentions” (cases where you see an “@” followed by non-whitespace characters - see Lab 8 example problems) from the tweets in senator_tweet_sample, show the words that most uniquely appear in Tweets published by Republicans compard to other parties using TF-IDF.

# first match congress twitter handles with party
congress_merged <- congress %>% 
  left_join(congress_contact, by='bioguide_id') %>% 
  filter(twitter != '') %>% # remove those with no twitter accoutns
  mutate(twitter=tolower(twitter)) %>% # lower-case twitter handles for matching
  mutate(is_repub=party=='Republican') %>% 
  select(twitter, is_repub)
congress_merged %>% head()

##           twitter is_repub
## 1 sensherrodbrown    FALSE
## 2 senatorcantwell    FALSE
## 3   senatorcardin    FALSE
## 4   senatorcarper    FALSE
## 5     senbobcasey    FALSE
## 6    senfeinstein    FALSE

# match Tweet text with screen name and political party
tdf <- senator_tweet_sample %>% 
  mutate(text = str_remove(text, url_pattern)) %>% # remove urls
  mutate(text = str_remove(text, mention_pattern)) %>% # remove mentions
  mutate(screen_name=tolower(screen_name)) %>% # lower-case to match with congress_merged
  left_join(congress_merged, by=c('screen_name'='twitter')) %>% 
  select(is_repub, text)
tdf %>% head()

## # A tibble: 6 × 2
##   is_repub text                                                                 
##   <lgl>    <chr>                                                                
## 1 TRUE     ".  said reforming the National Flood Insurance Program should be on…
## 2 TRUE     "On  Wednesday, @SenJohnKennedy said those who commit sexual harassm…
## 3 FALSE    "GOP leaders must stop burying their heads in the sand. If this isn’…
## 4 FALSE    "#GOP moving ahead w/ plans to shred the social safety net. Does tha…
## 5 FALSE    "I spoke to  this morning about the President’s lack of leadership a…
## 6 FALSE    "#FF new  - @RepJimMcDermott @RepBarbaraLee @davidcicilline @HakeemJ…

# tokenize texts (see 'token' parameter in docs)
party_token_counts <- tdf %>% 
  unnest_tokens("word", text, token = "tweets") %>% 
  anti_join(stop_words) %>% # remove stopwords
  count(is_repub, word)

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## Joining, by = "word"

party_token_counts %>% head()

## # A tibble: 6 × 3
##   is_repub word                n
##   <lgl>    <chr>           <int>
## 1 FALSE    ⁦⁩                    1
## 2 FALSE    ⁦                    3
## 3 FALSE    @ajitpaifcc         1
## 4 FALSE    @alexmillernews     1
## 5 FALSE    @b52malmet          1
## 6 FALSE    @beckyquick         1

# use the tf_idf to compare republican words with token frequency
tf_idf_scores <- party_token_counts %>% 
  bind_tf_idf(word, is_repub, n) %>% 
  arrange(desc(tf_idf)) %>% 
  select(is_repub, word, tf_idf)

## Warning: A value for tf_idf is negative:
##  Input should have exactly one row per document-term combination.

tf_idf_scores %>% head()

## # A tibble: 6 × 3
##   is_repub word        tf_idf
##   <lgl>    <chr>        <dbl>
## 1 FALSE    children   0.00242
## 2 TRUE     #taxreform 0.00184
## 3 TRUE     texas      0.00184
## 4 FALSE    nh         0.00121
## 5 TRUE     inhofe     0.00102
## 6 TRUE     liberty    0.00102

# now pick out republican words and show them in graph
tf_idf_scores %>% 
  filter(is_repub) %>% 
  arrange(desc(tf_idf)) %>% 
  head(10) %>% 
  ggplot(aes(y=reorder(word, tf_idf), x=tf_idf)) +
    geom_bar(stat='identity', position='dodge') +
    xlab('TF-IDF Score') + ylab('Word')

Questions

1. In which scenarios would it be best to consider dictionary-based approaches to text analysis? How does the decision to use dictionary-based approaches shape the research questions you can ask?

# your answer here

2. Create a bar graph showing the frequencies of the twenty most-used tokens in our senator_tweet_sample corpus after removing URLs, stopwords, and “mentions” (cases where you see an “@” followed by non-whitespace characters - see Lab 8 example problems).

# your answer here

3. For each of the top three most frequent non-stopword tokens, extract up to three tweets with the highest number of retweets that include the token. Based on the context provided in these Tweets, give a quick sentence about how they seem to be used and why they might appear frequently.

# your answer here

your written explanation here

4. Use TF-IDF to show the top 10 words that are most distinctive to Tweets published by Males and Females as a bar chart. To do this, you may combine texts from all Tweets published by these two genders, essentially treating our corpus as being two “documents” in the terminology that TF-IDF methods typically use. This approach provides a systematic way of comparing large number of texts along any dimension of interest.

# your answer here

your written explanation here

5. Create a new column in senator_tweet_sample that corresponds to the time of day that a given tweet was posted, and make a bar graph comparing the number of tweets published in daytime (5am-5pm, inclusive) vs night.

Hint: see the hour function of lubridate.

# your answer here

6. Use the “bing” sentiment dictionary to compare the average sentiment for Tweets published in daytime vs nighttime using a bar plot. You get to choose how you will create these sentiment scores for comparison - explain and justify your decision. Also explain your interpretation of the results.

HINT: use get_sentiments("bing") to get the Bing dictionary.

# your answer here

Explain why you chose to compute sentiment in this way.

7. Create a custom dictionary with at least two categories (e.g. positive/negative, happy/sad, solution/problem-oriented, etc) and compare daytime-nightime scores for each of the two categories. What does this result tell you about your data? What is your dictionary capturing here?

Hint: you may want to look at the bing dictionary (get_sentiments("bing")) to see how you should format your custom dictionary.

# your answer here

Explain what your dictionary is intended to capture and interpret the results.

8. Using the data you have collected for your final project, show one preliminary result or statistic from an analysis you ran. If you haven’t collected your dataset computationally, try to look anecdotally at the original source (e.g. if Twitter is your dataset, then just look on the Twitter website) and give one observation about the data. Try to make an observation or result based on one of the variables you will use for your final analysis. What do you see? Please send your figures and statistics directly to your TA in Slack - don’t add them to your script.

written description here

Lab #8: Dictionary-Based Text Analysis

Data Science and Society (Sociology 367)

Example Questions

Questions