Lab #10 Markdown File

Lab Instructions

In this lab, we will practice working with topic modeling algorithms. We will use a dataset of wikipedia pages for each senator, and explore the ways topic modeling can help us learn about the corpus. First we will use the LDA algorithm to practice the basics of topic modeling, then we will use the structural topic modeling algorithm (see the stm package) to show how we can use information about each senator (age, gender, political party) in conjunction with our model. NOTE: if you run into problems where your code takes too long to run or your computer freezes up, use the substr function to truncate the wikipedia page texts right after loading.

See the “Instructions” section of the Introduction to Lab Assignments page for more information about the labs. That page also gives descriptions for the datasets we will be using.

Required reading:

Optional reading:

Load the datasets and libraries. You shouldn’t need to change the URL in the load function

library(tidytext)
library(tidyverse)
library(tidyr)
library(dplyr)
library(tm)
library(stringr)
library(topicmodels)
library(ggplot2)
library(stm)
library(reshape2)
#install.packages('topicmodels')

load(url('https://dssoc.github.io/datasets/congress.RData'))
load(url('https://dssoc.github.io/datasets/senator_wiki.RData'))


Example Questions


ex1. Construct a LDA topic model from the text or subtext columns of senator_wiki with 10 topics, then compute the correlation between topic distributions and age/party. Do any topics appear to vary systematically with these covariates?

# join with congress dataset
wiki_info <- congress %>% inner_join(senator_wiki, by='bioguide_id')

# create the dtm from the subtext column
dtm <- wiki_info %>% 
  select(bioguide_id, subtext) %>% 
  unnest_tokens('word', 'subtext') %>% 
  anti_join(stop_words) %>% 
  count(bioguide_id, word) %>% 
  cast_dtm(bioguide_id, word, n)
## Joining, by = "word"
# create the topic model
tm <- LDA(dtm, k=10, control=list(seed=0))

# topic-word distributions
topic_words <- tm %>% tidy(matrix = "beta")
topic_words %>% head()
## # A tibble: 6 × 3
##   topic term      beta
##   <int> <chr>    <dbl>
## 1     1 10    0.00113 
## 2     2 10    0.000746
## 3     3 10    0.000940
## 4     4 10    0.00278 
## 5     5 10    0.000379
## 6     6 10    0.000765
# document_topic distributions
doc_topics <- tm %>% tidy(matrix = "gamma")
doc_topics %>% head()
## # A tibble: 6 × 3
##   document topic     gamma
##   <chr>    <int>     <dbl>
## 1 B000575      1 0.0000341
## 2 B000944      1 0.0000357
## 3 B001236      1 0.0000349
## 4 B001261      1 0.0000363
## 5 B001277      1 0.0000359
## 6 B001288      1 1.00
# compute correlations with age and party
doc_topics %>% 
  left_join(congress, by=c('document'='bioguide_id')) %>% 
  mutate(age=2022-birthyear, is_repub=party=='Republican') %>% 
  group_by(topic) %>% 
  summarize(
    age_cor=cor(gamma, age), 
    age_pval=cor.test(gamma, age)$p.value,
    is_repub_cor=cor(gamma, is_repub), 
    is_repub_pval=cor.test(gamma, as.numeric(is_repub))$p.value
  ) %>% arrange(age_cor)
## # A tibble: 10 × 5
##    topic age_cor age_pval is_repub_cor is_repub_pval
##    <int>   <dbl>    <dbl>        <dbl>         <dbl>
##  1     8 -0.238    0.0701      0.0171          0.898
##  2     9 -0.171    0.194       0.120           0.365
##  3     3 -0.0567   0.670       0.0712          0.592
##  4     2 -0.0276   0.836      -0.0212          0.873
##  5     1 -0.0258   0.846      -0.0952          0.473
##  6     4 -0.0173   0.896       0.00782         0.953
##  7     7  0.0114   0.932      -0.121           0.360
##  8    10  0.114    0.388       0.176           0.182
##  9     5  0.174    0.186      -0.0952          0.473
## 10     6  0.230    0.0792     -0.0916          0.490
# based on this, we do NOT have evidence that topics differ along age or party lines


Questions


1. Describe a document-term matrix (DTM) in your own words. Why is this data structure useful for text analysis?

your answer here


2. Answer each of the following questions:

What is a topic modeling algorithm?
What is the input to a topic modeling algorithm?
What is the structure of a fitted topic model?
How do you choose the number of topics?
What are the beta parameter estimates?


3. Construct a LDA topic model from the text or subtext columns of senator_wiki (after removing stopwords) using a specified random seed (see the control parameter). You can choose the number of topics however you see fit - it might be useful to try multiple values. Finally, create a plot showing the word distributions for the top ten words associated with two topics of your choice.

NOTE: depending on the parameters you choose, this might take a little while to run.

# your answer here


4. For this problem, we want to identify any topics from the previous LDA model that may be strongly associated with the gender of the senator about which the Wikipedia articles were written. To establish this, compute the correlation between the gender variable in our congress dataset and each of the topics in your LDA model from the previous question. Did you find that any of the topics from your model are strongly associated with the gender of the Senator? Based on the word distributions of those topics, what is your explanation for this finding?

HINT: you should use a correlation test (i.e. cor.test instead of cor) to establish whether or not there is a significant difference.

# your answer here
Written explanation here.


5. Create a structural topic model with the stm package using politician gender, political party affiliation, and (approximate) age as covariates in the model. Then use plot with no parameters to show prevalence and top words associated with each topic.

HINT: to create the STM, start with Section 3.1-3.3 of the stm Package Vignette listed in the required readings. You’ll use textProcessor, prepDocuments, and then stm to create the STM topic model.

# your answer here
your answer here


7. Use labelTopics to view the words associated with two topics you find most interesting. Can you easily describe what these topics are capturing?

HINT: to use labelTopics, read Section 3.5 of the stm Package Vignette.

# your answer here
your answer here


8. Now we will try to understand how each of our covariates (politician age, gender, and political party) corresponds to each topic. This is done primarily through use of the estimateEffect function. Use estimateEffect and summary to print out models corresponding to each of our topics. Identify several situations where a covariate is predictive of a topic. Then, create a plot showing those effect sizes with confidence intervals using the plot function. Make sure the figure is readable. Which topics are the most interesting based on the covariate significance? What do these results tell you?

HINT: See the plot in section 3.6 of the stm Package Vignette under the heading “Topical content” on pages 18-19.

# your answer here
your answer here


9. Before this assignment is due, make a short post about your final project in the #final-project-workshop channel in Slack, and give feedback or helpful suggestions to at least one other project posted there. This will be a good way to receive and offer help to your peers!

Congratulations! This is the last lab for the course. These labs were not easy, but you persisted and I hope you learned a lot in the process. As you probably noticed by now, learning data science is often about trying a bunch of things and doing research on the web to see what others have done. Of course, it also requires a bit of creativity that comes from experience and intuition about your dataset. Be sure to talk to Professor Bail and the TA to make sure you’re on the right track for the final project. Good luck!