In this lab, we will practice working with topic modeling algorithms.
We will use a dataset of wikipedia pages for each senator, and explore
the ways topic modeling can help us learn about the corpus. First we
will use the LDA algorithm to practice the basics of topic modeling,
then we will use the structural topic modeling algorithm (see the
stm
package) to show how we can use information about each
senator (age, gender, political party) in conjunction with our model.
NOTE: if you run into problems where your code takes too long to
run or your computer freezes up, use the substr
function to
truncate the wikipedia page texts right after loading.
See the “Instructions” section of the Introduction to Lab Assignments page for more information about the labs. That page also gives descriptions for the datasets we will be using.
Required reading:
Optional reading:
Load the datasets and libraries. You shouldn’t need to change
the URL in the load
function
library(tidytext)
library(tidyverse)
library(tidyr)
library(dplyr)
library(tm)
library(stringr)
library(topicmodels)
library(ggplot2)
library(stm)
library(reshape2)
#install.packages('topicmodels')
load(url('https://dssoc.github.io/datasets/congress.RData'))
load(url('https://dssoc.github.io/datasets/senator_wiki.RData'))
ex1. Construct a LDA topic model from the text
or subtext
columns of senator_wiki
with 10
topics, then compute the correlation between topic distributions and
age/party. Do any topics appear to vary systematically with these
covariates?
# join with congress dataset
wiki_info <- congress %>% inner_join(senator_wiki, by='bioguide_id')
# create the dtm from the subtext column
dtm <- wiki_info %>%
select(bioguide_id, subtext) %>%
unnest_tokens('word', 'subtext') %>%
anti_join(stop_words) %>%
count(bioguide_id, word) %>%
cast_dtm(bioguide_id, word, n)
## Joining, by = "word"
# create the topic model
tm <- LDA(dtm, k=10, control=list(seed=0))
# topic-word distributions
topic_words <- tm %>% tidy(matrix = "beta")
topic_words %>% head()
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 10 0.00113
## 2 2 10 0.000746
## 3 3 10 0.000940
## 4 4 10 0.00278
## 5 5 10 0.000379
## 6 6 10 0.000765
# document_topic distributions
doc_topics <- tm %>% tidy(matrix = "gamma")
doc_topics %>% head()
## # A tibble: 6 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 B000575 1 0.0000341
## 2 B000944 1 0.0000357
## 3 B001236 1 0.0000349
## 4 B001261 1 0.0000363
## 5 B001277 1 0.0000359
## 6 B001288 1 1.00
# compute correlations with age and party
doc_topics %>%
left_join(congress, by=c('document'='bioguide_id')) %>%
mutate(age=2022-birthyear, is_repub=party=='Republican') %>%
group_by(topic) %>%
summarize(
age_cor=cor(gamma, age),
age_pval=cor.test(gamma, age)$p.value,
is_repub_cor=cor(gamma, is_repub),
is_repub_pval=cor.test(gamma, as.numeric(is_repub))$p.value
) %>% arrange(age_cor)
## # A tibble: 10 × 5
## topic age_cor age_pval is_repub_cor is_repub_pval
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 8 -0.238 0.0701 0.0171 0.898
## 2 9 -0.171 0.194 0.120 0.365
## 3 3 -0.0567 0.670 0.0712 0.592
## 4 2 -0.0276 0.836 -0.0212 0.873
## 5 1 -0.0258 0.846 -0.0952 0.473
## 6 4 -0.0173 0.896 0.00782 0.953
## 7 7 0.0114 0.932 -0.121 0.360
## 8 10 0.114 0.388 0.176 0.182
## 9 5 0.174 0.186 -0.0952 0.473
## 10 6 0.230 0.0792 -0.0916 0.490
# based on this, we do NOT have evidence that topics differ along age or party lines
1. Describe a document-term matrix (DTM) in your own words. Why is this data structure useful for text analysis?
your answer here
2. Answer each of the following questions:
What is a topic modeling algorithm?
What is the input to a topic modeling algorithm?
What is the structure of a fitted topic model?
How do you choose the number of topics?
What are the beta parameter estimates?
3. Construct a LDA topic model from the text
or
subtext
columns of senator_wiki
(after
removing stopwords) using a specified random seed (see the
control
parameter). You can choose the number of topics
however you see fit - it might be useful to try multiple values.
Finally, create a plot showing the word distributions for the top ten
words associated with two topics of your choice.
NOTE: depending on the parameters you choose, this might take a little while to run.
# your answer here
4. For this problem, we want to identify any topics from the
previous LDA model that may be strongly associated with the gender of
the senator about which the Wikipedia articles were written. To
establish this, compute the correlation between the gender
variable in our congress
dataset and each of the topics in
your LDA model from the previous question. Did you find that any of the
topics from your model are strongly associated with the gender of the
Senator? Based on the word distributions of those topics, what is your
explanation for this finding?
HINT: you should use a correlation test
(i.e. cor.test
instead of cor
) to establish
whether or not there is a significant difference.
# your answer here
Written explanation here.
5. Create a structural topic model with the stm
package using politician gender, political party affiliation, and
(approximate) age as covariates in the model. Then use plot
with no parameters to show prevalence and top words associated with each
topic.
HINT: to create the STM, start with Section 3.1-3.3 of the stm
Package Vignette listed in the required readings. You’ll use
textProcessor
, prepDocuments
, and then
stm
to create the STM topic model.
# your answer here
your answer here
7. Use labelTopics
to view the words associated
with two topics you find most interesting. Can you easily describe what
these topics are capturing?
HINT: to use labelTopics
, read Section 3.5 of the stm
Package Vignette.
# your answer here
your answer here
8. Now we will try to understand how each of our covariates
(politician age, gender, and political party) corresponds to each topic.
This is done primarily through use of the estimateEffect
function. Use estimateEffect
and summary
to
print out models corresponding to each of our topics. Identify several
situations where a covariate is predictive of a topic. Then, create a
plot showing those effect sizes with confidence intervals using the
plot
function. Make sure the figure is readable. Which
topics are the most interesting based on the covariate significance?
What do these results tell you?
HINT: See the plot in section 3.6 of the stm Package Vignette under the heading “Topical content” on pages 18-19.
# your answer here
your answer here
9. Before this assignment is due, make a short post about
your final project in the #final-project-workshop
channel
in Slack, and give feedback or helpful suggestions to at least one other
project posted there. This will be a good way to receive and offer help
to your peers!
Congratulations! This is the last lab for the course. These labs were not easy, but you persisted and I hope you learned a lot in the process. As you probably noticed by now, learning data science is often about trying a bunch of things and doing research on the web to see what others have done. Of course, it also requires a bit of creativity that comes from experience and intuition about your dataset. Be sure to talk to Professor Bail and the TA to make sure you’re on the right track for the final project. Good luck!