In this lab, we will practice working with topic modeling algorithms. We will use a dataset of wikipedia pages for each senator, and explore the ways topic modeling can help us learn about the corpus. First we will use the LDA algorithm to practice the basics of topic modeling, then we will use the structural topic modeling algorithm (see the stm package) to show how we can use information about each senator (age, gender, political party) in conjunction with our model. NOTE: if you run into problems where your code takes too long to run or your computer freezes up, use the substr function to truncate the wikipedia page texts right after loading.

Be sure to do the required readings first. While many of the problems can be solved using approaches from the lecture videos, lab videos, or required readings, you may need to do some searching on the internet to solve some of the problems. This will be a valuable skill to learn as you develop as a data scientist.

This file should be submitted in both R Markdown (.Rmd) file and knitted HTML web page (.html) files, and you should be starting with the Lab markdown file (link to file) where the author name has been replaced with your own. While the R Markdown file should include all of the code you used to generate your solutions, the Knitted HTML file should ONLY display the solutions to the assignment, and NOT the code you used to solve it. Imagine this is a document you will submit to a supervisor or professor - they should be able to reproduce the code/analysis if needed, but otherwise only need to see the results and write-up. The TA should be able to run the R Markdown file to directly generate the same exact html file you submitted. Submit to your TA via direct message on Slack by the deadline indicated on the course website.

Required reading:

Optional reading:

Load the datasets and libraries. You shouldn’t need to change the URL in the load function

library(tidytext)
library(tidyverse)
library(tidyr)
library(dplyr)
library(tm)
library(stringr)
library(topicmodels)
library(ggplot2)
library(stm)

# THIS SHOULD WORK AS-IS. If not, download the file from your browswer at the same url.
load(url('https://dssoc.github.io/datasets/congress_wiki.RData'))
load(url('https://dssoc.github.io/datasets/congress.RData'))



# limits to first 1000 characters
congress_wiki <- congress_wiki %>% mutate(text=substr(text, 1, 1000))


Questions


1. Describe a document-term matrix (DTM) in your own words. Why is this data structure useful for text analysis?

your answer here

2. Describe a topic modeling algorithm in your own words. What is the input to a topic modeling algorithm (after parsing the raw text)? What does the actual topic model look like? How do you choose the number of topics? What are the beta parameter estimates?

your answer here

3. Join the columns of congress with congress_wiki so that you should have congress member information (gender, type, etc) associated with every wikipedia page in our dataset, then create a document-term matrix after removing stopwords.

Hint: your merged dataframe should have as many rows as the original congress_wiki dataframe.

# your answer here

4. Construct a topic model with LDA using a specified random seed (see the control parameter). This might take a little while to run. You can choose the number of topics however you see fit - it might be useful to try multiple values.

# your answer here

5. Make a function that accepts (takes as an argument) a topic model and returns a plot showing the word distributions for the top ten words in each topic, then call the function with your topic model. Choose two topics which appear to be easiest to understand, and explain what you think they represent based on the word distributions.

# your answer here
your answer here

6. Create a structural topic model with the stm package using politician gender, political orientation, type (senate/house), and (approximate) age as covariates in the model. Then use labelTopics to view the words associated with two topics you find most interesting. Can you easily describe what the topics are capturing?

Hint: you’ll need to recall some of the string techniques we’ve used before to calculate age based on birth date. You can use an approximation for age by subtracting the birth year from the current year (2020).

# your answer here
your answer here

7. Now try using searchK to identify the best number of topics for your dataset. Which K is optimal? How did you decide?

Hint: you should be able to use the same preprocessed and prepared objects in from the previous question.

# your answer here
your answer here

8. Use selectModel to generate a number of models and then plotModels to visualize their performance. Which model had the best performance, and how can you tell? What do the two dimensions plotted using plotModels mean? Then use plot to show an overview of the topics in the model. What does this tell you? Keep the optimal model for future questions.

# your answer here
your answer here

9. After identifying the statistically optimal number of topics and choosing the best among models with a given number of topics, we now have a model that we can examine more closely. The first thing we will do is try to understand how each of our covariates (politician age, gender, political party, and type) corresponds to each topic. This is done primarily through use of the estimateEffect function. Use estimateEffect and summary to print out models corresponding to each of our topics. Identify several situations where a covariate is predictive of a topic. Then, create a plot showing those effect sizes with confidence intervals using the plot function. Make sure the figure is readable. Which topics are the most interesting based on the covariate significance? What do these results tell you?

# your answer here
your answer here

10. Now we will inspect the topics in our selected model. First, use labelTopics to examine the top words for each topic you identified as interesting in the previous step. Then use plotQuote to show the first n characters of each document closely associated with the topic. Do you see any interesting patterns?

# your answer here
your answer here

11. Examine the statistical models and the content of the topics closely to understand the patterns that are appearing in our data. What is your interpretation of the results. You should be able to draw concrete conclusions such as “republican wikipedia pages are more likely to discuss topic X, which is about ..”. Based on the nature of the underlying data, why do you think we observe these results?

your answer here

12. Before this assignment is due, make a short post about your final project in the #final-project-workshop channel in Slack, and give feedback or helpful suggestions to at least one other project posted there. This will be a good way to receive and offer help to your peers!

Congratulations! This is the last lab for the course. These labs were not easy, but you persisted and I hope you learned a lot in the process. As you probably noticed by now, learning data science is often about trying a bunch of things and doing research on the web to see what others have done. Of course, it also requires a bit of creativity that comes from experience and intuition about your dataset. Be sure to talk to Professor Bail and the TA to make sure you’re on the right track for the final project. Good luck!