In this document I’ll first give detailed instructions for the labs and then give descriptions for each dataset we will be using this semester.

Instructions

Each week’s lab should be submitted to your TA via Slack direct message by the deadline indicated on the course website. You should start with the markdown file linked at the top of the lab, and you will submit two files for each lab: an R Markdown (.Rmd) file containing solution code and written responses (if required) that answer each question, and an HTML file (.html) generated by knitting that file. The TA should be able to knit your R Markdown file to reproduce your html file exactly without any additional steps, and the knitted HTML file should display ONLY your code and the output needed to answer the question (please do not show intermediate output in your final product). The output from your solution code should be easy to read and you will lose points if your knitted html file includes extraneous output that makes your solution harder to read.

Be sure to do the required readings first. While many of the problems can be solved using approaches from the lecture videos, lab videos, or required readings, you may need to do some searching on the internet to solve some of the problems. This will be a valuable skill to learn as you develop your data science skillset. Finding answers on the web might be hard at first as you learn the language of coding, so feel free to share web links on the #lab-help channel of Slack.

Your solutions to every problem should be general enough that it would produce the same output if we swapped the input data for another dataframe with the same columns but a different row order. This means, among other things, that you should not rely on numerical subscripts (e.g. my_dataframe[1]) to get a specific row in your code (unless you sorted the rows first) - this is because your code should work when the dataframe rows occur in a different order. Your solutions also cannot include hard-coded answers - it must produce the output directly in the code using the provided datasets. You may not enter values related to your solution in the code (except as comments). Your code should always appear in the code blocks that look include the text # your code here (or something similar). Some solutions require written text instead of a code block. In that case, I will provide a block that begins and ends with “```” and ends with “```” instead of beginning with “```{r}” and ending with a “```” as a code block does.

Visualizations should be readible. Each plot should have axis labels, all labels must be readable, and we should easily be able to tell what your figure is showing. Failure to make clear visualizations will result in point deductions.

Lab File List

This is a list of all Labs and their associated markdown files:

  1. Example R Markdown File (ungraded) (markdown)
  2. R Basics (markdown)
  3. Data Visualization (markdown)
  4. Data Wrangling (markdown)
  5. Programming Basics (markdown)
  6. Coding Social Networks (markdown)
  7. Working with APIs (markdown)
  8. Modeling (A Brief Introduction) (markdown)
  9. Introduction to text analysis (markdown)
  10. Word counts and Dictionaries (markdown)
  11. Topic Modeling (markdown)

Dataset Description

Now I will describe the datasets we will use for the class. We will use a total of three different RData files:

The congress.RData Dataset

Download congress.RData

This dataset contains information about each member of congress that was in-office as of January 11, 2021 and their committee memberships. It was retrieved from the congress-current.csv file of the congress-legislators repository, a database “maintained through a combination of manual edits by volunteers (from GovTrack, ProPublica, MapLight, FiveThirtyEight, and others) and automated imports from a variety of sources.” From this source we use the legislators-current dataset.

You will use this line of code to download the Rdata file from the course website and open it directly in RStudio. In theory, you could also download the file and point the load() function to that file on your computer, but please use this line so your code is easily reproducible on any computer for grading purposes.

load(url('https://dssoc.github.io/datasets/congress.RData'))

This dataset consists of two different variables:

  • congress: basic information like name, birthdate, state of representation, gender, and political party for each member of congress.
  • congress_contact: contact information including social media accounts and phone number for each member of congress.

congress Dataframe

Each row in this dataframe corresponds to a member of congress. type, party, and gender are factor variables and birthdate is a parsed date column.

congress %>% summary()
##  bioguide_id         full_name          type             party    
##  Length:539         Length:539         rep:439   Democrat   :273  
##  Class :character   Class :character   sen:100   Independent:  2  
##  Mode  :character   Mode  :character             Republican :264  
##                                                                   
##                                                                   
##                                                                   
##     state             birthdate          gender    birthyear   
##  Length:539         Min.   :1933-06-09   F:147   Min.   :1933  
##  Class :character   1st Qu.:1953-04-01   M:392   1st Qu.:1953  
##  Mode  :character   Median :1961-03-07           Median :1961  
##                     Mean   :1961-12-06           Mean   :1961  
##                     3rd Qu.:1970-10-02           3rd Qu.:1970  
##                     Max.   :1995-08-01           Max.   :1995
congress %>% head()
##   bioguide_id            full_name type    party state  birthdate gender
## 1     B000944        Sherrod Brown  sen Democrat    OH 1952-11-09      M
## 2     C000127       Maria Cantwell  sen Democrat    WA 1958-10-13      F
## 3     C000141   Benjamin L. Cardin  sen Democrat    MD 1943-10-05      M
## 4     C000174     Thomas R. Carper  sen Democrat    DE 1947-01-23      M
## 5     C001070 Robert P. Casey, Jr.  sen Democrat    PA 1960-04-13      M
## 6     F000062     Dianne Feinstein  sen Democrat    CA 1933-06-22      F
##   birthyear
## 1      1952
## 2      1958
## 3      1943
## 4      1947
## 5      1960
## 6      1933

Here are some important notes about some of the columns:

What is bioguide_id?

This is a unique identifier for each member of congress. You will want to use this for data merging or other tasks that require unique identifiers because there is always the possibility that two congress members will have the same full name. You can find more information on congress.gov.

congress %>% 
  select(bioguide_id, full_name) %>% 
  head()
##   bioguide_id            full_name
## 1     B000944        Sherrod Brown
## 2     C000127       Maria Cantwell
## 3     C000141   Benjamin L. Cardin
## 4     C000174     Thomas R. Carper
## 5     C001070 Robert P. Casey, Jr.
## 6     F000062     Dianne Feinstein

Senators VS Representatives

In United States politics, members of congress are divided into two groups: senators and representatives. This information is stored in the type column. To get only senators, you can use type == 'sen' and to get representatives you can use type == 'rep'

congress %>% 
  count(type)
##   type   n
## 1  rep 439
## 2  sen 100

Political Parties

In United States politics, there are currently three political parties represented in congress: Democrats, Republicans, and Independents. Note that there are far fewer Independents than other members of congress. For some problems, you will be asked to filter out Independents.

congress %>% 
  count(party)
##         party   n
## 1    Democrat 273
## 2 Independent   2
## 3  Republican 264

Birthdate and Date Columns

For convenience, I have parsed the birthdate data into a date type. You can use the lubridate package to create new variables from a date column. Here I show how to get the name of the month associated with each birthdate.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
congress %>% 
  mutate(month=month(birthdate, label=TRUE)) %>% 
  select(full_name, month) %>% 
  head()
##              full_name month
## 1        Sherrod Brown   Nov
## 2       Maria Cantwell   Oct
## 3   Benjamin L. Cardin   Oct
## 4     Thomas R. Carper   Jan
## 5 Robert P. Casey, Jr.   Apr
## 6     Dianne Feinstein   Jun

congress_contact Dataframe

This datafrme includes contact information for each member of congress.

congress_contact %>% summary()
##  bioguide_id           phone             twitter            facebook        
##  Length:539         Length:539         Length:539         Length:539        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    youtube           youtube_id        wikipedia_id      
##  Length:539         Length:539         Length:539        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
congress_contact %>% head()
##   bioguide_id        phone         twitter            facebook          youtube
## 1     B000944 202-224-2315 SenSherrodBrown SenatorSherrodBrown SherrodBrownOhio
## 2     C000127 202-224-3441 SenatorCantwell     senatorcantwell  SenatorCantwell
## 3     C000141 202-224-4524   SenatorCardin    senatorbencardin    senatorcardin
## 4     C000174 202-224-2441   SenatorCarper           tomcarper    senatorcarper
## 5     C001070 202-224-6324     SenBobCasey     SenatorBobCasey  SenatorBobCasey
## 6     F000062 202-224-3841    SenFeinstein    senatorfeinstein SenatorFeinstein
##                 youtube_id     wikipedia_id
## 1 UCgy8jfERh-t_ixkKKoCmglQ    Sherrod Brown
## 2 UCN52UDqKgvHRk39ncySrIMw   Maria Cantwell
## 3 UCiQaJnMzlfzzG3VESgyZChA       Ben Cardin
## 4 UCgLnvbKwu4B3navofj6Qvvw       Tom Carper
## 5 UCtVssXhx-KuZa-hSvnsnJ0A    Bob Casey Jr.
## 6 UCtVC--6LR0ff2aOP8THpuEw Dianne Feinstein

The committees.RData Dataset

Download committees.RData

load(url('https://dssoc.github.io/datasets/committees.RData'))

This dataset was obtained from the committees-current and committee-membership-current data from the same source described above. Conceptually, each committee is composed of several subcommittees, and we have membership data at both levels. The parsed version I have created consists of three dataframes:

  • committees: a list of committees and descriptions of their jurisdictions. The thomas_id column is a unique reference to that committee.
  • subcommittees: a list of subcommittees and their parent committee.
  • committee_membership: committee and subcommittee membership for each member of congress. Note that each row corresponds to a membership to either a committee or a subcommittee, not both.

committees

committees %>% summary()
##   thomas_id             name               type    jurisdiction      
##  Length:52          Length:52          house :26   Length:52         
##  Class :character   Class :character   joint : 5   Class :character  
##  Mode  :character   Mode  :character   senate:21   Mode  :character
committees %>% head()
## # A tibble: 6 × 4
##   thomas_id name                                   type  jurisdiction           
##   <chr>     <chr>                                  <fct> <chr>                  
## 1 HSAG      House Committee on Agriculture         house The House Committee on…
## 2 HSAP      House Committee on Appropriations      house The House Committee on…
## 3 HSAS      House Committee on Armed Services      house The House Committee on…
## 4 HSBA      House Committee on Financial Services  house The House Financial Se…
## 5 HSBU      House Committee on the Budget          house The House Committee on…
## 6 HSED      House Committee on Education and Labor house The committee has legi…

subcommittees

subcommittees %>% summary()
##   thomas_id         committee_thomas_id     name          
##  Length:201         Length:201          Length:201        
##  Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character    Mode  :character
subcommittees %>% head()
## # A tibble: 6 × 3
##   thomas_id committee_thomas_id name                                           
##   <chr>     <chr>               <chr>                                          
## 1 HSAG15    HSAG                Conservation and Forestry                      
## 2 HSAG22    HSAG                Commodity Exchanges, Energy, and Credit        
## 3 HSAG16    HSAG                General Farm Commodities and Risk Management   
## 4 HSAG29    HSAG                Livestock and Foreign Agriculture              
## 5 HSAG14    HSAG                Biotechnology, Horticulture, and Research      
## 6 HSAG03    HSAG                Nutrition, Oversight, and Department Operations

committee_memberships

This dataframe is a little trickier to work with than the other two because it links senators to BOTH full committees and subcommittees. Because subcommittees are already nested within committees, it can present some challenges. For instance, if we want to get the number of full committees that congress members belong to, we wouldn’t want to count both committees and subcommittees together. I’ll show how to work with these multiple levels below.

The columns are fairly straightforward. thomas_id is a unique reference to the committee or subcommittee, and bioguide_id is a reference to the member of congress. party, rank, and title give more information about the particular kinds of relationship.

committee_memberships %>% summary()
##   thomas_id         bioguide_id             party           rank       
##  Length:4009        Length:4009        majority:2168   Min.   : 1.000  
##  Class :character   Class :character   minority:1841   1st Qu.: 3.000  
##  Mode  :character   Mode  :character                   Median : 5.000  
##                                                        Mean   : 6.653  
##                                                        3rd Qu.: 8.000  
##                                                        Max.   :37.000  
##     title          
##  Length:4009       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
committee_memberships %>% head()
## # A tibble: 6 × 5
##   thomas_id bioguide_id party     rank title   
##   <chr>     <chr>       <fct>    <int> <chr>   
## 1 SSAF      S000770     majority     1 Chairman
## 2 SSAF      L000174     majority     2 <NA>    
## 3 SSAF      B000944     majority     3 <NA>    
## 4 SSAF      K000367     majority     4 <NA>    
## 5 SSAF      B001267     majority     5 <NA>    
## 6 SSAF      G000555     majority     6 <NA>

If for instance, we wanted to get information about full committees only, we’d join the congress dataframe with the committees dataframe. This will filter out all rows of committee_memberships that are not associated with a full committee.

full_committee_memberships <- committees %>% 
  inner_join(committee_memberships, on=thomas_id)
## Joining, by = "thomas_id"
full_committee_memberships %>% head()
## # A tibble: 6 × 8
##   thomas_id name                         type  juris…¹ biogu…² party  rank title
##   <chr>     <chr>                        <fct> <chr>   <chr>   <fct> <int> <chr>
## 1 HSAG      House Committee on Agricult… house The Ho… S001157 majo…     1 Chair
## 2 HSAG      House Committee on Agricult… house The Ho… T000467 mino…     1 Rank…
## 3 HSAG      House Committee on Agricult… house The Ho… C001059 majo…     2 <NA> 
## 4 HSAG      House Committee on Agricult… house The Ho… S001189 mino…     2 <NA> 
## 5 HSAG      House Committee on Agricult… house The Ho… M000312 majo…     3 <NA> 
## 6 HSAG      House Committee on Agricult… house The Ho… C001087 mino…     3 <NA> 
## # … with abbreviated variable names ¹​jurisdiction, ²​bioguide_id
full_committee_memberships %>% summary()
##   thomas_id             name               type     jurisdiction      
##  Length:1393        Length:1393        house :919   Length:1393       
##  Class :character   Class :character   joint : 59   Class :character  
##  Mode  :character   Mode  :character   senate:415   Mode  :character  
##                                                                       
##                                                                       
##                                                                       
##  bioguide_id             party          rank           title          
##  Length:1393        majority:754   Min.   : 1.000   Length:1393       
##  Class :character   minority:639   1st Qu.: 4.000   Class :character  
##  Mode  :character                  Median : 8.000   Mode  :character  
##                                    Mean   : 9.772                     
##                                    3rd Qu.:14.000                     
##                                    Max.   :37.000

This will filter out all rows of committee_memberships that are not associated with a subcommittee.

subcommittee_memberships <- subcommittees %>% 
  inner_join(committee_memberships, on=thomas_id)
## Joining, by = "thomas_id"
subcommittee_memberships %>% head()
## # A tibble: 6 × 7
##   thomas_id committee_thomas_id name                   biogu…¹ party  rank title
##   <chr>     <chr>               <chr>                  <chr>   <fct> <int> <chr>
## 1 HSAG15    HSAG                Conservation and Fore… S001209 majo…     1 Chair
## 2 HSAG15    HSAG                Conservation and Fore… L000578 mino…     1 Rank…
## 3 HSAG15    HSAG                Conservation and Fore… V000132 majo…     2 <NA> 
## 4 HSAG15    HSAG                Conservation and Fore… D000616 mino…     2 <NA> 
## 5 HSAG15    HSAG                Conservation and Fore… P000597 majo…     3 <NA> 
## 6 HSAG15    HSAG                Conservation and Fore… A000372 mino…     3 <NA> 
## # … with abbreviated variable name ¹​bioguide_id
subcommittee_memberships %>% summary()
##   thomas_id         committee_thomas_id     name           bioguide_id       
##  Length:2616        Length:2616         Length:2616        Length:2616       
##  Class :character   Class :character    Class :character   Class :character  
##  Mode  :character   Mode  :character    Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##       party           rank           title          
##  majority:1414   Min.   : 1.000   Length:2616       
##  minority:1202   1st Qu.: 2.000   Class :character  
##                  Median : 4.000   Mode  :character  
##                  Mean   : 4.992                     
##                  3rd Qu.: 7.000                     
##                  Max.   :31.000

You will likely need to use these code snippets for some assignment problems.

The senator_tweets.RData Dataset

Download senator_tweets.RData

Use this code to load the dataset:

load(url('https://dssoc.github.io/datasets/senator_tweets.RData'))

To collect this data, I began with a large list of status ids from this page.

  • senator_tweet_ids: random sample of 500 status ids. You can request the data from Twitter using rtweet::lookup_tweets.
  • senator_tweet_sample: random sample of 500 tweets from the full dataset collected using rtweet::lookup_tweets. Sample is different from senator_status_ids.
senator_tweet_ids %>% head()
## [1] "1072617542051618817" "1004868708970491904" "725757591490576384" 
## [4] "915609208640303105"  "859025969197219840"  "734772157922856960"
senator_tweet_sample %>% summary()
##   status_id           user_id          screen_name       
##  Length:990         Length:990         Length:990        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    created_at                   is_quote       is_retweet     
##  Min.   :2008-09-04 19:05:07   Mode :logical   Mode :logical  
##  1st Qu.:2015-07-01 01:55:14   FALSE:941       FALSE:830      
##  Median :2016-11-22 05:01:15   TRUE :49        TRUE :160      
##  Mean   :2016-06-28 19:24:19                                  
##  3rd Qu.:2017-11-03 08:37:29                                  
##  Max.   :2018-12-31 22:15:00                                  
##  favorite_count    retweet_count         text          
##  Min.   :    0.0   Min.   :    0.0   Length:990        
##  1st Qu.:    1.0   1st Qu.:    3.0   Class :character  
##  Median :    7.0   Median :    8.0   Mode  :character  
##  Mean   :  240.9   Mean   :  127.7                     
##  3rd Qu.:   55.0   3rd Qu.:   35.0                     
##  Max.   :29573.0   Max.   :10835.0
senator_tweet_sample %>% head()
## # A tibble: 6 × 9
##   status_id  user_id scree…¹ created_at          is_qu…² is_re…³ favor…⁴ retwe…⁵
##   <chr>      <chr>   <chr>   <dttm>              <lgl>   <lgl>     <int>   <int>
## 1 100407438… 816683… SenJoh… 2018-06-05 18:56:02 FALSE   TRUE          0       1
## 2 104786755… 816683… SenJoh… 2018-10-04 15:14:28 FALSE   TRUE          0     104
## 3 983905611… 181377… ChrisV… 2018-04-11 03:12:32 TRUE    FALSE       562     194
## 4 192311825… 181377… ChrisV… 2012-04-17 18:01:26 FALSE   FALSE         1      12
## 5 954383700… 181377… ChrisV… 2018-01-19 16:03:00 FALSE   FALSE        73      40
## 6 292297483… 181377… ChrisV… 2013-01-18 15:48:45 FALSE   FALSE         0       4
## # … with 1 more variable: text <chr>, and abbreviated variable names
## #   ¹​screen_name, ²​is_quote, ³​is_retweet, ⁴​favorite_count, ⁵​retweet_count

As you can see from this output, the status_ids of the full tweets and just the ids is entirely disjoint. They are sampled from the same full set of status_ids though.

senator_tweet_sample$status_id %>% setdiff(senator_tweet_ids) %>% length()
## [1] 990
senator_tweet_ids %>% setdiff(senator_tweet_sample$status_id) %>% length()
## [1] 500

The senator_wiki.RData Dataset

This dataset includes text from the Wikipedia pages for each of the Senators (not Representatives) in our dataset that aligned with the Wikipedia ids in the congress-legislators repository.

Download senator_wiki.RData

Use this code to load the dataset:

load(url('https://dssoc.github.io/datasets/senator_wiki.RData'))

The summary and text columns were provided directly from Wikipedia, and I created the subtext column by truncating the text data to exactly 5000 characters.

senator_wiki %>% head()
## # A tibble: 6 × 5
##   bioguide_id wikipedia_id     summary                             text  subtext
##   <chr>       <chr>            <chr>                               <chr> <chr>  
## 1 B000944     Sherrod Brown    "Sherrod Campbell Brown (; born No… "She… "Sherr…
## 2 C001070     Bob Casey Jr.    "Robert Patrick Casey Jr. (born Ap… "Rob… "Rober…
## 3 F000062     Dianne Feinstein "Dianne Goldman Berman Feinstein (… "Dia… "Diann…
## 4 K000367     Amy Klobuchar    "Amy Jean Klobuchar ( KLOH-bə-shar… "Amy… "Amy J…
## 5 M000639     Bob Menendez     "Robert Menendez (; born January 1… "Rob… "Rober…
## 6 S000033     Bernie Sanders   "Bernard  Sanders (born September … "Ber… "Berna…
senator_wiki %>% summary()
##  bioguide_id        wikipedia_id         summary              text          
##  Length:59          Length:59          Length:59          Length:59         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    subtext         
##  Length:59         
##  Class :character  
##  Mode  :character

Now I’ll show the average lengths for each of the text columns. Notice that all subtext columns have exactly 5000 characters - the column was created by truncating the text column. This truncated version has the added benefit of working well with topic modeling algorithms.

senator_wiki %>% 
  summarize(
    summary_av=mean(str_length(summary)), summary_sd=sd(str_length(summary)), 
    text_av=mean(str_length(text)), text_sd=sd(str_length(text)), 
    subtext_av=mean(str_length(subtext)), subtext_sd=sd(str_length(subtext))
  )
## # A tibble: 1 × 6
##   summary_av summary_sd text_av text_sd subtext_av subtext_sd
##        <dbl>      <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
## 1      1384.       663.  28102.  17498.       5000          0