In this document I’ll first give detailed instructions for the labs and then give descriptions for each dataset we will be using this semester.
Each week’s lab should be submitted to your TA via Slack direct message by the deadline indicated on the course website. You should start with the markdown file linked at the top of the lab, and you will submit two files for each lab: an R Markdown (.Rmd) file containing solution code and written responses (if required) that answer each question, and an HTML file (.html) generated by knitting that file. The TA should be able to knit your R Markdown file to reproduce your html file exactly without any additional steps, and the knitted HTML file should display ONLY your code and the output needed to answer the question (please do not show intermediate output in your final product). The output from your solution code should be easy to read and you will lose points if your knitted html file includes extraneous output that makes your solution harder to read.
Be sure to do the required readings first. While
many of the problems can be solved using approaches from the lecture
videos, lab videos, or required readings, you may need to do some
searching on the internet to solve some of the problems. This will be a
valuable skill to learn as you develop your data science skillset.
Finding answers on the web might be hard at first as you learn the
language of coding, so feel free to share web links on the
#lab-help
channel of Slack.
Your solutions to every problem should be general enough that
it would produce the same output if we swapped the input data for
another dataframe with the same columns but a different row
order. This means, among other things, that you should not rely
on numerical subscripts (e.g. my_dataframe[1]
) to get a
specific row in your code (unless you sorted the rows first) - this is
because your code should work when the dataframe rows occur in a
different order. Your solutions also cannot include hard-coded answers -
it must produce the output directly in the code using the provided
datasets. You may not enter values related to your solution in the code
(except as comments). Your code should always appear in the code blocks
that look include the text # your code here
(or something
similar). Some solutions require written text instead of a code block.
In that case, I will provide a block that begins and ends with “```” and
ends with “```” instead of beginning with “```{r}” and ending with a
“```” as a code block does.
Visualizations should be readible. Each plot should have axis labels, all labels must be readable, and we should easily be able to tell what your figure is showing. Failure to make clear visualizations will result in point deductions.
This is a list of all Labs and their associated markdown files:
Now I will describe the datasets we will use for the class. We will use a total of three different RData files:
congress.RData
:
contains basic information about each member of congress.committees.RData
:
contains basic information about each member of congress.senator_tweets.RData
:
a small sample of Tweets from accounts associated with each senator (a
subset of all congress members) in the congress
dataset.senator_wiki.RData
:
text from wikipedia pages of each Senator in the congress
dataset with a valid Wikipedia ID.congress.RData
DatasetThis dataset contains information about each member of congress that
was in-office as of January 11, 2021 and their committee memberships. It
was retrieved from the congress-current.csv
file of the congress-legislators
repository, a database “maintained through a combination of manual
edits by volunteers (from GovTrack, ProPublica, MapLight,
FiveThirtyEight, and others) and automated imports from a variety of
sources.” From this source we use the legislators-current
dataset.
You will use this line of code to download the Rdata file from the
course website and open it directly in RStudio. In theory, you could
also download the file and point the load()
function to
that file on your computer, but please use this line so your code is
easily reproducible on any computer for grading purposes.
load(url('https://dssoc.github.io/datasets/congress.RData'))
This dataset consists of two different variables:
congress
: basic information like name, birthdate, state
of representation, gender, and political party for each member of
congress.congress_contact
: contact information including social
media accounts and phone number for each member of congress.congress
DataframeEach row in this dataframe corresponds to a member of congress. type, party, and gender are factor variables and birthdate is a parsed date column.
congress %>% summary()
## bioguide_id full_name type party
## Length:539 Length:539 rep:439 Democrat :273
## Class :character Class :character sen:100 Independent: 2
## Mode :character Mode :character Republican :264
##
##
##
## state birthdate gender birthyear
## Length:539 Min. :1933-06-09 F:147 Min. :1933
## Class :character 1st Qu.:1953-04-01 M:392 1st Qu.:1953
## Mode :character Median :1961-03-07 Median :1961
## Mean :1961-12-06 Mean :1961
## 3rd Qu.:1970-10-02 3rd Qu.:1970
## Max. :1995-08-01 Max. :1995
congress %>% head()
## bioguide_id full_name type party state birthdate gender
## 1 B000944 Sherrod Brown sen Democrat OH 1952-11-09 M
## 2 C000127 Maria Cantwell sen Democrat WA 1958-10-13 F
## 3 C000141 Benjamin L. Cardin sen Democrat MD 1943-10-05 M
## 4 C000174 Thomas R. Carper sen Democrat DE 1947-01-23 M
## 5 C001070 Robert P. Casey, Jr. sen Democrat PA 1960-04-13 M
## 6 F000062 Dianne Feinstein sen Democrat CA 1933-06-22 F
## birthyear
## 1 1952
## 2 1958
## 3 1943
## 4 1947
## 5 1960
## 6 1933
Here are some important notes about some of the columns:
bioguide_id
?This is a unique identifier for each member of congress. You will want to use this for data merging or other tasks that require unique identifiers because there is always the possibility that two congress members will have the same full name. You can find more information on congress.gov.
congress %>%
select(bioguide_id, full_name) %>%
head()
## bioguide_id full_name
## 1 B000944 Sherrod Brown
## 2 C000127 Maria Cantwell
## 3 C000141 Benjamin L. Cardin
## 4 C000174 Thomas R. Carper
## 5 C001070 Robert P. Casey, Jr.
## 6 F000062 Dianne Feinstein
In United States politics, members of congress are divided into two
groups: senators and representatives. This information is stored in the
type
column. To get only senators, you can use
type == 'sen'
and to get representatives you can use
type == 'rep'
congress %>%
count(type)
## type n
## 1 rep 439
## 2 sen 100
In United States politics, there are currently three political parties represented in congress: Democrats, Republicans, and Independents. Note that there are far fewer Independents than other members of congress. For some problems, you will be asked to filter out Independents.
congress %>%
count(party)
## party n
## 1 Democrat 273
## 2 Independent 2
## 3 Republican 264
For convenience, I have parsed the birthdate data into a
date
type. You can use the lubridate
package
to create new variables from a date
column. Here I show how
to get the name of the month associated with each birthdate.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
congress %>%
mutate(month=month(birthdate, label=TRUE)) %>%
select(full_name, month) %>%
head()
## full_name month
## 1 Sherrod Brown Nov
## 2 Maria Cantwell Oct
## 3 Benjamin L. Cardin Oct
## 4 Thomas R. Carper Jan
## 5 Robert P. Casey, Jr. Apr
## 6 Dianne Feinstein Jun
congress_contact
DataframeThis datafrme includes contact information for each member of congress.
congress_contact %>% summary()
## bioguide_id phone twitter facebook
## Length:539 Length:539 Length:539 Length:539
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## youtube youtube_id wikipedia_id
## Length:539 Length:539 Length:539
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
congress_contact %>% head()
## bioguide_id phone twitter facebook youtube
## 1 B000944 202-224-2315 SenSherrodBrown SenatorSherrodBrown SherrodBrownOhio
## 2 C000127 202-224-3441 SenatorCantwell senatorcantwell SenatorCantwell
## 3 C000141 202-224-4524 SenatorCardin senatorbencardin senatorcardin
## 4 C000174 202-224-2441 SenatorCarper tomcarper senatorcarper
## 5 C001070 202-224-6324 SenBobCasey SenatorBobCasey SenatorBobCasey
## 6 F000062 202-224-3841 SenFeinstein senatorfeinstein SenatorFeinstein
## youtube_id wikipedia_id
## 1 UCgy8jfERh-t_ixkKKoCmglQ Sherrod Brown
## 2 UCN52UDqKgvHRk39ncySrIMw Maria Cantwell
## 3 UCiQaJnMzlfzzG3VESgyZChA Ben Cardin
## 4 UCgLnvbKwu4B3navofj6Qvvw Tom Carper
## 5 UCtVssXhx-KuZa-hSvnsnJ0A Bob Casey Jr.
## 6 UCtVC--6LR0ff2aOP8THpuEw Dianne Feinstein
committees.RData
Datasetload(url('https://dssoc.github.io/datasets/committees.RData'))
This dataset was obtained from the committees-current
and committee-membership-current
data from the same source
described above. Conceptually, each committee is composed of several
subcommittees, and we have membership data at both levels. The parsed
version I have created consists of three dataframes:
committees
: a list of committees and descriptions of
their jurisdictions. The thomas_id
column is a unique
reference to that committee.subcommittees
: a list of subcommittees and their parent
committee.committee_membership
: committee and subcommittee
membership for each member of congress. Note that each row corresponds
to a membership to either a committee or a subcommittee, not both.committees %>% summary()
## thomas_id name type jurisdiction
## Length:52 Length:52 house :26 Length:52
## Class :character Class :character joint : 5 Class :character
## Mode :character Mode :character senate:21 Mode :character
committees %>% head()
## # A tibble: 6 × 4
## thomas_id name type jurisdiction
## <chr> <chr> <fct> <chr>
## 1 HSAG House Committee on Agriculture house The House Committee on…
## 2 HSAP House Committee on Appropriations house The House Committee on…
## 3 HSAS House Committee on Armed Services house The House Committee on…
## 4 HSBA House Committee on Financial Services house The House Financial Se…
## 5 HSBU House Committee on the Budget house The House Committee on…
## 6 HSED House Committee on Education and Labor house The committee has legi…
subcommittees
subcommittees %>% summary()
## thomas_id committee_thomas_id name
## Length:201 Length:201 Length:201
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
subcommittees %>% head()
## # A tibble: 6 × 3
## thomas_id committee_thomas_id name
## <chr> <chr> <chr>
## 1 HSAG15 HSAG Conservation and Forestry
## 2 HSAG22 HSAG Commodity Exchanges, Energy, and Credit
## 3 HSAG16 HSAG General Farm Commodities and Risk Management
## 4 HSAG29 HSAG Livestock and Foreign Agriculture
## 5 HSAG14 HSAG Biotechnology, Horticulture, and Research
## 6 HSAG03 HSAG Nutrition, Oversight, and Department Operations
committee_memberships
This dataframe is a little trickier to work with than the other two because it links senators to BOTH full committees and subcommittees. Because subcommittees are already nested within committees, it can present some challenges. For instance, if we want to get the number of full committees that congress members belong to, we wouldn’t want to count both committees and subcommittees together. I’ll show how to work with these multiple levels below.
The columns are fairly straightforward. thomas_id
is a
unique reference to the committee or subcommittee, and
bioguide_id
is a reference to the member of congress.
party
, rank
, and title
give more
information about the particular kinds of relationship.
committee_memberships %>% summary()
## thomas_id bioguide_id party rank
## Length:4009 Length:4009 majority:2168 Min. : 1.000
## Class :character Class :character minority:1841 1st Qu.: 3.000
## Mode :character Mode :character Median : 5.000
## Mean : 6.653
## 3rd Qu.: 8.000
## Max. :37.000
## title
## Length:4009
## Class :character
## Mode :character
##
##
##
committee_memberships %>% head()
## # A tibble: 6 × 5
## thomas_id bioguide_id party rank title
## <chr> <chr> <fct> <int> <chr>
## 1 SSAF S000770 majority 1 Chairman
## 2 SSAF L000174 majority 2 <NA>
## 3 SSAF B000944 majority 3 <NA>
## 4 SSAF K000367 majority 4 <NA>
## 5 SSAF B001267 majority 5 <NA>
## 6 SSAF G000555 majority 6 <NA>
If for instance, we wanted to get information about full committees
only, we’d join the congress
dataframe with the
committees
dataframe. This will filter out all rows of
committee_memberships
that are not associated with a full
committee.
full_committee_memberships <- committees %>%
inner_join(committee_memberships, on=thomas_id)
## Joining, by = "thomas_id"
full_committee_memberships %>% head()
## # A tibble: 6 × 8
## thomas_id name type juris…¹ biogu…² party rank title
## <chr> <chr> <fct> <chr> <chr> <fct> <int> <chr>
## 1 HSAG House Committee on Agricult… house The Ho… S001157 majo… 1 Chair
## 2 HSAG House Committee on Agricult… house The Ho… T000467 mino… 1 Rank…
## 3 HSAG House Committee on Agricult… house The Ho… C001059 majo… 2 <NA>
## 4 HSAG House Committee on Agricult… house The Ho… S001189 mino… 2 <NA>
## 5 HSAG House Committee on Agricult… house The Ho… M000312 majo… 3 <NA>
## 6 HSAG House Committee on Agricult… house The Ho… C001087 mino… 3 <NA>
## # … with abbreviated variable names ¹jurisdiction, ²bioguide_id
full_committee_memberships %>% summary()
## thomas_id name type jurisdiction
## Length:1393 Length:1393 house :919 Length:1393
## Class :character Class :character joint : 59 Class :character
## Mode :character Mode :character senate:415 Mode :character
##
##
##
## bioguide_id party rank title
## Length:1393 majority:754 Min. : 1.000 Length:1393
## Class :character minority:639 1st Qu.: 4.000 Class :character
## Mode :character Median : 8.000 Mode :character
## Mean : 9.772
## 3rd Qu.:14.000
## Max. :37.000
This will filter out all rows of committee_memberships
that are not associated with a subcommittee.
subcommittee_memberships <- subcommittees %>%
inner_join(committee_memberships, on=thomas_id)
## Joining, by = "thomas_id"
subcommittee_memberships %>% head()
## # A tibble: 6 × 7
## thomas_id committee_thomas_id name biogu…¹ party rank title
## <chr> <chr> <chr> <chr> <fct> <int> <chr>
## 1 HSAG15 HSAG Conservation and Fore… S001209 majo… 1 Chair
## 2 HSAG15 HSAG Conservation and Fore… L000578 mino… 1 Rank…
## 3 HSAG15 HSAG Conservation and Fore… V000132 majo… 2 <NA>
## 4 HSAG15 HSAG Conservation and Fore… D000616 mino… 2 <NA>
## 5 HSAG15 HSAG Conservation and Fore… P000597 majo… 3 <NA>
## 6 HSAG15 HSAG Conservation and Fore… A000372 mino… 3 <NA>
## # … with abbreviated variable name ¹bioguide_id
subcommittee_memberships %>% summary()
## thomas_id committee_thomas_id name bioguide_id
## Length:2616 Length:2616 Length:2616 Length:2616
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## party rank title
## majority:1414 Min. : 1.000 Length:2616
## minority:1202 1st Qu.: 2.000 Class :character
## Median : 4.000 Mode :character
## Mean : 4.992
## 3rd Qu.: 7.000
## Max. :31.000
You will likely need to use these code snippets for some assignment problems.
senator_tweets.RData
DatasetUse this code to load the dataset:
load(url('https://dssoc.github.io/datasets/senator_tweets.RData'))
To collect this data, I began with a large list of status ids from this page.
senator_tweet_ids
: random sample of 500 status ids. You
can request the data from Twitter using
rtweet::lookup_tweets
.senator_tweet_sample
: random sample of 500 tweets from
the full dataset collected using rtweet::lookup_tweets
.
Sample is different from senator_status_ids
.senator_tweet_ids %>% head()
## [1] "1072617542051618817" "1004868708970491904" "725757591490576384"
## [4] "915609208640303105" "859025969197219840" "734772157922856960"
senator_tweet_sample %>% summary()
## status_id user_id screen_name
## Length:990 Length:990 Length:990
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## created_at is_quote is_retweet
## Min. :2008-09-04 19:05:07 Mode :logical Mode :logical
## 1st Qu.:2015-07-01 01:55:14 FALSE:941 FALSE:830
## Median :2016-11-22 05:01:15 TRUE :49 TRUE :160
## Mean :2016-06-28 19:24:19
## 3rd Qu.:2017-11-03 08:37:29
## Max. :2018-12-31 22:15:00
## favorite_count retweet_count text
## Min. : 0.0 Min. : 0.0 Length:990
## 1st Qu.: 1.0 1st Qu.: 3.0 Class :character
## Median : 7.0 Median : 8.0 Mode :character
## Mean : 240.9 Mean : 127.7
## 3rd Qu.: 55.0 3rd Qu.: 35.0
## Max. :29573.0 Max. :10835.0
senator_tweet_sample %>% head()
## # A tibble: 6 × 9
## status_id user_id scree…¹ created_at is_qu…² is_re…³ favor…⁴ retwe…⁵
## <chr> <chr> <chr> <dttm> <lgl> <lgl> <int> <int>
## 1 100407438… 816683… SenJoh… 2018-06-05 18:56:02 FALSE TRUE 0 1
## 2 104786755… 816683… SenJoh… 2018-10-04 15:14:28 FALSE TRUE 0 104
## 3 983905611… 181377… ChrisV… 2018-04-11 03:12:32 TRUE FALSE 562 194
## 4 192311825… 181377… ChrisV… 2012-04-17 18:01:26 FALSE FALSE 1 12
## 5 954383700… 181377… ChrisV… 2018-01-19 16:03:00 FALSE FALSE 73 40
## 6 292297483… 181377… ChrisV… 2013-01-18 15:48:45 FALSE FALSE 0 4
## # … with 1 more variable: text <chr>, and abbreviated variable names
## # ¹screen_name, ²is_quote, ³is_retweet, ⁴favorite_count, ⁵retweet_count
As you can see from this output, the status_id
s of the
full tweets and just the ids is entirely disjoint. They are sampled from
the same full set of status_id
s though.
senator_tweet_sample$status_id %>% setdiff(senator_tweet_ids) %>% length()
## [1] 990
senator_tweet_ids %>% setdiff(senator_tweet_sample$status_id) %>% length()
## [1] 500
senator_wiki.RData
DatasetThis dataset includes text from the Wikipedia pages for each of the Senators (not Representatives) in our dataset that aligned with the Wikipedia ids in the congress-legislators repository.
Use this code to load the dataset:
load(url('https://dssoc.github.io/datasets/senator_wiki.RData'))
The summary
and text
columns were provided
directly from Wikipedia, and I created the subtext
column
by truncating the text
data to exactly 5000 characters.
senator_wiki %>% head()
## # A tibble: 6 × 5
## bioguide_id wikipedia_id summary text subtext
## <chr> <chr> <chr> <chr> <chr>
## 1 B000944 Sherrod Brown "Sherrod Campbell Brown (; born No… "She… "Sherr…
## 2 C001070 Bob Casey Jr. "Robert Patrick Casey Jr. (born Ap… "Rob… "Rober…
## 3 F000062 Dianne Feinstein "Dianne Goldman Berman Feinstein (… "Dia… "Diann…
## 4 K000367 Amy Klobuchar "Amy Jean Klobuchar ( KLOH-bə-shar… "Amy… "Amy J…
## 5 M000639 Bob Menendez "Robert Menendez (; born January 1… "Rob… "Rober…
## 6 S000033 Bernie Sanders "Bernard Sanders (born September … "Ber… "Berna…
senator_wiki %>% summary()
## bioguide_id wikipedia_id summary text
## Length:59 Length:59 Length:59 Length:59
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## subtext
## Length:59
## Class :character
## Mode :character
Now I’ll show the average lengths for each of the text columns. Notice that all subtext columns have exactly 5000 characters - the column was created by truncating the text column. This truncated version has the added benefit of working well with topic modeling algorithms.
senator_wiki %>%
summarize(
summary_av=mean(str_length(summary)), summary_sd=sd(str_length(summary)),
text_av=mean(str_length(text)), text_sd=sd(str_length(text)),
subtext_av=mean(str_length(subtext)), subtext_sd=sd(str_length(subtext))
)
## # A tibble: 1 × 6
## summary_av summary_sd text_av text_sd subtext_av subtext_sd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1384. 663. 28102. 17498. 5000 0