In this lab we will be learning about APIs by working with the Twitter API. This lab requires more setup than previous labs because we need to retrieve authorization permissions.

Be sure to do the required readings first. While many of the problems can be solved using approaches from the lecture videos, lab videos, or required readings, you may need to do some searching on the internet to solve some of the problems. This will be a valuable skill to learn as you develop as a data scientist.

This file should be submitted in both R Markdown (.Rmd) file and knitted HTML web page (.html) files, and you should be starting with the Lab markdown file (link to file) where the author name has been replaced with your own. While the R Markdown file should include all of the code you used to generate your solutions, the Knitted HTML file should ONLY display the solutions to the assignment, and NOT the code you used to solve it. Imagine this is a document you will submit to a supervisor or professor - they should be able to reproduce the code/analysis if needed, but otherwise only need to see the results and write-up. The TA should be able to run the R Markdown file to directly generate the same exact html file you submitted. Submit to your TA via direct message on Slack by the deadline indicated on the course website.

Required reading:

Optional resources:

API Setup

Follow these two steps to set up your program for exercise.

1. Set up your API credentials with Twitter.

If you don’t already have one, you will need to create a new Twitter account. Next, you need to apply for a developer account and access credentials (api keys) for retreiving data. In Twitter’s getting started guide, navigate to the section titled “How to get access to the Twitter API.” This will include applying for a developer account, and retreiving the app’s keys and tokens. This tutorial may also be helpful. You’ll use these for the next step.

2. Store your credentials.

Copy and paste the json below into a new json file named api_credentials.json. Note that a json file is just a text file where the filename ends in “.json”, so you can use notepad or text editor to make the new file. From the instructions in Step 1 we have the api key, api secret key, access token, and access token secrets - replace the approprate values in the json file and save (in a place you remember).

  "app": "<app name here>",
    "api_key": "<api key here>",
    "api_secret_key": "<api_secret_key here",
    "access_token": "<access token here>",
    "access_token_secret": "<access token secret here>",
    "bearer_token": "<unused>"

3. Authenticate your application.

After you have the credentials stored into the json file, run this code to authenticate the application (be sure to use the filename corresponding to your actual file). This simply reads the json data and provides them directly to the create_token function of the rtweet package. Once you complete this step, you should be able to access Twitter data through the API. See the rtweet package documentation to see how to access different types of data.

# this code will read credentials from the JSON file you created.
creds <- fromJSON(file = 'api_credentials.json') # POINT THIS TO YOUR FILE

# will allow you to authenticate your application
token <- create_token(
  app = creds$app,
  consumer_key = creds$api_key,
  consumer_secret = creds$api_secret_key,
  access_token = creds$access_token,
  access_secret = creds$access_token_secret)

# this allows you to check the remaining data
lim <- rate_limit()
lim[, 1:4]
## # A tibble: 217 x 4
##    query                  limit remaining reset        
##    <chr>                  <int>     <int> <drtn>       
##  1 lists/list                15        15 15.00499 mins
##  2 lists/memberships         75        75 15.00499 mins
##  3 lists/subscribers/show    15        15 15.00499 mins
##  4 lists/members            900       900 15.00499 mins
##  5 lists/subscriptions       15        15 15.00499 mins
##  6 lists/show                75        75 15.00499 mins
##  7 lists/ownerships          15        15 15.00499 mins
##  8 lists/subscribers        180       180 15.00499 mins
##  9 lists/members/show        15        15 15.00499 mins
## 10 lists/statuses           900       900 15.00499 mins
## # ... with 207 more rows


1. In your own words, describe what an application programming interface is, and why it is useful to data scientists/computational social scientists.

# Your answer here

2. Use the Twitter API to augment senator information with the number of friends and the number of statuses they have posted on Twitter.

Note: If you haven’t already, filter the congress dataframe to include only senators (type=='sen') - we will only use senators for this assignment.

Note: we don’t need to get the actual lists of followers or Tweets - we just need the number of friends and tweets (i.e. do not try to use get_friends).

Note: for various reasons, not all senators in our dataset have valid Twitter handles - that’s okay for the purpose of this assignment.

Hint: You may need to refer to the rtweet Package Documentation to see which function to use. We need to lookup *user level information.

# Your answer here

3. Calculate the average number of friends for each senator by political party. For instance, you should be able to say something of the form "Republicans have an average of X friends and Democrats have an average of Y friends. What conclusions can you draw from this result? What assumptions are needed to come to these conclusions?

# Your answer here
written response here

4. Use ggplot to create a box plot or violin plot showing the number of statuses that each senator has posted by gender. What does the visualization tell us? What advantage does a violin plot have over calculating averages (like we did in the previous question).

# Write your answer here

5. Use the Twitter API to retrieve the last 10 Tweets from the 5 oldest male and female senators in our dataset, and merge it with the senator dataframe. Then create a violin or box plot showing the average number of favorites for Tweets by gender of the senator who published them.

# Your answer here

6. The Twitter API uses different rate limits for different API endpoints. The easiest way to extract a large number of historical Tweets is to use previously collected unique Tweet IDs, which you may be able to find at various sources on the web. For this problem, download the full Tweet text associated with the Tweet IDs in senator_tweet_ids (already loaded at the beginning of this markdown file), and generate a histogram showing the number of favorites those Tweets received.

Note: I originally downloaded the full set of tweet ids from this page, but I created a subset of 500 Tweets which were loaded at the beginning of this markdown file, so you should NOT download the data externally.

Note: these are NOT the same tweet ids as the tweets stored in senator_tweets, so you will need to collect them using the api.

# Your answer here

7. Identify another API, whether it has an associated R package or not, and describe how you might use the data available from it in a social/data scientific research project, and more specifically in your final project.

Written answer here

8. Develop a hypothesis for one of the research questions you described in the previous weeks. You can choose a new topic if you are no longer interested in your old ones, but make sure you’ll be able to test the hypothesis using available data. For example, the hypothesis could be something like “H: when x does y, we see more z.” This hypothesis is testable if we have empirical data about x, y, and z. Think carefully about what you might and might not be able to measure.

# Your answer here