In this lab we will be learning about APIs by working with the Twitter API. This lab requires more setup than previous labs because we need to retrieve authorization permissions.
See the “Instructions” section of the Introduction to Lab Assignments page for more information about the labs. That page also gives descriptions for the datasets we will be using.
Required reading:
Optional resources:
Follow these two steps to set up your program for exercise.
If you don’t already have one, you will need to create a new Twitter account. Next, you need to apply for a developer account and access credentials (api keys) for retreiving data. In Twitter’s getting started guide, navigate to the section titled “How to get access to the Twitter API.” This will include applying for a developer account, and retreiving the app’s keys and tokens. This tutorial may also be helpful. You’ll use these for the next step.
Copy and paste the json below into a new json file named api_credentials.json
. Note that a json file is just a text file where the filename ends in “.json”, so you can use notepad or text editor to make the new file. From the instructions in Step 1 we have the api key, api secret key, access token, and access token secrets - replace the approprate values in the json file and save (in a place you remember).
{
"app": "<app name here>",
"api_key": "<api key here>",
"api_secret_key": "<api_secret_key here",
"access_token": "<access token here>",
"access_token_secret": "<access token secret here>",
"bearer_token": "<unused>"
}
After you have the credentials stored into the json file, run this code to authenticate the application (be sure to use the filename corresponding to your actual file). This simply reads the json data and provides them directly to the create_token
function of the rtweet
package. Once you complete this step, you should be able to access Twitter data through the API. See the rtweet
package documentation to see how to access different types of data.
# this code will read credentials from the JSON file you created.
#install.packages("rjson")
library('rjson')
creds <- fromJSON(file = 'api_credentials.json') # POINT THIS TO YOUR FILE
# will allow you to authenticate your application
token <- create_token(
app = creds$app,
consumer_key = creds$api_key,
consumer_secret = creds$api_secret_key,
access_token = creds$access_token,
access_secret = creds$access_token_secret)
# this allows you to check the remaining data
lim <- rate_limit()
lim[, 1:4]
## # A tibble: 262 × 4
## query limit remaining reset
## <chr> <int> <int> <drtn>
## 1 lists/list 15 15 15.00839 mins
## 2 lists/:id/tweets&GET 900 900 15.00839 mins
## 3 lists/:id/followers&GET 180 180 15.00839 mins
## 4 lists/memberships 75 75 15.00839 mins
## 5 lists/:id&DELETE 300 300 15.00839 mins
## 6 lists/subscriptions 15 15 15.00839 mins
## 7 lists/members 900 900 15.00839 mins
## 8 lists/:id&GET 75 75 15.00839 mins
## 9 lists/subscribers/show 15 15 15.00839 mins
## 10 lists/:id&PUT 300 300 15.00839 mins
## # … with 252 more rows
ex1. How many senator Twitter accounts listed in our dataset were found on Twitter?
# Your answer here
# filter for just senators and get contact info for each
ex_senators_contact <- congress %>%
filter(type=='sen') %>%
left_join(congress_contact, by='bioguide_id') %>% # combine congress with contact info
mutate(twitter=tolower(twitter)) %>% # DON'T FORGET THIS! Need correct case for merging later
filter(twitter != '') # MUST INCLDUE THIS! Otherwise it will look for a username that is an empty string
ex_senators_contact %>% nrow()
## [1] 98
# download data from Twitter, count the rows
ex_twitter_users <- lookup_users(ex_senators_contact$twitter) %>%
mutate(screen_name=tolower(screen_name)) # DON'T FORGET THIS! Need correct case for merging later
ex_twitter_users %>% nrow() # show the number of Twitter accounts
## [1] 98
ex2. Give the average number of published Tweets for senators by gender using the Twitter data gathered in the previous question.
# combine congress info with the twitter user dataframe
ex_senator_info <- ex_senators_contact %>%
left_join(ex_twitter_users, by=c('twitter'='screen_name')) %>% # NOTE THAT I CHANGED THESE BOTH TO LOWER CASE
drop_na(created_at) # this will drop dataframe rows that did not have a matching username
# NOTE: THEY DON'T ALL MATCH UP - that is fine. You may use drop the one inconsistent user
# note that this gives fewer rows - not all retrieved users had the same username, so this method is sub-optimal (I'll show this later)
ex_senator_info %>% nrow()
## [1] 97
# typical summary stat code
ex_senator_info %>%
group_by(gender) %>%
summarize(av_statuses_count=mean(statuses_count), ct=n())
## # A tibble: 2 × 3
## gender av_statuses_count ct
## <fct> <dbl> <int>
## 1 F 10272. 24
## 2 M 10528. 73
ex3. What is the correlation between a senator’s age and the time their Twitter account was created?
cor(ex_senator_info$birthyear, as.numeric(ex_senator_info$created_at))
## [1] -0.06126896
ex4. Collect the last two Tweets published by each senator in our dataset, and show the average number of retweets they received by gender of the publishing senator.
NOTE: I only used a subset of senators here to help with the rate limits.
NOTE: we need to average by individual user and then by gender
# get last three Tweets
username_sample <- ex_senator_info$twitter %>% head(6) # to protect rate limit
ex_senator_tweets <- get_timeline(username_sample, n = 2)
# join tweet-level information with user-level information and average by individual
ex_tweet_info <- ex_senator_tweets %>%
inner_join(ex_senator_info, by='user_id')
# average by individual first and then by gender
ex_tweet_info %>%
group_by(user_id) %>%
summarize(gender=first(gender), user_av_retweet=mean(retweet_count.x)) %>% # note use of the "first" function - acts as a pass-through
group_by(gender) %>%
summarize(av_retweet=mean(user_av_retweet))
## # A tibble: 2 × 2
## gender av_retweet
## <fct> <dbl>
## 1 F 367.
## 2 M 11.4
1. In your own words, describe what an application programming interface is, and why it is useful to data scientists/computational social scientists.
# Your answer here
2. Use the Twitter API to augment Senator information with the number of followers and and the number of statuses they have posted on Twitter, and create a bar graph showing the average number of followers by political party. Save the joined dataframe for future problems.
NOTE: We don’t need to get the actual lists of followers or Tweets - we just need the number of followers and tweets (i.e. do not try to use get_followers
). You may need to refer to the rtweet Package Documentation to see which function to use.
NOTE: Not all senators in our dataset have valid Twitter handles - we can ignore those for the purpose of this assignment. Additionally, Twitter users are allowed to change their usernames - this can cause some problems for us when the username listed in congress_contact
is actually out of date. For this question, we will request from Twitter all of the valid usernames in our congress_contact
dataframe. The dataframe that Twitter returns to us, however, will contain screen names that we did not originally request - that is because Twitter recognized we requested an account name that had been changed, so they gave us the account that our requested username previously referred to. Because congress_contact
only contains the Twitter usernames instead of the unique user_id
that would allow us to track the same account even when the username has changed, we have no way of automatically matching our original demographic information to the changed Twitter accounts when bulk requesting (one workaround would be to request accounts one-at-a-time so you can keep track of the requested and returned usernames). For the purpose of this assignment, you may discard retrieved Twitter information that does not match the usernames in congress_contact
.
HINT: We will only use senators for this assignment, not representatives.
# Write your answer here
3. Use ggplot
to create a box plot or violin plot showing the average number of Tweets posted by gender. What does the visualization tell us? What advantage does a violin plot have over calculating averages (like we did in the previous question).
# Write your answer here
Written response here.
4. Use the Twitter API to retrieve the last 5 Tweets from the 10 oldest senators in our dataset. Then, compute the average number of favorites that those Tweets received by Senator. Finally, create a violin or box plot showing the average number of Senator-averaged favorites by political party.
NOTE: see the note in question number 2 about Twitter usernames that have been changed.
# Your answer here
5. The Twitter API uses different rate limits for different API endpoints. The easiest way to extract a large number of historical Tweets is to use previously collected unique Tweet IDs, which you may be able to find at various sources on the web. For this problem, download the full Tweet text associated with the Tweet IDs in senator_tweet_ids
(already loaded at the beginning of this markdown file), and compute the average number of favorites those Tweets received.
NOTE: see the Lab Instructions page for more details on this dataset.
NOTE: these are NOT the same tweet ids as the tweets stored in senator_tweet_sample
, so you will need to collect them using the api directly. Look for a function that can download status informtion directly from status ids.
# Your answer here
6. Using the tweet data in senator_tweet_sample
, create a single grouped bar plot that allows us to compare both favorites and retweet counts by gender.
HINT: See the suggested readings on this assignment for more information about grouped bar charts.
HINT: You will want to average by bioguide_id first so that we have within-person averages, then average by gender. Otherwise the proportion of tweets from each person will make the result misleading.
# Write your answer here
7. Identify another API, whether it has an associated R package or not, and describe how you might use the data available from it in a social/data scientific research project, and more specifically in your final project.
Written answer here
8. Develop a hypothesis for one of the research questions you described in the previous weeks. You can choose a new topic if you are no longer interested in your old ones, but make sure you’ll be able to test the hypothesis using available data. For example, the hypothesis could be something like “H: when x does y, we see more z.” This hypothesis is testable if we have empirical data about x, y, and z. Think carefully about what you might and might not be able to measure.
# Your answer here