In this lab, we will practice making and visualizing statistical models of data using R.
See the “Instructions” section of the Introduction to Lab Assignments page for more information about the labs. That page also gives descriptions for the datasets we will be using.
Required reading:
Optional reading:
ex1. Make a box plot showing the average approximate age of congress members by gender.
# Your answer here
congress %>%
mutate(age=2022-birthyear) %>%
#group_by(gender) %>%
#summarize(av_age=mean(age)) %>%
ggplot(aes(x=gender, y=age)) +
geom_boxplot()
ex2. Create a scatter plot showing the relationship between age and number of social media accounts that congress members have, and add a linear trendline.
# Your answer here
congress %>%
mutate(age=2022-birthyear) %>%
left_join(congress_contact, by='bioguide_id') %>%
mutate(num_accounts = (twitter!='') + (facebook!='') + (youtube!='')) %>%
ggplot(aes(x=age, y=num_accounts)) +
geom_point(alpha=0.1) +
geom_smooth(method='lm', formula= y ~ x)
ex3. Create a linear regression model predicting the number of social media accounts a congress member has from the following variables: type (senator or representative), age, party, and gender. Controlling for other factors, how would you communicate the relationship between age and the number of social media accounts a congress member has?
# create dataframe from which we will train the model
model_df <- congress %>%
mutate(age=2022-birthyear, is_repub=party=='Republican') %>%
left_join(congress_contact, by='bioguide_id') %>%
mutate(num_accounts = (twitter!='') + (facebook!='') + (youtube!=''))
# fit the full model with all the variables
m <- lm(num_accounts ~ type + age + is_repub + gender, data=model_df)
m %>% summary()
##
## Call:
## lm(formula = num_accounts ~ type + age + is_repub + gender, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8774 -0.5236 0.1205 0.6146 1.8324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.149337 0.186545 -0.801 0.42375
## typesen 0.242193 0.087756 2.760 0.00598 **
## age 0.034899 0.002873 12.149 < 2e-16 ***
## is_repubTRUE -0.218618 0.069963 -3.125 0.00188 **
## genderM 0.341592 0.078062 4.376 1.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7773 on 534 degrees of freedom
## Multiple R-squared: 0.2812, Adjusted R-squared: 0.2758
## F-statistic: 52.22 on 4 and 534 DF, p-value: < 2.2e-16
Response: On average, congress members have approximately 0.034 more social media accounts for every one year older they are.
ex4. Using the model from the previous question, generate the predicted number of social media accounts that a Republican female senator born in 1952 would have. How does this compare with the actual congress members that match that profile?
# generate a dummy dataframe for prediction
example_df <- data.frame(gender=c('F'), is_repub=T, type=c('sen'), age=c(2022-1952))
m %>% predict(example_df)
## 1
## 2.317202
# show actual congress members
model_df %>%
filter(gender=='F', party=='Republican', type=='sen', birthyear==1952) %>%
select(full_name, num_accounts)
## full_name num_accounts
## 1 Susan M. Collins 3
## 2 Marsha Blackburn 3
The model predicts that congress members with this profile would have an average of 2.3 social media accounts, but both congress members have 3 accounts.
ex5. Create three statistical models to predict the number of social media accounts that congress members have: one with gender only, one with political party (you may create a dummy variable for Republicans or Democrats) only, and one that includes both gender and political party. What can you learn from the combination of these three models?
lm(num_accounts ~ gender, data=model_df) %>% summary()
##
## Call:
## lm(formula = num_accounts ~ gender, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22704 -0.94558 0.05442 0.77296 1.05442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.94558 0.07469 26.049 < 2e-16 ***
## genderM 0.28146 0.08758 3.214 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9056 on 537 degrees of freedom
## Multiple R-squared: 0.01887, Adjusted R-squared: 0.01704
## F-statistic: 10.33 on 1 and 537 DF, p-value: 0.001389
lm(num_accounts ~ is_repub, data=model_df) %>% summary()
##
## Call:
## lm(formula = num_accounts ~ is_repub, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.26182 -1.03409 -0.03409 0.73818 0.96591
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.26182 0.05470 41.350 < 2e-16 ***
## is_repubTRUE -0.22773 0.07816 -2.914 0.00372 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9071 on 537 degrees of freedom
## Multiple R-squared: 0.01556, Adjusted R-squared: 0.01373
## F-statistic: 8.49 on 1 and 537 DF, p-value: 0.003721
lm(num_accounts ~ gender + is_repub, data=model_df) %>% summary()
##
## Call:
## lm(formula = num_accounts ~ gender + is_repub, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.40827 -1.03188 -0.03188 0.90888 1.28527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.03188 0.07682 26.450 < 2e-16 ***
## genderM 0.37639 0.08965 4.199 3.14e-05 ***
## is_repubTRUE -0.31715 0.07987 -3.971 8.14e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8934 on 536 degrees of freedom
## Multiple R-squared: 0.04691, Adjusted R-squared: 0.04335
## F-statistic: 13.19 on 2 and 536 DF, p-value: 2.559e-06
From the first two models we can say that men tend to have more SM accounts and republicans have fewer fewer accounts on average. When we see the second model, both coefficients have higher magnitudes. This is likely due to the fact that republicans have a higher proportion of men. We can verify with a table or simple correlation test, as shown below. By including only one or the other in the model, we are getting an incomplete picture of what that relationship looks like.
# Your answer here
table(model_df %>% select(gender, is_repub))
## is_repub
## gender FALSE TRUE
## F 107 40
## M 168 224
cor.test(as.numeric(model_df$gender=='M'), as.numeric(model_df$is_repub))
##
## Pearson's product-moment correlation
##
## data: as.numeric(model_df$gender == "M") and as.numeric(model_df$is_repub)
## t = 6.4117, df = 537, p-value = 3.156e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1864090 0.3433889
## sample estimates:
## cor
## 0.2666667
#cor.test(gender + is_repub, data=model_df)
1. When is statistical modeling used? Provide an example of an empirical question that could be answered using statistical modeling techniques.
Write here
2. Create the following new variables for each congress
member: (a) approximate age, (b) the number of full committees that they
are a part of, and (c) the percentage of instances where they hold a
title in the full committees they belong to (i.e. when the
title
entry in the committee_membership
dataframe is not empty). You will want to save these new variables for
future problems. Then use the summary
function to display
summary statistics for (ONLY) these new variables.
# Your answer here
3. Create a scatter plot with a linear trendline (must be a
straight line) to predict the number of full committees that congress
members belong to from age
. Describe the relationship. What
do each of these (the scatter points and the linear trendline) show that
you cannot see from the other?
Note: usually we see the dependent variable (number of committees in this case) on the y-axis and the independent variable (age in this case) on the x-axis.
# Your answer here
written explanation here
4. Create a box graph showing the average number of full committees that congress members belong to by gender (i.e. a bar for M and a bar for F) with error bars. What can you see from this visualization? Does there appear to be a significant difference?
Hint: you may want to see geom_errorbar
.
Hint: error bars are usually calculated by taking the average plus and minus the standard deviations.
# Your answer here
written explanation here
5. Following section 24.2.2
of the R4DS required reading, construct a model using lm
or
glm
to predict the proportion of instances where congress
members hold titles in the committees they belong to from age, gender,
and political party. Keep this model for future problems. Based on the
model summary
, which variables might be related to holding
a title? Try removing and adding different variables. Does changing any
of the included variables change your original
interpretation?
HINT: see required readings for help interpreting regression models.
# Your answer here
written explanation here
6. Using that same model, make a line plot showing predicted likelihood of holding a committee title by age with separate lines for the two genders and holding political party constant to Republican. Your plot should have age on the x-axis and predicted proportion of instances where congress members hold a title on the y-axis. You should also include labels for gender. From this, we should be able to see model predictions for a Republican of any age and gender in our dataset.
NOTE: you must use a single model for this - the one you produced from the previous question.
HINT: you may want to create a dummy dataset where age
and gender
vary but political party is held constant to
Republican in order to generate model predictions for the visualization.
See the data_grid
function.
# Your answer here
Your answer here
7. How could you use statistical modeling to answer one of the Final Project hypotheses you provided in the last assignment? What inferences could you make?
# Your answer here
8. Describe one or two existing datasets that you would like to use for the project you’ve been developing last week. Will you be able to download the data from somewhere, or can you use an API? Will you be making statistical models, analyzing networks, doing text analysis, or creating visualizations? See the “Final Project” section in the course description page on the website.
# Your answer here