Lab #7: Modeling with Statistics

In this lab, we will practice making and visualizing statistical models of data using R.

See the “Instructions” section of the Introduction to Lab Assignments page for more information about the labs. That page also gives descriptions for the datasets we will be using.

Required reading:

R for Data Science: Modeling (Chapters 23-25)
Quick Guide: Interpreting Simple Linear Model Output in R

Optional reading:

Simple Linear Regression - An example using R

Example Questions

ex1. Make a box plot showing the average approximate age of congress members by gender.

# Your answer here
congress %>% 
  mutate(age=2022-birthyear) %>% 
  #group_by(gender) %>% 
  #summarize(av_age=mean(age)) %>% 
  ggplot(aes(x=gender, y=age)) +
    geom_boxplot()

ex2. Create a scatter plot showing the relationship between age and number of social media accounts that congress members have, and add a linear trendline.

# Your answer here
congress %>% 
  mutate(age=2022-birthyear) %>% 
  left_join(congress_contact, by='bioguide_id') %>% 
  mutate(num_accounts = (twitter!='') + (facebook!='') + (youtube!='')) %>% 
  ggplot(aes(x=age, y=num_accounts)) + 
    geom_point(alpha=0.1) + 
    geom_smooth(method='lm', formula= y ~ x)

ex3. Create a linear regression model predicting the number of social media accounts a congress member has from the following variables: type (senator or representative), age, party, and gender. Controlling for other factors, how would you communicate the relationship between age and the number of social media accounts a congress member has?

# create dataframe from which we will train the model
model_df <- congress %>% 
  mutate(age=2022-birthyear, is_repub=party=='Republican') %>% 
  left_join(congress_contact, by='bioguide_id') %>% 
  mutate(num_accounts = (twitter!='') + (facebook!='') + (youtube!=''))

# fit the full model with all the variables
m <- lm(num_accounts ~ type + age + is_repub + gender, data=model_df)
m %>% summary()

## 
## Call:
## lm(formula = num_accounts ~ type + age + is_repub + gender, data = model_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8774 -0.5236  0.1205  0.6146  1.8324 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.149337   0.186545  -0.801  0.42375    
## typesen       0.242193   0.087756   2.760  0.00598 ** 
## age           0.034899   0.002873  12.149  < 2e-16 ***
## is_repubTRUE -0.218618   0.069963  -3.125  0.00188 ** 
## genderM       0.341592   0.078062   4.376 1.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7773 on 534 degrees of freedom
## Multiple R-squared:  0.2812, Adjusted R-squared:  0.2758 
## F-statistic: 52.22 on 4 and 534 DF,  p-value: < 2.2e-16

Response: On average, congress members have approximately 0.034 more social media accounts for every one year older they are.

ex4. Using the model from the previous question, generate the predicted number of social media accounts that a Republican female senator born in 1952 would have. How does this compare with the actual congress members that match that profile?

# generate a dummy dataframe for prediction
example_df <- data.frame(gender=c('F'), is_repub=T, type=c('sen'), age=c(2022-1952))
m %>% predict(example_df)

##        1 
## 2.317202

# show actual congress members
model_df %>% 
  filter(gender=='F', party=='Republican', type=='sen', birthyear==1952) %>% 
  select(full_name, num_accounts)

##          full_name num_accounts
## 1 Susan M. Collins            3
## 2 Marsha Blackburn            3

The model predicts that congress members with this profile would have an average of 2.3 social media accounts, but both congress members have 3 accounts.

ex5. Create three statistical models to predict the number of social media accounts that congress members have: one with gender only, one with political party (you may create a dummy variable for Republicans or Democrats) only, and one that includes both gender and political party. What can you learn from the combination of these three models?

lm(num_accounts ~ gender, data=model_df) %>% summary()

## 
## Call:
## lm(formula = num_accounts ~ gender, data = model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.22704 -0.94558  0.05442  0.77296  1.05442 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.94558    0.07469  26.049  < 2e-16 ***
## genderM      0.28146    0.08758   3.214  0.00139 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9056 on 537 degrees of freedom
## Multiple R-squared:  0.01887,    Adjusted R-squared:  0.01704 
## F-statistic: 10.33 on 1 and 537 DF,  p-value: 0.001389

lm(num_accounts ~ is_repub, data=model_df) %>% summary()

## 
## Call:
## lm(formula = num_accounts ~ is_repub, data = model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26182 -1.03409 -0.03409  0.73818  0.96591 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.26182    0.05470  41.350  < 2e-16 ***
## is_repubTRUE -0.22773    0.07816  -2.914  0.00372 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9071 on 537 degrees of freedom
## Multiple R-squared:  0.01556,    Adjusted R-squared:  0.01373 
## F-statistic:  8.49 on 1 and 537 DF,  p-value: 0.003721

lm(num_accounts ~ gender + is_repub, data=model_df) %>% summary()

## 
## Call:
## lm(formula = num_accounts ~ gender + is_repub, data = model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40827 -1.03188 -0.03188  0.90888  1.28527 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.03188    0.07682  26.450  < 2e-16 ***
## genderM       0.37639    0.08965   4.199 3.14e-05 ***
## is_repubTRUE -0.31715    0.07987  -3.971 8.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8934 on 536 degrees of freedom
## Multiple R-squared:  0.04691,    Adjusted R-squared:  0.04335 
## F-statistic: 13.19 on 2 and 536 DF,  p-value: 2.559e-06

From the first two models we can say that men tend to have more SM accounts and republicans have fewer fewer accounts on average. When we see the second model, both coefficients have higher magnitudes. This is likely due to the fact that republicans have a higher proportion of men. We can verify with a table or simple correlation test, as shown below. By including only one or the other in the model, we are getting an incomplete picture of what that relationship looks like.

# Your answer here
table(model_df %>% select(gender, is_repub))

##       is_repub
## gender FALSE TRUE
##      F   107   40
##      M   168  224

cor.test(as.numeric(model_df$gender=='M'), as.numeric(model_df$is_repub))

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(model_df$gender == "M") and as.numeric(model_df$is_repub)
## t = 6.4117, df = 537, p-value = 3.156e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1864090 0.3433889
## sample estimates:
##       cor 
## 0.2666667

#cor.test(gender + is_repub, data=model_df)

Questions

1. When is statistical modeling used? Provide an example of an empirical question that could be answered using statistical modeling techniques.

Write here

2. Create the following new variables for each congress member: (a) approximate age, (b) the number of full committees that they are a part of, and (c) the percentage of instances where they hold a title in the full committees they belong to (i.e. when the title entry in the committee_membership dataframe is not empty). You will want to save these new variables for future problems. Then use the summary function to display summary statistics for (ONLY) these new variables.

# Your answer here

3. Create a scatter plot with a linear trendline (must be a straight line) to predict the number of full committees that congress members belong to from age. Describe the relationship. What do each of these (the scatter points and the linear trendline) show that you cannot see from the other?

Note: usually we see the dependent variable (number of committees in this case) on the y-axis and the independent variable (age in this case) on the x-axis.

# Your answer here

written explanation here

4. Create a box graph showing the average number of full committees that congress members belong to by gender (i.e. a bar for M and a bar for F) with error bars. What can you see from this visualization? Does there appear to be a significant difference?

Hint: you may want to see geom_errorbar.

Hint: error bars are usually calculated by taking the average plus and minus the standard deviations.

# Your answer here

written explanation here

5. Following section 24.2.2 of the R4DS required reading, construct a model using lm or glm to predict the proportion of instances where congress members hold titles in the committees they belong to from age, gender, and political party. Keep this model for future problems. Based on the model summary, which variables might be related to holding a title? Try removing and adding different variables. Does changing any of the included variables change your original interpretation?

HINT: see required readings for help interpreting regression models.

# Your answer here

written explanation here

6. Using that same model, make a line plot showing predicted likelihood of holding a committee title by age with separate lines for the two genders and holding political party constant to Republican. Your plot should have age on the x-axis and predicted proportion of instances where congress members hold a title on the y-axis. You should also include labels for gender. From this, we should be able to see model predictions for a Republican of any age and gender in our dataset.

NOTE: you must use a single model for this - the one you produced from the previous question.

HINT: you may want to create a dummy dataset where age and gender vary but political party is held constant to Republican in order to generate model predictions for the visualization. See the data_grid function.

# Your answer here

Your answer here

7. How could you use statistical modeling to answer one of the Final Project hypotheses you provided in the last assignment? What inferences could you make?

# Your answer here

8. Describe one or two existing datasets that you would like to use for the project you’ve been developing last week. Will you be able to download the data from somewhere, or can you use an API? Will you be making statistical models, analyzing networks, doing text analysis, or creating visualizations? See the “Final Project” section in the course description page on the website.

# Your answer here

Lab #7: Modeling with Statistics

Data Science and Society (Sociology 367)

Example Questions

Questions