In this lab, we will practice making and visualizing statistical models of data using R.

Be sure to do the required readings first. While many of the problems can be solved using approaches from the lecture videos, lab videos, or required readings, you may need to do some searching on the internet to solve some of the problems. This will be a valuable skill to learn as you develop as a data scientist.

This file should be submitted in both R Markdown (.Rmd) file and knitted HTML web page (.html) files, and you should be starting with the Lab markdown file (link to file) where the author name has been replaced with your own. While the R Markdown file should include all of the code you used to generate your solutions, the Knitted HTML file should ONLY display the solutions to the assignment, and NOT the code you used to solve it. Imagine this is a document you will submit to a supervisor or professor - they should be able to reproduce the code/analysis if needed, but otherwise only need to see the results and write-up. The TA should be able to run the R Markdown file to directly generate the same exact html file you submitted. Submit to your TA via direct message on Slack by the deadline indicated on the course website.

Required reading:

Optional reading:


Questions


1. In your own words, describe what statistical modeling means. When is it used? What does it allow data scientists to do?

# Your answer here


2. Create three new variables related to our congress dataset: (a) the age of the member, (b) the number of committees they are on, and (c) the percentage of instances where they hold a title in the committees they belong to (i.e. when the title entry in the committee membership dataframe is not empty). You will want to save these new variables for future problems. Then use the summary function to create summary statistics for these new variables.

# Your answer here


3. Create a linear model predicting the number of committees that members belong to from age, then create a scatter plot with a linear trendline. Describe the relationship. What do each of these (the model summary and the plot) show that you cannot see from the other?

Note: usually we see the dependent variable (number of committees in this case) on the y-axis and the independent variable on the x-axis.

# Your answer here
written explanation here


4. Create a bar graph showing the average number of committees that congress members belong to by gender (i.e. a bar for M and a bar for F) with error bars. What can you see from this visualization? Does there appear to be a significant difference?

Hint: you may want to see geom_errorbar.

Hint: error bars are usually calculated by taking the average plus and minus the standard deviations.

# Your answer here
written explanation here


5. Construct a model predicting the percentage of time that a member holds a title in the committees they belong to from age, gender, and political party. Which variables might be related to holding a title? Try removing and adding different variables. Does changing any of the used variables change your original interpretation?

Note: you may want to save the full model for the next question.

# Your answer here
written explanation here


6. Use the model from the previous question to make a scatter plot that includes prediction lines for BOTH F and M Democrats. That is, your plot should include two prediction lines - one for M and one for F, and the visualization (not the model) should only include democrats. This is important because our original model included information about all the variables, but we mainly want to visualize a single relationship, and how it might differ by gender. How do you interpret this plot?

Hint: you may need to follow the examples in the R4DS modeling section for this (see required readings) - see the modelr package.

# Your answer here
# Your answer here


7. How could you use statistical modeling to answer the hypothesis you provided in the last assignment? What inferences could you make?

# Your answer here


8. Describe one or two existing datasets that you would like to use for the project you’ve been developing last week. Will you be able to download the data from somewhere, or can you use an API? Will you be making statistical models, analyzing networks, doing text analysis, or creating visualizations? See the “Final Project” section in the course description page on the website.

# Your answer here