In this lab, we will practice working with dictionary-based methods for text analysis.

Be sure to do the required readings first. While many of the problems can be solved using approaches from the lecture videos, lab videos, or required readings, you may need to do some searching on the internet to solve some of the problems. This will be a valuable skill to learn as you develop as a data scientist.

This file should be submitted in both R Markdown (.Rmd) file and knitted HTML web page (.html) files, and you should be starting with the Lab markdown file (link to file) where the author name has been replaced with your own. While the R Markdown file should include all of the code you used to generate your solutions, the Knitted HTML file should ONLY display the solutions to the assignment, and NOT the code you used to solve it. Imagine this is a document you will submit to a supervisor or professor - they should be able to reproduce the code/analysis if needed, but otherwise only need to see the results and write-up. The TA should be able to run the R Markdown file to directly generate the same exact html file you submitted. Submit to your TA via direct message on Slack by the deadline indicated on the course website.

Required reading:

Optional reading:


Questions


1. In which scenarios would it be best to consider dictionary-based approaches to text analysis? How does the decision to use dictionary-based approaches shape the research questions you can ask?

# your answer here


2. Create a bar graph showing the frequencies of the twenty most-used tokens in our corpus after removing URLs and stopwords, but preserving hashtags as tokens (e.g. “#19thamendment” should be a single token). Now create a similar plot that ONLY includes the hashtags.

Hint: you can do hashtag preservation in many ways, but you might find an easy solution by browsing the documentation for unnest_tokens. Searching on the internet may also be a good strategy.

# your answer here


3. Create a bar graph showing the tf-idf scores of the ten tokens with the highest values in our corpus, again preserving hashtags as tokens and removing urls/stopwords. What do these scores mean? Give a hypothesis for why the top three have the highest values.

# your answer here
your written explanation here


4. For each of the top-three tf-idf tokens, extract up to five tweets with the highest number of retweets that include the token. Based on the context provided in these Tweets, give a quick sentence about what they mean. Do they fit your hypotheses from the previous question?

# your answer here
your written explanation here


5. Create a new column in senator_tweets that corresponds to the time of day that a given tweet was posted, and make a bar graph comparing the number of tweets published in day (5am-5pm) vs night.

Hint: you may need to use as.POSIXlt to convert date information.

# your answer here


6. Use the “bing” sentiment dictionary to compare the average sentiment for Tweets published in daytime vs nighttime using a bar plot. You get to choose how you will create these sentiment scores for comparison - explain and justify your decision. Also explain your interpretation of the results.

# your answer here
Explain why you chose to compute sentiment in this way.


7. Create a custom dictionary with at least two categories (e.g. positive/negative, happy/sad, solution/problem-oriented) and compare daytime-nightime scores for each of the two categories. What does this result tell you about your data? What is your dictionary capturing here?

Hint: you may want to look at the bing dictionary to see how you should format your custom dictionary.

# your answer here
Explain what your dictionary is intended to capture and interpret the results.


8. Using the data you have collected for your final project, show one preliminary result or statistic from an analysis you ran. If you haven’t collected your dataset computationally, try to look anecdotally at the original source (e.g. if Twitter is your dataset, then just look on the Twitter website) and give one observation about the data. Try to make an observation or result based on one of the variables you will use for your final analysis. What do you see? Send your figures and statistics directly to your TA in Slack - don’t add them to your script.

written description here