Lab #9 Markdown File

Lab Instructions

In this lab, we will practice working with dictionary-based methods for text analysis.

See the “Instructions” section of the Introduction to Lab Assignments page for more information about the labs. That page also gives descriptions for the datasets we will be using.

Required reading:

Optional reading:


Questions


1. In which scenarios would it be best to consider dictionary-based approaches to text analysis? How does the decision to use dictionary-based approaches shape the research questions you can ask?

# your answer here


2. Create a bar graph showing the frequencies of the twenty most-used tokens in our senator_tweet_sample corpus after removing URLs and stopwords, but preserving hashtags as tokens (e.g. “#19thamendment” should be a single token). Now create a similar plot that ONLY includes the hashtags.

Hint: you can do hashtag preservation in many ways, but you might find an easy solution by browsing the documentation for unnest_tokens - see the token parameter. Searching on the internet may also be a good strategy.

# your answer here


3. For each of the top three most frequent non-stopword tokens, extract up to three tweets with the highest number of retweets that include the token. Based on the context provided in these Tweets, give a quick sentence about how they seem to be used in this context.

HINT: it might be useful to use str_count here.

# your answer here
your written explanation here


4. Create a bar graph showing the tf-idf scores of the ten tokens with the highest values in our corpus, again preserving hashtags as tokens and removing urls/stopwords. What do these scores mean? Give a hypothesis for why the top three have the highest values.

# your answer here
your written explanation here


5. Create a new column in senator_tweet_sample that corresponds to the time of day that a given tweet was posted, and make a bar graph comparing the number of tweets published in day (5am-5pm) vs night.

Hint: see the hour function of lubridate.

# your answer here


6. Use the “bing” sentiment dictionary to compare the average sentiment for Tweets published in daytime vs nighttime using a bar plot. You get to choose how you will create these sentiment scores for comparison - explain and justify your decision. Also explain your interpretation of the results.

HINT: use get_sentiments("bing") to get the Bing dictionary.

# your answer here
Explain why you chose to compute sentiment in this way.


7. Create a custom dictionary with at least two categories (e.g. positive/negative, happy/sad, solution/problem-oriented) and compare daytime-nightime scores for each of the two categories. What does this result tell you about your data? What is your dictionary capturing here?

Hint: you may want to look at the bing dictionary (get_sentiments("bing")) to see how you should format your custom dictionary.

# your answer here
Explain what your dictionary is intended to capture and interpret the results.


8. Using the data you have collected for your final project, show one preliminary result or statistic from an analysis you ran. If you haven’t collected your dataset computationally, try to look anecdotally at the original source (e.g. if Twitter is your dataset, then just look on the Twitter website) and give one observation about the data. Try to make an observation or result based on one of the variables you will use for your final analysis. What do you see? Send your figures and statistics directly to your TA in Slack - don’t add them to your script.

written description here