In this lab, we will practice working with text using stringr, tidytext, and tm packages.

Be sure to do the required readings first. While many of the problems can be solved using approaches from the lecture videos, lab videos, or required readings, you may need to do some searching on the internet to solve some of the problems. This will be a valuable skill to learn as you develop as a data scientist.

This file should be submitted in both R Markdown (.Rmd) file and knitted HTML web page (.html) files, and you should be starting with the Lab markdown file (link to file) where the author name has been replaced with your own. While the R Markdown file should include all of the code you used to generate your solutions, the Knitted HTML file should ONLY display the solutions to the assignment, and NOT the code you used to solve it. Imagine this is a document you will submit to a supervisor or professor - they should be able to reproduce the code/analysis if needed, but otherwise only need to see the results and write-up. The TA should be able to run the R Markdown file to directly generate the same exact html file you submitted. Submit to your TA via direct message on Slack by the deadline indicated on the course website.

Required reading:

Optional reading:


1. Create a regular expression which matches a URL in the example string ex, and verify that it works using str_view_all (described in R for data science Ch. 14). The output should show both URLs highlighted. Now do the same for hashtags - strings that include a “#” symbol followed by any letter, number, or underscore and ignoring capitalization.

Hint: these are common tasks in cleaning and analyzing Tweet/text data, so doing some research might save you a lot of time.

Hint: be wary of how R specifically interprets regex strings. It might be helpful to look for regex strings specifically written for R.

# Your answer here

ex <- "BREAKING NEWS - @brumleader urges everyone to do their bit in order to tackle the threat posed by rising coronavirus case numbers in city. Full statement here:\n\n\n\n#KeepBrumSafe\n#Btogether\n#COVID19\n#Coronavirus"

# your solution here

2. Add two new columns to the senator_tweets dataframe: n_link should include the number of URLs in the Tweet text, and n_ht should be the number of hashtags. Then, create a linear model predicting retweet_count from n_link and n_ht. Were either of these predictors statistically significant? Are they significant predictors of favorite_count? Be sure to show the model summaries.

Hint: be sure to read the stringr documentation.

# your solution here
written response here

3. Using stringr and dplyr (not tm or tidytext), produce a dataframe consisting of the 5 most used hashtags in our Tweets with the number of times they were used. If there is more than one tied for 5th place, you can ignore them - just choose in 5 total (i.e. you could just use head(5)).

Hint: you may want to check out the unnest function (not unnest_tokens).

# your solution here

4. Create a new column in senator_tweets called cleaned which includes the original Tweets with hashtags and links removed. We will use this column for the remaining questions.

# your solution here

5. Using tidytext, produce a dataframe showing the ten most common words in the Tweets after URLs and hashtags have been removed (use our new column cleaned). Then secondly show the most common words excluding stopwords.

Hint: look at the tidytext docs for unnest_tokens.

# your solution here

6. Create a document-term matrix which including english-language Tweets. We will discuss what to do with a dtm in the next lab.

# your solution here

7. How could you potentially use text analysis in the final project you have been working on? (You don’t necessarily need to do it for the project, just think hypothetically).


8. Last week you proposed some datasets that you might be able to use for our final projects in the class. If you haven’t yet, try to download or otherwise get access to the dataset so you can start playing with it. Either way, what did you find? Did your data have the information you needed after all? Was it as easy to access as you expected? Even if you’re not able to get all the data by now, write something about your plan for getting access to the data.