Blog Post 1
Introduction
Hello!
This is the first blog post for Brown University’s CS1951a Data Science Final Project.
The students in our project group are: Jack Kates, Julia Windham, Leo Ryu, Nam Do, and Sandra Ha.
Vision
One of the greatest problems driving apart modern society is the bitter divide between political parties. Bipartisian dialogue gets drowned beneath the name-calling, insult-tossing, and tantrum-throwing antics of members on both sides of the political spectrum.
Our goal for the final project is to use Twitter data in order to ascertain public sentiment about various political issues. By doing so, we hope to discover which political issues create the most volatile impact amongst the general population.
Data
Our method of collecting data for this project is two fold.
The first is using tweets from the publically available datasets found here. One dataset contains tweets made by various politicians, and the second dataset contains tweets containing political information.
The second is using the Twitter API to streams real time tweets that contain a specific keyword. For example, we can use the API to stream 1000 tweets containing the word “debt”. By streaming tweets based on keywords, we hope to collect tweets that accurately represent current public sentiment.
Initial Analysis
We loaded the data into a MySQL database. Our tweet database contains 1.6 million tweets.
About 40% of the tweets in the database have never been retweeted, but the distribution is very long-tailed. The most popular tweet was retweeted more than 3.3 million times!
The dataset contains tweets of 341 Republicans and 306 Democrats, along with other public figures who contribute to political discourse.
Next Steps
To measure the polarity of tweets, we will use sentiment analysis and retweet count. Currently, we are setting up a system to run the tweets through an existing language model to classify their sentiment from negative to positive. Our first round of analysis will look at relationships like party vs. sentiment, sentiment vs. retweet count, and the most commonly used words tweeted by American political figures.