Practice Session: Sentiment Analysis with Twitter

0. Introduction to NLP and Sentiment Analysis

1. Natural Language Processing with NTLK

2. Intro to NTLK, Part 2

3. Build a sentiment analysis program: We finally use all we learnt above to make a program that analyses sentiment of movie reviews

4. Sentiment Analysis with Twitter: A practice session for you, with a bit of learning.


 

Okay, so the practice session.

This contains a mixture of me teaching you stuff (like how to read Tweets in your Ntlk corpora), plus code you write yourself.

Sentiment Twitter.ipynb is the file you are working with.

Okay, since this is a practice session, I won’t explain everything. You are supposed to learn. I will cover the new stuff related to reading Twitter data.

This Twitter data is in the ntlk corpora folder, under twitter_samples.

nltk3You have your positive and negative tweets, in json format, which is how Twitter returns them.

Side note: If you want to know how to gather your own Twitter samples, see here.

But you don’t have to worry about parsing json files. Ntlk gives us handy functions to read the data.

Warning: The way the data was collected was, the tweets were classified negative or positive based on smileys. So if the tweet contained : ) it was classified as positive. This will have ramifications for us later on.

Make sure you have the notebook open in a new tab, as I will only go over parts of it. I expect you to write the code for the rest yourself. If you get stuck, look at the last lesson, as the  code is very similar.

Like movie reviews, you can see the types of files-in this you have the negative, positive, and the complete list.

This is how you read the Twitter samples. I’m only showing you the negative file, but the process is the same for the positive.

I only print the first 5 tweets above.

The next steps are to rewrite the create_word_features() function, and then collect the negative and positive tweets.

You will get 5000 tweets in each category. Create the training list with 8000, and test with 2000 tweets.

Create a Naive Bayes classifier and test your algorithm.

I got an accuracy of 97%, which just looks wrong.

Why? Take a minute to think.

The original data contains smileys. So our machine learning algorithm can ignore the text, and just look at the smiley to give a positive or negative review. We need to remove the smileys.

I give you some code:

The last line above,

is what you will use in your code to remove the smileys.

You will need to recreate the negative and positive tweets, this time removing the smileys using the code above.

Re-run the Naive Bayes classifier. This time the accuracy goes down to 75%, but the results are more accurate as they use the actual tweet text, and not the presence of a smiley(which may not even be there).

This was a key learning feature: Make sure the data you work with is representative of what you are testing.

Test your classifier with real tweets about movies. Is it accurate?

PS: Interested in leveling up your Python and getting a great job? Check out the Python Apprenticeship Program.