Intro to Nltk Part 2

0. Introduction to NLP and Sentiment Analysis

1. Natural Language Processing with NTLK

2. Intro to NTLK, Part 2

3. Build a sentiment analysis program: We finally use all we learnt above to make a program that analyses sentiment of movie reviews

4. Sentiment Analysis with Twitter: A practice session for you, with a bit of learning.


We will go over topics like stopwords and the movie reviews corpus.

A short video, this prepares the ground for Sentiment analysis.

Nltk_Intro_Part2.ipynb is the file to work with.

Only interested in videos? Go here for next video. Build a sentiment analysis program:

Let’s import the functions we need:

The first concept we want to learn is stop words. There are several words in English (and other languages) that carry little or no meaning, but are really common. Words like the, a, I, is etc. When doing language processing, we want to get rid of these words, as they take up a large part of any sentence, without adding any context or info.

Nltk comes inbuilt with a list of stop words for all main languages. To see the stop words for English:

We are only printing the first 16 stop words.

To see how to use this, we are going to see an example. I am taking a paragraph from

In the code, I use the full para. Here, I’ll print one sentence, as it’s enough to prove my point.

These are the words of one sentence. Let’s see if we can remove the stop words:

We use list comprehension to remove all words in the stopwords list. You see we got rid of words like was to all etc.

A much smaller list (which means less processing), which nevertheless contains all the keywords.

Next, we will look at movie reviews. When you downloaded the extra data in the previous lesson, it included a lot of free texts for analysis. Movie reviews, Twitter data, works from Shakespeare.

On Windows, it is in C:\Users\<username>\AppData\Roaming\nltk_data\corpora. Linux/Mac users can see here. If you forgot where it was, you can also run the function again, and it will tell you.

If you open the movie_reviews folder, you will see there are neg and pos folders. neg contains negative reviews, pos positive.


There are hundreds of files in each folder, each containing the review of one person. We will use these for machine learning in the next lesson.


Now, the great thing about the nltk corpora is, you don’t need to manually read the files, parse the data extra. Ntlk provides you with ready made functions to do so, saving our time to do more useful analysis.

We already import movie_reviews up there. To remind you:

We can just start using the data.

To see all the words in all the reviews:

Of course, there are too many words, and only a few are printed. You can look at the first negative review text file, and see the words above are taken from there.

You can see the categories:

You can see the list of files. For example, to see the 1st four files:

We can create a frequency distribution of the words, which will allow us to see which words are the most common in our reviews:

Most of these words are stop words. When we build our sentiment analysis program, we’ll have to get rid of them.

We now have all the tools needed to write some code that will perform sentiment analysis on the movie reviews. Let’s do that now.

Next: Build a sentiment analysis program

PS: Want a free pdf, Python: From Apprentice to Master?

Subscribers will also get exclusive content I won't share on my blog.