Introduction to NLTK (Natural Language Processing) with Python

0. Introduction to NLP and Sentiment Analysis

1. Natural Language Processing with NTLK

2. Intro to NTLK, Part 2

3. Build a sentiment analysis program: We finally use all we learnt above to make a program that analyses sentiment of movie reviews

4. Sentiment Analysis with Twitter: A practice session for you, with a bit of learning.


 

This first video is just a quick introduction to the NTLK library in Python.

All the source code for this and the upcoming videos is here.

Use  Ntlk_Intro.ipynb for this lesson.

You need to download some data sources before. Run these commands (instructions in video as well):

Be prepared for it to take some time. There is one file I mention in the video- panlex_lite or something. Don’t download that, or just press cancel when it is downloading that file (because it is huge, and pretty useless).

Everything else should be easy.

Only interested in the video? Go to part 2 here: Intro to NTLK, Part 2

There is a great completely free book for learning Natural Language Processing at http://www.nltk.org/book/

It also introduces you to Python if you are new to it. But there is a lot of information there, and it an be a bit overwhelming. What we will try to do in this lesson is, go over the main features of the Python NLTK library.

We import everything we need. We will go over these functions as we use them.

To start off, say we have a sentence, and we want to extract all the words from it.

We can split the function on a space (” “) to get all the words. The problem with this is, we cannot extract punctuation marks like full stops, and this simple parser will not be able to handle every single type of sentence.

Which is why we should use the word tokenizer provided by the NLTK library. This correctly identifies punctuation marks:

The word_tokenize() function is very useful, and we will be using it later.

Another useful feature is that nltk can figure out if a parts of a sentence are nouns, adverbs, verbs etc.

The pos_tag() works on the output of word_tokenize():

If you want to know what those tags mean, you can search for them online, or use the inbuilt functions:

So for example, NN means singular noun, JJ means adjective etc.

The Nltk has many great features, like finding the meaning of words, finding examples of words, finding similar and opposite words etc. You can see how useful these features would be if you were building like a search engine, or a text parser.

Let’s look at a few of these features.

The first thing you can do it, find the definition of any word.

The name() function gives the internal name of the word, since a word can have multiple definitions. In the example above, the word computer can mean the machine (stored internally as computer.n.01), or a human who performs calculations (stored as calculator.n.01).

Interesting side history in no way related to this lesson: Till World War II, computers were humans, usually women, whose job was to manually calculate the trajectories of missiles, artillery etc. Of course, they had  cheat sheets, but they did the calculations by hand. Only after WWII, when computers became fast enough to do this job, did the word computer come to mean the machine. The definition above makes the distinction between the words.

You can also look at example usage of words:

Very useful advice- for me!

If you don’t know what hyponyms and hypernyms are, the Wikipedia page is a good place to look. Actually, just the main image gives you a good idea:

Hypernym is the root of the word, color in the image above. Hyponyms are similar words, like the colors red blue green etc.

You can see communicate is the root (hypernym) of speak, while there are dozens of similar words. Babble, chatter, rant, troll, yack are a few.

What if you want to find the opposite word (antonym) of a word?

We have to do it this way. Lemma is the internal nltk term for the unique words.

Lemmas can be used to find all similar words:

You will see Bible and Quran in the list above, because they are synoymns (similar words) of the word book. Also words like script, record etc.

This was a quick intro to the nltk library. I won’t go over every feature, as the free book linked to earlier has more stuff.

In the next lesson, we will look at some more features in the nltk library that will help us build our sentiment analysis program.

Intro to NTLK, Part 2

PS: Interested in leveling up your Python and getting a great job? Check out the Python Apprenticeship Program.