Create a Word Counter in Python

Intro: Start here

Installing the libraries required for the book

Beginners Start Here:

Create a Word Counter in Python

An introduction to Numpy and Matplotlib

Introduction to Pandas with Practical Examples (New)

Main Book

Image and Video Processing in Python

Data Analysis with Pandas

Audio and Digital Signal Processing (DSP)

Control Your Raspberry Pi From Your Phone / Tablet

Machine Learning Section

Machine Learning with an Amazon like Recommendation Engine

Machine Learning New Stuff

Machine Learning For Complete Beginners: Learn how to predict how many Titanic survivors using machine learning. No previous knowledge needed!

Cross Validation and Model Selection: In which we look at cross validation, and how to choose between different machine learning algorithms. Working with the Iris flower dataset and the Pima diabetes dataset.

Natural Language Processing

0. Introduction to NLP and Sentiment Analysis

1. Natural Language Processing with NTLK

2. Intro to NTLK, Part 2

3. Build a sentiment analysis program

4. Sentiment Analysis with Twitter

5. Analysing the Enron Email Corpus: The Enron Email corpus has half a million files spread over 2.5 GB. When looking at data this size, the question is, where do you even start?

6. Build a Spam Filter using the Enron Corpus



Create a Word Counter in Python

This chapter is for those new to Python, but I recommend everyone go through it, just so that we are all on equal footing.

Baby steps: Read and print a file

Okay folks, we are going to start gentle. We will build a simple utility called word counter. Those of you who have used Linux will know this as the wc utility. On Linux, you can type:

 

to get the number of words, lines and characters in a file. The wc utility is quite advanced, of course, since it has been around for a long time. We are going to build a baby version of that. This is more interesting than just printing Hello World to the screen.

With that in mind, let’s start. The file we are working with is read_file.py, which is in the folder Wordcount.

The first line that starts with a #! is used mainly on Linux systems. It tells the shell that this is a Python file, and should be run as such. It also tells Linux which interpreter to use (Python in our case). It doesn’t do any harm on Windows (as anything that starts with a # is a comment in Python), so we keep it in.

Let’s start looking at the code.

We simply open a file called “birds.txt”. It must exist in the current directory (ie, the directory you are running the code from). Later on, we will cover reading from the command line, but for now, the path is hard coded. The r means the file will be opened in a read only mode. Other common modes are w for write, a for append. You can also read/write binary files, however we won’t go into that for the moment. Our files are plain text.

After opening the file, we read its contents into a variable called data, and close the file.

And we print the file. And now, to test our code.
If you are on Linux, you can just type:

to run it. You might have to make the file executable. On Windows, you’ll need to do:

Go to the folder called WordCount, and run the file there:

And there you go. Your First Python program.

Count words and lines

Okay, so we can read a file and print it on the screen. Now, to count the number of words. We’ll be using the file count_words.py in the WordCount folder.

These lines should be familiar by now. We open the file and read it.

Python has several in built functions for strings. One is the split() function which splits the string on the given parameter. In the example above, we are splitting on a space. The function returns a list (which is what Python calls arrays) of the string split on space.

To see how this works, I’ll fire up an IPython console.

I took a sentence “I am a boy” and split it on a space. Python returned a list with four elements: [‘I’, ‘am’, ‘a’, ‘boy’].
We can split on anything. Here, I split on a comma:

Coming back to our example:

You should know what we are doing now. We are splitting the file we read on spaces. This should give us the number of words, as in English, words are separated by space (as if you didn’t know already).

So we print the words that we found. Next, we call the len() function, which returns the length of a list. Remember I said the split() function breaks the string into a list? Well, by using the len() function, we can find out how many elements the list has, and hence the number of words.

Next, we find the number of lines by using the same method.

We do the same thing, except this we split on the newline character (“\n”). For those who don’t know, the newline character is the code that tells the editor to insert a new line, a return. By counting the number of newline characters, we can get the number of lines in the program.

Run the file count_words.py, and see the results.

Now open the file birds.txt and count the number of lines by hand. You’ll find the answers are different. That’s because there is a bug in our code. It is counting empty lines as well. We need to fix that now.

Count lines fixed

This is the old code, the one we need to correct.

For loop in Python

The syntax of the for loop is:

 

A few key things. There is a colon (:) after the for instruction. And in Python, there are no brackets {} or start-end keywords. For example, if you come from a C/C++/Java/C# type world, this is how you would write your for loop:

 

The curly braces {} tell the compiler that this code is under the for loop. Python doesn’t have these braces. Instead, it uses white space/indentation. If you don’t use indentation, Python will complain. Example:

 

The correct way to do it is:

 

How much indentation to use? Four spaces is  recommended. If you are using a good text editor like Sublime Text, it will do that automatically.
Coming back to our code,

 

Let’s go over this line by line.

We are looping over our list lines. l will contain each line as Python is looping over them.

As a side note, those of you who come from a C/C++ background, you will be surprised by us not using arrays. We don’t need to— Python will do that for us automatically. Python will take the list lines and automatically loop over it. We don’t need to do lines[0], lines[1], lines[2] etc like you would do in C/C++. In fact, doing so is an anti-pattern.

So now we have each line. We need to now check if it is empty. There are many ways to do it. One is:

This checks if the current line has a length of 0, which is fine, but there is a more elegant way of doing it.

 

The not keyword in Python will automatically check for emptiness for us. If the line is empty, we remove it from the list using the remove() command.

Again, like the for loop, we need to give four spaces to let Python know that this instruction is under the if condition.

We should now have the correct number of lines. Run count_lines_fixed.py to see the results.

Bringing it all together

Now we need to tie it all together. word_count.py is our final file.

The only new thing here is the import sys command. This is needed to read from the command line.

We will beak our code into functions now. The way to write a function in Python is:

 

def defines a function. Notice the colon (:) and the white space? Like loops and if conditions, you need to use indentation for code under the for loop.

Our first function counts the number of words:

 

It takes in the list data, and returns the number of words. Keep in mind this is the exact same code as before, the only difference is now it is in a function.

The function to count lines is similar:

 

 

The next part is one of the most Googled lines:

There are two ways to call Python files:

  1. You can call the file directly, python filename, which is what we’ve been doing.

  2. You can call the file as an external library.

We haven’t covered calling the file as a library yet. If you wanted to use the function count_words in another file, you would do this:

This will take the function count_words and make it available in the new file.
You can also do:

This will import all functions and variables, but generally, this approach isn’t recommended. You should only import what you need.

Now, sometimes you’ll have code you only want to run if the file is being called directly, ie, without an import. If so, you can put it under this line:

This means (in simple English): Only run this code if I am running this file from the command line (or something similar). If we import this file in another, all of this code will be ignored.

By using this syntax, you can ensure your function only runs when someone calls your program directly, and not imports is as a module.

__name__ is an an internal variable set to __main__ by the Python interpreter when we are running the program standalone.

Now in our examples, we have been calling the files directly, but I thought I’d show you this syntax in case you ever saw it on the web. It is optional in our case, but good practice in case you want to turn your code into a library. Even if you don’t want to at the moment, it’s a good idea to use the command as it’s only one line.

 

Remember we imported the sys library? It contains several system calls, one of which is the sys.argv command, which returns the command line arguments. You already know our old friend len(). We check if the number of command line arguments is less than two (the first is always the name of the file), and if so, print a message and exit. This is what happens:

 

The next line:

As I said, the first element of sys.argv (or argv[0]) will be the name of the file itself (word_count.py in our case). The second will be the file the user entered. We read that.

We read the data from the file.

And now we call our functions to count the number of words and lines, and print the results. Voila! A simple word counter.

 

The word counter isn’t perfect, and if you try it with different files, you will find many bugs. But it’s good enough for us to move on to the next chapter.