Data Analysis with Pandas

Intro: Start here

Installing the libraries required for the book

Beginners Start Here:

Create a Word Counter in Python

An introduction to Numpy and Matplotlib

Introduction to Pandas with Practical Examples (New)

Main Book

Image and Video Processing in Python

Data Analysis with Pandas

Audio and Digital Signal Processing (DSP)

Control Your Raspberry Pi From Your Phone / Tablet

Machine Learning Section

Machine Learning with an Amazon like Recommendation Engine

Machine Learning New Stuff

Machine Learning For Complete Beginners: Learn how to predict how many Titanic survivors using machine learning. No previous knowledge needed!

Cross Validation and Model Selection: In which we look at cross validation, and how to choose between different machine learning algorithms. Working with the Iris flower dataset and the Pima diabetes dataset.

Natural Language Processing

0. Introduction to NLP and Sentiment Analysis

1. Natural Language Processing with NTLK

2. Intro to NTLK, Part 2

3. Build a sentiment analysis program

4. Sentiment Analysis with Twitter

5. Analysing the Enron Email Corpus: The Enron Email corpus has half a million files spread over 2.5 GB. When looking at data this size, the question is, where do you even start?

6. Build a Spam Filter using the Enron Corpus



You must have seen in Chapter on plotting that Python can be used to parse csv files. You might also have noted that it is fairly painful. While there are libraries like csv_reader(), they still aren’t perfect. You still have to do  a lot of stuff manually.

Enter Pandas, which is a great library for data analysis. It is quite high level, so you don’t have to muck about with low level details, unless you really want to.

If you are dealing with complicated or large datasets, seriously consider Pandas. It is based on numpy/scipy, sort of a superset of it. But it gives you a lot of powerful features, like read directly from a Microsoft Excel file, do data joins (like you would do in SQL) etc, some of which features we will go over.

Just one more thing before we continue. Pandas has two basic data structures: Series and Dataframes.

Series is like numpy’s array/dictionary, though it comes with a lot of extra features. Dataframes is a two dimensional data structure that contains both column and row information, like the fields of an Excel file. Remember an Excel file has rows and columns, and an optional header field. All of this data can be represented in a Dataframe. That’s what we’ll use in our examples below.

Installing pandas

You should get Pandas if you installed Anaconda, as recommended.

Note to people without a graphical interface: If you can’t view the graphs because you have no GUI, you can save the graphs instead. See here for details.

Analyse obesity in England

I’m using the 2014 data from here:

http://data.gov.uk/dataset/statistics_on_obesity_physical_activity_and_diet_england

I’ll also put it in the Github directory, so you don’t have to download it. Make sure you open the file in Excel (or Openoffice) and view it, so you know what we’re talking about. Do that now, and have a look at the different sections. Especially section 7.1 and 7.2, as that’s what we’ll focus on.

Let’s get started then.

Pandas is imported as pd to save typing, in the same way we import numpy as np.

We are opening the xls file. The great thing about pandas is that you can open Excel files directly. Normally, most libraries can only work with csv files. We then print all the sheet names.

For those of you not comfortable with Excel, a sheet is one “page”, as it were, of data. Rather than having all the data in one huge unmanageable sheet, users break the data into multiple sheets. Above, we printed all the available sheets.

Section 7.1: Obesity by gender

Let’s have a look at sheet 1:

There are four columns: Year, total, males and females. The Year column doesn’t have a header- if you look at line 5, you will see the header for year is empty.

I’ll show you two ways to read in data. In the first one, you define the header columns yourself.

So I’m defining a list with four header entries: ‘year’, ‘total’, ‘males’ and ‘females’.

We read sheet 7.1. If you look at the actual sheet, the top 4 and bottom 14 rows contain useless info, so we skip it (skiprows=4, skipfooter=14, ). Finally, we tell pandas to name the column headers using our list names=columns1 .

We then print data_gender.

 

If you look at the entry for 0, it is NaN.  Why’s that? Look at the original spreadsheet. There is an empty space on line 6, to make the sheet easier to read. Since it is empty, it is read as Not A Number (NaN) by pandas. How do we get rid of it? Easy.

Using the inbuilt dropna() function, of course. inplace = True means modify the existing Dataframe. If we look at the output again:

 

we see the NaN values are now gone. There is another problem, though. We have an index at the beginning (the first column, going 1,2,3….). We don’t need an index. Instead, we want the year to be the index. Let’s fix that next.

That’s better.

We can see that while obesity for men has gone up, obesity for women has gone up more strongly.

Section 7.2: Obesity by age

We will read the  data slightly differently this time. Last time, we defined the headers ourself. This time, we’ll let pandas pick them up.

If you remember, the year column didn’t have a header, which is why pandas names it Unnamed. Let’s rename it:

That’s better. Let’s drop the Nan and set the index to year.

Let’s plot all the ages first.

You see a big problem: The Total column is huge and suppresses the other graphs. So we need to get rid of it to make the graph clearer.

The drop() function can drop entries from a table. Here, we are dropping the Total column.

We can quickly see that the age groups 35-44 and 45-54 have had the highest obesity growth.

What if we want to compare age groups? How do children compare to adults?

In Pandas, you can view any column by doing data_age[column_name].

This gives the data for just the under 16s. We can plot any age group this way.

We see that while children’s obesity has gone down, that for adults has ballooned. Maybe adults should start following the advice they give to children?

Movie Lens

The movie lens database is a collection of data based on users reviews of movies. It is one of the most popular open datasets out there. Download it from http://grouplens.org/datasets/movielens/, downloading the 100K version.

It is also different that the data isn’t in an Excel format, but plain text. This makes it easy to parse. Now, we have seen that Python pandas makes parsing Excel files easy as well, but many programming languages don’t have this feature.

Download the data and have a look at it. There is a large amount of data, and we will only work with a small subset. If you want to know the details of the files, they are in the Readme.

In a text editor, open the first file we will work with, u.item:

It contains the data movie id, movie name, release date, imdb.com link, and a series of flags for genre. These are separated by a pipe ( | ).
I will only use the first two of the entries (movie id and name).
Open u.user:

The details are user id, age, gender, occupation and post code. By the way, these details are in the Readme, if you forget.
Again, I will only use the first three enteries (id, age and gender).

Finally, we have u.data

The entries are user id, movie id, rating and timestamp (which I’ll ignore in this example).

If you look at the other files, you’ll see there is a wealth of data. You can draw many conclusions from the data, which is why the dataset is so popular.

We’ll focus on a small set of the data, mainly on how users of different ages and genders rated movies.

Reading the data

Since the data is plain text separated by |, it will not have column headers like Excel files. Which means, we’ll have to provide our own, like I did in the first part of the last section.

Pandas has a function to read plain text files, which is a lot similar to numpy’s read file functions we covered in chapter on plotting (not surprising, since pandas is based in numpy).

On to the code.

We declare user_columns for the three entries we want to read from u.user. Note that u.user has more than three entries per row, but we will ignore the others.

We will use the read_csv function to read the file, even though it isn’t technically a csv(comma separated values) file. The first argument is the file name. The second says that our values are separated by |. The third argument merely passes in our user_columns as the column names. The final argument says that we need to only read three values per row. Without this, pandas will try to read the whole line and cause a mess.

In exactly the same way, we load the other values as well:

Now comes the important part. The data in the movielens dataset is spread over multiple files. But that is no good to us. We need to merge it together, so we can analyse it in one go. If you have used Sql, you will know it has a JOIN function to join tables. Pandas has something similar.

Finally, we’ve added encoding=iso-8859-1. This is to keep Python 3 happy, as the file contains non-standard characters, and while Python 2 had a Wink wink, I’ll let you get away with it approach, Python 3 is more strict. We need to make sure we specify that we are using Latin-1 encoding (which is just plain English).

The merge() function will do the same thing. It will take two different DataFrames and merge them together. You can specify an index to merge them on, but pandas is smart enough to find the common index and merge on it.

So for the first merge:

pandas will see that movie_id is common between movies and ratings, so will merge on that.

We then merge this newly created movie_ratings with users:

Again, pandas will figure out user_id is common and merge around that.

The final result is a dataset that contains all the info we need, so we can start working on it.

The first thing we are going to do is find movies which have the most ratings. The code above may look big and confusing, but this is actually a feature of Python: You can chain multiple commands. Let’s break it down.

The groupby() function allows you to group the data by a chosen column (remember, the data in normally printed by index. Here, we are saying arrange the data by title, not index).

After we have the movies by title, we call the size() function to arrange them by size. Normall, this is in ascending order, so:

we call the sort_values(ascending=False) function, which arranged the data in descending order (so movies with more ratings appear at top). Finally, we use [:20] to only show the top 20 movies.

The second number is the number of votes. So Star Wars has 583 votes, making it the highest rated movie.

To no one’s surprise, all the Star War movies are there (only the real ones, none of that Jar Jar Binks garbage). Scream and Liar Liar were a bit surprising, as they aren’t really classics.  But as we’ll see later, some people may consider them classics.

Let’s compare how the tastes of teenagers(13-19) vs the oldies (60+) differ.

If you want to see all movies rated by people greater than 60 years, the easiest way is:

 

You just pass movie_data.age > 60 to itself, and pandas will return a DataFrame which meets this condition. Let’s do that now:

 

This will return the movies sorted by index. We want the movies sorted by number of votes (as above). Let’s do that now:

This is the same code as above, except this time we will see the top rated movies by old people only.

Let’s do the same for teenagers.

 

I’m doing the same as above, except this time I have two checks : (movie_data.age > 12) & (movie_data.age < 20). We can have as many checks as we want. We could have added a movie_data.gender == ‘F’ if we wanted teenage girls.

In the next line, again we sort by number of votes.

And now, let’s print the results, only printing the top 10 results:

You can see that Scream is only liked by teenagers. The English Patient is the older people’s favourite. But note that it only has 17 votes, which maybe just the fact that there are fewer older people compared to teenagers.

Let’s now compare the the ratings by gender.

We get the ratings by size.

 

We want to avoid movies that have one review. So we are only selecting those movies with at least 250 reviews. This is an arbitrary number. This will help us filter out popular movies only.

The next feature is very powerful. Say you want to reshape a dataset. You could perform multiple select and join operations. But the pivot_table() function allows us to do this more simply. Here, I want to create a new table which has the ratings of the movies selected by gender.

The pivot_table() takes the movie_data and reshapes it. The first argument says we want to arrange it by rating. The second argument says that the index should be the title, and the third says the columns should be based on gender. Let’s see what we get:

You can see we have a new table that shows the ratings by gender. But it has a lot of movies that may only have one or two ratings.

The .ix is another way to select data. We are selecting the data that is in popular_movies only, which means it has at least 250 ratings. If I print ratings_by_gender now:

That’s better. We are getting movies we’ve heard of.

Let’s see which movies were top rated by women:

You should understand what’s going on by now. We are sorting ratings_by_gender by the ‘F’ column, which remember, stands for female. Let’s print that:

As you can see, most top rated movies by women also got high ratings by men. Except for the last one, Sense and Sensibility (1995). Are there other movies men and women disagree on?

We are going to add a new column to our table, which will calculate the difference between the men and women’s ratings:

Here, diff is a new column. This is what it will look like:

Now, we want to see which movies had the greatest difference in votes. After I wrote this code, I saw Wes McKinney’s book on Pandas (Wes wrote the Pandas library), and he too used this example. He used standard deviation, which I guess will give a more accurate answer. I use the simple absolute value (the absolute value of the diff parameter we created).

I store just the differences in a new DataFrame.

And get the absolute values, to remove the negative sign.

Sort it by descending size, so the biggest differences will show up on top. And finally, we graph it:

The barh is a type of bar graph, except it is drawn in the horizontal direction:

Conclusion

That was a quick intro to Pandas. My advice is, search for some open data (just Google open data. Many countries, like the UK and USA, put a lot of their government data online for free) and find something you like. Pandas is quite easy to use, and there is a lot of help online. Any time you get stuck, just Google it.