Cross Validation and Model Selection

To start off, watch this presentation that goes over what Cross Validation is.

Note: There are 3 videos + transcript in this series. The videos are mixed with the transcripts, so scroll down if you are only interested in the videos. Make sure you turn on HD.

There is no transcript for the presentation, though the original Powerpoint is in the code repo. Though , I strongly recommend you watch the video, as I talk over several points that may not be obvious by just looking at the presentation.

Now, to the main code. The source code is here. We will be starting with the file Iris.ipynb.

We are working with the Iris Dataset in this example. Gathered in 1936, it is a very simple dataset, which is why it is used for teaching. It is a part of the Scikit library.

The raw data is here. A sample:

The values are sepal length, width, petal length & width, and the last entry is the name of the flower species. The aim is to predict the flower specie using sepal and petal measurements.

If you don’t know what a sepal is (as I didn’t), this image taken from Wikipedia will give you an idea:

Let’s start with the code. We import everything we need (which I won’t show here). Let’s go to the main code:

The load_iris() functions the data into memory. You can do a

if you want to see what the data is. It is the raw data, which includes description of the dataset. A small sample (which I’ve edited, as it’s too big to print):

We can work with the above, as Scikit gives us easier ways to read the data. We can directly access the input values (sepal & petal measurements) and expected output (flower specie):

Where the input is of the format:

These are the 4 measurements: sepal length and width, petal length and width.

The output is:

It’s just an array containing the species in numerical format. So for example, 0 is Iris-setosa.

You can see how simple the data is, and why it is useful for learning concepts.

Okay, remember this slide from the presentation:

cval1

The above is a simple kfold with 4 folds (as the data is divided into 4 test/train splits).

Let’s see how we we would do this in Python:

In the example above, we ask Scikit to create a kfold for us. The 10 value means 10 samples. Scikit will create a list with the values 0-9 for us. There are 5 folds, and shuffle means randomise the data.

How do we use this?

As you can see, Scikit has created 5 folds, and randomly put our numbers in either the test or train set.

Feel free to experiment with the number of samples and folds. Here’s a few. First, we lower number of folds two 2. This increases the samples in test set, but there are now fewer iterations (as each fold is a possible test case, and more test cases equals more accuracy):

What we if increase number of samples to 15, and folds to 10?

More test cases, but the test set now only has 1-2 samples. This is a thing you will have to decide for your use case.

How to use the folds?

So the question is, what do we do with the folds above? One way is to call you machine learning algorithm 5 times (or whatever your kfold value is), each time giving a different train/test set. You will then need to calculate the accuracy for each stage, and average it at the end.

This is such a common feature, that scikit provides you a ready made helper function for this, cross_val_score() which we’ll use below.

Before we go ahead, we will be comparing 3 machine learning algorithms in this lesson. Random Forests you’ve already looked at, we will also be looking at Logistic Regression and SVM.

How do you normally choose between algorithms? Always start with Random Forests, as they are a good tool for most cases. If for any reason, RFs are not working, then look at other algorithms in your domain. A simple Google for your domain will help. Eg, “Natural language  machine learning algorithms”.

Once you narrow down 2-3 algorithms, you can use the techniques we will see now.

Let’s create instances for random forests, logistic regression and svm:

The function used for Kfolds is called cross_val_score():

The first instance is the Random Forest object, then we pass in our inputs and output. scoring=’accuracy’ means measure the algorithm for accuracy (you can measure other things, I just stick to accuracy). Finally, we have 10 folds.

The function will run the Random Forest classifier with our input/outputs ten times, and measure the accuracy each time. This is the output:

The answers are between 0 and 1. You can see several 1s in there, which means an accuracy of 100%. We are getting this only because the data is very simple.

What if we want to find the average accuracy? We use the mean() function to find the mean of the ten values.

So we get an accuracy of 95.33%

Let’s do the same for all three algorithms.

From the example above, SVM is the most accurate, but keep in mind there is little difference between 95 and 98%. Both are so close. If you look at individual results, several of them are 100%.

But you do see how easy it is to compare different machine learning algorithms, and find the most accurate for your use case. You will get to practice with a more realistic dataset in the practice session.

There is one final thing I want to show you. And that is, how easy Scikit-learn makes it for you to try out different algorithms.

Say you wrote your whole code with Random Forests, and want to move to regression. It is easy, as the interfaces in Scikit library are common. An example will make it clear.

We create our test/train split:

First we use Random Forests:

The 2 functions we have are fit() and score(). What if we want to move to Logistic regression? It supports the exact same functions, with the same inputs/outputs:

As does SVM:

You see you don’t need to rewrite your code at all. Now, I’m sure other languages like R or Java provide similar features, but I had been using a proprietary tool where each algorithm was differently implemented, so if you wanted to move from svm to regression, you had to completely rewrite your code.

Coming to Python, it was a surprise to see you could just try a new algorithm with a one line change of code. But people who have used other (well implemented) open source tools will not be surprised.

If you look at the notebook, I have some code at the end that shows you how to shuffle/randomise the data when using cross_val_score() (as this does not automatically happen). I didn’t find much of an improvement with shuffling, which is why I won’t go over the code, but I will leave it in for reference.

Now to the practice session. CV_practice.ipynb is the file to use:

The practice session works with the Pima Indian Diabetes database(https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/)

The video contains hints, but the biggest hint is, look at the previous example, and just repeat what you learnt there.

I already read the data for you, saving you typing time:

All you need to do is, use the cross_val_score() function to compare the 3 algorithms, and find which one performs best for that dataset.

Finally, the last step is to run each algorithm individually, to see how easy it is to switch between them.

The session should be easy, as you are doing the same thing as before, just with a different dataset.