Scraping via APIs: Using the Github Jobs API

As I mentioned in the presentation in the first part of this series, not all companies want you scraping their websites. And yet, if they have important data, people will not stop trying, so most will provide an API. With the provision that you will be banned if you don’t use the API.

I’ve already covered the Reddit API and the Twitter API, so I won’t go over that again.

The example we are going over is much simpler, the Github Jobs API. It uses a simple REST api. Before we go ahead, read up on what REST is, if you don’t know already:  http://stackoverflow.com/questions/671118/what-exactly-is-restful-programming

You also need to know about the Json format. Summary: Json is a great format, especially for Python, as it’s basically a Python dictionary. You can read Json objects directly in Python, though we will using a library.

Let’s look at the Gihub API. This is the main site: https://jobs.github.com/api

I’d like you to look at two versions of the above. The first is the human readable form: http://jobs.github.com/positions?description=python&location=new+york

The second is the Json format, which is easier for machines to parse: https://jobs.github.com/positions.json?description=python&location=new+york

Open both the above links in parallel tabs, and have a look at them.

The key thing to understand is, both the links have the same data. Confirm it, if not sure. Look at the title and description in the Json format, and compare to the web version.

Also look at the link. There is a description field that includes the skill, and a location field for the location.

We will use this when we create our own link.

Now that we know how to get the Json version of the jobs, let’s see how we can use a script to get the jobs.

We import requests and the json library.

I am creating the link dynamically using a keyword and location, so that in future, these can be read from the user or a file.

We then print the link. Check that the link is correct by opening it in a new tab.

We then use requests to get the link. This will return a json object. Feel free to print it, but it will be messy and hard to read.

Instead of parsing the page directly, we will load it with the json library. This will allow us to do things like print the title for the first job:

jobs is a list with all the jobs, and if you want to see what keys are available:

So we have keys like how_to_apply and title which we’ve seen, and then description which will contain all the details of the job.

Let’s print all the job titles and the url for the jobs:

What if we want to look for jobs that are for Front End?

We just search for Front End in the title. It may vary when you run the code, as the jobs will be different.

What if we want to search through the jobs description? Say, you want to search for jobs that require Python and Java?

And that’s it. I just wanted to show you how easy parsing the Githib API is. This is thanks to Github and their efforts to make it easy. Compare it to Reddit, for example. It’s a messy and hard to use API.

And I hope you learnt how easy it is to parse Json objects. Almost any website that provides a Rest api will return Json (currently, though this may change), and knowing how to parse Json is a useful skill.