Getting Started with Web Scraping

To start off, view this presentation:

It introduces some concepts you should be aware of. While the presentation is in the code repo, I still recommend you view the video, as I cover points that may not be easily clear in the ppt file.

In the end, I recommend brushing up on HTML. This page should be enough: http://www.w3schools.com/html/html_elements.asp

Also skim through Xpaths. Go through all the 5-6 pages (they are tiny) here: http://www.w3schools.com/xsl/xpath_intro.asp

Finally, if you are on Firefox, install Firebug. Firefox and Chrome come with other inspectors, but that’s the only one I’ve used and recommend.

Right, to the first coding video:

 

To start off, get the code from the repo. Scraping1.ipynb is the file we are working with.

Requests is the library we are working with. It is a lot simpler to use than urllib2, which is the library most people go with. We are also using BeautifulSoup, which allows us to work with malformed HTML.

We will be working with this page: http://pythonforengineers.com/pythonforengineersbook/

We import what we need.

And that’s how easy it is to open a page. You can now look at the status code by doing r.status_code, the page text by r.text and so on.

But we will first convert this to a BeautifulSoup object, to remove any bad Html (on my website? As if!):

Now we can look at things like the title of the page:

You can print all the p elements on the page.

This will print the raw html, including all tags. To have a clean print, we do:

We loop over the p elements (paragraphs), and use the get_text() function to extract just the text part of the html.

What if we just want to print all the links on the page:

The a tag contains the html links. We get the href part of the a tag, which only contains the link.

And that’s how simple it is.

In the next video, we’ll see how we can extract pricing info from a dummy page.

We will be working with the dummy sales page: http://pythonforengineers.com/dummy-sales-page/

Scraping2.ipynb is the file we are working with.

If you scroll down, you will see some dummy price info.

This section requires some knowledge of regular expressions. If you saw the video, I give a quick intro to regexes. If you didn’t, you’ll need to read up on them. This is a good link, but uses Python 2: http://www.thegeekstuff.com/2014/07/python-regex-examples/

In the video, I develop the code slowly. Here, I’ll just give the final version.

To start off, let’s read the page:

This is the main code:

We create an empty list called price_list, and then we loop over all the p elements on the page.

This is the key part of the code. We are trying to find each item and it’s price. The first regex string is:

A great cheat sheet for regular expressions: http://www.pyregex.com/

The key regex is in the square brackets [].  There is a star (*) at the end to signal we should search for more than one of each type.

We are searching for all alphanumeric characters (\w), spaces and colons (:). Why is that? Because our first part contains:

Price for Item A:

Searching for spaces and columns will get us Item A.

The 2nd regex is :

Again, we search for alphanumeric characters  and the dollar symbol (as our price is in dollars). This will allow us to get the price, like $99.

If we find the price, we add the item and price to a dictionary and print it at the end:

Again, all very simple.

In the next lesson, we’ll see how to download all images from the page.