Automating Web Browsers with Selenium

 

So I start with a story of how I was trying to build a web service that would scrape Gog.com (a great site for DRM free games, a much better alternative to Steam, though with less newer games). The plan was to scrape the website daily and note the prices of games. Then, the user could enter their favourite games, and would get an email when their favorite games were on discount.

Sounds like a great idea, doesn’t it?

So what happened?

The problem with simple scraping

Based on what you have learnt, try to scrape Gog.com using Requests. That’s what I started with in the video. Look at what you scraped vs what you see on the screen. What happened?

Many websites use Javascript to load their content. Some even use full fledged Javascript web framework like AngularJS (which is what gog.com does as well). Just a simple static scraping won’t work, as many of the elements won’t load until the Javascript has loaded– and Javascript usually only loads in a full fledged web browser.

Now, gog.com is a really complicated website, so we’ll start with something simple.

Have a look at this page: http://pythonforengineers.com/secret/

There is some hidden text that is only visible when you click the button. Give it a try.

Let’s try to scrape this page. The notebook is Selenium 1.ipynb.

All this is old code. You should understand what’s going on. We can’t see the secret text, because we haven’t clicked the button. How do we get around that?

That’s where Selenium comes in

Selenium opens a full fledged web browser for you, from within Python, which means that everything you can do in a web browser (like click buttons), you can do in Selenium.

We are opening a Firefox instance from within our code. You should see a new Firefox window opening up. It will use a default environment, so that even if you have Firefox installed, any plugins or extras you have will not show up.

The new web browser that opened will go to the secret page above.

This is the same result as we were getting before. Nothing new till now.

Selenium has many functions to search for html elements. We are using the find_element_by_tag_name() function, which allows you to search by tags, the p element in this case. And then we print the text part of the p tag.

Now, we want to find the button and click it. How do we do that? There are multiple ways, and I show you two.

For the first, you need to find the xpath of the button, using Firebug. It’s hard to describe in text, please have a look at the video if you’ve never done it before.

So we find the button and read the text on it. Now, in the video I ask you  a question. The above is a very bad way. Take five minutes to think about why.

The answer is: The xpath is hard coded above. If you change the text, the xpath may change, and your code will break.

A better way is to see if we can find a method that will not fail if the page text changes. What we will do is, search for the text on the button. That is less likely to change.

The above is a slightly confusing method, that is not documented in the official docs. I found it on Stackoverflow.

I always forgot the format and have to Google it. Anyway, we are searching for all xpaths with the text Super on it. This will work because there is only one button with Super in it, but if you had multiple things, you’d have to give the exact text to match.

Anyway, so we have selected the button. Now, we just click it.

So simple! You should see the button clicked in the Firefox browser that was opened.

Let’s search for the text again:

We see the secret text this time.

And we close the web browser:

Right, to the practice session.

Practice Session

 

The practice session is fairly simple. Go to http://pythonforengineers.com/articles/ and click on the search button. You don’t need to fill anything in the form, just click the button. The video shows you how to find the id of the button so you can locate and click it.

Solution

The solution is in Selenium Practice Solution.ipynb.

We open the page.

Using Firebug, we knew the id of the button is searchsubmit. We find it, and then click it. That’s it.

In the next lesson, we’ll see how we can search through my website, clicking on buttons, downloading the source code.