Automating Web Browsers with Selenium Part 2

In this lesson, we will see how we can search for a term on my website, click on links and follow the articles. The code is in Selenium2.ipynb.

Specifically, we’ll see how we can look at the sites in this series:

In the next lesson, we’ll download the source code from the page, but for now, let’s see how to navigate the links.

So this lesson starts directly from the practice session in the last video, so make sure we’ve done that. We start at

Using Firebug, we find that the name of the search box is s.

We use find_element_by_name() to find the form. We then send the words build reddit bot to the search box. Remember, we are searching for the Build a Reddit Bot series of articles.

In the practice session, we pressed the search button. This time, I’m showing you another way. You can also send the Return key, which is the same as pressing enter after typing in the keywords.

The search term will have multiple results. We search for our exact result and click on it.

That will take us to the first article. What if we now want to navigate to the next part?

Each article has a Next link on it. We just click that.

This takes us to part 2. We do the same to go to Part 3.

If you try to do that a fourth time, you will get an error, as there are no more links in that series.

Okay, now that we’ve seen how easy it is to navigate the series, let’s download the code.

Get the source code in each Reddit bot series

In the video, I mention how Selenium really stumped me, as if you want to do anything beyond the basics, the documentation sucks.

The code on my website uses a plugin called Crayon, and it uses some fancy Html. While it could be parsed, if the Crayon code changes, my code will break. Also, the code was getting really messy, just because the underlying html was very very complex.

I decided to use BeautifulSoup with Selenium, just to avoid all the pain.

The code is in Selenium3.ipynb.

There is a function to get the code, but we’ll come back to it.

All this is the same as before. We search for our terms, find our link, and click on it. The new code is:

We pass in the page source (raw html) to our function. Let’s look at the function now:

Let’s go over it line by line:

We take the page source, and create a BeautifulSoup instance from it.

Now, by spending hours going through my webpage with Firebug, I know that the source code is stored in the textarea html tag. This is not the normal place, but the way the Crayon plugin on WordPress works. Keep that in mind.

We loop over this textarea html tag. You will note that there is some commented out code to print the source code. I had that in the first version. Feel free to put it back in.

What we will do, instead, is write the code to a file:

We write all the source code we find on the page to a file called code.txt.

Let’s come back to the main code now.

Let me explain this line by line.

In the previous example, we manually clicked on each of the Next link. Obviously that is not a very efficient method.

Here, I loop over the pages.

We have an infinite loop, but don’t worry, it will exit.

We try to find the Next link. If we find it, we click it, and then download all the code on the next page.

Once we reach the end of the series, we will hit this except. In that case, we exit the loop.

Look at code.txt. It should have all the source code from all three examples.

For fun, run the whole code by using Run all. It looks quite ghostly, the way the windows are opened, buttons clicked etc. Like a ghost or hacker is controlling your system!

And that’s it for this series. Hopefully, you’ve seen how easy it is to automate the web browser. This skill can be used not just for scraping, but for testing web apps as well.