Download all Images from a Website

Now that we know the basics of webscraping, let’s look at a more complicated example, how to download all images from a website.

The file we are working with is “Scraping 3.ipynb”.

In the video, I go over the code step by step, explaining how I wrote the code, and the decisions I made. In the transcript, I’ll just explain the final version of the code.

We are back with the Python for Engineers book page. The code above is simple- we open the page in Python.

To start off, we loop over all the img tags on the page (because images are stored in img tags).

If you remember your img tags (and I always have to Google this), the actual link to the image is stored under src. Like this example, taken from here:

So we read the source of our image:

Look at the link above. Is there something wrong with it?

I mentioned this in the presentation- the problem with web scraping is that code can break a lot. An earlier version of web scraping lessons broke within months, as WordPress (my CMS) changed the way it stored the images. If you look at the link above, not is there no http in front of the images, the actual images are stored on and not my website. This makes loading the images faster, but makes web scraping harder.

There are other problems. Like, look at the question mark in the link above. It sends an instruction to resize the image. This instruction is useful for the web browser, but will make our job harder. We’ll have to get rid of it.

First, let’s add http in front of the path of the image, so that our code can recognise it:

That looks like a proper http link. Let’s get rid of ?resize part now.

We find the location of the question mark using find, and remove the string after that. This gives us the actual file name. Why was this necessary? Because now we can extract the name of the image (Python_for_Scientists-small.jpg in the example above).

How do we get the file name? We use the os.path.split() function, which will separate the directory and the file. In the notebook, I give some examples. So our image is:

If we split it, we get a tuple with two entries: the directory and file:

So if we just want the file name:

And so we get our file name:

And we can extract the filename.  We now have both the full path and the name of the file. Let’s go get that file.

requests can not only get webpages, but images as well. We now write the file:

The only thing new there is r2.content, which contains the raw data of the image. You can print it if you want- it will be full of hex codes.

And that’s it. Your folder should have all the images downloaded now.

The Practice Session

The practice session is very simple. Open up Scraping Practice Session.ipynb.

You have to read this page:

On that page, there are the words “source code” somewhere. The words contain a link, and I want you to find out what the link points to:

I mention that you will not find an exact match for source code, so work around that.

The second challenge is to count the number of images on the page:

This can be done by looping over all img tags and running a counter.

A fairly simple practice session, but the whole topic is quite simple (once you know the basics).