Python Web Scraping

Web scraping is the process of automatically identifying and downloading data from a webpage. This blog post looks a a few python web scraping options.

Alternatives to web scraping include:
– Using an API. Usually a better option than web scraping if an API is available. See also Creating an API with flask. Open data APIs.
– Manually copying data from a website. Can be a better option if there is a small number of target pages or if there is a lot of variation in website data. See also an example of manual copying used to ‘scrape’ Wikipedia data.

Python Web Scraping Options

I’ve picked out a few different python web scraping options. There is some overlap in terms of concepts and packages used as the essential idea is the same: programmatically grab data from a webpage and then look for certain aspects within it.

Common python web scraping packages include requests, beautiful soup, regex.

Requests

Probably the core of modern python web scraping is the requests library, which allows you to easily capture webpage data in a python script. Note that earlier approaches to python web scraping may have used the urllib library – read about the differences between requests and urllib.

The data from requests comes back ‘raw’, and is likely to contain lots of HTML tags around the actual information you are interested in.

In very simple cases, such as looking for occurrences of a particular name or phrase you could use standard python string functions. Here is an example

import requests
url = "https://en.wikipedia.org/wiki/North_Atlantic_air_ferry_route_in_World_War_II"

data = requests.get(url).text
print("Nova Scotia" in data)

So the text back from requests can be used just like any other string.

Simple Regex

Introducing regex is a good next step once beyond simply downloading the data with requests. Using an online regex tool can be helpful when writing regex.

Regex is powerful, but it is not possible to fully parse HTML using regex. For this reason regex is generally suited to simpler tasks such as picking out IP addresses, postcodes etc. from a webpage.

The following example uses regex to find latitude and longitude coordinates in the webpage.

import requests
import re
url = "https://en.wikipedia.org/wiki/North_Atlantic_air_ferry_route_in_World_War_II"
data = requests.get(url).text

# Find find pairs of coordinates
pattern = re.compile( r"(\d{1,3}°{1}\d{1,3}′{1}\d{1,3}″{1}[N]).*(\d{1,3}°{1}\d{1,3}′{1}\d{1,3}″{1}[W])")
re.search(pattern, data).groups()

for match in pattern.finditer(data):
    print(match.groups())

Beautiful Soup

Beautiful Soup is probably the most popular python web scraping tool. It is powerful and flexible and I won’t be able to cover it all here, but the examples below hopefully give a flavour of how it can be used.

The basic Beautiful Soup usage is as follows:

from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/North_Atlantic_air_ferry_route_in_World_War_II"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html.parser')

Finding All Links

Because Beautiful Soup incorporates an HTML parser, it can be used to extract information from particular HTML elements.

In this simple example we use Beautiful Soup to list the href of all links (which use the ‘a’ tag).

for table in soup.find_all('a'):
    print(table.get('href'))

We can combine finding links with regex or simple string functions to find particular links.

for link in soup.find_all('a'):
    if 'wikimedia' in link.get('href', 'default_value') :
        print(link)

Finding all body text

Next, an example of getting all of the text found in the HTML ‘body’.

could technically dobody = soup.find_all('body')[0], but typically there would only by one body.

body = soup.body
for string in body.strings:
    print(string)

To be more focused in which text to capture we can also look for paragraphs (see next section). Note that here we use ‘string’ rather than ‘text’.

Finding all Paragraphs

In a similar fashion we can find all text in a paragraph block.

paras = soup.find_all('p')
    for para in paras:
        print(para.text)

Finding tables

This final example is more complex and shows how Beautiful Soup can be used to extract table data from a webpage.

Firstly we can find all the ‘table’ element in the target page.

tables = soup.find_all('table')

To select particular types of table, it can be useful to list the class of all tables

for table in soup.find_all('table'):
    print(table.get('class'))

Note that here we use get rather than ‘[]‘ can be useful so we can provide a default if needed.

The output of that gives some indication that the class of interest is ‘wikitable’

for table in soup.find_all('table', class_='wikitable'):
    print(table.text)

Once we have identified the tables we need to work through individual rows and columns.

It’s not always obvious how to handle the data coming back from the webpage, so building up an understanding can help. In this case we can use a simple loop to print out what belongs in each row and column, and match it to the corresponding rows and columns on the target webpage.

n = 1
for row in tables[1].find_all('tr'):
    print('Row: ', n)
    m = 1
    for column in row.find_all('td'):
        print('Column: ', m)
        print(column.text)
        m+=1
    n+=1

Once we have a better understanding of how the table is laid out in HTML we can transform it into a pandas dataframe. In this case we j

names = []
locations = []
coordinates = []
notes = []

for row in tables[1].find_all('tr'):
    col_num = 1
    for column in row.find_all('td'):
        if col_num == 1:
            names.append(column.text.strip())
        if col_num == 2:
            locations.append(column.text.strip())
        if col_num == 3:
            coordinates.append(column.text.strip())
        if col_num == 4:
            notes.append(column.text.strip())
        col_num+=1
    n+=1

df = pd.DataFrame({'Name': names,
                   'Location': locations,
                   'Coordinates': coordinates,
                   'Notes': notes
                  })

One problem you may encounter is nested tables, i.e. tables within tables. One way to approach this challenge is to use recursive=False as an extra parameter to find_all().

pandas read_html

Finally I wanted to show how pandas has a function ‘read_html‘, which can be used to parse html table data without directly working with Beautiful Soup. read_html is particularly suited for simpler pages with well-defined tables. You may find it doesn’t have the flexibility to deal with with more challenging pages.

import pandas as pd
import requests
url = 'https://en.wikipedia.org/wiki/North_Atlantic_air_ferry_route_in_World_War_II'
html = requests.get(url).content
df_list = pd.read_html(html)

In this case it works really well, apart from the first table it finds just being irrelevant.

A word on robots.txt

Robots.txt is a file that gives website owners the opportunity to specify what automated tools can access via their website. Wikipedia has a good example of a detailed robots.txt which reveals that a big issue for wikipedia is requests coming in at too fast a rate

It is possible to ignore robots.txt, but generally speaking you should respect it.