Web scraping is the process of automatically identifying and downloading data from a webpage. This blog post looks a a few python web scraping options.
Alternatives to web scraping include:
– Using an API. Usually a better option than web scraping if an API is available. See also Creating an API with flask. Open data APIs.
– Manually copying data from a website. Can be a better option if there is a small number of target pages or if there is a lot of variation in website data. See also an example of manual copying used to ‘scrape’ Wikipedia data.
Python Web Scraping Options
I’ve picked out a few different python web scraping options. There is some overlap in terms of concepts and packages used as the essential idea is the same: programmatically grab data from a webpage and then look for certain aspects within it.
Common python web scraping packages include requests, beautiful soup, regex.
Requests
Probably the core of modern python web scraping is the requests library, which allows you to easily capture webpage data in a python script. Note that earlier approaches to python web scraping may have used the urllib library – read about the differences between requests and urllib.
The data from requests comes back ‘raw’, and is likely to contain lots of HTML tags around the actual information you are interested in.
In very simple cases, such as looking for occurrences of a particular name or phrase you could use standard python string functions. Here is an example
import requests url = "https://en.wikipedia.org/wiki/North_Atlantic_air_ferry_route_in_World_War_II" data = requests.get(url).text print("Nova Scotia" in data)
So the text back from requests can be used just like any other string.
Simple Regex
Introducing regex is a good next step once beyond simply downloading the data with requests. Using an online regex tool can be helpful when writing regex.
Regex is powerful, but it is not possible to fully parse HTML using regex. For this reason regex is generally suited to simpler tasks such as picking out IP addresses, postcodes etc. from a webpage.
The following example uses regex to find latitude and longitude coordinates in the webpage.
import requests import re url = "https://en.wikipedia.org/wiki/North_Atlantic_air_ferry_route_in_World_War_II" data = requests.get(url).text # Find find pairs of coordinates pattern = re.compile( r"(\d{1,3}°{1}\d{1,3}′{1}\d{1,3}″{1}[N]).*(\d{1,3}°{1}\d{1,3}′{1}\d{1,3}″{1}[W])") re.search(pattern, data).groups() for match in pattern.finditer(data): print(match.groups())
Read more about capturing multiple groups using finditer.
Beautiful Soup
Beautiful Soup is probably the most popular python web scraping tool. It is powerful and flexible and I won’t be able to cover it all here, but the examples below hopefully give a flavour of how it can be used.
The basic Beautiful Soup usage is as follows:
from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/North_Atlantic_air_ferry_route_in_World_War_II" data = requests.get(url).text soup = BeautifulSoup(data, 'html.parser')
Finding All Links
Because Beautiful Soup incorporates an HTML parser, it can be used to extract information from particular HTML elements.
In this simple example we use Beautiful Soup to list the href of all links (which use the ‘a’ tag).
for table in soup.find_all('a'): print(table.get('href'))
We can combine finding links with regex or simple string functions to find particular links.
for link in soup.find_all('a'): if 'wikimedia' in link.get('href', 'default_value') : print(link)
Finding all body text
Next, an example of getting all of the text found in the HTML ‘body’.
could technically dobody = soup.find_all('body')[0]
, but typically there would only by one body
.
body = soup.body for string in body.strings: print(string)
To be more focused in which text to capture we can also look for paragraphs (see next section). Note that here we use ‘string’ rather than ‘text’.
Finding all Paragraphs
In a similar fashion we can find all text in a paragraph block.
paras = soup.find_all('p') for para in paras: print(para.text)
Finding tables
This final example is more complex and shows how Beautiful Soup can be used to extract table data from a webpage.
Firstly we can find all the ‘table’ element in the target page.
tables = soup.find_all('table')
To select particular types of table, it can be useful to list the class of all tables
for table in soup.find_all('table'): print(table.get('class'))
Note that here we use get
rather than ‘[]
‘ can be useful so we can provide a default if needed.
The output of that gives some indication that the class of interest is ‘wikitable’
for table in soup.find_all('table', class_='wikitable'): print(table.text)
Once we have identified the tables we need to work through individual rows and columns.
It’s not always obvious how to handle the data coming back from the webpage, so building up an understanding can help. In this case we can use a simple loop to print out what belongs in each row and column, and match it to the corresponding rows and columns on the target webpage.
n = 1 for row in tables[1].find_all('tr'): print('Row: ', n) m = 1 for column in row.find_all('td'): print('Column: ', m) print(column.text) m+=1 n+=1
Once we have a better understanding of how the table is laid out in HTML we can transform it into a pandas dataframe. In this case we j
names = [] locations = [] coordinates = [] notes = [] for row in tables[1].find_all('tr'): col_num = 1 for column in row.find_all('td'): if col_num == 1: names.append(column.text.strip()) if col_num == 2: locations.append(column.text.strip()) if col_num == 3: coordinates.append(column.text.strip()) if col_num == 4: notes.append(column.text.strip()) col_num+=1 n+=1 df = pd.DataFrame({'Name': names, 'Location': locations, 'Coordinates': coordinates, 'Notes': notes })
One problem you may encounter is nested tables, i.e. tables within tables. One way to approach this challenge is to use recursive=False
as an extra parameter to find_all().
pandas read_html
Finally I wanted to show how pandas has a function ‘read_html‘, which can be used to parse html table data without directly working with Beautiful Soup. read_html is particularly suited for simpler pages with well-defined tables. You may find it doesn’t have the flexibility to deal with with more challenging pages.
import pandas as pd import requests url = 'https://en.wikipedia.org/wiki/North_Atlantic_air_ferry_route_in_World_War_II' html = requests.get(url).content df_list = pd.read_html(html)
In this case it works really well, apart from the first table it finds just being irrelevant.
A word on robots.txt
Robots.txt is a file that gives website owners the opportunity to specify what automated tools can access via their website. Wikipedia has a good example of a detailed robots.txt which reveals that a big issue for wikipedia is requests coming in at too fast a rate
It is possible to ignore robots.txt, but generally speaking you should respect it.