Web Scraping 101
More and more organizations are publishing their data on the web. This is great, but often websites don’t offer an option to download a clean and complete dataset from the site. In this situation, you have two options. First, you (or some unlucky intern) can hunker down and spend a week wearing out the ‘c’ and ‘v’ keys on your keyboard as you cut and paste ad nauseam from the website to an Excel spreadsheet. Second, you can use a tool like python to automatically “scrape” the data from the website. In this blog post, I’ll demonstrate how to use python to “scrape” the names of all IDinsight staff from our website.
Step one — Check out the raw html
Often, the first step in scraping a website is to get familiar with the raw html code for the site. Let’s do this for the IDinsight staff webpage. Open “http://idinsight.org/about/team/staff/“ in a browser and then view the raw html for this webpage. (To do this in Safari, select “show page source” from the “develop” menu. If you don’t see the “develop” menu, follow these steps to add the “develop” menu.)
When you look through the html, you will see that details for each staff member is coded in an html block similar to the following:
<div class="bio-info">
<h3>Abhinav Gupta</h3>
<h4>Senior Finance and Operations Manager</h4>
<div class="bio-details" class="hide">
<h3>Abhinav Gupta</h3>
<p>
<p><a href="mailto:abhinav.gupta@idinsight.org">abhinav.gupta@IDinsight.org</a></p>
<p>Abhinav is a Senior Finance and Operations Manager at IDinsight’s India office. He brings more than eight years of financial advisory and consulting experience in India and U.K.</p>
<p>Prior to joining IDinsight, Abhinav was a Manager with the M&A Transaction Services department of Deloitte in India. As a consultant, he advised clients across several sectors in infrastructure, telecommunications, retail and e-commerce.</p>
<p>Abhinav holds a master’s degree in Economics from University of Cambridge, UK and is a qualified Chartered Accountant from Institute of Chartered Accountants of England and Wales.</p>
</p>
</div>
</div>
This suggests that if we want to get the names of each staff member, one strategy would be to search through the html code for each block of code starting with “<div class=”bio-info”>” and then search inside these code blocks for the text inside the html tags <h3> and </h3>.
Step 2 - Use BeautifulSoup to parse the html
The python package “requests” allows you to download the html code from a webpage and the python package “BeautifulSoup” parses the html from a webpage so that you can easily find specific code blocks. The following python code downloads and parses the html for the IDinsight staff page.
# import the packages that we will need to scrape the website.
# The BeautifulSoup package helps parse html
from bs4 import BeautifulSoup
# The requests package just downloads the html
import requests
# download the html for the idinsight staff webpage
response = requests.get("http://idinsight.org/about/team/staff/")
# parse the html using the BeautifulSoup function
soup = BeautifulSoup(response.content, "lxml")
Step 3 - Use BeautifulSoup to search the parsed html
Now, we want to search the parsed code for code blocks starting with “<div class=”bio-info”>” and search inside these code blocks for the text inside the html tags <h3> and </h3>. The following python code performs these operations. BeautifulSoup is a little tricky, so I’m not going to explain this code in detail, but as you can see it is quite powerful. For more on the BeautifulSoup package see here.
# use the BeautifulSoup function 'findall' to get a collection of div tags with the class 'bio-info'
bios = soup.findAll('div', {'class':"bio-info"})
# for each of the div tags with class 'bio-info', search for the h3 tag and get the contents of this tag
names = [bio.find('h3').contents[0] for bio in bios]
Lastly, we can use the following python code to check that we successfully scraped all staff names.
# print names of all staff
for person in names:
print(person)
Voila! We’re done. For a more advanced example of web scraping, this link includes code that downloads all the metadata for all of the studies in the 3ie impact evaluation repository. (If you just want the full dataset of all the metadata, let me know and I can share it with you.)