Web Scraping with Python — Part Two — Library overview of requests, urllib2, BeautifulSoup, lxml, Scrapy, and more!

Welcome to part 2 of the Big-Ish Data general web scraping writeups! I wrote the first one a little bit ago, got some good feedback, and figured I should take some time to go through some of the many Python libraries that you can use for scraping, talk about them a little, and then give suggestions on how to use them.

If you want to check the code I used and not just copy and paste from the sections below, I pushed the code to github in my bigishdata repo. In that folder you’ll find a requirements.txt file with all the libraries you need to pip install, and I highly suggest using a virtualenv to install them. Gotta keep it all contained and easier to deploy if that’s the type of project you’re working on. On this front, also let me know if you’re running this and have any issues!

Overall, the goal of the scraping project in this post is to grab all the information – text, headings, code segments and image urls – from the first post on this subject. We want to get the headings (both h1 and h3), paragraphs, and code sections and print them into local files, one for each tag. This task is very simple overall which means it doesn’t require super advanced parts of the libraries. Some scraping tasks require authentication, remote JSON data loading, or scheduling the scraping tasks. I might write an article about other scraping projects that require this type of knowledge, but it does not apply here. The goal here is to show basics of all the libraries as an introduction.

In this article, there’ll be three sections. First, I’ll talk about libraries that execute http requests to obtain HTML. Second, I’ll talk about libraries that are great for parsing HTML to allow you to scrape the data. Third, I’ll write about libraries that perform both actions at once. And if you have more suggestions of libraries to show, let me know on twitter and I’ll throw them in here.

Finally, a couple notes:

Note 1: There are many different ways of web scraping. People like using different methods, different libraries, different code structures, etc. I understand that.  I recognize that there are other useful methods out there – this is what I’ve found to be successful over time.

Note 2: I’m not here to tell you that it’s legal to scrape every website. There are laws about what data is copyrighted, what data that is owned by the company, and whether or not public data is actually legal to scrape. You might have to check things like robots.txt, their Terms of Service, maybe a frequently asked questions page.

Note 3: If you’re looking for data, or any other data engineering task, get in contact and we’ll see what I can do!

Ok! That all being said, it’s time to get going!

Requesting the Page

The first section here is showing a few libraries that can hit web servers and ask nicely for the HTML.

For all the examples here, I request the page, and then save the HTML in a local file. The other note on this section is that if you’re going to use one of these libraries, this is part one of the scraping! I talked about that a lot in the first post of this series, how you need to make sure you split up getting the HTML, and then work on scraping the data from the HTML.

First library on the first section is the famous, popular, and very simple to use library, requests.

requests

Let’s see it in action.

import requests
url = "https://bigishdata.com/2017/05/11/general-tips-for-web-scraping-with-python/" 
params = {"limit": 48, 'p': 2} #used for query string (?) values
headers = {'user-agent' : 'Jack Schultz, bigishdata.com, contact@bigishdata.com'}
page = requests.get(url, headers=headers)
helpers.write_html('requests', page.text.encode('UTF-8'))

Like I said, incredibly simple.

Requests also has the ability to use the more advanced features like SSL, credentials, https, cookies, and more. Like I said, I’m not going to go into those features (but maybe later). Time for simple examples for an actual project.

Overall, even before talking about the other libraries below, requests is the way to go.

urllib / urllib2

Ok, time to ignore that last sentence in the requests section, and move on to another simple library, urllib2. If you’re using Python 2.X, then it’s very simple to request a single page. And by simple, I mean couple lines.

import urllib2

url = "https://bigishdata.com/2017/05/11/general-tips-for-web-scraping-with-python/"
data = {}
headers = {'user-agent' : 'Jack Schultz, bigishdata.com, contact@bigishdata.com'}
fstring = urllib2.urlopen(url).read() #returns HTML string
helpers.write_html('urllib2', fstring)

Tada! Just as simple for this basic request. The difference with urllib2 and requests is that urllib2 is somewhat lacking the simplicity of ssl, cookies, authentication, posting files, etc. Not that those aren’t all possible with urllib2, but urllib2 requires extra lines of code when compared to requests. But in this case, urllib2 is nice and simple.

httplib

Googling “python http requests”, I came across another library that is built into Python that runs HTTP requests called HTTPLib. And just like moving from requests to urllib2, the shift from urllib2 to httplib requires more specification of the code.

Somewhat comically, one of the first lines in the documentation mentions how people should basically use the requests library instead of httplib, but I figure I might as well give an example of this!

import httplib
#note that the url here is split into the base and the path
conn = httplib.HTTPConnection("bigishdata.com")
conn.request("GET", "/2017/05/11/general-tips-for-web-scraping-with-python/")
response = conn.getresponse()
helpers.write_html('httplib', response.read())
conn.close()

Also very simple for this task, but you’ll see that you have to the specifics of an HTTP connection an actual HTTP request rather than just letting the library deal with it. Simple, but no reason to use this library instead of requests.

Also note, when writing this code, there were some issues with SSL errors, HTTP vs. HTTPS, which depend on the version of Python you’re using and if you’re using a virtual environment or not. So if you’re having issues running this code on your own, make sure you’re running a virtual environment.

Scraping the Page

Moving on! Once you have the HTML saved as a local file, the next step is writing code to gather the data from the page. Like I mentioned in the first post, when you’re working on the code that will scrape the data from the HTML, don’t keep requesting the page from the server. Save it locally and then practice figuring out the correct classes and ids that hold the data and how to use one of these libraries to get that data.

With each library in this section, I’ll show the code of how to store all the paragraphs of text, the headings, the urls of the images, and the examples of code each into their own text file.

BeautifulSoup

Ahh yes, BeautifulSoup. Frankly, it’s a fantastically simple library to use when you’re looking to get the data. Basically, the requirements for using this library are to figure out the tags, classes, ids where the data is stored.

from bs4 import BeautifulSoup as bs
soup = bs(page_string, "html.parser")
article = soup.find('div', {'class' : 'entry-content'})

text = {}
text['p'] = []
text['h1'] = []
text['h3'] = []
text['pre'] = []
for tag in article.contents:
  #multiple if statements here to make is easier to read
  if tag is not None and tag.name is not None:
    if tag.name == "p":
      text['p'].append(tag.text)
    elif tag.name == 'h1':
      text['h1'].append(tag.text)
    elif tag.name == 'h3':
      text['h3'].append(tag.text)
    elif tag.name == 'pre':
      text['pre'].append(tag.text)
for tag in article.findAll('img'):
 text['imgsrc'].append(tag['src'])
helpers.write_data('bs', text)

Boom, and frankly, simple.

LXML

The other giant and popular HTML scraping library for Python is LXML. It’s very similar in setup to BeautifulSoup, and in this case, since the data I’m scraping is pretty standard and simple to get, the only difference is the names of the functions that look for tags with specific classes.

import lxml.html

with open('page.html', 'r') as f:
  page_string = f.read()

page = lxml.html.fromstring(page_string)
post = page.find_class('entry-content')[0] #0 since only one tag with that class

text = []
for tag in post.findall('p'):
  text.append({'type': 'paragraph', 'text': tag.text})
 for img in tag.findall('img'): #images in paragraphs, so need to check here
   text['imgsrc'].append(img.attrib['src'])
for tag in post.findall('h1'):
  text.append({'type': 'h1', 'text': tag.text})
for tag in post.findall('h3'):
  text.append({'type': 'h3', 'text': tag.text})
for tag in post.findall('pre'):
  text.append({'type': 'code', 'text': tag.text})

This library does have different parsers to grab the data you’re looking, like ETree, or using XPATH or css detectors, rather than just the single simple one BeautifulSoup uses. This means picking BeautifulSoup or LXML depends on the file and data you want. Just like I mentioned above in the requesting the page section of the post, I’m not going deep into the advanced usage of this library.

HTMLParser

Moving down! How about looking at the HTMLParser library that’s built in base python? I had never heard of this standard Python library before, but I was searching google for “python html parsing libraries”(obviously returned with that phrase), I figured why not. But before I continue here, I’ll just say that you should probably not use this. It took me quite awhile to learn enough to even write this section, which I had to base on the response to this question from StackOverflow.

The library is very class based so you’ll need to write a subclass with functions that are called by the library when it hits a new tag.  You’ll see in the code below, you have to deal with every tag — starting, ending, and the data within the tag itself. Which means you’ll need to use class variables to know what tags of the HTML you’re in before being able to grab the text.

And again, writing and figuring out how this library works took a while. I needed to google a lot of tutorials and examples to know I got the code correct. In the end, it works. But it took way longer to write than the simple BeautifulSoup and lxml.

from HTMLParser import HTMLParser
import urllib

desired_tags = (u'p', u'h1', u'h3', u'pre')
class BigIshDataParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.inside_entry_content = 0
    self.current_tag = None
    self.current_text = []
    self.overall_text = {}
    self.overall_text['p'] = []
    self.overall_text['h1'] = []
    self.overall_text['h3'] = []
    self.overall_text['pre'] = []

  def handle_starttag(self, tag, attributes):
    if self.inside_entry_content and tag in desired_tags:
      self.current_tag = tag
    if tag == 'div':
      for name, value in attributes:
        if name == 'class' and value == 'entry-content': #if this is correct div
          self.inside_entry_content += 1
          return #don't keep going through the attributes since there could be infinate, or just a ton of them
    if tag == 'img' and self.inside_entry_content: #need to deal with images here since they're only a start tag
      for attr in attributes:
        if attr[0] == 'src':
          self.overall_text['img'].append(attr[1])
          break
  def handle_endtag(self, tag):
    if tag == 'div' and self.inside_entry_content:
      self.inside_entry_content -= 1 #moving on down the divs
    if tag == self.current_tag:
      tstring = ''.join(self.current_text)
      self.overall_text[self.current_tag].append(tstring)
      self.current_text = []
      self.current_tag = None

  def handle_data(self, data):
    if self.inside_entry_content:
      self.current_text.append(data)

p = BigIshDataParser()
page_string = p.unescape(page_string.decode('UTF-8'))
p.feed(page_string)
helpers.write_data('htmlparser', p.overall_text)
p.close()

Woof. Now, this code works, but it sure took awhile. The one thing to think about here is that it could be useful depending on the task you have at hand. Finding the text in a blog post is very simple, but in a different case, HTMLParser could be ideal. By reading this post, you now know about its existence, so keep that in mind when you’re trying to figure out which libraries to use for your project.

Combos!

If you’re scraping a site, you’ll need to pick a library from the first section that grabs the HTML, and then another from the second section that will scrape the information you’re looking for from that file. On the other hand, there are libraries out there that handle both of those tasks!

Scrapy

I’d never used Scrapy before, but I’ve heard it talked about a lot. So when I first checked out the docs, I need to give a shoutout to them since they talk about how BeautifulSoup and LXML are also ways of scraping the data!

Scrapy is very classed based, similar to HTMLParser above. To use scrapy, you create a subclass of scrapy.Spider, set the start_urls to the url of the post I wanted to scrape, and and then overwrote the parse function where you write the code that finds your data in the HTML.

Scrapy deals with the requests, and  turing the string into something searchable, and supplying a Selector class where you specify what data you’re looking for using XPATH or css expressions. It’s similar to BeautifulSoup and lxml and not difficult to learn the correct way to do this. Just read the docs on selectors.

Another big part about Scrapy is that all you have to do in the parse function is yield the data structure that holds the data you’ve scraped and Scrapy will turn that into JSON and write that to a file you specify. In the example below, I also write it to the file I use for the other examples, but yielding and specifying where you want those results to go is handled for you.

And finally, the other big thing with scrapy is that it makes deployment simple and automated. There are cloud based services where you can just deploy a configured folder and then sit back as it handles the requests and data saving and scheduled requests. The non combo options are decently easy to deploy on Heroku or AWS, but this is way simpler on this front.

import helpers

import scrapy
from scrapy.selector import Selector

class DataSpider(scrapy.Spider):
  name = "data"
  start_urls = [
     'https://bigishdata.com/2017/05/11/general-tips-for-web-scraping-with-python/'
  ]

  desired_tags = (u'p', u'h1', u'h3', u'pre')
  text = {}

  def words_from_tags(self, tag, response):
    total = []
    div = response.xpath("//div[contains(@class, 'entry-content')]")
    for para in div.xpath(".//%s" % tag):
      combined = []
      for words in para.xpath('./descendant-or-self::*/text()'):
        combined.append(words.extract())
      total.append(' '.join(combined))
    return total

  def parse(self, response):
    selector = Selector(response=response)
    for tag in self.desired_tags:
      self.text[tag] = self.words_from_tags(tag, response)
    helpers.write_data('scrapy', self.text)
    yield self.text #how scrapy returns the json object you created

Selenium

Very often, the Selenium library is used for testing websites you’re building. Unlike requests or the other libraries I talked about in the first section, Selenium acts like an actual browser sending the request rather than just asking for the plain HTML. This means that it will ask for the URL, and then run the javascript if that’s involved with creating the actual web page.

In this case, it’s also very useful to be used for scraping if you’re looking to scrape sites with JSON loaded data, and also those who detect requests by people who aren’t clicking on the url.

Since the page I’m scraping here isn’t loaded remotely, that benefit of Selenium isn’t necessary, but an advantage.

In the code below, I’m using PhantomJS, which is basically a coding browser. You could use a bunch of different browsers, like Chrome or Firefox if you’re trying to specifically test with those different browsers. This means that the headers you send the request with look like that browser meaning that it’d be hard for a site to detect that someone is trying to scrape their site, which is what I talked about above in terms of being nice to sites you’re scraping. And also, since it’s acting like a legit browser, the wordpress analytics has been saying I’ve been getting more hits when running this code rather than just getting the page!

import helpers

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = 'https://bigishdata.com/2017/05/11/general-tips-for-web-scraping-with-python/'

driver = webdriver.PhantomJS()
driver.get(url)
elem = driver.find_element_by_class_name('entry-content')

text = {}
desired_tags = (u'p', u'h1', u'h3', u'pre')
for tag in desired_tags:
  tags = elem.find_elements_by_tag_name(tag)
  text[tag] = []
  for data in tags:
    text[tag].append(data.text)

helpers.write_data('selenium', text)

Anyway, Selenium seems like a very in depth library with tons of features depending on the type of project you’re working on. Great for site testing, and also has the features of scraping sites if you don’t want to deal with the remote data requests. But also in this case compared to scrapy, it does allow simple line by line code if you’re just looking for text in an HTML page.

In Summary…

I know I talked about this a lot in all the sections above, but the main suggestion I have is to try out these different libraries, and pick the ones you’re most comfortable with and work for the scraping project you’re working on. In my case, I use requests and BeautifulSoup. I format the code simply, and then use something like Redis Queue to run the scraping jobs on a schedule that I need, and then deploy on Heroku with an AWS db. On the other hand, I could just be using scrapy which and handle all of those tasks on its own!

No matter what you choose, just enjoy writing code to get the data you need, because gathering data is an incredibly underrated and most important task for machine learning.

Again, get in contact if you need some data engineering work, and follow on twitter if you’re interested in thoughts and what I’m up to.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s