Archive
Tags
android (3)
ant (2)
beautifulsoup (1)
debian (1)
decorators (1)
django (9)
dovecot (1)
encryption (1)
fix (4)
gotcha (2)
hobo (1)
htmlparser (1)
imaplib (2)
java (1)
json (2)
kerberos (2)
linux (7)
lxml (5)
markdown (4)
mechanize (6)
multiprocessing (1)
mysql (2)
nagios (2)
new_features (3)
open_source (5)
optparse (2)
parsing (1)
perl (2)
postgres (1)
preseed (1)
pxe (4)
pyqt4 (1)
python (41)
raid (1)
rails (1)
red_hat (1)
reportlab (4)
request_tracker (2)
rt (2)
ruby (1)
scala (1)
screen_scraping (7)
shell_scripting (8)
soap (1)
solaris (3)
sql (2)
sqlalchemy (2)
tips_and_tricks (1)
twitter (2)
ubuntu (1)
vmware (2)
windows (1)
zimbra (2)

In my post from a while back, I gave an example of the standard HTMLParser's use. HTMLParser is not the easiest way to glean information from HTML. There are two modules that are not part of the standard python distribution that can shorten the development time. The first is BeautifulSoup. Here is the code from the previous episode using BeautifulSoup instead of HTMLParser.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup
from mechanize import Browser

br = Browser()
response = br.open('http://tylerlesmann.com/')
soup = BeautifulSoup(response.read())
headers = soup.findAll('div', attrs={'class': 'header'})
headlines = []
for header in headers:
    links = header.findAll('a')
    for link in links:
        headlines.append(link.string)
for headline in headlines:
    print headline

This is a lot shorter, 16 instead of 38 lines. It also took about 20 seconds to write. There is one gotcha here. Both scripts do the same task. This one using BeautifulSoup takes over twice as long to run. CPU time is much cheaper than development time though.

The next module is lxml. Here's the lxml version of the code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python

from lxml.html import parse
from mechanize import Browser

br = Browser()
response = br.open('http://tylerlesmann.com/')
doc = parse(response).getroot()
for link in doc.cssselect('div.header a'):
    print link.text_content()

As you can see, it is even shorter than BeautifulSoup at 10 lines. On top of that, lxml is faster than HTMLParser. So what is the catch? The lxml module uses C code, so you will not be able to use it on Google's AppEngine or on Jython.

Posted by Tyler Lesmann on January 14, 2009 at 6:13
Tagged as: beautifulsoup lxml mechanize python screen_scraping