Archive
Tags
android (3)
ant (2)
beautifulsoup (1)
debian (1)
decorators (1)
django (9)
dovecot (1)
encryption (1)
fix (4)
gotcha (2)
hobo (1)
htmlparser (1)
imaplib (2)
java (1)
json (2)
kerberos (2)
linux (7)
lxml (5)
markdown (4)
mechanize (6)
multiprocessing (1)
mysql (2)
nagios (2)
new_features (3)
open_source (5)
optparse (2)
parsing (1)
perl (2)
postgres (1)
preseed (1)
pxe (4)
pyqt4 (1)
python (41)
raid (1)
rails (1)
red_hat (1)
reportlab (4)
request_tracker (2)
rt (2)
ruby (1)
scala (1)
screen_scraping (7)
shell_scripting (8)
soap (1)
solaris (3)
sql (2)
sqlalchemy (2)
tips_and_tricks (1)
twitter (2)
ubuntu (1)
vmware (2)
windows (1)
zimbra (2)

In the last post, I illustrated how to most efficiently fetch html for data mining using the mechanize module. Now that we have our html, we can parse it for the information we want. To do this, we will use the HTMLParser module. This is a standard module in Python, so you don't have to install anything.

In this example, we will glean all of the headlines from the main page of this blog.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python

from HTMLParser import HTMLParser
from mechanize import Browser

class HeadLineParser(HTMLParser):
    def __init__(self):
        self.in_header = False
        self.in_headline = False
        self.headlines = []
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        if tag == 'div':
            # attrs is a list of tuple pairs, a dictionary is more useful
            dattrs = dict(attrs)
            if 'class' in dattrs and dattrs['class'] == 'header':
                self.in_header = True
        if tag == 'a' and self.in_header:
            self.in_headline = True

    def handle_endtag(self, tag):
        if tag == 'div':
            self.in_header = False
        if tag == 'a':
            self.in_headline = False

    def handle_data(self, data):
        if self.in_headline:
            self.headlines.append(data)

br = Browser()
response = br.open('http://tylerlesmann.com/')
hlp = HeadLineParser()
hlp.feed(response.read())
for headline in hlp.headlines:
    print headline
hlp.close()

You use HTMLParser by extending it. The four functions you'll need are init, handle_starttag, handle_endtag, and handle_data. HTMLParser can be confusing at first because it works in a unique matter. Whenever a html tag is encountered, handle_starttag is called. Whenever a closing tag is found, handle_endtag is called. Whenever anything in between tags is encountered, handle_data is called.

The way to actually use HTMLParser is to use a system of flags, like in_header and in_headline from the example. We toggle them on in handle_starttag and off in handle_endtag. If you look at the html of this blog, you'll see that headlines are enclosed in classless <a> tags. There are alot of <a>s on this site. We need something unique to flag the headline <a>s. If you look carefully, you would see that all of the headlines are enclosed with <div>s with a header class. We can flag those and flag the <a>s only inside them, which is what the example does.

Now that the script has all the proper flags to detect headlines, we can simply have handle_data append any text to a list of headlines when our in in_headline flag is True.

To use our new parser, we simply make an instance of it and use the instance's feed method to run html through the parser. We can access the headlines attribute directly like we can in any object in python.

Posted by Tyler Lesmann on October 4, 2008 at 7:09
Tagged as: htmlparser mechanize python screen_scraping