December 2009
November 2009
October 2009
September 2009
June 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
July 2008
June 2008
October 2007
September 2007
In the last post, I illustrated how to most efficiently fetch html for data mining using the mechanize module. Now that we have our html, we can parse it for the information we want. To do this, we will use the HTMLParser module. This is a standard module in Python, so you don't have to install anything.
In this example, we will glean all of the headlines from the main page of this blog.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | #!/usr/bin/env python from HTMLParser import HTMLParser from mechanize import Browser class HeadLineParser(HTMLParser): def __init__(self): self.in_header = False self.in_headline = False self.headlines = [] HTMLParser.__init__(self) def handle_starttag(self, tag, attrs): if tag == 'div': # attrs is a list of tuple pairs, a dictionary is more useful dattrs = dict(attrs) if 'class' in dattrs and dattrs['class'] == 'header': self.in_header = True if tag == 'a' and self.in_header: self.in_headline = True def handle_endtag(self, tag): if tag == 'div': self.in_header = False if tag == 'a': self.in_headline = False def handle_data(self, data): if self.in_headline: self.headlines.append(data) br = Browser() response = br.open('http://tylerlesmann.com/') hlp = HeadLineParser() hlp.feed(response.read()) for headline in hlp.headlines: print headline hlp.close() |
You use HTMLParser by extending it. The four functions you'll need are init, handle_starttag, handle_endtag, and handle_data. HTMLParser can be confusing at first because it works in a unique matter. Whenever a html tag is encountered, handle_starttag is called. Whenever a closing tag is found, handle_endtag is called. Whenever anything in between tags is encountered, handle_data is called.
The way to actually use HTMLParser is to use a system of flags, like in_header and in_headline from the example. We toggle them on in handle_starttag and off in handle_endtag. If you look at the html of this blog, you'll see that headlines are enclosed in classless <a> tags. There are alot of <a>s on this site. We need something unique to flag the headline <a>s. If you look carefully, you would see that all of the headlines are enclosed with <div>s with a header class. We can flag those and flag the <a>s only inside them, which is what the example does.
Now that the script has all the proper flags to detect headlines, we can simply have handle_data append any text to a list of headlines when our in in_headline flag is True.
To use our new parser, we simply make an instance of it and use the instance's feed method to run html through the parser. We can access the headlines attribute directly like we can in any object in python.
