Python is the perfect language for mining data, on the web or otherwise. Python comes with modules for accessing the Internet, like urllib and urllib2, but, if you are short on time or don't want to write more if you don't have to, you will want to use the mechanize package. One gotcha with mechanize is that you will want get the latest version. You can download it directly from their site, along with its dependency ClientForm, or you can use easy_install, if you have it.

$ sudo easy_install mechanize

Using the mechanize module is almost as simple as using a web browser. Here's an example of use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python

from mechanize import Browser

br = Browser()
br.open('http://tylerlesmann.com/')

response = br.follow_link(text='Data Mining the Web with Python: Part 1')

print response.read()

This script will go to this site, go to this article via its link, download the html, and print it. That's pretty simple. Mechanize can do much more interesting things.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python

from mechanize import Browser

br = Browser()
br.open('http://finance.yahoo.com/')
br.select_form(name='quote')
br['s'] = 'rht' # 's' is the input field in the quote form
response = br.submit() # Submit the form just like a web browser
print response.read()

This script goes to http://finance.yahoo.com, enters rht into the Get Quotes field, submits the form, and prints out the quote page html for Red Hat Inc. This is to illustrate how simple it is to use forms with mechanize. You can use the very same methods to log into web sites. Mechanize handles all of the fun of cookies and sessions. You just have to tell it where to go.

In the next post, I'll detail how to parse the fetched html with the HTMLParser parser.

Posted by Tyler Lesmann on October 3, 2008 at 13:09
Tagged as: mechanize python screen_scraping
Post a comment