Python is the perfect language for mining data, on the web or otherwise. Python comes with modules for accessing the Internet, like urllib and urllib2, but, if you are short on time or don't want to write more if you don't have to, you will want to use the mechanize package. One gotcha with mechanize is that you will want get the latest version. You can download it directly from their site, along with its dependency ClientForm, or you can use easy_install, if you have it.
$ sudo easy_install mechanize
Using the mechanize module is almost as simple as using a web browser. Here's an example of use:
1 2 3 4 5 6 7 8 9 10 | #!/usr/bin/env python from mechanize import Browser br = Browser() br.open('http://tylerlesmann.com/') response = br.follow_link(text='Data Mining the Web with Python: Part 1') print response.read() |
This script will go to this site, go to this article via its link, download the html, and print it. That's pretty simple. Mechanize can do much more interesting things.
1 2 3 4 5 6 7 8 9 10 | #!/usr/bin/env python from mechanize import Browser br = Browser() br.open('http://finance.yahoo.com/') br.select_form(name='quote') br['s'] = 'rht' # 's' is the input field in the quote form response = br.submit() # Submit the form just like a web browser print response.read() |
This script goes to http://finance.yahoo.com, enters rht into the Get Quotes field, submits the form, and prints out the quote page html for Red Hat Inc. This is to illustrate how simple it is to use forms with mechanize. You can use the very same methods to log into web sites. Mechanize handles all of the fun of cookies and sessions. You just have to tell it where to go.
In the next post, I'll detail how to parse the fetched html with the HTMLParser parser.
