Archive
Tags
android (3)
ant (2)
beautifulsoup (1)
debian (1)
decorators (1)
django (9)
dovecot (1)
encryption (1)
fix (4)
gotcha (2)
hobo (1)
htmlparser (1)
imaplib (2)
java (1)
json (2)
kerberos (2)
linux (7)
lxml (5)
markdown (4)
mechanize (6)
multiprocessing (1)
mysql (2)
nagios (2)
new_features (3)
open_source (5)
optparse (2)
parsing (1)
perl (2)
postgres (1)
preseed (1)
pxe (4)
pyqt4 (1)
python (41)
raid (1)
rails (1)
red_hat (1)
reportlab (4)
request_tracker (2)
rt (2)
ruby (1)
scala (1)
screen_scraping (7)
shell_scripting (8)
soap (1)
solaris (3)
sql (2)
sqlalchemy (2)
tips_and_tricks (1)
twitter (2)
ubuntu (1)
vmware (2)
windows (1)
zimbra (2)

If you are using Python 2.6 or higher, then you should get to know the multiprocessing module as soon as possible. It works around the GIL to give true multiprocessing capabilities to python. Here is an example to show you how to spider sites with several worker processes. Use of the logging module is imperative to debugging these multiprocess programs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#!/usr/bin/env python

"""
Spider steam boycott group and tally who followed through and who didn't.
"""

import logging
import urllib2
from cStringIO import StringIO
from multiprocessing import Pool
from lxml.html import parse

def glean_games(url):
    logging.debug('Getting %s' % url)
    doc = parse(urllib2.urlopen(url)).getroot()
    game_elements = doc.cssselect('#mainContents h4')
    return [e.text_content() for e in game_elements]

def glean_users(url=None, html=None):
    if html is None:
        logging.debug('Getting %s' % url)
        doc = parse(urllib2.urlopen(url)).getroot()
    else:
        doc = parse(StringIO(html)).getroot()
    user_links = doc.cssselect(
    'a.linkFriend_offline, a.linkFriend_online, a.linkFriend_in-game')
    return [(link.text_content(), link.attrib['href']) for link in user_links]

def spider(url, pool_size=20):
    logging.debug('Getting %s' % url)
    response = urllib2.urlopen(url)
    html = response.read() # Necessary for mulitprocessing, needs to be pickleable
    group_page = parse(StringIO(html)).getroot()
    page_links = group_page.cssselect('.pageLinks a')
    page_count = page_links[-2].attrib['href'].split('=')[-1]
    urls = ['%s?p=%d' % (url, page) for page in xrange(2, int(page_count) + 1)]

    pool = Pool(pool_size)
    results = []
    results.append(pool.apply_async(glean_users, (), {'html': html}))
    results.extend([pool.apply_async(glean_users, (url,)) for url in urls])

    users = []
    for result in results:
        users.extend(result.get())

    logging.info('Found %d users!' % len(users))

    games = []
    for username, url in users:
        games.append((username, pool.apply_async(glean_games, (url + '/games',))))

    for username, result in games:
        games = result.get()
        yield username, games

def main():
    import sys
    logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
    game = 'Call of Duty: Modern Warfare 2'
    has = []
    has_not = []
    for username, games in spider(
        'http://steamcommunity.com/groups/BOYCOTTMW2/members'):
        if game in games:
            logging.info('%s has %s' % (username, game))
            has.append(username)
        else:
            logging.info('%s has not' % (username))
            has_not.append(username)
    print '%d users have %s and %d do not.' % (len(has), game, len(has_not))

if __name__ == '__main__':
    main()
Posted by Tyler Lesmann on November 12, 2009 at 18:46
Tagged as: lxml multiprocessing python screen_scraping

I saw this blog post yesterday and I was inspired. I forgot that Qt has a nice little browser object, QWebView. I have to say that Siva's example could not be less pythonic though. Siva's primary language is Objective-C and it shows in that code. I've rewritten the whole thing to be pythonic.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
#!/usr/bin/env python

import os
import sys
from PyQt4.QtCore import QUrl, SIGNAL
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage, QWebView
from urllib2 import urlopen

JQUERY_URL = 'http://jqueryjs.googlecode.com/files/jquery-1.3.2.min.js'
JQUERY_FILE = JQUERY_URL.split('/')[-1]
JQUERY_PATH = os.path.join(os.path.dirname(__file__), JQUERY_FILE)

def get_jquery(jquery_url=JQUERY_URL, jquery_path=JQUERY_PATH):
    """
    Returns jquery source.

    If the source is not available at jquery_path, then we will download it from
    jquery_url.
    """
    if not os.path.exists(jquery_path):
        jquery = urlopen(jquery_url).read()
        f = open(jquery_path, 'w')
        f.write(jquery)
        f.close()
    else:
        f = open(jquery_path)
        jquery = f.read()
        f.close()
    return jquery

class WebPage(QWebPage):
    """
    QWebPage that prints Javascript errors to stderr.
    """
    def javaScriptConsoleMessage(self, message, lineNumber, sourceID):
        sys.stderr.write('Javascript error at line number %d\n' % lineNumber)
        sys.stderr.write('%s\n' % message)
        sys.stderr.write('Source ID: %s\n' % sourceID)

class GoogleSearchBot(QApplication):
    def __init__(self, argv, show_window=True):
        super(GoogleSearchBot, self).__init__(argv)
        self.jquery = get_jquery()
        self.web_view = QWebView()
        self.web_page = WebPage()
        self.web_view.setPage(self.web_page)
        if show_window is True:
            self.web_view.show()
        self.connect(self.web_view, SIGNAL("loadFinished(bool)"),
            self.load_finished)
        self.set_load_function(None)

    def google_search(self, keyword_string):
        self.set_load_function(self.parse_google_search)
        current_frame = self.web_view.page().currentFrame()
        current_frame.evaluateJavaScript(
            r"""
            $("input[title=Google Search]").val("%s");
            $("input[value=Google Search]").parents("form").submit();
            """ % keyword_string
        )

    def load_finished(self, ok):
        current_frame = self.web_page.currentFrame()
        current_frame.evaluateJavaScript(self.jquery)
        self.load_function(*self.load_function_args,
            **self.load_function_kwargs)

    def parse_google_search(self):
        current_frame = self.web_page.currentFrame()
        results = current_frame.evaluateJavaScript(
            r"""
            var results = "";
            $("h3[class=r]").each(function(i) {
                results += $(this).text() + "\n";
            });
            results;
            """
        )
        print('Google search result\n====================')
        for i, result in enumerate(unicode(results.toString(),'utf-8').splitlines()):
            print('%d. %s' % (i + 1, result))
        self.exit()

    def search(self, keyword):
        self.set_load_function(self.google_search, keyword)
        self.web_page.currentFrame().load(QUrl('http://www.google.com/ncr'))

    def set_load_function(self, load_function, *args, **kwargs):
        self.load_function = load_function
        self.load_function_args = args
        self.load_function_kwargs = kwargs

if __name__ == '__main__':
    if len(sys.argv) != 2:
        print("Usage: %s <keyword>" % sys.argv[0])
        raise SystemExit, 255

    googleSearchBot = GoogleSearchBot(sys.argv)
    googleSearchBot.search(sys.argv[1])
    sys.exit(googleSearchBot.exec_())

So what is the good and bad of using this method for web scraping?

Good

  • Javascript is not a problem anymore! Javascript is usually a pain in the world of web scraping as one must read Javascript and emulate it. This is especially awful with obfuscated Javascript. By using real browser, Javascript becomes a tool instead of a hindrance. AJAX applications become worlds easier to automate.
  • User gets more visual feedback through the browser rendering the page.

Bad

  • Javascript is hard to debug. I'm looking for the equivalent of the Firefox error console in QWebView or its attributes. That would fix this problem. FIXED! Extended QWebPage to add printing of Javascript errors to stderr.
  • QWebView takes a bit more resources than mechanize. Of course, we get page rendering and a Javascript engine.
  • This is not as easily implemented for Windows and OS X as it is for Linux/BSD. This is not a big problem for me, as Fedora has PyQt4 and its prerequisites whenever you install KDE. You may not be so lucky.
Posted by Tyler Lesmann on October 1, 2009 at 6:22 and commented on 7 times
Tagged as: pyqt4 python screen_scraping

Our Solarwinds Network Performance Monitor has a problem rendering custom reports on occasion. For something like that, there isn't an existing plugin for Nagios. Writing these plugins is easy. All there is to it is exit statuses. After reading this, you should have an idea of how to write a Nagios plugin for a variety of web applications.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#!/usr/bin/env python

from mechanize import Browser
from optparse import OptionParser

# Exit statuses recognized by Nagios
UNKNOWN = -1
OK = 0
WARNING = 1
CRITICAL = 2

def open_url(br, url):
    """Use a given mechanize.Browser to open url.

    If an exception is raised, then exit with CRITICAL status for Nagios.
    """
    try:
        response = br.open(url)
    except Exception, e:
        # Catching all exceptions is usually a bad idea.  We want to catch
        # them all to report to Nagios here.
        print 'CRITICAL - Could not reach page at %s: %s' % (url, e)
        raise SystemExit, CRITICAL
    return response

# I'm going to be using optparse.OptionParser from now on.  It makes
# command-line args a breeze.
parser = OptionParser()
parser.add_option('-H', '--hostname', dest='hostname')
parser.add_option('-u', '--username', dest='username')
parser.add_option('-p', '--password', dest='password')
parser.add_option('-r', '--report_url', dest='url',
    help="""Path to report relative to root, like
    /NetPerfMon/Report.asp?Report=Hostname+__+IPs""")
parser.add_option('-v', '--verbose', dest='verbose', action='store_true',
    default=False)
parser.add_option('-q', '--quiet', dest='verbose', action='store_false')

options, args = parser.parse_args()

# Check for required options
for option in ('hostname', 'username', 'password', 'url'):
    if not getattr(options, option):
        print 'CRITICAL - %s not specified' % option.capitalize()
        raise SystemExit, CRITICAL

# Go to the report and get a login page
br = Browser()
report_url = 'https://%s%s' % (options.hostname, options.url)
open_url(br, report_url)
br.select_form('aspnetForm')

# Solarwinds has interesting field names
# Maybe something with asp.net
br['ctl00$ContentPlaceHolder1$Username'] = options.username
br['ctl00$ContentPlaceHolder1$Password'] = options.password

# Attempt to login.  If we can't, tell Nagios.
try:
    report = br.submit()
except Exception, e:
    print 'CRITICAL - Error logging in: e' % e
    raise SystemExit, CRITICAL

report_html = report.read()
# class=Property occurs in every cell in a Solarwinds report.  If it's not
# there, something is wrong.
if 'class=Property' not in report_html:
    print 'CRITICAL - Report at %s is down' % report_url
    raise SystemExit, CRITICAL

# If we got this far, let's tell Nagios the report is okay.
print 'OK - Report at %s is up' % report_url
raise SystemExit, OK

To use our plugin, we need to do a bit of Nagios configuration. First, we need to define a command.

define command{
    command_name    check_npm_reports
    command_line    /usr/local/bin/reportmonitor.py -H $HOSTADDRESS$ $ARG1$
}

After that, we define a service.

define service{
    use         generic-service
    host_name           solarwinds-server
    service_description Solarwinds reports
    check_command       check_npm_reports!-u nagios -p some_password -r '/NetPerfMon/Report.asp?Report=Hostname+__+IPs'
}
Posted by Tyler Lesmann on September 3, 2009 at 13:37 and commented on 2 times
Tagged as: mechanize nagios optparse python screen_scraping

I found a blog post today that gleans the names and messages from the Twitter search. As an exercise, I decided to rewrite this using mechanize and lxml. My code writes to the standard out instead of a file. The user could redirect the output for the same effect. Note: I am aware that Twitter has JSON, plus several apis, and using that would be easier than this. This is an exercise.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#!/usr/bin/env python
import getopt                     
import sys                        
from mechanize import Browser, _mechanize
from lxml.html import parse              

baseurl = "http://search.twitter.com/search?lang=en&q="

def search_twitter(terms, pages=1):                    
    """                                                
    terms = a list of search terms                     
    pages(optional) = number of pages to retrive       

    returns a list of dictionaries
    """
    br = Browser()
    br.set_handle_robots(False)
    results = []
    response = br.open("".join([baseurl, "+".join(terms)]))
    while(pages > 0):
        doc = parse(response).getroot()
        for msg in doc.cssselect('div.msg'):
            name = msg.cssselect('a')[0].text_content()
            text = msg.cssselect('span')[0].text_content()
            text = text.replace(' (expand)', '')
            results.append({
                'name': name,
                'text': text,
            })
        try:
            response = br.follow_link(text='Older')
        except _mechanize.LinkNotFoundError:
            break # No more pages :(
        pages -= 1
    return results

if __name__ == '__main__':
    optlist, args = getopt.getopt(sys.argv[1:], 'p:', ['pages='])
    optd = dict(optlist)
    pages = 1
    if '-p' in optd:
        pages = int(optd['-p'])
    if '--pages' in optd:
        pages = int(optd['--pages'])
    if len(args) < 1:
        print """
        Usage: %s [-p] [--pages] search terms
            -p, --pages = number of pages to retrieve
        """ % sys.argv[0]
        raise SystemExit, 1
    results = search_twitter(args, pages)
    for result in results:
        print "%(name)-20s%(text)s" % result
Posted by Tyler Lesmann on January 14, 2009 at 15:16
Tagged as: lxml mechanize python screen_scraping

In my post from a while back, I gave an example of the standard HTMLParser's use. HTMLParser is not the easiest way to glean information from HTML. There are two modules that are not part of the standard python distribution that can shorten the development time. The first is BeautifulSoup. Here is the code from the previous episode using BeautifulSoup instead of HTMLParser.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup
from mechanize import Browser

br = Browser()
response = br.open('http://tylerlesmann.com/')
soup = BeautifulSoup(response.read())
headers = soup.findAll('div', attrs={'class': 'header'})
headlines = []
for header in headers:
    links = header.findAll('a')
    for link in links:
        headlines.append(link.string)
for headline in headlines:
    print headline

This is a lot shorter, 16 instead of 38 lines. It also took about 20 seconds to write. There is one gotcha here. Both scripts do the same task. This one using BeautifulSoup takes over twice as long to run. CPU time is much cheaper than development time though.

The next module is lxml. Here's the lxml version of the code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python

from lxml.html import parse
from mechanize import Browser

br = Browser()
response = br.open('http://tylerlesmann.com/')
doc = parse(response).getroot()
for link in doc.cssselect('div.header a'):
    print link.text_content()

As you can see, it is even shorter than BeautifulSoup at 10 lines. On top of that, lxml is faster than HTMLParser. So what is the catch? The lxml module uses C code, so you will not be able to use it on Google's AppEngine or on Jython.

Posted by Tyler Lesmann on January 14, 2009 at 6:13
Tagged as: beautifulsoup lxml mechanize python screen_scraping