Archive
Tags
android (3)
ant (2)
beautifulsoup (1)
debian (1)
decorators (1)
django (9)
dovecot (1)
encryption (1)
fix (4)
gotcha (2)
hobo (1)
htmlparser (1)
imaplib (2)
java (1)
json (2)
kerberos (2)
linux (7)
lxml (5)
markdown (4)
mechanize (6)
multiprocessing (1)
mysql (2)
nagios (2)
new_features (3)
open_source (5)
optparse (2)
parsing (1)
perl (2)
postgres (1)
preseed (1)
pxe (4)
pyqt4 (1)
python (41)
raid (1)
rails (1)
red_hat (1)
reportlab (4)
request_tracker (2)
rt (2)
ruby (1)
scala (1)
screen_scraping (7)
shell_scripting (8)
soap (1)
solaris (3)
sql (2)
sqlalchemy (2)
tips_and_tricks (1)
twitter (2)
ubuntu (1)
vmware (2)
windows (1)
zimbra (2)

If you are using Python 2.6 or higher, then you should get to know the multiprocessing module as soon as possible. It works around the GIL to give true multiprocessing capabilities to python. Here is an example to show you how to spider sites with several worker processes. Use of the logging module is imperative to debugging these multiprocess programs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#!/usr/bin/env python

"""
Spider steam boycott group and tally who followed through and who didn't.
"""

import logging
import urllib2
from cStringIO import StringIO
from multiprocessing import Pool
from lxml.html import parse

def glean_games(url):
    logging.debug('Getting %s' % url)
    doc = parse(urllib2.urlopen(url)).getroot()
    game_elements = doc.cssselect('#mainContents h4')
    return [e.text_content() for e in game_elements]

def glean_users(url=None, html=None):
    if html is None:
        logging.debug('Getting %s' % url)
        doc = parse(urllib2.urlopen(url)).getroot()
    else:
        doc = parse(StringIO(html)).getroot()
    user_links = doc.cssselect(
    'a.linkFriend_offline, a.linkFriend_online, a.linkFriend_in-game')
    return [(link.text_content(), link.attrib['href']) for link in user_links]

def spider(url, pool_size=20):
    logging.debug('Getting %s' % url)
    response = urllib2.urlopen(url)
    html = response.read() # Necessary for mulitprocessing, needs to be pickleable
    group_page = parse(StringIO(html)).getroot()
    page_links = group_page.cssselect('.pageLinks a')
    page_count = page_links[-2].attrib['href'].split('=')[-1]
    urls = ['%s?p=%d' % (url, page) for page in xrange(2, int(page_count) + 1)]

    pool = Pool(pool_size)
    results = []
    results.append(pool.apply_async(glean_users, (), {'html': html}))
    results.extend([pool.apply_async(glean_users, (url,)) for url in urls])

    users = []
    for result in results:
        users.extend(result.get())

    logging.info('Found %d users!' % len(users))

    games = []
    for username, url in users:
        games.append((username, pool.apply_async(glean_games, (url + '/games',))))

    for username, result in games:
        games = result.get()
        yield username, games

def main():
    import sys
    logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
    game = 'Call of Duty: Modern Warfare 2'
    has = []
    has_not = []
    for username, games in spider(
        'http://steamcommunity.com/groups/BOYCOTTMW2/members'):
        if game in games:
            logging.info('%s has %s' % (username, game))
            has.append(username)
        else:
            logging.info('%s has not' % (username))
            has_not.append(username)
    print '%d users have %s and %d do not.' % (len(has), game, len(has_not))

if __name__ == '__main__':
    main()
Posted by Tyler Lesmann on November 12, 2009 at 18:46
Tagged as: lxml multiprocessing python screen_scraping

If you do not know about ZoomInfo yet, it is a business intelligence search service. With it, you can find information about companies and people working in them. The great thing about ZoomInfo is they offer an API for accessing their service. I have written up a little convenience module for this API. The only requirement is lxml.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import urllib2
from lxml import objectify
from urllib import urlencode

apiurl = 'http://api.zoominfo.com/PartnerAPI/XmlOutput.aspx?'

class ZoomInfoException(Exception):
    pass

def zoom_query(qtype, key, **kwargs):
    args = {
        'query_type': qtype,
        'pc': key,
    }
    args.update(kwargs)
    resp = urllib2.urlopen(''.join([apiurl, urlencode(args)]))
    resptree = objectify.parse(resp).getroot()
    if hasattr(resptree, 'ErrorMessage'):
        raise ZoomInfoException, resptree.ErrorMessage
    return resptree

def company_competitors(key, **kwargs):
    return zoom_query('company_competitors', key, **kwargs)

def company_detail(key, **kwargs):
    return zoom_query('company_detail', key, **kwargs)

def company_search_query(key, **kwargs):
    return zoom_query('company_search_query', key, **kwargs)

def people_search_query(key, **kwargs):
    return zoom_query('people_search_query', key, **kwargs)

This module is only 32 lines, but provides full functionality to the current ZoomInfo API. It is lacking in documentation, but it follows the API documentation to the letter. Here's some example of its use.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env python
from lxml import etree
from time import sleep
import zoominfo

key = 'Your_API_key_here'

xml = zoominfo.company_search_query(key, companyName="red hat")
for rec in xml.CompanySearchResults.CompanyRecord:
    print rec.CompanyID, rec.CompanyName, rec.Website
sleep(2)

xml = zoominfo.company_detail(key, CompanyDomain='www.redhat.com')
for person in xml.KeyPerson:
    print person.JobTitle, '-', person.FirstName, person.LastName
redhatid = xml.CompanyID
sleep(2)

try:
    xml = zoominfo.company_competitors(key, CompanyID=redhatid)
except zoominfo.ZoomInfoException, e:
    print 'Caught Error from ZoomInfo:', e
    print 'You need to upgrade your ZoomInfo to use this function'
sleep(2)

xml = zoominfo.people_search_query(key, firstName='Paul', lastName='Cormier')
rec = xml.PeopleSearchResults.PersonRecord[0]
print rec.CurrentEmployment.JobTitle, '@', rec.CurrentEmployment.Company.CompanyName

If you use this module, you are still subject to ZoomInfo's use restrictions and branding requirements. This is why my example includes a few time.sleeps in between queries.

If you have any suggestions, please leave a comment.

Posted by Tyler Lesmann on February 13, 2009 at 9:05
Tagged as: lxml python

I have been using lxml to generate XML to interface with Authorize.net's CIM API and I noticed something. Element does not behave as expected, which is supposed to be as a list. The key difference is with references to the same Element.

1
2
3
4
5
6
7
8
#!/usr/bin/env python
from lxml import etree

root = etree.Element('root')
child = etree.Element('child')
root.append(child)
root.append(child)
print etree.tostring(root, pretty_print=True)

If you run this, you will get the following output:

<root>
  <child/>
</root>

One child when we are expecting two. This is a bug. The workaround is to use deepcopy.

1
2
3
4
5
6
7
8
9
#!/usr/bin/env python
from copy import deepcopy
from lxml import etree

root = etree.Element('root')
child = etree.Element('child')
root.append(child)
root.append(deepcopy(child))
print etree.tostring(root, pretty_print=True)

This results with the expected output:

<root>
  <child/>
  <child/>
</root>
Posted by Tyler Lesmann on February 10, 2009 at 12:55 and commented on 1 time
Tagged as: gotcha lxml python

I found a blog post today that gleans the names and messages from the Twitter search. As an exercise, I decided to rewrite this using mechanize and lxml. My code writes to the standard out instead of a file. The user could redirect the output for the same effect. Note: I am aware that Twitter has JSON, plus several apis, and using that would be easier than this. This is an exercise.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#!/usr/bin/env python
import getopt                     
import sys                        
from mechanize import Browser, _mechanize
from lxml.html import parse              

baseurl = "http://search.twitter.com/search?lang=en&q="

def search_twitter(terms, pages=1):                    
    """                                                
    terms = a list of search terms                     
    pages(optional) = number of pages to retrive       

    returns a list of dictionaries
    """
    br = Browser()
    br.set_handle_robots(False)
    results = []
    response = br.open("".join([baseurl, "+".join(terms)]))
    while(pages > 0):
        doc = parse(response).getroot()
        for msg in doc.cssselect('div.msg'):
            name = msg.cssselect('a')[0].text_content()
            text = msg.cssselect('span')[0].text_content()
            text = text.replace(' (expand)', '')
            results.append({
                'name': name,
                'text': text,
            })
        try:
            response = br.follow_link(text='Older')
        except _mechanize.LinkNotFoundError:
            break # No more pages :(
        pages -= 1
    return results

if __name__ == '__main__':
    optlist, args = getopt.getopt(sys.argv[1:], 'p:', ['pages='])
    optd = dict(optlist)
    pages = 1
    if '-p' in optd:
        pages = int(optd['-p'])
    if '--pages' in optd:
        pages = int(optd['--pages'])
    if len(args) < 1:
        print """
        Usage: %s [-p] [--pages] search terms
            -p, --pages = number of pages to retrieve
        """ % sys.argv[0]
        raise SystemExit, 1
    results = search_twitter(args, pages)
    for result in results:
        print "%(name)-20s%(text)s" % result
Posted by Tyler Lesmann on January 14, 2009 at 15:16
Tagged as: lxml mechanize python screen_scraping

In my post from a while back, I gave an example of the standard HTMLParser's use. HTMLParser is not the easiest way to glean information from HTML. There are two modules that are not part of the standard python distribution that can shorten the development time. The first is BeautifulSoup. Here is the code from the previous episode using BeautifulSoup instead of HTMLParser.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup
from mechanize import Browser

br = Browser()
response = br.open('http://tylerlesmann.com/')
soup = BeautifulSoup(response.read())
headers = soup.findAll('div', attrs={'class': 'header'})
headlines = []
for header in headers:
    links = header.findAll('a')
    for link in links:
        headlines.append(link.string)
for headline in headlines:
    print headline

This is a lot shorter, 16 instead of 38 lines. It also took about 20 seconds to write. There is one gotcha here. Both scripts do the same task. This one using BeautifulSoup takes over twice as long to run. CPU time is much cheaper than development time though.

The next module is lxml. Here's the lxml version of the code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python

from lxml.html import parse
from mechanize import Browser

br = Browser()
response = br.open('http://tylerlesmann.com/')
doc = parse(response).getroot()
for link in doc.cssselect('div.header a'):
    print link.text_content()

As you can see, it is even shorter than BeautifulSoup at 10 lines. On top of that, lxml is faster than HTMLParser. So what is the catch? The lxml module uses C code, so you will not be able to use it on Google's AppEngine or on Jython.

Posted by Tyler Lesmann on January 14, 2009 at 6:13
Tagged as: beautifulsoup lxml mechanize python screen_scraping