Archive
Tags
android (3)
ant (2)
beautifulsoup (1)
debian (1)
decorators (1)
django (9)
dovecot (1)
encryption (1)
fix (4)
gotcha (2)
hobo (1)
htmlparser (1)
imaplib (2)
java (1)
json (2)
kerberos (2)
linux (7)
lxml (5)
markdown (4)
mechanize (6)
multiprocessing (1)
mysql (2)
nagios (2)
new_features (3)
open_source (5)
optparse (2)
parsing (1)
perl (2)
postgres (1)
preseed (1)
pxe (4)
pyqt4 (1)
python (41)
raid (1)
rails (1)
red_hat (1)
reportlab (4)
request_tracker (2)
rt (2)
ruby (1)
scala (1)
screen_scraping (7)
shell_scripting (8)
soap (1)
solaris (3)
sql (2)
sqlalchemy (2)
tips_and_tricks (1)
twitter (2)
ubuntu (1)
vmware (2)
windows (1)
zimbra (2)

I saw this blog post yesterday and I was inspired. I forgot that Qt has a nice little browser object, QWebView. I have to say that Siva's example could not be less pythonic though. Siva's primary language is Objective-C and it shows in that code. I've rewritten the whole thing to be pythonic.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
#!/usr/bin/env python

import os
import sys
from PyQt4.QtCore import QUrl, SIGNAL
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage, QWebView
from urllib2 import urlopen

JQUERY_URL = 'http://jqueryjs.googlecode.com/files/jquery-1.3.2.min.js'
JQUERY_FILE = JQUERY_URL.split('/')[-1]
JQUERY_PATH = os.path.join(os.path.dirname(__file__), JQUERY_FILE)

def get_jquery(jquery_url=JQUERY_URL, jquery_path=JQUERY_PATH):
    """
    Returns jquery source.

    If the source is not available at jquery_path, then we will download it from
    jquery_url.
    """
    if not os.path.exists(jquery_path):
        jquery = urlopen(jquery_url).read()
        f = open(jquery_path, 'w')
        f.write(jquery)
        f.close()
    else:
        f = open(jquery_path)
        jquery = f.read()
        f.close()
    return jquery

class WebPage(QWebPage):
    """
    QWebPage that prints Javascript errors to stderr.
    """
    def javaScriptConsoleMessage(self, message, lineNumber, sourceID):
        sys.stderr.write('Javascript error at line number %d\n' % lineNumber)
        sys.stderr.write('%s\n' % message)
        sys.stderr.write('Source ID: %s\n' % sourceID)

class GoogleSearchBot(QApplication):
    def __init__(self, argv, show_window=True):
        super(GoogleSearchBot, self).__init__(argv)
        self.jquery = get_jquery()
        self.web_view = QWebView()
        self.web_page = WebPage()
        self.web_view.setPage(self.web_page)
        if show_window is True:
            self.web_view.show()
        self.connect(self.web_view, SIGNAL("loadFinished(bool)"),
            self.load_finished)
        self.set_load_function(None)

    def google_search(self, keyword_string):
        self.set_load_function(self.parse_google_search)
        current_frame = self.web_view.page().currentFrame()
        current_frame.evaluateJavaScript(
            r"""
            $("input[title=Google Search]").val("%s");
            $("input[value=Google Search]").parents("form").submit();
            """ % keyword_string
        )

    def load_finished(self, ok):
        current_frame = self.web_page.currentFrame()
        current_frame.evaluateJavaScript(self.jquery)
        self.load_function(*self.load_function_args,
            **self.load_function_kwargs)

    def parse_google_search(self):
        current_frame = self.web_page.currentFrame()
        results = current_frame.evaluateJavaScript(
            r"""
            var results = "";
            $("h3[class=r]").each(function(i) {
                results += $(this).text() + "\n";
            });
            results;
            """
        )
        print('Google search result\n====================')
        for i, result in enumerate(unicode(results.toString(),'utf-8').splitlines()):
            print('%d. %s' % (i + 1, result))
        self.exit()

    def search(self, keyword):
        self.set_load_function(self.google_search, keyword)
        self.web_page.currentFrame().load(QUrl('http://www.google.com/ncr'))

    def set_load_function(self, load_function, *args, **kwargs):
        self.load_function = load_function
        self.load_function_args = args
        self.load_function_kwargs = kwargs

if __name__ == '__main__':
    if len(sys.argv) != 2:
        print("Usage: %s <keyword>" % sys.argv[0])
        raise SystemExit, 255

    googleSearchBot = GoogleSearchBot(sys.argv)
    googleSearchBot.search(sys.argv[1])
    sys.exit(googleSearchBot.exec_())

So what is the good and bad of using this method for web scraping?

Good

  • Javascript is not a problem anymore! Javascript is usually a pain in the world of web scraping as one must read Javascript and emulate it. This is especially awful with obfuscated Javascript. By using real browser, Javascript becomes a tool instead of a hindrance. AJAX applications become worlds easier to automate.
  • User gets more visual feedback through the browser rendering the page.

Bad

  • Javascript is hard to debug. I'm looking for the equivalent of the Firefox error console in QWebView or its attributes. That would fix this problem. FIXED! Extended QWebPage to add printing of Javascript errors to stderr.
  • QWebView takes a bit more resources than mechanize. Of course, we get page rendering and a Javascript engine.
  • This is not as easily implemented for Windows and OS X as it is for Linux/BSD. This is not a big problem for me, as Fedora has PyQt4 and its prerequisites whenever you install KDE. You may not be so lucky.
Posted by Tyler Lesmann on October 1, 2009 at 6:22 and commented on 7 times
Tagged as: pyqt4 python screen_scraping

Want to have fun? Try migrating an existing web application between different database technologies! With Django and SQLAlchemy, it actually isn't that hard! I used the following procedure to migrate both deathcat.org and this blog to Postgres. I'm assuming your know you to use Postgres and you are doing this as a Postgres superuser. All of this assumes ident authentication for Postgres, but should be easily tweaked for other configurations.

Make a directory in your Django application to store these scripts, like scripts/. Make sure this directory resides at the same level as manage.py. Now, get the code for my SQLAlchemy table copier and put it in a new file called puller.py. Comment out the line that reads table.metadata.create_all(dengine).

Now put this in a file called migrate2pg.sh:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash

database=my_django_db
mysql_user=django_user
mysql_pass=django_passwd
mysql_connection_string="mysql://$mysql_user:$mysql_pass@localhost/$database?charset=utf8"
postgres_connection_string="postgres:///$database"

tables=$(echo 'show tables' | mysql -u $mysql_user -p"$mysql_pass" $database | xargs echo | cut -d ' ' -f 2-)

echo $tables

echo "Dropping old postgres database, if any"
dropdb $database

echo "Creating new database"
createdb $database

echo "Setting up Django schema"
../manage.py syncdb --noinput

echo "Removing initial data"
echo 'DELETE FROM auth_permission' | psql $database
echo 'DELETE FROM django_content_type' | psql $database

echo "Importing data from MySQL"
python puller.py \
    -f $mysql_connection_string \
    -t $postgres_connection_string \
    $tables

echo "Fixing sequences"
for table in $tables
do
    echo Fixing "${table}'s sequence"
    echo "select setval('${table}_id_seq', max(id)) from ${table};" | psql $database
done

Tweak the variables at the top as necessary for your case. Run a bash migrate2pg.sh and read the messages. One error you will see is a during the Fixing sequences phase when the script attempts to fix django_session_id_seq sequence. Ignore this error.

The final part is to give permissions or ownership to the user who will be accessing the data. I'm assuming you can do this, but if you are using Postgres ident authentication and apache, then here's a helpful script for you.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash

database='my_django_db'

echo "Granting apache rights to ${database}"
echo "GRANT ALL ON DATABASE ${database} TO apache;" | psql $database

tables=$(echo '\dt' | psql $database | awk -F '|' '/table/ {print $2}')
sequences=$(echo '\ds' | psql $database | awk -F '|' '/sequence/ {print $2}')

echo "Tables:" $tables
echo

echo "Sequences:" $sequences
echo

tablesql=$(for table in $tables; do echo "ALTER TABLE $table OWNER TO apache;"; done)
seqsql=$(for seq in $sequences; do echo "ALTER TABLE $seq OWNER TO apache;"; done)

echo "Table Alteration SQL:" $tablesql
echo
echo "Sequence Alteration SQL:" $seqsql
echo

echo $tablesql $seqsql | psql $database
Posted by Tyler Lesmann on September 4, 2009 at 16:47 and commented on 3 times
Tagged as: django mysql postgres python sql sqlalchemy

Our Solarwinds Network Performance Monitor has a problem rendering custom reports on occasion. For something like that, there isn't an existing plugin for Nagios. Writing these plugins is easy. All there is to it is exit statuses. After reading this, you should have an idea of how to write a Nagios plugin for a variety of web applications.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#!/usr/bin/env python

from mechanize import Browser
from optparse import OptionParser

# Exit statuses recognized by Nagios
UNKNOWN = -1
OK = 0
WARNING = 1
CRITICAL = 2

def open_url(br, url):
    """Use a given mechanize.Browser to open url.

    If an exception is raised, then exit with CRITICAL status for Nagios.
    """
    try:
        response = br.open(url)
    except Exception, e:
        # Catching all exceptions is usually a bad idea.  We want to catch
        # them all to report to Nagios here.
        print 'CRITICAL - Could not reach page at %s: %s' % (url, e)
        raise SystemExit, CRITICAL
    return response

# I'm going to be using optparse.OptionParser from now on.  It makes
# command-line args a breeze.
parser = OptionParser()
parser.add_option('-H', '--hostname', dest='hostname')
parser.add_option('-u', '--username', dest='username')
parser.add_option('-p', '--password', dest='password')
parser.add_option('-r', '--report_url', dest='url',
    help="""Path to report relative to root, like
    /NetPerfMon/Report.asp?Report=Hostname+__+IPs""")
parser.add_option('-v', '--verbose', dest='verbose', action='store_true',
    default=False)
parser.add_option('-q', '--quiet', dest='verbose', action='store_false')

options, args = parser.parse_args()

# Check for required options
for option in ('hostname', 'username', 'password', 'url'):
    if not getattr(options, option):
        print 'CRITICAL - %s not specified' % option.capitalize()
        raise SystemExit, CRITICAL

# Go to the report and get a login page
br = Browser()
report_url = 'https://%s%s' % (options.hostname, options.url)
open_url(br, report_url)
br.select_form('aspnetForm')

# Solarwinds has interesting field names
# Maybe something with asp.net
br['ctl00$ContentPlaceHolder1$Username'] = options.username
br['ctl00$ContentPlaceHolder1$Password'] = options.password

# Attempt to login.  If we can't, tell Nagios.
try:
    report = br.submit()
except Exception, e:
    print 'CRITICAL - Error logging in: e' % e
    raise SystemExit, CRITICAL

report_html = report.read()
# class=Property occurs in every cell in a Solarwinds report.  If it's not
# there, something is wrong.
if 'class=Property' not in report_html:
    print 'CRITICAL - Report at %s is down' % report_url
    raise SystemExit, CRITICAL

# If we got this far, let's tell Nagios the report is okay.
print 'OK - Report at %s is up' % report_url
raise SystemExit, OK

To use our plugin, we need to do a bit of Nagios configuration. First, we need to define a command.

define command{
    command_name    check_npm_reports
    command_line    /usr/local/bin/reportmonitor.py -H $HOSTADDRESS$ $ARG1$
}

After that, we define a service.

define service{
    use         generic-service
    host_name           solarwinds-server
    service_description Solarwinds reports
    check_command       check_npm_reports!-u nagios -p some_password -r '/NetPerfMon/Report.asp?Report=Hostname+__+IPs'
}
Posted by Tyler Lesmann on September 3, 2009 at 13:37 and commented on 2 times
Tagged as: mechanize nagios optparse python screen_scraping

Doing anything with SOAP is a pain without a WSDL, which is the case with Zimbra. All of the Howtos I found about SOAP and ruby either required a WSDL or making several classes in a special, undocumented way to trick a SOAP::RPC::Driver instance into working. Both were unacceptable. After much hardship, I found an easier to read way to do SOAP without an WSDL in ruby, by building SOAP::Elements myself. Here is the code, documented to be easy to read, use, and extend.

# Incomplete library for interacting with Zimbra
#
#  require 'zimbra'
#
#  host = 'zimbra.tylerlesmann.com'
#  user = 'root'
#  passwd = 'hard_password'
#  creds = Zimbra.authenticate(host, user, passwd)
#  usercreds = Zimbra.masquerade(host, creds.authToken, 'tlesmann')
#  Zimbra.createappointment(host, usercreds.authToken,
#    Time.local(2009, 6, 26), 'Make a blog post', 'Maybe some Java', [
#    '/home/tlesmann/Documents/java.png',
#    '/home/tlesmann/Documents/tutorial.pdf',
#  ])

require 'net/http'
require 'net/https'
require 'soap/element'
require 'soap/rpc/driver'
require 'soap/processor'
require 'soap/streamHandler'
require 'soap/property'
require 'zimbra/multipart'

module Zimbra
  # Builds and sends AuthRequest to a provided Zimbra host.
  #
  # Returns a SOAP::Mapping instance, with an authToken attribute
  def self.authenticate(host, name, password)
    header = SOAP::SOAPHeader.new
    body = SOAP::SOAPBody.new(element('AuthRequest', nil,
      {
        'xmlns' => 'urn:zimbraAdmin',
      },
      [
        element('name', name),
        element('password', password),
      ]
    ))
    envelope = SOAP::SOAPEnvelope.new(header, body)
    return send_soap(envelope, host)
  end

  # Builds and sends CreateAppointmentRequest to a provided Zimbra host.  The
  # attachments argument expects a list of filename strings.
  #
  # Returns a SOAP::Mapping instance
  def self.createappointment(host, authToken, start, subject, description='',
    attachments=[])
    header = SOAP::SOAPHeader.new
    context = element('context', nil, {'xmlns' => 'urn:zimbra'}, [
      element('authToken', authToken)
    ])
    header.add('context', context)
    aids = []
    for attachment in attachments
      aids << upload_attachment(host, authToken, attachment)
    end
    if aids.empty?
      attach = nil
    else
      attach = element('attach', nil, {
      'aid' => aids.join(",")
      })
    end
    body = SOAP::SOAPBody.new(element('CreateAppointmentRequest', nil,
      {
        'xmlns' => 'urn:zimbraMail'
      },
      [
        element('m', nil, {}, [
          element('inv', nil, {}, [
            element('comp', nil,
              {
                'status' => 'CONF',
                'allDay' => 1,
                'fb' => 'F',
                'name' => subject,
                'noBlob' => 1,
              },
              [
                datetime('s', start),
                datetime('e', start),
                element('descHtml', description),
                element('alarm', nil,
                  {
                    'action' => 'DISPLAY'
                  },
                  [
                    element('trigger', nil, {}, [
                      element('rel', nil, {
                        'm' => 1
                      })
                  ]),
                  element('desc', subject),
                  ]
                ),
              ]
            ),
          ]),
          attach
        ]),
      ]
    ))
    envelope = SOAP::SOAPEnvelope.new(header, body)
    send_soap(envelope, host)
  end

  # builds SOAP::SOAPElement with tag name with a *d* attribute of the
  # provided ruby Time
  def self.datetime(name, time)
    return element(name, nil, {'d' => time.strftime("%Y%m%d")})
  end

  # builds SOAP::SOAPElements the way SOAP::SOAPElement constructor _should_
  #
  #  element('AuthRequest', nil,
  #    {
  #      'xmlns' => 'urn:zimbraAdmin',
  #    },
  #    [
  #    element('name', 'whoa'),
  #    element('password', 'man'),
  #    ]
  #  )
  #
  # The returned SOAP::SOAPElement converted to XML would be:
  #
  #  <AuthRequest xmlns="urn:zimbraAdmin">
  #    <name>whoa</name>
  #    <password>man</password>
  #  </AuthRequest>
  def self.element(name, value=nil, attrs={}, children=[])
    element = SOAP::SOAPElement.new(name, value)
    element.extraattr.update(attrs)
    for child in children
      if child
        element.add(child)
      end
    end
    return element
  end

  # Builds and sends DelegateAuth Request to a provided Zimbra host.  The
  # authToken must be that of an admin!  The account arg is nothing fancy, just
  # the username of the user to spoof.
  #
  # Returns a SOAP::Mapping instance, with an authToken attribute
  def self.masquerade(host, authToken, account)
    header = SOAP::SOAPHeader.new
    context = element('context', nil, {'xmlns' => 'urn:zimbra'}, [
      element('authToken', authToken)
    ])
    header.add('context', context)
    body = SOAP::SOAPBody.new(element('DelegateAuthRequest', nil,
      {
          'xmlns' => 'urn:zimbraAdmin'
      },
      [
        element('account', account, {
          'by' => 'name',
        })
      ]
    ))
    envelope = SOAP::SOAPEnvelope.new(header, body)
    return send_soap(envelope, host)
  end

  # Marshals SOAP::Envelopes and sends them to a given Zimbra host
  #
  # Returns response as a SOAP::Mapping instance
  def self.send_soap(envelope, host)
    url = 'https://' + host + ':7071/service/admin/soap/'
    stream = SOAP::HTTPStreamHandler.new(SOAP::Property.new)
    request_string = SOAP::Processor.marshal(envelope)
    puts request_string if $DEBUG
    request = SOAP::StreamHandler::ConnectionData.new(request_string)
    response_string = stream.send(url, request).receive_string
    puts response_string if $DEBUG
    env = SOAP::Processor.unmarshal(response_string)
    return SOAP::Mapping.soap2obj(env.body.root_node)
  end

  # Uploads file to given Zimbra host
  #
  # Returns a string containing the Zimbra attachment id.  These attachments are
  # only accessible to the user that uploaded them.
  def self.upload_attachment(host, authToken, filename)
    params = Hash.new
    file = File.open(filename, "rb")
    params["attachment"] = file
    mp = Multipart::MultipartPost.new
    query, headers = mp.prepare_query(params)
    file.close
    headers['Cookie'] = 'ZM_AUTH_TOKEN=' + authToken
    url = URI.parse('https://' + host + '/service/upload')
    client = Net::HTTP.new(url.host, url.port)
    client.use_ssl = true
    response = client.post(url.path + '?fmt=raw', query, headers)
    return response.body.split(',')[2].strip.slice(1..-2)
  end
end

Note: I would have done this in python, if it were not needed for an existing rails application. ;)

Posted by Tyler Lesmann on June 24, 2009 at 16:14 and commented on 9 times
Tagged as: mechanize ruby soap zimbra

With the announcement of the Android Scripting Engine, I had to check it out, what with the ability to code python on the G1. It has one large inconvenience at the moment. There is no built in way to use scripts on the SD card. I have a workaround for the time being. With the following script, which runs with ASE, you will be able to import scripts from the ase folder of your SD card into ASE's normal script directory.

import android
import os
import shutil

src = '/sdcard/ase'
dst = '/data/data/com.google.ase/scripts'

droid = android.Android()
for file in os.listdir(src):
    shutil.copy(os.path.join(src, file), dst)
    # Interesting permissions on Android
    os.chmod(os.path.join(dst, file), 0666)

droid.makeToast('Import Complete')

Note that you will be typing this into your Android device through ASE. The last script you will have to do that way. You will have to close and reopen ASE before the scripts appear.

Posted by Tyler Lesmann on June 10, 2009 at 11:56
Tagged as: android fix python