I saw this blog post yesterday and I was inspired. I forgot that Qt has a nice little browser object, QWebView. I have to say that Siva's example could not be less pythonic though. Siva's primary language is Objective-C and it shows in that code. I've rewritten the whole thing to be pythonic.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
#!/usr/bin/env python

import os
import sys
from PyQt4.QtCore import QUrl, SIGNAL
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage, QWebView
from urllib2 import urlopen

JQUERY_URL = 'http://jqueryjs.googlecode.com/files/jquery-1.3.2.min.js'
JQUERY_FILE = JQUERY_URL.split('/')[-1]
JQUERY_PATH = os.path.join(os.path.dirname(__file__), JQUERY_FILE)

def get_jquery(jquery_url=JQUERY_URL, jquery_path=JQUERY_PATH):
    """
    Returns jquery source.

    If the source is not available at jquery_path, then we will download it from
    jquery_url.
    """
    if not os.path.exists(jquery_path):
        jquery = urlopen(jquery_url).read()
        f = open(jquery_path, 'w')
        f.write(jquery)
        f.close()
    else:
        f = open(jquery_path)
        jquery = f.read()
        f.close()
    return jquery

class WebPage(QWebPage):
    """
    QWebPage that prints Javascript errors to stderr.
    """
    def javaScriptConsoleMessage(self, message, lineNumber, sourceID):
        sys.stderr.write('Javascript error at line number %d\n' % lineNumber)
        sys.stderr.write('%s\n' % message)
        sys.stderr.write('Source ID: %s\n' % sourceID)

class GoogleSearchBot(QApplication):
    def __init__(self, argv, show_window=True):
        super(GoogleSearchBot, self).__init__(argv)
        self.jquery = get_jquery()
        self.web_view = QWebView()
        self.web_page = WebPage()
        self.web_view.setPage(self.web_page)
        if show_window is True:
            self.web_view.show()
        self.connect(self.web_view, SIGNAL("loadFinished(bool)"),
            self.load_finished)
        self.set_load_function(None)

    def google_search(self, keyword_string):
        self.set_load_function(self.parse_google_search)
        current_frame = self.web_view.page().currentFrame()
        current_frame.evaluateJavaScript(
            r"""
            $("input[title=Google Search]").val("%s");
            $("input[value=Google Search]").parents("form").submit();
            """ % keyword_string
        )

    def load_finished(self, ok):
        current_frame = self.web_page.currentFrame()
        current_frame.evaluateJavaScript(self.jquery)
        self.load_function(*self.load_function_args,
            **self.load_function_kwargs)

    def parse_google_search(self):
        current_frame = self.web_page.currentFrame()
        results = current_frame.evaluateJavaScript(
            r"""
            var results = "";
            $("h3[class=r]").each(function(i) {
                results += $(this).text() + "\n";
            });
            results;
            """
        )
        print('Google search result\n====================')
        for i, result in enumerate(unicode(results.toString(),'utf-8').splitlines()):
            print('%d. %s' % (i + 1, result))
        self.exit()

    def search(self, keyword):
        self.set_load_function(self.google_search, keyword)
        self.web_page.currentFrame().load(QUrl('http://www.google.com/ncr'))

    def set_load_function(self, load_function, *args, **kwargs):
        self.load_function = load_function
        self.load_function_args = args
        self.load_function_kwargs = kwargs

if __name__ == '__main__':
    if len(sys.argv) != 2:
        print("Usage: %s <keyword>" % sys.argv[0])
        raise SystemExit, 255

    googleSearchBot = GoogleSearchBot(sys.argv)
    googleSearchBot.search(sys.argv[1])
    sys.exit(googleSearchBot.exec_())

So what is the good and bad of using this method for web scraping?

Good

  • Javascript is not a problem anymore! Javascript is usually a pain in the world of web scraping as one must read Javascript and emulate it. This is especially awful with obfuscated Javascript. By using real browser, Javascript becomes a tool instead of a hindrance. AJAX applications become worlds easier to automate.
  • User gets more visual feedback through the browser rendering the page.

Bad

  • Javascript is hard to debug. I'm looking for the equivalent of the Firefox error console in QWebView or its attributes. That would fix this problem. FIXED! Extended QWebPage to add printing of Javascript errors to stderr.
  • QWebView takes a bit more resources than mechanize. Of course, we get page rendering and a Javascript engine.
  • This is not as easily implemented for Windows and OS X as it is for Linux/BSD. This is not a big problem for me, as Fedora has PyQt4 and its prerequisites whenever you install KDE. You may not be so lucky.
Posted by Tyler Lesmann on October 1, 2009 at 6:22
Tagged as: pyqt4 python screen_scraping
Comments
#1 Wayne wrote this 5 years ago

This is really cool. Thanks for sharing.
I noticed if you run with the number 5 as an argument it pukes up a unicode exception. One possible fix is:

for i, result in enumerate(unicode(results.toString(),'utf-8').splitlines()):

#2 Tyler Lesmann wrote this 5 years ago

Thanks for the fix. I've updated the example.

#3 Mikael wrote this 4 years, 4 months ago

Thanks, excellent example. Had a very hard time finding the documentation/examples to do this.

#4 Evandro Myller wrote this 4 years, 2 months ago

Hi

This is a great piece of code.
I wonder if I could do something similar to Mechanize, but with QtWebKit as the back-end. I still could not figure out how to handle the Qt main event loop in order to make a 'Browser' instance reusable outside the scope of the program (it may be importable, instantiated then used).

Do you have any idea about it? Please ping me at e-mail, if posible; I'm looking forward this for a while. :)

Thanks.

#5 chx wrote this 4 years, 2 months ago

Have you considered using QWebElement instead of jQuery?

#6 Evandro Myller wrote this 4 years, 2 months ago

Hello. This article was very useful for me but I had to create something that fit my needs, I was able to figure out how to do what I said above. Here[1]'s the source, as a contibution.

[1]: http://github.com/emyller/webkitcrawler

#7 fminer wrote this 3 years, 4 months ago

I've made a web scraping tool using PySide named fminer( <a href="http://www.fminer.com">web scraping tool</a> ). This article give me great help, thanks.

Post a comment