Archive
January 2010
December 2009
November 2009
October 2009
September 2009
June 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
July 2008
June 2008
October 2007
September 2007
December 2009
November 2009
October 2009
September 2009
June 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
July 2008
June 2008
October 2007
September 2007
Tags
android
(3)
ant
(2)
beautifulsoup
(1)
debian
(1)
decorators
(1)
django
(9)
dovecot
(1)
encryption
(1)
fix
(4)
gotcha
(2)
hobo
(1)
htmlparser
(1)
imaplib
(2)
java
(1)
json
(2)
kerberos
(2)
linux
(7)
lxml
(5)
markdown
(4)
mechanize
(6)
multiprocessing
(1)
mysql
(2)
nagios
(2)
new_features
(3)
open_source
(5)
optparse
(2)
parsing
(1)
perl
(2)
postgres
(1)
preseed
(1)
pxe
(4)
pyqt4
(1)
python
(41)
raid
(1)
rails
(1)
red_hat
(1)
reportlab
(4)
request_tracker
(2)
rt
(2)
ruby
(1)
scala
(1)
screen_scraping
(7)
shell_scripting
(8)
soap
(1)
solaris
(3)
sql
(2)
sqlalchemy
(2)
tips_and_tricks
(1)
twitter
(2)
ubuntu
(1)
vmware
(2)
windows
(1)
zimbra
(2)
Archive for October 2009
I have moved the code repository to Google Code. In addition, the latest version respects youtube's URL get values, like hd=1.
Grab a clone like so:
hg clone https://python-markdown-video.googlecode.com/hg/ python-markdown-video
This is only the version compatible with python markdown 2.0. The version for earlier versions of python markdown is now deprecated and will not be maintained.
Posted by
on October 20, 2009
at 9:33
I saw this blog post yesterday and I was inspired. I forgot that Qt has a nice little browser object, QWebView. I have to say that Siva's example could not be less pythonic though. Siva's primary language is Objective-C and it shows in that code. I've rewritten the whole thing to be pythonic.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | #!/usr/bin/env python import os import sys from PyQt4.QtCore import QUrl, SIGNAL from PyQt4.QtGui import QApplication from PyQt4.QtWebKit import QWebPage, QWebView from urllib2 import urlopen JQUERY_URL = 'http://jqueryjs.googlecode.com/files/jquery-1.3.2.min.js' JQUERY_FILE = JQUERY_URL.split('/')[-1] JQUERY_PATH = os.path.join(os.path.dirname(__file__), JQUERY_FILE) def get_jquery(jquery_url=JQUERY_URL, jquery_path=JQUERY_PATH): """ Returns jquery source. If the source is not available at jquery_path, then we will download it from jquery_url. """ if not os.path.exists(jquery_path): jquery = urlopen(jquery_url).read() f = open(jquery_path, 'w') f.write(jquery) f.close() else: f = open(jquery_path) jquery = f.read() f.close() return jquery class WebPage(QWebPage): """ QWebPage that prints Javascript errors to stderr. """ def javaScriptConsoleMessage(self, message, lineNumber, sourceID): sys.stderr.write('Javascript error at line number %d\n' % lineNumber) sys.stderr.write('%s\n' % message) sys.stderr.write('Source ID: %s\n' % sourceID) class GoogleSearchBot(QApplication): def __init__(self, argv, show_window=True): super(GoogleSearchBot, self).__init__(argv) self.jquery = get_jquery() self.web_view = QWebView() self.web_page = WebPage() self.web_view.setPage(self.web_page) if show_window is True: self.web_view.show() self.connect(self.web_view, SIGNAL("loadFinished(bool)"), self.load_finished) self.set_load_function(None) def google_search(self, keyword_string): self.set_load_function(self.parse_google_search) current_frame = self.web_view.page().currentFrame() current_frame.evaluateJavaScript( r""" $("input[title=Google Search]").val("%s"); $("input[value=Google Search]").parents("form").submit(); """ % keyword_string ) def load_finished(self, ok): current_frame = self.web_page.currentFrame() current_frame.evaluateJavaScript(self.jquery) self.load_function(*self.load_function_args, **self.load_function_kwargs) def parse_google_search(self): current_frame = self.web_page.currentFrame() results = current_frame.evaluateJavaScript( r""" var results = ""; $("h3[class=r]").each(function(i) { results += $(this).text() + "\n"; }); results; """ ) print('Google search result\n====================') for i, result in enumerate(unicode(results.toString(),'utf-8').splitlines()): print('%d. %s' % (i + 1, result)) self.exit() def search(self, keyword): self.set_load_function(self.google_search, keyword) self.web_page.currentFrame().load(QUrl('http://www.google.com/ncr')) def set_load_function(self, load_function, *args, **kwargs): self.load_function = load_function self.load_function_args = args self.load_function_kwargs = kwargs if __name__ == '__main__': if len(sys.argv) != 2: print("Usage: %s <keyword>" % sys.argv[0]) raise SystemExit, 255 googleSearchBot = GoogleSearchBot(sys.argv) googleSearchBot.search(sys.argv[1]) sys.exit(googleSearchBot.exec_()) |
So what is the good and bad of using this method for web scraping?
Good
- Javascript is not a problem anymore! Javascript is usually a pain in the world of web scraping as one must read Javascript and emulate it. This is especially awful with obfuscated Javascript. By using real browser, Javascript becomes a tool instead of a hindrance. AJAX applications become worlds easier to automate.
- User gets more visual feedback through the browser rendering the page.
Bad
- Javascript is hard to debug. I'm looking for the equivalent of the Firefox error console in QWebView or its attributes. That would fix this problem. FIXED! Extended QWebPage to add printing of Javascript errors to stderr.
- QWebView takes a bit more resources than mechanize. Of course, we get page rendering and a Javascript engine.
- This is not as easily implemented for Windows and OS X as it is for Linux/BSD. This is not a big problem for me, as Fedora has PyQt4 and its prerequisites whenever you install KDE. You may not be so lucky.
Posted by
on October 1, 2009
at 6:22
and commented on 7
times
