Archive
Tags
android (3)
ant (2)
beautifulsoup (1)
debian (1)
decorators (1)
django (9)
dovecot (1)
encryption (1)
fix (4)
gotcha (2)
hobo (1)
htmlparser (1)
imaplib (2)
java (1)
json (2)
kerberos (2)
linux (7)
lxml (5)
markdown (4)
mechanize (6)
multiprocessing (1)
mysql (2)
nagios (2)
new_features (3)
open_source (5)
optparse (2)
parsing (1)
perl (2)
postgres (1)
preseed (1)
pxe (4)
pyqt4 (1)
python (41)
raid (1)
rails (1)
red_hat (1)
reportlab (4)
request_tracker (2)
rt (2)
ruby (1)
scala (1)
screen_scraping (7)
shell_scripting (8)
soap (1)
solaris (3)
sql (2)
sqlalchemy (2)
tips_and_tricks (1)
twitter (2)
ubuntu (1)
vmware (2)
windows (1)
zimbra (2)

This was a nice little learning exercise of my skills, so I'd like to share it. This parses the main dovecot log and the rawlogs for each mailbox to generate a HTML report of which host/ip has done what. The actions are still raw IMAP, but are pretty understandable.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
#!/usr/bin/env python

import cgi
import datetime
import glob
import os
import re
import socket
import sys

HOMEDIRSPATH = '/home'
MAILLOG = '/var/log/mail.log'

# Regex for mail.log
TIMESTAMP_RE = re.compile('.*(\d\d:\d\d:\d\d)')
RIP_RE = re.compile('.*rip=(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})')
MB_RE = re.compile('.*user=<(\w+?)>')

# Regex for dovecot.rawlog
RAWLOG_RES = [
    re.compile('\w+? CREATE "', re.I), # Folder Created
    re.compile('\w+? DELETE "', re.I), # Folder Deleted
    re.compile('\w+? RENAME "', re.I), # Folder Moved/Renamed
    re.compile('\w+? APPEND "', re.I), # Mail Added
    re.compile('\w+? UID STORE.*DELETED', re.I), # Mail Deleted
]
RAWLOG_SELECT_RE = re.compile('\w+? SELECT "', re.I) # Folder selected
RAWLOG_COPY_RE = re.compile('\w+? UID COPY', re.I) # Folder Copied/Being Moved
RAWLOG_STRIP_RE = re.compile('\w+? (.*)') # Remove the action id

HTML_HEADER = """<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
        <title>%s</title>
        <style type="text/css">
            .mb { background-color: #BDB; margin: 0px 0px 40px 0px; padding: 5px; }
            .ip { background-color: #CEC; margin: 10px; padding: 5px; }
            .log { background-color: #DFD; margin: 10px; padding: 5px; }
            span { font-size: small; }
        </style>
    </head>
    <body>
"""

HTML_FOOTER = """
    </body>
</html>
"""


class parsedc:

    def __init__(self, day=None, maillog=MAILLOG, homedirs=HOMEDIRSPATH):
        if day is None:
            # Set to yesterday by default
            self.day = datetime.date.today() - datetime.timedelta(days=1)
        else:
            self.day = day
        self.maillog = maillog
        self.homedirs = homedirs 
        self.results = {}
        self.current_mb = ''
        self.current_ip = ''

    def feed(self):
        f = open(self.maillog, 'r')
        s = f.read()
        f.close()
        dayts = self.day.strftime('%Y%m%d')
        timed = self.parsetimes(s, self.day) 
        mailboxes = timed.keys()
        mailboxes.sort()
        for mb in mailboxes:
            self.current_mb = mb
            if not mb in self.results:
                self.results[mb] = {}
            os.chdir(os.path.join(self.homedirs, mb, 'dovecot.rawlog'))
            offset = 0
            last = ''
            for rec in timed[mb]:
                time = rec[0]
                self.current_ip = ip = rec[1]
                if not ip in self.results[mb]:
                    self.results[mb][ip] = []
                if time == last:
                    offset += 1
                else:
                    offset = 0
                    last = time
                logs = glob.glob('-'.join([dayts, time, '*.in']))
                try:
                    f = open(logs[offset])
                except IndexError:
                    continue # dovecot may not have made a rawlog
                self.parserawlog(f)
                f.close()

    def parsetimes(self, s, dt):
        monthday = self.day.strftime('%c')[4:10]
        times = {}
        for line in s.split('\n'):
            if line.startswith(monthday):
                time = rip = mb = ''
                m = TIMESTAMP_RE.match(line)
                if m:
                    time = m.group(1).replace(':', '')
                m = RIP_RE.match(line)
                if m:
                    rip = m.group(1)
                m = MB_RE.match(line)
                if m:
                    mb = m.group(1)
                if time and rip and mb:
                    if not mb in times:
                        times[mb] = [] 
                    times[mb].append((time, rip))
        return times

    def parserawlog(self, f):
        lastselect = ''
        for line in f.readlines():
            for p in RAWLOG_RES:
                if p.match(line):
                    self.results[self.current_mb][self.current_ip].append(line)
                    continue
            if RAWLOG_COPY_RE.match(line):
                self.results[self.current_mb][self.current_ip].extend([lastselect,
                    line])
            if RAWLOG_SELECT_RE.match(line):
                lastselect = line

    def print_report(self):
        mailboxes = self.results.keys()
        mailboxes.sort()
        print HTML_HEADER % self.day.isoformat()
        for mb in mailboxes:
            print '<div class="mb"><span>', mb, '</span>'
            ips = self.results[mb].keys()
            ips.sort()
            for ip in ips:
                if self.results[mb][ip]:
                    try:
                        host, aliases, addrs = socket.gethostbyaddr(ip)
                    except socket.herror:
                        host = None
                    print '<div class="ip"><span>'
                    if not host is None:
                        print '(%s)' % host
                    print ip
                    print '</span>'
                    print '<div class="log"><span>'
                    for line in self.results[mb][ip]:
                        m = RAWLOG_STRIP_RE.match(line.strip())
                        print '', '', cgi.escape(m.group(1)), '<br />'
                    print '</span></div>'
                    print '</div>'
            print '</div>'
        print HTML_FOOTER

if __name__ == '__main__':
    pdc = parsedc()
    pdc.feed()
    pdc.print_report()
Posted by Tyler Lesmann on October 24, 2008 at 15:40
Tagged as: dovecot linux parsing python

I had a recent problem at work. We needed to copy a folder in one IMAP mailbox to another. It wasn't as easy as it should have been. The folder in question is just below 15GB. The obvious route is to copy it using a mail client. The first one I tried was Evolution 2.24. Evolution does copy folders, but it only copies the mail out of folders you've opened. This is a problem with 3500+ folders. I'll probably file a bug with GNOME about this. The next client was Kmail.
Kmail 3.5.10 does the copying perfectly...until is crashes. Kmail is pathetic in the realm of stability. The KDE team needs to make Kmail more resilient. It should display an error instead of dying. I didn't have a Windows machine available to try Outlook Express on or I would have tried that too.

Since everything seems to have trouble with folders this size, I decided to try my hand at the task with Python. I ended up with this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python

import getpass
import imaplib
import sys

def cpimap(user1, host1, pass1, user2, host2, pass2, target='/'):
    m1 = imaplib.IMAP4_SSL(host1, 993)
    m2 = imaplib.IMAP4_SSL(host2, 993)
    m1.login(user1,pass1)
    m2.login(user2,pass2)

    folders = [folder.split(' "/" ')[1][1:-1] for folder in m1.list(target)[1]]

    folders.insert(0, target) # Copy messages in the root of the target too

    print 'Copying', len(folders), 'folders...'

    for f in folders:
        if '\\' in f:
            print 'Skipping', f
            # imaplib does not support backslashes in mailbox names!
            continue
        print 'Copying', f
        m2.create(f)
        m1.select(f)
        print 'Fetching messages...'
        typ, data = m1.search(None, 'ALL')

        msgs = data[0].split()

        sys.stdout.write(" ".join(['Copying', str(len(msgs)), 'messages']))

        for num in msgs:
            typ, data = m1.fetch(num, '(RFC822)')
            sys.stdout.write('.')
            m2.append(f, None, None, data[0][1])
        sys.stdout.write('\n')

If you look through the code, you may notice that imaplib is a messy module to work with. At the very least, the user of imaplib doesn't need to know exactly how the IMAP protocol works. There is a lot more output that what is needed. With a few tutorials, like this one, you can get a usable piece of code in a few minutes.

One limitation of imaplib is operations on folders containing backslashes. The module does not properly escape them. If you try to do anything with a folder named back\slash, then it will try to apply the procedure on back\\slash.This is a bug in imaplib.

Posted by Tyler Lesmann on October 10, 2008 at 12:39
Tagged as: imaplib python

Install the kerberos client packages

apt-get install krb5-user libpam-krb5

Copy the /etc/krb5.conf from the server. You should double-check the kdc and admin_server lines.

Edit the pam configuration to tell linux to ask kerberos for authentication. There are four files, /etc/pam.d/common-{account,auth,password,session}.

Keep a session logged in as root until you verify that you can still login after making these changes!

# /etc/pam.d/common-account - authorization settings common to all services
account    sufficient    pam_unix.so
account    sufficient    pam_krb5.so
account    required    pam_deny.so

# /etc/pam.d/common-auth - authentication settings common to all services
auth    sufficient    pam_unix.so nullok_secure
auth    sufficient    pam_krb5.so use_first_pass
auth    required    pam_deny.so

# /etc/pam.d/common-password - password-related modules common to all services
password    sufficient    pam_unix.so nullok obscure min=4 max=8 md5
password    sufficient    pam_krb5.so use_first_pass
password    required    pam_deny.so

# /etc/pam.d/common-session - session-related modules common to all services
session    optional    pam_unix.so
session    optional    pam_krb5.so

You should now be able to authenticate using kerberos. Remember that you will still need create accounts, i.e. useradd, before you will be able to login.

Important note: Make sure that the machine can resolve its hostname to an IP address. This is as simple as adding an entry to /etc/hosts.

Posted by Tyler Lesmann on October 6, 2008 at 14:18
Tagged as: debian kerberos linux

In the last post, I illustrated how to most efficiently fetch html for data mining using the mechanize module. Now that we have our html, we can parse it for the information we want. To do this, we will use the HTMLParser module. This is a standard module in Python, so you don't have to install anything.

In this example, we will glean all of the headlines from the main page of this blog.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python

from HTMLParser import HTMLParser
from mechanize import Browser

class HeadLineParser(HTMLParser):
    def __init__(self):
        self.in_header = False
        self.in_headline = False
        self.headlines = []
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        if tag == 'div':
            # attrs is a list of tuple pairs, a dictionary is more useful
            dattrs = dict(attrs)
            if 'class' in dattrs and dattrs['class'] == 'header':
                self.in_header = True
        if tag == 'a' and self.in_header:
            self.in_headline = True

    def handle_endtag(self, tag):
        if tag == 'div':
            self.in_header = False
        if tag == 'a':
            self.in_headline = False

    def handle_data(self, data):
        if self.in_headline:
            self.headlines.append(data)

br = Browser()
response = br.open('http://tylerlesmann.com/')
hlp = HeadLineParser()
hlp.feed(response.read())
for headline in hlp.headlines:
    print headline
hlp.close()

You use HTMLParser by extending it. The four functions you'll need are init, handle_starttag, handle_endtag, and handle_data. HTMLParser can be confusing at first because it works in a unique matter. Whenever a html tag is encountered, handle_starttag is called. Whenever a closing tag is found, handle_endtag is called. Whenever anything in between tags is encountered, handle_data is called.

The way to actually use HTMLParser is to use a system of flags, like in_header and in_headline from the example. We toggle them on in handle_starttag and off in handle_endtag. If you look at the html of this blog, you'll see that headlines are enclosed in classless <a> tags. There are alot of <a>s on this site. We need something unique to flag the headline <a>s. If you look carefully, you would see that all of the headlines are enclosed with <div>s with a header class. We can flag those and flag the <a>s only inside them, which is what the example does.

Now that the script has all the proper flags to detect headlines, we can simply have handle_data append any text to a list of headlines when our in in_headline flag is True.

To use our new parser, we simply make an instance of it and use the instance's feed method to run html through the parser. We can access the headlines attribute directly like we can in any object in python.

Posted by Tyler Lesmann on October 4, 2008 at 7:09
Tagged as: htmlparser mechanize python screen_scraping

Python is the perfect language for mining data, on the web or otherwise. Python comes with modules for accessing the Internet, like urllib and urllib2, but, if you are short on time or don't want to write more if you don't have to, you will want to use the mechanize package. One gotcha with mechanize is that you will want get the latest version. You can download it directly from their site, along with its dependency ClientForm, or you can use easy_install, if you have it.

$ sudo easy_install mechanize

Using the mechanize module is almost as simple as using a web browser. Here's an example of use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python

from mechanize import Browser

br = Browser()
br.open('http://tylerlesmann.com/')

response = br.follow_link(text='Data Mining the Web with Python: Part 1')

print response.read()

This script will go to this site, go to this article via its link, download the html, and print it. That's pretty simple. Mechanize can do much more interesting things.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python

from mechanize import Browser

br = Browser()
br.open('http://finance.yahoo.com/')
br.select_form(name='quote')
br['s'] = 'rht' # 's' is the input field in the quote form
response = br.submit() # Submit the form just like a web browser
print response.read()

This script goes to http://finance.yahoo.com, enters rht into the Get Quotes field, submits the form, and prints out the quote page html for Red Hat Inc. This is to illustrate how simple it is to use forms with mechanize. You can use the very same methods to log into web sites. Mechanize handles all of the fun of cookies and sessions. You just have to tell it where to go.

In the next post, I'll detail how to parse the fetched html with the HTMLParser parser.

Posted by Tyler Lesmann on October 3, 2008 at 13:09
Tagged as: mechanize python screen_scraping