December 2009
November 2009
October 2009
September 2009
June 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
July 2008
June 2008
October 2007
September 2007
This was a nice little learning exercise of my skills, so I'd like to share it. This parses the main dovecot log and the rawlogs for each mailbox to generate a HTML report of which host/ip has done what. The actions are still raw IMAP, but are pretty understandable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | #!/usr/bin/env python import cgi import datetime import glob import os import re import socket import sys HOMEDIRSPATH = '/home' MAILLOG = '/var/log/mail.log' # Regex for mail.log TIMESTAMP_RE = re.compile('.*(\d\d:\d\d:\d\d)') RIP_RE = re.compile('.*rip=(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})') MB_RE = re.compile('.*user=<(\w+?)>') # Regex for dovecot.rawlog RAWLOG_RES = [ re.compile('\w+? CREATE "', re.I), # Folder Created re.compile('\w+? DELETE "', re.I), # Folder Deleted re.compile('\w+? RENAME "', re.I), # Folder Moved/Renamed re.compile('\w+? APPEND "', re.I), # Mail Added re.compile('\w+? UID STORE.*DELETED', re.I), # Mail Deleted ] RAWLOG_SELECT_RE = re.compile('\w+? SELECT "', re.I) # Folder selected RAWLOG_COPY_RE = re.compile('\w+? UID COPY', re.I) # Folder Copied/Being Moved RAWLOG_STRIP_RE = re.compile('\w+? (.*)') # Remove the action id HTML_HEADER = """<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>%s</title> <style type="text/css"> .mb { background-color: #BDB; margin: 0px 0px 40px 0px; padding: 5px; } .ip { background-color: #CEC; margin: 10px; padding: 5px; } .log { background-color: #DFD; margin: 10px; padding: 5px; } span { font-size: small; } </style> </head> <body> """ HTML_FOOTER = """ </body> </html> """ class parsedc: def __init__(self, day=None, maillog=MAILLOG, homedirs=HOMEDIRSPATH): if day is None: # Set to yesterday by default self.day = datetime.date.today() - datetime.timedelta(days=1) else: self.day = day self.maillog = maillog self.homedirs = homedirs self.results = {} self.current_mb = '' self.current_ip = '' def feed(self): f = open(self.maillog, 'r') s = f.read() f.close() dayts = self.day.strftime('%Y%m%d') timed = self.parsetimes(s, self.day) mailboxes = timed.keys() mailboxes.sort() for mb in mailboxes: self.current_mb = mb if not mb in self.results: self.results[mb] = {} os.chdir(os.path.join(self.homedirs, mb, 'dovecot.rawlog')) offset = 0 last = '' for rec in timed[mb]: time = rec[0] self.current_ip = ip = rec[1] if not ip in self.results[mb]: self.results[mb][ip] = [] if time == last: offset += 1 else: offset = 0 last = time logs = glob.glob('-'.join([dayts, time, '*.in'])) try: f = open(logs[offset]) except IndexError: continue # dovecot may not have made a rawlog self.parserawlog(f) f.close() def parsetimes(self, s, dt): monthday = self.day.strftime('%c')[4:10] times = {} for line in s.split('\n'): if line.startswith(monthday): time = rip = mb = '' m = TIMESTAMP_RE.match(line) if m: time = m.group(1).replace(':', '') m = RIP_RE.match(line) if m: rip = m.group(1) m = MB_RE.match(line) if m: mb = m.group(1) if time and rip and mb: if not mb in times: times[mb] = [] times[mb].append((time, rip)) return times def parserawlog(self, f): lastselect = '' for line in f.readlines(): for p in RAWLOG_RES: if p.match(line): self.results[self.current_mb][self.current_ip].append(line) continue if RAWLOG_COPY_RE.match(line): self.results[self.current_mb][self.current_ip].extend([lastselect, line]) if RAWLOG_SELECT_RE.match(line): lastselect = line def print_report(self): mailboxes = self.results.keys() mailboxes.sort() print HTML_HEADER % self.day.isoformat() for mb in mailboxes: print '<div class="mb"><span>', mb, '</span>' ips = self.results[mb].keys() ips.sort() for ip in ips: if self.results[mb][ip]: try: host, aliases, addrs = socket.gethostbyaddr(ip) except socket.herror: host = None print '<div class="ip"><span>' if not host is None: print '(%s)' % host print ip print '</span>' print '<div class="log"><span>' for line in self.results[mb][ip]: m = RAWLOG_STRIP_RE.match(line.strip()) print '', '', cgi.escape(m.group(1)), '<br />' print '</span></div>' print '</div>' print '</div>' print HTML_FOOTER if __name__ == '__main__': pdc = parsedc() pdc.feed() pdc.print_report() |
I had a recent problem at work. We needed to copy a folder in one IMAP mailbox to another. It wasn't as easy as it should have been. The folder in question is just below 15GB. The obvious route is to copy it using a mail client. The first one I tried was Evolution 2.24. Evolution does copy folders, but it only copies the mail out of folders you've opened. This is a problem with 3500+ folders. I'll probably file a bug with GNOME about this. The next client was Kmail.
Kmail 3.5.10 does the copying perfectly...until is crashes. Kmail is pathetic in the realm of stability. The KDE team needs to make Kmail more resilient. It should display an error instead of dying. I didn't have a Windows machine available to try Outlook Express on or I would have tried that too.
Since everything seems to have trouble with folders this size, I decided to try my hand at the task with Python. I ended up with this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | #!/usr/bin/env python import getpass import imaplib import sys def cpimap(user1, host1, pass1, user2, host2, pass2, target='/'): m1 = imaplib.IMAP4_SSL(host1, 993) m2 = imaplib.IMAP4_SSL(host2, 993) m1.login(user1,pass1) m2.login(user2,pass2) folders = [folder.split(' "/" ')[1][1:-1] for folder in m1.list(target)[1]] folders.insert(0, target) # Copy messages in the root of the target too print 'Copying', len(folders), 'folders...' for f in folders: if '\\' in f: print 'Skipping', f # imaplib does not support backslashes in mailbox names! continue print 'Copying', f m2.create(f) m1.select(f) print 'Fetching messages...' typ, data = m1.search(None, 'ALL') msgs = data[0].split() sys.stdout.write(" ".join(['Copying', str(len(msgs)), 'messages'])) for num in msgs: typ, data = m1.fetch(num, '(RFC822)') sys.stdout.write('.') m2.append(f, None, None, data[0][1]) sys.stdout.write('\n') |
If you look through the code, you may notice that imaplib is a messy module to work with. At the very least, the user of imaplib doesn't need to know exactly how the IMAP protocol works. There is a lot more output that what is needed. With a few tutorials, like this one, you can get a usable piece of code in a few minutes.
One limitation of imaplib is operations on folders containing backslashes. The module does not properly escape them. If you try to do anything with a folder named back\slash, then it will try to apply the procedure on back\\slash.This is a bug in imaplib.
Install the kerberos client packages
apt-get install krb5-user libpam-krb5
Copy the /etc/krb5.conf from the server. You should double-check the kdc and admin_server lines.
Edit the pam configuration to tell linux to ask kerberos for authentication. There are four files, /etc/pam.d/common-{account,auth,password,session}.
Keep a session logged in as root until you verify that you can still login after making these changes!
# /etc/pam.d/common-account - authorization settings common to all services account sufficient pam_unix.so account sufficient pam_krb5.so account required pam_deny.so # /etc/pam.d/common-auth - authentication settings common to all services auth sufficient pam_unix.so nullok_secure auth sufficient pam_krb5.so use_first_pass auth required pam_deny.so # /etc/pam.d/common-password - password-related modules common to all services password sufficient pam_unix.so nullok obscure min=4 max=8 md5 password sufficient pam_krb5.so use_first_pass password required pam_deny.so # /etc/pam.d/common-session - session-related modules common to all services session optional pam_unix.so session optional pam_krb5.so
You should now be able to authenticate using kerberos. Remember that you will still need create accounts, i.e. useradd, before you will be able to login.
Important note: Make sure that the machine can resolve its hostname to an IP address. This is as simple as adding an entry to /etc/hosts.
In the last post, I illustrated how to most efficiently fetch html for data mining using the mechanize module. Now that we have our html, we can parse it for the information we want. To do this, we will use the HTMLParser module. This is a standard module in Python, so you don't have to install anything.
In this example, we will glean all of the headlines from the main page of this blog.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | #!/usr/bin/env python from HTMLParser import HTMLParser from mechanize import Browser class HeadLineParser(HTMLParser): def __init__(self): self.in_header = False self.in_headline = False self.headlines = [] HTMLParser.__init__(self) def handle_starttag(self, tag, attrs): if tag == 'div': # attrs is a list of tuple pairs, a dictionary is more useful dattrs = dict(attrs) if 'class' in dattrs and dattrs['class'] == 'header': self.in_header = True if tag == 'a' and self.in_header: self.in_headline = True def handle_endtag(self, tag): if tag == 'div': self.in_header = False if tag == 'a': self.in_headline = False def handle_data(self, data): if self.in_headline: self.headlines.append(data) br = Browser() response = br.open('http://tylerlesmann.com/') hlp = HeadLineParser() hlp.feed(response.read()) for headline in hlp.headlines: print headline hlp.close() |
You use HTMLParser by extending it. The four functions you'll need are init, handle_starttag, handle_endtag, and handle_data. HTMLParser can be confusing at first because it works in a unique matter. Whenever a html tag is encountered, handle_starttag is called. Whenever a closing tag is found, handle_endtag is called. Whenever anything in between tags is encountered, handle_data is called.
The way to actually use HTMLParser is to use a system of flags, like in_header and in_headline from the example. We toggle them on in handle_starttag and off in handle_endtag. If you look at the html of this blog, you'll see that headlines are enclosed in classless <a> tags. There are alot of <a>s on this site. We need something unique to flag the headline <a>s. If you look carefully, you would see that all of the headlines are enclosed with <div>s with a header class. We can flag those and flag the <a>s only inside them, which is what the example does.
Now that the script has all the proper flags to detect headlines, we can simply have handle_data append any text to a list of headlines when our in in_headline flag is True.
To use our new parser, we simply make an instance of it and use the instance's feed method to run html through the parser. We can access the headlines attribute directly like we can in any object in python.
Python is the perfect language for mining data, on the web or otherwise. Python comes with modules for accessing the Internet, like urllib and urllib2, but, if you are short on time or don't want to write more if you don't have to, you will want to use the mechanize package. One gotcha with mechanize is that you will want get the latest version. You can download it directly from their site, along with its dependency ClientForm, or you can use easy_install, if you have it.
$ sudo easy_install mechanize
Using the mechanize module is almost as simple as using a web browser. Here's an example of use:
1 2 3 4 5 6 7 8 9 10 | #!/usr/bin/env python from mechanize import Browser br = Browser() br.open('http://tylerlesmann.com/') response = br.follow_link(text='Data Mining the Web with Python: Part 1') print response.read() |
This script will go to this site, go to this article via its link, download the html, and print it. That's pretty simple. Mechanize can do much more interesting things.
1 2 3 4 5 6 7 8 9 10 | #!/usr/bin/env python from mechanize import Browser br = Browser() br.open('http://finance.yahoo.com/') br.select_form(name='quote') br['s'] = 'rht' # 's' is the input field in the quote form response = br.submit() # Submit the form just like a web browser print response.read() |
This script goes to http://finance.yahoo.com, enters rht into the Get Quotes field, submits the form, and prints out the quote page html for Red Hat Inc. This is to illustrate how simple it is to use forms with mechanize. You can use the very same methods to log into web sites. Mechanize handles all of the fun of cookies and sessions. You just have to tell it where to go.
In the next post, I'll detail how to parse the fetched html with the HTMLParser parser.
