Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #4080
| Path | csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!feeds.phibee-telecom.net!dedekind.zen.co.uk!zen.net.uk!hamilton.zen.co.uk!prichard.zen.co.uk.POSTED!not-for-mail |
|---|---|
| From | Nobody <nobody@nowhere.com> |
| Subject | Re: Reading Huge UnixMailbox Files |
| Date | Tue, 26 Apr 2011 21:23:31 +0100 |
| User-Agent | Pan/0.14.2 (This is not a psychotic episode. It's a cleansing moment of clarity.) |
| Message-Id | <pan.2011.04.26.20.23.29.625000@nowhere.com> |
| Newsgroups | comp.lang.python |
| References | <mailman.866.1303846801.9059.python-list@python.org> |
| MIME-Version | 1.0 |
| Content-Type | text/plain; charset=UTF-8 |
| Content-Transfer-Encoding | 8bit |
| Lines | 41 |
| Organization | Zen Internet |
| NNTP-Posting-Host | a54b9e92.news.zen.co.uk |
| X-Trace | DXC=7I43Ai\d82gHBfKJE02Kie0g@SS;SF6ngRiiCXJE[K>gb^f7@N:5[RcYo[@=aWRTIbE[K[TH\A3;` |
| X-Complaints-To | abuse@zen.co.uk |
| Xref | x330-a1.tempe.blueboxinc.net comp.lang.python:4080 |
Show key headers only | View raw
On Tue, 26 Apr 2011 15:39:37 -0400, Brandon McGinty wrote:
> I'm trying to import hundreds of thousands of e-mail messages into a
> database with Python.
> However, some of these mailboxes are so large that they are giving
> errors when being read with the standard mailbox module.
> I created a buffered reader, that reads chunks of the mailbox, splits
> them using the re.split function with a compiled regexp, and imports
> each chunk as a message.
> The regular expression work is where the bottle-neck appears to be,
> based on timings.
> I'm wondering if there is a faster way to do this, or some other method
> that you all would recommend.
Consider using awk. In my experience, high-level languages tend to have
slower regex libraries than simple tools such as sed and awk.
E.g. the following script reads a mailbox on stdin and writes a separate
file for each message:
#!/usr/bin/awk -f
BEGIN {
num = 0;
ofile = "";
}
/^From / {
if (ofile != "") close(ofile);
ofile = sprintf("%06d.mbox", num);
num ++;
}
{
print > ofile;
}
It would be simple to modify it to start a new file after a given number
of messages or a given number of lines.
You can then read the resulting smaller mailboxes using your Python script.
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar
Reading Huge UnixMailbox Files Brandon McGinty <brandon.mcginty@gmail.com> - 2011-04-26 15:39 -0400
Re: Reading Huge UnixMailbox Files Nobody <nobody@nowhere.com> - 2011-04-26 21:23 +0100
Re: Reading Huge UnixMailbox Files Dan Stromberg <drsalists@gmail.com> - 2011-04-26 14:02 -0700
Re: Reading Huge UnixMailbox Files Nobody <nobody@nowhere.com> - 2011-04-27 13:52 +0100
csiph-web