Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #4080

Re: Reading Huge UnixMailbox Files

From Nobody <nobody@nowhere.com>
Subject Re: Reading Huge UnixMailbox Files
Date 2011-04-26 21:23 +0100
Message-Id <pan.2011.04.26.20.23.29.625000@nowhere.com>
Newsgroups comp.lang.python
References <mailman.866.1303846801.9059.python-list@python.org>
Organization Zen Internet

Show all headers | View raw


On Tue, 26 Apr 2011 15:39:37 -0400, Brandon McGinty wrote:

> I'm trying to import hundreds of thousands of e-mail messages into a
> database with Python.
> However, some of these mailboxes are so large that they are giving
> errors when being read with the standard mailbox module.
> I created a buffered reader, that reads chunks of the mailbox, splits
> them using the re.split function with a compiled regexp, and imports
> each chunk as a message.
> The regular expression work is where the bottle-neck appears to be,
> based on timings.
> I'm wondering if there is a faster way to do this, or some other method
> that you all would recommend.

Consider using awk. In my experience, high-level languages tend to have
slower regex libraries than simple tools such as sed and awk.

E.g. the following script reads a mailbox on stdin and writes a separate
file for each message:

	#!/usr/bin/awk -f
	BEGIN {
		num = 0;
		ofile = "";
	}
	
	/^From / {
		if (ofile != "") close(ofile);
		ofile = sprintf("%06d.mbox", num);
		num ++;
	}
	
	{
		print > ofile;
	}

It would be simple to modify it to start a new file after a given number
of messages or a given number of lines.

You can then read the resulting smaller mailboxes using your Python script.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Reading Huge UnixMailbox Files Brandon McGinty <brandon.mcginty@gmail.com> - 2011-04-26 15:39 -0400
  Re: Reading Huge UnixMailbox Files Nobody <nobody@nowhere.com> - 2011-04-26 21:23 +0100
    Re: Reading Huge UnixMailbox Files Dan Stromberg <drsalists@gmail.com> - 2011-04-26 14:02 -0700
      Re: Reading Huge UnixMailbox Files Nobody <nobody@nowhere.com> - 2011-04-27 13:52 +0100

csiph-web