Groups > comp.lang.python > #10673 > unrolled thread

Complex sort on big files

Started by	aliman <alimanfoo@googlemail.com>
First post	2011-08-01 08:33 -0700
Last post	2011-08-09 15:20 -0700
Articles	9 — 8 participants

Back to article view | Back to comp.lang.python

  Complex sort on big files aliman <alimanfoo@googlemail.com> - 2011-08-01 08:33 -0700
    Re: Complex sort on big files Peter Otten <__peter__@web.de> - 2011-08-01 19:00 +0200
    Re: Complex sort on big files Alistair Miles <alimanfoo@googlemail.com> - 2011-08-02 11:25 +0100
    Re: python module to determine if a machine is idle/free Chris Rebert <clp2@rebertia.com> - 2011-08-03 21:38 -0700
    Re: Complex sort on big files sturlamolden <sturlamolden@yahoo.no> - 2011-08-05 18:31 -0700
      Re: Complex sort on big files Roy Smith <roy@panix.com> - 2011-08-05 22:54 -0400
        Re: Complex sort on big files Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-08-06 13:30 +1000
    Re: Complex sort on big files sturlamolden <sturlamolden@yahoo.no> - 2011-08-06 10:53 -0700
      Re: Complex sort on big files John Nagle <nagle@animats.com> - 2011-08-09 15:20 -0700

#10673 — Complex sort on big files

From	aliman <alimanfoo@googlemail.com>
Date	2011-08-01 08:33 -0700
Subject	Complex sort on big files
Message-ID	<062306c6-3ee6-43ce-936f-ca9cc0f013c9@eb9g2000vbb.googlegroups.com>

Hi all,

Apologies I'm sure this has been asked many times, but I'm trying to
figure out the most efficient way to do a complex sort on very large
files.

I've read the recipe at [1] and understand that the way to sort a
large file is to break it into chunks, sort each chunk and write
sorted chunks to disk, then use heapq.merge to combine the chunks as
you read them.

What I'm having trouble figuring out is what to do when I want to sort
by one key ascending then another key descending (a "complex sort").

I understand that sorts are stable, so I could just repeat the whole
sort process once for each key in turn, but that would involve going
to and from disk once for each step in the sort, and I'm wondering if
there is a better way.

I also thought you could apply the complex sort to each chunk before
writing it to disk, so each chunk was completely sorted, but then the
heapq.merge wouldn't work properly, because afaik you can only give it
one key.

Any help much appreciated (I may well be missing something glaringly
obvious).

Cheers,

Alistair

[1] http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/

[toc] | [next] | [standalone]

#10675

From	Peter Otten <__peter__@web.de>
Date	2011-08-01 19:00 +0200
Message-ID	<j16m24$4q4$1@solani.org>
In reply to	#10673

aliman wrote:

> Apologies I'm sure this has been asked many times, but I'm trying to
> figure out the most efficient way to do a complex sort on very large
> files.
> 
> I've read the recipe at [1] and understand that the way to sort a
> large file is to break it into chunks, sort each chunk and write
> sorted chunks to disk, then use heapq.merge to combine the chunks as
> you read them.
> 
> What I'm having trouble figuring out is what to do when I want to sort
> by one key ascending then another key descending (a "complex sort").
> 
> I understand that sorts are stable, so I could just repeat the whole
> sort process once for each key in turn, but that would involve going
> to and from disk once for each step in the sort, and I'm wondering if
> there is a better way.
> 
> I also thought you could apply the complex sort to each chunk before
> writing it to disk, so each chunk was completely sorted, but then the
> heapq.merge wouldn't work properly, because afaik you can only give it
> one key.

You can make that key as complex as needed:

>>> class Key(object):
...     def __init__(self, obj):
...             self.asc = obj[1]
...             self.desc = obj[2]
...     def __cmp__(self, other):
...             return cmp(self.asc, other.asc) or -cmp(self.desc, 
other.desc)
...
>>> sorted(["abc", "aba", "bbb", "aaa", "aab"], key=Key)
['aab', 'aaa', 'abc', 'bbb', 'aba']

See also

http://docs.python.org/library/functools.html#functools.total_ordering

[toc] | [prev] | [next] | [standalone]

#10712

From	Alistair Miles <alimanfoo@googlemail.com>
Date	2011-08-02 11:25 +0100
Message-ID	<mailman.1758.1312280708.1164.python-list@python.org>
In reply to	#10673

Hi Dan,

Thanks for the reply.

On Mon, Aug 1, 2011 at 5:45 PM, Dan Stromberg <drsalists@gmail.com> wrote:
>
> Python 2.x, or Python 3.x?

Currently Python 2.x.

> What are the types of your sort keys?

Both numbers and strings.

> If you're on 3.x and the key you need reversed is numeric, you can negate
> the key.

I did wonder about that. Would that not be doable also in Python 2.7,
using sorted(key=...)?

> If you're on 2.x, you can use an object with a __cmp__ method to compare
> objects however you require.

OK, right.

Looking at the HowTo/Sorting again [1] and the bit about cmp_to_key,
could you also achieve the same effect by returning a key with custom
implementations of rich comparison functions?

> You probably should timsort the chunks (which is the standard list_.sort() -
> it's a very good in-memory sort), and then merge them afterward using the
> merge step of merge sort.

Yes, that's what I understood by the activestate recipe [2].

So I guess my question boils down to, how do you do the merge step for
a complex sort? (Assuming each chunk had been completely sorted
first.)

Maybe the answer is also to construct a key with custom implementation
of rich comparisons?

Now I'm also wondering about the best way to sort each chunk. The
examples in [1] of complex sorts suggest the best way to do it is to
first sort by the secondary key, then sort by the primary key, relying
on the stability of the sort to get the desired outcome. But would it
not be better to call sorted() once, supplying a custom key function?

(As an aside, at the end of the section in [1] on Sort Stability and
Complex sorts, it says "The Timsort algorithm used in Python does
multiple sorts efficiently because it can take advantage of any
ordering already present in a dataset." - I get that that's true, but
I don't see how that's relevant to this strategy for doing complex
sorts. I.e., if you sort first by the secondary key, you don't get any
ordering that's helpful when you subsequently sort by the primary key.
...?)

(Sorry, another side question, I'm guessing reading a chunk of data
into a list and using Timsort, i.e., calling list.sort() or
sorted(mylist), is quicker than using bisect to keep the chunk sorted
as you build it?)

> heapq's not unreasonable for the merging, but I think it's more common to
> use a short list.

Do you mean a regular Python list, and calling min()?

> I have a bunch of Python sorts at
> http://stromberg.dnsalias.org/svn/sorts/compare/trunk/ - if you find your
> need is specialized (EG, 3.x sorting by a secondary key that's a string, in
> reverse), you could adapt one of these to do what you need.

Thanks. Re 3.x sorting by a secondary key that's a string, in reverse,
which one were you thinking of in particular?

> heapq is not bad, but if you find you need a logn datastructure, you might
> check out http://stromberg.dnsalias.org/~dstromberg/treap/ - a treap is also
> logn, but has a very good amortized cost because it sacrifices keeping
> things perfectly balanced (and rebalancing, and rebalancing...) to gain
> speed.  But still, you might be better off with a short list and min.

Thanks, that's really helpful.

Cheers,

Alistair

[1] http://wiki.python.org/moin/HowTo/Sorting/
[2] http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/

> On Mon, Aug 1, 2011 at 8:33 AM, aliman <alimanfoo@googlemail.com> wrote:
>>
>> Hi all,
>>
>> Apologies I'm sure this has been asked many times, but I'm trying to
>> figure out the most efficient way to do a complex sort on very large
>> files.
>>
>> I've read the recipe at [1] and understand that the way to sort a
>> large file is to break it into chunks, sort each chunk and write
>> sorted chunks to disk, then use heapq.merge to combine the chunks as
>> you read them.
>>
>> What I'm having trouble figuring out is what to do when I want to sort
>> by one key ascending then another key descending (a "complex sort").
>>
>> I understand that sorts are stable, so I could just repeat the whole
>> sort process once for each key in turn, but that would involve going
>> to and from disk once for each step in the sort, and I'm wondering if
>> there is a better way.
>>
>> I also thought you could apply the complex sort to each chunk before
>> writing it to disk, so each chunk was completely sorted, but then the
>> heapq.merge wouldn't work properly, because afaik you can only give it
>> one key.
>>
>> Any help much appreciated (I may well be missing something glaringly
>> obvious).
>>
>> Cheers,
>>
>> Alistair
>>
>> [1]
>> http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>
>

-- 
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Web: http://purl.org/net/aliman
Email: alimanfoo@gmail.com
Tel: +44 (0)1865 287669

[toc] | [prev] | [next] | [standalone]

#10840 — Re: python module to determine if a machine is idle/free

From	Chris Rebert <clp2@rebertia.com>
Date	2011-08-03 21:38 -0700
Subject	Re: python module to determine if a machine is idle/free
Message-ID	<mailman.1876.1312432716.1164.python-list@python.org>
In reply to	#10673

On Wed, Aug 3, 2011 at 9:06 PM, Danny Wong (dannwong)
<dannwong@cisco.com> wrote:
> Hi all,
>
>                 I have 5 server machines that are using to process
> information. I would like to write a quick server python script that
> determines which of the machines are not in use. Any recommendations on
> which python module I should use to detect if a machine is not performing
> idle (ex. Some specific task is not running)?

Yes, psutil:
http://code.google.com/p/psutil/

os.getloadavg() may or may not also be useful to you:
http://docs.python.org/library/os.html#os.getloadavg

Cheers,
Chris

[toc] | [prev] | [next] | [standalone]

#10946

From	sturlamolden <sturlamolden@yahoo.no>
Date	2011-08-05 18:31 -0700
Message-ID	<b1b79dee-edc5-4e20-9a6f-fc32c051add8@j15g2000yqf.googlegroups.com>
In reply to	#10673

On Aug 1, 5:33 pm, aliman <aliman...@googlemail.com> wrote:

> I understand that sorts are stable, so I could just repeat the whole
> sort process once for each key in turn, but that would involve going
> to and from disk once for each step in the sort, and I'm wondering if
> there is a better way.

I would consider using memory mapping the file and sorting it inline.
Sorting a binary file of bytes with NumPy is as easy as this:

import numpy as np
f = np.memmap(filename, mode='rwb', dtype=np.uint8)
f.sort(kind='quicksort')
del f

(You can define dtype for any C data type or struct.)

If the file is really big, use 64-bit Python.

With memory mapping you don't have to worry about processing the file
in chunks, because the operating
systems will take care of those details.

I am not sure how to achieve this (inline file sort) with standard
library mmap and timsort, so I'll leave that out.

Sturla

[toc] | [prev] | [next] | [standalone]

#10949

From	Roy Smith <roy@panix.com>
Date	2011-08-05 22:54 -0400
Message-ID	<roy-D51908.22540405082011@news.panix.com>
In reply to	#10946

Wow.

I was going to suggest using the unix command-line sort utility via 
popen() or subprocess.  My arguments were that it's written in C, has 30 
years of optimizing in it, etc, etc, etc.  It almost certainly has to be 
faster than anything you could do in Python.

Then I tried the experiment.  I generated a file of 1 million random 
integers in the range 0 to 5000.  I wrote a little sorting program:

numbers = [int(line) for line in open('numbers')]
numbers.sort()
for i in numbers:
    print i

and ran it on my MacBook Pro (8 Gig, 2 x 2.4 GHz cores), Python 2.6.1.

$ time ./sort.py  > py-sort
real  0m2.706s
user  0m2.491s
sys   0m0.057s

and did the same with the unix utility:

$ time sort -n numbers  > cli-sort
real  0m5.123s
user  0m4.745s
sys   0m0.063s

Python took just about half the time.  Certainly knocked my socks off.  
Hard to believe, actually.

[toc] | [prev] | [next] | [standalone]

#10951

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-08-06 13:30 +1000
Message-ID	<4e3cb55a$0$29985$c3e8da3$5496439d@news.astraweb.com>
In reply to	#10949

Roy Smith wrote:

> Wow.
> 
> I was going to suggest using the unix command-line sort utility via
> popen() or subprocess.  My arguments were that it's written in C, has 30
> years of optimizing in it, etc, etc, etc.  It almost certainly has to be
> faster than anything you could do in Python.
> 
> Then I tried the experiment.  I generated a file of 1 million random
> integers in the range 0 to 5000.  I wrote a little sorting program:
[...]
> Python took just about half the time.  Certainly knocked my socks off.
> Hard to believe, actually.

One million integers isn't very big. If each integer can fit in four-byte
long, that's less than 4MB. That's almost small enough to fit in your CPU's
cache, with room left over for the first few chapters of "War And Peace"
*wink*

So you're comparing Python's timsort, which is Awesome with a capital AWE
but only works on data that fits in memory, versus something which can also
work on files too big to fit into memory.

Try generating a twenty gigabyte file of data, and sort that. Actually,
don't, because just *reading it in* to Python will probably fail, and very
possibly lock your PC up for the duration.

Unix sort does an external R-Way merge sort: if you have more data than
memory, it slices the data up into a bunch of smaller pieces, each of which
will fit in memory, sorts each one to a temporary file, then merges the
lot. It does a lot more work on such big files, because it *takes* a lot
more work.

For something as small as one million numbers, chances are the Unix sort
falls back on a heapsort or a quicksort, which will be pretty fast, but it
ain't no timsort.

So yes, Python's timsort is awesome, but so is Unix's sort, just in
different ways.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#10981

From	sturlamolden <sturlamolden@yahoo.no>
Date	2011-08-06 10:53 -0700
Message-ID	<e20a4684-08b1-4a5f-a141-cd9fef807181@cf8g2000vbb.googlegroups.com>
In reply to	#10673

On Aug 1, 5:33 pm, aliman <aliman...@googlemail.com> wrote:

> I've read the recipe at [1] and understand that the way to sort a
> large file is to break it into chunks, sort each chunk and write
> sorted chunks to disk, then use heapq.merge to combine the chunks as
> you read them.

Or just memory map the file (mmap.mmap) and do an inline .sort() on
the bytearray (Python 3.2). With Python 2.7, use e.g. numpy.memmap
instead. If the file is large, use 64-bit Python. You don't have to
process the file in chunks as the operating system will take care of
those details.

Sturla

[toc] | [prev] | [next] | [standalone]

#11082

From	John Nagle <nagle@animats.com>
Date	2011-08-09 15:20 -0700
Message-ID	<4e41b2cd$0$2138$742ec2ed@news.sonic.net>
In reply to	#10981

On 8/6/2011 10:53 AM, sturlamolden wrote:
> On Aug 1, 5:33 pm, aliman<aliman...@googlemail.com>  wrote:
>
>> I've read the recipe at [1] and understand that the way to sort a
>> large file is to break it into chunks, sort each chunk and write
>> sorted chunks to disk, then use heapq.merge to combine the chunks as
>> you read them.
>
> Or just memory map the file (mmap.mmap) and do an inline .sort() on
> the bytearray (Python 3.2). With Python 2.7, use e.g. numpy.memmap
> instead. If the file is large, use 64-bit Python. You don't have to
> process the file in chunks as the operating system will take care of
> those details.
>
> Sturla

    No, no, no.  If the file is too big to fit in memory, trying to
page it will just cause thrashing as the file pages in and out from
disk.

    The UNIX sort program is probably good enough.  There are better
approaches, if you have many gigabytes to sort, (see Syncsort, which
is a commercial product) but few people need them.

				John Nagle

[toc] | [prev] | [standalone]

csiph-web

Complex sort on big files

Contents

#10673 — Complex sort on big files

#10675

#10712

#10840 — Re: python module to determine if a machine is idle/free

#10946

#10949

#10951

#10981

#11082