Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #10673 > unrolled thread
| Started by | aliman <alimanfoo@googlemail.com> |
|---|---|
| First post | 2011-08-01 08:33 -0700 |
| Last post | 2011-08-09 15:20 -0700 |
| Articles | 9 — 8 participants |
Back to article view | Back to comp.lang.python
Complex sort on big files aliman <alimanfoo@googlemail.com> - 2011-08-01 08:33 -0700
Re: Complex sort on big files Peter Otten <__peter__@web.de> - 2011-08-01 19:00 +0200
Re: Complex sort on big files Alistair Miles <alimanfoo@googlemail.com> - 2011-08-02 11:25 +0100
Re: python module to determine if a machine is idle/free Chris Rebert <clp2@rebertia.com> - 2011-08-03 21:38 -0700
Re: Complex sort on big files sturlamolden <sturlamolden@yahoo.no> - 2011-08-05 18:31 -0700
Re: Complex sort on big files Roy Smith <roy@panix.com> - 2011-08-05 22:54 -0400
Re: Complex sort on big files Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-08-06 13:30 +1000
Re: Complex sort on big files sturlamolden <sturlamolden@yahoo.no> - 2011-08-06 10:53 -0700
Re: Complex sort on big files John Nagle <nagle@animats.com> - 2011-08-09 15:20 -0700
| From | aliman <alimanfoo@googlemail.com> |
|---|---|
| Date | 2011-08-01 08:33 -0700 |
| Subject | Complex sort on big files |
| Message-ID | <062306c6-3ee6-43ce-936f-ca9cc0f013c9@eb9g2000vbb.googlegroups.com> |
Hi all, Apologies I'm sure this has been asked many times, but I'm trying to figure out the most efficient way to do a complex sort on very large files. I've read the recipe at [1] and understand that the way to sort a large file is to break it into chunks, sort each chunk and write sorted chunks to disk, then use heapq.merge to combine the chunks as you read them. What I'm having trouble figuring out is what to do when I want to sort by one key ascending then another key descending (a "complex sort"). I understand that sorts are stable, so I could just repeat the whole sort process once for each key in turn, but that would involve going to and from disk once for each step in the sort, and I'm wondering if there is a better way. I also thought you could apply the complex sort to each chunk before writing it to disk, so each chunk was completely sorted, but then the heapq.merge wouldn't work properly, because afaik you can only give it one key. Any help much appreciated (I may well be missing something glaringly obvious). Cheers, Alistair [1] http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/
[toc] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2011-08-01 19:00 +0200 |
| Message-ID | <j16m24$4q4$1@solani.org> |
| In reply to | #10673 |
aliman wrote: > Apologies I'm sure this has been asked many times, but I'm trying to > figure out the most efficient way to do a complex sort on very large > files. > > I've read the recipe at [1] and understand that the way to sort a > large file is to break it into chunks, sort each chunk and write > sorted chunks to disk, then use heapq.merge to combine the chunks as > you read them. > > What I'm having trouble figuring out is what to do when I want to sort > by one key ascending then another key descending (a "complex sort"). > > I understand that sorts are stable, so I could just repeat the whole > sort process once for each key in turn, but that would involve going > to and from disk once for each step in the sort, and I'm wondering if > there is a better way. > > I also thought you could apply the complex sort to each chunk before > writing it to disk, so each chunk was completely sorted, but then the > heapq.merge wouldn't work properly, because afaik you can only give it > one key. You can make that key as complex as needed: >>> class Key(object): ... def __init__(self, obj): ... self.asc = obj[1] ... self.desc = obj[2] ... def __cmp__(self, other): ... return cmp(self.asc, other.asc) or -cmp(self.desc, other.desc) ... >>> sorted(["abc", "aba", "bbb", "aaa", "aab"], key=Key) ['aab', 'aaa', 'abc', 'bbb', 'aba'] See also http://docs.python.org/library/functools.html#functools.total_ordering
[toc] | [prev] | [next] | [standalone]
| From | Alistair Miles <alimanfoo@googlemail.com> |
|---|---|
| Date | 2011-08-02 11:25 +0100 |
| Message-ID | <mailman.1758.1312280708.1164.python-list@python.org> |
| In reply to | #10673 |
Hi Dan, Thanks for the reply. On Mon, Aug 1, 2011 at 5:45 PM, Dan Stromberg <drsalists@gmail.com> wrote: > > Python 2.x, or Python 3.x? Currently Python 2.x. > What are the types of your sort keys? Both numbers and strings. > If you're on 3.x and the key you need reversed is numeric, you can negate > the key. I did wonder about that. Would that not be doable also in Python 2.7, using sorted(key=...)? > If you're on 2.x, you can use an object with a __cmp__ method to compare > objects however you require. OK, right. Looking at the HowTo/Sorting again [1] and the bit about cmp_to_key, could you also achieve the same effect by returning a key with custom implementations of rich comparison functions? > You probably should timsort the chunks (which is the standard list_.sort() - > it's a very good in-memory sort), and then merge them afterward using the > merge step of merge sort. Yes, that's what I understood by the activestate recipe [2]. So I guess my question boils down to, how do you do the merge step for a complex sort? (Assuming each chunk had been completely sorted first.) Maybe the answer is also to construct a key with custom implementation of rich comparisons? Now I'm also wondering about the best way to sort each chunk. The examples in [1] of complex sorts suggest the best way to do it is to first sort by the secondary key, then sort by the primary key, relying on the stability of the sort to get the desired outcome. But would it not be better to call sorted() once, supplying a custom key function? (As an aside, at the end of the section in [1] on Sort Stability and Complex sorts, it says "The Timsort algorithm used in Python does multiple sorts efficiently because it can take advantage of any ordering already present in a dataset." - I get that that's true, but I don't see how that's relevant to this strategy for doing complex sorts. I.e., if you sort first by the secondary key, you don't get any ordering that's helpful when you subsequently sort by the primary key. ...?) (Sorry, another side question, I'm guessing reading a chunk of data into a list and using Timsort, i.e., calling list.sort() or sorted(mylist), is quicker than using bisect to keep the chunk sorted as you build it?) > heapq's not unreasonable for the merging, but I think it's more common to > use a short list. Do you mean a regular Python list, and calling min()? > I have a bunch of Python sorts at > http://stromberg.dnsalias.org/svn/sorts/compare/trunk/ - if you find your > need is specialized (EG, 3.x sorting by a secondary key that's a string, in > reverse), you could adapt one of these to do what you need. Thanks. Re 3.x sorting by a secondary key that's a string, in reverse, which one were you thinking of in particular? > heapq is not bad, but if you find you need a logn datastructure, you might > check out http://stromberg.dnsalias.org/~dstromberg/treap/ - a treap is also > logn, but has a very good amortized cost because it sacrifices keeping > things perfectly balanced (and rebalancing, and rebalancing...) to gain > speed. But still, you might be better off with a short list and min. Thanks, that's really helpful. Cheers, Alistair [1] http://wiki.python.org/moin/HowTo/Sorting/ [2] http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/ > On Mon, Aug 1, 2011 at 8:33 AM, aliman <alimanfoo@googlemail.com> wrote: >> >> Hi all, >> >> Apologies I'm sure this has been asked many times, but I'm trying to >> figure out the most efficient way to do a complex sort on very large >> files. >> >> I've read the recipe at [1] and understand that the way to sort a >> large file is to break it into chunks, sort each chunk and write >> sorted chunks to disk, then use heapq.merge to combine the chunks as >> you read them. >> >> What I'm having trouble figuring out is what to do when I want to sort >> by one key ascending then another key descending (a "complex sort"). >> >> I understand that sorts are stable, so I could just repeat the whole >> sort process once for each key in turn, but that would involve going >> to and from disk once for each step in the sort, and I'm wondering if >> there is a better way. >> >> I also thought you could apply the complex sort to each chunk before >> writing it to disk, so each chunk was completely sorted, but then the >> heapq.merge wouldn't work properly, because afaik you can only give it >> one key. >> >> Any help much appreciated (I may well be missing something glaringly >> obvious). >> >> Cheers, >> >> Alistair >> >> [1] >> http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/ >> -- >> http://mail.python.org/mailman/listinfo/python-list > > -- -- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: http://purl.org/net/aliman Email: alimanfoo@gmail.com Tel: +44 (0)1865 287669
[toc] | [prev] | [next] | [standalone]
| From | Chris Rebert <clp2@rebertia.com> |
|---|---|
| Date | 2011-08-03 21:38 -0700 |
| Subject | Re: python module to determine if a machine is idle/free |
| Message-ID | <mailman.1876.1312432716.1164.python-list@python.org> |
| In reply to | #10673 |
On Wed, Aug 3, 2011 at 9:06 PM, Danny Wong (dannwong) <dannwong@cisco.com> wrote: > Hi all, > > I have 5 server machines that are using to process > information. I would like to write a quick server python script that > determines which of the machines are not in use. Any recommendations on > which python module I should use to detect if a machine is not performing > idle (ex. Some specific task is not running)? Yes, psutil: http://code.google.com/p/psutil/ os.getloadavg() may or may not also be useful to you: http://docs.python.org/library/os.html#os.getloadavg Cheers, Chris
[toc] | [prev] | [next] | [standalone]
| From | sturlamolden <sturlamolden@yahoo.no> |
|---|---|
| Date | 2011-08-05 18:31 -0700 |
| Message-ID | <b1b79dee-edc5-4e20-9a6f-fc32c051add8@j15g2000yqf.googlegroups.com> |
| In reply to | #10673 |
On Aug 1, 5:33 pm, aliman <aliman...@googlemail.com> wrote: > I understand that sorts are stable, so I could just repeat the whole > sort process once for each key in turn, but that would involve going > to and from disk once for each step in the sort, and I'm wondering if > there is a better way. I would consider using memory mapping the file and sorting it inline. Sorting a binary file of bytes with NumPy is as easy as this: import numpy as np f = np.memmap(filename, mode='rwb', dtype=np.uint8) f.sort(kind='quicksort') del f (You can define dtype for any C data type or struct.) If the file is really big, use 64-bit Python. With memory mapping you don't have to worry about processing the file in chunks, because the operating systems will take care of those details. I am not sure how to achieve this (inline file sort) with standard library mmap and timsort, so I'll leave that out. Sturla
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2011-08-05 22:54 -0400 |
| Message-ID | <roy-D51908.22540405082011@news.panix.com> |
| In reply to | #10946 |
Wow.
I was going to suggest using the unix command-line sort utility via
popen() or subprocess. My arguments were that it's written in C, has 30
years of optimizing in it, etc, etc, etc. It almost certainly has to be
faster than anything you could do in Python.
Then I tried the experiment. I generated a file of 1 million random
integers in the range 0 to 5000. I wrote a little sorting program:
numbers = [int(line) for line in open('numbers')]
numbers.sort()
for i in numbers:
print i
and ran it on my MacBook Pro (8 Gig, 2 x 2.4 GHz cores), Python 2.6.1.
$ time ./sort.py > py-sort
real 0m2.706s
user 0m2.491s
sys 0m0.057s
and did the same with the unix utility:
$ time sort -n numbers > cli-sort
real 0m5.123s
user 0m4.745s
sys 0m0.063s
Python took just about half the time. Certainly knocked my socks off.
Hard to believe, actually.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2011-08-06 13:30 +1000 |
| Message-ID | <4e3cb55a$0$29985$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #10949 |
Roy Smith wrote: > Wow. > > I was going to suggest using the unix command-line sort utility via > popen() or subprocess. My arguments were that it's written in C, has 30 > years of optimizing in it, etc, etc, etc. It almost certainly has to be > faster than anything you could do in Python. > > Then I tried the experiment. I generated a file of 1 million random > integers in the range 0 to 5000. I wrote a little sorting program: [...] > Python took just about half the time. Certainly knocked my socks off. > Hard to believe, actually. One million integers isn't very big. If each integer can fit in four-byte long, that's less than 4MB. That's almost small enough to fit in your CPU's cache, with room left over for the first few chapters of "War And Peace" *wink* So you're comparing Python's timsort, which is Awesome with a capital AWE but only works on data that fits in memory, versus something which can also work on files too big to fit into memory. Try generating a twenty gigabyte file of data, and sort that. Actually, don't, because just *reading it in* to Python will probably fail, and very possibly lock your PC up for the duration. Unix sort does an external R-Way merge sort: if you have more data than memory, it slices the data up into a bunch of smaller pieces, each of which will fit in memory, sorts each one to a temporary file, then merges the lot. It does a lot more work on such big files, because it *takes* a lot more work. For something as small as one million numbers, chances are the Unix sort falls back on a heapsort or a quicksort, which will be pretty fast, but it ain't no timsort. So yes, Python's timsort is awesome, but so is Unix's sort, just in different ways. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | sturlamolden <sturlamolden@yahoo.no> |
|---|---|
| Date | 2011-08-06 10:53 -0700 |
| Message-ID | <e20a4684-08b1-4a5f-a141-cd9fef807181@cf8g2000vbb.googlegroups.com> |
| In reply to | #10673 |
On Aug 1, 5:33 pm, aliman <aliman...@googlemail.com> wrote: > I've read the recipe at [1] and understand that the way to sort a > large file is to break it into chunks, sort each chunk and write > sorted chunks to disk, then use heapq.merge to combine the chunks as > you read them. Or just memory map the file (mmap.mmap) and do an inline .sort() on the bytearray (Python 3.2). With Python 2.7, use e.g. numpy.memmap instead. If the file is large, use 64-bit Python. You don't have to process the file in chunks as the operating system will take care of those details. Sturla
[toc] | [prev] | [next] | [standalone]
| From | John Nagle <nagle@animats.com> |
|---|---|
| Date | 2011-08-09 15:20 -0700 |
| Message-ID | <4e41b2cd$0$2138$742ec2ed@news.sonic.net> |
| In reply to | #10981 |
On 8/6/2011 10:53 AM, sturlamolden wrote:
> On Aug 1, 5:33 pm, aliman<aliman...@googlemail.com> wrote:
>
>> I've read the recipe at [1] and understand that the way to sort a
>> large file is to break it into chunks, sort each chunk and write
>> sorted chunks to disk, then use heapq.merge to combine the chunks as
>> you read them.
>
> Or just memory map the file (mmap.mmap) and do an inline .sort() on
> the bytearray (Python 3.2). With Python 2.7, use e.g. numpy.memmap
> instead. If the file is large, use 64-bit Python. You don't have to
> process the file in chunks as the operating system will take care of
> those details.
>
> Sturla
No, no, no. If the file is too big to fit in memory, trying to
page it will just cause thrashing as the file pages in and out from
disk.
The UNIX sort program is probably good enough. There are better
approaches, if you have many gigabytes to sort, (see Syncsort, which
is a commercial product) but few people need them.
John Nagle
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web