Groups > comp.lang.python > #61630 > unrolled thread

Optimizing list processing

Started by	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
First post	2013-12-11 23:54 +0000
Last post	2013-12-12 13:07 -0500
Articles	19 — 9 participants

Back to article view | Back to comp.lang.python

  Optimizing list processing Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-11 23:54 +0000
    Re: Optimizing list processing MRAB <python@mrabarnett.plus.com> - 2013-12-12 00:59 +0000
      Re: Optimizing list processing Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-12 01:43 +0000
        Re: Optimizing list processing MRAB <python@mrabarnett.plus.com> - 2013-12-12 02:09 +0000
    Re: Optimizing list processing duncan smith <buzzard@invalid.invalid> - 2013-12-12 01:02 +0000
    Re: Optimizing list processing Ben Finney <ben+python@benfinney.id.au> - 2013-12-12 12:18 +1100
    Re: Optimizing list processing Terry Reedy <tjreedy@udel.edu> - 2013-12-11 21:26 -0500
      Re: Optimizing list processing Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-12 12:08 +0000
        Re: Optimizing list processing Chris Angelico <rosuav@gmail.com> - 2013-12-12 23:25 +1100
        Re: Optimizing list processing MRAB <python@mrabarnett.plus.com> - 2013-12-12 13:32 +0000
        Re: Optimizing list processing Chris Angelico <rosuav@gmail.com> - 2013-12-13 01:06 +1100
        Re: Optimizing list processing Terry Reedy <tjreedy@udel.edu> - 2013-12-12 13:40 -0500
          Re: Optimizing list processing Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-13 00:14 +0000
            Re: Optimizing list processing Chris Angelico <rosuav@gmail.com> - 2013-12-13 12:01 +1100
    Re: Optimizing list processing Stefan Behnel <stefan_ml@behnel.de> - 2013-12-12 12:09 +0100
    Re: Optimizing list processing Peter Otten <__peter__@web.de> - 2013-12-12 16:08 +0100
      Re: Optimizing list processing Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-13 03:01 +0000
        Re: Optimizing list processing rusi <rustompmody@gmail.com> - 2013-12-12 21:35 -0800
    Re: Optimizing list processing Terry Reedy <tjreedy@udel.edu> - 2013-12-12 13:07 -0500

#61630 — Optimizing list processing

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-12-11 23:54 +0000
Subject	Optimizing list processing
Message-ID	<52a8fb2d$0$29992$c3e8da3$5496439d@news.astraweb.com>

I have some code which produces a list from an iterable using at least 
one temporary list, using a Decorate-Sort-Undecorate idiom. The algorithm 
looks something like this (simplified):

table = sorted([(x, i) for i,x in enumerate(iterable)])
table = [i for x,i in table]

The problem here is that for large iterables, say 10 million items or so, 
this is *painfully* slow, as my system has to page memory like mad to fit 
two large lists into memory at once. So I came up with an in-place 
version that saves (approximately) two-thirds of the memory needed.

table = [(x, i) for i,x in enumerate(iterable)]
table.sort()
for x, i in table:
    table[i] = x

For giant iterables (ten million items), this version is a big 
improvement, about three times faster than the list comp version. Since 
we're talking about the difference between 4 seconds and 12 seconds (plus 
an additional 40-80 seconds of general slow-down as the computer pages 
memory into and out of virtual memory), this is a good, solid 
optimization.

Except that for more reasonably sized iterables, it's a pessimization. 
With one million items, the ratio is the other way around: the list comp 
version is 2-3 times faster than the in-place version. For smaller lists, 
the ratio varies, but the list comp version is typically around twice as 
fast. A good example of trading memory for time.

So, ideally I'd like to write my code like this:


table = [(x, i) for i,x in enumerate(iterable)]
table.sort()
if len(table) < ?????:
    table = [i for x,i in table]
else:
    for x, i in table:
        table[i] = x

where ????? no doubt will depend on how much memory is available in one 
contiguous chunk.

Is there any way to determine which branch I should run, apart from hard-
coding some arbitrary and constant cut-off value?



-- 
Steven

[toc] | [next] | [standalone]

#61632

From	MRAB <python@mrabarnett.plus.com>
Date	2013-12-12 00:59 +0000
Message-ID	<mailman.3942.1386809980.18130.python-list@python.org>
In reply to	#61630

On 11/12/2013 23:54, Steven D'Aprano wrote:
> I have some code which produces a list from an iterable using at least
> one temporary list, using a Decorate-Sort-Undecorate idiom. The algorithm
> looks something like this (simplified):
>
> table = sorted([(x, i) for i,x in enumerate(iterable)])
> table = [i for x,i in table]
>
> The problem here is that for large iterables, say 10 million items or so,
> this is *painfully* slow, as my system has to page memory like mad to fit
> two large lists into memory at once. So I came up with an in-place
> version that saves (approximately) two-thirds of the memory needed.
>
> table = [(x, i) for i,x in enumerate(iterable)]
> table.sort()

This looks wrong to me:

> for x, i in table:
>      table[i] = x

Couldn't it replace an item it'll need later on?

Let me see if I can find an example where it would fail.

Start with:

 >>> table = [('b', 0), ('a', 1)]

Sort it and you get:

 >>> table.sort()
 >>> table
[('a', 1), ('b', 0)]

Run that code:

 >>> for x, i in table:
	table[i] = x

	
Traceback (most recent call last):
   File "<pyshell#18>", line 1, in <module>
     for x, i in table:
ValueError: need more than 1 value to unpack

Why did it fail?

 >>> table
[('a', 1), 'a']

The 2 methods give different results anyway: the first returns a list
of indexes, and the second returns a list of items from the iterable.

[snip]

[toc] | [prev] | [next] | [standalone]

#61644

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-12-12 01:43 +0000
Message-ID	<52a914c3$0$29992$c3e8da3$5496439d@news.astraweb.com>
In reply to	#61632

On Thu, 12 Dec 2013 00:59:42 +0000, MRAB wrote:

>> table = [(x, i) for i,x in enumerate(iterable)] 
>> table.sort()
> 
> This looks wrong to me:
> 
>> for x, i in table:
>>      table[i] = x

Yes, you're right, I over-simplified the example, and in doing so 
introduced a bug. What I actually use is:

for i, x in enumerate(table):
    table[i] = x[1]

modulo any further bugs introduced while copying and pasting :-)


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#61648

From	MRAB <python@mrabarnett.plus.com>
Date	2013-12-12 02:09 +0000
Message-ID	<mailman.3951.1386814181.18130.python-list@python.org>
In reply to	#61644

On 12/12/2013 01:43, Steven D'Aprano wrote:
> On Thu, 12 Dec 2013 00:59:42 +0000, MRAB wrote:
>
>>> table = [(x, i) for i,x in enumerate(iterable)]
>>> table.sort()
>>
>> This looks wrong to me:
>>
>>> for x, i in table:
>>>      table[i] = x
>
> Yes, you're right, I over-simplified the example, and in doing so
> introduced a bug. What I actually use is:
>
> for i, x in enumerate(table):
>      table[i] = x[1]
>
> modulo any further bugs introduced while copying and pasting :-)
>
How does this compare:

table = list(iterable)
indexes = list(range(len(table)))
indexes.sort(key=table.__getitem__)

[toc] | [prev] | [next] | [standalone]

#61634

From	duncan smith <buzzard@invalid.invalid>
Date	2013-12-12 01:02 +0000
Message-ID	<52a90b21$0$47892$862e30e2@ngroups.net>
In reply to	#61630

On 11/12/13 23:54, Steven D'Aprano wrote:
> I have some code which produces a list from an iterable using at least
> one temporary list, using a Decorate-Sort-Undecorate idiom. The algorithm
> looks something like this (simplified):
>
> table = sorted([(x, i) for i,x in enumerate(iterable)])
> table = [i for x,i in table]
>
> The problem here is that for large iterables, say 10 million items or so,
> this is *painfully* slow, as my system has to page memory like mad to fit
> two large lists into memory at once. So I came up with an in-place
> version that saves (approximately) two-thirds of the memory needed.
>
> table = [(x, i) for i,x in enumerate(iterable)]
> table.sort()
> for x, i in table:
>      table[i] = x
>
> For giant iterables (ten million items), this version is a big
> improvement, about three times faster than the list comp version. Since
> we're talking about the difference between 4 seconds and 12 seconds (plus
> an additional 40-80 seconds of general slow-down as the computer pages
> memory into and out of virtual memory), this is a good, solid
> optimization.
>
> Except that for more reasonably sized iterables, it's a pessimization.
> With one million items, the ratio is the other way around: the list comp
> version is 2-3 times faster than the in-place version. For smaller lists,
> the ratio varies, but the list comp version is typically around twice as
> fast. A good example of trading memory for time.
>
> So, ideally I'd like to write my code like this:
>
>
> table = [(x, i) for i,x in enumerate(iterable)]
> table.sort()
> if len(table) < ?????:
>      table = [i for x,i in table]
> else:
>      for x, i in table:
>          table[i] = x
>
> where ????? no doubt will depend on how much memory is available in one
> contiguous chunk.
>
> Is there any way to determine which branch I should run, apart from hard-
> coding some arbitrary and constant cut-off value?
>

I had a slightly similar problem a while ago. I actually wanted to 
process data from a large file in sorted order. In the end I read chunks 
of data from the file, sorted them, then wrote each chunk of data to a 
temporary file. Then I used heapq.merge to merge the data in the 
temporary files. It vastly reduced memory consumption, and was 'quick 
enough'. It was based on Guido's solution for sorting a million 32-bit 
integers in 2MB of RAM 
(http://neopythonic.blogspot.co.uk/2008/10/sorting-million-32-bit-integers-in-2mb.html). 
Cheers.

Duncan

[toc] | [prev] | [next] | [standalone]

#61638

From	Ben Finney <ben+python@benfinney.id.au>
Date	2013-12-12 12:18 +1100
Message-ID	<mailman.3947.1386811122.18130.python-list@python.org>
In reply to	#61630

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

> For giant iterables (ten million items), this version is a big 
> improvement, about three times faster than the list comp version. […]
>
> Except that for more reasonably sized iterables, it's a pessimization.
> With one million items, the ratio is the other way around: the list
> comp version is 2-3 times faster than the in-place version. For
> smaller lists, the ratio varies, but the list comp version is
> typically around twice as fast. A good example of trading memory for
> time.

> Is there any way to determine which branch I should run, apart from
> hard- coding some arbitrary and constant cut-off value?

Hmm. The code isn't going to be able to accurately judge in advance the
time it'll take to run. But perhaps it could quickly and accurately
calculate the memory usage of your data structure? Is that useful for
determining which branch to take?

-- 
 \      “The fact that a believer is happier than a skeptic is no more |
  `\   to the point than the fact that a drunken man is happier than a |
_o__)                                 sober one.” —George Bernard Shaw |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#61650

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-12-11 21:26 -0500
Message-ID	<mailman.3953.1386815228.18130.python-list@python.org>
In reply to	#61630

On 12/11/2013 6:54 PM, Steven D'Aprano wrote:
> I have some code which produces a list from an iterable using at least
> one temporary list, using a Decorate-Sort-Undecorate idiom.

It is a non-standard version thereof, as DSU usually means to decorate 
with a key that gets discarded.

A couple of examples of input and expected output would have been good ;-).

> The algorithm looks something like this (simplified):
>
> table = sorted([(x, i) for i,x in enumerate(iterable)])

This makes two temporaries when only one is needed, and shows why we 
added generator expressions.

table = sorted((x, i) for i,x in enumerate(iterable))
is equivalent to the table; table.sort lines below.

The following (untested) should be faster as it avoids tuple unpacking 
and repacking.

from itertools import count
table = sorted(t for t in zip(iterable, count))

 > table = [i for x,i in table]

Did your original un-simplified use zip instead enumerate? Table now has 
the original index of the items in iterable sorted by the items value. 
The use for this is not obvious.

> The problem here is that for large iterables, say 10 million items or so,
> this is *painfully* slow, as my system has to page memory like mad to fit
> two large lists into memory at once. So I came up with an in-place
> version that saves (approximately) two-thirds of the memory needed.

With 1/3 saved by using a genex, it saves 1/2 of the remainder.

> table = [(x, i) for i,x in enumerate(iterable)]
> table.sorted()

(You meant table.sort().)  This is an expansion of sorted(genex). It 
might be slightly faster as list comp may be faster than list(equivalent 
genex).

> for x, i in table:
>      table[i] = x

I cannot understand what you are aiming for here. Besides the fact that 
this does not work, it keeps x values rather than i indexes as before.

 > table = [i for x,i in table]

done in place is

for j, (x,i) in enumerate(table):
   table[j] = i

I expect the list comp be faster than in-place as long as the result 
list can be allocated and held in memory without paging. (This of course 
depends on system memory and other memory uses.)  List iterators have a 
__length_hint__ method giving the length of the underlying list, so the 
list comp result list can potentially be allocated once and then filled 
in by enumeration and replacement, but in C rather than Python code.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#61701

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-12-12 12:08 +0000
Message-ID	<52a9a75a$0$29992$c3e8da3$5496439d@news.astraweb.com>
In reply to	#61650

On Wed, 11 Dec 2013 21:26:51 -0500, Terry Reedy wrote:

> On 12/11/2013 6:54 PM, Steven D'Aprano wrote:
>> I have some code which produces a list from an iterable using at least
>> one temporary list, using a Decorate-Sort-Undecorate idiom.
> 
> It is a non-standard version thereof, as DSU usually means to decorate
> with a key that gets discarded.

I do decorate with a key that gets discarded. The actual values that are 
being sorted are the enumerated values 0, 1, 2, ... and the temporary key 
values come from the iterable.

Confused yet? :-)

> A couple of examples of input and expected output would have been good
> ;-).

Only if I wanted you to understand the algorithm, which isn't really 
important. My question is about swapping between in-place modifications 
and external copying. If I wanted to be really mysterious (annoying), I 
could have just give pseudo-code :-)

Please don't focus on the algorithm I gave. Focus on the fact that I 
could write it like this:

if some condition to do with the computer's available memory:
     make modifications in place
else:
     make a copy of the data containing the modifications

if only I knew what that condition was and how to test for it.

>> The algorithm looks something like this (simplified):
>>
>> table = sorted([(x, i) for i,x in enumerate(iterable)])
> 
> This makes two temporaries when only one is needed, and shows why we
> added generator expressions.

You may be right in this particular case, I don't know I haven't tried 
it, but there are cases where list comps are significantly faster than 
generator expressions. For example, when passed to str.join, list comps 
are (on my computer) consistently 15% faster:

[steve@ando ~]$ python3.3 -m timeit "''.join([chr(n) for n in range(1, 
201)])"
10000 loops, best of 3: 100 usec per loop
[steve@ando ~]$ python3.3 -m timeit "''.join([chr(n) for n in range(1, 
201)])"
10000 loops, best of 3: 94.1 usec per loop
[steve@ando ~]$ python3.3 -m timeit "''.join(chr(n) for n in range(1, 
201))"
10000 loops, best of 3: 116 usec per loop
[steve@ando ~]$ python3.3 -m timeit "''.join(chr(n) for n in range(1, 
201))"
10000 loops, best of 3: 109 usec per loop

> table = sorted((x, i) for i,x in enumerate(iterable)) is equivalent to
> the table; table.sort lines below.
> 
> The following (untested) should be faster as it avoids tuple unpacking
> and repacking.

I'm not terribly fussed about micro-optimizations here. I'm concerned 
about *macro-optimizing* the case where creating two (or more) lists 
forces the computer to page memory in and out of virtual memory. Saving a 
few milliseconds? I don't care. Saving 5 or 6 seconds? That I care about.

[...]
> I expect the list comp be faster than in-place as long as the result
> list can be allocated and held in memory without paging. (This of course
> depends on system memory and other memory uses.)  

Exactly!

> List iterators have a
> __length_hint__ method giving the length of the underlying list, so the
> list comp result list can potentially be allocated once and then filled
> in by enumeration and replacement, but in C rather than Python code.

I don't want to pre-allocate the list comp. I want to avoid having to 
have two LARGE lists in memory at the same time, one containing the 
decorated values, and one not. I know that this is a good optimization 
for large enough lists. On my computer, ten million items is sufficient 
to demonstrate the optimization, and with sufficient testing I could 
determine a rough cut-off value, below which list comps are more 
efficient and above which in-place modifications are better.

But I don't know how to generalize that cut-off value. If I buy extra 
RAM, the cut-off value will shift. If I run it on another computer, it 
will shift.

P.S. The algorithm I'm working on is a way of generating index and rank 
tables. Not that it really matters -- what matters is determining whether 
or not to shift from "make a copy of the list" to "modify the list in 
place".

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#61703

From	Chris Angelico <rosuav@gmail.com>
Date	2013-12-12 23:25 +1100
Message-ID	<mailman.3984.1386851879.18130.python-list@python.org>
In reply to	#61701

On Thu, Dec 12, 2013 at 11:08 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> P.S. The algorithm I'm working on is a way of generating index and rank
> tables. Not that it really matters -- what matters is determining whether
> or not to shift from "make a copy of the list" to "modify the list in
> place".

So you're currently looking at...

if len(table) < ?????:
    table = [i for x,i in table]
else:
    for x, i in table:
        table[i] = x

Can I throw a spanner [1] in the works with other suggestions to try timing?

table[:] = [i for x,i in table]  # Does slice assignment get optimized?

SPLIT = 1048576  # Pick some useful cutoff
for part in range(0,len(table),SPLIT):
    table[part:part+SPLIT] = [i for x,i in table[part:part+SPLIT]]

If slice assignment is reasonably fast (which I suspect it is), the
one-liner should be practically identical to your small-table
one-liner. Then if the million-record splits can be done inside
memory, it ought to be possible to do this in comparable time, even if
the total table length is huge. The choice of SPLIT would then matter
a lot less than the cutoff that you're trying to find; you might have
been able to do it in half the number of sections, but that won't make
as big a difference as suddenly paging out. Ideally, what I'd like to
see is that a higher SPLIT improves performance slightly until it gets
too high, at which point you go bust and the dealer wins... but the
critical word there is "slightly", meaning that it wouldn't cost too
much for SPLIT to be lower than it needs to be. That's the criteria
for the experiment; do you have the data on which to try it?

[1] Monkey wrench, for the Yanks in the room

ChrisA

[toc] | [prev] | [next] | [standalone]

#61706

From	MRAB <python@mrabarnett.plus.com>
Date	2013-12-12 13:32 +0000
Message-ID	<mailman.3986.1386855129.18130.python-list@python.org>
In reply to	#61701

On 12/12/2013 12:25, Chris Angelico wrote:
> On Thu, Dec 12, 2013 at 11:08 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> P.S. The algorithm I'm working on is a way of generating index and rank
>> tables. Not that it really matters -- what matters is determining whether
>> or not to shift from "make a copy of the list" to "modify the list in
>> place".
>
> So you're currently looking at...
>
> if len(table) < ?????:
>      table = [i for x,i in table]
> else:
>      for x, i in table:
>          table[i] = x
>
>
> Can I throw a spanner [1] in the works with other suggestions to try timing?
>
> table[:] = [i for x,i in table]  # Does slice assignment get optimized?
>
[snip]

If you're trying that, you could also try:

table[:] = (i for x,i in table)

[toc] | [prev] | [next] | [standalone]

#61713

From	Chris Angelico <rosuav@gmail.com>
Date	2013-12-13 01:06 +1100
Message-ID	<mailman.3991.1386857197.18130.python-list@python.org>
In reply to	#61701

On Fri, Dec 13, 2013 at 12:32 AM, MRAB <python@mrabarnett.plus.com> wrote:
>> table[:] = [i for x,i in table]  # Does slice assignment get optimized?
>>
> [snip]
>
> If you're trying that, you could also try:
>
> table[:] = (i for x,i in table)

Right, a tweak which could be applied also to the split version. Thanks MRAB.

ChrisA

[toc] | [prev] | [next] | [standalone]

#61739

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-12-12 13:40 -0500
Message-ID	<mailman.4010.1386873654.18130.python-list@python.org>
In reply to	#61701

On 12/12/2013 7:08 AM, Steven D'Aprano wrote:

> Please don't focus on the algorithm I gave. Focus on the fact that I
> could write it like this:
>
> if some condition to do with the computer's available memory:
>       make modifications in place
> else:
>       make a copy of the data containing the modifications
>
> if only I knew what that condition was and how to test for it.

In other words, you want a magic formula that depends on just about 
everything: the (static) memory in the machine, the other programs 
running, the memory otherwise used by the current program, the number 
items, the nature of the items, and possibly the memory fragmentation state.

> You may be right in this particular case, I don't know I haven't tried
> it, but there are cases where list comps are significantly faster than
> generator expressions.

Stefan noted that neither is needed if zip produces the items wanted.

> I'm not terribly fussed about micro-optimizations here. I'm concerned
> about *macro-optimizing* the case where creating two (or more) lists
> forces the computer to page memory in and out of virtual memory. Saving a
> few milliseconds? I don't care. Saving 5 or 6 seconds? That I care about.

Why would you spend an hour or more of your time to save 5 seconds? 
Theoretical interest or a practical problem with enough runs to turns 
seconds into hours or days? A non-trivial practical answer will require 
more info about the practical context, including the distribution of 
machine memory sizes and of problem sizes.

> I don't want to pre-allocate the list comp. I want to avoid having to
> have two LARGE lists in memory at the same time, one containing the
> decorated values, and one not. I know that this is a good optimization
> for large enough lists. On my computer, ten million items is sufficient
> to demonstrate the optimization, and with sufficient testing I could
> determine a rough cut-off value, below which list comps are more
> efficient and above which in-place modifications are better.
>
> But I don't know how to generalize that cut-off value. If I buy extra
> RAM, the cut-off value will shift. If I run it on another computer, it
> will shift.

The simplest answer is to always avoid catastrophe by always modifying 
the one list. For 'short' lists, the time difference will be relatively 
small.

If the fraction of short lists is large enough to make the cumulative 
time difference large enough, and the list object sizes are more or less 
constant or at least bounded, a possible answer is to pick a minimum 
memory size, get a machine with that size, and work out a list size 
small enough to avoid thrashing.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#61777

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-12-13 00:14 +0000
Message-ID	<52aa5149$0$29992$c3e8da3$5496439d@news.astraweb.com>
In reply to	#61739

On Thu, 12 Dec 2013 13:40:36 -0500, Terry Reedy wrote:

> On 12/12/2013 7:08 AM, Steven D'Aprano wrote:
> 
>> Please don't focus on the algorithm I gave. Focus on the fact that I
>> could write it like this:
>>
>> if some condition to do with the computer's available memory:
>>       make modifications in place
>> else:
>>       make a copy of the data containing the modifications
>>
>> if only I knew what that condition was and how to test for it.
> 
> In other words, you want a magic formula that depends on just about
> everything: the (static) memory in the machine, the other programs
> running, the memory otherwise used by the current program, the number
> items, the nature of the items, and possibly the memory fragmentation
> state.

Yes. Surely there's a way to check whether allocating a list of N items 
will succeed without paging, without actually having to allocate a list 
of N items?

Back in the 1980s, when I was a Mac user and occasional programmer, there 
were memory manager routines which (if I recall correctly) would tell you 
whether or not an allocation would succeed or not. Macs of that vintage 
didn't have virtual memory, it is true, but the principle remains sound: 
"is there a single contiguous block of N bytes available? if so, use a 
list comprehension, if not, switch to an in-place algorithm". Until 
Python, I had never used a programming language that didn't give you at 
least a rough-and-ready way to determine how much memory was available so 
you could make an informed guess as to whether an expensive operation 
would succeed before actually attempting it.

I will be very disappointed (but hardly surprised) to learn that 
functionality which was standard in 1984 computers is not possible in 
2013.

But I don't *specifically* care about measuring memory size, only as a 
proxy for whether or not I should switch algorithms. That's why this post 
is called "Optimizing list processing" rather than "Finding out how much 
memory is available". If there's another way to solve the same problem, 
I'm happy to hear it.

[...]
>> I'm not terribly fussed about micro-optimizations here. I'm concerned
>> about *macro-optimizing* the case where creating two (or more) lists
>> forces the computer to page memory in and out of virtual memory. Saving
>> a few milliseconds? I don't care. Saving 5 or 6 seconds? That I care
>> about.
> 
> Why would you spend an hour or more of your time to save 5 seconds?

For a number of reasons:

- Trading off increased memory use for faster code is a good optimization 
technique so long as it actually does result in faster code. I measured, 
and discovered that there comes a point where it results in serious 
performance degradation. Not just a little bit slower by 5% or 10%, but 
nearly 300% slower. We're told not to optimize until we've measured. Well 
I've measured. Swapping algorithms to in-place for large enough lists is 
a BIG time saver.

- It's not just the five seconds that I save, but the 30 or 50 seconds of 
general slowdown of *everything* on the computer as it frantically pages 
blocks of memory. And not just on *this* computer, if I'm running the 
code on a server it will effect all the other machines relying on that 
server. That five seconds of time for a single process may affect all the 
users of that machine for a minute or two before the system settles down 
to normal performance again.

Paging is *deadly* for performance -- moving memory around is typically 
O(N**2), and hard drive speeds are typically millions of times slower 
than RAM speeds. Here is a graphical view of the latency of various media:

http://i.imgur.com/X1Hi1.gif

[...]
> The simplest answer is to always avoid catastrophe by always modifying
> the one list. For 'short' lists, the time difference will be relatively
> small.

It seems perverse to avoid writing idiomatic, and usually faster, Python 
code just because occasionally it performs badly. What you describe is a 
last resort.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#61779

From	Chris Angelico <rosuav@gmail.com>
Date	2013-12-13 12:01 +1100
Message-ID	<mailman.4042.1386896504.18130.python-list@python.org>
In reply to	#61777

On Fri, Dec 13, 2013 at 11:14 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Back in the 1980s, when I was a Mac user and occasional programmer, there
> were memory manager routines which (if I recall correctly) would tell you
> whether or not an allocation would succeed or not. Macs of that vintage
> didn't have virtual memory, it is true, but the principle remains sound:
> "is there a single contiguous block of N bytes available? if so, use a
> list comprehension, if not, switch to an in-place algorithm". Until
> Python, I had never used a programming language that didn't give you at
> least a rough-and-ready way to determine how much memory was available so
> you could make an informed guess as to whether an expensive operation
> would succeed before actually attempting it.
>
> I will be very disappointed (but hardly surprised) to learn that
> functionality which was standard in 1984 computers is not possible in
> 2013.

The situation is a bit different. Virtualization of memory makes it a
lot harder to figure out what can/can't be allocated without paging.
As long as you're okay with it being really rough, you could probably
get some info from the system. On my Linux box, /proc/meminfo tells
me:

MemTotal:       16452036 kB
MemFree:         1689636 kB
Buffers:          251240 kB
Cached:          6228724 kB

That suggests I can allocate 1.5GB + .25GB + 6.25GB = 8GB, roughly,
before hitting the swapper. (Buffers and Cached represent memory that
will be reclaimed to satisfy other requests, but is being used - most
of it caching stuff off the primary filesystem.) In very VERY round
figures, that would be how much it ought to be safe to allocate. Now,
how that compares with your list requirements is a separate question.
I've no idea how many list elements would fit in that 8GB, but it
probably won't be a billion (which would be the theoretical, one
64-bit pointer per element).

This might tie in nicely with the theory I suggested about splitting
the job. You could pick a SPLIT by figuring how much memory is free,
dividing that by your pointer size or other estimate of usage, halving
it because you need two lists, and then halving again for safety.

On this same 64-bit system, sys.getsizeof(list(range(N))) for various
values of N seems to suggest that it approaches 9*N. So by the
guesstimate above, I come up with a plausible SPLIT of 226933 ==
(1689636+251240+6228724)//9//2//2, not forgetting that this is scaled
by one SI unit (so 226 million elements would be safe). Alas, that's
not true, because I just tried that, and he hit the swapper... even
though sys.getsizeof() returned the expected N*9+112 figure.

Which is because I'm an idiot. Of course. The reason it's more is
because I'm creating all those integers too... actually, I'm quite
impressed with that, that's costing almost nothing more.

>>> sys.getsizeof(list(range(100000000)))
900000112
>>> sys.getsizeof([None]*100000000)
800000064

Not quite sure what's going on there. I know those integers are taking
up more than 1 byte of storage. Guess getsizeof() isn't the best way
to calculate. Anyway, a bit of experimentation should tell you how
much you can do safely, and as long as there's a good margin of error,
this would be an extremely rough guide.

Now, getting that info in a cross-platform way... that's the problem!

ChrisA

[toc] | [prev] | [next] | [standalone]

#61697

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2013-12-12 12:09 +0100
Message-ID	<mailman.3980.1386846588.18130.python-list@python.org>
In reply to	#61630

Terry Reedy, 12.12.2013 03:26:
> from itertools import count
> table = sorted(t for t in zip(iterable, count))

This is a useless use of a generator expression. sorted(zip(...)) is enough
(and likely to be substantially faster also).

Stefan

[toc] | [prev] | [next] | [standalone]

#61726

From	Peter Otten <__peter__@web.de>
Date	2013-12-12 16:08 +0100
Message-ID	<mailman.3999.1386860910.18130.python-list@python.org>
In reply to	#61630

Steven D'Aprano wrote:

> I have some code which produces a list from an iterable using at least
> one temporary list, using a Decorate-Sort-Undecorate idiom. The algorithm
> looks something like this (simplified):
> 
> table = sorted([(x, i) for i,x in enumerate(iterable)])
> table = [i for x,i in table]
> 
> The problem here is that for large iterables, say 10 million items or so,
> this is *painfully* slow, as my system has to page memory like mad to fit
> two large lists into memory at once. So I came up with an in-place
> version that saves (approximately) two-thirds of the memory needed.
> 
> table = [(x, i) for i,x in enumerate(iterable)]
> table.sort()
> for x, i in table:
>     table[i] = x
> 
> For giant iterables (ten million items), this version is a big
> improvement, about three times faster than the list comp version. Since
> we're talking about the difference between 4 seconds and 12 seconds (plus
> an additional 40-80 seconds of general slow-down as the computer pages
> memory into and out of virtual memory), this is a good, solid
> optimization.
> 
> Except that for more reasonably sized iterables, it's a pessimization.
> With one million items, the ratio is the other way around: the list comp
> version is 2-3 times faster than the in-place version. For smaller lists,
> the ratio varies, but the list comp version is typically around twice as
> fast. A good example of trading memory for time.
> 
> So, ideally I'd like to write my code like this:
> 
> 
> table = [(x, i) for i,x in enumerate(iterable)]
> table.sort()
> if len(table) < ?????:
>     table = [i for x,i in table]
> else:
>     for x, i in table:
>         table[i] = x
> 
> where ????? no doubt will depend on how much memory is available in one
> contiguous chunk.
> 
> Is there any way to determine which branch I should run, apart from hard-
> coding some arbitrary and constant cut-off value?

How about using two lists?

keys = list(iterable)
values = range(len(keys))
values.sort(key=keys.__getitem__)
del keys

The intention is to save memory used for the 2-tuples; I don't know if they 
pop up elsewhere.

[toc] | [prev] | [next] | [standalone]

#61785

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-12-13 03:01 +0000
Message-ID	<52aa7890$0$29992$c3e8da3$5496439d@news.astraweb.com>
In reply to	#61726

On Thu, 12 Dec 2013 16:08:33 +0100, Peter Otten wrote:

> Steven D'Aprano wrote:
[...]
>> So, ideally I'd like to write my code like this:
>> 
>> 
>> table = [(x, i) for i,x in enumerate(iterable)] table.sort()
>> if len(table) < ?????:
>>     table = [i for x,i in table]
>> else:
>>     for x, i in table:
>>         table[i] = x
>> 
>> where ????? no doubt will depend on how much memory is available in one
>> contiguous chunk.
>> 
>> Is there any way to determine which branch I should run, apart from
>> hard- coding some arbitrary and constant cut-off value?
> 
> How about using two lists?
> 
> keys = list(iterable)
> values = range(len(keys))
> values.sort(key=keys.__getitem__)
> del keys
> 
> The intention is to save memory used for the 2-tuples; I don't know if
> they pop up elsewhere.

They don't. On the other hand, you've now got two lists where before I 
only had one.

In my post, I said that the code I showed was a simplification. What I 
actually have is two branches. If the given iterable is a sequence type 
(say, a list) I use something quite similar to your code above:

    if isinstance(iterable, collections.Sequence):
        getitem = iterable.__getitem__
        if key is not None:
            # Compose the given key with getitem.
            f = lambda i: key(getitem(i))
        else:
            f = getitem
        table = range(len(iterable))  # Python 2.x version
        table.sort(key=f, reverse=reverse)

which avoids any temporary two-tuples, and creates only the one 
additional list. I can't avoid making that list, since it's the result 
required. I don't think that this branch of the code can be optimized in 
any serious way, I can't see any additional fat to be trimmed off that. 
If the input list is so large as to cause thrashing, the only solution is 
to buy more memory.

The other branch handles iterators, and I have two versions. The 
idiomatic version is faster for small data sets (small in this case 
meaning "up to at least one million items on my computer") but ends up 
building at least one extra temporary list which is thrown away:

    else:
        table = [(x, i) for (i, x) in enumerate(iterable)]
        table.sort(key=key, reverse=reverse)
        table = [i for x, i in table]

I've experimented with slight variations, e.g. using sorted() and a 
generator expression instead of a list comp, and the differences are 
trivial. The important thing here is that there's a temporary list which 
in some cases is *very large* but destined to be thrown away. Having two 
very large lists (for large enough input sizes) causes thrashing. 

To avoid that thrashing, I have a second version which builds only one 
list and then modifies it in place. For very large input sizes (ten 
million items) this is significantly faster:

    else:
        table = [(x, i) for (i, x) in enumerate(iterable)]
        table.sort(key=key, reverse=reverse)
        for x, i in enumerate(table):
            table[i] = x

but much slower, and less idiomatic, for small input sizes.

I don't know of any reasonable way to tell at runtime which of the two 
algorithms I ought to take. Hard-coding an arbitrary value 
("if len(table) > 5000000") is not the worst idea I've ever had, but I'm 
hoping for something more deterministic which can be cheaply calculated 
at runtime.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#61792

From	rusi <rustompmody@gmail.com>
Date	2013-12-12 21:35 -0800
Message-ID	<3a844cdb-218a-43e2-a1ad-7a7186dea282@googlegroups.com>
In reply to	#61785

On Friday, December 13, 2013 8:31:37 AM UTC+5:30, Steven D'Aprano wrote:

> I don't know of any reasonable way to tell at runtime which of the two 
> algorithms I ought to take. Hard-coding an arbitrary value 
> ("if len(table) > 5000000") is not the worst idea I've ever had, but I'm 
> hoping for something more deterministic which can be cheaply calculated 
> at runtime.

Have you seen guppy and heapy?

http://guppy-pe.sourceforge.net/

Even if you dont use it directly, it should have syscalls that you can steal

[toc] | [prev] | [next] | [standalone]

#61735

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-12-12 13:07 -0500
Message-ID	<mailman.4006.1386871684.18130.python-list@python.org>
In reply to	#61630

On 12/12/2013 6:09 AM, Stefan Behnel wrote:
> Terry Reedy, 12.12.2013 03:26:
>> from itertools import count
>> table = sorted(t for t in zip(iterable, count))
>
> This is a useless use of a generator expression. sorted(zip(...)) is enough
> (and likely to be substantially faster also).

Yes, definitely, and thank you. I did not squeeze enough.

-- 
Terry Jan Reedy

[toc] | [prev] | [standalone]

csiph-web

Optimizing list processing

Contents

#61630 — Optimizing list processing

#61632

#61644

#61648

#61634

#61638

#61650

#61701

#61703

#61706

#61713

#61739

#61777

#61779

#61697

#61726

#61785

#61792

#61735