Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!bcyclone04.am1.xlned.com!bcyclone04.am1.xlned.com!newsfeed.xs4all.nl!newsfeed2a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.005 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; '"this': 0.03; 'cpython': 0.05; 'sufficient': 0.05; 'that?': 0.05; '64-bit': 0.07; 'exists.': 0.07; 'etc).': 0.09; 'happen?': 0.09; 'sure,': 0.09; 'python': 0.11; 'bug': 0.12; 'random': 0.14; '24,': 0.16; 'at- least': 0.16; 'element,': 0.16; 'elements,': 0.16; 'merely': 0.16; 'reasonably': 0.16; 'remembers': 0.16; 'sorting': 0.16; 'sorting.': 0.16; 'elements': 0.16; 'do,': 0.16; 'fix': 0.17; 'wrote:': 0.18; 'wed,': 0.18; 'figures': 0.19; 'feb': 0.22; 'email addr:gmail.com>': 0.22; 'proposed': 0.22; 'saying': 0.22; 'error': 0.23; 'bytes': 0.24; 'skip': 0.24; 'stick': 0.24; 'source': 0.25; '>': 0.26; 'certain': 0.27; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'appear': 0.29; 'fixed': 0.29; 'chris': 0.29; 'leave': 0.29; 'am,': 0.29; 'array': 0.29; 'absolute': 0.30; 'said,': 0.30; 'sets': 0.30; 'message- id:@mail.gmail.com': 0.30; 'went': 0.31; 'code': 0.31; 'easier': 0.31; '(my': 0.31; '25,': 0.31; 'crash': 0.31; 'documenting': 0.31; 'stands': 0.31; 'testing.': 0.31; 'lists': 0.32; 'quite': 0.32; 'guess': 0.33; 'minimal': 0.33; 'comment': 0.34; 'received:74.125.82': 0.34; 'could': 0.34; 'agree': 0.35; 'something': 0.35; 'test': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'add': 0.35; 'shows': 0.36; 'wrong': 0.37; 'list': 0.37; 'being': 0.38; 'implement': 0.38; 'minimum': 0.38; 'problems': 0.38; 'easiest': 0.38; 'machines': 0.38; 'massive': 0.38; 'somebody': 0.38; 'to:addr:python-list': 0.38; 'issue': 0.38; 'pm,': 0.38; 'skip:& 20': 0.39; 'received:74.125': 0.39; 'does': 0.39; 'realize': 0.39; 'to:addr:python.org': 0.39; 'enough': 0.39; 'space': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'easy': 0.60; 'future': 0.60; 'most': 0.60; 'back': 0.62; 'road': 0.65; 'finally': 0.65; 'occur': 0.65; 'due': 0.66; 'moments': 0.68; 'eight': 0.74; 'obvious': 0.74; 'million': 0.74; 'potentially': 0.81; '(probably': 0.84; '2015': 0.84; 'comment.': 0.84; 'cost,': 0.84; 'presumably': 0.84; 'understand,': 0.84; 'dealt': 0.91 X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=p5wYwmTHA+3ycLUeVMEeun8zRo3RKyICgtp2GUjE2sY=; b=dyhM/ewUq/0jVsKkzlE6uuOzSXgjeVlXAQOV4W7Hb8dTVCjtVdAHBTUNEMev9S7bsz bdrwMNneaRe25z3lDkFYo9XLVtXLPDHM9UBTRfhxBz8oggSnwcom9zolYMW7Vxpd7MZ2 KJwTZKZDVpz2b9z4aPn0LQs0UPLAph/4t7IQxz9TNaiAwOsyRH+JFclbgfR5LW1pQQBx h+nX4tk+oL0EVoprUgdabqLLMsiU179H66b3fcdSgvaR0VAQUxueih6aOqDTRf+3Y0hj iLeLzgMlSwQYWxkvZGT10N16GQT8W8K6erJzXDWunCcBZdsbDICbsTSpVFik6ovd8WeN C8ww== X-Gm-Message-State: ALoCoQkFCNu5/M1c5Ip7rC6Mb8fOIIgnIbz2KOxWDrga4yKOI8spUmMHvqkHYq86H4kVIKOg63Ab X-Received: by 10.194.48.12 with SMTP id h12mr1176455wjn.74.1424824748931; Tue, 24 Feb 2015 16:39:08 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <1cf84559-3a63-4799-a879-ae8e513d387e@googlegroups.com> <54ED0B7E.7030801@mrabarnett.plus.com> From: Chris Kaynor Date: Tue, 24 Feb 2015 16:38:48 -0800 Subject: Re: Bug in timsort!? To: Python Content-Type: multipart/alternative; boundary=089e013c6704070fe8050fdedd50 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 118 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1424824756 news.xs4all.nl 2921 [2001:888:2000:d::a6]:41467 X-Complaints-To: abuse@xs4all.nl X-Received-Bytes: 11287 X-Received-Body-CRC: 2581964393 Xref: csiph.com comp.lang.python:86364 --089e013c6704070fe8050fdedd50 Content-Type: text/plain; charset=UTF-8 On Tue, Feb 24, 2015 at 4:07 PM, Chris Angelico wrote: > On Wed, Feb 25, 2015 at 10:50 AM, Skip Montanaro > wrote: > > Even if/when we get to the point where machines can hold an array of > > 2**49 elements, I suspect people won't be using straight Python to > > wrangle them. > > Looking just at CPython, what is the absolute minimum storage space > for a massive list like that? My guess is that a 64-bit CPython might > get as good as eight bytes per element; playing around with smaller > figures (up to circa a million elements) shows it ranging from 8 to 9 > bytes per element, bottoming out at just a smidge over 8, presumably > at the moments when there's no slack space (there's a fixed overhead, > which gets diminishingly significant). So an array of 2**49 elements > would take 2**52 bytes (4 petabytes) of storage - addressable, to be > sure, but an awful lot to be sorting. > > Would it be sufficient to stick a comment into the source saying "This > may have problems with lists in excess of 2**49 elements", and leave > it until that's actually likely to happen? > I would agree that it will be quite a while as it stands before the bug appears, so documenting it MAY be appropriate, as it is likely to be far enough in the future that a number of unforeseen things may block the bug from being an issue (TimSort may be replaced by something better, CPython could die, etc). That said, from what I understand, their proposed fix would be reasonably easy to use, and have minimal cost, so it may be worthwhile to use just because the issue is likely to be in use for quite a while before somebody remembers the issue exists. Once somebody does hit it, it may take a while to realize that TimSort is at fault, as it will only occur on some data sets (my understanding is that the elements must be in certain configurations with at-least the minimum number), so it will likely appear as just a random crash occurring. If it were fully up to me, I would do one of the following, in order of preference: 1) Implement and test the fix provided on the blog. This requires the most work due to licensing, reviewing, and testing. 2) Add a noisy warning/error to the code when sorting lists large enough to potentially hit the error (probably 2**48 to leave some wiggle room). This is much easier to do, but merely pushes the can down the road to be dealt with later. It does still make it obvious what is going long when it finally breaks. 3) Add a comment explaining the issue (likely linking back to the blog). While by far the easiest "solution", again, this merely pushes the can down the road, and it may be very non-obvious what went wrong when it breaks, despite the comment. Chris --089e013c6704070fe8050fdedd50 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On T= ue, Feb 24, 2015 at 4:07 PM, Chris Angelico <rosuav@gmail.com> wrote:
On Wed, Feb 25, 2015 at = 10:50 AM, Skip Montanaro
<skip.mont= anaro@gmail.com> wrote:
> Even if/when we get to the point where machines can hold an array of > 2**49 elements, I suspect people won't be using straight Python to=
> wrangle them.

Looking just at CPython, what is the absolute minimum storage space<= br> for a massive list like that? My guess is that a 64-bit CPython might
get as good as eight bytes per element; playing around with smaller
figures (up to circa a million elements) shows it ranging from 8 to 9
bytes per element, bottoming out at just a smidge over 8, presumably
at the moments when there's no slack space (there's a fixed overhea= d,
which gets diminishingly significant). So an array of 2**49 elements
would take 2**52 bytes (4 petabytes) of storage - addressable, to be
sure, but an awful lot to be sorting.

Would it be sufficient to stick a comment into the source saying "This=
may have problems with lists in excess of 2**49 elements", and leave it until that's actually likely to happen?

I would agree that it will be quite a while as it stands before the= bug appears, so documenting it MAY be appropriate, as it is likely to be f= ar enough in the future that a number of unforeseen things may block the bu= g from being an issue (TimSort may be replaced by something better, CPython= could die, etc).

That said, from what I understan= d, their proposed fix would be reasonably easy to use, and have minimal cos= t, so it may be worthwhile to use just because the issue is likely to be in= use for quite a while before somebody remembers the issue exists. Once som= ebody does hit it, it may take a while to realize that TimSort is at fault,= as it will only occur on some data sets (my understanding is that the elem= ents must be in certain configurations with at-least the minimum number), s= o it will likely appear as just a random crash occurring.
If it were fully up to me, I would do one of the following, in = order of preference:
1) Implement and test the fix provided on th= e blog. This requires the most work due to licensing, reviewing, and testin= g.
2) Add a noisy warning/error to the code when sorting lists la= rge enough to potentially hit the error (probably 2**48 to leave some wiggl= e room). This is much easier to do, but merely pushes the can down the road= to be dealt with later. It does still make it obvious what is going long w= hen it finally breaks.
3) Add a comment explaining the issue (lik= ely linking back to the blog). While by far the easiest "solution"= ;, again, this merely pushes the can down the road, and it may be very non-= obvious what went wrong when it breaks, despite the comment.

=
Chris

--089e013c6704070fe8050fdedd50--