Groups > comp.lang.python > #28380 > unrolled thread

Comparing strings from the back?

Started by	Roy Smith <roy@panix.com>
First post	2012-09-03 21:54 -0400
Last post	2012-09-10 21:52 +0000
Articles	20 on this page of 101 — 21 participants

Back to article view | Back to comp.lang.python

  Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-03 21:54 -0400
    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-04 12:07 +1000
    Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 02:17 +0000
      Re: Comparing strings from the back? Dan Sommers <dan@tombstonezero.net> - 2012-09-03 21:56 -0700
      Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:50 +0100
      Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-04 18:32 +0200
        Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 18:07 +0000
          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:17 +0200
        Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 07:59 +1000
          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:24 +0200
            Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:43 +0200
              Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 14:30 +0000
                Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:51 +0200
                  Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 16:24 +0000
                    Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-05 22:47 +0000
                      Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-06 08:33 +0000
                        Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 06:07 -0400
                          Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:06 +0000
                            Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:10 +0000
                              Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-08 00:55 +0000
                                Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-08 11:53 +0000
                                RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-13 18:39 +0000
                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 15:37 -0400
                                  Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 20:48 -0700
                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 00:46 -0400
                                      Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 21:54 -0700
                                        Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 01:38 -0400
                                          Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 23:06 -0700
                                            Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:03 -0400
                                              Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 01:20 -0700
                                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:53 -0400
                                                  Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 03:26 -0700
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:36 -0400
                                                Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 10:16 +0000
                                                  Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:43 -0400
                                                RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 22:43 +0000
                                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 23:10 -0400
                                                  Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 18:11 -0700
                                                    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-17 14:05 +1000
                                                      Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 23:06 -0700
                                                Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-17 13:35 -0700
                                                  Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-18 09:14 +1000
                                                    Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-18 08:12 -0700
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:55 -0400
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:59 -0400
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 12:17 -0400
                                                      Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-19 00:47 +0000
                                                    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-19 02:20 +1000
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 16:40 -0400
                                                    Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-19 00:48 +0100
                                Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 20:53 +0100
                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:06 -0400
                                  Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 03:39 +0000
                                    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-14 14:15 +1000
                                Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 22:17 +0100
                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:35 -0400
                                RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 21:32 +0000
                            Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:40 +0000
                            Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:52 +0200
                              Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-10 08:59 +0000
                                Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-10 13:45 +0000
                                  Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:06 +0000
                                    Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:51 +0000
                                      Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-11 11:55 -0400
                                  Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:26 +1000
                                  Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:32 +0000
                                  Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:43 +0000
                                  Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:56 +1000
                                  Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:41 +0000
                        Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-06 12:04 +0100
                          Re: Comparing strings from the back? Steve Howell <showell30@yahoo.com> - 2012-09-14 17:51 -0700
                        Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-06 08:13 -0400
                          Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-06 22:29 +1000
                        Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:43 +0200
                          Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 10:23 -0400
                            Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:33 +0200
                            Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:42 +0200
                              Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 11:54 -0400
                          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:34 +0200
                        Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:50 +0200
                    Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:37 +0200
                      Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-07 00:39 +1000
                        Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:36 +0200
                          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:44 +0200
                        Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:07 +0000
    Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-04 01:13 -0400
    Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:56 +0100
    Re: Comparing strings from the back? Alain Ketterlin <alain@dpt-info.u-strasbg.fr> - 2012-09-04 11:58 +0200
    Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-05 12:18 +1000
      Re: Comparing strings from the back? MRAB <python@mrabarnett.plus.com> - 2012-09-05 03:39 +0100
      Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-04 22:48 -0400
      Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:33 +0200
    Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 10:29 +0200
    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 18:33 +1000
    Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 11:48 +0200
    Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 17:45 +0200
    Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 18:07 +0200
    Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 16:33 +0000
    Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:32 +0200
    Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:44 +0200
    Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 21:52 +0000

Page 4 of 6 — ← Prev page 1 2 3 [4] 5 6 Next page →

#28818

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-10 13:45 +0000
Message-ID	<504deedc$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28807

On Mon, 10 Sep 2012 08:59:37 +0000, Duncan Booth wrote:

> Gelonida N <gelonida@gmail.com> wrote:
> 
>> On 09/07/2012 06:06 AM, Steven D'Aprano wrote:
>>> On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:
>>>
>>>
>>> Also of some interest is the best case: O(1) for unequal strings (they
>>> differ at the first character) and O(N) for equal strings.
>> 
>> The worst case is O(N) or N characters the average case is O(1) or two
>> characters.
>> 
>> For denial of service attacks or systems, that are NEVER allowed to
>> fail the worst case is important.
>> 
>> For most other cases the average complexity counts.
>> 
>> However I still wonder for how many applications the complexity of
>> string comparisons would be the limiting factor.
>> 
>> 
> and of course if you ever do find an application where that worst case
> matters there's an easy way round it: just call intern() on all the
> strings when they are created.

There are other good reasons for interning strings, but speeding up 
"astring == bstring" is not one of them.

[steve@ando ~]$ python -m timeit -s "s = 'abc'*100000" -s \
> "t = s[:-1] + 'x'" "s == t"
1000 loops, best of 3: 910 usec per loop
[steve@ando ~]$ python -m timeit -s "s = 'abc'*100000" -s \
> "t = s[:-1] + 'x'" -s "intern(s); intern(t)" "s == t"
1000 loops, best of 3: 914 usec per loop

No significant difference.

To state the bleedin' obvious, the computational effort required to 
*compare two strings* pre-supposes that those strings already exist, and 
is *completely independent* of the complexity of creating the strings.

> For the comparison to be the limiting factor you have to be doing a lot
> of comparisons 

Correct.

> on the same string (otherwise creating the string would be the limiting
> factor),

Not necessarily.

Look, it's really hard to come up with a realistic, non-contrived example 
where string comparisons are a significant bottleneck in a non-toy 
application. So let me first off say that *nearly always* you're not 
going to care whether "s == t" looks at all N characters or just the 
first 2 (or 20, or 100). This argument is rather academic (the best sort 
of argument!). Until you start getting up into truly huge strings, we're 
arguing about how many angels can dance on a CPU cache.

But for the record, in principle string comparisons *could* be the 
bottleneck. Example: you have 10000 strings, which are each created once 
and stored in a list. Then you iterate over the list, comparing every 
string against every other string. And due to some weird vagary of the 
data, the strings are nearly all equal.

(Why would you do this? I don't know, maybe it's a programmers' challenge 
found on the Internet, make up your own scenario...)

Total number of strings created: 10000.
Total number of strings compared: 100000000.

The overhead of creating the strings is trivial compared to the overhead 
of comparing them, and since each string is only created once anyway, 
interning them is just a waste of time.

> so at the expense of a single dictionary
> insertion when the string is created you can get guaranteed O(1) on all
> the comparisons.

What interning buys you is that "s == t" is an O(1) pointer compare if 
they are equal. But if s and t differ in the last character, __eq__ will 
still inspect every character. There is no way to tell Python "all 
strings are interned, if s is not t then s != t as well".

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28820

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2012-09-10 14:06 +0000
Message-ID	<mailman.444.1347285990.27098.python-list@python.org>
In reply to	#28818

On 2012-09-10, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> On Mon, 10 Sep 2012 08:59:37 +0000, Duncan Booth wrote:
>
>> Gelonida N <gelonida@gmail.com> wrote:
>> 
>> so at the expense of a single dictionary
>> insertion when the string is created you can get guaranteed O(1) on all
>> the comparisons.
>
> What interning buys you is that "s == t" is an O(1) pointer compare if 
> they are equal. But if s and t differ in the last character, __eq__ will 
> still inspect every character. There is no way to tell Python "all 
> strings are interned, if s is not t then s != t as well".
>

I thought that if *both* strings were interned then a pointer comparison could
decide if they were unequal without needing to check the characters.

Have I misunderstood how intern() works?

Oscar

[toc] | [prev] | [next] | [standalone]

#28883

From	Duncan Booth <duncan.booth@invalid.invalid>
Date	2012-09-11 09:51 +0000
Message-ID	<XnsA0CB6DD177060duncanbooth@127.0.0.1>
In reply to	#28820

Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:

>> What interning buys you is that "s == t" is an O(1) pointer compare
>> if they are equal. But if s and t differ in the last character,
>> __eq__ will still inspect every character. There is no way to tell
>> Python "all strings are interned, if s is not t then s != t as well".
>>
> 
> I thought that if *both* strings were interned then a pointer
> comparison could decide if they were unequal without needing to check
> the characters. 
> 
> Have I misunderstood how intern() works?
> 

I don't think you've misunderstood how it work, but so far as I can see the 
code doesn't attempt to short circuit the "not equal but interned" case. 
The comparison code doesn't look at interning at all, it only looks for 
identity as a shortcut.

-- 
Duncan Booth http://kupuguy.blogspot.com

[toc] | [prev] | [next] | [standalone]

#28888

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-09-11 11:55 -0400
Message-ID	<mailman.509.1347378993.27098.python-list@python.org>
In reply to	#28883

On 9/11/2012 6:40 AM, Oscar Benjamin wrote:
> On 11 September 2012 10:51, Duncan Booth <duncan.booth@invalid.invalid
> <mailto:duncan.booth@invalid.invalid>> wrote:
>
>     Oscar Benjamin <oscar.j.benjamin@gmail.com
>     <mailto:oscar.j.benjamin@gmail.com>> wrote:
>
>      >> What interning buys you is that "s == t" is an O(1) pointer compare
>      >> if they are equal. But if s and t differ in the last character,
>      >> __eq__ will still inspect every character. There is no way to tell
>      >> Python "all strings are interned, if s is not t then s != t as
>     well".
>      >>
>      >
>      > I thought that if *both* strings were interned then a pointer
>      > comparison could decide if they were unequal without needing to check
>      > the characters.
>      >
>      > Have I misunderstood how intern() works?
>      >
>
>     I don't think you've misunderstood how it work, but so far as I can
>     see the
>     code doesn't attempt to short circuit the "not equal but interned" case.
>     The comparison code doesn't look at interning at all, it only looks for
>     identity as a shortcut.
>
>
> It also doesn't seem to check if the hash values have been set. I guess
> the cached hash value is only used in contexts where the hash is
> explicitly desired.-

I believe the internal use of interning and hash comparison has varied 
from release to release. However, the main use of string comparison is 
for dict keys, especially the internal dicts for namespaces and 
attributes. Since the dict lookup code needs hash values anyway, to find 
slots with possible conflicts, I am sure it does not use the generic 
comparison operators but starts with hash comparisons.

Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#28821

From	Chris Angelico <rosuav@gmail.com>
Date	2012-09-11 00:26 +1000
Message-ID	<mailman.445.1347287192.27098.python-list@python.org>
In reply to	#28818

On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin
<oscar.j.benjamin@gmail.com> wrote:
> On 2012-09-10, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>> What interning buys you is that "s == t" is an O(1) pointer compare if
>> they are equal. But if s and t differ in the last character, __eq__ will
>> still inspect every character. There is no way to tell Python "all
>> strings are interned, if s is not t then s != t as well".
>>
>
> I thought that if *both* strings were interned then a pointer comparison could
> decide if they were unequal without needing to check the characters.
>
> Have I misunderstood how intern() works?

In a language where _all_ strings are guaranteed to be interned (such
as Lua, I think), you do indeed gain this. Pointer inequality implies
string inequality. But when interning is optional (as in Python), you
cannot depend on that, unless there's some way of recognizing interned
strings. Of course, that may indeed be the case; a simple bit flag
"this string has been interned" would suffice, and if both strings are
interned AND their pointers differ, THEN you can be sure the strings
differ.

I have no idea whether or not CPython version X.Y.Z does this. The
value of such an optimization really depends on how likely strings are
to be interned; for instance, if the compiler automatically interns
all the names of builtins, this could be quite beneficial. Otherwise,
probably not; most Python scripts don't bother interning anything.

ChrisA

[toc] | [prev] | [next] | [standalone]

#28822

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2012-09-10 14:32 +0000
Message-ID	<mailman.446.1347287554.27098.python-list@python.org>
In reply to	#28818

On 2012-09-10, Chris Angelico <rosuav@gmail.com> wrote:
> On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin
><oscar.j.benjamin@gmail.com> wrote:
>> On 2012-09-10, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>>> What interning buys you is that "s == t" is an O(1) pointer compare if
>>> they are equal. But if s and t differ in the last character, __eq__ will
>>> still inspect every character. There is no way to tell Python "all
>>> strings are interned, if s is not t then s != t as well".
>>>
>>
>> I thought that if *both* strings were interned then a pointer comparison
>> could decide if they were unequal without needing to check the characters.
>>
>> Have I misunderstood how intern() works?
>
> In a language where _all_ strings are guaranteed to be interned (such
> as Lua, I think), you do indeed gain this. Pointer inequality implies
> string inequality. But when interning is optional (as in Python), you
> cannot depend on that, unless there's some way of recognizing interned
> strings. Of course, that may indeed be the case; a simple bit flag
> "this string has been interned" would suffice, and if both strings are
> interned AND their pointers differ, THEN you can be sure the strings
> differ.
>
> I have no idea whether or not CPython version X.Y.Z does this. The
> value of such an optimization really depends on how likely strings are
> to be interned; for instance, if the compiler automatically interns
> all the names of builtins, this could be quite beneficial. Otherwise,
> probably not; most Python scripts don't bother interning anything.
>

I haven't looked at the source but my understanding was precisely that there
is an intern() bit and that not only the builtins module but all the literals
in any byte-compiled module are interned.

Oscar

[toc] | [prev] | [next] | [standalone]

#28823

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2012-09-10 14:43 +0000
Message-ID	<mailman.447.1347288231.27098.python-list@python.org>
In reply to	#28818

On 2012-09-10, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
> On 2012-09-10, Chris Angelico <rosuav@gmail.com> wrote:
>> On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin
>><oscar.j.benjamin@gmail.com> wrote:
>>> On 2012-09-10, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>>>> What interning buys you is that "s == t" is an O(1) pointer compare if
>>>> they are equal. But if s and t differ in the last character, __eq__ will
>>>> still inspect every character. There is no way to tell Python "all
>>>> strings are interned, if s is not t then s != t as well".
>>>>
>>>
>>> I thought that if *both* strings were interned then a pointer comparison
>>> could decide if they were unequal without needing to check the characters.
>>>
>>> Have I misunderstood how intern() works?
>>
>> In a language where _all_ strings are guaranteed to be interned (such
>> as Lua, I think), you do indeed gain this. Pointer inequality implies
>> string inequality. But when interning is optional (as in Python), you
>> cannot depend on that, unless there's some way of recognizing interned
>> strings. Of course, that may indeed be the case; a simple bit flag
>> "this string has been interned" would suffice, and if both strings are
>> interned AND their pointers differ, THEN you can be sure the strings
>> differ.
>>
>> I have no idea whether or not CPython version X.Y.Z does this. The
>> value of such an optimization really depends on how likely strings are
>> to be interned; for instance, if the compiler automatically interns
>> all the names of builtins, this could be quite beneficial. Otherwise,
>> probably not; most Python scripts don't bother interning anything.
>>
>
> I haven't looked at the source but my understanding was precisely that there
> is an intern() bit and that not only the builtins module but all the literals
> in any byte-compiled module are interned.
>

s/literals/identifiers/

You can see the interned flag in the PyUnicodeObject struct here:
http://hg.python.org/cpython/file/3ffd6ad93fe4/Include/unicodeobject.h#l303

Oscar

[toc] | [prev] | [next] | [standalone]

#28825

From	Chris Angelico <rosuav@gmail.com>
Date	2012-09-11 00:56 +1000
Message-ID	<mailman.448.1347288972.27098.python-list@python.org>
In reply to	#28818

On Tue, Sep 11, 2012 at 12:43 AM, Oscar Benjamin
<oscar.j.benjamin@gmail.com> wrote:
> On 2012-09-10, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
>> I haven't looked at the source but my understanding was precisely that there
>> is an intern() bit and that not only the builtins module but all the literals
>> in any byte-compiled module are interned.
>>
>
> s/literals/identifiers/
>
> You can see the interned flag in the PyUnicodeObject struct here:
> http://hg.python.org/cpython/file/3ffd6ad93fe4/Include/unicodeobject.h#l303

Ah, yep, so that's there. In that case, it's possible to have that
optimization. However, I may be misreading this, but it seems the only
Unicode comparison function is a rich compare, which is unable to take
advantage of a known difference:
http://hg.python.org/cpython/file/b48ef168d8c5/Objects/unicodeobject.c#l6114

Different pointers prove the strings differ, but don't tell you which
is to be sorted earlier. You could use this if you roll your own
comparison in C; or, if you already know the strings are interned, you
can use 'is' / 'is not'. But that seems to be the extent of it.

ChrisA

[toc] | [prev] | [next] | [standalone]

#28882

From	Duncan Booth <duncan.booth@invalid.invalid>
Date	2012-09-11 09:41 +0000
Message-ID	<XnsA0CB6C9F85D97duncanbooth@127.0.0.1>
In reply to	#28818

Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> But for the record, in principle string comparisons *could* be the 
> bottleneck. Example: you have 10000 strings, which are each created
> once and stored in a list. Then you iterate over the list, comparing
> every string against every other string. And due to some weird vagary
> of the data, the strings are nearly all equal.
> 
> (Why would you do this? I don't know, maybe it's a programmers'
> challenge found on the Internet, make up your own scenario...)
> 
> Total number of strings created: 10000.
> Total number of strings compared: 100000000.

which is exactly what I meant by doing a lot of comparisons (10000) on
the same string. Perhaps if I'd phrased it more as "you have to be doing
many more comparisons than string creation operations" it would have
been clearer what I meant. 

> The overhead of creating the strings is trivial compared to the
> overhead of comparing them, and since each string is only created once
> anyway, interning them is just a waste of time.

No, you created 10k strings many of which are equal and then did 10k
comparisons on each most of which found 'yes' they are equal. Interning
them would have reduced all the 'true' comparisons to an identity check 
at the cost of 1 hash and 1 full comparison per string. 

>> so at the expense of a single dictionary
>> insertion when the string is created you can get guaranteed O(1) on
>> all the comparisons.
> 
> What interning buys you is that "s == t" is an O(1) pointer compare if
> they are equal. But if s and t differ in the last character, __eq__
> will still inspect every character. There is no way to tell Python
> "all strings are interned, if s is not t then s != t as well".

Right if the strings differ only in the last character, but the rest of
this thread has been about how, for random strings, the not-equal case
is O(1) as well. 

-- 
Duncan Booth http://kupuguy.blogspot.com

[toc] | [prev] | [next] | [standalone]

#28572

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2012-09-06 12:04 +0100
Message-ID	<mailman.289.1346929477.27098.python-list@python.org>
In reply to	#28557

On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel <d@davea.name> wrote:
> For random strings (as defined below), the average compare time is
> effectively unrelated to the size of the string, once the size 
passes
> some point.


> Define random string as being a selection from a set of characters, 
with
> replacement.  So if we pick some set of characters, say 10 (or 256, 
it
> doesn't really matter), the number of possible strings is 10**N.


> The likelihood of not finding a mismatch within k characters is  
> (1/10)**k   The likelihood of actually reaching the end of the 
random
> string is (1/10)**N.  (which is the reciprocal of the number of 
strings,
> naturally)


> If we wanted an average number of comparisons, we'd have to 
calculate a
> series, where each term is a probability times a value for k.
>    sum((k * 9*10**-k) for k in range(1, 10))




> Those terms very rapidly approach 0, so it's safe to stop after a 
few. 
> Looking at the first 9 items, I see a value of 1.1111111


> This may not be quite right, but the value is certainly well under 
2 for
> a population of 10 characters, chosen randomly.  And notice that N
> doesn't really come into it.

It's exactly right. You can obtain this result analytically from 
Johannes' formula above. Just replace 256 with 10 to get that the 
expected number of comparisons is

(10/9)*(1 - 10**(-N))

The last term shows the dependence on N and is tiny even for N=9.

Oscar

[toc] | [prev] | [next] | [standalone]

#29196

From	Steve Howell <showell30@yahoo.com>
Date	2012-09-14 17:51 -0700
Message-ID	<6da9ea89-c3ee-4c2d-9d86-c7b164113bb0@kr6g2000pbb.googlegroups.com>
In reply to	#28572

On Sep 6, 4:04 am, Oscar Benjamin <oscar.j.benja...@gmail.com> wrote:
> On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel <d...@davea.name> wrote:
> > For random strings (as defined below), the average compare time is
> > effectively unrelated to the size of the string, once the size
> passes
> > some point.
> > Define random string as being a selection from a set of characters,
> with
> > replacement.  So if we pick some set of characters, say 10 (or 256,
> it
> > doesn't really matter), the number of possible strings is 10**N.
> > The likelihood of not finding a mismatch within k characters is
> > (1/10)**k   The likelihood of actually reaching the end of the
> random
> > string is (1/10)**N.  (which is the reciprocal of the number of
> strings,
> > naturally)
> > If we wanted an average number of comparisons, we'd have to
> calculate a
> > series, where each term is a probability times a value for k.
> >    sum((k * 9*10**-k) for k in range(1, 10))
> > Those terms very rapidly approach 0, so it's safe to stop after a
> few.
> > Looking at the first 9 items, I see a value of 1.1111111
> > This may not be quite right, but the value is certainly well under
> 2 for
> > a population of 10 characters, chosen randomly.  And notice that N
> > doesn't really come into it.
>
> It's exactly right. You can obtain this result analytically from
> Johannes' formula above. Just replace 256 with 10 to get that the
> expected number of comparisons is
>
> (10/9)*(1 - 10**(-N))

The math here is simple, but of course the average running time is
also driven by usage patterns.

Let's say you have two independently produced strings with N number of
random bits.  The average amount of bit comparisons that it takes to
determine that the strings are different is between 1 and 2 checks for
any value of N, for the same reason that the average number of times
you need to flip a coin before either coming up heads or exhausting
your N tries is always less than 2, and pretty darn close to 2 for
even relatively small values of N.

    def compute_checks_for_different_strings(n):
      x = 1
      p = 0.5
      for i in range(0, n):
        x += p * i
        p = p * 0.5
      print n, x

    for n in [10, 100, 1000, 10000]:
      compute_checks_for_different_strings(n)

     10 1.9892578125
     100 2.0
     1000 2.0
     10000 2.0

Now let's say your string comparison function is used to verify 100-
bit passwords, and 99% of the time the password is from a legit user
who types it correctly.  Except for the 1% of a time when somebody fat
fingers the copy/paste of their password, every strcmp call is gonna
have to compare 100 pairs of characters, or N pairs.  If the usage
pattern says that 99% of the time you're simply verifying that two
strings are equal, then the number of bits is the main driving factor
for running time.

[toc] | [prev] | [next] | [standalone]

#28582

From	Roy Smith <roy@panix.com>
Date	2012-09-06 08:13 -0400
Message-ID	<roy-F077EE.08130706092012@news.panix.com>
In reply to	#28557

In article <50485fca$0$29977$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> In any case, the *worst* case for string equality 
> testing is certainly O(N) (every character must be looked at), and the 
> *best* case is O(1) obviously (the first character fails to match).

The best case is O(0), if either string is empty (ducking and running).

[toc] | [prev] | [next] | [standalone]

#28586

From	Chris Angelico <rosuav@gmail.com>
Date	2012-09-06 22:29 +1000
Message-ID	<mailman.299.1346934558.27098.python-list@python.org>
In reply to	#28582

n Thu, Sep 6, 2012 at 10:13 PM, Roy Smith <roy@panix.com> wrote:
> In article <50485fca$0$29977$c3e8da3$5496439d@news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>
>> In any case, the *worst* case for string equality
>> testing is certainly O(N) (every character must be looked at), and the
>> *best* case is O(1) obviously (the first character fails to match).
>
> The best case is O(0), if either string is empty (ducking and running).

No, O(0) would be when the application decides not to compare at all.

ChrisA (also ducking)

[toc] | [prev] | [next] | [standalone]

#28594

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-06 15:43 +0200
Message-ID	<k2a99e$i5p$1@news.albasani.net>
In reply to	#28557

On 06.09.2012 10:33, Steven D'Aprano wrote:

> But you know, it really doesn't make a difference. Equality testing will 
> *still* be O(N) since the asymptomatic behaviour for sufficiently large 
> string comparisons will be bounded by O(N). Multiplicative constants 
> ("half the string" versus "0.001 of the string") do not matter.

Wrong, at least for randomized strings (i.e. every character with the
same probability). O(N) is worst-case, O(log N) is correct for
randomized strings.

Then again, since you were nitpicking about Landau notation earlier this
thread, every function bound by O(log N) is also bound by O(N), since,
as you correctly stated, it's only a upper bound. We should be talking
about the asymptotic sharp bound, i.e. capital Theta.

> I may have been overly-conservative earlier when I said that on average 
> string equality has to compare half the characters. I thought I had 
> remembered that from a computer science textbook, but I can't find that 
> reference now, so possibly I was thinking of something else. (String 
> searching perhaps?). In any case, the *worst* case for string equality 
> testing is certainly O(N) (every character must be looked at), and the 
> *best* case is O(1) obviously (the first character fails to match). But 
> I'm not so sure about the average case. Further thought is required.

Yes, worst-case is O(N), best case O(1). Average is O(n log n).

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28600

From	Dave Angel <d@davea.name>
Date	2012-09-06 10:23 -0400
Message-ID	<mailman.304.1346941450.27098.python-list@python.org>
In reply to	#28594

On 09/06/2012 09:43 AM, Johannes Bauer wrote:
> <snip>
> Yes, worst-case is O(N), best case O(1). Average is O(n log n).

Can't see how you came up with an average of n log(n).  Fourteen minutes
before you made this post, you demonstrated it was less than 2 for any N.

And I previously posted that for a character set of size 10, the average
was 1.1111

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#28602

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-06 16:33 +0200
Message-ID	<k2ac6g$o2n$1@news.albasani.net>
In reply to	#28600

On 06.09.2012 16:23, Dave Angel wrote:
> On 09/06/2012 09:43 AM, Johannes Bauer wrote:
>> <snip>
>> Yes, worst-case is O(N), best case O(1). Average is O(n log n).
> 
> Can't see how you came up with an average of n log(n).  Fourteen minutes
> before you made this post, you demonstrated it was less than 2 for any N.

O(log n) is what I meant, sorry! Damnit.

> And I previously posted that for a character set of size 10, the average
> was 1.1111

For any given string n and and alphabet size l, the average is:

sum(i = 0..n) (1 / (l ^ n))

So with larger word length, the total complexity constantly increases.
The amount by which it increases however is shrinking exponentially with
the word length. Therefore O(log n).

Sorry about the n log n mistake, argh.

Best regards & thanks for the correction,
Johannes


-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28605

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-06 16:42 +0200
Message-ID	<k2acou$p8n$1@news.albasani.net>
In reply to	#28600

On 06.09.2012 16:23, Dave Angel wrote:
> On 09/06/2012 09:43 AM, Johannes Bauer wrote:
>> <snip>
>> Yes, worst-case is O(N), best case O(1). Average is O(n log n).
> 
> Can't see how you came up with an average of n log(n).  Fourteen minutes
> before you made this post, you demonstrated it was less than 2 for any N.
> 
> And I previously posted that for a character set of size 10, the average
> was 1.1111

Again playing with the equations and thinking about it again. And I
completely missed your point. It wasn't about n log n vs. log n. Both
are wrong. This surprises me. I was somehow thinking about the limit of
sum (1/n), n -> infty. But since the summands are shrinking
exponentially, the limit is different.

I think the limit of the average comparisons for a given wordlength n
against infinity with alphabet size l is

l / (l - 1)

i.e. for bitstrings it's 2 and for bytestrings it's 256/255.

This would mean string comparison of randomized strings is O(1). Can
that really be true? It looks like it.

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28615

From	Dave Angel <d@davea.name>
Date	2012-09-06 11:54 -0400
Message-ID	<mailman.313.1346946924.27098.python-list@python.org>
In reply to	#28605

On 09/06/2012 10:42 AM, Johannes Bauer wrote:
> On 06.09.2012 16:23, Dave Angel wrote:
>> On 09/06/2012 09:43 AM, Johannes Bauer wrote:
>>> <snip>
>>> Yes, worst-case is O(N), best case O(1). Average is O(n log n).
>> Can't see how you came up with an average of n log(n).  Fourteen minutes
>> before you made this post, you demonstrated it was less than 2 for any N.
>>
>> And I previously posted that for a character set of size 10, the average
>> was 1.1111
> Again playing with the equations and thinking about it again. And I
> completely missed your point. It wasn't about n log n vs. log n. Both
> are wrong. This surprises me. I was somehow thinking about the limit of
> sum (1/n), n -> infty. But since the summands are shrinking
> exponentially, the limit is different.
>
> I think the limit of the average comparisons for a given wordlength n
> against infinity with alphabet size l is
>
> l / (l - 1)
>
> i.e. for bitstrings it's 2 and for bytestrings it's 256/255.
>
> This would mean string comparison of randomized strings is O(1). Can
> that really be true? It looks like it.
>
(Just lost the internet in a storm, so I'm not sure how long it'll be
before this sends.)

Thanks, that was exactly my point.  Since el is at least 2, the average
number of comparisons is no larger than 2, for any value of N.  That's
why, when I'm visually comparing MD5 values, I usually only look at the
first few, and last few hex digits.

However, Chris Angelico (at 10:39) pointed out again that totally random
strings aren't real-world equivalent.

He has now proposed that somebody measure real-world cases here, by
patching the string comparison to gather statistics.  I think that's
beyond me at this point.  I only joined this thread when the cases under
study were well-defined random, and therefore predictable.  <g>

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#28603

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-06 16:34 +0200
Message-ID	<k2ac8i$o2n$2@news.albasani.net>
In reply to	#28594

On 06.09.2012 15:43, Johannes Bauer wrote:

> Wrong, at least for randomized strings (i.e. every character with the
> same probability). O(N) is worst-case, O(log N) is correct for
> randomized strings.
^^
Here I write the right thing. Then further below...

> Yes, worst-case is O(N), best case O(1). Average is O(n log n).

...I write the wrong thing. O(log n) is what I meant, as Dave correctly
noticed.

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28727

From	Gelonida N <gelonida@gmail.com>
Date	2012-09-08 17:50 +0200
Message-ID	<mailman.385.1347119434.27098.python-list@python.org>
In reply to	#28557

On 09/06/2012 10:33 AM, Steven D'Aprano wrote:
> On Wed, 05 Sep 2012 22:47:14 +0000, Oscar Benjamin wrote:
>
> I may have been overly-conservative earlier when I said that on average
> string equality has to compare half the characters. I thought I had
> remembered that from a computer science textbook, but I can't find that
> reference now, so possibly I was thinking of something else. (String
> searching perhaps?).

Yeah I think you mixed it up with searching an entry in an unsorted list 
of length N


That's one of the most common N/2 cases, that one hits daily with many 
straight forward implementations

[toc] | [prev] | [next] | [standalone]

Page 4 of 6 — ← Prev page 1 2 3 [4] 5 6 Next page →

csiph-web

Comparing strings from the back?

Contents

#28818

#28820

#28883

#28888

#28821

#28822

#28823

#28825

#28882

#28572

#29196

#28582

#28586

#28594

#28600

#28602

#28605

#28615

#28603

#28727