Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #28380 > unrolled thread
| Started by | Roy Smith <roy@panix.com> |
|---|---|
| First post | 2012-09-03 21:54 -0400 |
| Last post | 2012-09-10 21:52 +0000 |
| Articles | 20 on this page of 101 — 21 participants |
Back to article view | Back to comp.lang.python
Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-03 21:54 -0400
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-04 12:07 +1000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 02:17 +0000
Re: Comparing strings from the back? Dan Sommers <dan@tombstonezero.net> - 2012-09-03 21:56 -0700
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:50 +0100
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-04 18:32 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 18:07 +0000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:17 +0200
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 07:59 +1000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:24 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:43 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 14:30 +0000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:51 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 16:24 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-05 22:47 +0000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-06 08:33 +0000
Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 06:07 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:06 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:10 +0000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-08 00:55 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-08 11:53 +0000
RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-13 18:39 +0000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 15:37 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 20:48 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 00:46 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 21:54 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 01:38 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 23:06 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:03 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 01:20 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:53 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 03:26 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:36 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 10:16 +0000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:43 -0400
RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 22:43 +0000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 23:10 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 18:11 -0700
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-17 14:05 +1000
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 23:06 -0700
Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-17 13:35 -0700
Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-18 09:14 +1000
Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-18 08:12 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:55 -0400
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:59 -0400
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 12:17 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-19 00:47 +0000
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-19 02:20 +1000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 16:40 -0400
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-19 00:48 +0100
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 20:53 +0100
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:06 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 03:39 +0000
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-14 14:15 +1000
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 22:17 +0100
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:35 -0400
RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 21:32 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:40 +0000
Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:52 +0200
Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-10 08:59 +0000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-10 13:45 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:06 +0000
Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:51 +0000
Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-11 11:55 -0400
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:26 +1000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:32 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:43 +0000
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:56 +1000
Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:41 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-06 12:04 +0100
Re: Comparing strings from the back? Steve Howell <showell30@yahoo.com> - 2012-09-14 17:51 -0700
Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-06 08:13 -0400
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-06 22:29 +1000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:43 +0200
Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 10:23 -0400
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:33 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:42 +0200
Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 11:54 -0400
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:34 +0200
Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:50 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:37 +0200
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-07 00:39 +1000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:36 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:44 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:07 +0000
Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-04 01:13 -0400
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:56 +0100
Re: Comparing strings from the back? Alain Ketterlin <alain@dpt-info.u-strasbg.fr> - 2012-09-04 11:58 +0200
Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-05 12:18 +1000
Re: Comparing strings from the back? MRAB <python@mrabarnett.plus.com> - 2012-09-05 03:39 +0100
Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-04 22:48 -0400
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:33 +0200
Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 10:29 +0200
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 18:33 +1000
Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 11:48 +0200
Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 17:45 +0200
Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 18:07 +0200
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 16:33 +0000
Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:32 +0200
Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:44 +0200
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 21:52 +0000
Page 4 of 6 — ← Prev page 1 2 3 [4] 5 6 Next page →
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-10 13:45 +0000 |
| Message-ID | <504deedc$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28807 |
On Mon, 10 Sep 2012 08:59:37 +0000, Duncan Booth wrote: > Gelonida N <gelonida@gmail.com> wrote: > >> On 09/07/2012 06:06 AM, Steven D'Aprano wrote: >>> On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote: >>> >>> >>> Also of some interest is the best case: O(1) for unequal strings (they >>> differ at the first character) and O(N) for equal strings. >> >> The worst case is O(N) or N characters the average case is O(1) or two >> characters. >> >> For denial of service attacks or systems, that are NEVER allowed to >> fail the worst case is important. >> >> For most other cases the average complexity counts. >> >> However I still wonder for how many applications the complexity of >> string comparisons would be the limiting factor. >> >> > and of course if you ever do find an application where that worst case > matters there's an easy way round it: just call intern() on all the > strings when they are created. There are other good reasons for interning strings, but speeding up "astring == bstring" is not one of them. [steve@ando ~]$ python -m timeit -s "s = 'abc'*100000" -s \ > "t = s[:-1] + 'x'" "s == t" 1000 loops, best of 3: 910 usec per loop [steve@ando ~]$ python -m timeit -s "s = 'abc'*100000" -s \ > "t = s[:-1] + 'x'" -s "intern(s); intern(t)" "s == t" 1000 loops, best of 3: 914 usec per loop No significant difference. To state the bleedin' obvious, the computational effort required to *compare two strings* pre-supposes that those strings already exist, and is *completely independent* of the complexity of creating the strings. > For the comparison to be the limiting factor you have to be doing a lot > of comparisons Correct. > on the same string (otherwise creating the string would be the limiting > factor), Not necessarily. Look, it's really hard to come up with a realistic, non-contrived example where string comparisons are a significant bottleneck in a non-toy application. So let me first off say that *nearly always* you're not going to care whether "s == t" looks at all N characters or just the first 2 (or 20, or 100). This argument is rather academic (the best sort of argument!). Until you start getting up into truly huge strings, we're arguing about how many angels can dance on a CPU cache. But for the record, in principle string comparisons *could* be the bottleneck. Example: you have 10000 strings, which are each created once and stored in a list. Then you iterate over the list, comparing every string against every other string. And due to some weird vagary of the data, the strings are nearly all equal. (Why would you do this? I don't know, maybe it's a programmers' challenge found on the Internet, make up your own scenario...) Total number of strings created: 10000. Total number of strings compared: 100000000. The overhead of creating the strings is trivial compared to the overhead of comparing them, and since each string is only created once anyway, interning them is just a waste of time. > so at the expense of a single dictionary > insertion when the string is created you can get guaranteed O(1) on all > the comparisons. What interning buys you is that "s == t" is an O(1) pointer compare if they are equal. But if s and t differ in the last character, __eq__ will still inspect every character. There is no way to tell Python "all strings are interned, if s is not t then s != t as well". -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2012-09-10 14:06 +0000 |
| Message-ID | <mailman.444.1347285990.27098.python-list@python.org> |
| In reply to | #28818 |
On 2012-09-10, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Mon, 10 Sep 2012 08:59:37 +0000, Duncan Booth wrote: > >> Gelonida N <gelonida@gmail.com> wrote: >> >> so at the expense of a single dictionary >> insertion when the string is created you can get guaranteed O(1) on all >> the comparisons. > > What interning buys you is that "s == t" is an O(1) pointer compare if > they are equal. But if s and t differ in the last character, __eq__ will > still inspect every character. There is no way to tell Python "all > strings are interned, if s is not t then s != t as well". > I thought that if *both* strings were interned then a pointer comparison could decide if they were unequal without needing to check the characters. Have I misunderstood how intern() works? Oscar
[toc] | [prev] | [next] | [standalone]
| From | Duncan Booth <duncan.booth@invalid.invalid> |
|---|---|
| Date | 2012-09-11 09:51 +0000 |
| Message-ID | <XnsA0CB6DD177060duncanbooth@127.0.0.1> |
| In reply to | #28820 |
Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: >> What interning buys you is that "s == t" is an O(1) pointer compare >> if they are equal. But if s and t differ in the last character, >> __eq__ will still inspect every character. There is no way to tell >> Python "all strings are interned, if s is not t then s != t as well". >> > > I thought that if *both* strings were interned then a pointer > comparison could decide if they were unequal without needing to check > the characters. > > Have I misunderstood how intern() works? > I don't think you've misunderstood how it work, but so far as I can see the code doesn't attempt to short circuit the "not equal but interned" case. The comparison code doesn't look at interning at all, it only looks for identity as a shortcut. -- Duncan Booth http://kupuguy.blogspot.com
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-09-11 11:55 -0400 |
| Message-ID | <mailman.509.1347378993.27098.python-list@python.org> |
| In reply to | #28883 |
On 9/11/2012 6:40 AM, Oscar Benjamin wrote: > On 11 September 2012 10:51, Duncan Booth <duncan.booth@invalid.invalid > <mailto:duncan.booth@invalid.invalid>> wrote: > > Oscar Benjamin <oscar.j.benjamin@gmail.com > <mailto:oscar.j.benjamin@gmail.com>> wrote: > > >> What interning buys you is that "s == t" is an O(1) pointer compare > >> if they are equal. But if s and t differ in the last character, > >> __eq__ will still inspect every character. There is no way to tell > >> Python "all strings are interned, if s is not t then s != t as > well". > >> > > > > I thought that if *both* strings were interned then a pointer > > comparison could decide if they were unequal without needing to check > > the characters. > > > > Have I misunderstood how intern() works? > > > > I don't think you've misunderstood how it work, but so far as I can > see the > code doesn't attempt to short circuit the "not equal but interned" case. > The comparison code doesn't look at interning at all, it only looks for > identity as a shortcut. > > > It also doesn't seem to check if the hash values have been set. I guess > the cached hash value is only used in contexts where the hash is > explicitly desired.- I believe the internal use of interning and hash comparison has varied from release to release. However, the main use of string comparison is for dict keys, especially the internal dicts for namespaces and attributes. Since the dict lookup code needs hash values anyway, to find slots with possible conflicts, I am sure it does not use the generic comparison operators but starts with hash comparisons. Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-09-11 00:26 +1000 |
| Message-ID | <mailman.445.1347287192.27098.python-list@python.org> |
| In reply to | #28818 |
On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: > On 2012-09-10, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: >> What interning buys you is that "s == t" is an O(1) pointer compare if >> they are equal. But if s and t differ in the last character, __eq__ will >> still inspect every character. There is no way to tell Python "all >> strings are interned, if s is not t then s != t as well". >> > > I thought that if *both* strings were interned then a pointer comparison could > decide if they were unequal without needing to check the characters. > > Have I misunderstood how intern() works? In a language where _all_ strings are guaranteed to be interned (such as Lua, I think), you do indeed gain this. Pointer inequality implies string inequality. But when interning is optional (as in Python), you cannot depend on that, unless there's some way of recognizing interned strings. Of course, that may indeed be the case; a simple bit flag "this string has been interned" would suffice, and if both strings are interned AND their pointers differ, THEN you can be sure the strings differ. I have no idea whether or not CPython version X.Y.Z does this. The value of such an optimization really depends on how likely strings are to be interned; for instance, if the compiler automatically interns all the names of builtins, this could be quite beneficial. Otherwise, probably not; most Python scripts don't bother interning anything. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2012-09-10 14:32 +0000 |
| Message-ID | <mailman.446.1347287554.27098.python-list@python.org> |
| In reply to | #28818 |
On 2012-09-10, Chris Angelico <rosuav@gmail.com> wrote: > On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin ><oscar.j.benjamin@gmail.com> wrote: >> On 2012-09-10, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: >>> What interning buys you is that "s == t" is an O(1) pointer compare if >>> they are equal. But if s and t differ in the last character, __eq__ will >>> still inspect every character. There is no way to tell Python "all >>> strings are interned, if s is not t then s != t as well". >>> >> >> I thought that if *both* strings were interned then a pointer comparison >> could decide if they were unequal without needing to check the characters. >> >> Have I misunderstood how intern() works? > > In a language where _all_ strings are guaranteed to be interned (such > as Lua, I think), you do indeed gain this. Pointer inequality implies > string inequality. But when interning is optional (as in Python), you > cannot depend on that, unless there's some way of recognizing interned > strings. Of course, that may indeed be the case; a simple bit flag > "this string has been interned" would suffice, and if both strings are > interned AND their pointers differ, THEN you can be sure the strings > differ. > > I have no idea whether or not CPython version X.Y.Z does this. The > value of such an optimization really depends on how likely strings are > to be interned; for instance, if the compiler automatically interns > all the names of builtins, this could be quite beneficial. Otherwise, > probably not; most Python scripts don't bother interning anything. > I haven't looked at the source but my understanding was precisely that there is an intern() bit and that not only the builtins module but all the literals in any byte-compiled module are interned. Oscar
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2012-09-10 14:43 +0000 |
| Message-ID | <mailman.447.1347288231.27098.python-list@python.org> |
| In reply to | #28818 |
On 2012-09-10, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: > On 2012-09-10, Chris Angelico <rosuav@gmail.com> wrote: >> On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin >><oscar.j.benjamin@gmail.com> wrote: >>> On 2012-09-10, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: >>>> What interning buys you is that "s == t" is an O(1) pointer compare if >>>> they are equal. But if s and t differ in the last character, __eq__ will >>>> still inspect every character. There is no way to tell Python "all >>>> strings are interned, if s is not t then s != t as well". >>>> >>> >>> I thought that if *both* strings were interned then a pointer comparison >>> could decide if they were unequal without needing to check the characters. >>> >>> Have I misunderstood how intern() works? >> >> In a language where _all_ strings are guaranteed to be interned (such >> as Lua, I think), you do indeed gain this. Pointer inequality implies >> string inequality. But when interning is optional (as in Python), you >> cannot depend on that, unless there's some way of recognizing interned >> strings. Of course, that may indeed be the case; a simple bit flag >> "this string has been interned" would suffice, and if both strings are >> interned AND their pointers differ, THEN you can be sure the strings >> differ. >> >> I have no idea whether or not CPython version X.Y.Z does this. The >> value of such an optimization really depends on how likely strings are >> to be interned; for instance, if the compiler automatically interns >> all the names of builtins, this could be quite beneficial. Otherwise, >> probably not; most Python scripts don't bother interning anything. >> > > I haven't looked at the source but my understanding was precisely that there > is an intern() bit and that not only the builtins module but all the literals > in any byte-compiled module are interned. > s/literals/identifiers/ You can see the interned flag in the PyUnicodeObject struct here: http://hg.python.org/cpython/file/3ffd6ad93fe4/Include/unicodeobject.h#l303 Oscar
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-09-11 00:56 +1000 |
| Message-ID | <mailman.448.1347288972.27098.python-list@python.org> |
| In reply to | #28818 |
On Tue, Sep 11, 2012 at 12:43 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: > On 2012-09-10, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: >> I haven't looked at the source but my understanding was precisely that there >> is an intern() bit and that not only the builtins module but all the literals >> in any byte-compiled module are interned. >> > > s/literals/identifiers/ > > You can see the interned flag in the PyUnicodeObject struct here: > http://hg.python.org/cpython/file/3ffd6ad93fe4/Include/unicodeobject.h#l303 Ah, yep, so that's there. In that case, it's possible to have that optimization. However, I may be misreading this, but it seems the only Unicode comparison function is a rich compare, which is unable to take advantage of a known difference: http://hg.python.org/cpython/file/b48ef168d8c5/Objects/unicodeobject.c#l6114 Different pointers prove the strings differ, but don't tell you which is to be sorted earlier. You could use this if you roll your own comparison in C; or, if you already know the strings are interned, you can use 'is' / 'is not'. But that seems to be the extent of it. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Duncan Booth <duncan.booth@invalid.invalid> |
|---|---|
| Date | 2012-09-11 09:41 +0000 |
| Message-ID | <XnsA0CB6C9F85D97duncanbooth@127.0.0.1> |
| In reply to | #28818 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > But for the record, in principle string comparisons *could* be the > bottleneck. Example: you have 10000 strings, which are each created > once and stored in a list. Then you iterate over the list, comparing > every string against every other string. And due to some weird vagary > of the data, the strings are nearly all equal. > > (Why would you do this? I don't know, maybe it's a programmers' > challenge found on the Internet, make up your own scenario...) > > Total number of strings created: 10000. > Total number of strings compared: 100000000. which is exactly what I meant by doing a lot of comparisons (10000) on the same string. Perhaps if I'd phrased it more as "you have to be doing many more comparisons than string creation operations" it would have been clearer what I meant. > The overhead of creating the strings is trivial compared to the > overhead of comparing them, and since each string is only created once > anyway, interning them is just a waste of time. No, you created 10k strings many of which are equal and then did 10k comparisons on each most of which found 'yes' they are equal. Interning them would have reduced all the 'true' comparisons to an identity check at the cost of 1 hash and 1 full comparison per string. >> so at the expense of a single dictionary >> insertion when the string is created you can get guaranteed O(1) on >> all the comparisons. > > What interning buys you is that "s == t" is an O(1) pointer compare if > they are equal. But if s and t differ in the last character, __eq__ > will still inspect every character. There is no way to tell Python > "all strings are interned, if s is not t then s != t as well". Right if the strings differ only in the last character, but the rest of this thread has been about how, for random strings, the not-equal case is O(1) as well. -- Duncan Booth http://kupuguy.blogspot.com
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2012-09-06 12:04 +0100 |
| Message-ID | <mailman.289.1346929477.27098.python-list@python.org> |
| In reply to | #28557 |
On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel <d@davea.name> wrote: > For random strings (as defined below), the average compare time is > effectively unrelated to the size of the string, once the size passes > some point. > Define random string as being a selection from a set of characters, with > replacement. So if we pick some set of characters, say 10 (or 256, it > doesn't really matter), the number of possible strings is 10**N. > The likelihood of not finding a mismatch within k characters is > (1/10)**k The likelihood of actually reaching the end of the random > string is (1/10)**N. (which is the reciprocal of the number of strings, > naturally) > If we wanted an average number of comparisons, we'd have to calculate a > series, where each term is a probability times a value for k. > sum((k * 9*10**-k) for k in range(1, 10)) > Those terms very rapidly approach 0, so it's safe to stop after a few. > Looking at the first 9 items, I see a value of 1.1111111 > This may not be quite right, but the value is certainly well under 2 for > a population of 10 characters, chosen randomly. And notice that N > doesn't really come into it. It's exactly right. You can obtain this result analytically from Johannes' formula above. Just replace 256 with 10 to get that the expected number of comparisons is (10/9)*(1 - 10**(-N)) The last term shows the dependence on N and is tiny even for N=9. Oscar
[toc] | [prev] | [next] | [standalone]
| From | Steve Howell <showell30@yahoo.com> |
|---|---|
| Date | 2012-09-14 17:51 -0700 |
| Message-ID | <6da9ea89-c3ee-4c2d-9d86-c7b164113bb0@kr6g2000pbb.googlegroups.com> |
| In reply to | #28572 |
On Sep 6, 4:04 am, Oscar Benjamin <oscar.j.benja...@gmail.com> wrote:
> On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel <d...@davea.name> wrote:
> > For random strings (as defined below), the average compare time is
> > effectively unrelated to the size of the string, once the size
> passes
> > some point.
> > Define random string as being a selection from a set of characters,
> with
> > replacement. So if we pick some set of characters, say 10 (or 256,
> it
> > doesn't really matter), the number of possible strings is 10**N.
> > The likelihood of not finding a mismatch within k characters is
> > (1/10)**k The likelihood of actually reaching the end of the
> random
> > string is (1/10)**N. (which is the reciprocal of the number of
> strings,
> > naturally)
> > If we wanted an average number of comparisons, we'd have to
> calculate a
> > series, where each term is a probability times a value for k.
> > sum((k * 9*10**-k) for k in range(1, 10))
> > Those terms very rapidly approach 0, so it's safe to stop after a
> few.
> > Looking at the first 9 items, I see a value of 1.1111111
> > This may not be quite right, but the value is certainly well under
> 2 for
> > a population of 10 characters, chosen randomly. And notice that N
> > doesn't really come into it.
>
> It's exactly right. You can obtain this result analytically from
> Johannes' formula above. Just replace 256 with 10 to get that the
> expected number of comparisons is
>
> (10/9)*(1 - 10**(-N))
The math here is simple, but of course the average running time is
also driven by usage patterns.
Let's say you have two independently produced strings with N number of
random bits. The average amount of bit comparisons that it takes to
determine that the strings are different is between 1 and 2 checks for
any value of N, for the same reason that the average number of times
you need to flip a coin before either coming up heads or exhausting
your N tries is always less than 2, and pretty darn close to 2 for
even relatively small values of N.
def compute_checks_for_different_strings(n):
x = 1
p = 0.5
for i in range(0, n):
x += p * i
p = p * 0.5
print n, x
for n in [10, 100, 1000, 10000]:
compute_checks_for_different_strings(n)
10 1.9892578125
100 2.0
1000 2.0
10000 2.0
Now let's say your string comparison function is used to verify 100-
bit passwords, and 99% of the time the password is from a legit user
who types it correctly. Except for the 1% of a time when somebody fat
fingers the copy/paste of their password, every strcmp call is gonna
have to compare 100 pairs of characters, or N pairs. If the usage
pattern says that 99% of the time you're simply verifying that two
strings are equal, then the number of bits is the main driving factor
for running time.
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-09-06 08:13 -0400 |
| Message-ID | <roy-F077EE.08130706092012@news.panix.com> |
| In reply to | #28557 |
In article <50485fca$0$29977$c3e8da3$5496439d@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > In any case, the *worst* case for string equality > testing is certainly O(N) (every character must be looked at), and the > *best* case is O(1) obviously (the first character fails to match). The best case is O(0), if either string is empty (ducking and running).
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-09-06 22:29 +1000 |
| Message-ID | <mailman.299.1346934558.27098.python-list@python.org> |
| In reply to | #28582 |
n Thu, Sep 6, 2012 at 10:13 PM, Roy Smith <roy@panix.com> wrote: > In article <50485fca$0$29977$c3e8da3$5496439d@news.astraweb.com>, > Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > >> In any case, the *worst* case for string equality >> testing is certainly O(N) (every character must be looked at), and the >> *best* case is O(1) obviously (the first character fails to match). > > The best case is O(0), if either string is empty (ducking and running). No, O(0) would be when the application decides not to compare at all. ChrisA (also ducking)
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-06 15:43 +0200 |
| Message-ID | <k2a99e$i5p$1@news.albasani.net> |
| In reply to | #28557 |
On 06.09.2012 10:33, Steven D'Aprano wrote:
> But you know, it really doesn't make a difference. Equality testing will
> *still* be O(N) since the asymptomatic behaviour for sufficiently large
> string comparisons will be bounded by O(N). Multiplicative constants
> ("half the string" versus "0.001 of the string") do not matter.
Wrong, at least for randomized strings (i.e. every character with the
same probability). O(N) is worst-case, O(log N) is correct for
randomized strings.
Then again, since you were nitpicking about Landau notation earlier this
thread, every function bound by O(log N) is also bound by O(N), since,
as you correctly stated, it's only a upper bound. We should be talking
about the asymptotic sharp bound, i.e. capital Theta.
> I may have been overly-conservative earlier when I said that on average
> string equality has to compare half the characters. I thought I had
> remembered that from a computer science textbook, but I can't find that
> reference now, so possibly I was thinking of something else. (String
> searching perhaps?). In any case, the *worst* case for string equality
> testing is certainly O(N) (every character must be looked at), and the
> *best* case is O(1) obviously (the first character fails to match). But
> I'm not so sure about the average case. Further thought is required.
Yes, worst-case is O(N), best case O(1). Average is O(n log n).
Best regards,
Johannes
--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-09-06 10:23 -0400 |
| Message-ID | <mailman.304.1346941450.27098.python-list@python.org> |
| In reply to | #28594 |
On 09/06/2012 09:43 AM, Johannes Bauer wrote: > <snip> > Yes, worst-case is O(N), best case O(1). Average is O(n log n). Can't see how you came up with an average of n log(n). Fourteen minutes before you made this post, you demonstrated it was less than 2 for any N. And I previously posted that for a character set of size 10, the average was 1.1111 -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-06 16:33 +0200 |
| Message-ID | <k2ac6g$o2n$1@news.albasani.net> |
| In reply to | #28600 |
On 06.09.2012 16:23, Dave Angel wrote: > On 09/06/2012 09:43 AM, Johannes Bauer wrote: >> <snip> >> Yes, worst-case is O(N), best case O(1). Average is O(n log n). > > Can't see how you came up with an average of n log(n). Fourteen minutes > before you made this post, you demonstrated it was less than 2 for any N. O(log n) is what I meant, sorry! Damnit. > And I previously posted that for a character set of size 10, the average > was 1.1111 For any given string n and and alphabet size l, the average is: sum(i = 0..n) (1 / (l ^ n)) So with larger word length, the total complexity constantly increases. The amount by which it increases however is shrinking exponentially with the word length. Therefore O(log n). Sorry about the n log n mistake, argh. Best regards & thanks for the correction, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-06 16:42 +0200 |
| Message-ID | <k2acou$p8n$1@news.albasani.net> |
| In reply to | #28600 |
On 06.09.2012 16:23, Dave Angel wrote: > On 09/06/2012 09:43 AM, Johannes Bauer wrote: >> <snip> >> Yes, worst-case is O(N), best case O(1). Average is O(n log n). > > Can't see how you came up with an average of n log(n). Fourteen minutes > before you made this post, you demonstrated it was less than 2 for any N. > > And I previously posted that for a character set of size 10, the average > was 1.1111 Again playing with the equations and thinking about it again. And I completely missed your point. It wasn't about n log n vs. log n. Both are wrong. This surprises me. I was somehow thinking about the limit of sum (1/n), n -> infty. But since the summands are shrinking exponentially, the limit is different. I think the limit of the average comparisons for a given wordlength n against infinity with alphabet size l is l / (l - 1) i.e. for bitstrings it's 2 and for bytestrings it's 256/255. This would mean string comparison of randomized strings is O(1). Can that really be true? It looks like it. Best regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-09-06 11:54 -0400 |
| Message-ID | <mailman.313.1346946924.27098.python-list@python.org> |
| In reply to | #28605 |
On 09/06/2012 10:42 AM, Johannes Bauer wrote: > On 06.09.2012 16:23, Dave Angel wrote: >> On 09/06/2012 09:43 AM, Johannes Bauer wrote: >>> <snip> >>> Yes, worst-case is O(N), best case O(1). Average is O(n log n). >> Can't see how you came up with an average of n log(n). Fourteen minutes >> before you made this post, you demonstrated it was less than 2 for any N. >> >> And I previously posted that for a character set of size 10, the average >> was 1.1111 > Again playing with the equations and thinking about it again. And I > completely missed your point. It wasn't about n log n vs. log n. Both > are wrong. This surprises me. I was somehow thinking about the limit of > sum (1/n), n -> infty. But since the summands are shrinking > exponentially, the limit is different. > > I think the limit of the average comparisons for a given wordlength n > against infinity with alphabet size l is > > l / (l - 1) > > i.e. for bitstrings it's 2 and for bytestrings it's 256/255. > > This would mean string comparison of randomized strings is O(1). Can > that really be true? It looks like it. > (Just lost the internet in a storm, so I'm not sure how long it'll be before this sends.) Thanks, that was exactly my point. Since el is at least 2, the average number of comparisons is no larger than 2, for any value of N. That's why, when I'm visually comparing MD5 values, I usually only look at the first few, and last few hex digits. However, Chris Angelico (at 10:39) pointed out again that totally random strings aren't real-world equivalent. He has now proposed that somebody measure real-world cases here, by patching the string comparison to gather statistics. I think that's beyond me at this point. I only joined this thread when the cases under study were well-defined random, and therefore predictable. <g> -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-06 16:34 +0200 |
| Message-ID | <k2ac8i$o2n$2@news.albasani.net> |
| In reply to | #28594 |
On 06.09.2012 15:43, Johannes Bauer wrote: > Wrong, at least for randomized strings (i.e. every character with the > same probability). O(N) is worst-case, O(log N) is correct for > randomized strings. ^^ Here I write the right thing. Then further below... > Yes, worst-case is O(N), best case O(1). Average is O(n log n). ...I write the wrong thing. O(log n) is what I meant, as Dave correctly noticed. Best regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Gelonida N <gelonida@gmail.com> |
|---|---|
| Date | 2012-09-08 17:50 +0200 |
| Message-ID | <mailman.385.1347119434.27098.python-list@python.org> |
| In reply to | #28557 |
On 09/06/2012 10:33 AM, Steven D'Aprano wrote: > On Wed, 05 Sep 2012 22:47:14 +0000, Oscar Benjamin wrote: > > I may have been overly-conservative earlier when I said that on average > string equality has to compare half the characters. I thought I had > remembered that from a computer science textbook, but I can't find that > reference now, so possibly I was thinking of something else. (String > searching perhaps?). Yeah I think you mixed it up with searching an entry in an unsorted list of length N That's one of the most common N/2 cases, that one hits daily with many straight forward implementations
[toc] | [prev] | [next] | [standalone]
Page 4 of 6 — ← Prev page 1 2 3 [4] 5 6 Next page →
Back to top | Article view | comp.lang.python
csiph-web