Groups > comp.lang.python > #28380 > unrolled thread

Comparing strings from the back?

Started by	Roy Smith <roy@panix.com>
First post	2012-09-03 21:54 -0400
Last post	2012-09-10 21:52 +0000
Articles	20 on this page of 101 — 21 participants

Back to article view | Back to comp.lang.python

  Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-03 21:54 -0400
    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-04 12:07 +1000
    Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 02:17 +0000
      Re: Comparing strings from the back? Dan Sommers <dan@tombstonezero.net> - 2012-09-03 21:56 -0700
      Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:50 +0100
      Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-04 18:32 +0200
        Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 18:07 +0000
          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:17 +0200
        Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 07:59 +1000
          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:24 +0200
            Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:43 +0200
              Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 14:30 +0000
                Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:51 +0200
                  Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 16:24 +0000
                    Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-05 22:47 +0000
                      Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-06 08:33 +0000
                        Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 06:07 -0400
                          Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:06 +0000
                            Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:10 +0000
                              Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-08 00:55 +0000
                                Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-08 11:53 +0000
                                RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-13 18:39 +0000
                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 15:37 -0400
                                  Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 20:48 -0700
                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 00:46 -0400
                                      Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 21:54 -0700
                                        Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 01:38 -0400
                                          Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 23:06 -0700
                                            Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:03 -0400
                                              Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 01:20 -0700
                                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:53 -0400
                                                  Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 03:26 -0700
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:36 -0400
                                                Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 10:16 +0000
                                                  Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:43 -0400
                                                RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 22:43 +0000
                                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 23:10 -0400
                                                  Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 18:11 -0700
                                                    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-17 14:05 +1000
                                                      Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 23:06 -0700
                                                Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-17 13:35 -0700
                                                  Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-18 09:14 +1000
                                                    Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-18 08:12 -0700
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:55 -0400
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:59 -0400
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 12:17 -0400
                                                      Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-19 00:47 +0000
                                                    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-19 02:20 +1000
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 16:40 -0400
                                                    Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-19 00:48 +0100
                                Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 20:53 +0100
                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:06 -0400
                                  Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 03:39 +0000
                                    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-14 14:15 +1000
                                Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 22:17 +0100
                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:35 -0400
                                RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 21:32 +0000
                            Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:40 +0000
                            Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:52 +0200
                              Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-10 08:59 +0000
                                Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-10 13:45 +0000
                                  Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:06 +0000
                                    Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:51 +0000
                                      Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-11 11:55 -0400
                                  Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:26 +1000
                                  Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:32 +0000
                                  Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:43 +0000
                                  Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:56 +1000
                                  Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:41 +0000
                        Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-06 12:04 +0100
                          Re: Comparing strings from the back? Steve Howell <showell30@yahoo.com> - 2012-09-14 17:51 -0700
                        Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-06 08:13 -0400
                          Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-06 22:29 +1000
                        Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:43 +0200
                          Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 10:23 -0400
                            Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:33 +0200
                            Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:42 +0200
                              Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 11:54 -0400
                          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:34 +0200
                        Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:50 +0200
                    Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:37 +0200
                      Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-07 00:39 +1000
                        Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:36 +0200
                          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:44 +0200
                        Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:07 +0000
    Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-04 01:13 -0400
    Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:56 +0100
    Re: Comparing strings from the back? Alain Ketterlin <alain@dpt-info.u-strasbg.fr> - 2012-09-04 11:58 +0200
    Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-05 12:18 +1000
      Re: Comparing strings from the back? MRAB <python@mrabarnett.plus.com> - 2012-09-05 03:39 +0100
      Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-04 22:48 -0400
      Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:33 +0200
    Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 10:29 +0200
    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 18:33 +1000
    Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 11:48 +0200
    Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 17:45 +0200
    Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 18:07 +0200
    Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 16:33 +0000
    Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:32 +0200
    Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:44 +0200
    Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 21:52 +0000

Page 1 of 6 [1] 2 3 4 5 6 Next page →

#28380 — Comparing strings from the back?

From	Roy Smith <roy@panix.com>
Date	2012-09-03 21:54 -0400
Subject	Comparing strings from the back?
Message-ID	<roy-B37C28.21540103092012@news.panix.com>

There's been a bunch of threads lately about string implementations, and 
that got me thinking (which is often a dangerous thing).

Let's assume you're testing two strings for equality.  You've already 
done the obvious quick tests (i.e they're the same length), and you're 
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings 
instead of at the beginning?  If the strings are indeed equal, it's the 
same amount of work starting from either end.  But, if it turns out that 
for real-life situations, the ends of strings have more entropy than the 
beginnings, the odds are you'll discover that they're unequal quicker by 
starting at the end.

It doesn't seem un-plausible that this is the case.  For example, most 
of the filenames I work with begin with "/home/roy/".  Most of the 
strings I use as memcache keys have one of a small number of prefixes.  
Most of the strings I use as logger names have common leading 
substrings.  Things like credit card and telephone numbers tend to have 
much more entropy in the trailing digits.

On the other hand, hostnames (and thus email addresses) exhibit the 
opposite pattern.

Anyway, it's just a thought.  Has anybody studied this for real-life 
usage patterns?

I'm also not sure how this work with all the possible UCS/UTF encodings.  
With some of them, you may get the encoding semantics wrong if you don't 
start from the front.

[toc] | [next] | [standalone]

#28383

From	Chris Angelico <rosuav@gmail.com>
Date	2012-09-04 12:07 +1000
Message-ID	<mailman.164.1346724445.27098.python-list@python.org>
In reply to	#28380

On Tue, Sep 4, 2012 at 11:54 AM, Roy Smith <roy@panix.com> wrote:
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning?

> I'm also not sure how this work with all the possible UCS/UTF encodings.
> With some of them, you may get the encoding semantics wrong if you don't
> start from the front.

No problem there; Python uses only fixed-width encodings. Also, any
canonical encoding can be safely compared byte-for-byte; two identical
Unicode strings will be bit-wise identical in (say) UTF-8.

There's issues of cache locality and such that quite possibly mean
it's not going to be faster overall, but it wouldn't be difficult to
tweak the Python sources, recompile, and run some tests. I'm sure
python-dev or python-list will be most happy to discuss some benchmark
figures!

ChrisA

[toc] | [prev] | [next] | [standalone]

#28384

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-04 02:17 +0000
Message-ID	<504564ba$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28380

On Mon, 03 Sep 2012 21:54:01 -0400, Roy Smith wrote:

> Let's assume you're testing two strings for equality.  You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
> 
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning?  If the strings are indeed equal, it's the
> same amount of work starting from either end.  But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.

And if the strings have the same end and different beginnings:

    running
    jumping
    cooking

then you'll find out quicker by starting at the beginning.

No general-purpose string comparison operator can make assumptions about 
the content of the strings, like "they are nearly always pathnames on 
Unix-like systems like /home/user/x and /home/user/y".

I suppose you could have two different operators, == for "test from the 
front" and === for "test from the back", and push that decision onto the 
user, who will then spend hours angsting about the "most efficient" 
equality test and days painstakingly profiling his data to save four 
nanoseconds at runtime.

Besides, then somebody will say "Yes, but what about the cases where the 
prefix and the suffix are both equal, but the middle will be different?" 
and propose a third string-equality operator ==== and then there will be 
bloodshed.

On average, string equality needs to check half the characters in the 
string. Any individual comparisons may require fewer, or more, than that, 
but whichever way you scan the string, some comparisons will take more 
and some will take fewer. With no general way of telling ahead of time 
which will be which, on average you can't do better than O(N/2) and no 
complexity justification for picking "start from the end" from "start 
from the start".

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28389

From	Dan Sommers <dan@tombstonezero.net>
Date	2012-09-03 21:56 -0700
Message-ID	<mailman.169.1346735738.27098.python-list@python.org>
In reply to	#28384

On 2012-09-04 at 02:17:30 +0000,
Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> Besides, then somebody will say "Yes, but what about the cases where
> the prefix and the suffix are both equal, but the middle will be
> different?"  and propose a third string-equality operator ==== and
> then there will be bloodshed.

Lisp has several equality forms, although they're for different kinds or
levels of equality rather than for different algorithms for determining
that equality.  I imagine that a discussion similar to this one spawned
many of those forms.  I agree with Steven:  decisions such as this
belong to the application developer, not the standard library.  Don't
forget the fourth string-equality operator for case-insensitive
comparisons and the fifth string-equality operator for strings that are
likely to be palindromes.

That said, if I really wanted bloodshed, I would propose = for the third
string-equality operator!  ;-)

Dan

[toc] | [prev] | [next] | [standalone]

#28394

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-09-04 08:50 +0100
Message-ID	<mailman.171.1346745011.27098.python-list@python.org>
In reply to	#28384

On 04/09/2012 05:56, Dan Sommers wrote:
>
> That said, if I really wanted bloodshed, I would propose = for the third
> string-equality operator!  ;-)
>
> Dan
>

Dan "agent provocateur" Sommers? :)

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#28405

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-04 18:32 +0200
Message-ID	<k25afd$7vj$1@news.albasani.net>
In reply to	#28384

On 04.09.2012 04:17, Steven D'Aprano wrote:

> On average, string equality needs to check half the characters in the 
> string.

How do you arrive at that conclusion? When comparing two random strings,
I just derived

n = (256 / 255) * (1 - 256 ^ (-c))

where n is the average number of character comparisons and c. The
rationale as follows: The first character has to be compared in any
case. The second with a probability of 1/256, the third with 1/(256^2)
and so on.

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28411

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-04 18:07 +0000
Message-ID	<50464361$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28405

On Tue, 04 Sep 2012 18:32:57 +0200, Johannes Bauer wrote:

> On 04.09.2012 04:17, Steven D'Aprano wrote:
> 
>> On average, string equality needs to check half the characters in the
>> string.
> 
> How do you arrive at that conclusion? 

Take two non-empty strings of the same length, N. If the strings are 
equal, you have to make N comparisons to be sure they are equal (you 
check all N pairs of characters). If they are unequal, you have to check 
each pair of characters up to the first pair that are different:

def string_equality(astr, bstr):
    # Assumes lengths are equal.
    for achr, bchr in zip(astr, bstr):
        if achr != bchr:
            return False
    return True

For the unequal case, how many comparisons do you do? Ahead of time, we 
know very little about the strings. We don't even know how many possible 
characters there are (are they English alphanumeric, ASCII, Greek, Thai, 
or from the full range of 1114111 Unicode code points?) or what their 
probability distribution is.

A reasonable, conservative assumption is to calculate the largest 
possible value of the average for random strings. That largest value 
occurs when the alphabet is as small as possible, namely two characters. 
In practice, strings come from a larger alphabet, up to 1114111 different 
characters for full Unicode strings, so the average for them will be less 
than the average we calculate now.

So for unequal strings, the number of comparisons is equally likely to be 
1, 2, 3, ..., N. The average then is:

sum([1, 2, 3, ..., N])/N

which by a bit of simple algebra works out to be (N+1)/2, or half the 
characters as I said.

(Note that this average assumes the strings are completely random. In 
practice, strings tend to come from strongly non-uniform distributions of 
characters. For instance, in English texts, 'e' is much more likely than 
'q' and most characters are vanishingly rare, and in practice, strings 
are likely to be highly non-random.)

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28471

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-05 11:17 +0200
Message-ID	<k275b1$h20$1@news.albasani.net>
In reply to	#28411

On 04.09.2012 20:07, Steven D'Aprano wrote:
> A reasonable, conservative assumption is to calculate the largest 
> possible value of the average for random strings. That largest value 
> occurs when the alphabet is as small as possible, namely two characters. 
> In practice, strings come from a larger alphabet, up to 1114111 different 
> characters for full Unicode strings, so the average for them will be less 
> than the average we calculate now.
> 
> So for unequal strings, the number of comparisons is equally likely to be 
> 1, 2, 3, ..., N. The average then is:

That is exactly what I don't agree with. For unequal random strings, the
distribution is *not* equally likely to have the same amount of
comparisons. For demonstration, let's look at bitstrings and the string
"1010". There 15 bitstrings of same length which are unequal and I'll
put the number of comparisons needed next to them going left to right:

0000  1
0001  1
0010  1
0011  1
0100  1
0101  1
0110  1
0111  1
1100  2
1101  2
1110  2
1111  2
1000  3
1001  3
1011  4

i.e. about 1.73 (26/15) comparisons in average with random strings
compared to the 2.5 comparisons you end up with. My formula does respect
lazy termination (because the probabilty of higher comparisons gets
exponentially lower).

> sum([1, 2, 3, ..., N])/N
> 
> which by a bit of simple algebra works out to be (N+1)/2, or half the 
> characters as I said.

Yes, but it's a flawed assumption you're making since you're neglecting
lazy evaluation and early abort of comparison.

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28428

From	Chris Angelico <rosuav@gmail.com>
Date	2012-09-05 07:59 +1000
Message-ID	<mailman.194.1346795987.27098.python-list@python.org>
In reply to	#28405

On Wed, Sep 5, 2012 at 2:32 AM, Johannes Bauer <dfnsonfsduifb@gmx.de> wrote:
> How do you arrive at that conclusion? When comparing two random strings,
> I just derived
>
> n = (256 / 255) * (1 - 256 ^ (-c))
>
> where n is the average number of character comparisons and c. The
> rationale as follows: The first character has to be compared in any
> case. The second with a probability of 1/256, the third with 1/(256^2)
> and so on.

That would be for comparing two random areas of memory. Python strings
don't have 256 options per character; and in terms of actual strings,
there's so many possibilities. The strings that a program is going to
compare for equality are going to use a vastly restricted alphabet;
for a lot of cases, there might be only a few dozen plausible
characters.

But even so, it's going to scale approximately linearly with the
string length. If they're really random, then yes, there's little
chance that either a 1MB string or a 2MB string will be the same, but
with real data, they might very well have a long common prefix. So
it's still going to be more or less O(n).

ChrisA

[toc] | [prev] | [next] | [standalone]

#28472

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-05 11:24 +0200
Message-ID	<k275on$i1l$1@news.albasani.net>
In reply to	#28428

On 04.09.2012 23:59, Chris Angelico wrote:

>> n = (256 / 255) * (1 - 256 ^ (-c))
>>
>> where n is the average number of character comparisons and c. The
>> rationale as follows: The first character has to be compared in any
>> case. The second with a probability of 1/256, the third with 1/(256^2)
>> and so on.
> 
> That would be for comparing two random areas of memory. Python strings
> don't have 256 options per character; and in terms of actual strings,
> there's so many possibilities. 

The unit of comparison is completely arbitraty and can equally be used
to compare bitstrings if changed accordingly.

> The strings that a program is going to
> compare for equality are going to use a vastly restricted alphabet;
> for a lot of cases, there might be only a few dozen plausible
> characters.

This is very true. But if so, I would like to see the assumtion that
you're making (which might very well be plausible) in order to arrive at
a average of N/2 comparisons (which just seems *way* off).

> But even so, it's going to scale approximately linearly with the
> string length. If they're really random, then yes, there's little
> chance that either a 1MB string or a 2MB string will be the same, but
> with real data, they might very well have a long common prefix. So
> it's still going to be more or less O(n).

I understand what you're saying and surely this is true to some extent
(someone mentioned filenames, which is an excellent example). However in
order to derive a *formula* you need to formularize your assumtions.
With random data, it's definitely not O(n). And even in reality (with a
more limited alphabet) it usually will early abort:

Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
assumes O(1) comparisons! If the comparison operation itself were in
O(n), the total sorting complexity would be O(n^2 log(n)), which is
definitely false. Most comparisons will abort *very* early (after the
first character).

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28474

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-05 11:43 +0200
Message-ID	<k276r0$k5k$1@news.albasani.net>
In reply to	#28472

On 05.09.2012 11:24, Johannes Bauer wrote:

> Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
> assumes O(1) comparisons! If the comparison operation itself were in
> O(n), the total sorting complexity would be O(n^2 log(n)), which is
> definitely false. Most comparisons will abort *very* early (after the
> first character).

I've just written a program to demonstrate this (below it's attached). I
used my aspell dictionary, which contains about 100k words. To have
complexity down I went for only words wich occur most often (in this
case 9 characters or ~13k words). Here's the results (note it's
randomized, so they'll always differ a little, but are pretty stable)

Max Cmp: 100
Sum Cmp: 32111
Avg Cmp: 2.393842254361115
Num words  : 13414
Min wordlen: 9
Max wordlen: 9
Avg wordlen: 9.0

Going up in wordlength (only showing part now)

Avg Cmp: 2.2374929735806632
Num words  : 10674
Avg wordlen: 10.0

Avg Cmp: 2.261727078891258
Num words  : 7504
Avg wordlen: 11.0

Avg Cmp: 2.2335647202939977
Num words  : 4898
Avg wordlen: 12.0

Avg Cmp: 2.1743341404358354
Num words  : 2891
Avg wordlen: 13.0

Avg Cmp: 2.156782549420586
Num words  : 1467
Avg wordlen: 14.0

Avg Cmp: 2.112449799196787
Num words  : 747
Avg wordlen: 15.0

So, interestingly, in this real-world case(tm), the complexity does not
scale with O(n). It actually goes down (which is probably due to the
limit amount of words of higher length).

In summary, let's please not debate "feelings" or such. Either
measurements (with clear indiciation on what data they're based on and
how the test was conducted) or derivations (with an explanation of the
assumtions). Feelings can be very misleading in cases like this.

Best regards,
Johannes



#!/usr/bin/python3
import random

class Word():
	def __init__(self, s):
		self._s = s
		self._c = 0

	def __lt__(self, other):
		if len(self) < len(other):
			return True
		elif len(self) > len(other):
			return False

		for i in range(len(self)):
			self._c += 1
			if self._s[i] < other._s[i]:
				return True
		return False

	def __repr__(self):
		return "%s<%d>" % (self._s, self._c)

	def __len__(self):
		return len(self._s)

	def getcmp(self):
		return self._c

wordlist = [ ]
for line in open("dict", "r"):
	word = Word(line[:-1])
	if len(word) == 9:
		wordlist.append(word)

random.shuffle(wordlist)
wordlist.sort()

comparisons = [ word.getcmp() for word in wordlist ]
print("Max Cmp:", max(comparisons))
print("Sum Cmp:", sum(comparisons))
print("Avg Cmp:", sum(comparisons) / len(comparisons))

wordlengths = [ len(word) for word in wordlist ]
print("Min wordlen:", min(wordlengths))
print("Max wordlen:", max(wordlengths))
print("Avg wordlen:", sum(wordlengths) / len(wordlengths))


-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28500

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-05 14:30 +0000
Message-ID	<504761ef$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28474

On Wed, 05 Sep 2012 11:43:08 +0200, Johannes Bauer wrote:

> On 05.09.2012 11:24, Johannes Bauer wrote:
> 
>> Consider sorting a large dictionary. Sorting is in O(n log(n)), but
>> this assumes O(1) comparisons! If the comparison operation itself were
>> in O(n), the total sorting complexity would be O(n^2 log(n)), 

This shows some significant confusion about Big Oh complexity analysis 
here. When you say that sorting the list of words is O(N log N), the N 
refers to the number of words in the dictionary. But when you speak about 
comparisons being O(N), the N doesn't refer to the size of the dict, but 
the size of the individual strings being compared. There is no reasonable 
comparison algorithm where the effort required to compare two strings 
depends on the size of the dictionary they have come from.

Since these are *completely different Ns*, you can't combine them to get 
O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer 
nonsense. If you state it like this:

sorting the list of words is O(N log N) where N = length of the list
comparing the words is O(M) where M = the average length of the words

then the overall complexity is O(N log N)*O(M), but since M is a constant 
for any particular dictionary (its just the average length of the words) 
that reduces down to O(N log N).

To say that sorting a dictionary is O(N log N) does *not* imply that 
string comparisons are O(1). It only implies that string comparisons 
don't depend on the size of the dictionary. Comparing:

s1 = 'a'*30002
s2 = 'a'*30001 + 'b'

takes the same amount of time whether they are in a dictionary of 100 
words or one of 100000 words. For the purposes of calculating the 
algorithmic complexity of sorting a large list of words, we can treat 
comparisons as an elementary operation even though they actually aren't.

>> which is
>> definitely false. Most comparisons will abort *very* early (after the
>> first character).

You are making unjustified assumptions about the distribution of letters 
in the words. This might be a list of long chemical compounds where the 
words typically differ only in their suffix. It might be a list of people 
with titles:

Herr Professor Frederick Schmidt
Herr Professor Frederick Wagner
...

But either way, it doesn't matter for the Big Oh behaviour of sorting the 
words.

> I've just written a program to demonstrate this (below it's attached).

I have no idea why you think this program demonstrates anything about 
string equality comparisons. What you are counting is not the average 
number of comparisons for string equality, but the number of comparisons 
when sorting. These are *completely different*, because sorting a list 
does not require testing each string against every other string.

Whatever you have measured, it has nothing to do with the Big Oh 
complexity of string comparisons.

It seems to me that you have misunderstood the purpose of Big Oh 
notation. It gives an asymptotic upper bound on the amount of work done, 
ignoring all lower terms and multiplicative factors, not an exact formula.

If something is O(N), then this tells you:

1) the number of algorithmic steps is proportional to N, plus or
   minus some terms which are less than N (e.g. sqrt(N), or log(N), 
   or 1/N, or some constant, etc.);

2) if you ignore those other terms, and keep all other influences
   equal, then doubling N will *at worst* double the number of
   algorithmic steps;

3) if all those algorithmic steps take exactly the same amount of
   time, then doubling N will take twice as much time.

But of course *none* of those assumptions hold for measurements of actual 
elapsed time. In real life, operations rarely take constant time, you 
don't always see worst case behaviour, and the lower terms are not 
necessarily insignificant enough to ignore.

[...]
> So, interestingly, in this real-world case(tm), the complexity does not
> scale with O(n). It actually goes down (which is probably due to the
> limit amount of words of higher length).

I would guess that it probably has more to do with the ability of the 
timsort algorithm to exploit local order within a list and avoid 
performing comparisons.

> In summary, let's please not debate "feelings" or such.

I have no idea what feelings you are referring to.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28505

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-05 16:51 +0200
Message-ID	<k27osg$q7t$1@news.albasani.net>
In reply to	#28500

On 05.09.2012 16:30, Steven D'Aprano wrote:

> Since these are *completely different Ns*, you can't combine them to get 
> O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer 
> nonsense. If you state it like this:

Yes, you are correct here.

> You are making unjustified assumptions about the distribution of letters 
> in the words. This might be a list of long chemical compounds where the 
> words typically differ only in their suffix. It might be a list of people 
> with titles:

Actually, I'm not. I'm stating exactly what assumptions I'm making to
get my calculation. I'm comparing *random* character strings or bitstrings.

You, on the other hand, are making vague assumptions which you do not
care for formalize and yet you claim that "the number of comparisons is
equally likely to be 1, 2, 3, ..., N. The average then is". Without any
explanation for this. At all.

> Herr Professor Frederick Schmidt
> Herr Professor Frederick Wagner
> ...

Is your assumtion that we're comparing words that have the common prefix
"Herr Professor Frederick "? Why don't you clearly state what you're
comparing. In the example you gave in the other thread you claimed "Note
that this average assumes the strings are completely random.", yet now
you're talking about specific word patterns.

> I have no idea why you think this program demonstrates anything about 
> string equality comparisons. What you are counting is not the average 
> number of comparisons for string equality, but the number of comparisons 
> when sorting. These are *completely different*, because sorting a list 
> does not require testing each string against every other string.

You're wrong. It's not counting the number of string comparisons, it's
counting the number of character comparisons. Which is exactly what
we're talking about in this thread, are we not? The sorting of the list
is just to show some example where lots of strings are compared.

>> So, interestingly, in this real-world case(tm), the complexity does not
>> scale with O(n). It actually goes down (which is probably due to the
>> limit amount of words of higher length).
> 
> I would guess that it probably has more to do with the ability of the 
> timsort algorithm to exploit local order within a list and avoid 
> performing comparisons.

Again, it's not counting the number of string comparisons. It's counting
the average number of character comparisons per string comparison. The
total amount of string comparisons has nothing to do with it.

>> In summary, let's please not debate "feelings" or such.
> 
> I have no idea what feelings you are referring to.

I'm referring to "the number of comparisons is equally likely to be 1,
2, 3, ..., N. The average then is" without any deduction or rationale.
Without any explanation or assumption.

Regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28519

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-05 16:24 +0000
Message-ID	<50477cbb$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28505

On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
[...]
>> You are making unjustified assumptions about the distribution of
>> letters in the words. This might be a list of long chemical compounds
>> where the words typically differ only in their suffix. It might be a
>> list of people with titles:
> 
> Actually, I'm not. I'm stating exactly what assumptions I'm making to
> get my calculation. I'm comparing *random* character strings or
> bitstrings.

Excuse me, you are not. You are comparing English words which are highly 
non-random.

> You, on the other hand, are making vague assumptions which you do not
> care for formalize and yet you claim that "the number of comparisons is
> equally likely to be 1, 2, 3, ..., N. The average then is". Without any
> explanation for this. At all.

I will accept that my explanation was not good enough, but I strongly 
disagree that I gave no explanation at all.

>> Herr Professor Frederick Schmidt
>> Herr Professor Frederick Wagner
>> ...
> 
> Is your assumtion that we're comparing words that have the common prefix
> "Herr Professor Frederick "? 

No, I am pointing out that *your* assumption that most string comparisons 
will halt close to the beginning of the string is an invalid assumption. 
Your assumption only holds for some non-random strings.

[...]
>> I have no idea why you think this program demonstrates anything about
>> string equality comparisons. What you are counting is not the average
>> number of comparisons for string equality, but the number of
>> comparisons when sorting. These are *completely different*, because
>> sorting a list does not require testing each string against every other
>> string.
> 
> You're wrong. It's not counting the number of string comparisons, 

I didn't say it was. 

> it's counting the number of character comparisons.

Correct. But only out of an extremely limited subset of all possible 
string comparisons, namely those very few that happen when sorting lists 
of English words using a very specific algorithm, namely timsort.

> Which is exactly what
> we're talking about in this thread, are we not? The sorting of the list
> is just to show some example where lots of strings are compared.

It is not good enough to extract a non-random subset of strings (only 
English words) and then decrease the data set even further by only 
comparing an extremely non-uniform subset of those strings: specifically 
those few string comparisons that happen to occur using timsort.

Whatever figure you have calculated by taking this non-random selection, 
it is irrelevant to the question of the average performance of string 
equality given random strings.

>>> So, interestingly, in this real-world case(tm), the complexity does
>>> not scale with O(n). It actually goes down (which is probably due to
>>> the limit amount of words of higher length).
>> 
>> I would guess that it probably has more to do with the ability of the
>> timsort algorithm to exploit local order within a list and avoid
>> performing comparisons.
> 
> Again, it's not counting the number of string comparisons. It's counting
> the average number of character comparisons per string comparison. The
> total amount of string comparisons has nothing to do with it.

I didn't say anything about counting string comparisons. But you are 
wrong -- the numbers you get are *strongly* influenced by the number of 
string comparisons performed. That is just one of the reasons why the 
results you get are biased.

Let me spell it out for you: given a list of five letter words:

['fruit', 'apple', 'claim', 'pears', ... , 'toast']

sorting the list may compare:

'fruit' with 'apple'
'apple' with 'claim'

but almost certainly will not compare:

'apple' with 'toast'

and it definitely won't compare:

'zzzax' with 'zzzaq'

because strings like those never get into the list in the first place.

For the sake of simple calculations, let's pretend that there are only 
1000 five-letter strings possible. We want to know how many character-
comparisons are done on average when testing two random five-letter 
strings for equality. To calculate this average, you must compare every 
string to every other string, *including* itself.

This gives 1000*1000 = one million equality tests to be performed. For 
each equality test, we count the number of character comparisons 
performed, sum all those counts, and divide by 1e6. That is the average 
number of char comparisons for random strings of five letters.

But that's not what you do. First you eliminate 900 out of the 1000 
possible strings by only sampling English words -- strings like "xxoij" 
cannot possibly be selected. So you are left with the highly non-random 
subset of 10000 equality tests.

But you haven't finished. Instead of checking all 10000 equality tests, 
you then decide to only check the 2000 tests that happen to randomly 
occur when using the timsort algorithm to sort a list.

So from the population of one million possible equality tests, you throw 
away the vast majority, then average the *strongly non-random* few that 
are remaining. Naturally the result you get is strongly biased, and it 
has nothing to do with the average Big Oh behaviour of equality on random 
strings.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28544

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2012-09-05 22:47 +0000
Message-ID	<mailman.272.1346885254.27098.python-list@python.org>
In reply to	#28519

In news.gmane.comp.python.general, you wrote:
> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
> [...]
>>> You are making unjustified assumptions about the distribution of
>>> letters in the words. This might be a list of long chemical compounds
>>> where the words typically differ only in their suffix. It might be a
>>> list of people with titles:
>> 
>> Actually, I'm not. I'm stating exactly what assumptions I'm making to
>> get my calculation. I'm comparing *random* character strings or
>> bitstrings.
>
> Excuse me, you are not. You are comparing English words which are highly 
> non-random.

Evidently we have different understandings of what 'random' means. I don't
think it's unreasonable to say that strings drawn uniformly from the set of
all strings in the English language (having a given number of characters) is
random. The distribution is not uniform over the set of all possible character
strings but it is still random. I think Johannes deliberately chose these
strings to emulate a particular kind of 'real' distribution of strings that
might occur in practise.

>
>
>> You, on the other hand, are making vague assumptions which you do not
>> care for formalize and yet you claim that "the number of comparisons is
>> equally likely to be 1, 2, 3, ..., N. The average then is". Without any
>> explanation for this. At all.
>
> I will accept that my explanation was not good enough, but I strongly 
> disagree that I gave no explanation at all.
>
>
>>> Herr Professor Frederick Schmidt
>>> Herr Professor Frederick Wagner
>>> ...
>> 
>> Is your assumtion that we're comparing words that have the common prefix
>> "Herr Professor Frederick "? 
>
> No, I am pointing out that *your* assumption that most string comparisons 
> will halt close to the beginning of the string is an invalid assumption. 
> Your assumption only holds for some non-random strings.

I think you have this backwards. The case where this assumption is provably
true is precisely for random strings. To be clear, when I say 'random' in this
context I mean that each character is chosen independently from the same
probability distribution over the possible characters regardless of which
index it has in the string and regardless of what the other characters are
(IID). In this case the probability that comparison terminates at the jth
character decreases exponentially with j. This means that for large strings
the expected number of character comparisons is independent of the number of
characters in the string as the probability of reaching the later parts of the
string is too small for them to have any significant effect. This is provable
and applies regardless of how many possible characters there are and whether
or not each character is equally likely (except for the pathological case
where one character has a probability of 1).

For strings from 'real' distributions it is harder to make statements about
the 'average case' and it is possible to construct situations where the
comparison would regularly need to compare a common prefix. However, to get
asymptotic performance worse than O(1) it is not sufficient to say that there
may be a common prefix such as 'Herr' in the distribution of strings. It is
necessary that, somehow, the common prefix is likely to grow as the size of
the strings grows.

For example, the set of all strings of length N whose first N//2 characters
are always 'a' and whose remaining characters are chosen IID would lead to
O(N) performance. This is why the file paths example chosen at the start of
this thread is a good one. If a program is dealing with a number of large
strings representing file paths then it is not uncommon that many of those
paths would refer to files in the same deeply nested directory and hence
compare equal for a significant number of characters. This could lead to
average case O(N) performance.

I think it's appropriate to compare string comparison with dict insertion:
Best case O(1) (no hash collisions)
Worst case O(N) (collides with every key)
Average case O(1) (as long as you don't use pathological data)

The only difference with string comparison is that there are some conceivable,
non-malicious cases where the pathological data can occur (such as with file
paths). However, I suspect that out of all the different uses of python
strings these cases are a small minority.

In saying that, it's not inconceivable that someone could exploit string
comparison by providing pathological data to make normally O(1) operations
behave as O(N). If I understand correctly it was precisely this kind of
problem with dict insertion/lookup that lead to the recent hash-seed security
update.

Oscar

[toc] | [prev] | [next] | [standalone]

#28557

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-06 08:33 +0000
Message-ID	<50485fca$0$29977$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28544

On Wed, 05 Sep 2012 22:47:14 +0000, Oscar Benjamin wrote:

> In news.gmane.comp.python.general, you wrote:
>> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...]
>>>> You are making unjustified assumptions about the distribution of
>>>> letters in the words. This might be a list of long chemical compounds
>>>> where the words typically differ only in their suffix. It might be a
>>>> list of people with titles:
>>> 
>>> Actually, I'm not. I'm stating exactly what assumptions I'm making to
>>> get my calculation. I'm comparing *random* character strings or
>>> bitstrings.
>>
>> Excuse me, you are not. You are comparing English words which are
>> highly non-random.
> 
> Evidently we have different understandings of what 'random' means. I
> don't think it's unreasonable to say that strings drawn uniformly from
> the set of all strings in the English language (having a given number of
> characters) is random. The distribution is not uniform over the set of
> all possible character strings but it is still random. I think Johannes
> deliberately chose these strings to emulate a particular kind of 'real'
> distribution of strings that might occur in practise.

Of course for some "real" applications, your strings being compared will 
be English words. And for other applications, they will be strings of 
numeric digits uniformly distributed between 20000 and 30000. Or taken 
from a Gaussian distribution centered around 7000. Or strings made up of 
deeply nested sets of parentheses (((((( ... )))))). Whichever special 
distribution of strings you pick, I don't think calling them "random" is 
terribly useful in context, even if they are generated randomly from some 
non-uniform distribution.

But you know, it really doesn't make a difference. Equality testing will 
*still* be O(N) since the asymptomatic behaviour for sufficiently large 
string comparisons will be bounded by O(N). Multiplicative constants 
("half the string" versus "0.001 of the string") do not matter.

I may have been overly-conservative earlier when I said that on average 
string equality has to compare half the characters. I thought I had 
remembered that from a computer science textbook, but I can't find that 
reference now, so possibly I was thinking of something else. (String 
searching perhaps?). In any case, the *worst* case for string equality 
testing is certainly O(N) (every character must be looked at), and the 
*best* case is O(1) obviously (the first character fails to match). But 
I'm not so sure about the average case. Further thought is required.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28569

From	Dave Angel <d@davea.name>
Date	2012-09-06 06:07 -0400
Message-ID	<mailman.287.1346926087.27098.python-list@python.org>
In reply to	#28557

On 09/06/2012 04:33 AM, Steven D'Aprano wrote:
> <snip>
>
> I may have been overly-conservative earlier when I said that on
> average string equality has to compare half the characters. I thought
> I had remembered that from a computer science textbook, but I can't
> find that reference now, so possibly I was thinking of something else.
> (String searching perhaps?). In any case, the *worst* case for string
> equality testing is certainly O(N) (every character must be looked
> at), and the *best* case is O(1) obviously (the first character fails
> to match). But I'm not so sure about the average case. Further thought
> is required. 

For random strings (as defined below), the average compare time is
effectively unrelated to the size of the string, once the size passes
some point.

Define random string as being a selection from a set of characters, with
replacement.  So if we pick some set of characters, say 10 (or 256, it
doesn't really matter), the number of possible strings is 10**N.

The likelihood of not finding a mismatch within k characters is  
(1/10)**k   The likelihood of actually reaching the end of the random
string is (1/10)**N.  (which is the reciprocal of the number of strings,
naturally)

If we wanted an average number of comparisons, we'd have to calculate a
series, where each term is a probability times a value for k.
   sum((k * 9*10**-k) for k in range(1, 10))

Those terms very rapidly approach 0, so it's safe to stop after a few. 
Looking at the first 9 items, I see a value of 1.1111111

This may not be quite right, but the value is certainly well under 2 for
a population of 10 characters, chosen randomly.  And notice that N
doesn't really come into it. 

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#28676

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-07 04:06 +0000
Message-ID	<504972d1$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28569

On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:

> On 09/06/2012 04:33 AM, Steven D'Aprano wrote:
>> <snip>
>>
>> I may have been overly-conservative earlier when I said that on average
>> string equality has to compare half the characters. I thought I had
>> remembered that from a computer science textbook, but I can't find that
>> reference now, so possibly I was thinking of something else. (String
>> searching perhaps?). In any case, the *worst* case for string equality
>> testing is certainly O(N) (every character must be looked at), and the
>> *best* case is O(1) obviously (the first character fails to match). But
>> I'm not so sure about the average case. Further thought is required.
> 
> For random strings (as defined below), the average compare time is
> effectively unrelated to the size of the string, once the size passes
> some point.

Okay, I've spent some time looking for the reference, and can't find it, 
so I'm now sure I must have been conflating it with something else.

After further thought, and giving consideration to the arguments given by 
people here, I'm now satisfied to say that for equal-length strings, 
string equality is best described as O(N).

1) If the strings are equal, a == b will always compare all N 
   characters in each string.

2) If the strings are unequal, a == b will *at worst* compare
   all N characters.

3) Following usual practice in this field, worst case is the
   one which conventionally is meant when discussing Big Oh
   behaviour. See, for example, "Introduction To Algorithms" 
   by Cormen, Leiserson and Rivest.

Also of some interest is the best case: O(1) for unequal strings (they 
differ at the first character) and O(N) for equal strings.

Also of interest is the case that has caused the majority of the 
discussion, the average case. I am now satisfied that the average number 
of comparisons for unequal strings is O(1). To be precise, it is bounded 
below by 1 comparison (you always have to compare at least one pair of 
characters) and bounded above by 2 comparisons.

(I'm talking about the average here -- the actual number of comparisons 
can range all the way up to N, but the average is <= 2.)

If I've done the maths right, the exact value for the average is:

((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)

for random strings of length N taken from an alphabet of size M.

For M = 2, that average approaches but never exceeds 2 as N increases; 
for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333... 
and so forth.

Thanks to all who contributed.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28705

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2012-09-07 19:10 +0000
Message-ID	<mailman.366.1347045032.27098.python-list@python.org>
In reply to	#28676

On 2012-09-07, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> <snip>
>
> After further thought, and giving consideration to the arguments given by 
> people here, I'm now satisfied to say that for equal-length strings, 
> string equality is best described as O(N).
>
> 1) If the strings are equal, a == b will always compare all N 
>    characters in each string.
>
> 2) If the strings are unequal, a == b will *at worst* compare
>    all N characters.
>
> 3) Following usual practice in this field, worst case is the
>    one which conventionally is meant when discussing Big Oh
>    behaviour. See, for example, "Introduction To Algorithms" 
>    by Cormen, Leiserson and Rivest.

Would you say, then, that dict insertion is O(N)?

>
> Also of some interest is the best case: O(1) for unequal strings (they 
> differ at the first character) and O(N) for equal strings.
>
> Also of interest is the case that has caused the majority of the 
> discussion, the average case. I am now satisfied that the average number 
> of comparisons for unequal strings is O(1). To be precise, it is bounded 
> below by 1 comparison (you always have to compare at least one pair of 
> characters) and bounded above by 2 comparisons.

I find this idea of separating into the comparison of equal strings versus the
comparison of unequal strings rather odd. If the strings you compare come from
a distribution where they are guaranteed to be equal (or unequal) then you can
just use the O(0) comparison method.

Since string comparison is only useful if the strings can be equal or unequal,
the average case depends on how often they are equal/unequal as well as the
average complexity of both. For random strings the frequency of equal strings
decreases very fast as N increases so that the comparison of random strings is
O(1).

>
> (I'm talking about the average here -- the actual number of comparisons 
> can range all the way up to N, but the average is <= 2.)
>
> If I've done the maths right, the exact value for the average is:
>
> ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)

I'm not sure where the extra N comes from --------^ but otherwise good.

I would have written that as:

(1 - p) * sum(i * p**(i-1) for i in range(1, N+1))

where p is the probability of a match (1/M for M equally likely characters) or
in closed form:

 ⎛     N                 ⎞
 ⎝1 - p ⋅(1 + N ⋅(1 - p))⎠
 ─────────────────────────
           1 - p

>
> for random strings of length N taken from an alphabet of size M.
>
> For M = 2, that average approaches but never exceeds 2 as N increases; 
> for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333... 
> and so forth.

It approaches 1 / (1 - p) or, if you prefer: M / (M - 1)

Oscar

[toc] | [prev] | [next] | [standalone]

#28713

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-08 00:55 +0000
Message-ID	<504a9785$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28705

On Fri, 07 Sep 2012 19:10:16 +0000, Oscar Benjamin wrote:

> On 2012-09-07, Steven D'Aprano <steve+comp.lang.python@pearwood.info>
> wrote:
>> <snip>
>>
>> After further thought, and giving consideration to the arguments given
>> by people here, I'm now satisfied to say that for equal-length strings,
>> string equality is best described as O(N).
>>
>> 1) If the strings are equal, a == b will always compare all N
>>    characters in each string.
>>
>> 2) If the strings are unequal, a == b will *at worst* compare
>>    all N characters.
>>
>> 3) Following usual practice in this field, worst case is the
>>    one which conventionally is meant when discussing Big Oh behaviour.
>>    See, for example, "Introduction To Algorithms" by Cormen, Leiserson
>>    and Rivest.
> 
> Would you say, then, that dict insertion is O(N)?

Pedantically, yes. 

But since we're allowed to state (or even imply *wink*) whatever 
assumptions we like, we're allowed to assume "in the absence of 
significant numbers of hash collisions" and come up with amortized O(1) 
for dict insertions and lookups.

(Provided, of course, that your computer has an infinite amount of 
unfragmented memory and the OS never starts paging your dict to disk. 
Another unstated assumption that gets glossed over when we talk about 
complexity analysis -- on real world computers, for big enough N, 
*everything* is O(2**N) or worse.)

Big Oh analysis, despite the formal mathematics used, is not an exact 
science. Essentially, it is a way of bringing some vague order to hand-
wavy estimates of complexity, and the apparent mathematical rigour is 
built on some awfully shaky foundations. But despite that, it actually is 
useful.

Coming back to strings... given that in any real-world application, you 
are likely to have some string comparisons on equal strings and some on 
unequal strings, and more importantly you don't know which are which 
ahead of time, which attitude is less likely to give you a nasty surprise 
when you run your code?

"I have many millions of 100K strings to compare against other 100K 
strings, and string comparisons are O(1) so that will be fast."

"I have many millions of 100K strings to compare against other 100K 
strings, and string comparisons are O(N) so that will be slow, better 
find another algorithm."

Remember too that "for small enough N, everything is O(1)". Getting hung 
up on Big Oh is just as much a mistake as ignoring it completely.

> I find this idea of separating into the comparison of equal strings
> versus the comparison of unequal strings rather odd. If the strings you
> compare come from a distribution where they are guaranteed to be equal
> (or unequal) then you can just use the O(0) comparison method.

If you know that they're (un)equal, you don't need to call == at all.

If you know that "most" strings will be unequal, then you might be 
justified in treating comparisons as O(1) "most of the time" and not 
stress about the occasional slow call. But of course most of the time you 
don't know this, which is why it is appropriate to treat string 
comparisons as O(N) rather than O(1), since that's the upper bound.

> Since string comparison is only useful if the strings can be equal or
> unequal, the average case depends on how often they are equal/unequal as
> well as the average complexity of both. For random strings the frequency
> of equal strings decreases very fast as N increases so that the
> comparison of random strings is O(1).

But that is not an upper bound, and Big Oh analysis is strictly defined 
in terms of upper bounds.

>> (I'm talking about the average here -- the actual number of comparisons
>> can range all the way up to N, but the average is <= 2.)
>>
>> If I've done the maths right, the exact value for the average is:
>>
>> ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)
> 
> I'm not sure where the extra N comes from --------^ but otherwise good.

The extra N comes from the case where you compare string S with itself, 
which takes exactly N comparisons.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

Page 1 of 6 [1] 2 3 4 5 6 Next page →

csiph-web

Comparing strings from the back?

Contents

#28380 — Comparing strings from the back?

#28383

#28384

#28389

#28394

#28405

#28411

#28471

#28428

#28472

#28474

#28500

#28505

#28519

#28544

#28557

#28569

#28676

#28705

#28713