Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #28380 > unrolled thread
| Started by | Roy Smith <roy@panix.com> |
|---|---|
| First post | 2012-09-03 21:54 -0400 |
| Last post | 2012-09-10 21:52 +0000 |
| Articles | 20 on this page of 101 — 21 participants |
Back to article view | Back to comp.lang.python
Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-03 21:54 -0400
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-04 12:07 +1000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 02:17 +0000
Re: Comparing strings from the back? Dan Sommers <dan@tombstonezero.net> - 2012-09-03 21:56 -0700
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:50 +0100
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-04 18:32 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 18:07 +0000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:17 +0200
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 07:59 +1000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:24 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:43 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 14:30 +0000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:51 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 16:24 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-05 22:47 +0000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-06 08:33 +0000
Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 06:07 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:06 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:10 +0000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-08 00:55 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-08 11:53 +0000
RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-13 18:39 +0000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 15:37 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 20:48 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 00:46 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 21:54 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 01:38 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 23:06 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:03 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 01:20 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:53 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 03:26 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:36 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 10:16 +0000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:43 -0400
RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 22:43 +0000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 23:10 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 18:11 -0700
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-17 14:05 +1000
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 23:06 -0700
Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-17 13:35 -0700
Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-18 09:14 +1000
Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-18 08:12 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:55 -0400
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:59 -0400
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 12:17 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-19 00:47 +0000
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-19 02:20 +1000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 16:40 -0400
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-19 00:48 +0100
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 20:53 +0100
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:06 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 03:39 +0000
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-14 14:15 +1000
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 22:17 +0100
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:35 -0400
RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 21:32 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:40 +0000
Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:52 +0200
Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-10 08:59 +0000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-10 13:45 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:06 +0000
Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:51 +0000
Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-11 11:55 -0400
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:26 +1000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:32 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:43 +0000
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:56 +1000
Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:41 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-06 12:04 +0100
Re: Comparing strings from the back? Steve Howell <showell30@yahoo.com> - 2012-09-14 17:51 -0700
Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-06 08:13 -0400
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-06 22:29 +1000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:43 +0200
Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 10:23 -0400
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:33 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:42 +0200
Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 11:54 -0400
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:34 +0200
Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:50 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:37 +0200
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-07 00:39 +1000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:36 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:44 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:07 +0000
Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-04 01:13 -0400
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:56 +0100
Re: Comparing strings from the back? Alain Ketterlin <alain@dpt-info.u-strasbg.fr> - 2012-09-04 11:58 +0200
Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-05 12:18 +1000
Re: Comparing strings from the back? MRAB <python@mrabarnett.plus.com> - 2012-09-05 03:39 +0100
Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-04 22:48 -0400
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:33 +0200
Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 10:29 +0200
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 18:33 +1000
Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 11:48 +0200
Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 17:45 +0200
Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 18:07 +0200
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 16:33 +0000
Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:32 +0200
Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:44 +0200
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 21:52 +0000
Page 1 of 6 [1] 2 3 4 5 6 Next page →
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-09-03 21:54 -0400 |
| Subject | Comparing strings from the back? |
| Message-ID | <roy-B37C28.21540103092012@news.panix.com> |
There's been a bunch of threads lately about string implementations, and that got me thinking (which is often a dangerous thing). Let's assume you're testing two strings for equality. You've already done the obvious quick tests (i.e they're the same length), and you're down to the O(n) part of comparing every character. I'm wondering if it might be faster to start at the ends of the strings instead of at the beginning? If the strings are indeed equal, it's the same amount of work starting from either end. But, if it turns out that for real-life situations, the ends of strings have more entropy than the beginnings, the odds are you'll discover that they're unequal quicker by starting at the end. It doesn't seem un-plausible that this is the case. For example, most of the filenames I work with begin with "/home/roy/". Most of the strings I use as memcache keys have one of a small number of prefixes. Most of the strings I use as logger names have common leading substrings. Things like credit card and telephone numbers tend to have much more entropy in the trailing digits. On the other hand, hostnames (and thus email addresses) exhibit the opposite pattern. Anyway, it's just a thought. Has anybody studied this for real-life usage patterns? I'm also not sure how this work with all the possible UCS/UTF encodings. With some of them, you may get the encoding semantics wrong if you don't start from the front.
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-09-04 12:07 +1000 |
| Message-ID | <mailman.164.1346724445.27098.python-list@python.org> |
| In reply to | #28380 |
On Tue, Sep 4, 2012 at 11:54 AM, Roy Smith <roy@panix.com> wrote: > I'm wondering if it might be faster to start at the ends of the strings > instead of at the beginning? > I'm also not sure how this work with all the possible UCS/UTF encodings. > With some of them, you may get the encoding semantics wrong if you don't > start from the front. No problem there; Python uses only fixed-width encodings. Also, any canonical encoding can be safely compared byte-for-byte; two identical Unicode strings will be bit-wise identical in (say) UTF-8. There's issues of cache locality and such that quite possibly mean it's not going to be faster overall, but it wouldn't be difficult to tweak the Python sources, recompile, and run some tests. I'm sure python-dev or python-list will be most happy to discuss some benchmark figures! ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-04 02:17 +0000 |
| Message-ID | <504564ba$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28380 |
On Mon, 03 Sep 2012 21:54:01 -0400, Roy Smith wrote:
> Let's assume you're testing two strings for equality. You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning? If the strings are indeed equal, it's the
> same amount of work starting from either end. But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.
And if the strings have the same end and different beginnings:
running
jumping
cooking
then you'll find out quicker by starting at the beginning.
No general-purpose string comparison operator can make assumptions about
the content of the strings, like "they are nearly always pathnames on
Unix-like systems like /home/user/x and /home/user/y".
I suppose you could have two different operators, == for "test from the
front" and === for "test from the back", and push that decision onto the
user, who will then spend hours angsting about the "most efficient"
equality test and days painstakingly profiling his data to save four
nanoseconds at runtime.
Besides, then somebody will say "Yes, but what about the cases where the
prefix and the suffix are both equal, but the middle will be different?"
and propose a third string-equality operator ==== and then there will be
bloodshed.
On average, string equality needs to check half the characters in the
string. Any individual comparisons may require fewer, or more, than that,
but whichever way you scan the string, some comparisons will take more
and some will take fewer. With no general way of telling ahead of time
which will be which, on average you can't do better than O(N/2) and no
complexity justification for picking "start from the end" from "start
from the start".
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Dan Sommers <dan@tombstonezero.net> |
|---|---|
| Date | 2012-09-03 21:56 -0700 |
| Message-ID | <mailman.169.1346735738.27098.python-list@python.org> |
| In reply to | #28384 |
On 2012-09-04 at 02:17:30 +0000, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > Besides, then somebody will say "Yes, but what about the cases where > the prefix and the suffix are both equal, but the middle will be > different?" and propose a third string-equality operator ==== and > then there will be bloodshed. Lisp has several equality forms, although they're for different kinds or levels of equality rather than for different algorithms for determining that equality. I imagine that a discussion similar to this one spawned many of those forms. I agree with Steven: decisions such as this belong to the application developer, not the standard library. Don't forget the fourth string-equality operator for case-insensitive comparisons and the fifth string-equality operator for strings that are likely to be palindromes. That said, if I really wanted bloodshed, I would propose = for the third string-equality operator! ;-) Dan
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-09-04 08:50 +0100 |
| Message-ID | <mailman.171.1346745011.27098.python-list@python.org> |
| In reply to | #28384 |
On 04/09/2012 05:56, Dan Sommers wrote: > > That said, if I really wanted bloodshed, I would propose = for the third > string-equality operator! ;-) > > Dan > Dan "agent provocateur" Sommers? :) -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-04 18:32 +0200 |
| Message-ID | <k25afd$7vj$1@news.albasani.net> |
| In reply to | #28384 |
On 04.09.2012 04:17, Steven D'Aprano wrote: > On average, string equality needs to check half the characters in the > string. How do you arrive at that conclusion? When comparing two random strings, I just derived n = (256 / 255) * (1 - 256 ^ (-c)) where n is the average number of character comparisons and c. The rationale as follows: The first character has to be compared in any case. The second with a probability of 1/256, the third with 1/(256^2) and so on. Best regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-04 18:07 +0000 |
| Message-ID | <50464361$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28405 |
On Tue, 04 Sep 2012 18:32:57 +0200, Johannes Bauer wrote:
> On 04.09.2012 04:17, Steven D'Aprano wrote:
>
>> On average, string equality needs to check half the characters in the
>> string.
>
> How do you arrive at that conclusion?
Take two non-empty strings of the same length, N. If the strings are
equal, you have to make N comparisons to be sure they are equal (you
check all N pairs of characters). If they are unequal, you have to check
each pair of characters up to the first pair that are different:
def string_equality(astr, bstr):
# Assumes lengths are equal.
for achr, bchr in zip(astr, bstr):
if achr != bchr:
return False
return True
For the unequal case, how many comparisons do you do? Ahead of time, we
know very little about the strings. We don't even know how many possible
characters there are (are they English alphanumeric, ASCII, Greek, Thai,
or from the full range of 1114111 Unicode code points?) or what their
probability distribution is.
A reasonable, conservative assumption is to calculate the largest
possible value of the average for random strings. That largest value
occurs when the alphabet is as small as possible, namely two characters.
In practice, strings come from a larger alphabet, up to 1114111 different
characters for full Unicode strings, so the average for them will be less
than the average we calculate now.
So for unequal strings, the number of comparisons is equally likely to be
1, 2, 3, ..., N. The average then is:
sum([1, 2, 3, ..., N])/N
which by a bit of simple algebra works out to be (N+1)/2, or half the
characters as I said.
(Note that this average assumes the strings are completely random. In
practice, strings tend to come from strongly non-uniform distributions of
characters. For instance, in English texts, 'e' is much more likely than
'q' and most characters are vanishingly rare, and in practice, strings
are likely to be highly non-random.)
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-05 11:17 +0200 |
| Message-ID | <k275b1$h20$1@news.albasani.net> |
| In reply to | #28411 |
On 04.09.2012 20:07, Steven D'Aprano wrote: > A reasonable, conservative assumption is to calculate the largest > possible value of the average for random strings. That largest value > occurs when the alphabet is as small as possible, namely two characters. > In practice, strings come from a larger alphabet, up to 1114111 different > characters for full Unicode strings, so the average for them will be less > than the average we calculate now. > > So for unequal strings, the number of comparisons is equally likely to be > 1, 2, 3, ..., N. The average then is: That is exactly what I don't agree with. For unequal random strings, the distribution is *not* equally likely to have the same amount of comparisons. For demonstration, let's look at bitstrings and the string "1010". There 15 bitstrings of same length which are unequal and I'll put the number of comparisons needed next to them going left to right: 0000 1 0001 1 0010 1 0011 1 0100 1 0101 1 0110 1 0111 1 1100 2 1101 2 1110 2 1111 2 1000 3 1001 3 1011 4 i.e. about 1.73 (26/15) comparisons in average with random strings compared to the 2.5 comparisons you end up with. My formula does respect lazy termination (because the probabilty of higher comparisons gets exponentially lower). > sum([1, 2, 3, ..., N])/N > > which by a bit of simple algebra works out to be (N+1)/2, or half the > characters as I said. Yes, but it's a flawed assumption you're making since you're neglecting lazy evaluation and early abort of comparison. Best regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-09-05 07:59 +1000 |
| Message-ID | <mailman.194.1346795987.27098.python-list@python.org> |
| In reply to | #28405 |
On Wed, Sep 5, 2012 at 2:32 AM, Johannes Bauer <dfnsonfsduifb@gmx.de> wrote: > How do you arrive at that conclusion? When comparing two random strings, > I just derived > > n = (256 / 255) * (1 - 256 ^ (-c)) > > where n is the average number of character comparisons and c. The > rationale as follows: The first character has to be compared in any > case. The second with a probability of 1/256, the third with 1/(256^2) > and so on. That would be for comparing two random areas of memory. Python strings don't have 256 options per character; and in terms of actual strings, there's so many possibilities. The strings that a program is going to compare for equality are going to use a vastly restricted alphabet; for a lot of cases, there might be only a few dozen plausible characters. But even so, it's going to scale approximately linearly with the string length. If they're really random, then yes, there's little chance that either a 1MB string or a 2MB string will be the same, but with real data, they might very well have a long common prefix. So it's still going to be more or less O(n). ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-05 11:24 +0200 |
| Message-ID | <k275on$i1l$1@news.albasani.net> |
| In reply to | #28428 |
On 04.09.2012 23:59, Chris Angelico wrote: >> n = (256 / 255) * (1 - 256 ^ (-c)) >> >> where n is the average number of character comparisons and c. The >> rationale as follows: The first character has to be compared in any >> case. The second with a probability of 1/256, the third with 1/(256^2) >> and so on. > > That would be for comparing two random areas of memory. Python strings > don't have 256 options per character; and in terms of actual strings, > there's so many possibilities. The unit of comparison is completely arbitraty and can equally be used to compare bitstrings if changed accordingly. > The strings that a program is going to > compare for equality are going to use a vastly restricted alphabet; > for a lot of cases, there might be only a few dozen plausible > characters. This is very true. But if so, I would like to see the assumtion that you're making (which might very well be plausible) in order to arrive at a average of N/2 comparisons (which just seems *way* off). > But even so, it's going to scale approximately linearly with the > string length. If they're really random, then yes, there's little > chance that either a 1MB string or a 2MB string will be the same, but > with real data, they might very well have a long common prefix. So > it's still going to be more or less O(n). I understand what you're saying and surely this is true to some extent (someone mentioned filenames, which is an excellent example). However in order to derive a *formula* you need to formularize your assumtions. With random data, it's definitely not O(n). And even in reality (with a more limited alphabet) it usually will early abort: Consider sorting a large dictionary. Sorting is in O(n log(n)), but this assumes O(1) comparisons! If the comparison operation itself were in O(n), the total sorting complexity would be O(n^2 log(n)), which is definitely false. Most comparisons will abort *very* early (after the first character). Best regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-05 11:43 +0200 |
| Message-ID | <k276r0$k5k$1@news.albasani.net> |
| In reply to | #28472 |
On 05.09.2012 11:24, Johannes Bauer wrote:
> Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
> assumes O(1) comparisons! If the comparison operation itself were in
> O(n), the total sorting complexity would be O(n^2 log(n)), which is
> definitely false. Most comparisons will abort *very* early (after the
> first character).
I've just written a program to demonstrate this (below it's attached). I
used my aspell dictionary, which contains about 100k words. To have
complexity down I went for only words wich occur most often (in this
case 9 characters or ~13k words). Here's the results (note it's
randomized, so they'll always differ a little, but are pretty stable)
Max Cmp: 100
Sum Cmp: 32111
Avg Cmp: 2.393842254361115
Num words : 13414
Min wordlen: 9
Max wordlen: 9
Avg wordlen: 9.0
Going up in wordlength (only showing part now)
Avg Cmp: 2.2374929735806632
Num words : 10674
Avg wordlen: 10.0
Avg Cmp: 2.261727078891258
Num words : 7504
Avg wordlen: 11.0
Avg Cmp: 2.2335647202939977
Num words : 4898
Avg wordlen: 12.0
Avg Cmp: 2.1743341404358354
Num words : 2891
Avg wordlen: 13.0
Avg Cmp: 2.156782549420586
Num words : 1467
Avg wordlen: 14.0
Avg Cmp: 2.112449799196787
Num words : 747
Avg wordlen: 15.0
So, interestingly, in this real-world case(tm), the complexity does not
scale with O(n). It actually goes down (which is probably due to the
limit amount of words of higher length).
In summary, let's please not debate "feelings" or such. Either
measurements (with clear indiciation on what data they're based on and
how the test was conducted) or derivations (with an explanation of the
assumtions). Feelings can be very misleading in cases like this.
Best regards,
Johannes
#!/usr/bin/python3
import random
class Word():
def __init__(self, s):
self._s = s
self._c = 0
def __lt__(self, other):
if len(self) < len(other):
return True
elif len(self) > len(other):
return False
for i in range(len(self)):
self._c += 1
if self._s[i] < other._s[i]:
return True
return False
def __repr__(self):
return "%s<%d>" % (self._s, self._c)
def __len__(self):
return len(self._s)
def getcmp(self):
return self._c
wordlist = [ ]
for line in open("dict", "r"):
word = Word(line[:-1])
if len(word) == 9:
wordlist.append(word)
random.shuffle(wordlist)
wordlist.sort()
comparisons = [ word.getcmp() for word in wordlist ]
print("Max Cmp:", max(comparisons))
print("Sum Cmp:", sum(comparisons))
print("Avg Cmp:", sum(comparisons) / len(comparisons))
wordlengths = [ len(word) for word in wordlist ]
print("Min wordlen:", min(wordlengths))
print("Max wordlen:", max(wordlengths))
print("Avg wordlen:", sum(wordlengths) / len(wordlengths))
--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-05 14:30 +0000 |
| Message-ID | <504761ef$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28474 |
On Wed, 05 Sep 2012 11:43:08 +0200, Johannes Bauer wrote: > On 05.09.2012 11:24, Johannes Bauer wrote: > >> Consider sorting a large dictionary. Sorting is in O(n log(n)), but >> this assumes O(1) comparisons! If the comparison operation itself were >> in O(n), the total sorting complexity would be O(n^2 log(n)), This shows some significant confusion about Big Oh complexity analysis here. When you say that sorting the list of words is O(N log N), the N refers to the number of words in the dictionary. But when you speak about comparisons being O(N), the N doesn't refer to the size of the dict, but the size of the individual strings being compared. There is no reasonable comparison algorithm where the effort required to compare two strings depends on the size of the dictionary they have come from. Since these are *completely different Ns*, you can't combine them to get O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer nonsense. If you state it like this: sorting the list of words is O(N log N) where N = length of the list comparing the words is O(M) where M = the average length of the words then the overall complexity is O(N log N)*O(M), but since M is a constant for any particular dictionary (its just the average length of the words) that reduces down to O(N log N). To say that sorting a dictionary is O(N log N) does *not* imply that string comparisons are O(1). It only implies that string comparisons don't depend on the size of the dictionary. Comparing: s1 = 'a'*30002 s2 = 'a'*30001 + 'b' takes the same amount of time whether they are in a dictionary of 100 words or one of 100000 words. For the purposes of calculating the algorithmic complexity of sorting a large list of words, we can treat comparisons as an elementary operation even though they actually aren't. >> which is >> definitely false. Most comparisons will abort *very* early (after the >> first character). You are making unjustified assumptions about the distribution of letters in the words. This might be a list of long chemical compounds where the words typically differ only in their suffix. It might be a list of people with titles: Herr Professor Frederick Schmidt Herr Professor Frederick Wagner ... But either way, it doesn't matter for the Big Oh behaviour of sorting the words. > I've just written a program to demonstrate this (below it's attached). I have no idea why you think this program demonstrates anything about string equality comparisons. What you are counting is not the average number of comparisons for string equality, but the number of comparisons when sorting. These are *completely different*, because sorting a list does not require testing each string against every other string. Whatever you have measured, it has nothing to do with the Big Oh complexity of string comparisons. It seems to me that you have misunderstood the purpose of Big Oh notation. It gives an asymptotic upper bound on the amount of work done, ignoring all lower terms and multiplicative factors, not an exact formula. If something is O(N), then this tells you: 1) the number of algorithmic steps is proportional to N, plus or minus some terms which are less than N (e.g. sqrt(N), or log(N), or 1/N, or some constant, etc.); 2) if you ignore those other terms, and keep all other influences equal, then doubling N will *at worst* double the number of algorithmic steps; 3) if all those algorithmic steps take exactly the same amount of time, then doubling N will take twice as much time. But of course *none* of those assumptions hold for measurements of actual elapsed time. In real life, operations rarely take constant time, you don't always see worst case behaviour, and the lower terms are not necessarily insignificant enough to ignore. [...] > So, interestingly, in this real-world case(tm), the complexity does not > scale with O(n). It actually goes down (which is probably due to the > limit amount of words of higher length). I would guess that it probably has more to do with the ability of the timsort algorithm to exploit local order within a list and avoid performing comparisons. > In summary, let's please not debate "feelings" or such. I have no idea what feelings you are referring to. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-05 16:51 +0200 |
| Message-ID | <k27osg$q7t$1@news.albasani.net> |
| In reply to | #28500 |
On 05.09.2012 16:30, Steven D'Aprano wrote: > Since these are *completely different Ns*, you can't combine them to get > O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer > nonsense. If you state it like this: Yes, you are correct here. > You are making unjustified assumptions about the distribution of letters > in the words. This might be a list of long chemical compounds where the > words typically differ only in their suffix. It might be a list of people > with titles: Actually, I'm not. I'm stating exactly what assumptions I'm making to get my calculation. I'm comparing *random* character strings or bitstrings. You, on the other hand, are making vague assumptions which you do not care for formalize and yet you claim that "the number of comparisons is equally likely to be 1, 2, 3, ..., N. The average then is". Without any explanation for this. At all. > Herr Professor Frederick Schmidt > Herr Professor Frederick Wagner > ... Is your assumtion that we're comparing words that have the common prefix "Herr Professor Frederick "? Why don't you clearly state what you're comparing. In the example you gave in the other thread you claimed "Note that this average assumes the strings are completely random.", yet now you're talking about specific word patterns. > I have no idea why you think this program demonstrates anything about > string equality comparisons. What you are counting is not the average > number of comparisons for string equality, but the number of comparisons > when sorting. These are *completely different*, because sorting a list > does not require testing each string against every other string. You're wrong. It's not counting the number of string comparisons, it's counting the number of character comparisons. Which is exactly what we're talking about in this thread, are we not? The sorting of the list is just to show some example where lots of strings are compared. >> So, interestingly, in this real-world case(tm), the complexity does not >> scale with O(n). It actually goes down (which is probably due to the >> limit amount of words of higher length). > > I would guess that it probably has more to do with the ability of the > timsort algorithm to exploit local order within a list and avoid > performing comparisons. Again, it's not counting the number of string comparisons. It's counting the average number of character comparisons per string comparison. The total amount of string comparisons has nothing to do with it. >> In summary, let's please not debate "feelings" or such. > > I have no idea what feelings you are referring to. I'm referring to "the number of comparisons is equally likely to be 1, 2, 3, ..., N. The average then is" without any deduction or rationale. Without any explanation or assumption. Regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-05 16:24 +0000 |
| Message-ID | <50477cbb$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28505 |
On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...] >> You are making unjustified assumptions about the distribution of >> letters in the words. This might be a list of long chemical compounds >> where the words typically differ only in their suffix. It might be a >> list of people with titles: > > Actually, I'm not. I'm stating exactly what assumptions I'm making to > get my calculation. I'm comparing *random* character strings or > bitstrings. Excuse me, you are not. You are comparing English words which are highly non-random. > You, on the other hand, are making vague assumptions which you do not > care for formalize and yet you claim that "the number of comparisons is > equally likely to be 1, 2, 3, ..., N. The average then is". Without any > explanation for this. At all. I will accept that my explanation was not good enough, but I strongly disagree that I gave no explanation at all. >> Herr Professor Frederick Schmidt >> Herr Professor Frederick Wagner >> ... > > Is your assumtion that we're comparing words that have the common prefix > "Herr Professor Frederick "? No, I am pointing out that *your* assumption that most string comparisons will halt close to the beginning of the string is an invalid assumption. Your assumption only holds for some non-random strings. [...] >> I have no idea why you think this program demonstrates anything about >> string equality comparisons. What you are counting is not the average >> number of comparisons for string equality, but the number of >> comparisons when sorting. These are *completely different*, because >> sorting a list does not require testing each string against every other >> string. > > You're wrong. It's not counting the number of string comparisons, I didn't say it was. > it's counting the number of character comparisons. Correct. But only out of an extremely limited subset of all possible string comparisons, namely those very few that happen when sorting lists of English words using a very specific algorithm, namely timsort. > Which is exactly what > we're talking about in this thread, are we not? The sorting of the list > is just to show some example where lots of strings are compared. It is not good enough to extract a non-random subset of strings (only English words) and then decrease the data set even further by only comparing an extremely non-uniform subset of those strings: specifically those few string comparisons that happen to occur using timsort. Whatever figure you have calculated by taking this non-random selection, it is irrelevant to the question of the average performance of string equality given random strings. >>> So, interestingly, in this real-world case(tm), the complexity does >>> not scale with O(n). It actually goes down (which is probably due to >>> the limit amount of words of higher length). >> >> I would guess that it probably has more to do with the ability of the >> timsort algorithm to exploit local order within a list and avoid >> performing comparisons. > > Again, it's not counting the number of string comparisons. It's counting > the average number of character comparisons per string comparison. The > total amount of string comparisons has nothing to do with it. I didn't say anything about counting string comparisons. But you are wrong -- the numbers you get are *strongly* influenced by the number of string comparisons performed. That is just one of the reasons why the results you get are biased. Let me spell it out for you: given a list of five letter words: ['fruit', 'apple', 'claim', 'pears', ... , 'toast'] sorting the list may compare: 'fruit' with 'apple' 'apple' with 'claim' but almost certainly will not compare: 'apple' with 'toast' and it definitely won't compare: 'zzzax' with 'zzzaq' because strings like those never get into the list in the first place. For the sake of simple calculations, let's pretend that there are only 1000 five-letter strings possible. We want to know how many character- comparisons are done on average when testing two random five-letter strings for equality. To calculate this average, you must compare every string to every other string, *including* itself. This gives 1000*1000 = one million equality tests to be performed. For each equality test, we count the number of character comparisons performed, sum all those counts, and divide by 1e6. That is the average number of char comparisons for random strings of five letters. But that's not what you do. First you eliminate 900 out of the 1000 possible strings by only sampling English words -- strings like "xxoij" cannot possibly be selected. So you are left with the highly non-random subset of 10000 equality tests. But you haven't finished. Instead of checking all 10000 equality tests, you then decide to only check the 2000 tests that happen to randomly occur when using the timsort algorithm to sort a list. So from the population of one million possible equality tests, you throw away the vast majority, then average the *strongly non-random* few that are remaining. Naturally the result you get is strongly biased, and it has nothing to do with the average Big Oh behaviour of equality on random strings. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2012-09-05 22:47 +0000 |
| Message-ID | <mailman.272.1346885254.27098.python-list@python.org> |
| In reply to | #28519 |
In news.gmane.comp.python.general, you wrote: > On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: > [...] >>> You are making unjustified assumptions about the distribution of >>> letters in the words. This might be a list of long chemical compounds >>> where the words typically differ only in their suffix. It might be a >>> list of people with titles: >> >> Actually, I'm not. I'm stating exactly what assumptions I'm making to >> get my calculation. I'm comparing *random* character strings or >> bitstrings. > > Excuse me, you are not. You are comparing English words which are highly > non-random. Evidently we have different understandings of what 'random' means. I don't think it's unreasonable to say that strings drawn uniformly from the set of all strings in the English language (having a given number of characters) is random. The distribution is not uniform over the set of all possible character strings but it is still random. I think Johannes deliberately chose these strings to emulate a particular kind of 'real' distribution of strings that might occur in practise. > > >> You, on the other hand, are making vague assumptions which you do not >> care for formalize and yet you claim that "the number of comparisons is >> equally likely to be 1, 2, 3, ..., N. The average then is". Without any >> explanation for this. At all. > > I will accept that my explanation was not good enough, but I strongly > disagree that I gave no explanation at all. > > >>> Herr Professor Frederick Schmidt >>> Herr Professor Frederick Wagner >>> ... >> >> Is your assumtion that we're comparing words that have the common prefix >> "Herr Professor Frederick "? > > No, I am pointing out that *your* assumption that most string comparisons > will halt close to the beginning of the string is an invalid assumption. > Your assumption only holds for some non-random strings. I think you have this backwards. The case where this assumption is provably true is precisely for random strings. To be clear, when I say 'random' in this context I mean that each character is chosen independently from the same probability distribution over the possible characters regardless of which index it has in the string and regardless of what the other characters are (IID). In this case the probability that comparison terminates at the jth character decreases exponentially with j. This means that for large strings the expected number of character comparisons is independent of the number of characters in the string as the probability of reaching the later parts of the string is too small for them to have any significant effect. This is provable and applies regardless of how many possible characters there are and whether or not each character is equally likely (except for the pathological case where one character has a probability of 1). For strings from 'real' distributions it is harder to make statements about the 'average case' and it is possible to construct situations where the comparison would regularly need to compare a common prefix. However, to get asymptotic performance worse than O(1) it is not sufficient to say that there may be a common prefix such as 'Herr' in the distribution of strings. It is necessary that, somehow, the common prefix is likely to grow as the size of the strings grows. For example, the set of all strings of length N whose first N//2 characters are always 'a' and whose remaining characters are chosen IID would lead to O(N) performance. This is why the file paths example chosen at the start of this thread is a good one. If a program is dealing with a number of large strings representing file paths then it is not uncommon that many of those paths would refer to files in the same deeply nested directory and hence compare equal for a significant number of characters. This could lead to average case O(N) performance. I think it's appropriate to compare string comparison with dict insertion: Best case O(1) (no hash collisions) Worst case O(N) (collides with every key) Average case O(1) (as long as you don't use pathological data) The only difference with string comparison is that there are some conceivable, non-malicious cases where the pathological data can occur (such as with file paths). However, I suspect that out of all the different uses of python strings these cases are a small minority. In saying that, it's not inconceivable that someone could exploit string comparison by providing pathological data to make normally O(1) operations behave as O(N). If I understand correctly it was precisely this kind of problem with dict insertion/lookup that lead to the recent hash-seed security update. Oscar
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-06 08:33 +0000 |
| Message-ID | <50485fca$0$29977$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28544 |
On Wed, 05 Sep 2012 22:47:14 +0000, Oscar Benjamin wrote:
> In news.gmane.comp.python.general, you wrote:
>> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...]
>>>> You are making unjustified assumptions about the distribution of
>>>> letters in the words. This might be a list of long chemical compounds
>>>> where the words typically differ only in their suffix. It might be a
>>>> list of people with titles:
>>>
>>> Actually, I'm not. I'm stating exactly what assumptions I'm making to
>>> get my calculation. I'm comparing *random* character strings or
>>> bitstrings.
>>
>> Excuse me, you are not. You are comparing English words which are
>> highly non-random.
>
> Evidently we have different understandings of what 'random' means. I
> don't think it's unreasonable to say that strings drawn uniformly from
> the set of all strings in the English language (having a given number of
> characters) is random. The distribution is not uniform over the set of
> all possible character strings but it is still random. I think Johannes
> deliberately chose these strings to emulate a particular kind of 'real'
> distribution of strings that might occur in practise.
Of course for some "real" applications, your strings being compared will
be English words. And for other applications, they will be strings of
numeric digits uniformly distributed between 20000 and 30000. Or taken
from a Gaussian distribution centered around 7000. Or strings made up of
deeply nested sets of parentheses (((((( ... )))))). Whichever special
distribution of strings you pick, I don't think calling them "random" is
terribly useful in context, even if they are generated randomly from some
non-uniform distribution.
But you know, it really doesn't make a difference. Equality testing will
*still* be O(N) since the asymptomatic behaviour for sufficiently large
string comparisons will be bounded by O(N). Multiplicative constants
("half the string" versus "0.001 of the string") do not matter.
I may have been overly-conservative earlier when I said that on average
string equality has to compare half the characters. I thought I had
remembered that from a computer science textbook, but I can't find that
reference now, so possibly I was thinking of something else. (String
searching perhaps?). In any case, the *worst* case for string equality
testing is certainly O(N) (every character must be looked at), and the
*best* case is O(1) obviously (the first character fails to match). But
I'm not so sure about the average case. Further thought is required.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-09-06 06:07 -0400 |
| Message-ID | <mailman.287.1346926087.27098.python-list@python.org> |
| In reply to | #28557 |
On 09/06/2012 04:33 AM, Steven D'Aprano wrote: > <snip> > > I may have been overly-conservative earlier when I said that on > average string equality has to compare half the characters. I thought > I had remembered that from a computer science textbook, but I can't > find that reference now, so possibly I was thinking of something else. > (String searching perhaps?). In any case, the *worst* case for string > equality testing is certainly O(N) (every character must be looked > at), and the *best* case is O(1) obviously (the first character fails > to match). But I'm not so sure about the average case. Further thought > is required. For random strings (as defined below), the average compare time is effectively unrelated to the size of the string, once the size passes some point. Define random string as being a selection from a set of characters, with replacement. So if we pick some set of characters, say 10 (or 256, it doesn't really matter), the number of possible strings is 10**N. The likelihood of not finding a mismatch within k characters is (1/10)**k The likelihood of actually reaching the end of the random string is (1/10)**N. (which is the reciprocal of the number of strings, naturally) If we wanted an average number of comparisons, we'd have to calculate a series, where each term is a probability times a value for k. sum((k * 9*10**-k) for k in range(1, 10)) Those terms very rapidly approach 0, so it's safe to stop after a few. Looking at the first 9 items, I see a value of 1.1111111 This may not be quite right, but the value is certainly well under 2 for a population of 10 characters, chosen randomly. And notice that N doesn't really come into it. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-07 04:06 +0000 |
| Message-ID | <504972d1$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28569 |
On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote: > On 09/06/2012 04:33 AM, Steven D'Aprano wrote: >> <snip> >> >> I may have been overly-conservative earlier when I said that on average >> string equality has to compare half the characters. I thought I had >> remembered that from a computer science textbook, but I can't find that >> reference now, so possibly I was thinking of something else. (String >> searching perhaps?). In any case, the *worst* case for string equality >> testing is certainly O(N) (every character must be looked at), and the >> *best* case is O(1) obviously (the first character fails to match). But >> I'm not so sure about the average case. Further thought is required. > > For random strings (as defined below), the average compare time is > effectively unrelated to the size of the string, once the size passes > some point. Okay, I've spent some time looking for the reference, and can't find it, so I'm now sure I must have been conflating it with something else. After further thought, and giving consideration to the arguments given by people here, I'm now satisfied to say that for equal-length strings, string equality is best described as O(N). 1) If the strings are equal, a == b will always compare all N characters in each string. 2) If the strings are unequal, a == b will *at worst* compare all N characters. 3) Following usual practice in this field, worst case is the one which conventionally is meant when discussing Big Oh behaviour. See, for example, "Introduction To Algorithms" by Cormen, Leiserson and Rivest. Also of some interest is the best case: O(1) for unequal strings (they differ at the first character) and O(N) for equal strings. Also of interest is the case that has caused the majority of the discussion, the average case. I am now satisfied that the average number of comparisons for unequal strings is O(1). To be precise, it is bounded below by 1 comparison (you always have to compare at least one pair of characters) and bounded above by 2 comparisons. (I'm talking about the average here -- the actual number of comparisons can range all the way up to N, but the average is <= 2.) If I've done the maths right, the exact value for the average is: ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N) for random strings of length N taken from an alphabet of size M. For M = 2, that average approaches but never exceeds 2 as N increases; for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333... and so forth. Thanks to all who contributed. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2012-09-07 19:10 +0000 |
| Message-ID | <mailman.366.1347045032.27098.python-list@python.org> |
| In reply to | #28676 |
On 2012-09-07, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> <snip>
>
> After further thought, and giving consideration to the arguments given by
> people here, I'm now satisfied to say that for equal-length strings,
> string equality is best described as O(N).
>
> 1) If the strings are equal, a == b will always compare all N
> characters in each string.
>
> 2) If the strings are unequal, a == b will *at worst* compare
> all N characters.
>
> 3) Following usual practice in this field, worst case is the
> one which conventionally is meant when discussing Big Oh
> behaviour. See, for example, "Introduction To Algorithms"
> by Cormen, Leiserson and Rivest.
Would you say, then, that dict insertion is O(N)?
>
> Also of some interest is the best case: O(1) for unequal strings (they
> differ at the first character) and O(N) for equal strings.
>
> Also of interest is the case that has caused the majority of the
> discussion, the average case. I am now satisfied that the average number
> of comparisons for unequal strings is O(1). To be precise, it is bounded
> below by 1 comparison (you always have to compare at least one pair of
> characters) and bounded above by 2 comparisons.
I find this idea of separating into the comparison of equal strings versus the
comparison of unequal strings rather odd. If the strings you compare come from
a distribution where they are guaranteed to be equal (or unequal) then you can
just use the O(0) comparison method.
Since string comparison is only useful if the strings can be equal or unequal,
the average case depends on how often they are equal/unequal as well as the
average complexity of both. For random strings the frequency of equal strings
decreases very fast as N increases so that the comparison of random strings is
O(1).
>
> (I'm talking about the average here -- the actual number of comparisons
> can range all the way up to N, but the average is <= 2.)
>
> If I've done the maths right, the exact value for the average is:
>
> ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)
I'm not sure where the extra N comes from --------^ but otherwise good.
I would have written that as:
(1 - p) * sum(i * p**(i-1) for i in range(1, N+1))
where p is the probability of a match (1/M for M equally likely characters) or
in closed form:
⎛ N ⎞
⎝1 - p ⋅(1 + N ⋅(1 - p))⎠
─────────────────────────
1 - p
>
> for random strings of length N taken from an alphabet of size M.
>
> For M = 2, that average approaches but never exceeds 2 as N increases;
> for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333...
> and so forth.
It approaches 1 / (1 - p) or, if you prefer: M / (M - 1)
Oscar
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-08 00:55 +0000 |
| Message-ID | <504a9785$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28705 |
On Fri, 07 Sep 2012 19:10:16 +0000, Oscar Benjamin wrote: > On 2012-09-07, Steven D'Aprano <steve+comp.lang.python@pearwood.info> > wrote: >> <snip> >> >> After further thought, and giving consideration to the arguments given >> by people here, I'm now satisfied to say that for equal-length strings, >> string equality is best described as O(N). >> >> 1) If the strings are equal, a == b will always compare all N >> characters in each string. >> >> 2) If the strings are unequal, a == b will *at worst* compare >> all N characters. >> >> 3) Following usual practice in this field, worst case is the >> one which conventionally is meant when discussing Big Oh behaviour. >> See, for example, "Introduction To Algorithms" by Cormen, Leiserson >> and Rivest. > > Would you say, then, that dict insertion is O(N)? Pedantically, yes. But since we're allowed to state (or even imply *wink*) whatever assumptions we like, we're allowed to assume "in the absence of significant numbers of hash collisions" and come up with amortized O(1) for dict insertions and lookups. (Provided, of course, that your computer has an infinite amount of unfragmented memory and the OS never starts paging your dict to disk. Another unstated assumption that gets glossed over when we talk about complexity analysis -- on real world computers, for big enough N, *everything* is O(2**N) or worse.) Big Oh analysis, despite the formal mathematics used, is not an exact science. Essentially, it is a way of bringing some vague order to hand- wavy estimates of complexity, and the apparent mathematical rigour is built on some awfully shaky foundations. But despite that, it actually is useful. Coming back to strings... given that in any real-world application, you are likely to have some string comparisons on equal strings and some on unequal strings, and more importantly you don't know which are which ahead of time, which attitude is less likely to give you a nasty surprise when you run your code? "I have many millions of 100K strings to compare against other 100K strings, and string comparisons are O(1) so that will be fast." "I have many millions of 100K strings to compare against other 100K strings, and string comparisons are O(N) so that will be slow, better find another algorithm." Remember too that "for small enough N, everything is O(1)". Getting hung up on Big Oh is just as much a mistake as ignoring it completely. > I find this idea of separating into the comparison of equal strings > versus the comparison of unequal strings rather odd. If the strings you > compare come from a distribution where they are guaranteed to be equal > (or unequal) then you can just use the O(0) comparison method. If you know that they're (un)equal, you don't need to call == at all. If you know that "most" strings will be unequal, then you might be justified in treating comparisons as O(1) "most of the time" and not stress about the occasional slow call. But of course most of the time you don't know this, which is why it is appropriate to treat string comparisons as O(N) rather than O(1), since that's the upper bound. > Since string comparison is only useful if the strings can be equal or > unequal, the average case depends on how often they are equal/unequal as > well as the average complexity of both. For random strings the frequency > of equal strings decreases very fast as N increases so that the > comparison of random strings is O(1). But that is not an upper bound, and Big Oh analysis is strictly defined in terms of upper bounds. >> (I'm talking about the average here -- the actual number of comparisons >> can range all the way up to N, but the average is <= 2.) >> >> If I've done the maths right, the exact value for the average is: >> >> ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N) > > I'm not sure where the extra N comes from --------^ but otherwise good. The extra N comes from the case where you compare string S with itself, which takes exactly N comparisons. -- Steven
[toc] | [prev] | [next] | [standalone]
Page 1 of 6 [1] 2 3 4 5 6 Next page →
Back to top | Article view | comp.lang.python
csiph-web