Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #28380 > unrolled thread
| Started by | Roy Smith <roy@panix.com> |
|---|---|
| First post | 2012-09-03 21:54 -0400 |
| Last post | 2012-09-10 21:52 +0000 |
| Articles | 20 on this page of 101 — 21 participants |
Back to article view | Back to comp.lang.python
Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-03 21:54 -0400
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-04 12:07 +1000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 02:17 +0000
Re: Comparing strings from the back? Dan Sommers <dan@tombstonezero.net> - 2012-09-03 21:56 -0700
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:50 +0100
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-04 18:32 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 18:07 +0000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:17 +0200
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 07:59 +1000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:24 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:43 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 14:30 +0000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:51 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 16:24 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-05 22:47 +0000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-06 08:33 +0000
Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 06:07 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:06 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:10 +0000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-08 00:55 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-08 11:53 +0000
RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-13 18:39 +0000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 15:37 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 20:48 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 00:46 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 21:54 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 01:38 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 23:06 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:03 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 01:20 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:53 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 03:26 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:36 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 10:16 +0000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:43 -0400
RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 22:43 +0000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 23:10 -0400
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 18:11 -0700
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-17 14:05 +1000
Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 23:06 -0700
Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-17 13:35 -0700
Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-18 09:14 +1000
Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-18 08:12 -0700
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:55 -0400
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:59 -0400
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 12:17 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-19 00:47 +0000
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-19 02:20 +1000
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 16:40 -0400
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-19 00:48 +0100
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 20:53 +0100
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:06 -0400
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 03:39 +0000
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-14 14:15 +1000
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 22:17 +0100
Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:35 -0400
RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 21:32 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:40 +0000
Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:52 +0200
Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-10 08:59 +0000
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-10 13:45 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:06 +0000
Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:51 +0000
Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-11 11:55 -0400
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:26 +1000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:32 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:43 +0000
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:56 +1000
Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:41 +0000
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-06 12:04 +0100
Re: Comparing strings from the back? Steve Howell <showell30@yahoo.com> - 2012-09-14 17:51 -0700
Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-06 08:13 -0400
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-06 22:29 +1000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:43 +0200
Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 10:23 -0400
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:33 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:42 +0200
Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 11:54 -0400
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:34 +0200
Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:50 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:37 +0200
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-07 00:39 +1000
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:36 +0200
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:44 +0200
Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:07 +0000
Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-04 01:13 -0400
Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:56 +0100
Re: Comparing strings from the back? Alain Ketterlin <alain@dpt-info.u-strasbg.fr> - 2012-09-04 11:58 +0200
Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-05 12:18 +1000
Re: Comparing strings from the back? MRAB <python@mrabarnett.plus.com> - 2012-09-05 03:39 +0100
Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-04 22:48 -0400
Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:33 +0200
Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 10:29 +0200
Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 18:33 +1000
Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 11:48 +0200
Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 17:45 +0200
Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 18:07 +0200
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 16:33 +0000
Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:32 +0200
Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:44 +0200
Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 21:52 +0000
Page 5 of 6 — ← Prev page 1 2 3 4 [5] 6 Next page →
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-06 15:37 +0200 |
| Message-ID | <k2a8u4$hdt$1@news.albasani.net> |
| In reply to | #28519 |
On 05.09.2012 18:24, Steven D'Aprano wrote:
> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
> [...]
>>> You are making unjustified assumptions about the distribution of
>>> letters in the words. This might be a list of long chemical compounds
>>> where the words typically differ only in their suffix. It might be a
>>> list of people with titles:
>>
>> Actually, I'm not. I'm stating exactly what assumptions I'm making to
>> get my calculation. I'm comparing *random* character strings or
>> bitstrings.
>
> Excuse me, you are not. You are comparing English words which are highly
> non-random.
Not in my original post. If you read it again, you will clearly see that
I was talking about purely random strings. And since you like to
nitpick, I'll clarify further: I'm talking about bitstrings in which
every bit of every character has the same probability of occurence, 50%.
You then replied by mentioning probability distributions of characters
in different languages/encodings, to which I tried giving a example as
it might (and does) occur in the real world, like sorting a dictionary.
But the original point is still valid: Sorting of randomized bitstrings
is definitely not O(N), you got that wrong.
>> You, on the other hand, are making vague assumptions which you do not
>> care for formalize and yet you claim that "the number of comparisons is
>> equally likely to be 1, 2, 3, ..., N. The average then is". Without any
>> explanation for this. At all.
>
> I will accept that my explanation was not good enough, but I strongly
> disagree that I gave no explanation at all.
What possible explanation could there be that comparison of purely
random bitstrings yields an equal amount of comparisons for its length?
>>> Herr Professor Frederick Schmidt
>>> Herr Professor Frederick Wagner
>>> ...
>>
>> Is your assumtion that we're comparing words that have the common prefix
>> "Herr Professor Frederick "?
>
> No, I am pointing out that *your* assumption that most string comparisons
> will halt close to the beginning of the string is an invalid assumption.
> Your assumption only holds for some non-random strings.
Definitely not. It *especially* holds true for random strings. For
random strings with an alphabet of size n, the probability that the
first character needs to be compared is n^0, i.e. 1. The probability
that the second character needs to be compared is n^(-1). For the xth
character, it is n^(-x). This is due to the lazy abort in comparison and
it results in the average number of comparisons being extremely short no
matter the bitlength n, or O(log n).
>> it's counting the number of character comparisons.
>
> Correct. But only out of an extremely limited subset of all possible
> string comparisons, namely those very few that happen when sorting lists
> of English words using a very specific algorithm, namely timsort.
Yes, but this was to look at a real-world example (in which way more
comparisons are needed than in the random case). You first were talking
about randomized bitstrings (and so was I), then you brought up
character sets and languages (which, for the original problem, are
entirely irrelevant).
> Whatever figure you have calculated by taking this non-random selection,
> it is irrelevant to the question of the average performance of string
> equality given random strings.
Correct. I gave the estimate for random strings in my very first post.
> For the sake of simple calculations, let's pretend that there are only
> 1000 five-letter strings possible. We want to know how many character-
> comparisons are done on average when testing two random five-letter
> strings for equality. To calculate this average, you must compare every
> string to every other string, *including* itself.
>
> This gives 1000*1000 = one million equality tests to be performed. For
> each equality test, we count the number of character comparisons
> performed, sum all those counts, and divide by 1e6. That is the average
> number of char comparisons for random strings of five letters.
Easy enough. Since you didn't look at my original equation, here are
some numbers and the program attached:
64 combinations for length 6
Char comparisons: 8064
Comparison cnt : 4096
Avg comparison : 1.97
128 combinations for length 7
Char comparisons: 32512
Comparison cnt : 16384
Avg comparison : 1.98
256 combinations for length 8
Char comparisons: 130560
Comparison cnt : 65536
Avg comparison : 1.99
512 combinations for length 9
Char comparisons: 523264
Comparison cnt : 262144
Avg comparison : 2.00
1024 combinations for length 10
Char comparisons: 2095104
Comparison cnt : 1048576
Avg comparison : 2.00
2048 combinations for length 11
Char comparisons: 8384512
Comparison cnt : 4194304
Avg comparison : 2.00
4096 combinations for length 12
Char comparisons: 33546240
Comparison cnt : 16777216
Avg comparison : 2.00
Hopefully now you'll see that your assumption O(n) is dead wrong.
> But that's not what you do. First you eliminate 900 out of the 1000
> possible strings by only sampling English words -- strings like "xxoij"
> cannot possibly be selected. So you are left with the highly non-random
> subset of 10000 equality tests.
This is *exactly* what my original calculation did. Enumerate all
possibilites and divide by the total number.
Regards,
Johannes
#!/usr/bin/python3
import itertools
import random
alphabet = [ c for c in "01" ]
def cmpstr(s1, s2):
c = 0
for i in range(len(s1)):
c += 1
if s1[i] != s2[i]:
break
return c
for strlength in range(1, 16):
combinations = list(itertools.product(*([ alphabet ] * strlength)))
assert(len(combinations) == len(alphabet) ** strlength)
print("%d combinations for length %d" % (len(combinations), strlength))
compcnt = 0
comparisons = 0
for c1 in combinations:
for c2 in combinations:
comparisons += cmpstr(c1, c2)
compcnt += 1
print("Char comparisons: %d" % (comparisons))
print("Comparison cnt : %d" % (compcnt))
print("Avg comparison : %.2f" % (comparisons / compcnt))
--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-09-07 00:39 +1000 |
| Message-ID | <mailman.306.1346942376.27098.python-list@python.org> |
| In reply to | #28591 |
On Thu, Sep 6, 2012 at 11:37 PM, Johannes Bauer <dfnsonfsduifb@gmx.de> wrote: > Not in my original post. If you read it again, you will clearly see that > I was talking about purely random strings. And since you like to > nitpick, I'll clarify further: I'm talking about bitstrings in which > every bit of every character has the same probability of occurence, 50%. That sort of string isn't a normal thing to be comparing, though. Here's an idea. Someone who's doing a lot of arguing in this thread should take Python, find the string comparison routine, and hack in some statistics-gathering. Then run *real code* on it. Maybe use this with one of those web frameworks and run your web site on it for an hour or two, or fire off some real scripts you really use. Then dump out the stats at the end. My guess: The bulk of string comparisons that get to actually comparing byte-for-byte will end up returning True. Most of the false comparisons will be proven earlier; if I understand correctly, Python will check for identity (easy true) and different lengths (easy false). But my guess could turn out to be flat wrong. In any case, it'll be far FAR more useful than arguing from totally random, or random word selection, or anything. Who's game? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-06 17:36 +0200 |
| Message-ID | <k2aftn$qu$1@news.albasani.net> |
| In reply to | #28604 |
On 06.09.2012 16:39, Chris Angelico wrote: > In any case, it'll be far FAR more useful than arguing from > totally random, or random word selection, or anything. > > Who's game? Okay, patched against Python 3.2.3: http://pastebin.com/PRRN53P6 To invoke display of the stats, compare the string "pleasedumpstats" as LHO, i.e.: "pleasedumpstats" < "" Just ran it against a big script of mine which takes the stringified IMDb text files, parses it and dumps it into a sqlite3 database. Surprisingly little string comparisons however (the sqlite module without doubt uses its own routines internally). Although the database in the end has about 2 million records, these were the stats: strCmpEq 1017 strCmpLt 2802 strCmpGt 1633 strCmpTc 16219 strCmpCc 8570 which means 5452 comparisons of which 19% were equal and the rest inequal. Average string length is about 2.97 characters and aborted was in average after 1.57 characters. Maybe someone has a test which makes heavier use of string comparison. I don't want to make up something, however, since this is (by definition) going to be artificial. Best regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-06 17:44 +0200 |
| Message-ID | <k2agbm$1jg$1@news.albasani.net> |
| In reply to | #28611 |
On 06.09.2012 17:36, Johannes Bauer wrote: > "pleasedumpstats" < "" And against a XML-reading C code generator that uses mako: strCmpEq 39670 strCmpLt 2766215 strCmpGt 2744002 strCmpTc 37430821 strCmpCc 14048656 Compared strings: 5549887 Equal: 0.7% Average wordlength: 6.74 chars Average comparelength: 2.53 chars Best regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-07 04:07 +0000 |
| Message-ID | <504972fb$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28604 |
On Fri, 07 Sep 2012 00:39:33 +1000, Chris Angelico wrote: > On Thu, Sep 6, 2012 at 11:37 PM, Johannes Bauer <dfnsonfsduifb@gmx.de> > wrote: >> Not in my original post. If you read it again, you will clearly see >> that I was talking about purely random strings. And since you like to >> nitpick, I'll clarify further: I'm talking about bitstrings in which >> every bit of every character has the same probability of occurence, >> 50%. > > That sort of string isn't a normal thing to be comparing, though. > > Here's an idea. Someone who's doing a lot of arguing in this thread > should take Python, find the string comparison routine, and hack in some > statistics-gathering. Then run *real code* on it. Where's the fun in that? :-P -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-09-04 01:13 -0400 |
| Message-ID | <mailman.168.1346735624.27098.python-list@python.org> |
| In reply to | #28380 |
On 9/3/2012 9:54 PM, Roy Smith wrote: > There's been a bunch of threads lately about string implementations, and > that got me thinking (which is often a dangerous thing). > > Let's assume you're testing two strings for equality. You've already > done the obvious quick tests (i.e they're the same length), and you're > down to the O(n) part of comparing every character. > > I'm wondering if it might be faster to start at the ends of the strings > instead of at the beginning? I posted the 3.3 str (in)equality compare a few days ago. It first compares length and then kind, then the actual bytes in whatever chunks. If the latter uses the C/system mem compare, that is faster than anything coded in Python or even C. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-09-04 08:56 +0100 |
| Message-ID | <mailman.172.1346745307.27098.python-list@python.org> |
| In reply to | #28380 |
On 04/09/2012 02:54, Roy Smith wrote: > There's been a bunch of threads lately about string implementations, and > that got me thinking (which is often a dangerous thing). > > Let's assume you're testing two strings for equality. You've already > done the obvious quick tests (i.e they're the same length), and you're > down to the O(n) part of comparing every character. > > I'm wondering if it might be faster to start at the ends of the strings > instead of at the beginning? If the strings are indeed equal, it's the > same amount of work starting from either end. But, if it turns out that > for real-life situations, the ends of strings have more entropy than the > beginnings, the odds are you'll discover that they're unequal quicker by > starting at the end. > > It doesn't seem un-plausible that this is the case. For example, most > of the filenames I work with begin with "/home/roy/". Most of the > strings I use as memcache keys have one of a small number of prefixes. > Most of the strings I use as logger names have common leading > substrings. Things like credit card and telephone numbers tend to have > much more entropy in the trailing digits. > > On the other hand, hostnames (and thus email addresses) exhibit the > opposite pattern. > > Anyway, it's just a thought. Has anybody studied this for real-life > usage patterns? > > I'm also not sure how this work with all the possible UCS/UTF encodings. > With some of them, you may get the encoding semantics wrong if you don't > start from the front. > Good grief, what timing!!! The dust has yet to settle over the "Flexible string representation, unicode, typography, ..." thread and you ask this. I'll just cross my fingers and hope that people stick with facts, realistic benchmarks etc, and not good old fashioned FUD. -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Alain Ketterlin <alain@dpt-info.u-strasbg.fr> |
|---|---|
| Date | 2012-09-04 11:58 +0200 |
| Message-ID | <87ipbuyurz.fsf@dpt-info.u-strasbg.fr> |
| In reply to | #28380 |
Roy Smith <roy@panix.com> writes: > There's been a bunch of threads lately about string implementations, and > that got me thinking (which is often a dangerous thing). > > Let's assume you're testing two strings for equality. You've already > done the obvious quick tests (i.e they're the same length), and you're > down to the O(n) part of comparing every character. > > I'm wondering if it might be faster to start at the ends of the strings > instead of at the beginning? If the strings are indeed equal, it's the > same amount of work starting from either end. But, if it turns out that > for real-life situations, the ends of strings have more entropy than the > beginnings, the odds are you'll discover that they're unequal quicker by > starting at the end. I guess we all have examples with specific entropy distribution. Going backwards on long strings can be profitable with file paths. If that is the case, and if you spend a lot of time comparing strings, why not store them in reverse? > It doesn't seem un-plausible that this is the case. For example, most > of the filenames I work with begin with "/home/roy/". Most of the > strings I use as memcache keys have one of a small number of prefixes. > Most of the strings I use as logger names have common leading > substrings. In that case I would split the paths, and use some sort of string-id, or use intern() and then compare components with "is" instead of "==". (Basically, turning the string of chars into a strings of ids/ints/pointers.) > Things like credit card and telephone numbers tend to have much more > entropy in the trailing digits. On the other hand, hostnames (and thus > email addresses) exhibit the opposite pattern. Yes, real-world is a mess. > Anyway, it's just a thought. Has anybody studied this for real-life > usage patterns? I've tried the "intern()" trick several times with lists of strings, and was satisfied, but this was in specific cases (with a finite set of strings). For "short" strings of chars (up to, say a hundred of characters), I've never found anything significantly faster than the default sweep (that was in C/C++, not python). For longer and/or more structured strings/lists, making use of the structure is a good idea, but I see no general pattern here (as you note). > I'm also not sure how this work with all the possible UCS/UTF encodings. > With some of them, you may get the encoding semantics wrong if you don't > start from the front. If you're testing for equality, I can't see how this could matter, even with variable-length encodings. If you're comparing different encodings, then you need different access methods to random characters (but this doesn't affect the algorithm). If you're using variable-length encoding, e.g., UTF-8, accessing at a specific position is not possible. -- Alain.
[toc] | [prev] | [next] | [standalone]
| From | Neil Hodgson <nhodgson@iinet.net.au> |
|---|---|
| Date | 2012-09-05 12:18 +1000 |
| Message-ID | <-9CdnaqjTK6NKtvNnZ2dnUVZ_gednZ2d@westnet.com.au> |
| In reply to | #28380 |
Roy Smith:
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning? If the strings are indeed equal, it's the
> same amount of work starting from either end.
Most people write loops that go forwards. This leads to the
processor designers prioritizing detection mechanisms like cache
prefetching for that case.
However, its not always the situation: a couple of years ago Intel
contributed a memcpy implementation to glibc that went backwards for a
performance improvement. An explanation from a related patch involves
speculative and committed operations and address bits on some processors
and quickly lost me:
paragraph 3 of
http://lists.freedesktop.org/archives/pixman/2010-August/000423.html
The memcpy patch was controversial as it broke Adobe Flash which
assumed memcpy was safe like memmove.
Neil
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-09-05 03:39 +0100 |
| Message-ID | <mailman.204.1346812744.27098.python-list@python.org> |
| In reply to | #28440 |
On 05/09/2012 03:18, Neil Hodgson wrote: > Roy Smith: > >> I'm wondering if it might be faster to start at the ends of the strings >> instead of at the beginning? If the strings are indeed equal, it's the >> same amount of work starting from either end. > > Most people write loops that go forwards. This leads to the > processor designers prioritizing detection mechanisms like cache > prefetching for that case. > > However, its not always the situation: a couple of years ago Intel > contributed a memcpy implementation to glibc that went backwards for a > performance improvement. An explanation from a related patch involves > speculative and committed operations and address bits on some processors > and quickly lost me: > paragraph 3 of > http://lists.freedesktop.org/archives/pixman/2010-August/000423.html > > The memcpy patch was controversial as it broke Adobe Flash which > assumed memcpy was safe like memmove. > I don't think I've ever use memcpy, only memmove.
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-09-04 22:48 -0400 |
| Message-ID | <roy-920352.22482004092012@news.panix.com> |
| In reply to | #28440 |
In article <-9CdnaqjTK6NKtvNnZ2dnUVZ_gednZ2d@westnet.com.au>, Neil Hodgson <nhodgson@iinet.net.au> wrote: > The memcpy patch was controversial as it broke Adobe Flash An added benefit!
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-09-05 16:33 +0200 |
| Message-ID | <k27nr0$o42$1@news.albasani.net> |
| In reply to | #28440 |
On 05.09.2012 04:18, Neil Hodgson wrote: > The memcpy patch was controversial as it broke Adobe Flash which > assumed memcpy was safe like memmove. Adobe Flash was broken before, making an assumption that is not guaranteed by the standard. The patch only showed the problem. Regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-09-05 10:29 +0200 |
| Message-ID | <mailman.221.1346833766.27098.python-list@python.org> |
| In reply to | #28380 |
Roy Smith wrote: > There's been a bunch of threads lately about string implementations, and > that got me thinking (which is often a dangerous thing). > > Let's assume you're testing two strings for equality. You've already > done the obvious quick tests (i.e they're the same length), and you're > down to the O(n) part of comparing every character. > > I'm wondering if it might be faster to start at the ends of the strings > instead of at the beginning? If the strings are indeed equal, it's the > same amount of work starting from either end. But, if it turns out that > for real-life situations, the ends of strings have more entropy than the > beginnings, the odds are you'll discover that they're unequal quicker by > starting at the end. > > It doesn't seem un-plausible that this is the case. For example, most > of the filenames I work with begin with "/home/roy/". Most of the > strings I use as memcache keys have one of a small number of prefixes. > Most of the strings I use as logger names have common leading > substrings. Things like credit card and telephone numbers tend to have > much more entropy in the trailing digits. But do you actually compare them with each other? Even if you do I'd guess that Python looks at the hash values most of the time. > On the other hand, hostnames (and thus email addresses) exhibit the > opposite pattern. Nor do english words: $ python3 reverse_compare2.py compare /usr/share/dict/words --word-len 8 comparing every pair in a sample of 1000 8-char words taken from '/usr/share/dict/words' head 1: 477222 ************************************************************ 2: 18870 ** 3: 2870 4: 435 5: 74 6: 17 7: 12 tail 1: 386633 ************************************************************ 2: 74966 *********** 3: 29698 **** 4: 6536 * 5: 1475 6: 154 7: 28 8: 10 > Anyway, it's just a thought. Has anybody studied this for real-life > usage patterns? > > I'm also not sure how this work with all the possible UCS/UTF encodings. > With some of them, you may get the encoding semantics wrong if you don't > start from the front.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-09-05 18:33 +1000 |
| Message-ID | <mailman.222.1346833988.27098.python-list@python.org> |
| In reply to | #28380 |
On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten <__peter__@web.de> wrote: > comparing every pair in a sample of 1000 8-char words > taken from '/usr/share/dict/words' > > head > 1: 477222 ************************************************************ > 2: 18870 ** > ... Not understanding this. What are the statistics, and what (if it's not obvious from the previous answer) do they prove? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-09-05 11:48 +0200 |
| Message-ID | <mailman.228.1346838532.27098.python-list@python.org> |
| In reply to | #28380 |
Chris Angelico wrote:
> On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten <__peter__@web.de> wrote:
>> comparing every pair in a sample of 1000 8-char words
>> taken from '/usr/share/dict/words'
>>
>> head
>> 1: 477222 ************************************************************
>> 2: 18870 **
>> ...
tail
1: 386633 ************************************************************
2: 74966 ***********
> Not understanding this. What are the statistics,
I measured how many chars they have in common for all combinations of 1000
words taken from /usr/share/dict/words.
477222 pairs have one char in common, 18870 pairs have two chars in common
when compared from the *beginning* of the string.
386633 pairs have one char in common, 74966 pairs have two chars in common
when compared from the *end* of the string.
and what (if it's not obvious from the previous answer) do they prove?
They demonstrate that for two words from that particular corpus it is likely
that a[::-1] == b[::-1] has to take more (1.34 versus 1.05 on average)
characters into account than a == b, i. e. comparing from the back should be
slower rather than faster.
If that doesn't help, here's the code ;)
import random
import itertools
from contextlib import contextmanager
from collections import defaultdict
@contextmanager
def open_corpus(args):
with open(args.name) as instream:
words = (line.strip() for line in instream)
#words = (word for word in words if not word.endswith("'s"))
yield words
def pick(items, count):
items = list(items)
return random.sample(items, count)
def count_common(a, b):
i = 0
for i, (x, y) in enumerate(zip(a, b), 1):
if x != y:
break
return i
def corpus(args):
with open_corpus(args) as words:
wordlens = (len(line.strip()) for line in words)
freq = defaultdict(int)
for wordlen in wordlens:
freq[wordlen] += 1
show_histogram(freq)
def show_histogram(freq):
peak = max(freq.values())
freq_width = len(str(peak))
max_freq = max(freq)
len_width = len(str(max_freq))
for i in range(min(freq), max_freq+1):
value = freq[i]
print("{:{}}: {:{}} {}".format(
i, len_width, value,
freq_width, "*" * (60 * value // peak)))
def compare(args):
with open_corpus(args) as words:
words = pick(
(word for word in words if len(word) == args.word_len),
args.sample_size)
n = 0
head_common = defaultdict(int)
tail_common = defaultdict(int)
n_tail = n_head = 0
for n, (a, b) in enumerate(itertools.combinations(words, 2), 1):
cc = count_common(a, b)
n_head += cc
head_common[cc] += 1
ccr = count_common(a[::-1], b[::-1])
n_tail += ccr
tail_common[ccr] += 1
print(n, "combinations")
print("average common chars (head) {:.3}".format(n_head/n))
print("average common chars (tail) {:.3}".format(n_tail/n))
print("comparing every pair in a "
"sample of {sample} {len}-char words\n"
"taken from {name!r}".format(
sample=args.sample_size,
len=args.word_len,
name=args.name))
print()
print("head")
show_histogram(head_common)
print()
print("tail")
show_histogram(tail_common)
def main():
import argparse
parser = argparse.ArgumentParser()
parsers = parser.add_subparsers()
pcorpus = parsers.add_parser("corpus")
pcorpus.add_argument("name")
pcorpus.set_defaults(func=corpus)
pcompare = parsers.add_parser("compare")
pcompare.add_argument("name")
pcompare.add_argument("--sample-size", type=int, default=1000)
pcompare.add_argument("--word-len", type=int, default=8)
pcompare.set_defaults(func=compare)
args = parser.parse_args()
args.func(args)
if __name__ == "__main__":
main()
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-09-05 17:45 +0200 |
| Message-ID | <mailman.247.1346859964.27098.python-list@python.org> |
| In reply to | #28380 |
Oscar Benjamin wrote:
> On 5 September 2012 10:48, Peter Otten <__peter__@web.de> wrote:
>> def count_common(a, b):
[sorry for seriously broken implementation]
> This function will return 1 if the first character differs. It does not
> count the number of common characters but rather the more relevant
> quantity which is the number of comparisons required to decide if the
> strings are equal.
I remember stumbling over the non-zero frequency for len(word) but somehow
plodded on with beautifying the broken results :(
def count_common(a, b):
"""
>>> count_common("alpha", "beta")
0
>>> count_common("", "")
0
>>> count_common("alpha", "argument")
1
>>> count_common("alpha", "alpha")
5
>>> count_common("alpha", "alphx")
4
"""
i = 0
for x, y in zip(a, b):
if x != y:
break
i += 1
return i
$ python3 reverse_compare2.py compare /usr/share/dict/words
499500 combinations
average common chars (head) 0.0535
average common chars (tail) 0.322
comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'
head
0: 477371 ************************************************************
1: 18310 **
2: 3191
3: 523
4: 79
5: 15
6: 10
7: 1
tail
0: 385069 ************************************************************
1: 78742 ************
2: 26734 ****
3: 7462 *
4: 1297
5: 168
6: 22
7: 6
[toc] | [prev] | [next] | [standalone]
| From | Dan Goodman <dg.gmane@thesamovar.net> |
|---|---|
| Date | 2012-09-10 18:07 +0200 |
| Message-ID | <mailman.450.1347293273.27098.python-list@python.org> |
| In reply to | #28380 |
On 04/09/2012 03:54, Roy Smith wrote: > Let's assume you're testing two strings for equality. You've already > done the obvious quick tests (i.e they're the same length), and you're > down to the O(n) part of comparing every character. > > I'm wondering if it might be faster to start at the ends of the strings > instead of at the beginning? If the strings are indeed equal, it's the > same amount of work starting from either end. But, if it turns out that > for real-life situations, the ends of strings have more entropy than the > beginnings, the odds are you'll discover that they're unequal quicker by > starting at the end. From the rest of the thread, it looks like in most situations it won't make much difference as typically very few characters need to be compared if they are unequal. However, if you were in a situation with many strings which were almost equal, the most general way to improve the situation might be store a hash of the string along with the string, i.e. store (hash(x), x) and then compare equality of this tuple. Almost all of the time, if the strings are unequal the hash will be unequal. Or, as someone else suggested, use interned versions of the strings. This is basically the same solution but even better. In this case, your startup costs will be higher (creating the strings) but your comparisons will always be instant. Dan
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2012-09-10 16:33 +0000 |
| Message-ID | <mailman.455.1347294841.27098.python-list@python.org> |
| In reply to | #28380 |
On 2012-09-10, Dan Goodman <dg.gmane@thesamovar.net> wrote: > On 04/09/2012 03:54, Roy Smith wrote: >> Let's assume you're testing two strings for equality. You've already >> done the obvious quick tests (i.e they're the same length), and you're >> down to the O(n) part of comparing every character. >> >> I'm wondering if it might be faster to start at the ends of the strings >> instead of at the beginning? If the strings are indeed equal, it's the >> same amount of work starting from either end. But, if it turns out that >> for real-life situations, the ends of strings have more entropy than the >> beginnings, the odds are you'll discover that they're unequal quicker by >> starting at the end. > > From the rest of the thread, it looks like in most situations it won't > make much difference as typically very few characters need to be > compared if they are unequal. > > However, if you were in a situation with many strings which were almost > equal, the most general way to improve the situation might be store a > hash of the string along with the string, i.e. store (hash(x), x) and > then compare equality of this tuple. Almost all of the time, if the > strings are unequal the hash will be unequal. Or, as someone else > suggested, use interned versions of the strings. This is basically the > same solution but even better. In this case, your startup costs will be > higher (creating the strings) but your comparisons will always be instant. > Computing the hash always requires iterating over all characters in the string so is best case O(N) where string comparison is best case (and often average case) O(1). Also, so far as I know the hash value once computed is stored on the string object itself [1] and used for subsequent string comparisons so there's no need for you to do that in your code. Oscar [1] http://hg.python.org/cpython/file/71d94e79b0c3/Include/unicodeobject.h#l293
[toc] | [prev] | [next] | [standalone]
| From | Dan Goodman <dg.gmane@thesamovar.net> |
|---|---|
| Date | 2012-09-10 19:32 +0200 |
| Message-ID | <mailman.457.1347298387.27098.python-list@python.org> |
| In reply to | #28380 |
On 10/09/2012 18:33, Oscar Benjamin wrote: > Computing the hash always requires iterating over all characters in the string > so is best case O(N) where string comparison is best case (and often average > case) O(1). Yep, but you already have O(N) costs just creating the strings in the first place, so it's absorbed into that. It's only worth doing if you do many comparisons. > Also, so far as I know the hash value once computed is stored on the string > object itself [1] and used for subsequent string comparisons so there's no > need for you to do that in your code. Cool, Python is very clever as always! :) Dan
[toc] | [prev] | [next] | [standalone]
| From | Dan Goodman <dg.gmane@thesamovar.net> |
|---|---|
| Date | 2012-09-10 19:44 +0200 |
| Message-ID | <mailman.459.1347299097.27098.python-list@python.org> |
| In reply to | #28380 |
On 10/09/2012 18:07, Dan Goodman wrote: > On 04/09/2012 03:54, Roy Smith wrote: >> Let's assume you're testing two strings for equality. You've already >> done the obvious quick tests (i.e they're the same length), and you're >> down to the O(n) part of comparing every character. >> >> I'm wondering if it might be faster to start at the ends of the strings >> instead of at the beginning? If the strings are indeed equal, it's the >> same amount of work starting from either end. But, if it turns out that >> for real-life situations, the ends of strings have more entropy than the >> beginnings, the odds are you'll discover that they're unequal quicker by >> starting at the end. > > From the rest of the thread, it looks like in most situations it won't > make much difference as typically very few characters need to be > compared if they are unequal. > > However, if you were in a situation with many strings which were almost > equal, the most general way to improve the situation might be store a > hash of the string along with the string, i.e. store (hash(x), x) and > then compare equality of this tuple. Almost all of the time, if the > strings are unequal the hash will be unequal. Or, as someone else > suggested, use interned versions of the strings. This is basically the > same solution but even better. In this case, your startup costs will be > higher (creating the strings) but your comparisons will always be instant. Just had another thought about this. Although it's unlikely to be necessary in practice since (a) it's rarely necessary at all, and (b) when it is, hashing and optionally interning seems like the better approach, I had another idea that would be more general. Rather than starting from the beginning or the end, why not do something like: check the first and last character, then the len/2 character, then the len/4, then 3*len/4, then len/8, 3*len/8, etc. You'd need to be a bit clever about making sure you hit every character but I'm sure someone's already got an efficient algorithm for this. You could probably even make this cache efficient by working on cache line length blocks. Almost certainly entirely unnecessary, but I like the original question and it's a nice theoretical problem. Dan
[toc] | [prev] | [next] | [standalone]
Page 5 of 6 — ← Prev page 1 2 3 4 [5] 6 Next page →
Back to top | Article view | comp.lang.python
csiph-web