Groups > comp.lang.python > #28380 > unrolled thread

Comparing strings from the back?

Started by	Roy Smith <roy@panix.com>
First post	2012-09-03 21:54 -0400
Last post	2012-09-10 21:52 +0000
Articles	20 on this page of 101 — 21 participants

Back to article view | Back to comp.lang.python

  Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-03 21:54 -0400
    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-04 12:07 +1000
    Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 02:17 +0000
      Re: Comparing strings from the back? Dan Sommers <dan@tombstonezero.net> - 2012-09-03 21:56 -0700
      Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:50 +0100
      Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-04 18:32 +0200
        Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 18:07 +0000
          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:17 +0200
        Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 07:59 +1000
          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:24 +0200
            Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 11:43 +0200
              Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 14:30 +0000
                Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:51 +0200
                  Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-05 16:24 +0000
                    Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-05 22:47 +0000
                      Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-06 08:33 +0000
                        Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 06:07 -0400
                          Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:06 +0000
                            Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:10 +0000
                              Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-08 00:55 +0000
                                Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-08 11:53 +0000
                                RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-13 18:39 +0000
                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 15:37 -0400
                                  Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 20:48 -0700
                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 00:46 -0400
                                      Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 21:54 -0700
                                        Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 01:38 -0400
                                          Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-13 23:06 -0700
                                            Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:03 -0400
                                              Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 01:20 -0700
                                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 04:53 -0400
                                                  Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-14 03:26 -0700
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:36 -0400
                                                Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 10:16 +0000
                                                  Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 07:43 -0400
                                                RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 22:43 +0000
                                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-14 23:10 -0400
                                                  Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 18:11 -0700
                                                    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-17 14:05 +1000
                                                      Re: Comparing strings from the back? alex23 <wuwei23@gmail.com> - 2012-09-16 23:06 -0700
                                                Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-17 13:35 -0700
                                                  Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-18 09:14 +1000
                                                    Re: Comparing strings from the back? Ethan Furman <ethan@stoneleaf.us> - 2012-09-18 08:12 -0700
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:55 -0400
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 11:59 -0400
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 12:17 -0400
                                                      Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-19 00:47 +0000
                                                    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-19 02:20 +1000
                                                    Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-18 16:40 -0400
                                                    Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-19 00:48 +0100
                                Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 20:53 +0100
                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:06 -0400
                                  Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 03:39 +0000
                                    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-14 14:15 +1000
                                Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-13 22:17 +0100
                                Re: Comparing strings from the back? Dwight Hutto <dwightdhutto@gmail.com> - 2012-09-13 17:35 -0400
                                RE: Comparing strings from the back? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-09-14 21:32 +0000
                            Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-07 19:40 +0000
                            Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:52 +0200
                              Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-10 08:59 +0000
                                Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-10 13:45 +0000
                                  Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:06 +0000
                                    Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:51 +0000
                                      Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-11 11:55 -0400
                                  Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:26 +1000
                                  Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:32 +0000
                                  Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 14:43 +0000
                                  Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-11 00:56 +1000
                                  Re: Comparing strings from the back? Duncan Booth <duncan.booth@invalid.invalid> - 2012-09-11 09:41 +0000
                        Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-06 12:04 +0100
                          Re: Comparing strings from the back? Steve Howell <showell30@yahoo.com> - 2012-09-14 17:51 -0700
                        Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-06 08:13 -0400
                          Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-06 22:29 +1000
                        Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:43 +0200
                          Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 10:23 -0400
                            Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:33 +0200
                            Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:42 +0200
                              Re: Comparing strings from the back? Dave Angel <d@davea.name> - 2012-09-06 11:54 -0400
                          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 16:34 +0200
                        Re: Comparing strings from the back? Gelonida N <gelonida@gmail.com> - 2012-09-08 17:50 +0200
                    Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 15:37 +0200
                      Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-07 00:39 +1000
                        Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:36 +0200
                          Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-06 17:44 +0200
                        Re: Comparing strings from the back? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-07 04:07 +0000
    Re: Comparing strings from the back? Terry Reedy <tjreedy@udel.edu> - 2012-09-04 01:13 -0400
    Re: Comparing strings from the back? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-04 08:56 +0100
    Re: Comparing strings from the back? Alain Ketterlin <alain@dpt-info.u-strasbg.fr> - 2012-09-04 11:58 +0200
    Re: Comparing strings from the back? Neil Hodgson <nhodgson@iinet.net.au> - 2012-09-05 12:18 +1000
      Re: Comparing strings from the back? MRAB <python@mrabarnett.plus.com> - 2012-09-05 03:39 +0100
      Re: Comparing strings from the back? Roy Smith <roy@panix.com> - 2012-09-04 22:48 -0400
      Re: Comparing strings from the back? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-09-05 16:33 +0200
    Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 10:29 +0200
    Re: Comparing strings from the back? Chris Angelico <rosuav@gmail.com> - 2012-09-05 18:33 +1000
    Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 11:48 +0200
    Re: Comparing strings from the back? Peter Otten <__peter__@web.de> - 2012-09-05 17:45 +0200
    Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 18:07 +0200
    Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 16:33 +0000
    Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:32 +0200
    Re: Comparing strings from the back? Dan Goodman <dg.gmane@thesamovar.net> - 2012-09-10 19:44 +0200
    Re: Comparing strings from the back? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-09-10 21:52 +0000

Page 5 of 6 — ← Prev page 1 2 3 4 [5] 6 Next page →

#28591

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-06 15:37 +0200
Message-ID	<k2a8u4$hdt$1@news.albasani.net>
In reply to	#28519

On 05.09.2012 18:24, Steven D'Aprano wrote:
> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
> [...]
>>> You are making unjustified assumptions about the distribution of
>>> letters in the words. This might be a list of long chemical compounds
>>> where the words typically differ only in their suffix. It might be a
>>> list of people with titles:
>>
>> Actually, I'm not. I'm stating exactly what assumptions I'm making to
>> get my calculation. I'm comparing *random* character strings or
>> bitstrings.
> 
> Excuse me, you are not. You are comparing English words which are highly 
> non-random.

Not in my original post. If you read it again, you will clearly see that
I was talking about purely random strings. And since you like to
nitpick, I'll clarify further: I'm talking about bitstrings in which
every bit of every character has the same probability of occurence, 50%.

You then replied by mentioning probability distributions of characters
in different languages/encodings, to which I tried giving a example as
it might (and does) occur in the real world, like sorting a dictionary.

But the original point is still valid: Sorting of randomized bitstrings
is definitely not O(N), you got that wrong.

>> You, on the other hand, are making vague assumptions which you do not
>> care for formalize and yet you claim that "the number of comparisons is
>> equally likely to be 1, 2, 3, ..., N. The average then is". Without any
>> explanation for this. At all.
> 
> I will accept that my explanation was not good enough, but I strongly 
> disagree that I gave no explanation at all.

What possible explanation could there be that comparison of purely
random bitstrings yields an equal amount of comparisons for its length?

>>> Herr Professor Frederick Schmidt
>>> Herr Professor Frederick Wagner
>>> ...
>>
>> Is your assumtion that we're comparing words that have the common prefix
>> "Herr Professor Frederick "? 
> 
> No, I am pointing out that *your* assumption that most string comparisons 
> will halt close to the beginning of the string is an invalid assumption. 
> Your assumption only holds for some non-random strings.

Definitely not. It *especially* holds true for random strings. For
random strings with an alphabet of size n, the probability that the
first character needs to be compared is n^0, i.e. 1. The probability
that the second character needs to be compared is n^(-1). For the xth
character, it is n^(-x). This is due to the lazy abort in comparison and
it results in the average number of comparisons being extremely short no
matter the bitlength n, or O(log n).

>> it's counting the number of character comparisons.
> 
> Correct. But only out of an extremely limited subset of all possible 
> string comparisons, namely those very few that happen when sorting lists 
> of English words using a very specific algorithm, namely timsort.

Yes, but this was to look at a real-world example (in which way more
comparisons are needed than in the random case). You first were talking
about randomized bitstrings (and so was I), then you brought up
character sets and languages (which, for the original problem, are
entirely irrelevant).

> Whatever figure you have calculated by taking this non-random selection, 
> it is irrelevant to the question of the average performance of string 
> equality given random strings.

Correct. I gave the estimate for random strings in my very first post.

> For the sake of simple calculations, let's pretend that there are only 
> 1000 five-letter strings possible. We want to know how many character-
> comparisons are done on average when testing two random five-letter 
> strings for equality. To calculate this average, you must compare every 
> string to every other string, *including* itself.
> 
> This gives 1000*1000 = one million equality tests to be performed. For 
> each equality test, we count the number of character comparisons 
> performed, sum all those counts, and divide by 1e6. That is the average 
> number of char comparisons for random strings of five letters.

Easy enough. Since you didn't look at my original equation, here are
some numbers and the program attached:

64 combinations for length 6
Char comparisons: 8064
Comparison cnt  : 4096
Avg comparison  : 1.97

128 combinations for length 7
Char comparisons: 32512
Comparison cnt  : 16384
Avg comparison  : 1.98

256 combinations for length 8
Char comparisons: 130560
Comparison cnt  : 65536
Avg comparison  : 1.99

512 combinations for length 9
Char comparisons: 523264
Comparison cnt  : 262144
Avg comparison  : 2.00

1024 combinations for length 10
Char comparisons: 2095104
Comparison cnt  : 1048576
Avg comparison  : 2.00

2048 combinations for length 11
Char comparisons: 8384512
Comparison cnt  : 4194304
Avg comparison  : 2.00

4096 combinations for length 12
Char comparisons: 33546240
Comparison cnt  : 16777216
Avg comparison  : 2.00

Hopefully now you'll see that your assumption O(n) is dead wrong.

> But that's not what you do. First you eliminate 900 out of the 1000 
> possible strings by only sampling English words -- strings like "xxoij" 
> cannot possibly be selected. So you are left with the highly non-random 
> subset of 10000 equality tests.

This is *exactly* what my original calculation did. Enumerate all
possibilites and divide by the total number.

Regards,
Johannes

#!/usr/bin/python3
import itertools
import random

alphabet = [ c for c in "01" ]

def cmpstr(s1, s2):
	c = 0
	for i in range(len(s1)):
		c += 1
		if s1[i] != s2[i]:
			break
	return c


for strlength in range(1, 16):
	combinations = list(itertools.product(*([ alphabet ] * strlength)))

	assert(len(combinations) == len(alphabet) ** strlength)
	print("%d combinations for length %d" % (len(combinations), strlength))

	compcnt = 0
	comparisons = 0
	for c1 in combinations:
		for c2 in combinations:
			comparisons += cmpstr(c1, c2)		
			compcnt += 1
	print("Char comparisons: %d" % (comparisons))
	print("Comparison cnt  : %d" % (compcnt))
	print("Avg comparison  : %.2f" % (comparisons / compcnt))

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28604

From	Chris Angelico <rosuav@gmail.com>
Date	2012-09-07 00:39 +1000
Message-ID	<mailman.306.1346942376.27098.python-list@python.org>
In reply to	#28591

On Thu, Sep 6, 2012 at 11:37 PM, Johannes Bauer <dfnsonfsduifb@gmx.de> wrote:
> Not in my original post. If you read it again, you will clearly see that
> I was talking about purely random strings. And since you like to
> nitpick, I'll clarify further: I'm talking about bitstrings in which
> every bit of every character has the same probability of occurence, 50%.

That sort of string isn't a normal thing to be comparing, though.

Here's an idea. Someone who's doing a lot of arguing in this thread
should take Python, find the string comparison routine, and hack in
some statistics-gathering. Then run *real code* on it. Maybe use this
with one of those web frameworks and run your web site on it for an
hour or two, or fire off some real scripts you really use. Then dump
out the stats at the end. My guess: The bulk of string comparisons
that get to actually comparing byte-for-byte will end up returning
True. Most of the false comparisons will be proven earlier; if I
understand correctly, Python will check for identity (easy true) and
different lengths (easy false). But my guess could turn out to be flat
wrong. In any case, it'll be far FAR more useful than arguing from
totally random, or random word selection, or anything.

Who's game?

ChrisA

[toc] | [prev] | [next] | [standalone]

#28611

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-06 17:36 +0200
Message-ID	<k2aftn$qu$1@news.albasani.net>
In reply to	#28604

On 06.09.2012 16:39, Chris Angelico wrote:

> In any case, it'll be far FAR more useful than arguing from
> totally random, or random word selection, or anything.
> 
> Who's game?

Okay, patched against Python 3.2.3: http://pastebin.com/PRRN53P6

To invoke display of the stats, compare the string "pleasedumpstats" as
LHO, i.e.:

"pleasedumpstats" < ""

Just ran it against a big script of mine which takes the stringified
IMDb text files, parses it and dumps it into a sqlite3 database.
Surprisingly little string comparisons however (the sqlite module
without doubt uses its own routines internally). Although the database
in the end has about 2 million records, these were the stats:

strCmpEq 1017
strCmpLt 2802
strCmpGt 1633
strCmpTc 16219
strCmpCc 8570

which means 5452 comparisons of which 19% were equal and the rest inequal.

Average string length is about 2.97 characters and aborted was in
average after 1.57 characters.

Maybe someone has a test which makes heavier use of string comparison. I
don't want to make up something, however, since this is (by definition)
going to be artificial.

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28613

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-06 17:44 +0200
Message-ID	<k2agbm$1jg$1@news.albasani.net>
In reply to	#28611

On 06.09.2012 17:36, Johannes Bauer wrote:
> "pleasedumpstats" < ""

And against a XML-reading C code generator that uses mako:

strCmpEq 39670
strCmpLt 2766215
strCmpGt 2744002
strCmpTc 37430821
strCmpCc 14048656

Compared strings: 5549887
Equal: 0.7%
Average wordlength: 6.74 chars
Average comparelength: 2.53 chars

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28677

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-07 04:07 +0000
Message-ID	<504972fb$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28604

On Fri, 07 Sep 2012 00:39:33 +1000, Chris Angelico wrote:

> On Thu, Sep 6, 2012 at 11:37 PM, Johannes Bauer <dfnsonfsduifb@gmx.de>
> wrote:
>> Not in my original post. If you read it again, you will clearly see
>> that I was talking about purely random strings. And since you like to
>> nitpick, I'll clarify further: I'm talking about bitstrings in which
>> every bit of every character has the same probability of occurence,
>> 50%.
> 
> That sort of string isn't a normal thing to be comparing, though.
> 
> Here's an idea. Someone who's doing a lot of arguing in this thread
> should take Python, find the string comparison routine, and hack in some
> statistics-gathering. Then run *real code* on it. 

Where's the fun in that?

:-P



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28388

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-09-04 01:13 -0400
Message-ID	<mailman.168.1346735624.27098.python-list@python.org>
In reply to	#28380

On 9/3/2012 9:54 PM, Roy Smith wrote:
> There's been a bunch of threads lately about string implementations, and
> that got me thinking (which is often a dangerous thing).
>
> Let's assume you're testing two strings for equality.  You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning?

I posted the 3.3 str (in)equality compare a few days ago. It first 
compares length and then kind, then the actual bytes in whatever chunks. 
If the latter uses the C/system mem compare, that is faster than 
anything coded in Python or even C.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#28395

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-09-04 08:56 +0100
Message-ID	<mailman.172.1346745307.27098.python-list@python.org>
In reply to	#28380

On 04/09/2012 02:54, Roy Smith wrote:
> There's been a bunch of threads lately about string implementations, and
> that got me thinking (which is often a dangerous thing).
>
> Let's assume you're testing two strings for equality.  You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning?  If the strings are indeed equal, it's the
> same amount of work starting from either end.  But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.
>
> It doesn't seem un-plausible that this is the case.  For example, most
> of the filenames I work with begin with "/home/roy/".  Most of the
> strings I use as memcache keys have one of a small number of prefixes.
> Most of the strings I use as logger names have common leading
> substrings.  Things like credit card and telephone numbers tend to have
> much more entropy in the trailing digits.
>
> On the other hand, hostnames (and thus email addresses) exhibit the
> opposite pattern.
>
> Anyway, it's just a thought.  Has anybody studied this for real-life
> usage patterns?
>
> I'm also not sure how this work with all the possible UCS/UTF encodings.
> With some of them, you may get the encoding semantics wrong if you don't
> start from the front.
>

Good grief, what timing!!!  The dust has yet to settle over the 
"Flexible string representation, unicode, typography, ..." thread and 
you ask this.  I'll just cross my fingers and hope that people stick 
with facts, realistic benchmarks etc, and not good old fashioned FUD.

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#28397

From	Alain Ketterlin <alain@dpt-info.u-strasbg.fr>
Date	2012-09-04 11:58 +0200
Message-ID	<87ipbuyurz.fsf@dpt-info.u-strasbg.fr>
In reply to	#28380

Roy Smith <roy@panix.com> writes:

> There's been a bunch of threads lately about string implementations, and 
> that got me thinking (which is often a dangerous thing).
>
> Let's assume you're testing two strings for equality.  You've already 
> done the obvious quick tests (i.e they're the same length), and you're 
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings 
> instead of at the beginning?  If the strings are indeed equal, it's the 
> same amount of work starting from either end.  But, if it turns out that 
> for real-life situations, the ends of strings have more entropy than the 
> beginnings, the odds are you'll discover that they're unequal quicker by 
> starting at the end.

I guess we all have examples with specific entropy distribution. Going
backwards on long strings can be profitable with file paths. If that is
the case, and if you spend a lot of time comparing strings, why not
store them in reverse?

> It doesn't seem un-plausible that this is the case.  For example, most 
> of the filenames I work with begin with "/home/roy/".  Most of the 
> strings I use as memcache keys have one of a small number of prefixes.  
> Most of the strings I use as logger names have common leading 
> substrings.

In that case I would split the paths, and use some sort of string-id, or
use intern() and then compare components with "is" instead of "==".
(Basically, turning the string of chars into a strings of
ids/ints/pointers.)

> Things like credit card and telephone numbers tend to have much more
> entropy in the trailing digits. On the other hand, hostnames (and thus
> email addresses) exhibit the opposite pattern.

Yes, real-world is a mess.

> Anyway, it's just a thought.  Has anybody studied this for real-life 
> usage patterns?

I've tried the "intern()" trick several times with lists of strings, and
was satisfied, but this was in specific cases (with a finite set of
strings). For "short" strings of chars (up to, say a hundred of
characters), I've never found anything significantly faster than the
default sweep (that was in C/C++, not python). For longer and/or more
structured strings/lists, making use of the structure is a good idea,
but I see no general pattern here (as you note).

> I'm also not sure how this work with all the possible UCS/UTF encodings.  
> With some of them, you may get the encoding semantics wrong if you don't 
> start from the front.

If you're testing for equality, I can't see how this could matter, even
with variable-length encodings.

If you're comparing different encodings, then you need different access
methods to random characters (but this doesn't affect the algorithm). If
you're using variable-length encoding, e.g., UTF-8, accessing at a
specific position is not possible.

-- Alain.

[toc] | [prev] | [next] | [standalone]

#28440

From	Neil Hodgson <nhodgson@iinet.net.au>
Date	2012-09-05 12:18 +1000
Message-ID	<-9CdnaqjTK6NKtvNnZ2dnUVZ_gednZ2d@westnet.com.au>
In reply to	#28380

Roy Smith:

> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning?  If the strings are indeed equal, it's the
> same amount of work starting from either end.

    Most people write loops that go forwards. This leads to the 
processor designers prioritizing detection mechanisms like cache 
prefetching for that case.

    However, its not always the situation: a couple of years ago Intel 
contributed a memcpy implementation to glibc that went backwards for a 
performance improvement. An explanation from a related patch involves 
speculative and committed operations and address bits on some processors 
and quickly lost me:
paragraph 3 of
http://lists.freedesktop.org/archives/pixman/2010-August/000423.html

    The memcpy patch was controversial as it broke Adobe Flash which 
assumed memcpy was safe like memmove.

    Neil

[toc] | [prev] | [next] | [standalone]

#28441

From	MRAB <python@mrabarnett.plus.com>
Date	2012-09-05 03:39 +0100
Message-ID	<mailman.204.1346812744.27098.python-list@python.org>
In reply to	#28440

On 05/09/2012 03:18, Neil Hodgson wrote:
> Roy Smith:
>
>> I'm wondering if it might be faster to start at the ends of the strings
>> instead of at the beginning?  If the strings are indeed equal, it's the
>> same amount of work starting from either end.
>
>      Most people write loops that go forwards. This leads to the
> processor designers prioritizing detection mechanisms like cache
> prefetching for that case.
>
>      However, its not always the situation: a couple of years ago Intel
> contributed a memcpy implementation to glibc that went backwards for a
> performance improvement. An explanation from a related patch involves
> speculative and committed operations and address bits on some processors
> and quickly lost me:
> paragraph 3 of
> http://lists.freedesktop.org/archives/pixman/2010-August/000423.html
>
>      The memcpy patch was controversial as it broke Adobe Flash which
> assumed memcpy was safe like memmove.
>
I don't think I've ever use memcpy, only memmove.

[toc] | [prev] | [next] | [standalone]

#28442

From	Roy Smith <roy@panix.com>
Date	2012-09-04 22:48 -0400
Message-ID	<roy-920352.22482004092012@news.panix.com>
In reply to	#28440

In article <-9CdnaqjTK6NKtvNnZ2dnUVZ_gednZ2d@westnet.com.au>,
 Neil Hodgson <nhodgson@iinet.net.au> wrote:

> The memcpy patch was controversial as it broke Adobe Flash 

An added benefit!

[toc] | [prev] | [next] | [standalone]

#28501

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-09-05 16:33 +0200
Message-ID	<k27nr0$o42$1@news.albasani.net>
In reply to	#28440

On 05.09.2012 04:18, Neil Hodgson wrote:

>    The memcpy patch was controversial as it broke Adobe Flash which
> assumed memcpy was safe like memmove.

Adobe Flash was broken before, making an assumption that is not
guaranteed by the standard. The patch only showed the problem.

Regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#28465

From	Peter Otten <__peter__@web.de>
Date	2012-09-05 10:29 +0200
Message-ID	<mailman.221.1346833766.27098.python-list@python.org>
In reply to	#28380

Roy Smith wrote:

> There's been a bunch of threads lately about string implementations, and
> that got me thinking (which is often a dangerous thing).
> 
> Let's assume you're testing two strings for equality.  You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
> 
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning?  If the strings are indeed equal, it's the
> same amount of work starting from either end.  But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.
> 
> It doesn't seem un-plausible that this is the case.  For example, most
> of the filenames I work with begin with "/home/roy/".  Most of the
> strings I use as memcache keys have one of a small number of prefixes.
> Most of the strings I use as logger names have common leading
> substrings.  Things like credit card and telephone numbers tend to have
> much more entropy in the trailing digits.

But do you actually compare them with each other? Even if you do I'd guess 
that Python looks at the hash values most of the time.
 
> On the other hand, hostnames (and thus email addresses) exhibit the
> opposite pattern.

Nor do english words:

$ python3 reverse_compare2.py compare /usr/share/dict/words --word-len 8

comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'

head
1: 477222 ************************************************************
2:  18870 **
3:   2870 
4:    435 
5:     74 
6:     17 
7:     12 

tail
1: 386633 ************************************************************
2:  74966 ***********
3:  29698 ****
4:   6536 *
5:   1475 
6:    154 
7:     28 
8:     10 

> Anyway, it's just a thought.  Has anybody studied this for real-life
> usage patterns?
> 
> I'm also not sure how this work with all the possible UCS/UTF encodings.
> With some of them, you may get the encoding semantics wrong if you don't
> start from the front.

[toc] | [prev] | [next] | [standalone]

#28466

From	Chris Angelico <rosuav@gmail.com>
Date	2012-09-05 18:33 +1000
Message-ID	<mailman.222.1346833988.27098.python-list@python.org>
In reply to	#28380

On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten <__peter__@web.de> wrote:
> comparing every pair in a sample of 1000 8-char words
> taken from '/usr/share/dict/words'
>
> head
> 1: 477222 ************************************************************
> 2:  18870 **
> ...

Not understanding this. What are the statistics, and what (if it's not
obvious from the previous answer) do they prove?

ChrisA

[toc] | [prev] | [next] | [standalone]

#28475

From	Peter Otten <__peter__@web.de>
Date	2012-09-05 11:48 +0200
Message-ID	<mailman.228.1346838532.27098.python-list@python.org>
In reply to	#28380

Chris Angelico wrote:

> On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten <__peter__@web.de> wrote:
>> comparing every pair in a sample of 1000 8-char words
>> taken from '/usr/share/dict/words'
>>
>> head
>> 1: 477222 ************************************************************
>> 2:  18870 **
>> ...

tail
1: 386633 ************************************************************
2:  74966 ***********

 
> Not understanding this. What are the statistics, 

I measured how many chars they have in common for all combinations of 1000 
words taken from /usr/share/dict/words.

477222 pairs have one char in common, 18870 pairs have two chars in common 
when compared from the *beginning* of the string.

386633 pairs have one char in common, 74966 pairs have two chars in common 
when compared from the *end* of the string.

and what (if it's not obvious from the previous answer) do they prove?

They demonstrate that for two words from that particular corpus it is likely 
that a[::-1] == b[::-1] has to take more (1.34 versus 1.05 on average) 
characters into account than a == b, i. e. comparing from the back should be 
slower rather than faster.

If that doesn't help, here's the code ;)

import random
import itertools
from contextlib import contextmanager
from collections import defaultdict

@contextmanager
def open_corpus(args):
    with open(args.name) as instream:
        words = (line.strip() for line in instream)
        #words = (word for word in words if not word.endswith("'s"))
        yield words

def pick(items, count):
    items = list(items)
    return random.sample(items, count)

def count_common(a, b):
    i = 0
    for i, (x, y) in enumerate(zip(a, b), 1):
        if x != y:
            break
    return i

def corpus(args):
    with open_corpus(args) as words:
        wordlens = (len(line.strip()) for line in words)
        freq = defaultdict(int)
        for wordlen in wordlens:
            freq[wordlen] += 1
    show_histogram(freq)

def show_histogram(freq):
    peak = max(freq.values())
    freq_width = len(str(peak))
    max_freq = max(freq)
    len_width = len(str(max_freq))
    for i in range(min(freq), max_freq+1):
        value = freq[i]
        print("{:{}}: {:{}} {}".format(
                i, len_width, value,
                freq_width, "*" * (60 * value // peak)))

def compare(args):
    with open_corpus(args) as words:
        words = pick(
            (word for word in words if len(word) == args.word_len),
            args.sample_size)
    n = 0
    head_common = defaultdict(int)
    tail_common = defaultdict(int)
    n_tail = n_head = 0
    for n, (a, b) in enumerate(itertools.combinations(words, 2), 1):
        cc = count_common(a, b)
        n_head += cc
        head_common[cc] += 1
        ccr = count_common(a[::-1], b[::-1])
        n_tail += ccr
        tail_common[ccr] += 1

    print(n, "combinations")
    print("average common chars (head) {:.3}".format(n_head/n))
    print("average common chars (tail) {:.3}".format(n_tail/n))

    print("comparing every pair in a "
          "sample of {sample} {len}-char words\n"
          "taken from {name!r}".format(
            sample=args.sample_size,
            len=args.word_len,
            name=args.name))
    print()
    print("head")
    show_histogram(head_common)
    print()
    print("tail")
    show_histogram(tail_common)

def main():
    import argparse
    parser = argparse.ArgumentParser()
    parsers = parser.add_subparsers()

    pcorpus = parsers.add_parser("corpus")
    pcorpus.add_argument("name")
    pcorpus.set_defaults(func=corpus)

    pcompare = parsers.add_parser("compare")
    pcompare.add_argument("name")
    pcompare.add_argument("--sample-size", type=int, default=1000)
    pcompare.add_argument("--word-len", type=int, default=8)
    pcompare.set_defaults(func=compare)
    args = parser.parse_args()
    args.func(args)

if __name__ == "__main__":
    main()

[toc] | [prev] | [next] | [standalone]

#28510

From	Peter Otten <__peter__@web.de>
Date	2012-09-05 17:45 +0200
Message-ID	<mailman.247.1346859964.27098.python-list@python.org>
In reply to	#28380

Oscar Benjamin wrote:

> On 5 September 2012 10:48, Peter Otten <__peter__@web.de> wrote:

>> def count_common(a, b):

[sorry for seriously broken implementation]

> This function will return 1 if the first character differs. It does not
> count the number of common characters but rather the more relevant
> quantity which is the number of comparisons required to decide if the
> strings are equal.

I remember stumbling over the non-zero frequency for len(word) but somehow 
plodded on with beautifying the broken results :(

def count_common(a, b):
    """
    >>> count_common("alpha", "beta")
    0
    >>> count_common("", "")
    0
    >>> count_common("alpha", "argument")
    1
    >>> count_common("alpha", "alpha")
    5
    >>> count_common("alpha", "alphx")
    4
    """
    i = 0
    for x, y in zip(a, b):
        if x != y:
            break
        i += 1
    return i


$ python3 reverse_compare2.py compare /usr/share/dict/words
499500 combinations
average common chars (head) 0.0535
average common chars (tail) 0.322
comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'

head
0: 477371 ************************************************************
1:  18310 **
2:   3191 
3:    523 
4:     79 
5:     15 
6:     10 
7:      1 

tail
0: 385069 ************************************************************
1:  78742 ************
2:  26734 ****
3:   7462 *
4:   1297 
5:    168 
6:     22 
7:      6

[toc] | [prev] | [next] | [standalone]

#28828

From	Dan Goodman <dg.gmane@thesamovar.net>
Date	2012-09-10 18:07 +0200
Message-ID	<mailman.450.1347293273.27098.python-list@python.org>
In reply to	#28380

On 04/09/2012 03:54, Roy Smith wrote:
> Let's assume you're testing two strings for equality.  You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning?  If the strings are indeed equal, it's the
> same amount of work starting from either end.  But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.

 From the rest of the thread, it looks like in most situations it won't 
make much difference as typically very few characters need to be 
compared if they are unequal.

However, if you were in a situation with many strings which were almost 
equal, the most general way to improve the situation might be store a 
hash of the string along with the string, i.e. store (hash(x), x) and 
then compare equality of this tuple. Almost all of the time, if the 
strings are unequal the hash will be unequal. Or, as someone else 
suggested, use interned versions of the strings. This is basically the 
same solution but even better. In this case, your startup costs will be 
higher (creating the strings) but your comparisons will always be instant.

Dan

[toc] | [prev] | [next] | [standalone]

#28833

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2012-09-10 16:33 +0000
Message-ID	<mailman.455.1347294841.27098.python-list@python.org>
In reply to	#28380

On 2012-09-10, Dan Goodman <dg.gmane@thesamovar.net> wrote:
> On 04/09/2012 03:54, Roy Smith wrote:
>> Let's assume you're testing two strings for equality.  You've already
>> done the obvious quick tests (i.e they're the same length), and you're
>> down to the O(n) part of comparing every character.
>>
>> I'm wondering if it might be faster to start at the ends of the strings
>> instead of at the beginning?  If the strings are indeed equal, it's the
>> same amount of work starting from either end.  But, if it turns out that
>> for real-life situations, the ends of strings have more entropy than the
>> beginnings, the odds are you'll discover that they're unequal quicker by
>> starting at the end.
>
>  From the rest of the thread, it looks like in most situations it won't 
> make much difference as typically very few characters need to be 
> compared if they are unequal.
>
> However, if you were in a situation with many strings which were almost 
> equal, the most general way to improve the situation might be store a 
> hash of the string along with the string, i.e. store (hash(x), x) and 
> then compare equality of this tuple. Almost all of the time, if the 
> strings are unequal the hash will be unequal. Or, as someone else 
> suggested, use interned versions of the strings. This is basically the 
> same solution but even better. In this case, your startup costs will be 
> higher (creating the strings) but your comparisons will always be instant.
>

Computing the hash always requires iterating over all characters in the string
so is best case O(N) where string comparison is best case (and often average
case) O(1).

Also, so far as I know the hash value once computed is stored on the string
object itself [1] and used for subsequent string comparisons so there's no
need for you to do that in your code.

Oscar

[1] http://hg.python.org/cpython/file/71d94e79b0c3/Include/unicodeobject.h#l293

[toc] | [prev] | [next] | [standalone]

#28837

From	Dan Goodman <dg.gmane@thesamovar.net>
Date	2012-09-10 19:32 +0200
Message-ID	<mailman.457.1347298387.27098.python-list@python.org>
In reply to	#28380

On 10/09/2012 18:33, Oscar Benjamin wrote:
> Computing the hash always requires iterating over all characters in the string
> so is best case O(N) where string comparison is best case (and often average
> case) O(1).

Yep, but you already have O(N) costs just creating the strings in the 
first place, so it's absorbed into that. It's only worth doing if you do 
many comparisons.

> Also, so far as I know the hash value once computed is stored on the string
> object itself [1] and used for subsequent string comparisons so there's no
> need for you to do that in your code.

Cool, Python is very clever as always! :)

Dan

[toc] | [prev] | [next] | [standalone]

#28839

From	Dan Goodman <dg.gmane@thesamovar.net>
Date	2012-09-10 19:44 +0200
Message-ID	<mailman.459.1347299097.27098.python-list@python.org>
In reply to	#28380

On 10/09/2012 18:07, Dan Goodman wrote:
> On 04/09/2012 03:54, Roy Smith wrote:
>> Let's assume you're testing two strings for equality.  You've already
>> done the obvious quick tests (i.e they're the same length), and you're
>> down to the O(n) part of comparing every character.
>>
>> I'm wondering if it might be faster to start at the ends of the strings
>> instead of at the beginning?  If the strings are indeed equal, it's the
>> same amount of work starting from either end.  But, if it turns out that
>> for real-life situations, the ends of strings have more entropy than the
>> beginnings, the odds are you'll discover that they're unequal quicker by
>> starting at the end.
>
>  From the rest of the thread, it looks like in most situations it won't
> make much difference as typically very few characters need to be
> compared if they are unequal.
>
> However, if you were in a situation with many strings which were almost
> equal, the most general way to improve the situation might be store a
> hash of the string along with the string, i.e. store (hash(x), x) and
> then compare equality of this tuple. Almost all of the time, if the
> strings are unequal the hash will be unequal. Or, as someone else
> suggested, use interned versions of the strings. This is basically the
> same solution but even better. In this case, your startup costs will be
> higher (creating the strings) but your comparisons will always be instant.

Just had another thought about this. Although it's unlikely to be 
necessary in practice since (a) it's rarely necessary at all, and (b) 
when it is, hashing and optionally interning seems like the better 
approach, I had another idea that would be more general. Rather than 
starting from the beginning or the end, why not do something like: check 
the first and last character, then the len/2 character, then the len/4, 
then 3*len/4, then len/8, 3*len/8, etc. You'd need to be a bit clever 
about making sure you hit every character but I'm sure someone's already 
got an efficient algorithm for this. You could probably even make this 
cache efficient by working on cache line length blocks. Almost certainly 
entirely unnecessary, but I like the original question and it's a nice 
theoretical problem.

Dan

[toc] | [prev] | [next] | [standalone]

Page 5 of 6 — ← Prev page 1 2 3 4 [5] 6 Next page →

csiph-web

Comparing strings from the back?

Contents

#28591

#28604

#28611

#28613

#28677

#28388

#28395

#28397

#28440

#28441

#28442

#28501

#28465

#28466

#28475

#28510

#28828

#28833

#28837

#28839