Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Thu, 06 Sep 2012 06:07:38 -0400
From: Dave Angel <d@davea.name>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0
MIME-Version: 1.0
To: Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Subject: Re: Comparing strings from the back?
References: <roy-B37C28.21540103092012@news.panix.com> <504564ba$0$29978$c3e8da3$5496439d@news.astraweb.com> <k25afd$7vj$1@news.albasani.net> <mailman.194.1346795987.27098.python-list@python.org> <k275on$i1l$1@news.albasani.net> <k276r0$k5k$1@news.albasani.net> <504761ef$0$29981$c3e8da3$5496439d@news.astraweb.com> <k27osg$q7t$1@news.albasani.net> <50477cbb$0$29981$c3e8da3$5496439d@news.astraweb.com> <mailman.272.1346885254.27098.python-list@python.org> <50485fca$0$29977$c3e8da3$5496439d@news.astraweb.com>
In-Reply-To: <50485fca$0$29977$c3e8da3$5496439d@news.astraweb.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: python-list@python.org
Precedence: list
Reply-To: d@davea.name
Newsgroups: comp.lang.python
Message-ID: <mailman.287.1346926087.27098.python-list@python.org>
Lines: 42
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:28569

On 09/06/2012 04:33 AM, Steven D'Aprano wrote:
> <snip>
>
> I may have been overly-conservative earlier when I said that on
> average string equality has to compare half the characters. I thought
> I had remembered that from a computer science textbook, but I can't
> find that reference now, so possibly I was thinking of something else.
> (String searching perhaps?). In any case, the *worst* case for string
> equality testing is certainly O(N) (every character must be looked
> at), and the *best* case is O(1) obviously (the first character fails
> to match). But I'm not so sure about the average case. Further thought
> is required. 

For random strings (as defined below), the average compare time is
effectively unrelated to the size of the string, once the size passes
some point.

Define random string as being a selection from a set of characters, with
replacement.  So if we pick some set of characters, say 10 (or 256, it
doesn't really matter), the number of possible strings is 10**N.

The likelihood of not finding a mismatch within k characters is  
(1/10)**k   The likelihood of actually reaching the end of the random
string is (1/10)**N.  (which is the reciprocal of the number of strings,
naturally)

If we wanted an average number of comparisons, we'd have to calculate a
series, where each term is a probability times a value for k.
   sum((k * 9*10**-k) for k in range(1, 10))


Those terms very rapidly approach 0, so it's safe to stop after a few. 
Looking at the first 9 items, I see a value of 1.1111111

This may not be quite right, but the value is certainly well under 2 for
a population of 10 characters, chosen randomly.  And notice that N
doesn't really come into it. 

-- 

DaveA