Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <d6e7bb84-3810-435a-bd7a-ad45da86f906@googlegroups.com>
References: <2e3a3c01-20b3-4948-9b32-bd80ed46822b@googlegroups.com> <mailman.93.1428417164.12925.python-list@python.org> <d6e7bb84-3810-435a-bd7a-ad45da86f906@googlegroups.com>
From: Ian Kelly <ian.g.kelly@gmail.com>
Date: Tue, 7 Apr 2015 14:15:35 -0600
Subject: Re: Best search algorithm to find condition within a range
To: Python <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.115.1428437779.12925.python-list@python.org>
Lines: 102
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:88616

On Tue, Apr 7, 2015 at 9:40 AM,  <jonas.thornvall@gmail.com> wrote:
> Well of course you use same principles like a binary search setting min a=
nd max, closing in on the digit. In this case the searched numbers  > base^=
exp  and number< base^exp+1.
>
> But since the search is within large bases upto 32-bit space, so base 429=
4967295 is the biggest allowed. I need to find the nearest less exp in base=
 for each (lets call them pseudo digits). But as you see there it will take=
 time to add them up. So better doing a binary search, you know min-max hal=
f (iteration). You can do the same for a random oracle min max within range=
, and if the number of tries in general over 22 i think a random oracle do =
it better than a binary search.

Okay, so you're talking about doing a binary search but picking the
partition point randomly instead of evenly?

> It was a long time since i did this, but i do know there is a threshold w=
here searching min max with the oracle will be faster than the binary searc=
h.

I don't know where you got this idea from, but it's fiction. If you
split the space in half, then you are always reducing the amount of
work remaining by a factor of 2, and the total work is O(log n). If
you do an uneven split, then while you will sometimes make a good
partition and reduce the amount of work by more than a factor of two,
more often the target will be in the larger subspace and you will have
reduced the amount of work by less than a factor of two. In the worst
case of this, you only remove one element from the search space at a
time and end up with a linear search.

As a demonstration of this, here's a function which does a binary
search for a number in a range and returns the number of partitions it
had to perform to find the number:

>>> def binsearch(n, min, max):
...   if min =3D=3D max: return 0
...   med =3D (max + min) // 2
...   if n <=3D med:
...     return 1 + binsearch(n, min, med)
...   else:
...     return 1 + binsearch(n, med+1, max)
...

And here's the average number of partitions needed to find a number in
various ranges:

>>> sum(binsearch(i, 1, n) for i in range(1, n+1)) / n
16.68928
>>> n =3D 1000000
>>> sum(binsearch(i, 1, n) for i in range(1, n+1)) / n
19.951424
>>> n =3D 10000000
>>> sum(binsearch(i, 1, n) for i in range(1, n+1)) / n
23.3222784

Now let's try a variant function that splits the range at one third
instead of evenly:

>>> def trisearch(n, min, max):
...     if min =3D=3D max: return 0
...     tern =3D (min + min + max) // 3
...     if n <=3D tern:
...         return 1 + trisearch(n, min, tern)
...     else:
...         return 1 + trisearch(n, tern+1, max)
...
>>> n =3D 100000
>>> sum(trisearch(i, 1, n) for i in range(1, n+1)) / n
17.88387
>>> n =3D 1000000
>>> sum(trisearch(i, 1, n) for i in range(1, n+1)) / n
21.502276
>>> n =3D 10000000
>>> sum(trisearch(i, 1, n) for i in range(1, n+1)) / n
25.1157592

The ternary version invariably takes more partitions on average to
find the goal number. Now let's try a third variant that partitions
the range randomly:

>>> def randsearch(n, min, max):
...     if min =3D=3D max: return 0
...     part =3D randrange(min, max)
...     if n <=3D part:
...         return 1 + randsearch(n, min, part)
...     else:
...         return 1 + randsearch(n, part+1, max)
...
>>> n =3D 100000
>>> sum(randsearch(i, 1, n) for i in range(1, n+1)) / n
22.17863
>>> n =3D 1000000
>>> sum(randsearch(i, 1, n) for i in range(1, n+1)) / n
26.78371
>>> n =3D 10000000
>>> sum(randsearch(i, 1, n) for i in range(1, n+1)) / n
31.3916557

As you can see, this one does a lot worse on average, and at n=3D1e6
it's already performing worse than either of the previous versions did
at n=3D1e7.  I'm not even going to try this for n=3D1e8 because it would
take too long to run.