Re: Good hash for pointers

From	Tim Rentsch <tr.17687@z991.linuxsc.com>
Newsgroups	comp.lang.c
Subject	Re: Good hash for pointers
Date	2024-06-16 04:38 -0700
Organization	A noiseless patient Spider
Message-ID	<86cyohktgt.fsf@linuxsc.com> (permalink)
References	(9 earlier) <v2voe7$3fr50$1@raubtier-asyl.eternal-september.org> <86ed9ofq14.fsf@linuxsc.com> <20240605005916.00001b33@yahoo.com> <86a5jzmle1.fsf@linuxsc.com> <20240606133534.00005c5d@yahoo.com>

Show all headers | View raw

Michael S <already5chosen@yahoo.com> writes:

> On Wed, 05 Jun 2024 08:58:46 -0700
> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>
>> If we're trying to write a general hash function, it should work
>> well across a wide range of input key sets, and also for different
>> kinds of hash table management.  Let's look at the output side
>> first.  There are two independent axes for hash tables:  table
>> size, and direct mapping versus using rehashing.  For table size,
>> the two prominent choices are some prime in the appropriate range,
>> or else a power of two.  Clearly any size could be chosen, but
>> primes have the nice property that they divide up the space of hash
>> values very evenly, whereas powers of two allow a cheaper way of
>> reducing a hash value to the range of table indices (anding versus a
>> modulo operation).  A good hash function should work well with both.
>> The choice of direct mapping versus rehashing depends on expected
>> occupancy rates (which I am used to calling "load factor").  For a
>> load factor of 75 to 80 percent or less, generally rehashing is
>> better.  Direct mapping has the advantage that it can handle load
>> factors above 100%, at a cost of dealing with whatever subsidiary
>> data structure (usually linked lists) is used when two or more keys
>> map to the same index in the hash table.  (I expect you know all
>> that, I wrote it mostly just to clarify my own thinking.)
>
> I never had interest in implementation of hash methods, typically
> just took what language or library provides and used it.
> All my knowledge of internals are 50 y.o.  news (TAoCP, and no I
> did not read it 50 years ago, but I did read it first time
> slightly less than 40 years ago and not sure if I re-read this
> particular topic since then).
> So, I am not sure that I understood everything above.
> In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion
> that simple circular increment of the index is superior to more
> elaborate methods.  I don't know if conclusion was changed in the
> following years, not even sure that I recollect it correctly.

Hashing is discussed in volume 3 of TAOCP.  I first read volume 3
just after it came out;  I reviewed the section on hashing again
after reading your message.  Your remembrance doesn't match what
is said in that section.  Moreover some of the conclusions Knuth
wrote in 1973 are certainly out of date now.  General rehashing,
also called open addressing, is overall a net win, especially for
hash tables that have an above-average load factor.  Also there
are variations such as cuckoo hash that depend on using open
addressing.

> Despite (or due to?) my relative ignorance of the topic, I'd dare
> to disagree with couple of your points:
> 1. I don't think that hash function should be evaluated in
> isolation from the rest of algorithm.  IMHO, they should be
> co-developed.  There is nothing wrong with hash function that
> works well with one particular "back end" and poorly with all
> others as long as limitations are clearly stated.

There are several reasons why a co-development approach usually
isn't a good way to go.  Malcolm started this thread by asking for
a good hash function for pointers (to be used in a portable ANSI C
setting).  We don't know, nor do we have any control over, how
Malcolm might use the function (and that goes double for any random
reader of comp.lang.c who might see some posted hash function and
decide to use it).  Or we might want to use some sort of generic
table manager that takes a pointer-to-function to do the hashing,
but no way to affect how the hash values are used.  Another reason
has to do with total code size.  If there are M data types to be
hashed, and N applications that use hash values, coupling the code
that does the hashing with the code that makes use of the hash
values means there will be M*N parts to program, whereas separating
the two aspects means there will be only M+N parts to program.  The
M+N path almost certainly means a big reduction in the overall
amount of effort needed.

Of course there are some specialized applications, such as perfect
hashing, where the hashing method needs to be tailored to some sort
of external considerations.  In most cases though a general hashing
function is a better choice.

As for how evaluation should be done, it's important to evaluate
hash functions in isolation for the same reasons it's important to
evaluate random number generators in isolation:  the results need
to be repeatable, and they need to give a quantitative measure of
how "mixed up" the outputs are for similar inputs.  Knuth talks
about random numbers in volume 2 of TAOCP, spending more than 100
pages on the subject, but he gives an essential precept before the
end of page 5: "Random numbers should not be generated with a
method chosen at random.  Some theory should be used."  The same is
true of hash functions, and a repeatable quantitative evaluation of
the function by itself is a key element of that process.

> 2. I don't like "modulo" method of reduction of hash value to
> range.  I don't like it for any chosen size of the bucket, not
> just for power of two, but even for prime.  Of course, for power
> of two size my dislike to it is deeper.

I understand your reaction, but this question is a separate topic.
Part of the reason for designing a high-quality hash function is so
that it doesn't matter which bits are used:  high order bits, low
order bits, some of each, every other bit, or every third or fourth
bit, they should all give good results (where "good" means nearby
inputs give what appear to be uncorrelated outputs).  If we design
our hash function to do that then it doesn't matter whether a hash
table does a mod or uses a different method.  Obviously that is
better than a hash function that works okay with some hash tables
but works poorly with others.

Back to comp.lang.c | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 00:59 +0300
  Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-05 11:10 +0200
    Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 12:34 +0300
      Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-05 12:05 +0200
        Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 13:11 +0300
  Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 08:58 -0700
    AES problem (was: Good hash for pointers) Michael S <already5chosen@yahoo.com> - 2024-06-05 19:51 +0300
      Re: AES problem (was: Good hash for pointers) Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 21:24 -0700
    Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 19:59 +0300
      Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 21:40 -0700
        Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-06 11:00 +0300
          Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-17 00:56 -0700
            Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-17 12:39 +0300
              Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-18 10:47 -0700
                Re: Good hash for pointers Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-06-18 15:17 -0700
                Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-18 16:17 -0700
                Re: Good hash for pointers James Kuyper <jameskuyper@alumni.caltech.edu> - 2024-06-18 19:23 -0400
                Re: Good hash for pointers Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-06-18 17:17 -0700
                Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-23 11:23 -0700
    Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-06 13:35 +0300
      Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-16 04:38 -0700
        Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-06-16 12:34 -0700
  Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-07 20:53 +0200
    Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-09 13:35 +0200
      Re: Good hash for pointers Richard Harnden <richard.nospam@gmail.invalid> - 2024-06-09 12:40 +0100
        Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-09 15:09 +0200
      Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-06-10 01:34 +0100
        Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-09 18:31 -0700
          Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-10 15:14 +0300
            Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-10 14:40 +0200
            Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-15 20:32 -0700
        Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-10 09:35 +0200

csiph-web