Path: csiph.com!news.swapon.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch
Newsgroups: comp.lang.c
Subject: Re: Good hash for pointers
Date: Sun, 16 Jun 2024 04:38:58 -0700
Organization: A noiseless patient Spider
Lines: 104
Message-ID: <86cyohktgt.fsf@linuxsc.com>
References: <86fru6gsqr.fsf@linuxsc.com> <8634q5hjsp.fsf@linuxsc.com> <86le3wfsmd.fsf@linuxsc.com> <86ed9ofq14.fsf@linuxsc.com> <20240605005916.00001b33@yahoo.com> <86a5jzmle1.fsf@linuxsc.com> <20240606133534.00005c5d@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Date: Sun, 16 Jun 2024 13:39:01 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="582f9e6093daba5cc619f698edf5f43e"; logging-data="47354"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Y3HlxMJcek5YM+xFzKmu+8O/REW3l0kY="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:rVqqitWu14SRVy0B2WNMdKw3QLU= sha1:3clxG6Q43UWyaK1UYBz0aq/gfIg=
Xref: csiph.com comp.lang.c:386020
Michael S writes:
> On Wed, 05 Jun 2024 08:58:46 -0700
> Tim Rentsch wrote:
>
>> If we're trying to write a general hash function, it should work
>> well across a wide range of input key sets, and also for different
>> kinds of hash table management. Let's look at the output side
>> first. There are two independent axes for hash tables: table
>> size, and direct mapping versus using rehashing. For table size,
>> the two prominent choices are some prime in the appropriate range,
>> or else a power of two. Clearly any size could be chosen, but
>> primes have the nice property that they divide up the space of hash
>> values very evenly, whereas powers of two allow a cheaper way of
>> reducing a hash value to the range of table indices (anding versus a
>> modulo operation). A good hash function should work well with both.
>> The choice of direct mapping versus rehashing depends on expected
>> occupancy rates (which I am used to calling "load factor"). For a
>> load factor of 75 to 80 percent or less, generally rehashing is
>> better. Direct mapping has the advantage that it can handle load
>> factors above 100%, at a cost of dealing with whatever subsidiary
>> data structure (usually linked lists) is used when two or more keys
>> map to the same index in the hash table. (I expect you know all
>> that, I wrote it mostly just to clarify my own thinking.)
>
> I never had interest in implementation of hash methods, typically
> just took what language or library provides and used it.
> All my knowledge of internals are 50 y.o. news (TAoCP, and no I
> did not read it 50 years ago, but I did read it first time
> slightly less than 40 years ago and not sure if I re-read this
> particular topic since then).
> So, I am not sure that I understood everything above.
> In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion
> that simple circular increment of the index is superior to more
> elaborate methods. I don't know if conclusion was changed in the
> following years, not even sure that I recollect it correctly.
Hashing is discussed in volume 3 of TAOCP. I first read volume 3
just after it came out; I reviewed the section on hashing again
after reading your message. Your remembrance doesn't match what
is said in that section. Moreover some of the conclusions Knuth
wrote in 1973 are certainly out of date now. General rehashing,
also called open addressing, is overall a net win, especially for
hash tables that have an above-average load factor. Also there
are variations such as cuckoo hash that depend on using open
addressing.
> Despite (or due to?) my relative ignorance of the topic, I'd dare
> to disagree with couple of your points:
> 1. I don't think that hash function should be evaluated in
> isolation from the rest of algorithm. IMHO, they should be
> co-developed. There is nothing wrong with hash function that
> works well with one particular "back end" and poorly with all
> others as long as limitations are clearly stated.
There are several reasons why a co-development approach usually
isn't a good way to go. Malcolm started this thread by asking for
a good hash function for pointers (to be used in a portable ANSI C
setting). We don't know, nor do we have any control over, how
Malcolm might use the function (and that goes double for any random
reader of comp.lang.c who might see some posted hash function and
decide to use it). Or we might want to use some sort of generic
table manager that takes a pointer-to-function to do the hashing,
but no way to affect how the hash values are used. Another reason
has to do with total code size. If there are M data types to be
hashed, and N applications that use hash values, coupling the code
that does the hashing with the code that makes use of the hash
values means there will be M*N parts to program, whereas separating
the two aspects means there will be only M+N parts to program. The
M+N path almost certainly means a big reduction in the overall
amount of effort needed.
Of course there are some specialized applications, such as perfect
hashing, where the hashing method needs to be tailored to some sort
of external considerations. In most cases though a general hashing
function is a better choice.
As for how evaluation should be done, it's important to evaluate
hash functions in isolation for the same reasons it's important to
evaluate random number generators in isolation: the results need
to be repeatable, and they need to give a quantitative measure of
how "mixed up" the outputs are for similar inputs. Knuth talks
about random numbers in volume 2 of TAOCP, spending more than 100
pages on the subject, but he gives an essential precept before the
end of page 5: "Random numbers should not be generated with a
method chosen at random. Some theory should be used." The same is
true of hash functions, and a repeatable quantitative evaluation of
the function by itself is a key element of that process.
> 2. I don't like "modulo" method of reduction of hash value to
> range. I don't like it for any chosen size of the bucket, not
> just for power of two, but even for prime. Of course, for power
> of two size my dislike to it is deeper.
I understand your reaction, but this question is a separate topic.
Part of the reason for designing a high-quality hash function is so
that it doesn't matter which bits are used: high order bits, low
order bits, some of each, every other bit, or every third or fourth
bit, they should all give good results (where "good" means nearby
inputs give what appear to be uncorrelated outputs). If we design
our hash function to do that then it doesn't matter whether a hash
table does a mod or uses a different method. Obviously that is
better than a hash function that works okay with some hash tables
but works poorly with others.