Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.c > #386020
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Newsgroups | comp.lang.c |
| Subject | Re: Good hash for pointers |
| Date | 2024-06-16 04:38 -0700 |
| Organization | A noiseless patient Spider |
| Message-ID | <86cyohktgt.fsf@linuxsc.com> (permalink) |
| References | (9 earlier) <v2voe7$3fr50$1@raubtier-asyl.eternal-september.org> <86ed9ofq14.fsf@linuxsc.com> <20240605005916.00001b33@yahoo.com> <86a5jzmle1.fsf@linuxsc.com> <20240606133534.00005c5d@yahoo.com> |
Michael S <already5chosen@yahoo.com> writes: > On Wed, 05 Jun 2024 08:58:46 -0700 > Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > >> If we're trying to write a general hash function, it should work >> well across a wide range of input key sets, and also for different >> kinds of hash table management. Let's look at the output side >> first. There are two independent axes for hash tables: table >> size, and direct mapping versus using rehashing. For table size, >> the two prominent choices are some prime in the appropriate range, >> or else a power of two. Clearly any size could be chosen, but >> primes have the nice property that they divide up the space of hash >> values very evenly, whereas powers of two allow a cheaper way of >> reducing a hash value to the range of table indices (anding versus a >> modulo operation). A good hash function should work well with both. >> The choice of direct mapping versus rehashing depends on expected >> occupancy rates (which I am used to calling "load factor"). For a >> load factor of 75 to 80 percent or less, generally rehashing is >> better. Direct mapping has the advantage that it can handle load >> factors above 100%, at a cost of dealing with whatever subsidiary >> data structure (usually linked lists) is used when two or more keys >> map to the same index in the hash table. (I expect you know all >> that, I wrote it mostly just to clarify my own thinking.) > > I never had interest in implementation of hash methods, typically > just took what language or library provides and used it. > All my knowledge of internals are 50 y.o. news (TAoCP, and no I > did not read it 50 years ago, but I did read it first time > slightly less than 40 years ago and not sure if I re-read this > particular topic since then). > So, I am not sure that I understood everything above. > In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion > that simple circular increment of the index is superior to more > elaborate methods. I don't know if conclusion was changed in the > following years, not even sure that I recollect it correctly. Hashing is discussed in volume 3 of TAOCP. I first read volume 3 just after it came out; I reviewed the section on hashing again after reading your message. Your remembrance doesn't match what is said in that section. Moreover some of the conclusions Knuth wrote in 1973 are certainly out of date now. General rehashing, also called open addressing, is overall a net win, especially for hash tables that have an above-average load factor. Also there are variations such as cuckoo hash that depend on using open addressing. > Despite (or due to?) my relative ignorance of the topic, I'd dare > to disagree with couple of your points: > 1. I don't think that hash function should be evaluated in > isolation from the rest of algorithm. IMHO, they should be > co-developed. There is nothing wrong with hash function that > works well with one particular "back end" and poorly with all > others as long as limitations are clearly stated. There are several reasons why a co-development approach usually isn't a good way to go. Malcolm started this thread by asking for a good hash function for pointers (to be used in a portable ANSI C setting). We don't know, nor do we have any control over, how Malcolm might use the function (and that goes double for any random reader of comp.lang.c who might see some posted hash function and decide to use it). Or we might want to use some sort of generic table manager that takes a pointer-to-function to do the hashing, but no way to affect how the hash values are used. Another reason has to do with total code size. If there are M data types to be hashed, and N applications that use hash values, coupling the code that does the hashing with the code that makes use of the hash values means there will be M*N parts to program, whereas separating the two aspects means there will be only M+N parts to program. The M+N path almost certainly means a big reduction in the overall amount of effort needed. Of course there are some specialized applications, such as perfect hashing, where the hashing method needs to be tailored to some sort of external considerations. In most cases though a general hashing function is a better choice. As for how evaluation should be done, it's important to evaluate hash functions in isolation for the same reasons it's important to evaluate random number generators in isolation: the results need to be repeatable, and they need to give a quantitative measure of how "mixed up" the outputs are for similar inputs. Knuth talks about random numbers in volume 2 of TAOCP, spending more than 100 pages on the subject, but he gives an essential precept before the end of page 5: "Random numbers should not be generated with a method chosen at random. Some theory should be used." The same is true of hash functions, and a repeatable quantitative evaluation of the function by itself is a key element of that process. > 2. I don't like "modulo" method of reduction of hash value to > range. I don't like it for any chosen size of the bucket, not > just for power of two, but even for prime. Of course, for power > of two size my dislike to it is deeper. I understand your reaction, but this question is a separate topic. Part of the reason for designing a high-quality hash function is so that it doesn't matter which bits are used: high order bits, low order bits, some of each, every other bit, or every third or fourth bit, they should all give good results (where "good" means nearby inputs give what appear to be uncorrelated outputs). If we design our hash function to do that then it doesn't matter whether a hash table does a mod or uses a different method. Obviously that is better than a hash function that works okay with some hash tables but works poorly with others.
Back to comp.lang.c | Previous | Next — Previous in thread | Next in thread | Find similar
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 00:59 +0300
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-05 11:10 +0200
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 12:34 +0300
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-05 12:05 +0200
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 13:11 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 08:58 -0700
AES problem (was: Good hash for pointers) Michael S <already5chosen@yahoo.com> - 2024-06-05 19:51 +0300
Re: AES problem (was: Good hash for pointers) Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 21:24 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 19:59 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 21:40 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-06 11:00 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-17 00:56 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-17 12:39 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-18 10:47 -0700
Re: Good hash for pointers Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-06-18 15:17 -0700
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-18 16:17 -0700
Re: Good hash for pointers James Kuyper <jameskuyper@alumni.caltech.edu> - 2024-06-18 19:23 -0400
Re: Good hash for pointers Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-06-18 17:17 -0700
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-23 11:23 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-06 13:35 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-16 04:38 -0700
Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-06-16 12:34 -0700
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-07 20:53 +0200
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-09 13:35 +0200
Re: Good hash for pointers Richard Harnden <richard.nospam@gmail.invalid> - 2024-06-09 12:40 +0100
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-09 15:09 +0200
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-06-10 01:34 +0100
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-09 18:31 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-10 15:14 +0300
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-10 14:40 +0200
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-15 20:32 -0700
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-10 09:35 +0200
csiph-web