Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.c > #384883 > unrolled thread
| Started by | Malcolm McLean <malcolm.arthur.mclean@gmail.com> |
|---|---|
| First post | 2024-05-23 12:11 +0100 |
| Last post | 2024-05-26 20:24 +0200 |
| Articles | 20 on this page of 112 — 15 participants |
Back to article view | Back to comp.lang.c
Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-23 12:11 +0100
Re: Good hash for pointers Richard Harnden <richard.nospam@gmail.invalid> - 2024-05-23 15:37 +0100
Re: Good hash for pointers Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-05-23 15:51 -0700
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 00:42 +0100
Re: Good hash for pointers Kaz Kylheku <643-408-1753@kylheku.com> - 2024-05-23 20:34 +0000
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-23 15:49 -0700
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 00:43 +0100
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-23 16:52 -0700
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 01:28 +0100
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-23 18:39 -0700
Re: Good hash for pointers bart <bc@freeuk.com> - 2024-05-24 11:14 +0100
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 12:05 +0100
Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-05-24 10:49 -0700
Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-05-24 10:51 -0700
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-24 06:18 -0700
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 15:07 +0100
Re: Good hash for pointers scott@slp53.sl.home (Scott Lurndal) - 2024-05-24 14:51 +0000
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 02:49 -0700
Re: Good hash for pointers David Brown <david.brown@hesbynett.no> - 2024-05-24 17:00 +0200
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 17:10 +0100
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-05-24 19:27 +0300
Re: Good hash for pointers David Brown <david.brown@hesbynett.no> - 2024-05-24 09:41 +0200
Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-05-23 17:32 -0700
Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-05-23 18:59 -0700
Re: Good hash for pointers jak <nospam@please.ty> - 2024-05-24 04:09 +0200
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-24 20:28 +0200
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 19:57 +0100
Re: Good hash for pointers bart <bc@freeuk.com> - 2024-05-25 00:54 +0100
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 02:12 -0700
Re: Good hash for pointers bart <bc@freeuk.com> - 2024-05-25 12:28 +0100
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 11:12 -0700
Re: Good hash for pointers bart <bc@freeuk.com> - 2024-05-25 20:31 +0100
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 22:54 -0700
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-25 17:00 +0200
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 10:40 -0700
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-25 18:56 +0100
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 11:23 -0700
Re: Good hash for pointers Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2024-05-25 23:13 +0200
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-25 23:07 +0100
Re: Good hash for pointers bart <bc@freeuk.com> - 2024-05-25 23:42 +0100
Re: Good hash for pointers Richard Harnden <richard.nospam@gmail.invalid> - 2024-05-26 19:58 +0100
Re: Good hash for pointers Kaz Kylheku <643-408-1753@kylheku.com> - 2024-05-26 22:42 +0000
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 18:05 +0200
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 18:07 +0200
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 18:04 +0200
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-26 09:24 -0700
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 18:36 +0200
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-26 10:20 -0700
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 19:39 +0200
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 19:54 +0200
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-27 08:07 +0200
Re: Good hash for pointers Ben Bacarisse <ben@bsb.me.uk> - 2024-05-28 11:07 +0100
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-30 10:10 +0200
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-30 11:27 +0200
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-30 19:26 -0700
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-30 19:27 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-02 10:45 +0300
Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-06-02 12:42 -0700
Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-06-03 12:35 -0700
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-02 16:02 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 10:50 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-03 18:02 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-04 11:38 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-04 22:10 -0700
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-03 16:34 +0200
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 17:46 +0300
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-03 17:54 +0200
Re: Good hash for pointers bart <bc@freeuk.com> - 2024-06-03 17:24 +0100
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 20:16 +0300
Re: Good hash for pointers bart <bc@freeuk.com> - 2024-06-03 19:48 +0100
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 22:41 +0300
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 22:51 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-03 16:51 -0700
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-03 17:01 -0700
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-03 20:25 +0200
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 19:50 +0300
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-03 20:31 +0200
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 00:59 +0300
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-05 11:10 +0200
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 12:34 +0300
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-05 12:05 +0200
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 13:11 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 08:58 -0700
AES problem (was: Good hash for pointers) Michael S <already5chosen@yahoo.com> - 2024-06-05 19:51 +0300
Re: AES problem (was: Good hash for pointers) Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 21:24 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 19:59 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 21:40 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-06 11:00 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-17 00:56 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-17 12:39 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-18 10:47 -0700
Re: Good hash for pointers Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-06-18 15:17 -0700
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-18 16:17 -0700
Re: Good hash for pointers James Kuyper <jameskuyper@alumni.caltech.edu> - 2024-06-18 19:23 -0400
Re: Good hash for pointers Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-06-18 17:17 -0700
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-23 11:23 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-06 13:35 +0300
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-16 04:38 -0700
Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-06-16 12:34 -0700
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-07 20:53 +0200
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-09 13:35 +0200
Re: Good hash for pointers Richard Harnden <richard.nospam@gmail.invalid> - 2024-06-09 12:40 +0100
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-09 15:09 +0200
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-06-10 01:34 +0100
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-09 18:31 -0700
Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-10 15:14 +0300
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-10 14:40 +0200
Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-15 20:32 -0700
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-10 09:35 +0200
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 20:06 +0200
Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-26 20:10 +0100
Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 20:24 +0200
Page 5 of 6 — ← Prev page 1 2 3 4 [5] 6 Next page →
| From | Bonita Montero <Bonita.Montero@gmail.com> |
|---|---|
| Date | 2024-06-05 12:05 +0200 |
| Message-ID | <v3pd8i$tutm$1@raubtier-asyl.eternal-september.org> |
| In reply to | #385585 |
Am 05.06.2024 um 11:34 schrieb Michael S: > I have no idea what you are talking about. > 18446744073709551557 == -59. > That's a very small number. You should take the value not signed. Addition and subtraction work the same signed and unsigned, but not multiplication and division. As I've shown you get a perfekt equal distribution if the input also has a equal distribution.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2024-06-05 13:11 +0300 |
| Message-ID | <20240605131124.00005f5e@yahoo.com> |
| In reply to | #385586 |
On Wed, 5 Jun 2024 12:05:07 +0200 Bonita Montero <Bonita.Montero@gmail.com> wrote: > Am 05.06.2024 um 11:34 schrieb Michael S: > > > I have no idea what you are talking about. > > 18446744073709551557 == -59. > > That's a very small number. > > You should take the value not signed. Addition and subtraction work > the same signed and unsigned, but not multiplication and division. > As I've shown you get a perfekt equal distribution if the input > also has a equal distribution. You had not shown shite. If you can not understand simple things on paper then run my test bench.
[toc] | [prev] | [next] | [standalone]
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2024-06-05 08:58 -0700 |
| Message-ID | <86a5jzmle1.fsf@linuxsc.com> |
| In reply to | #385561 |
Michael S <already5chosen@yahoo.com> writes: > On Sun, 26 May 2024 10:20:55 -0700 > Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: [..comments on a hash function that simply multiplies the key by a large prime (18446744073709551557) to give the hash value, and noting that tests have shown poor performance..] > 18446744073709551557 is indeed very bad (too close to 2**64). > But I tried other prime and at least in my simple tests it performs > rather well. > May be my tests are to simplistic, I don't pretend to understand much > about it. > > [lots of C code] > > Compiled as: > gcc -s -Wall -O2 -march=native ptr_hash.c hash_bonita.c hash_aes4.c I couldn't get the AES-based hash function to compile. There was a complaint about not being able to inline an 'always-inline' function. I played around with it for a while but didn't get anywhere. I did get your own hash function out and put it into my little test rig. There may be summary comments below. > Run as: > ./a 100000 75000 80 1000 > > Result: > ref: 75000 keys, 100000 slots, 52799 occupied 17431 collisions. [...] > uut: 75000 keys, 100000 slots, 53201 occupied 17139 collisions. [...] Both of these occupancy rates are close to the expected theoretical average, which I calculated to be 52764. If I had been more ambitious I might have done some statistical sampling to get some idea of the standard deviation, but I wasn't that energetic today. :) If we're trying to write a general hash function, it should work well across a wide range of input key sets, and also for different kinds of hash table management. Let's look at the output side first. There are two independent axes for hash tables: table size, and direct mapping versus using rehashing. For table size, the two prominent choices are some prime in the appropriate range, or else a power of two. Clearly any size could be chosen, but primes have the nice property that they divide up the space of hash values very evenly, whereas powers of two allow a cheaper way of reducing a hash value to the range of table indices (anding versus a modulo operation). A good hash function should work well with both. The choice of direct mapping versus rehashing depends on expected occupancy rates (which I am used to calling "load factor"). For a load factor of 75 to 80 percent or less, generally rehashing is better. Direct mapping has the advantage that it can handle load factors above 100%, at a cost of dealing with whatever subsidiary data structure (usually linked lists) is used when two or more keys map to the same index in the hash table. (I expect you know all that, I wrote it mostly just to clarify my own thinking.) For the recent testing I did, I chose a power of two (65536) for the table size, and rehashing rather than direct mapping. Both of these choices make it easier to evaluate how well hash functions perform. Having a power-of-two table size reveals weaknesses in the hash function more readily than taking the hash value modulo a prime; using rehashing allows an easy computation of how many probes on average are needed to do an insertion, which is an easier metric to use for comparisons than occupancy rates (which typically need some statistical calculations to appreciate the meaning of different resulting occupancy rates). Let me emphasize that I don't mean to advocate any of the different hash table models over the others. The key point is that a good hash function should work with all of them. On the input side, it's important to have a wide range of input key sets, with different sorts of distributions. The input key sets might include: linear addresses out of a single array, so a+i*m with i in the range [0..k], for different values of m (to be thorough all m in the range [1..64], all multiples of 2 from 64 to 128, all multiples of 4 from 128 to 256, all multiples of 8 from 256 to 512, and so forth up to all multiples of 64 up from 2048 to 4096); results of malloc() using fixed sizes (all of the m values from the last example); results of malloc() using a range of sizes, for several kinds of random distributions); results of malloc() using linearly increasing sizes; and other kinds of variations. In my testing I used several fixed sizes to malloc() all less than 60; rand()%997 + 43 for malloc() sizes; malloc( i/7 + 1 ) for each key value sub i; malloc( i/17 + 1 ) for each key value sub i; linear addressing in an array, with various multipliers (ad hoc rather than systematic choices); quadratic addressing in an array (which means array + i*i*m, for index i and multiplier choice m) -- here again the choices for m were ad hoc rather than system. An interesting result under the "in array" choices is that they sometimes would change from run to run, presumably as the array was put at different starting positions. Sometimes the differences were surprisingly large. Like I said before, a good hash function should produce average or near-average values for all of these scenarios. It might be nice to see some above-average results under some workloads, but they don't make up for well below-average results under other workloads. The situation is pretty much the opposite of what happens with quicksort: the worst case behavior of quicksort is quadratic, but we know that in practice that almost never happens, so quicksort is used because its typical or average performance is better than that of other choices. With hashing, better-than-average performance is rare, and doesn't give much benefit, but worse-than-average performance is more likely, and can be disastrous. Obviously there can be other factors that might motivate other choices in some situations (eg, perfect hashing), but for a general-use hash function the first design criterion should be close-to-average behavior under all scenarios tested.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2024-06-05 19:51 +0300 |
| Subject | AES problem (was: Good hash for pointers) |
| Message-ID | <20240605195159.00002f0e@yahoo.com> |
| In reply to | #385597 |
On Wed, 05 Jun 2024 08:58:46 -0700 Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > Michael S <already5chosen@yahoo.com> writes: > > > I couldn't get the AES-based hash function to compile. There > was a complaint about not being able to inline an 'always-inline' > function. I played around with it for a while but didn't get > anywhere. > Not that using my reference hash is particularly important (I needed it as a quick way to get something I could rely on; you already have something else with similar properties) but it's still worth investigation. There are two possibilities: 1. gcc compiler's -march=native logic incorrectly detects CPU type on your computer. 2. you CPU really does not support AES instructions. If the former, you can force compiler's hand with -maes flag. If the later, then you can compile it, but wouldn't be able to run my "reference" hash function on this particular computer What CPU are you using?
[toc] | [prev] | [next] | [standalone]
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2024-06-05 21:24 -0700 |
| Subject | Re: AES problem (was: Good hash for pointers) |
| Message-ID | <865xumn1fb.fsf@linuxsc.com> |
| In reply to | #385598 |
Michael S <already5chosen@yahoo.com> writes: > On Wed, 05 Jun 2024 08:58:46 -0700 > Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > >> Michael S <already5chosen@yahoo.com> writes: >> >> >> I couldn't get the AES-based hash function to compile. There >> was a complaint about not being able to inline an 'always-inline' >> function. I played around with it for a while but didn't get >> anywhere. > > Not that using my reference hash is particularly important (I needed it > as a quick way to get something I could rely on; you already have > something else with similar properties) but it's still worth > investigation. > > There are two possibilities: > 1. gcc compiler's -march=native logic incorrectly detects CPU type on > your computer. > 2. you CPU really does not support AES instructions. > > If the former, you can force compiler's hand with -maes flag. > If the later, then you can compile it, but wouldn't be able to run my > "reference" hash function on this particular computer I finally got it to work. I tried -maes before, and it still failed, but maybe I was using some other option that messed things up. In any case it works now, and the AES hash gives great results AFAICT.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2024-06-05 19:59 +0300 |
| Message-ID | <20240605195905.00002484@yahoo.com> |
| In reply to | #385597 |
On Wed, 05 Jun 2024 08:58:46 -0700 Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > > I did get your own hash function out and put it into my little > test rig. There may be summary comments below. > <snip> As you probably already paid attention, I like bottom lines. What is a bottom line here? Did you you encounter cases in which almost-bonita's-but-not-quite hash function performed badly?
[toc] | [prev] | [next] | [standalone]
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2024-06-05 21:40 -0700 |
| Message-ID | <86y17ilm4k.fsf@linuxsc.com> |
| In reply to | #385599 |
Michael S <already5chosen@yahoo.com> writes: > On Wed, 05 Jun 2024 08:58:46 -0700 > Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > >> I did get your own hash function out and put it into my little >> test rig. There may be summary comments below. > > <snip> > > As you probably already paid attention, I like bottom lines. > What is a bottom line here? The bottom line is both the multiply-by-a-large-prime hashes should be avoided. > Did you you encounter cases in which almost-bonita's-but-not-quite > hash function performed badly? Yes. A clear example is using directly mapped hash table that has a power of two elements, with inputs all from malloc( 48 ) calls. All keys map to one of only 1024 entries, in a hash table that has 65536 slots.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2024-06-06 11:00 +0300 |
| Message-ID | <20240606110009.00001096@yahoo.com> |
| In reply to | #385609 |
On Wed, 05 Jun 2024 21:40:27 -0700 Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > Michael S <already5chosen@yahoo.com> writes: > > > On Wed, 05 Jun 2024 08:58:46 -0700 > > Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > > > >> I did get your own hash function out and put it into my little > >> test rig. There may be summary comments below. > > > > <snip> > > > > As you probably already paid attention, I like bottom lines. > > What is a bottom line here? > > The bottom line is both the multiply-by-a-large-prime hashes > should be avoided. > I am not convinced. So far I had seen no disadvantages for particular application mentioned by Malcolm. Of course, it is no good for Malcolm, because it can't be coded in what Malcolm erroneously call "ANSI C". It is not easy to be coded even in ISO C11. But in my own practice it does not matter. I happily use language extensions, like gnu __int128 or Microsoft's __umulh() when they are the best tool for a job. > > Did you you encounter cases in which almost-bonita's-but-not-quite > > hash function performed badly? > > Yes. A clear example is using directly mapped hash table that > has a power of two elements, with inputs all from malloc( 48 ) > calls. All keys map to one of only 1024 entries, in a hash > table that has 65536 slots. I don't see it. $ ./a 0x10000 0xC000 48 ref: 49152 keys, 65536 slots, 34590 occupied 11295 collisions. Worst collision: 7 uut: 49152 keys, 65536 slots, 35791 occupied 11872 collisions. Worst collision: 3 It sounds like in your tests you don't use 'linear scaling to table size' for translation of hash value to table index. IMHO, you should. May be, for some particular hash functions other methods of translation also work, but I see no reasons to use other methods. This one is quite fast, robust and looks to me like the most logical. For power-of-2 sized tables it degenerates into simple right shift.
[toc] | [prev] | [next] | [standalone]
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2024-06-17 00:56 -0700 |
| Message-ID | <86zfrkj93b.fsf@linuxsc.com> |
| In reply to | #385615 |
Michael S <already5chosen@yahoo.com> writes:
> On Wed, 05 Jun 2024 21:40:27 -0700
> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>
>> Michael S <already5chosen@yahoo.com> writes:
>>
>>> On Wed, 05 Jun 2024 08:58:46 -0700
>>> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>>>
>>>> I did get your own hash function out and put it into my little
>>>> test rig. There may be summary comments below.
>>>
>>> <snip>
>>>
>>> As you probably already paid attention, I like bottom lines.
>>> What is a bottom line here?
>>
>> The bottom line is both the multiply-by-a-large-prime hashes
>> should be avoided.
>
> I am not convinced.
Let me amend my earlier statement. A hash function that simply
multiplies its input by an integer constant should be avoided
if the goal is to produce a hash function of decent quaility
rather than one of mediocre quality.
> Of course, it is no good for Malcolm, because it can't be coded
> in what Malcolm erroneously call "ANSI C".
I don't know why you say that. C was an ANSI standard before it
was an ISO standard. Or is it that you think that the language
Malcolm is intent on using does not conform to C90/C89/ANSI C?
>>> Did you you encounter cases in which almost-bonita's-but-not-quite
>>> hash function performed badly?
>>
>> Yes. A clear example is using directly mapped hash table that
>> has a power of two elements, with inputs all from malloc( 48 )
>> calls. All keys map to one of only 1024 entries, in a hash
>> table that has 65536 slots.
>
> I don't see it.
> $ ./a 0x10000 0xC000 48
> ref: 49152 keys, 65536 slots, 34590 occupied 11295 collisions.
> Worst collision: 7
> uut: 49152 keys, 65536 slots, 35791 occupied 11872 collisions.
> Worst collision: 3
You have run a poor test. The point of quality evaluation
testing is to look for potential problems, not to disguise
them.
> It sounds like in your tests you don't use 'linear scaling to
> table size' for translation of hash value to table index.
> IMHO, you should.
Like I said in another response, the question of how hash tables
should make use of hash values is a separate topic. The point
of the tests I ran was to evaluate the quality of hash functions,
not to look at how hash tables should be coded.
Here are some statistics I gathered from a more comprehensive
set of tests for several hash functions. The tests look at
aggregates of measured values for all output bits (considered
eight bits at a time) as a function of all input bits (considered
sixteen bits at a time. The first line in each pair summarizes
the low values in each set, and the second line summarizes the
high values in each set, in each case showing the range, average,
and standard deviation. The first set, hash 0, uses the aes
code you posted. I think it's easy to see the quality get
worse going from top to bottom.
hash 0
186 .. 201 avg: 195.39 dev: 3.71
312 .. 339 avg: 322.31 dev: 5.85
hash 1
186 .. 203 avg: 195.48 dev: 4.01
310 .. 344 avg: 322.45 dev: 6.31
hash 2
160 .. 202 avg: 193.56 dev: 7.70
311 .. 348 avg: 322.22 dev: 7.91
hash 3
143 .. 202 avg: 185.52 dev: 10.15
311 .. 387 avg: 329.52 dev: 14.10
hash 4
0 .. 212 avg: 9.94 dev: 45.16
270 .. 65536 avg: 44546.56 dev: 28809.47
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2024-06-17 12:39 +0300 |
| Message-ID | <20240617123926.00006a12@yahoo.com> |
| In reply to | #386066 |
On Mon, 17 Jun 2024 00:56:40 -0700 Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > > I don't know why you say that. C was an ANSI standard before it > was an ISO standard. Or is it that you think that the language > Malcolm is intent on using does not conform to C90/C89/ANSI C? > > All I wanted to point by this comment is that ANSI recognizes ISO/IEC 9899:2018 as their current C Standard and probably will recognize the next ISO C Standard pretty soon. For that reason I think that names like C89 or C90 are preferable (to ANSI C) when we want to refer to this particular variant of the language.
[toc] | [prev] | [next] | [standalone]
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2024-06-18 10:47 -0700 |
| Message-ID | <86ed8ujg7j.fsf@linuxsc.com> |
| In reply to | #386074 |
Michael S <already5chosen@yahoo.com> writes: > On Mon, 17 Jun 2024 00:56:40 -0700 > Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > >> I don't know why you say that. C was an ANSI standard before it >> was an ISO standard. Or is it that you think that the language >> Malcolm is intent on using does not conform to C90/C89/ANSI C? > > All I wanted to point by this comment is that ANSI recognizes ISO/IEC > 9899:2018 as their current C Standard and probably will recognize the > next ISO C Standard pretty soon. For that reason I think that names like > C89 or C90 are preferable (to ANSI C) when we want to refer to this > particular variant of the language. I see. So it isn't that you think "ANSI C" is wrong, just that it might be misleading or that C89 or C90 is preferable. Personally I would be surprised if someone used "ANSI C" to mean anything other than C89/C90, but certainly other people could have a different reaction.
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <Keith.S.Thompson+u@gmail.com> |
|---|---|
| Date | 2024-06-18 15:17 -0700 |
| Message-ID | <87bk3xzyi4.fsf@nosuchdomain.example.com> |
| In reply to | #386188 |
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
> Michael S <already5chosen@yahoo.com> writes:
>> On Mon, 17 Jun 2024 00:56:40 -0700
>> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>>
>>> I don't know why you say that. C was an ANSI standard before it
>>> was an ISO standard. Or is it that you think that the language
>>> Malcolm is intent on using does not conform to C90/C89/ANSI C?
>>
>> All I wanted to point by this comment is that ANSI recognizes ISO/IEC
>> 9899:2018 as their current C Standard and probably will recognize the
>> next ISO C Standard pretty soon. For that reason I think that names like
>> C89 or C90 are preferable (to ANSI C) when we want to refer to this
>> particular variant of the language.
>
> I see. So it isn't that you think "ANSI C" is wrong, just
> that it might be misleading or that C89 or C90 is preferable.
> Personally I would be surprised if someone used "ANSI C" to
> mean anything other than C89/C90, but certainly other people
> could have a different reaction.
The term "ANSI C" almost universally refers to C89/C90. But someone
not familiar with the term might expect it to mean "the C standard
endorsed by ANSI", which is currently C17.
The term "ANSI C" started out as a way to refer to the newly
standardized language, distinguishing it from pre-standard versions
like the one documented in K&R1.
I don't necessarily complain when someone uses the phrase "ANSI C"
to mean C89/C90, but I try to avoid it myself in favor of "C89" or
"C90".
Hmm. It occurs to me that "K&R C", which usually refers to the
language defined in K&R1, is also potentially ambiguous. I'm not
going to worry about it too much. (One C compiler uses a "-ansic"
command-line option to cause it to attempt to conform to C99.)
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
[toc] | [prev] | [next] | [standalone]
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2024-06-18 16:17 -0700 |
| Message-ID | <86sex9j0ww.fsf@linuxsc.com> |
| In reply to | #386203 |
Keith Thompson <Keith.S.Thompson+u@gmail.com> writes: > Tim Rentsch <tr.17687@z991.linuxsc.com> writes: > >> Michael S <already5chosen@yahoo.com> writes: >> >>> On Mon, 17 Jun 2024 00:56:40 -0700 >>> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: >>> >>>> I don't know why you say that. C was an ANSI standard before it >>>> was an ISO standard. Or is it that you think that the language >>>> Malcolm is intent on using does not conform to C90/C89/ANSI C? >>> >>> All I wanted to point by this comment is that ANSI recognizes ISO/IEC >>> 9899:2018 as their current C Standard and probably will recognize the >>> next ISO C Standard pretty soon. For that reason I think that names like >>> C89 or C90 are preferable (to ANSI C) when we want to refer to this >>> particular variant of the language. >> >> I see. So it isn't that you think "ANSI C" is wrong, just >> that it might be misleading or that C89 or C90 is preferable. >> Personally I would be surprised if someone used "ANSI C" to >> mean anything other than C89/C90, but certainly other people >> could have a different reaction. > > [...] I don't necessarily complain when someone uses the phrase > "ANSI C" to mean C89/C90, but I try to avoid it myself in favor > of "C89" or "C90". I'm reminded that gcc accepts the option -ansi as a synonym for the option -std=c90.
[toc] | [prev] | [next] | [standalone]
| From | James Kuyper <jameskuyper@alumni.caltech.edu> |
|---|---|
| Date | 2024-06-18 19:23 -0400 |
| Message-ID | <v4t4tq$1j49k$1@dont-email.me> |
| In reply to | #386188 |
Tim Rentsch <tr.17687@z991.linuxsc.com> writes: ... > I see. So it isn't that you think "ANSI C" is wrong, just > that it might be misleading or that C89 or C90 is preferable. ANSI's documentation is quite clear about the fact that there is, at any time, only one ANSI C standard, which is the version most recently approved - the older versions cease to be ANSI standards as soon as newer ones are approved. C17 is the current ANSI standard for C. Therefore, using "ANSI C" to mean specifically C89 is inaccurate, unless the wording makes it clear that it's referring to a time period when that was the current ANSI C standard. > Personally I would be surprised if someone used "ANSI C" to > mean anything other than C89/C90, I would expect the term to be used almost exclusively by people who incorrectly think that it means C89. Since it became an ISO standard with C90, few people who care about the latest version of the C standard worry about the parallel ANSI standard. Most people never got into the habit of using "ISO C" to mean specifically C90, so they didn't need to break that habit when it was superseded by C99.
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <Keith.S.Thompson+u@gmail.com> |
|---|---|
| Date | 2024-06-18 17:17 -0700 |
| Message-ID | <8734p9zsyt.fsf@nosuchdomain.example.com> |
| In reply to | #386206 |
James Kuyper <jameskuyper@alumni.caltech.edu> writes:
> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
[...]
>> Personally I would be surprised if someone used "ANSI C" to
>> mean anything other than C89/C90,
>
> I would expect the term to be used almost exclusively by people who
> incorrectly think that it means C89. Since it became an ISO standard
> with C90, few people who care about the latest version of the C standard
> worry about the parallel ANSI standard. Most people never got into the
> habit of using "ISO C" to mean specifically C90, so they didn't need to
> break that habit when it was superseded by C99.
I mostly agree, except that I might argue that the meaning of a
phrase can be defined by common usage.
**This is a very minor nitpick.**
In a more formal context, a "string literal" is not a literal whose
value is necessarily a string. We can't infer the meaning of a
defined phrase from the meanings of its consituent words.
"ANSI C" is defined by informal common usage to mean C89 (that was
the more formal meaning as of 1989, but common usage has evolved).
It's unfortunate that the obvious meaning of the phrase based
on its constituent words is very different, currently C17.
But practically nobody refers to C17 as "ANSI C"; if anything,
it's more appropriately called "ISO C".
If I see "ANSI C", I can be *almost* certain that it refers to
C89/C90 (the same language specified by two different documents).
But if I want to talk about the C89 or C90 standard, I'll use "C89"
or "C90", and if I want to talk about the language they define,
I might use both.
gcc's "-ansi" option was introduced when "ANSI C" was a correct and
reasonable name for C89/C90. There are too many existing scripts
to allow changing it now.
English am difficult.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
[toc] | [prev] | [next] | [standalone]
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2024-06-23 11:23 -0700 |
| Message-ID | <86v81zfrhv.fsf@linuxsc.com> |
| In reply to | #386206 |
James Kuyper <jameskuyper@alumni.caltech.edu> writes: > Tim Rentsch <tr.17687@z991.linuxsc.com> writes: > ... > >> I see. So it isn't that you think "ANSI C" is wrong, just >> that it might be misleading or that C89 or C90 is preferable. > > ANSI's documentation is quite clear about the fact that there is, at any > time, only one ANSI C standard, [...] I'm not talking about ANSI's documentation. I'm talking about how people use the term ANSI C.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2024-06-06 13:35 +0300 |
| Message-ID | <20240606133534.00005c5d@yahoo.com> |
| In reply to | #385597 |
On Wed, 05 Jun 2024 08:58:46 -0700 Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > > If we're trying to write a general hash function, it should work > well across a wide range of input key sets, and also for different > kinds of hash table management. Let's look at the output side > first. There are two independent axes for hash tables: table > size, and direct mapping versus using rehashing. For table size, > the two prominent choices are some prime in the appropriate range, > or else a power of two. Clearly any size could be chosen, but > primes have the nice property that they divide up the space of hash > values very evenly, whereas powers of two allow a cheaper way of > reducing a hash value to the range of table indices (anding versus a > modulo operation). A good hash function should work well with both. > The choice of direct mapping versus rehashing depends on expected > occupancy rates (which I am used to calling "load factor"). For a > load factor of 75 to 80 percent or less, generally rehashing is > better. Direct mapping has the advantage that it can handle load > factors above 100%, at a cost of dealing with whatever subsidiary > data structure (usually linked lists) is used when two or more keys > map to the same index in the hash table. (I expect you know all > that, I wrote it mostly just to clarify my own thinking.) > I never had interest in implementation of hash methods, typically just took what language or library provides and used it. All my knowledge of internals are 50 y.o. news (TAoCP, and no I did not read it 50 years ago, but I did read it first time slightly less than 40 years ago and not sure if I re-read this particular topic since then). So, I am not sure that I understood everything above. In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion that simple circular increment of the index is superior to more elaborate methods. I don't know if conclusion was changed in the following years, not even sure that I recollect it correctly. Despite (or due to?) my relative ignorance of the topic, I'd dare to disagree with couple of your points: 1. I don't think that hash function should be evaluated in isolation from the rest of algorithm. IMHO, they should be co-developed. There is nothing wrong with hash function that works well with one particular "back end" and poorly with all others as long as limitations are clearly stated. 2. I don't like "modulo" method of reduction of hash value to range. I don't like it for any chosen size of the bucket, not just for power of two, but even for prime. Of course, for power of two size my dislike to it is deeper.
[toc] | [prev] | [next] | [standalone]
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2024-06-16 04:38 -0700 |
| Message-ID | <86cyohktgt.fsf@linuxsc.com> |
| In reply to | #385618 |
Michael S <already5chosen@yahoo.com> writes: > On Wed, 05 Jun 2024 08:58:46 -0700 > Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > >> If we're trying to write a general hash function, it should work >> well across a wide range of input key sets, and also for different >> kinds of hash table management. Let's look at the output side >> first. There are two independent axes for hash tables: table >> size, and direct mapping versus using rehashing. For table size, >> the two prominent choices are some prime in the appropriate range, >> or else a power of two. Clearly any size could be chosen, but >> primes have the nice property that they divide up the space of hash >> values very evenly, whereas powers of two allow a cheaper way of >> reducing a hash value to the range of table indices (anding versus a >> modulo operation). A good hash function should work well with both. >> The choice of direct mapping versus rehashing depends on expected >> occupancy rates (which I am used to calling "load factor"). For a >> load factor of 75 to 80 percent or less, generally rehashing is >> better. Direct mapping has the advantage that it can handle load >> factors above 100%, at a cost of dealing with whatever subsidiary >> data structure (usually linked lists) is used when two or more keys >> map to the same index in the hash table. (I expect you know all >> that, I wrote it mostly just to clarify my own thinking.) > > I never had interest in implementation of hash methods, typically > just took what language or library provides and used it. > All my knowledge of internals are 50 y.o. news (TAoCP, and no I > did not read it 50 years ago, but I did read it first time > slightly less than 40 years ago and not sure if I re-read this > particular topic since then). > So, I am not sure that I understood everything above. > In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion > that simple circular increment of the index is superior to more > elaborate methods. I don't know if conclusion was changed in the > following years, not even sure that I recollect it correctly. Hashing is discussed in volume 3 of TAOCP. I first read volume 3 just after it came out; I reviewed the section on hashing again after reading your message. Your remembrance doesn't match what is said in that section. Moreover some of the conclusions Knuth wrote in 1973 are certainly out of date now. General rehashing, also called open addressing, is overall a net win, especially for hash tables that have an above-average load factor. Also there are variations such as cuckoo hash that depend on using open addressing. > Despite (or due to?) my relative ignorance of the topic, I'd dare > to disagree with couple of your points: > 1. I don't think that hash function should be evaluated in > isolation from the rest of algorithm. IMHO, they should be > co-developed. There is nothing wrong with hash function that > works well with one particular "back end" and poorly with all > others as long as limitations are clearly stated. There are several reasons why a co-development approach usually isn't a good way to go. Malcolm started this thread by asking for a good hash function for pointers (to be used in a portable ANSI C setting). We don't know, nor do we have any control over, how Malcolm might use the function (and that goes double for any random reader of comp.lang.c who might see some posted hash function and decide to use it). Or we might want to use some sort of generic table manager that takes a pointer-to-function to do the hashing, but no way to affect how the hash values are used. Another reason has to do with total code size. If there are M data types to be hashed, and N applications that use hash values, coupling the code that does the hashing with the code that makes use of the hash values means there will be M*N parts to program, whereas separating the two aspects means there will be only M+N parts to program. The M+N path almost certainly means a big reduction in the overall amount of effort needed. Of course there are some specialized applications, such as perfect hashing, where the hashing method needs to be tailored to some sort of external considerations. In most cases though a general hashing function is a better choice. As for how evaluation should be done, it's important to evaluate hash functions in isolation for the same reasons it's important to evaluate random number generators in isolation: the results need to be repeatable, and they need to give a quantitative measure of how "mixed up" the outputs are for similar inputs. Knuth talks about random numbers in volume 2 of TAOCP, spending more than 100 pages on the subject, but he gives an essential precept before the end of page 5: "Random numbers should not be generated with a method chosen at random. Some theory should be used." The same is true of hash functions, and a repeatable quantitative evaluation of the function by itself is a key element of that process. > 2. I don't like "modulo" method of reduction of hash value to > range. I don't like it for any chosen size of the bucket, not > just for power of two, but even for prime. Of course, for power > of two size my dislike to it is deeper. I understand your reaction, but this question is a separate topic. Part of the reason for designing a high-quality hash function is so that it doesn't matter which bits are used: high order bits, low order bits, some of each, every other bit, or every third or fourth bit, they should all give good results (where "good" means nearby inputs give what appear to be uncorrelated outputs). If we design our hash function to do that then it doesn't matter whether a hash table does a mod or uses a different method. Obviously that is better than a hash function that works okay with some hash tables but works poorly with others.
[toc] | [prev] | [next] | [standalone]
| From | "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> |
|---|---|
| Date | 2024-06-16 12:34 -0700 |
| Message-ID | <v4neot$6b51$4@dont-email.me> |
| In reply to | #386020 |
On 6/16/2024 4:38 AM, Tim Rentsch wrote: [...] Hashing pointers is useful... https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ
[toc] | [prev] | [next] | [standalone]
| From | Bonita Montero <Bonita.Montero@gmail.com> |
|---|---|
| Date | 2024-06-07 20:53 +0200 |
| Message-ID | <v3vkvp$268b2$1@raubtier-asyl.eternal-september.org> |
| In reply to | #385561 |
I tested the AES hash function with a C++20 program. The hash-function
is also re-written in C++-style:
#include <iostream>
#include <utility>
#include <vector>
#include <cstdint>
#include <algorithm>
#include <execution>
#include <thread>
#include <span>
#include <intrin.h>
using namespace std;
uint64_t MichaelsHash( uint64_t key );
int main()
{
static_assert(sizeof(size_t) == 8,"");
// number of buckets
#if defined(NDEBUG)
constexpr ptrdiff_t N_BUCKETS = 1ull << 32;
#else
constexpr ptrdiff_t N_BUCKETS = 1 << 16;
#endif
// don't zero-initialize update-array
union ndi_sz { size_t s; ndi_sz() {} };
vector<ndi_sz> rawUpdates( N_BUCKETS );
// native size_t span to update-array
span<size_t> updates( &rawUpdates[0].s, N_BUCKETS );
int hc = thread::hardware_concurrency();
// thread partitition end to update-array
ptrdiff_t end = N_BUCKETS;
// partitition index to update array
ptrdiff_t p = hc;
vector<jthread> threads;
threads.reserve( hc );
do
{
// begin to update array
ptrdiff_t begin = (ptrdiff_t)((double)--p / hc * N_BUCKETS);
threads.emplace_back( [&]( size_t begin, size_t end )
{
for( ; begin != end; ++begin )
updates[begin] = MichaelsHash( begin ) % N_BUCKETS;
}, begin, end );
// next end to update array is the begin before
end = begin;
} while( p > 0 );
// wait for all threads to finish
threads.resize( 0 );
// sort update array parallel
sort( execution::parallel_policy(), updates.begin(), updates.end() );
vector<uint8_t> buckets( N_BUCKETS );
// update buckets according to the update-list
for( size_t u : updates )
if( !++buckets[u] )
return EXIT_FAILURE;
// release memory of the update list
rawUpdates.clear();
rawUpdates.shrink_to_fit();
// sum up loads
vector<size_t> loads;
for( size_t bucketLoad : buckets )
{
if( loads.size() <= bucketLoad )
loads.resize( bucketLoad + 1 );
++loads[bucketLoad];
}
// print loads percentage
for( size_t ld = 0; ptrdiff_t nLoad : loads )
cout << ld++ << ": " << (double)nLoad / (ptrdiff_t)buckets.size() << endl;
}
uint64_t MichaelsHash( uint64_t key )
{
__m128i xkey = _mm_set_epi64x( key, 42 );
using bar_t = pair<uint64_t, uint64_t>;
static bar_t const bars[8] =
{
{ 0xBB09BBCC90B24BF2, 0x825C622FF2792A01 },
{ 0x94F0535CB06D4060, 0x939C756246DBFD1D },
{ 0x5B835E01A7E14CA1, 0xAC2BDAFC023CDD06 },
{ 0xE0B6A4735B774AEC, 0x9CAFB43E7DDE494C },
};
for( bar_t const &bar : bars )
{
__m128i rkey;
rkey.m128i_u64[0] = bar.first;
rkey.m128i_u64[1] = bar.second;
xkey = _mm_aesenc_si128( xkey, rkey );
}
return xkey.m128i_u64[0];
}
The hash function is tested with 2 ^ 32 buckets. The problem with that
is that the bucket-counters are updated with total random memory access,
which is extremely slow. So I first prepare an update-list, wich is a
vector of size_t's with the indices of the buckets being updated later.
The memory access for this update-list is updated lineary, which is very
fast. The update-list is written with the available number of threads.
This list is sorted after that and quicksorting also has linear memory
acces. To improve that further the quicksort is done with parallel
execution. The update-list takes 2 ^ 32 size_t's, which is 32GiB. The
sorted update list ressults in linear updates of the buckets.
This sounds rather complex, but I first tried it with random updates
to the buckets on a single core and on my 7950X Zen4-CPU I stopped
after 20min.
The resulting load-depths are according to the coupon collector'
problem (1 / (e * N!)):
C:\Users\Boni\Documents\Source\MichaelAesHash>timep
x64\Release\MichaelAesHash.exe
0: 36.7875%
1: 36.7884%
2: 18.3945%
3: 6.13053%
4: 1.53304%
5: 0.306619%
6: 0.0510489%
7: 0.00729787%
8: 0.000908948%
9: 0.000102189%
10: 1.07335e-05%
11: 8.14907e-07%
12: 4.65661e-08%
real 53.783s
user 11m:39.984s
sys 6.141s
cylces 3.446.315.603.052
[toc] | [prev] | [next] | [standalone]
Page 5 of 6 — ← Prev page 1 2 3 4 [5] 6 Next page →
Back to top | Article view | comp.lang.c
csiph-web