Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.c > #384883 > unrolled thread

Good hash for pointers

Started byMalcolm McLean <malcolm.arthur.mclean@gmail.com>
First post2024-05-23 12:11 +0100
Last post2024-05-26 20:24 +0200
Articles 20 on this page of 112 — 15 participants

Back to article view | Back to comp.lang.c


Contents

  Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-23 12:11 +0100
    Re: Good hash for pointers Richard Harnden <richard.nospam@gmail.invalid> - 2024-05-23 15:37 +0100
      Re: Good hash for pointers Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-05-23 15:51 -0700
        Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 00:42 +0100
    Re: Good hash for pointers Kaz Kylheku <643-408-1753@kylheku.com> - 2024-05-23 20:34 +0000
    Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-23 15:49 -0700
      Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 00:43 +0100
        Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-23 16:52 -0700
          Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 01:28 +0100
            Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-23 18:39 -0700
              Re: Good hash for pointers bart <bc@freeuk.com> - 2024-05-24 11:14 +0100
                Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 12:05 +0100
                  Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-05-24 10:49 -0700
                    Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-05-24 10:51 -0700
                Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-24 06:18 -0700
                  Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 15:07 +0100
                    Re: Good hash for pointers scott@slp53.sl.home (Scott Lurndal) - 2024-05-24 14:51 +0000
                    Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 02:49 -0700
                Re: Good hash for pointers David Brown <david.brown@hesbynett.no> - 2024-05-24 17:00 +0200
                  Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 17:10 +0100
                    Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-05-24 19:27 +0300
            Re: Good hash for pointers David Brown <david.brown@hesbynett.no> - 2024-05-24 09:41 +0200
    Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-05-23 17:32 -0700
      Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-05-23 18:59 -0700
        Re: Good hash for pointers jak <nospam@please.ty> - 2024-05-24 04:09 +0200
    Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-24 20:28 +0200
      Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-24 19:57 +0100
        Re: Good hash for pointers bart <bc@freeuk.com> - 2024-05-25 00:54 +0100
          Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 02:12 -0700
            Re: Good hash for pointers bart <bc@freeuk.com> - 2024-05-25 12:28 +0100
              Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 11:12 -0700
                Re: Good hash for pointers bart <bc@freeuk.com> - 2024-05-25 20:31 +0100
                  Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 22:54 -0700
            Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-25 17:00 +0200
              Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 10:40 -0700
                Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-25 18:56 +0100
                  Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-25 11:23 -0700
                  Re: Good hash for pointers Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2024-05-25 23:13 +0200
                    Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-25 23:07 +0100
                      Re: Good hash for pointers bart <bc@freeuk.com> - 2024-05-25 23:42 +0100
                        Re: Good hash for pointers Richard Harnden <richard.nospam@gmail.invalid> - 2024-05-26 19:58 +0100
                        Re: Good hash for pointers Kaz Kylheku <643-408-1753@kylheku.com> - 2024-05-26 22:42 +0000
                  Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 18:05 +0200
                    Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 18:07 +0200
                Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 18:04 +0200
                  Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-26 09:24 -0700
                    Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 18:36 +0200
                      Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-26 10:20 -0700
                        Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 19:39 +0200
                          Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 19:54 +0200
                          Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-27 08:07 +0200
                            Re: Good hash for pointers Ben Bacarisse <ben@bsb.me.uk> - 2024-05-28 11:07 +0100
                              Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-30 10:10 +0200
                                Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-30 11:27 +0200
                                Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-30 19:26 -0700
                          Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-05-30 19:27 -0700
                            Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-02 10:45 +0300
                              Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-06-02 12:42 -0700
                                Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-06-03 12:35 -0700
                              Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-02 16:02 -0700
                                Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 10:50 +0300
                                  Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-03 18:02 -0700
                                    Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-04 11:38 +0300
                                      Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-04 22:10 -0700
                              Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-03 16:34 +0200
                                Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 17:46 +0300
                                  Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-03 17:54 +0200
                                    Re: Good hash for pointers bart <bc@freeuk.com> - 2024-06-03 17:24 +0100
                                      Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 20:16 +0300
                                        Re: Good hash for pointers bart <bc@freeuk.com> - 2024-06-03 19:48 +0100
                                          Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 22:41 +0300
                                          Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 22:51 +0300
                                          Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-03 16:51 -0700
                                        Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-03 17:01 -0700
                                      Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-03 20:25 +0200
                                    Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-03 19:50 +0300
                                      Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-03 20:31 +0200
                        Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 00:59 +0300
                          Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-05 11:10 +0200
                            Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 12:34 +0300
                              Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-05 12:05 +0200
                                Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 13:11 +0300
                          Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 08:58 -0700
                            AES problem (was: Good hash for pointers) Michael S <already5chosen@yahoo.com> - 2024-06-05 19:51 +0300
                              Re: AES problem (was: Good hash for pointers) Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 21:24 -0700
                            Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-05 19:59 +0300
                              Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-05 21:40 -0700
                                Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-06 11:00 +0300
                                  Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-17 00:56 -0700
                                    Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-17 12:39 +0300
                                      Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-18 10:47 -0700
                                        Re: Good hash for pointers Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-06-18 15:17 -0700
                                          Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-18 16:17 -0700
                                        Re: Good hash for pointers James Kuyper <jameskuyper@alumni.caltech.edu> - 2024-06-18 19:23 -0400
                                          Re: Good hash for pointers Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-06-18 17:17 -0700
                                          Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-23 11:23 -0700
                            Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-06 13:35 +0300
                              Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-16 04:38 -0700
                                Re: Good hash for pointers "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-06-16 12:34 -0700
                          Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-07 20:53 +0200
                            Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-09 13:35 +0200
                              Re: Good hash for pointers Richard Harnden <richard.nospam@gmail.invalid> - 2024-06-09 12:40 +0100
                                Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-09 15:09 +0200
                              Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-06-10 01:34 +0100
                                Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-09 18:31 -0700
                                  Re: Good hash for pointers Michael S <already5chosen@yahoo.com> - 2024-06-10 15:14 +0300
                                    Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-10 14:40 +0200
                                    Re: Good hash for pointers Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-06-15 20:32 -0700
                                Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-06-10 09:35 +0200
    Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 20:06 +0200
      Re: Good hash for pointers Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2024-05-26 20:10 +0100
    Re: Good hash for pointers Bonita Montero <Bonita.Montero@gmail.com> - 2024-05-26 20:24 +0200

Page 5 of 6 — ← Prev page 1 2 3 4 [5] 6  Next page →


#385586

FromBonita Montero <Bonita.Montero@gmail.com>
Date2024-06-05 12:05 +0200
Message-ID<v3pd8i$tutm$1@raubtier-asyl.eternal-september.org>
In reply to#385585
Am 05.06.2024 um 11:34 schrieb Michael S:

> I have no idea what you are talking about.
> 18446744073709551557 == -59.
> That's a very small number. 

You should take the value not signed. Addition and subtraction work
the same signed and unsigned, but not multiplication and division.
As I've shown you get a perfekt equal distribution if the input
also has a equal distribution.

[toc] | [prev] | [next] | [standalone]


#385587

FromMichael S <already5chosen@yahoo.com>
Date2024-06-05 13:11 +0300
Message-ID<20240605131124.00005f5e@yahoo.com>
In reply to#385586
On Wed, 5 Jun 2024 12:05:07 +0200
Bonita Montero <Bonita.Montero@gmail.com> wrote:

> Am 05.06.2024 um 11:34 schrieb Michael S:
> 
> > I have no idea what you are talking about.
> > 18446744073709551557 == -59.
> > That's a very small number.   
> 
> You should take the value not signed. Addition and subtraction work
> the same signed and unsigned, but not multiplication and division.
> As I've shown you get a perfekt equal distribution if the input
> also has a equal distribution.

You had not shown shite.
If you can not understand simple things on paper then run my test
bench.

[toc] | [prev] | [next] | [standalone]


#385597

FromTim Rentsch <tr.17687@z991.linuxsc.com>
Date2024-06-05 08:58 -0700
Message-ID<86a5jzmle1.fsf@linuxsc.com>
In reply to#385561
Michael S <already5chosen@yahoo.com> writes:

> On Sun, 26 May 2024 10:20:55 -0700
> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

[..comments on a hash function that simply multiplies the key
 by a large prime (18446744073709551557) to give the hash
 value, and noting that tests have shown poor performance..]

> 18446744073709551557 is indeed very bad (too close to 2**64).
> But I tried other prime and at least in my simple tests it performs
> rather well.
> May be my tests are to simplistic, I don't pretend to understand much
> about it.
>
> [lots of C code]
>
> Compiled as:
> gcc -s -Wall -O2 -march=native ptr_hash.c hash_bonita.c hash_aes4.c

I couldn't get the AES-based hash function to compile.  There
was a complaint about not being able to inline an 'always-inline'
function.  I played around with it for a while but didn't get
anywhere.

I did get your own hash function out and put it into my little
test rig.  There may be summary comments below.

> Run as:
> ./a 100000 75000 80 1000
>
> Result:
> ref: 75000 keys, 100000 slots, 52799 occupied 17431 collisions.  [...]
> uut: 75000 keys, 100000 slots, 53201 occupied 17139 collisions.  [...]

Both of these occupancy rates are close to the expected theoretical
average, which I calculated to be 52764.  If I had been more
ambitious I might have done some statistical sampling to get some
idea of the standard deviation, but I wasn't that energetic
today. :)

If we're trying to write a general hash function, it should work
well across a wide range of input key sets, and also for different
kinds of hash table management.  Let's look at the output side
first.  There are two independent axes for hash tables:  table
size, and direct mapping versus using rehashing.  For table size,
the two prominent choices are some prime in the appropriate range,
or else a power of two.  Clearly any size could be chosen, but
primes have the nice property that they divide up the space of hash
values very evenly, whereas powers of two allow a cheaper way of
reducing a hash value to the range of table indices (anding versus a
modulo operation).  A good hash function should work well with both.
The choice of direct mapping versus rehashing depends on expected
occupancy rates (which I am used to calling "load factor").  For a
load factor of 75 to 80 percent or less, generally rehashing is
better.  Direct mapping has the advantage that it can handle load
factors above 100%, at a cost of dealing with whatever subsidiary
data structure (usually linked lists) is used when two or more keys
map to the same index in the hash table.  (I expect you know all
that, I wrote it mostly just to clarify my own thinking.)

For the recent testing I did, I chose a power of two (65536) for the
table size, and rehashing rather than direct mapping.  Both of these
choices make it easier to evaluate how well hash functions perform.
Having a power-of-two table size reveals weaknesses in the hash
function more readily than taking the hash value modulo a prime;
using rehashing allows an easy computation of how many probes on
average are needed to do an insertion, which is an easier metric
to use for comparisons than occupancy rates (which typically need
some statistical calculations to appreciate the meaning of different
resulting occupancy rates).

Let me emphasize that I don't mean to advocate any of the different
hash table models over the others.  The key point is that a good
hash function should work with all of them.

On the input side, it's important to have a wide range of input
key sets, with different sorts of distributions.  The input key
sets might include:  linear addresses out of a single array, so
a+i*m with i in the range [0..k], for different values of m (to
be thorough all m in the range [1..64], all multiples of 2 from
64 to 128, all multiples of 4 from 128 to 256, all multiples of 8
from 256 to 512, and so forth up to all multiples of 64 up from
2048 to 4096);  results of malloc() using fixed sizes (all of the
m values from the last example);  results of malloc() using a
range of sizes, for several kinds of random distributions);
results of malloc() using linearly increasing sizes; and other
kinds of variations.  In my testing I used several fixed sizes to
malloc() all less than 60;  rand()%997 + 43 for malloc() sizes;
malloc( i/7 + 1 ) for each key value sub i;  malloc( i/17 + 1 )
for each key value sub i;  linear addressing in an array, with
various multipliers (ad hoc rather than systematic choices);
quadratic addressing in an array (which means array + i*i*m, for
index i and multiplier choice m) -- here again the choices for
m were ad hoc rather than system.

An interesting result under the "in array" choices is that they
sometimes would change from run to run, presumably as the array was
put at different starting positions.  Sometimes the differences were
surprisingly large.

Like I said before, a good hash function should produce average
or near-average values for all of these scenarios.  It might be
nice to see some above-average results under some workloads, but
they don't make up for well below-average results under other
workloads.  The situation is pretty much the opposite of what
happens with quicksort:  the worst case behavior of quicksort is
quadratic, but we know that in practice that almost never
happens, so quicksort is used because its typical or average
performance is better than that of other choices.  With hashing,
better-than-average performance is rare, and doesn't give much
benefit, but worse-than-average performance is more likely, and
can be disastrous.  Obviously there can be other factors that
might motivate other choices in some situations (eg, perfect
hashing), but for a general-use hash function the first design
criterion should be close-to-average behavior under all scenarios
tested.

[toc] | [prev] | [next] | [standalone]


#385598 — AES problem (was: Good hash for pointers)

FromMichael S <already5chosen@yahoo.com>
Date2024-06-05 19:51 +0300
SubjectAES problem (was: Good hash for pointers)
Message-ID<20240605195159.00002f0e@yahoo.com>
In reply to#385597
On Wed, 05 Jun 2024 08:58:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

> Michael S <already5chosen@yahoo.com> writes:
> 
> 
> I couldn't get the AES-based hash function to compile.  There
> was a complaint about not being able to inline an 'always-inline'
> function.  I played around with it for a while but didn't get
> anywhere.
> 

Not that using my reference hash is particularly important (I needed it
as a quick way to get something I could rely on; you already have
something else with similar properties) but it's still worth
investigation.

There are two possibilities:
1. gcc compiler's -march=native logic incorrectly detects CPU type on
your computer.
2. you CPU really does not support AES instructions.

If the former, you can force compiler's hand with -maes flag.
If the later, then you can compile it, but wouldn't be able to run my
"reference" hash function on this particular computer

What CPU are you using?


[toc] | [prev] | [next] | [standalone]


#385608 — Re: AES problem (was: Good hash for pointers)

FromTim Rentsch <tr.17687@z991.linuxsc.com>
Date2024-06-05 21:24 -0700
SubjectRe: AES problem (was: Good hash for pointers)
Message-ID<865xumn1fb.fsf@linuxsc.com>
In reply to#385598
Michael S <already5chosen@yahoo.com> writes:

> On Wed, 05 Jun 2024 08:58:46 -0700
> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>
>> Michael S <already5chosen@yahoo.com> writes:
>>
>>
>> I couldn't get the AES-based hash function to compile.  There
>> was a complaint about not being able to inline an 'always-inline'
>> function.  I played around with it for a while but didn't get
>> anywhere.
>
> Not that using my reference hash is particularly important (I needed it
> as a quick way to get something I could rely on;  you already have
> something else with similar properties) but it's still worth
> investigation.
>
> There are two possibilities:
> 1. gcc compiler's -march=native logic incorrectly detects CPU type on
> your computer.
> 2. you CPU really does not support AES instructions.
>
> If the former, you can force compiler's hand with -maes flag.
> If the later, then you can compile it, but wouldn't be able to run my
> "reference" hash function on this particular computer

I finally got it to work.  I tried -maes before, and it still
failed, but maybe I was using some other option that messed
things up.  In any case it works now, and the AES hash gives
great results AFAICT.

[toc] | [prev] | [next] | [standalone]


#385599

FromMichael S <already5chosen@yahoo.com>
Date2024-06-05 19:59 +0300
Message-ID<20240605195905.00002484@yahoo.com>
In reply to#385597
On Wed, 05 Jun 2024 08:58:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

> 
> I did get your own hash function out and put it into my little
> test rig.  There may be summary comments below.
> 

<snip>

As you probably already paid attention, I like bottom lines.
What is a bottom line here?
Did you you encounter cases in which almost-bonita's-but-not-quite
hash function performed badly?

[toc] | [prev] | [next] | [standalone]


#385609

FromTim Rentsch <tr.17687@z991.linuxsc.com>
Date2024-06-05 21:40 -0700
Message-ID<86y17ilm4k.fsf@linuxsc.com>
In reply to#385599
Michael S <already5chosen@yahoo.com> writes:

> On Wed, 05 Jun 2024 08:58:46 -0700
> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>
>> I did get your own hash function out and put it into my little
>> test rig.  There may be summary comments below.
>
> <snip>
>
> As you probably already paid attention, I like bottom lines.
> What is a bottom line here?

The bottom line is both the multiply-by-a-large-prime hashes
should be avoided.

> Did you you encounter cases in which almost-bonita's-but-not-quite
> hash function performed badly?

Yes.  A clear example is using directly mapped hash table that
has a power of two elements, with inputs all from malloc( 48 )
calls.  All keys map to one of only 1024 entries, in a hash
table that has 65536 slots.

[toc] | [prev] | [next] | [standalone]


#385615

FromMichael S <already5chosen@yahoo.com>
Date2024-06-06 11:00 +0300
Message-ID<20240606110009.00001096@yahoo.com>
In reply to#385609
On Wed, 05 Jun 2024 21:40:27 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

> Michael S <already5chosen@yahoo.com> writes:
> 
> > On Wed, 05 Jun 2024 08:58:46 -0700
> > Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
> >  
> >> I did get your own hash function out and put it into my little
> >> test rig.  There may be summary comments below.  
> >
> > <snip>
> >
> > As you probably already paid attention, I like bottom lines.
> > What is a bottom line here?  
> 
> The bottom line is both the multiply-by-a-large-prime hashes
> should be avoided.
>

I am not convinced.
So far I had seen no disadvantages for particular application mentioned
by Malcolm.
Of course, it is no good for Malcolm, because it can't be coded in what
Malcolm erroneously call "ANSI C". It is not easy to be coded even in
ISO C11. But in my own practice it does not matter. I happily use
language extensions, like gnu __int128 or Microsoft's __umulh() when
they are the best tool for a job. 

> > Did you you encounter cases in which almost-bonita's-but-not-quite
> > hash function performed badly?  
> 
> Yes.  A clear example is using directly mapped hash table that
> has a power of two elements, with inputs all from malloc( 48 )
> calls.  All keys map to one of only 1024 entries, in a hash
> table that has 65536 slots.

I don't see it.
$ ./a 0x10000 0xC000 48
ref: 49152 keys, 65536 slots, 34590 occupied 11295 collisions. Worst
collision: 7 
uut: 49152 keys, 65536 slots, 35791 occupied 11872 collisions. Worst
collision: 3

It sounds like in your tests you don't use 'linear scaling to table
size' for translation of hash value to table index. IMHO, you should.
May be, for some particular hash functions other methods of translation
also work, but I see no reasons to use other methods. This one is quite
fast, robust and looks to me like the most logical. For power-of-2 sized
tables it degenerates into simple right shift.


[toc] | [prev] | [next] | [standalone]


#386066

FromTim Rentsch <tr.17687@z991.linuxsc.com>
Date2024-06-17 00:56 -0700
Message-ID<86zfrkj93b.fsf@linuxsc.com>
In reply to#385615
Michael S <already5chosen@yahoo.com> writes:

> On Wed, 05 Jun 2024 21:40:27 -0700
> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>
>> Michael S <already5chosen@yahoo.com> writes:
>>
>>> On Wed, 05 Jun 2024 08:58:46 -0700
>>> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>>>
>>>> I did get your own hash function out and put it into my little
>>>> test rig.  There may be summary comments below.
>>>
>>> <snip>
>>>
>>> As you probably already paid attention, I like bottom lines.
>>> What is a bottom line here?
>>
>> The bottom line is both the multiply-by-a-large-prime hashes
>> should be avoided.
>
> I am not convinced.

Let me amend my earlier statement.  A hash function that simply
multiplies its input by an integer constant should be avoided
if the goal is to produce a hash function of decent quaility
rather than one of mediocre quality.

> Of course, it is no good for Malcolm, because it can't be coded
> in what Malcolm erroneously call "ANSI C".

I don't know why you say that.  C was an ANSI standard before it
was an ISO standard.  Or is it that you think that the language
Malcolm is intent on using does not conform to C90/C89/ANSI C?


>>> Did you you encounter cases in which almost-bonita's-but-not-quite
>>> hash function performed badly?
>>
>> Yes.  A clear example is using directly mapped hash table that
>> has a power of two elements, with inputs all from malloc( 48 )
>> calls.  All keys map to one of only 1024 entries, in a hash
>> table that has 65536 slots.
>
> I don't see it.
> $ ./a 0x10000 0xC000 48
> ref: 49152 keys, 65536 slots, 34590 occupied 11295 collisions.
>    Worst collision: 7
> uut: 49152 keys, 65536 slots, 35791 occupied 11872 collisions.
>    Worst collision: 3

You have run a poor test.  The point of quality evaluation
testing is to look for potential problems, not to disguise
them.

> It sounds like in your tests you don't use 'linear scaling to
> table size' for translation of hash value to table index.
> IMHO, you should.

Like I said in another response, the question of how hash tables
should make use of hash values is a separate topic.  The point
of the tests I ran was to evaluate the quality of hash functions,
not to look at how hash tables should be coded.

Here are some statistics I gathered from a more comprehensive
set of tests for several hash functions.  The tests look at
aggregates of measured values for all output bits (considered
eight bits at a time) as a function of all input bits (considered
sixteen bits at a time.  The first line in each pair summarizes
the low values in each set, and the second line summarizes the
high values in each set, in each case showing the range, average,
and standard deviation.  The first set, hash 0, uses the aes
code you posted.  I think it's easy to see the quality get
worse going from top to bottom.

   hash 0
       186 ..   201   avg:   195.39   dev:  3.71
       312 ..   339   avg:   322.31   dev:  5.85

   hash 1
       186 ..   203   avg:   195.48   dev:  4.01
       310 ..   344   avg:   322.45   dev:  6.31

   hash 2
       160 ..   202   avg:   193.56   dev:  7.70
       311 ..   348   avg:   322.22   dev:  7.91

   hash 3
       143 ..   202   avg:   185.52   dev: 10.15
       311 ..   387   avg:   329.52   dev: 14.10

   hash 4
         0 ..   212   avg:     9.94   dev:    45.16
       270 .. 65536   avg: 44546.56   dev: 28809.47

[toc] | [prev] | [next] | [standalone]


#386074

FromMichael S <already5chosen@yahoo.com>
Date2024-06-17 12:39 +0300
Message-ID<20240617123926.00006a12@yahoo.com>
In reply to#386066
On Mon, 17 Jun 2024 00:56:40 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

> 
> I don't know why you say that.  C was an ANSI standard before it
> was an ISO standard.  Or is it that you think that the language
> Malcolm is intent on using does not conform to C90/C89/ANSI C?
> 
> 

All I wanted to point by this comment is that ANSI recognizes ISO/IEC
9899:2018 as their current C Standard and probably will recognize the
next ISO C Standard pretty soon. For that reason I think that names like
C89 or C90 are preferable (to ANSI C) when we want to refer to this
particular variant of the language.

[toc] | [prev] | [next] | [standalone]


#386188

FromTim Rentsch <tr.17687@z991.linuxsc.com>
Date2024-06-18 10:47 -0700
Message-ID<86ed8ujg7j.fsf@linuxsc.com>
In reply to#386074
Michael S <already5chosen@yahoo.com> writes:

> On Mon, 17 Jun 2024 00:56:40 -0700
> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>
>> I don't know why you say that.  C was an ANSI standard before it
>> was an ISO standard.  Or is it that you think that the language
>> Malcolm is intent on using does not conform to C90/C89/ANSI C?
>
> All I wanted to point by this comment is that ANSI recognizes ISO/IEC
> 9899:2018 as their current C Standard and probably will recognize the
> next ISO C Standard pretty soon.  For that reason I think that names like
> C89 or C90 are preferable (to ANSI C) when we want to refer to this
> particular variant of the language.

I see.  So it isn't that you think "ANSI C" is wrong, just
that it might be misleading or that C89 or C90 is preferable.
Personally I would be surprised if someone used "ANSI C" to
mean anything other than C89/C90, but certainly other people
could have a different reaction.

[toc] | [prev] | [next] | [standalone]


#386203

FromKeith Thompson <Keith.S.Thompson+u@gmail.com>
Date2024-06-18 15:17 -0700
Message-ID<87bk3xzyi4.fsf@nosuchdomain.example.com>
In reply to#386188
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
> Michael S <already5chosen@yahoo.com> writes:
>> On Mon, 17 Jun 2024 00:56:40 -0700
>> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>>
>>> I don't know why you say that.  C was an ANSI standard before it
>>> was an ISO standard.  Or is it that you think that the language
>>> Malcolm is intent on using does not conform to C90/C89/ANSI C?
>>
>> All I wanted to point by this comment is that ANSI recognizes ISO/IEC
>> 9899:2018 as their current C Standard and probably will recognize the
>> next ISO C Standard pretty soon.  For that reason I think that names like
>> C89 or C90 are preferable (to ANSI C) when we want to refer to this
>> particular variant of the language.
>
> I see.  So it isn't that you think "ANSI C" is wrong, just
> that it might be misleading or that C89 or C90 is preferable.
> Personally I would be surprised if someone used "ANSI C" to
> mean anything other than C89/C90, but certainly other people
> could have a different reaction.

The term "ANSI C" almost universally refers to C89/C90.  But someone
not familiar with the term might expect it to mean "the C standard
endorsed by ANSI", which is currently C17.

The term "ANSI C" started out as a way to refer to the newly
standardized language, distinguishing it from pre-standard versions
like the one documented in K&R1.

I don't necessarily complain when someone uses the phrase "ANSI C"
to mean C89/C90, but I try to avoid it myself in favor of "C89" or
"C90".

Hmm.  It occurs to me that "K&R C", which usually refers to the
language defined in K&R1, is also potentially ambiguous.  I'm not
going to worry about it too much.  (One C compiler uses a "-ansic"
command-line option to cause it to attempt to conform to C99.)

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]


#386205

FromTim Rentsch <tr.17687@z991.linuxsc.com>
Date2024-06-18 16:17 -0700
Message-ID<86sex9j0ww.fsf@linuxsc.com>
In reply to#386203
Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
>
>> Michael S <already5chosen@yahoo.com> writes:
>>
>>> On Mon, 17 Jun 2024 00:56:40 -0700
>>> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>>>
>>>> I don't know why you say that.  C was an ANSI standard before it
>>>> was an ISO standard.  Or is it that you think that the language
>>>> Malcolm is intent on using does not conform to C90/C89/ANSI C?
>>>
>>> All I wanted to point by this comment is that ANSI recognizes ISO/IEC
>>> 9899:2018 as their current C Standard and probably will recognize the
>>> next ISO C Standard pretty soon.  For that reason I think that names like
>>> C89 or C90 are preferable (to ANSI C) when we want to refer to this
>>> particular variant of the language.
>>
>> I see.  So it isn't that you think "ANSI C" is wrong, just
>> that it might be misleading or that C89 or C90 is preferable.
>> Personally I would be surprised if someone used "ANSI C" to
>> mean anything other than C89/C90, but certainly other people
>> could have a different reaction.
>
> [...] I don't necessarily complain when someone uses the phrase
> "ANSI C" to mean C89/C90, but I try to avoid it myself in favor
> of "C89" or "C90".

I'm reminded that gcc accepts the option -ansi as a synonym for
the option -std=c90.

[toc] | [prev] | [next] | [standalone]


#386206

FromJames Kuyper <jameskuyper@alumni.caltech.edu>
Date2024-06-18 19:23 -0400
Message-ID<v4t4tq$1j49k$1@dont-email.me>
In reply to#386188
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
...
> I see.  So it isn't that you think "ANSI C" is wrong, just
> that it might be misleading or that C89 or C90 is preferable.

ANSI's documentation is quite clear about the fact that there is, at any
time, only one ANSI C standard, which is the version most recently
approved - the older versions cease to be ANSI standards as soon as
newer ones are approved. C17 is the current ANSI standard for C.
Therefore, using "ANSI C" to mean specifically C89 is inaccurate, unless
the wording makes it clear that it's referring to a time period when
that was the current ANSI C standard.

> Personally I would be surprised if someone used "ANSI C" to
> mean anything other than C89/C90, 

I would expect the term to be used almost exclusively by people who
incorrectly think that it means C89. Since it became an ISO standard
with C90, few people who care about the latest version of the C standard
worry about the parallel ANSI standard. Most people never got into the
habit of using "ISO C" to mean specifically C90, so they didn't need to
break that habit when it was superseded by C99.

[toc] | [prev] | [next] | [standalone]


#386209

FromKeith Thompson <Keith.S.Thompson+u@gmail.com>
Date2024-06-18 17:17 -0700
Message-ID<8734p9zsyt.fsf@nosuchdomain.example.com>
In reply to#386206
James Kuyper <jameskuyper@alumni.caltech.edu> writes:
> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
[...]
>> Personally I would be surprised if someone used "ANSI C" to
>> mean anything other than C89/C90, 
>
> I would expect the term to be used almost exclusively by people who
> incorrectly think that it means C89. Since it became an ISO standard
> with C90, few people who care about the latest version of the C standard
> worry about the parallel ANSI standard. Most people never got into the
> habit of using "ISO C" to mean specifically C90, so they didn't need to
> break that habit when it was superseded by C99.

I mostly agree, except that I might argue that the meaning of a
phrase can be defined by common usage.

**This is a very minor nitpick.**

In a more formal context, a "string literal" is not a literal whose
value is necessarily a string.  We can't infer the meaning of a
defined phrase from the meanings of its consituent words.

"ANSI C" is defined by informal common usage to mean C89 (that was
the more formal meaning as of 1989, but common usage has evolved).
It's unfortunate that the obvious meaning of the phrase based
on its constituent words is very different, currently C17.
But practically nobody refers to C17 as "ANSI C"; if anything,
it's more appropriately called "ISO C".

If I see "ANSI C", I can be *almost* certain that it refers to
C89/C90 (the same language specified by two different documents).
But if I want to talk about the C89 or C90 standard, I'll use "C89"
or "C90", and if I want to talk about the language they define,
I might use both.

gcc's "-ansi" option was introduced when "ANSI C" was a correct and
reasonable name for C89/C90.  There are too many existing scripts
to allow changing it now.

English am difficult.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]


#386399

FromTim Rentsch <tr.17687@z991.linuxsc.com>
Date2024-06-23 11:23 -0700
Message-ID<86v81zfrhv.fsf@linuxsc.com>
In reply to#386206
James Kuyper <jameskuyper@alumni.caltech.edu> writes:

> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
> ...
>
>> I see.  So it isn't that you think "ANSI C" is wrong, just
>> that it might be misleading or that C89 or C90 is preferable.
>
> ANSI's documentation is quite clear about the fact that there is, at any
> time, only one ANSI C standard, [...]

I'm not talking about ANSI's documentation.  I'm talking about
how people use the term ANSI C.

[toc] | [prev] | [next] | [standalone]


#385618

FromMichael S <already5chosen@yahoo.com>
Date2024-06-06 13:35 +0300
Message-ID<20240606133534.00005c5d@yahoo.com>
In reply to#385597
On Wed, 05 Jun 2024 08:58:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

> 
> If we're trying to write a general hash function, it should work
> well across a wide range of input key sets, and also for different
> kinds of hash table management.  Let's look at the output side
> first.  There are two independent axes for hash tables:  table
> size, and direct mapping versus using rehashing.  For table size,
> the two prominent choices are some prime in the appropriate range,
> or else a power of two.  Clearly any size could be chosen, but
> primes have the nice property that they divide up the space of hash
> values very evenly, whereas powers of two allow a cheaper way of
> reducing a hash value to the range of table indices (anding versus a
> modulo operation).  A good hash function should work well with both.
> The choice of direct mapping versus rehashing depends on expected
> occupancy rates (which I am used to calling "load factor").  For a
> load factor of 75 to 80 percent or less, generally rehashing is
> better.  Direct mapping has the advantage that it can handle load
> factors above 100%, at a cost of dealing with whatever subsidiary
> data structure (usually linked lists) is used when two or more keys
> map to the same index in the hash table.  (I expect you know all
> that, I wrote it mostly just to clarify my own thinking.)
>

I never had interest in implementation of hash methods, typically just
took what language or library provides and used it.
All my knowledge of internals are 50 y.o. news (TAoCP, and no I did not
read it 50 years ago, but I did read it first time slightly less than
40 years ago and not sure if I re-read this particular topic since
then).
So, I am not sure that I understood everything above. 
In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion that
simple circular increment of the index is superior to more elaborate
methods. I don't know if conclusion was changed in the following years,
not even sure that I recollect it correctly.

Despite (or due to?) my relative ignorance of the topic, I'd dare to
disagree with couple of your points:
1. I don't think that hash function should be evaluated in isolation
from the rest of algorithm. IMHO, they should be co-developed. There
is nothing wrong with hash function that works well with one particular
"back end" and poorly with all others as long as limitations are
clearly stated.
2. I don't like "modulo" method of reduction of hash value to range. I
don't like it for any chosen size of the bucket, not just for power of
two, but even for prime. Of course, for power of two size my dislike to
it is deeper.

[toc] | [prev] | [next] | [standalone]


#386020

FromTim Rentsch <tr.17687@z991.linuxsc.com>
Date2024-06-16 04:38 -0700
Message-ID<86cyohktgt.fsf@linuxsc.com>
In reply to#385618
Michael S <already5chosen@yahoo.com> writes:

> On Wed, 05 Jun 2024 08:58:46 -0700
> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>
>> If we're trying to write a general hash function, it should work
>> well across a wide range of input key sets, and also for different
>> kinds of hash table management.  Let's look at the output side
>> first.  There are two independent axes for hash tables:  table
>> size, and direct mapping versus using rehashing.  For table size,
>> the two prominent choices are some prime in the appropriate range,
>> or else a power of two.  Clearly any size could be chosen, but
>> primes have the nice property that they divide up the space of hash
>> values very evenly, whereas powers of two allow a cheaper way of
>> reducing a hash value to the range of table indices (anding versus a
>> modulo operation).  A good hash function should work well with both.
>> The choice of direct mapping versus rehashing depends on expected
>> occupancy rates (which I am used to calling "load factor").  For a
>> load factor of 75 to 80 percent or less, generally rehashing is
>> better.  Direct mapping has the advantage that it can handle load
>> factors above 100%, at a cost of dealing with whatever subsidiary
>> data structure (usually linked lists) is used when two or more keys
>> map to the same index in the hash table.  (I expect you know all
>> that, I wrote it mostly just to clarify my own thinking.)
>
> I never had interest in implementation of hash methods, typically
> just took what language or library provides and used it.
> All my knowledge of internals are 50 y.o.  news (TAoCP, and no I
> did not read it 50 years ago, but I did read it first time
> slightly less than 40 years ago and not sure if I re-read this
> particular topic since then).
> So, I am not sure that I understood everything above.
> In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion
> that simple circular increment of the index is superior to more
> elaborate methods.  I don't know if conclusion was changed in the
> following years, not even sure that I recollect it correctly.

Hashing is discussed in volume 3 of TAOCP.  I first read volume 3
just after it came out;  I reviewed the section on hashing again
after reading your message.  Your remembrance doesn't match what
is said in that section.  Moreover some of the conclusions Knuth
wrote in 1973 are certainly out of date now.  General rehashing,
also called open addressing, is overall a net win, especially for
hash tables that have an above-average load factor.  Also there
are variations such as cuckoo hash that depend on using open
addressing.

> Despite (or due to?) my relative ignorance of the topic, I'd dare
> to disagree with couple of your points:
> 1. I don't think that hash function should be evaluated in
> isolation from the rest of algorithm.  IMHO, they should be
> co-developed.  There is nothing wrong with hash function that
> works well with one particular "back end" and poorly with all
> others as long as limitations are clearly stated.

There are several reasons why a co-development approach usually
isn't a good way to go.  Malcolm started this thread by asking for
a good hash function for pointers (to be used in a portable ANSI C
setting).  We don't know, nor do we have any control over, how
Malcolm might use the function (and that goes double for any random
reader of comp.lang.c who might see some posted hash function and
decide to use it).  Or we might want to use some sort of generic
table manager that takes a pointer-to-function to do the hashing,
but no way to affect how the hash values are used.  Another reason
has to do with total code size.  If there are M data types to be
hashed, and N applications that use hash values, coupling the code
that does the hashing with the code that makes use of the hash
values means there will be M*N parts to program, whereas separating
the two aspects means there will be only M+N parts to program.  The
M+N path almost certainly means a big reduction in the overall
amount of effort needed.

Of course there are some specialized applications, such as perfect
hashing, where the hashing method needs to be tailored to some sort
of external considerations.  In most cases though a general hashing
function is a better choice.

As for how evaluation should be done, it's important to evaluate
hash functions in isolation for the same reasons it's important to
evaluate random number generators in isolation:  the results need
to be repeatable, and they need to give a quantitative measure of
how "mixed up" the outputs are for similar inputs.  Knuth talks
about random numbers in volume 2 of TAOCP, spending more than 100
pages on the subject, but he gives an essential precept before the
end of page 5: "Random numbers should not be generated with a
method chosen at random.  Some theory should be used."  The same is
true of hash functions, and a repeatable quantitative evaluation of
the function by itself is a key element of that process.

> 2. I don't like "modulo" method of reduction of hash value to
> range.  I don't like it for any chosen size of the bucket, not
> just for power of two, but even for prime.  Of course, for power
> of two size my dislike to it is deeper.

I understand your reaction, but this question is a separate topic.
Part of the reason for designing a high-quality hash function is so
that it doesn't matter which bits are used:  high order bits, low
order bits, some of each, every other bit, or every third or fourth
bit, they should all give good results (where "good" means nearby
inputs give what appear to be uncorrelated outputs).  If we design
our hash function to do that then it doesn't matter whether a hash
table does a mod or uses a different method.  Obviously that is
better than a hash function that works okay with some hash tables
but works poorly with others.

[toc] | [prev] | [next] | [standalone]


#386037

From"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>
Date2024-06-16 12:34 -0700
Message-ID<v4neot$6b51$4@dont-email.me>
In reply to#386020
On 6/16/2024 4:38 AM, Tim Rentsch wrote:
[...]

Hashing pointers is useful...

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

[toc] | [prev] | [next] | [standalone]


#385719

FromBonita Montero <Bonita.Montero@gmail.com>
Date2024-06-07 20:53 +0200
Message-ID<v3vkvp$268b2$1@raubtier-asyl.eternal-september.org>
In reply to#385561
I tested the AES hash function with a C++20 program. The hash-function
is also re-written in C++-style:

#include <iostream>
#include <utility>
#include <vector>
#include <cstdint>
#include <algorithm>
#include <execution>
#include <thread>
#include <span>
#include <intrin.h>

using namespace std;

uint64_t MichaelsHash( uint64_t key );

int main()
{
	static_assert(sizeof(size_t) == 8,"");
	// number of buckets
#if defined(NDEBUG)
	constexpr ptrdiff_t N_BUCKETS = 1ull << 32;
#else
	constexpr ptrdiff_t N_BUCKETS = 1 << 16;
#endif
	// don't zero-initialize update-array
	union ndi_sz { size_t s; ndi_sz() {} };
	vector<ndi_sz> rawUpdates( N_BUCKETS );
	// native size_t span to update-array
	span<size_t> updates( &rawUpdates[0].s, N_BUCKETS );
	int hc = thread::hardware_concurrency();
	// thread partitition end to update-array
	ptrdiff_t end = N_BUCKETS;
	// partitition index to update array
	ptrdiff_t p = hc;
	vector<jthread> threads;
	threads.reserve( hc );
	do
	{
		// begin to update array
		ptrdiff_t begin = (ptrdiff_t)((double)--p / hc * N_BUCKETS);
		threads.emplace_back( [&]( size_t begin, size_t end )
			{
				for( ; begin != end; ++begin )
					updates[begin] = MichaelsHash( begin ) % N_BUCKETS;
			}, begin, end );
		// next end to update array is the begin before
		end = begin;
	} while( p > 0 );
	// wait for all threads to finish
	threads.resize( 0 );
	// sort update array parallel
	sort( execution::parallel_policy(), updates.begin(), updates.end() );
	vector<uint8_t> buckets( N_BUCKETS );
	// update buckets according to the update-list
	for( size_t u : updates )
		if( !++buckets[u] )
			return EXIT_FAILURE;
	// release memory of the update list
	rawUpdates.clear();
	rawUpdates.shrink_to_fit();
	// sum up loads
	vector<size_t> loads;
	for( size_t bucketLoad : buckets )
	{
		if( loads.size() <= bucketLoad )
			loads.resize( bucketLoad + 1 );
		++loads[bucketLoad];
	}
	// print loads percentage
	for( size_t ld = 0; ptrdiff_t nLoad : loads )
		cout << ld++ << ": " << (double)nLoad / (ptrdiff_t)buckets.size() << endl;
}

uint64_t MichaelsHash( uint64_t key )
{
	__m128i xkey = _mm_set_epi64x( key, 42 );
	using bar_t = pair<uint64_t, uint64_t>;
	static bar_t const bars[8] =
	{
		{ 0xBB09BBCC90B24BF2, 0x825C622FF2792A01 },
		{ 0x94F0535CB06D4060, 0x939C756246DBFD1D },
		{ 0x5B835E01A7E14CA1, 0xAC2BDAFC023CDD06 },
		{ 0xE0B6A4735B774AEC, 0x9CAFB43E7DDE494C },
	};
	for( bar_t const &bar : bars )
	{
		__m128i rkey;
		rkey.m128i_u64[0] = bar.first;
		rkey.m128i_u64[1] = bar.second;
		xkey = _mm_aesenc_si128( xkey, rkey );
	}
	return xkey.m128i_u64[0];
}

The hash function is tested with 2 ^ 32 buckets. The problem with that
is that the bucket-counters are updated with total random memory access,
which is extremely slow. So I first prepare an update-list, wich is a
vector of size_t's with the indices of the buckets being updated later.
The memory access for this update-list is updated lineary, which is very
fast. The update-list is written with the available number of threads.
This list is sorted after that and quicksorting also has linear memory
acces. To improve that further the quicksort is done with parallel
execution. The update-list takes 2 ^ 32 size_t's, which is 32GiB. The
sorted update list ressults in linear updates of the buckets.
This sounds rather complex, but I first tried it with random updates
to the buckets on a single core and on my 7950X Zen4-CPU I stopped
after 20min.
The resulting load-depths are according to the coupon collector'
problem (1 / (e * N!)):

C:\Users\Boni\Documents\Source\MichaelAesHash>timep 
x64\Release\MichaelAesHash.exe
0: 36.7875%
1: 36.7884%
2: 18.3945%
3: 6.13053%
4: 1.53304%
5: 0.306619%
6: 0.0510489%
7: 0.00729787%
8: 0.000908948%
9: 0.000102189%
10: 1.07335e-05%
11: 8.14907e-07%
12: 4.65661e-08%
real   53.783s
user   11m:39.984s
sys    6.141s
cylces 3.446.315.603.052

[toc] | [prev] | [next] | [standalone]


Page 5 of 6 — ← Prev page 1 2 3 4 [5] 6  Next page →

Back to top | Article view | comp.lang.c


csiph-web