Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #16431 > unrolled thread

Re: order independent hash?

Started byPeter Otten <__peter__@web.de>
First post2011-11-30 13:47 +0100
Last post2011-12-02 04:33 +0000
Articles 11 on this page of 51 — 14 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: order independent hash? Peter Otten <__peter__@web.de> - 2011-11-30 13:47 +0100
    Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-01 07:35 -0800
    Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-01 07:35 -0800
      Re: order independent hash? Dave Angel <d@davea.name> - 2011-12-01 10:52 -0500
        Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-01 08:17 -0800
        Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-01 08:17 -0800
      Re: order independent hash? Chris Angelico <rosuav@gmail.com> - 2011-12-02 03:18 +1100
        Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-01 20:29 -0800
          Re: order independent hash? Chris Angelico <rosuav@gmail.com> - 2011-12-02 16:00 +1100
            Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-01 21:48 -0800
              Re: order independent hash? Lie Ryan <lie.1296@gmail.com> - 2011-12-05 00:40 +1100
            Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-01 21:48 -0800
              Re: order independent hash? Roel Schroeven <rschroev_nospam_ml@fastmail.fm> - 2011-12-04 14:41 +0100
                Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-04 07:39 -0800
                  Re: order independent hash? Ian Kelly <ian.g.kelly@gmail.com> - 2011-12-04 10:41 -0700
                    Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-04 10:06 -0800
                      Re: order independent hash? Ian Kelly <ian.g.kelly@gmail.com> - 2011-12-04 13:13 -0700
                        Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-04 15:17 -0800
                        Re: order independent hash? Matt Joiner <anacrolix@gmail.com> - 2011-12-05 10:21 +1100
                        Re: order independent hash? Ian Kelly <ian.g.kelly@gmail.com> - 2011-12-04 16:24 -0700
                          Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-04 16:52 -0800
                            Re: order independent hash? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-12-05 02:00 +0000
                          Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-04 16:52 -0800
                            Re: order independent hash? Lie Ryan <lie.1296@gmail.com> - 2011-12-05 12:59 +1100
                            Re: order independent hash? Ethan Furman <ethan@stoneleaf.us> - 2011-12-04 18:08 -0800
                              Re: order independent hash? Ben Finney <ben+python@benfinney.id.au> - 2011-12-05 15:52 +1100
                            Re: order independent hash? Matt Joiner <anacrolix@gmail.com> - 2011-12-05 14:02 +1100
                          Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-06 21:24 -0800
                          Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-06 21:24 -0800
                        Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-04 15:17 -0800
                          Re: order independent hash? Terry Reedy <tjreedy@udel.edu> - 2011-12-04 19:28 -0500
                    Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-04 10:06 -0800
            Re: order independent hash? Hrvoje Niksic <hniksic@xemacs.org> - 2011-12-02 10:53 +0100
              Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-02 03:48 -0800
              Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-02 09:00 -0800
              Re: order independent hash? Terry Reedy <tjreedy@udel.edu> - 2011-12-02 16:40 -0500
                Re: order independent hash? Hrvoje Niksic <hniksic@xemacs.org> - 2011-12-04 14:55 +0100
                  Re: order independent hash? Chris Angelico <rosuav@gmail.com> - 2011-12-05 01:08 +1100
                    Re: order independent hash? Hrvoje Niksic <hniksic@xemacs.org> - 2011-12-07 14:28 +0100
                      Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-07 09:48 -0800
                  Re: order independent hash? Tim Chase <python.list@tim.thechases.com> - 2011-12-04 12:57 -0600
                    Re: order independent hash? Hrvoje Niksic <hniksic@xemacs.org> - 2011-12-08 10:30 +0100
                      Re: order independent hash? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-12-09 11:27 +0000
                        Re: order independent hash? Hrvoje Niksic <hniksic@xemacs.org> - 2011-12-09 17:51 +0100
                          Re: order independent hash? Chris Angelico <rosuav@gmail.com> - 2011-12-10 04:13 +1100
                        Re: order independent hash? Lie Ryan <lie.1296@gmail.com> - 2011-12-11 10:58 +1100
                        Re: order independent hash? Chris Angelico <rosuav@gmail.com> - 2011-12-11 11:17 +1100
                        Re: order independent hash? Lie Ryan <lie.1296@gmail.com> - 2011-12-11 14:04 +1100
          Re: order independent hash? Lie Ryan <lie.1296@gmail.com> - 2011-12-05 00:29 +1100
        Re: order independent hash? 88888 Dihedral <dihedral88888@googlemail.com> - 2011-12-01 20:29 -0800
        Re: order independent hash? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-12-02 04:33 +0000

Page 3 of 3 — ← Prev page 1 2 [3]


#16624

FromTim Chase <python.list@tim.thechases.com>
Date2011-12-04 12:57 -0600
Message-ID<mailman.3277.1323025080.27778.python-list@python.org>
In reply to#16608
On 12/04/11 08:08, Chris Angelico wrote:
> 2011/12/5 Hrvoje Niksic<hniksic@xemacs.org>:
>> If a Python
>> implementation tried to implement dict as a tree, instances of classes
>> that define only __eq__ and __hash__ would not be correctly inserted in
>> such a dict.
>
> Couldn't you just make a tree of hash values? Okay, that's probably
> not the most useful way to do things, but technically it'd comply with
> the spec.

 From an interface perspective, I suppose it would work.  However 
one of the main computer-science reasons for addressing by a hash 
is to get O(1) access to items (modulo pessimal hash 
structures/algorithms which can approach O(N) if everything 
hashes to the same value/bucket), rather than the O(logN) time 
you'd get from a tree. So folks reaching for a hash/map might be 
surprised if performance degraded with the size of the contents.

-tkc


[toc] | [prev] | [next] | [standalone]


#16813

FromHrvoje Niksic <hniksic@xemacs.org>
Date2011-12-08 10:30 +0100
Message-ID<87y5un9zs6.fsf@xemacs.org>
In reply to#16624
Tim Chase <python.list@tim.thechases.com> writes:

> From an interface perspective, I suppose it would work.  However one
> of the main computer-science reasons for addressing by a hash is to
> get O(1) access to items (modulo pessimal hash structures/algorithms
> which can approach O(N) if everything hashes to the same
> value/bucket), rather than the O(logN) time you'd get from a tree. So
> folks reaching for a hash/map might be surprised if performance
> degraded with the size of the contents.

In a language like Python, the difference between O(1) and O(log n) is
not the primary reason why programmers use dict; they use it because
it's built-in, efficient compared to alternatives, and convenient to
use.  If Python dict had been originally implemented as a tree, I'm sure
it would be just as popular.

Omitting the factor of O(log n) as functionally equivalent to O(1) is
applicable to many situations and is sometimes called "soft-O" notation.
One example from practice is the pre-2011 C++, where the standardization
committee failed to standardize hash tables on time for the 1998
standard.  Although this was widely recognized as an oversight, a large
number of programs simply used tree-based std::maps and never noticed a
practical difference between between average-constant-time and
logarithmic complexity lookups.

[toc] | [prev] | [next] | [standalone]


#16907

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2011-12-09 11:27 +0000
Message-ID<4ee1f093$0$29977$c3e8da3$5496439d@news.astraweb.com>
In reply to#16813
On Thu, 08 Dec 2011 10:30:01 +0100, Hrvoje Niksic wrote:

> In a language like Python, the difference between O(1) and O(log n) is
> not the primary reason why programmers use dict; they use it because
> it's built-in, efficient compared to alternatives, and convenient to
> use.  If Python dict had been originally implemented as a tree, I'm sure
> it would be just as popular.

Except for people who needed dicts with tens of millions of items.

Remember also that dicts are used for looking up names in Python. Nearly 
all method calls, attribute accesses, global name lookups, function 
calls, etc. go through at least one and potentially multiple dict 
lookups. The simple statement:

n = len(x.y) + len(z)

likely requires nine dict lookups, and potentially more. In even a small 
application, there could be tens of millions of dict lookups; changing 
each of them from O(1) to O(log N) could result in a measurable slowdown 
to Python code in real applications. That is why dicts are highly 
optimized for speed.

As fast as dicts are, sometimes they aren't fast enough. One common micro-
optimization for tight loops and time-critical code is to create local 
variables from globals or builtins, because local variable access 
bypasses dict lookup. So people would notice if dicts were slower.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#16923

FromHrvoje Niksic <hniksic@xemacs.org>
Date2011-12-09 17:51 +0100
Message-ID<87ty59adtx.fsf@xemacs.org>
In reply to#16907
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

> Except for people who needed dicts with tens of millions of items.

Huge tree-based dicts would be somewhat slower than today's hash-based
dicts, but they would be far from unusable.  Trees are often used to
organize large datasets for quick access.

The case of dicts which require frequent access, such as those used to
implement namespaces, is different, and more interesting.  Those dicts
are typically quite small, and for them the difference between O(log n)
and O(1) is negligible in both theory (since n is "small", i.e. bounded)
and practice.  In fact, depending on the details of the implementation,
the lookup in a small tree could even be marginally faster.

[toc] | [prev] | [next] | [standalone]


#16925

FromChris Angelico <rosuav@gmail.com>
Date2011-12-10 04:13 +1100
Message-ID<mailman.3480.1323450805.27778.python-list@python.org>
In reply to#16923
On Sat, Dec 10, 2011 at 3:51 AM, Hrvoje Niksic <hniksic@xemacs.org> wrote:
> The case of dicts which require frequent access, such as those used to
> implement namespaces, is different, and more interesting.  Those dicts
> are typically quite small, and for them the difference between O(log n)
> and O(1) is negligible in both theory (since n is "small", i.e. bounded)
> and practice.  In fact, depending on the details of the implementation,
> the lookup in a small tree could even be marginally faster.

This is something where, I am sure, far greater minds than mine
delve... but, would a splay tree be effective for name lookups? In
most cases, you'll have a huge puddle of names of which you use the
tiniest fraction; and a splay tree would, in effect, automatically
optimize itself to handle tight loops.

ChrisA

[toc] | [prev] | [next] | [standalone]


#16972

FromLie Ryan <lie.1296@gmail.com>
Date2011-12-11 10:58 +1100
Message-ID<mailman.3503.1323561541.27778.python-list@python.org>
In reply to#16907
On 12/09/2011 10:27 PM, Steven D'Aprano wrote:
> On Thu, 08 Dec 2011 10:30:01 +0100, Hrvoje Niksic wrote:
>
>> In a language like Python, the difference between O(1) and O(log n) is
>> not the primary reason why programmers use dict; they use it because
>> it's built-in, efficient compared to alternatives, and convenient to
>> use.  If Python dict had been originally implemented as a tree, I'm sure
>> it would be just as popular.
>
> Except for people who needed dicts with tens of millions of items.

who should be using a proper DBMS in any case.

[toc] | [prev] | [next] | [standalone]


#16974

FromChris Angelico <rosuav@gmail.com>
Date2011-12-11 11:17 +1100
Message-ID<mailman.3505.1323562670.27778.python-list@python.org>
In reply to#16907
On Sun, Dec 11, 2011 at 10:58 AM, Lie Ryan <lie.1296@gmail.com> wrote:
> On 12/09/2011 10:27 PM, Steven D'Aprano wrote:
>> Except for people who needed dicts with tens of millions of items.
>
> who should be using a proper DBMS in any case.

Not necessarily. "Database" usually implies disk-based and relational,
features that may well be quite superfluous; and a pure-memory
database with no relational facilities... is basically a dict. So why
not use one?

ChrisA

[toc] | [prev] | [next] | [standalone]


#16976

FromLie Ryan <lie.1296@gmail.com>
Date2011-12-11 14:04 +1100
Message-ID<mailman.3507.1323572713.27778.python-list@python.org>
In reply to#16907
On 12/11/2011 11:17 AM, Chris Angelico wrote:
> On Sun, Dec 11, 2011 at 10:58 AM, Lie Ryan<lie.1296@gmail.com>  wrote:
>> On 12/09/2011 10:27 PM, Steven D'Aprano wrote:
>>> Except for people who needed dicts with tens of millions of items.
>>
>> who should be using a proper DBMS in any case.
>
> Not necessarily. "Database" usually implies disk-based and relational,
> features that may well be quite superfluous; and a pure-memory
> database with no relational facilities... is basically a dict. So why
> not use one?

It is very unlikely you'd have millions of items in a dict and you're 
not planning to do any data processing at all. In any case, there are 
very few use cases which requires the use of a dict with millions of 
items that wouldn't be better served by a proper database.

[toc] | [prev] | [next] | [standalone]


#16605

FromLie Ryan <lie.1296@gmail.com>
Date2011-12-05 00:29 +1100
Message-ID<mailman.3262.1323005372.27778.python-list@python.org>
In reply to#16523
On 12/02/2011 03:29 PM, 88888 Dihedral wrote:
>
> I clear my point a hash is a collection of (key, value) pairs that have
> well defined methods and behavior to be used in programming.
>
> The basic operations of a hash normally includes the following:
>
> 1. insertion of a (key, value) pair  into the hash
> 2. deletion of a (key, value) from the hash
> 3. inquiring  a hash by a key to retrieve the value if the (key, value)
> pair available in the hash. If no key matched, the hash will return
> a not found result.
>
> The hash can grow with (k,v) pairs accumulated in the run time.
> An auto memory management mechanism is required for a hash of a non-fixed size of (k,v) pairs.
>
> Some implementations of a hash might pose some restrictions of k and v
> for some reasons. But in object programming k and v can be objects
> to be manipulated by the programmer.

Strictly speaking, what you're describing is just a dictionary/mapping 
abstract data type (ADT), not a hashtable. Hashtable is a particular way 
to implement the dictionary/mapping ADT. Python's dictionary is 
implemented as hashtable, but there are other ways to implement a 
dictionary/mapping, such as using a sorted tree.

For a data structure to be considered a Hashtable, in addition to having 
the properties of a dictionary that you described, the data structure 
must also uses a "hashing function" to encode the dictionary's keys into 
integer that will be used to calculate the index for the corresponding 
value in its internal array. A hashtable also must provide mechanism to 
deal with hash collisions to maintains its invariants as a dictionary.

[toc] | [prev] | [next] | [standalone]


#16524

From88888 Dihedral <dihedral88888@googlemail.com>
Date2011-12-01 20:29 -0800
Message-ID<mailman.3215.1322800184.27778.python-list@python.org>
In reply to#16503
On Friday, December 2, 2011 12:18:29 AM UTC+8, Chris Angelico wrote:
> On Fri, Dec 2, 2011 at 2:52 AM, Dave Angel <d...@davea.name> wrote:
> > On 12/01/2011 10:35 AM, 88888 Dihedral wrote:
> >> I knew a hash can replace a bi-directional linked list.
> >> The value  can be a multi-field string  to be parsed for further actions.
> >> Is this what you are asking?
> >
> > A hash is a number, so I don't see how it can replace any kind of linked
> > list.  Perhaps you're thinking of some other language.
> 
> A hashtable is a form of data structure that involves hashing values
> (ie calculating hashes for them) and storing them for easier retrieval
> than a simple linked list offers. That may be what you're thinking of.
> 
> ChrisA

I clear my point a hash is a collection of (key, value) pairs that have
well defined methods and behavior to be used in programming. 

The basic operations of a hash normally includes the following:

1. insertion of a (key, value) pair  into the hash
2. deletion of a (key, value) from the hash
3. inquiring  a hash by a key to retrieve the value if the (key, value) 
pair available in the hash. If no key matched, the hash will return 
a not found result. 

The hash can grow with (k,v) pairs accumulated in the run time.
An auto memory management mechanism is required for a hash of a non-fixed size of (k,v) pairs. 

Some implementations of a hash might pose some restrictions of k and v
for some reasons. But in object programming k and v can be objects
to be manipulated by the programmer.  

[toc] | [prev] | [next] | [standalone]


#16525

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2011-12-02 04:33 +0000
Message-ID<4ed85535$0$29988$c3e8da3$5496439d@news.astraweb.com>
In reply to#16503
On Fri, 02 Dec 2011 03:18:29 +1100, Chris Angelico wrote:

> On Fri, Dec 2, 2011 at 2:52 AM, Dave Angel <d@davea.name> wrote:
>> On 12/01/2011 10:35 AM, 88888 Dihedral wrote:
>>> I knew a hash can replace a bi-directional linked list. The value  can
>>> be a multi-field string  to be parsed for further actions. Is this
>>> what you are asking?
>>
>> A hash is a number, so I don't see how it can replace any kind of
>> linked list.  Perhaps you're thinking of some other language.
> 
> A hashtable is a form of data structure that involves hashing values (ie
> calculating hashes for them) and storing them for easier retrieval than
> a simple linked list offers. That may be what you're thinking of.


Python dicts are hash tables.



-- 
Steven

[toc] | [prev] | [standalone]


Page 3 of 3 — ← Prev page 1 2 [3]

Back to top | Article view | comp.lang.python


csiph-web