Groups > comp.lang.python > #75456 > unrolled thread

Re: Dict when defining not returning multi value key error

Started by	Dan Stromberg <drsalists@gmail.com>
First post	2014-07-31 20:12 -0700
Last post	2014-08-02 12:13 +1000
Articles	13 — 6 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Dict when defining not returning multi value key error Dan Stromberg <drsalists@gmail.com> - 2014-07-31 20:12 -0700
    Re: Dict when defining not returning multi value key error Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-01 04:04 +0000
      Re: Dict when defining not returning multi value key error Ben Finney <ben+python@benfinney.id.au> - 2014-08-01 14:17 +1000
        Re: Dict when defining not returning multi value key error Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-01 13:31 +0000
          Re: Dict when defining not returning multi value key error Chris Angelico <rosuav@gmail.com> - 2014-08-01 23:57 +1000
            Re: Dict when defining not returning multi value key error Marko Rauhamaa <marko@pacujo.net> - 2014-08-01 18:39 +0300
              Re: Dict when defining not returning multi value key error Terry Reedy <tjreedy@udel.edu> - 2014-08-01 17:42 -0400
              Re: Dict when defining not returning multi value key error Chris Angelico <rosuav@gmail.com> - 2014-08-02 09:57 +1000
                Re: Dict when defining not returning multi value key error Marko Rauhamaa <marko@pacujo.net> - 2014-08-02 09:41 +0300
                  Re: Dict when defining not returning multi value key error Marko Rauhamaa <marko@pacujo.net> - 2014-08-02 10:06 +0300
                  Re: Dict when defining not returning multi value key error Chris Angelico <rosuav@gmail.com> - 2014-08-02 20:58 +1000
              Re: Dict when defining not returning multi value key error Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-02 11:47 +1000
                Re: Dict when defining not returning multi value key error Chris Angelico <rosuav@gmail.com> - 2014-08-02 12:13 +1000

#75456 — Re: Dict when defining not returning multi value key error

From	Dan Stromberg <drsalists@gmail.com>
Date	2014-07-31 20:12 -0700
Subject	Re: Dict when defining not returning multi value key error
Message-ID	<mailman.12501.1406862741.18130.python-list@python.org>

On Thu, Jul 31, 2014 at 8:08 PM, Dan Stromberg <drsalists@gmail.com> wrote:

>> p = {'1':"value0",'1.0':"value1"}

> For 1 and 1.0 - they simply hash differently.  Dictionaries are
> resizeable hash tables.

I removed some quotes, and noticed that 1 and 1.0 hash the same.
That's a bit unexpected, but I suppose it's not completely
unreasonable.

EG:
$ pythons 'print("%s %s" % (hash(1), hash(1.0)))'
/usr/local/cpython-2.4/bin/python 1 1
/usr/local/cpython-2.5/bin/python 1 1
/usr/local/cpython-2.6/bin/python 1 1
/usr/local/cpython-2.7/bin/python 1 1
/usr/local/cpython-3.0/bin/python 1 1
/usr/local/cpython-3.1/bin/python 1 1
/usr/local/cpython-3.2/bin/python 1 1
/usr/local/cpython-3.3/bin/python 1 1
/usr/local/cpython-3.4/bin/python 1 1
/usr/local/pypy-2.3.1/bin/pypy 1 1
/usr/local/pypy3-2.3.1/bin/pypy 1 1
/usr/local/jython-2.7b2/bin/jython 1 1

[toc] | [next] | [standalone]

#75458

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-08-01 04:04 +0000
Message-ID	<53db11c1$0$29986$c3e8da3$5496439d@news.astraweb.com>
In reply to	#75456

On Thu, 31 Jul 2014 20:12:12 -0700, Dan Stromberg wrote:

> I removed some quotes, and noticed that 1 and 1.0 hash the same. That's
> a bit unexpected, but I suppose it's not completely unreasonable.

You should expect that two equal objects should hash to the same value in 
different versions of Python, or even from one run of Python to the next. 
But within a single session, then far from being "not completely 
unreasonable", having equal items hash to the same value is the only 
reasonable thing to do. It is part of the dict API that if x == y, then 
dict[x] == dict[y], and that cannot happen unless x and y hash to the 
same thing.

In general, it's not entirely possible to enforce that, especially given 
how flexible custom __eq__ methods can be, so Python doesn't even try. 
But with *numbers* (or at least those which are part of the numeric 
tower), Python does insist that each distinct value should hash to the 
same result no matter what type that value happens to be:

py> n = 23
py> hash(n) == hash(long(n)) == hash(float(n)) == hash(complex(n))
True
py> from fractions import Fraction
py> from decimal import Decimal
py> hash(n) == hash(Fraction(n)) == hash(Decimal(n))
True

With the possible exception of Decimal, which is not fully integrated 
with the numeric tower, all numbers in the standard library will obey the 
condition that if x == y, hash(x) == hash(y), regardless of type.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#75459

From	Ben Finney <ben+python@benfinney.id.au>
Date	2014-08-01 14:17 +1000
Message-ID	<mailman.12503.1406866676.18130.python-list@python.org>
In reply to	#75458

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

> On Thu, 31 Jul 2014 20:12:12 -0700, Dan Stromberg wrote:
>
> > I removed some quotes, and noticed that 1 and 1.0 hash the same.
> > That's a bit unexpected, but I suppose it's not completely
> > unreasonable.

I would expect it, for the reasons Steven explains. Why is it
unexpected?

> You should expect that two equal objects should hash to the same value
> in different versions of Python, or even from one run of Python to the
> next.

I suspect there's a “not” missing there.

To be explicit: You should not expect any particular comparisons to hold
true for hash values between different Python versions, or even from one
run to the next.

Only those promises made or logically implied in the API should be
expected. Implementation details should be expected to break assumptions
— in other words, don't depend on any particular behaviour of an
implementation detail.

If you expect the hash valuesto be the same, you're making a mistake; if
you expect the hash values to be different, you're making a mistake.

The correct attitude, with regard to implementation details, is to
expect uncertainty and not assume anything :-)

-- 
 \      “Natural catastrophes are rare, but they come often enough. We |
  `\   need not force the hand of nature.” —Carl Sagan, _Cosmos_, 1980 |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#75474

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-08-01 13:31 +0000
Message-ID	<53db96bc$0$29986$c3e8da3$5496439d@news.astraweb.com>
In reply to	#75459

On Fri, 01 Aug 2014 14:17:41 +1000, Ben Finney wrote:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> 
>> On Thu, 31 Jul 2014 20:12:12 -0700, Dan Stromberg wrote:
>>
>> > I removed some quotes, and noticed that 1 and 1.0 hash the same.
>> > That's a bit unexpected, but I suppose it's not completely
>> > unreasonable.
> 
> I would expect it, for the reasons Steven explains. Why is it
> unexpected?
> 
>> You should expect that two equal objects should hash to the same value
>> in different versions of Python, or even from one run of Python to the
>> next.
> 
> I suspect there's a “not” missing there.

Thanks for spotting that. Correct, I meant that you should *not* expect 
hashing to return any specific value.


[...]
> If you expect the hash valuesto be the same, you're making a mistake; if
> you expect the hash values to be different, you're making a mistake.

Yes. Although Python promises (at least for classes in the standard 
library) that x == y should imply that hash(x) == hash(y), it says 
nothing about the other way:

x == y implies that hash(x) == hash(y)

but hash(x) == hash(y) does NOT imply that x == y.


py> hash(2**128) == hash(1)
True


> The correct attitude, with regard to implementation details, is to
> expect uncertainty and not assume anything :-)

Yes, this!



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#75477

From	Chris Angelico <rosuav@gmail.com>
Date	2014-08-01 23:57 +1000
Message-ID	<mailman.12512.1406901485.18130.python-list@python.org>
In reply to	#75474

On Fri, Aug 1, 2014 at 11:31 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Yes. Although Python promises (at least for classes in the standard
> library) that x == y should imply that hash(x) == hash(y), it says
> nothing about the other way:
>
> x == y implies that hash(x) == hash(y)

This is the entire point of the hashing system. If equal values can
hash differently, why bother calculating the hashes?

Or if you prefer, the point of hashing is the logical converse of the
above: hash(x) != hash(y) implies that x != y. If the hashes are
different, you can skip the equality check, ergo you can build a data
type that claims to search for the key that == the looked-up key, but
actually does a much faster hash check first, and skips everything
with a different hash.

> but hash(x) == hash(y) does NOT imply that x == y.
>

Hello, pigeonhole principle :) If this were false - that is, if equal
hashes DID imply equal objects - it would be necessary to completely
encode an object's state in its hash, and hashes would be impossibly
large. This would, in fact, destroy their value completely.

ChrisA

[toc] | [prev] | [next] | [standalone]

#75487

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-08-01 18:39 +0300
Message-ID	<87vbqcnrrh.fsf@elektro.pacujo.net>
In reply to	#75477

Chris Angelico <rosuav@gmail.com>:

>> but hash(x) == hash(y) does NOT imply that x == y.
>
> Hello, pigeonhole principle :) If this were false - that is, if equal
> hashes DID imply equal objects - it would be necessary to completely
> encode an object's state in its hash, and hashes would be impossibly
> large. This would, in fact, destroy their value completely.

Well, modern computing assumes precisely:

   hash(x) == hash(y) => x == y

That principle is at play with strong authentication (HTTPS et al),
version control (git et al),

A handful of bits uniquely identify an object regardless of its size.


Marko

[toc] | [prev] | [next] | [standalone]

#75502

From	Terry Reedy <tjreedy@udel.edu>
Date	2014-08-01 17:42 -0400
Message-ID	<mailman.12525.1406929354.18130.python-list@python.org>
In reply to	#75487

On 8/1/2014 11:39 AM, Marko Rauhamaa wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>>> but hash(x) == hash(y) does NOT imply that x == y.
>>
>> Hello, pigeonhole principle :) If this were false - that is, if equal
>> hashes DID imply equal objects - it would be necessary to completely
>> encode an object's state in its hash, and hashes would be impossibly
>> large. This would, in fact, destroy their value completely.
>
> Well, modern computing assumes precisely:
>
>     hash(x) == hash(y) => x == y

Assuming that a false statement is true does not make it true, and can 
and has gotten 'computing' into trouble.

> That principle is at play with strong authentication (HTTPS et al),
> version control (git et al),

The principle for these applications is

stronghash(x) == stronghash(y) => x == y with probability 1 (or 
indistinguishable from 1).

For mercurial, with no treat model, a 160 bit hash is used. Internet 
applications need more bits and carefully vetted algorithms to hopefully 
make the actual principle true.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#75507

From	Chris Angelico <rosuav@gmail.com>
Date	2014-08-02 09:57 +1000
Message-ID	<mailman.12529.1406937431.18130.python-list@python.org>
In reply to	#75487

On Sat, Aug 2, 2014 at 7:42 AM, Terry Reedy <tjreedy@udel.edu> wrote:
> For mercurial, with no treat model, a 160 bit hash is used. Internet
> applications need more bits and carefully vetted algorithms to hopefully
> make the actual principle true.

Ditto git, which also has no threat model. I don't know of any
situation in HTTPS that has this, but the classic concept of hashed
passwords (quite independent of HTTPS) basically says "if I take an
arbitrary/random salt and combine it with your password, and hash
that, then the probability of a hash collision involving the same salt
and a different password approaches 0". And any time "approaches 0" is
provably false (or doesn't approach 0 closely enough), you have weak
passwords, which is why it's a really bad idea to use MD5 passwording.
Ergo MD5 is not (any more, at least) a "carefully vetted algorithm".
(That said, though, I will happily use md5sum across a huge pile of
files to find duplicates. It's a lot quicker than sha*sum, and I don't
have reason to expect malicious hash collisions on my own hard drive.
Plus, I can always just check some other way.)

ChrisA

[toc] | [prev] | [next] | [standalone]

#75516

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-08-02 09:41 +0300
Message-ID	<87zjfnmlzr.fsf@elektro.pacujo.net>
In reply to	#75507

Chris Angelico <rosuav@gmail.com>:

> On Sat, Aug 2, 2014 at 7:42 AM, Terry Reedy <tjreedy@udel.edu> wrote:

>> For mercurial, with no treat model, a 160 bit hash is used. Internet
>> applications need more bits and carefully vetted algorithms to
>> hopefully make the actual principle true.
>
> Ditto git, which also has no threat model.

I don't know why you way hg and git have no threat models. A great deal
of damage could be inflicted if you could sneak malicious edits into
version control systems without altering the hash.

Important systems absolutely rely on the fact that the hashes can be
used for identification. They are not just checksums. They are not
double-checked with bit-to-bit comparisons of the actual data.

Marko

[toc] | [prev] | [next] | [standalone]

#75518

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-08-02 10:06 +0300
Message-ID	<87vbqbmkv9.fsf@elektro.pacujo.net>
In reply to	#75516

Marko Rauhamaa <marko@pacujo.net>:

> Important systems absolutely rely on the fact that the hashes can be
> used for identification. They are not just checksums. They are not
> double-checked with bit-to-bit comparisons of the actual data.

And in fact, you can use the principle in Python:

    class Thingy:
        def __init__(self, ...):
            ...
            self.stronghash = self.compute_strong_hash()
            self.hash = struct.unpack("Q", self.stronghash[:8])[0]

        def __hash__(self):
            return self.hash

        def __eq__(self, other):
            try:
                return self.stronghash == other.stronghash
            except AttributeError:
                return False


Marko

[toc] | [prev] | [next] | [standalone]

#75528

From	Chris Angelico <rosuav@gmail.com>
Date	2014-08-02 20:58 +1000
Message-ID	<mailman.12539.1406977127.18130.python-list@python.org>
In reply to	#75516

On Sat, Aug 2, 2014 at 4:41 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> I don't know why you way hg and git have no threat models. A great deal
> of damage could be inflicted if you could sneak malicious edits into
> version control systems without altering the hash.

You would have to somehow push that change into someone else's
repository. I'm not sure how you'd go about doing that without
actually having a commit, which will be kinda visible.

ChrisA

[toc] | [prev] | [next] | [standalone]

#75510

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-08-02 11:47 +1000
Message-ID	<53dc433b$0$29979$c3e8da3$5496439d@news.astraweb.com>
In reply to	#75487

Marko Rauhamaa wrote:

> Chris Angelico <rosuav@gmail.com>:
> 
>>> but hash(x) == hash(y) does NOT imply that x == y.
>>
>> Hello, pigeonhole principle :) If this were false - that is, if equal
>> hashes DID imply equal objects - it would be necessary to completely
>> encode an object's state in its hash, and hashes would be impossibly
>> large. This would, in fact, destroy their value completely.
> 
> Well, modern computing assumes precisely:
> 
>    hash(x) == hash(y) => x == y

I really don't think that the behaviour of Python's built-in hash() function
is of fundamental concern to all of modern computing. Python's hash()
function is of fundamental concern to Python dicts, and that's all.

> That principle is at play with strong authentication (HTTPS et al),
> version control (git et al),

git is not even written in Python. HTTPS is a protocol, so it could be
written in Python, but typically isn't. Neither of them use Python's
hash(), because the people writing these applications aren't insane.
Collisions in Python's hash() are trivially easy to find, and the existence
of such collisions is well-known.

Perhaps you're talking about cryptographically strong hash functions, of
which there are a few? If so, a brief note that you were changing the
subject would be appreciated :-)

The pigeonhole principle applies to crypto hash functions too, and there is
active research into creating newer and stronger hash functions as the
older ones are broken. For instance, md4, md5, md2 and sha-0 hash functions
have all be broken, and shouldn't be used for anything where security is of
concern.

> A handful of bits uniquely identify an object regardless of its size.

No. The principle being applied is not that a handful of bits *uniquely*
identifies an object, since that is clearly and obviously untrue: consider
objects of size 1K, there are 2**1024 possible such objects. I don't know
what "a handful" of bits is, but let's say 128 bits, so there are 2**128
possible hashes. It doesn't take much mathematics to see that you cannot
*uniquely* hash 2**1024 objects into only 2**128 checksums without quite a
few collisions.

The actual principle being applied is that, with a cryptographically strong
hash function, a handful of bits can *practically* identify an object
regardless of its size. Let's take that 128 bit checksum again: that means
that there are 340282366920938463463374607431768211456 possible checksums.
If the input files are unpredictably hashed over that entire range, on
average you would have to generate half that number of hashes before
stumbling across a collision purely by chance. Half of 2**128 is still a
rather big number: if you generated a million hashes a second, on average
it would take you considerably more than a billion years to accidentally
hit a collision.

For practical purposes, that's good enough.

But note that there are numerous provisos there, including:

- It relies on the hash function generating a large enough handful of bits.
An 8 bit checksum only has 2**8 = 256 unique values, so you would find a
collision pretty quickly.

- It relies on the mapping of original object to checksum being randomly
distributed. If they are not randomly distributed, then there will be sets
of objects which collide more often than expected. For instance, suppose
that every object containing a 0 byte hashed to the same value. Then, on
average, you would not have to wait very long before finding your first
collision.

- It relies on the checksum being unpredictable, to prevent substitution
attacks: you're expecting object x with checksum a, but somebody
substitutes object y with checksum a instead.

Python's hash() function is not cryptographically strong and makes no claim
to be.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#75511

From	Chris Angelico <rosuav@gmail.com>
Date	2014-08-02 12:13 +1000
Message-ID	<mailman.12531.1406945609.18130.python-list@python.org>
In reply to	#75510

On Sat, Aug 2, 2014 at 11:47 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> - It relies on the checksum being unpredictable, to prevent substitution
> attacks: you're expecting object x with checksum a, but somebody
> substitutes object y with checksum a instead.

Note that this requirement is only an issue when there are actual
attacks involved. In many cases, hashing is used to detect either
errors in transfer (eg truncated files) or meaningfully different
files (eg MP3s of different songs). A collision between those won't be
the result of someone deliberately crafting a file with the right
checksum; it'll happen only by chance, so md5sum can be safely used
there.

But if you're trying to use this to prove that a file was downloaded
correctly, you do need to worry about that. If I say that you can
download the binary of my program from <this URL>, and that the MD5
checksum is 123456...DEF, then someone could do a DNS hack (cache
poisoning, proxy interference, whatever) to capture your attempted
download, and send you instead to his own server, where he has a
carefully crafted binary that does everything mine does, plus it tells
him all your passwords - and it has arbitrary junk buried in it to
make sure the MD5 sum matches. So an MD5 checksum is broken for
anything from the internet, but is quite usable for certain specific
cases.

There is one aspect of the unpredictability that's important even in
the simple cases, though, and that's the avalanche effect. If anything
changes in the file, the whole hash should completely and arbitrarily
change. That means you don't need to stare at the whole hash, trying
to see if that 8 became a 0; any change to the file will probably make
the first few digits visibly different, so it's easily obvious that
the hash has changed.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Re: Dict when defining not returning multi value key error

Contents

#75456 — Re: Dict when defining not returning multi value key error

#75458

#75459

#75474

#75477

#75487

#75502

#75507

#75516

#75518

#75528

#75510

#75511