Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #75456 > unrolled thread
| Started by | Dan Stromberg <drsalists@gmail.com> |
|---|---|
| First post | 2014-07-31 20:12 -0700 |
| Last post | 2014-08-02 12:13 +1000 |
| Articles | 13 — 6 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Dict when defining not returning multi value key error Dan Stromberg <drsalists@gmail.com> - 2014-07-31 20:12 -0700
Re: Dict when defining not returning multi value key error Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-01 04:04 +0000
Re: Dict when defining not returning multi value key error Ben Finney <ben+python@benfinney.id.au> - 2014-08-01 14:17 +1000
Re: Dict when defining not returning multi value key error Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-01 13:31 +0000
Re: Dict when defining not returning multi value key error Chris Angelico <rosuav@gmail.com> - 2014-08-01 23:57 +1000
Re: Dict when defining not returning multi value key error Marko Rauhamaa <marko@pacujo.net> - 2014-08-01 18:39 +0300
Re: Dict when defining not returning multi value key error Terry Reedy <tjreedy@udel.edu> - 2014-08-01 17:42 -0400
Re: Dict when defining not returning multi value key error Chris Angelico <rosuav@gmail.com> - 2014-08-02 09:57 +1000
Re: Dict when defining not returning multi value key error Marko Rauhamaa <marko@pacujo.net> - 2014-08-02 09:41 +0300
Re: Dict when defining not returning multi value key error Marko Rauhamaa <marko@pacujo.net> - 2014-08-02 10:06 +0300
Re: Dict when defining not returning multi value key error Chris Angelico <rosuav@gmail.com> - 2014-08-02 20:58 +1000
Re: Dict when defining not returning multi value key error Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-02 11:47 +1000
Re: Dict when defining not returning multi value key error Chris Angelico <rosuav@gmail.com> - 2014-08-02 12:13 +1000
| From | Dan Stromberg <drsalists@gmail.com> |
|---|---|
| Date | 2014-07-31 20:12 -0700 |
| Subject | Re: Dict when defining not returning multi value key error |
| Message-ID | <mailman.12501.1406862741.18130.python-list@python.org> |
On Thu, Jul 31, 2014 at 8:08 PM, Dan Stromberg <drsalists@gmail.com> wrote:
>> p = {'1':"value0",'1.0':"value1"}
> For 1 and 1.0 - they simply hash differently. Dictionaries are
> resizeable hash tables.
I removed some quotes, and noticed that 1 and 1.0 hash the same.
That's a bit unexpected, but I suppose it's not completely
unreasonable.
EG:
$ pythons 'print("%s %s" % (hash(1), hash(1.0)))'
/usr/local/cpython-2.4/bin/python 1 1
/usr/local/cpython-2.5/bin/python 1 1
/usr/local/cpython-2.6/bin/python 1 1
/usr/local/cpython-2.7/bin/python 1 1
/usr/local/cpython-3.0/bin/python 1 1
/usr/local/cpython-3.1/bin/python 1 1
/usr/local/cpython-3.2/bin/python 1 1
/usr/local/cpython-3.3/bin/python 1 1
/usr/local/cpython-3.4/bin/python 1 1
/usr/local/pypy-2.3.1/bin/pypy 1 1
/usr/local/pypy3-2.3.1/bin/pypy 1 1
/usr/local/jython-2.7b2/bin/jython 1 1
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-08-01 04:04 +0000 |
| Message-ID | <53db11c1$0$29986$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #75456 |
On Thu, 31 Jul 2014 20:12:12 -0700, Dan Stromberg wrote: > I removed some quotes, and noticed that 1 and 1.0 hash the same. That's > a bit unexpected, but I suppose it's not completely unreasonable. You should expect that two equal objects should hash to the same value in different versions of Python, or even from one run of Python to the next. But within a single session, then far from being "not completely unreasonable", having equal items hash to the same value is the only reasonable thing to do. It is part of the dict API that if x == y, then dict[x] == dict[y], and that cannot happen unless x and y hash to the same thing. In general, it's not entirely possible to enforce that, especially given how flexible custom __eq__ methods can be, so Python doesn't even try. But with *numbers* (or at least those which are part of the numeric tower), Python does insist that each distinct value should hash to the same result no matter what type that value happens to be: py> n = 23 py> hash(n) == hash(long(n)) == hash(float(n)) == hash(complex(n)) True py> from fractions import Fraction py> from decimal import Decimal py> hash(n) == hash(Fraction(n)) == hash(Decimal(n)) True With the possible exception of Decimal, which is not fully integrated with the numeric tower, all numbers in the standard library will obey the condition that if x == y, hash(x) == hash(y), regardless of type. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2014-08-01 14:17 +1000 |
| Message-ID | <mailman.12503.1406866676.18130.python-list@python.org> |
| In reply to | #75458 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: > On Thu, 31 Jul 2014 20:12:12 -0700, Dan Stromberg wrote: > > > I removed some quotes, and noticed that 1 and 1.0 hash the same. > > That's a bit unexpected, but I suppose it's not completely > > unreasonable. I would expect it, for the reasons Steven explains. Why is it unexpected? > You should expect that two equal objects should hash to the same value > in different versions of Python, or even from one run of Python to the > next. I suspect there's a “not” missing there. To be explicit: You should not expect any particular comparisons to hold true for hash values between different Python versions, or even from one run to the next. Only those promises made or logically implied in the API should be expected. Implementation details should be expected to break assumptions — in other words, don't depend on any particular behaviour of an implementation detail. If you expect the hash valuesto be the same, you're making a mistake; if you expect the hash values to be different, you're making a mistake. The correct attitude, with regard to implementation details, is to expect uncertainty and not assume anything :-) -- \ “Natural catastrophes are rare, but they come often enough. We | `\ need not force the hand of nature.” —Carl Sagan, _Cosmos_, 1980 | _o__) | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-08-01 13:31 +0000 |
| Message-ID | <53db96bc$0$29986$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #75459 |
On Fri, 01 Aug 2014 14:17:41 +1000, Ben Finney wrote: > Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: > >> On Thu, 31 Jul 2014 20:12:12 -0700, Dan Stromberg wrote: >> >> > I removed some quotes, and noticed that 1 and 1.0 hash the same. >> > That's a bit unexpected, but I suppose it's not completely >> > unreasonable. > > I would expect it, for the reasons Steven explains. Why is it > unexpected? > >> You should expect that two equal objects should hash to the same value >> in different versions of Python, or even from one run of Python to the >> next. > > I suspect there's a “not” missing there. Thanks for spotting that. Correct, I meant that you should *not* expect hashing to return any specific value. [...] > If you expect the hash valuesto be the same, you're making a mistake; if > you expect the hash values to be different, you're making a mistake. Yes. Although Python promises (at least for classes in the standard library) that x == y should imply that hash(x) == hash(y), it says nothing about the other way: x == y implies that hash(x) == hash(y) but hash(x) == hash(y) does NOT imply that x == y. py> hash(2**128) == hash(1) True > The correct attitude, with regard to implementation details, is to > expect uncertainty and not assume anything :-) Yes, this! -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-08-01 23:57 +1000 |
| Message-ID | <mailman.12512.1406901485.18130.python-list@python.org> |
| In reply to | #75474 |
On Fri, Aug 1, 2014 at 11:31 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > Yes. Although Python promises (at least for classes in the standard > library) that x == y should imply that hash(x) == hash(y), it says > nothing about the other way: > > x == y implies that hash(x) == hash(y) This is the entire point of the hashing system. If equal values can hash differently, why bother calculating the hashes? Or if you prefer, the point of hashing is the logical converse of the above: hash(x) != hash(y) implies that x != y. If the hashes are different, you can skip the equality check, ergo you can build a data type that claims to search for the key that == the looked-up key, but actually does a much faster hash check first, and skips everything with a different hash. > but hash(x) == hash(y) does NOT imply that x == y. > Hello, pigeonhole principle :) If this were false - that is, if equal hashes DID imply equal objects - it would be necessary to completely encode an object's state in its hash, and hashes would be impossibly large. This would, in fact, destroy their value completely. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-08-01 18:39 +0300 |
| Message-ID | <87vbqcnrrh.fsf@elektro.pacujo.net> |
| In reply to | #75477 |
Chris Angelico <rosuav@gmail.com>: >> but hash(x) == hash(y) does NOT imply that x == y. > > Hello, pigeonhole principle :) If this were false - that is, if equal > hashes DID imply equal objects - it would be necessary to completely > encode an object's state in its hash, and hashes would be impossibly > large. This would, in fact, destroy their value completely. Well, modern computing assumes precisely: hash(x) == hash(y) => x == y That principle is at play with strong authentication (HTTPS et al), version control (git et al), A handful of bits uniquely identify an object regardless of its size. Marko
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-08-01 17:42 -0400 |
| Message-ID | <mailman.12525.1406929354.18130.python-list@python.org> |
| In reply to | #75487 |
On 8/1/2014 11:39 AM, Marko Rauhamaa wrote: > Chris Angelico <rosuav@gmail.com>: > >>> but hash(x) == hash(y) does NOT imply that x == y. >> >> Hello, pigeonhole principle :) If this were false - that is, if equal >> hashes DID imply equal objects - it would be necessary to completely >> encode an object's state in its hash, and hashes would be impossibly >> large. This would, in fact, destroy their value completely. > > Well, modern computing assumes precisely: > > hash(x) == hash(y) => x == y Assuming that a false statement is true does not make it true, and can and has gotten 'computing' into trouble. > That principle is at play with strong authentication (HTTPS et al), > version control (git et al), The principle for these applications is stronghash(x) == stronghash(y) => x == y with probability 1 (or indistinguishable from 1). For mercurial, with no treat model, a 160 bit hash is used. Internet applications need more bits and carefully vetted algorithms to hopefully make the actual principle true. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-08-02 09:57 +1000 |
| Message-ID | <mailman.12529.1406937431.18130.python-list@python.org> |
| In reply to | #75487 |
On Sat, Aug 2, 2014 at 7:42 AM, Terry Reedy <tjreedy@udel.edu> wrote: > For mercurial, with no treat model, a 160 bit hash is used. Internet > applications need more bits and carefully vetted algorithms to hopefully > make the actual principle true. Ditto git, which also has no threat model. I don't know of any situation in HTTPS that has this, but the classic concept of hashed passwords (quite independent of HTTPS) basically says "if I take an arbitrary/random salt and combine it with your password, and hash that, then the probability of a hash collision involving the same salt and a different password approaches 0". And any time "approaches 0" is provably false (or doesn't approach 0 closely enough), you have weak passwords, which is why it's a really bad idea to use MD5 passwording. Ergo MD5 is not (any more, at least) a "carefully vetted algorithm". (That said, though, I will happily use md5sum across a huge pile of files to find duplicates. It's a lot quicker than sha*sum, and I don't have reason to expect malicious hash collisions on my own hard drive. Plus, I can always just check some other way.) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-08-02 09:41 +0300 |
| Message-ID | <87zjfnmlzr.fsf@elektro.pacujo.net> |
| In reply to | #75507 |
Chris Angelico <rosuav@gmail.com>: > On Sat, Aug 2, 2014 at 7:42 AM, Terry Reedy <tjreedy@udel.edu> wrote: >> For mercurial, with no treat model, a 160 bit hash is used. Internet >> applications need more bits and carefully vetted algorithms to >> hopefully make the actual principle true. > > Ditto git, which also has no threat model. I don't know why you way hg and git have no threat models. A great deal of damage could be inflicted if you could sneak malicious edits into version control systems without altering the hash. Important systems absolutely rely on the fact that the hashes can be used for identification. They are not just checksums. They are not double-checked with bit-to-bit comparisons of the actual data. Marko
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-08-02 10:06 +0300 |
| Message-ID | <87vbqbmkv9.fsf@elektro.pacujo.net> |
| In reply to | #75516 |
Marko Rauhamaa <marko@pacujo.net>:
> Important systems absolutely rely on the fact that the hashes can be
> used for identification. They are not just checksums. They are not
> double-checked with bit-to-bit comparisons of the actual data.
And in fact, you can use the principle in Python:
class Thingy:
def __init__(self, ...):
...
self.stronghash = self.compute_strong_hash()
self.hash = struct.unpack("Q", self.stronghash[:8])[0]
def __hash__(self):
return self.hash
def __eq__(self, other):
try:
return self.stronghash == other.stronghash
except AttributeError:
return False
Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-08-02 20:58 +1000 |
| Message-ID | <mailman.12539.1406977127.18130.python-list@python.org> |
| In reply to | #75516 |
On Sat, Aug 2, 2014 at 4:41 PM, Marko Rauhamaa <marko@pacujo.net> wrote: > I don't know why you way hg and git have no threat models. A great deal > of damage could be inflicted if you could sneak malicious edits into > version control systems without altering the hash. You would have to somehow push that change into someone else's repository. I'm not sure how you'd go about doing that without actually having a commit, which will be kinda visible. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-08-02 11:47 +1000 |
| Message-ID | <53dc433b$0$29979$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #75487 |
Marko Rauhamaa wrote: > Chris Angelico <rosuav@gmail.com>: > >>> but hash(x) == hash(y) does NOT imply that x == y. >> >> Hello, pigeonhole principle :) If this were false - that is, if equal >> hashes DID imply equal objects - it would be necessary to completely >> encode an object's state in its hash, and hashes would be impossibly >> large. This would, in fact, destroy their value completely. > > Well, modern computing assumes precisely: > > hash(x) == hash(y) => x == y I really don't think that the behaviour of Python's built-in hash() function is of fundamental concern to all of modern computing. Python's hash() function is of fundamental concern to Python dicts, and that's all. > That principle is at play with strong authentication (HTTPS et al), > version control (git et al), git is not even written in Python. HTTPS is a protocol, so it could be written in Python, but typically isn't. Neither of them use Python's hash(), because the people writing these applications aren't insane. Collisions in Python's hash() are trivially easy to find, and the existence of such collisions is well-known. Perhaps you're talking about cryptographically strong hash functions, of which there are a few? If so, a brief note that you were changing the subject would be appreciated :-) The pigeonhole principle applies to crypto hash functions too, and there is active research into creating newer and stronger hash functions as the older ones are broken. For instance, md4, md5, md2 and sha-0 hash functions have all be broken, and shouldn't be used for anything where security is of concern. > A handful of bits uniquely identify an object regardless of its size. No. The principle being applied is not that a handful of bits *uniquely* identifies an object, since that is clearly and obviously untrue: consider objects of size 1K, there are 2**1024 possible such objects. I don't know what "a handful" of bits is, but let's say 128 bits, so there are 2**128 possible hashes. It doesn't take much mathematics to see that you cannot *uniquely* hash 2**1024 objects into only 2**128 checksums without quite a few collisions. The actual principle being applied is that, with a cryptographically strong hash function, a handful of bits can *practically* identify an object regardless of its size. Let's take that 128 bit checksum again: that means that there are 340282366920938463463374607431768211456 possible checksums. If the input files are unpredictably hashed over that entire range, on average you would have to generate half that number of hashes before stumbling across a collision purely by chance. Half of 2**128 is still a rather big number: if you generated a million hashes a second, on average it would take you considerably more than a billion years to accidentally hit a collision. For practical purposes, that's good enough. But note that there are numerous provisos there, including: - It relies on the hash function generating a large enough handful of bits. An 8 bit checksum only has 2**8 = 256 unique values, so you would find a collision pretty quickly. - It relies on the mapping of original object to checksum being randomly distributed. If they are not randomly distributed, then there will be sets of objects which collide more often than expected. For instance, suppose that every object containing a 0 byte hashed to the same value. Then, on average, you would not have to wait very long before finding your first collision. - It relies on the checksum being unpredictable, to prevent substitution attacks: you're expecting object x with checksum a, but somebody substitutes object y with checksum a instead. Python's hash() function is not cryptographically strong and makes no claim to be. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-08-02 12:13 +1000 |
| Message-ID | <mailman.12531.1406945609.18130.python-list@python.org> |
| In reply to | #75510 |
On Sat, Aug 2, 2014 at 11:47 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > - It relies on the checksum being unpredictable, to prevent substitution > attacks: you're expecting object x with checksum a, but somebody > substitutes object y with checksum a instead. Note that this requirement is only an issue when there are actual attacks involved. In many cases, hashing is used to detect either errors in transfer (eg truncated files) or meaningfully different files (eg MP3s of different songs). A collision between those won't be the result of someone deliberately crafting a file with the right checksum; it'll happen only by chance, so md5sum can be safely used there. But if you're trying to use this to prove that a file was downloaded correctly, you do need to worry about that. If I say that you can download the binary of my program from <this URL>, and that the MD5 checksum is 123456...DEF, then someone could do a DNS hack (cache poisoning, proxy interference, whatever) to capture your attempted download, and send you instead to his own server, where he has a carefully crafted binary that does everything mine does, plus it tells him all your passwords - and it has arbitrary junk buried in it to make sure the MD5 sum matches. So an MD5 checksum is broken for anything from the internet, but is quite usable for certain specific cases. There is one aspect of the unpredictability that's important even in the simple cases, though, and that's the avalanche effect. If anything changes in the file, the whole hash should completely and arbitrarily change. That means you don't need to stare at the whole hash, trying to see if that 8 became a 0; any change to the file will probably make the first few digits visibly different, so it's easily obvious that the hash has changed. ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web