Groups > comp.lang.python > #18958 > unrolled thread

Hash stability

Started by	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
First post	2012-01-14 04:42 +0000
Last post	2012-01-16 10:15 +0100
Articles	18 — 9 participants

Back to article view | Back to comp.lang.python

  Hash stability Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-01-14 04:42 +0000
    Re: Hash stability Peter Otten <__peter__@web.de> - 2012-01-14 10:46 +0100
    Re: Hash stability Heiko Wundram <modelnine@modelnine.org> - 2012-01-14 23:45 +0100
    Re: Hash stability Chris Angelico <rosuav@gmail.com> - 2012-01-15 11:36 +1100
      Re: Hash stability Bryan <bryanjugglercryptographer@yahoo.com> - 2012-01-15 04:03 -0800
        Re: Hash stability Chris Angelico <rosuav@gmail.com> - 2012-01-15 23:21 +1100
    Re: Hash stability Roy Smith <roy@panix.com> - 2012-01-14 21:26 -0500
      Re: Hash stability Terry Reedy <tjreedy@udel.edu> - 2012-01-14 23:07 -0500
    Re: Hash stability Stefan Behnel <stefan_ml@behnel.de> - 2012-01-15 11:13 +0100
    Re: Hash stability Heiko Wundram <modelnine@modelnine.org> - 2012-01-15 12:46 +0100
    Re: Hash stability Peter Otten <__peter__@web.de> - 2012-01-15 13:22 +0100
    Re: Hash stability Heiko Wundram <modelnine@modelnine.org> - 2012-01-15 17:07 +0100
    Re: Hash stability Chris Angelico <rosuav@gmail.com> - 2012-01-16 03:13 +1100
    Re: Hash stability Heiko Wundram <modelnine@modelnine.org> - 2012-01-15 17:51 +0100
    Re: Hash stability Stefan Behnel <stefan_ml@behnel.de> - 2012-01-15 18:20 +0100
    Re: Hash stability Peter Otten <__peter__@web.de> - 2012-01-16 09:18 +0100
    Re: Hash stability Christian Heimes <lists@cheimes.de> - 2012-01-16 09:44 +0100
    Re: Hash stability Heiko Wundram <modelnine@modelnine.org> - 2012-01-16 10:15 +0100

#18958 — Hash stability

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-01-14 04:42 +0000
Subject	Hash stability
Message-ID	<4f1107b7$0$29988$c3e8da3$5496439d@news.astraweb.com>

On the Python Dev mailing list, there is a discussion going on about the 
stability of the hash function for strings.

How many people rely on hash(some_string) being stable across Python 
versions? Does anyone have code that will be broken if the string hashing 
algorithm changes?



-- 
Steven

[toc] | [next] | [standalone]

#18967

From	Peter Otten <__peter__@web.de>
Date	2012-01-14 10:46 +0100
Message-ID	<mailman.4742.1326534409.27778.python-list@python.org>
In reply to	#18958

Steven D'Aprano wrote:

> On the Python Dev mailing list, there is a discussion going on about the
> stability of the hash function for strings.
> 
> How many people rely on hash(some_string) being stable across Python
> versions? Does anyone have code that will be broken if the string hashing
> algorithm changes?

Nobody who understands the question ;)

[toc] | [prev] | [next] | [standalone]

#18983

From	Heiko Wundram <modelnine@modelnine.org>
Date	2012-01-14 23:45 +0100
Message-ID	<mailman.4754.1326581730.27778.python-list@python.org>
In reply to	#18958

Am 14.01.2012 10:46, schrieb Peter Otten:
> Steven D'Aprano wrote:
>> How many people rely on hash(some_string) being stable across Python
>> versions? Does anyone have code that will be broken if the string hashing
>> algorithm changes?
>
> Nobody who understands the question ;)

Erm, not exactly true. There are actually some packages out there (take 
suds [https://fedorahosted.org/suds/], for example) that rely on the 
hashing algorithm to be stable to function "properly" (suds uses hash() 
of strings to create caches of objects/XML Schemas on the filesystem). 
This, in a different context, bit me at the end of last week, when 
required to use suds to access EWS.

I'd personally start debating the sensibility of this decision on the 
part of the suds developers, but... That's not the question. ;-)

-- 
--- Heiko.

[toc] | [prev] | [next] | [standalone]

#18986

From	Chris Angelico <rosuav@gmail.com>
Date	2012-01-15 11:36 +1100
Message-ID	<mailman.4756.1326587769.27778.python-list@python.org>
In reply to	#18958

On Sat, Jan 14, 2012 at 3:42 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On the Python Dev mailing list, there is a discussion going on about the
> stability of the hash function for strings.
>
> How many people rely on hash(some_string) being stable across Python
> versions? Does anyone have code that will be broken if the string hashing
> algorithm changes?

On reading your post I immediately thought that you could, if changing
algorithm, simultaneously fix the issue of malicious collisions, but
that appears to be what you're doing it for primarily :)

Suggestion: Create a subclass of dict, the SecureDict or something,
which could either perturb the hashes or even use a proper
cryptographic hash function; normal dictionaries can continue to use
the current algorithm. The description in Objects/dictnotes.txt
suggests that it's still well worth keeping the current system for
programmer-controlled dictionaries, and only change user-controlled
ones (such as POST data etc).

It would then be up to the individual framework and module authors to
make use of this, but it would not impose any cost on the myriad other
uses of dictionaries - there's no point adding extra load to every
name lookup just because of a security issue in an extremely narrow
situation. It would also mean that code relying on hash(str) stability
wouldn't be broken.

ChrisA

[toc] | [prev] | [next] | [standalone]

#19004

From	Bryan <bryanjugglercryptographer@yahoo.com>
Date	2012-01-15 04:03 -0800
Message-ID	<994ca5fa-59b0-4128-8f9a-696d46db6856@4g2000pbz.googlegroups.com>
In reply to	#18986

Chris Angelico wrote:
> Suggestion: Create a subclass of dict, the SecureDict or something,
> which could either perturb the hashes or even use a proper
> cryptographic hash function; normal dictionaries can continue to use
> the current algorithm. The description in Objects/dictnotes.txt
> suggests that it's still well worth keeping the current system for
> programmer-controlled dictionaries, and only change user-controlled
> ones (such as POST data etc).

I have to disagree; that's not how the world works, at least not
anymore. Competent, skilled, dedicated programmers have over and over
again failed to appreciate the importance and the difficulty of
maintaining proper function in an adversarial environment. The tactic
of ignoring security issues unless and until they are proven
problematic stands utterly discredited.

> It would then be up to the individual framework and module authors to
> make use of this, but it would not impose any cost on the myriad other
> uses of dictionaries - there's no point adding extra load to every
> name lookup just because of a security issue in an extremely narrow
> situation. It would also mean that code relying on hash(str) stability
> wouldn't be broken.

That seemingly "extremely narrow situation" turns out to be wide as
Montana. Maybe Siberia. Does your program take input? Does it accept a
format that could possibly be downloaded from a malicious site on the
Internet? Does your market include users who occasionally make
mistakes? If not, enjoy your utter irrelevance. If so,
congratulations: you write Internet software.

Varying the hash function is just the first step. Plausible attacks
dynamically infer how to induce degenerate behavior. Replacing the
dictionary hash function with a "proper cryptographic hash function"
is a naive non-solution; all things considered it's somewhat worse
than useless. An old and interesting and relevant exercise is to
implement a dictionary with O(1) insert, look-up, and delete in the
average non-adversarial case; and O(lg n) insert, look-up, and delete
in the worse case.

[toc] | [prev] | [next] | [standalone]

#19005

From	Chris Angelico <rosuav@gmail.com>
Date	2012-01-15 23:21 +1100
Message-ID	<mailman.4768.1326630087.27778.python-list@python.org>
In reply to	#19004

On Sun, Jan 15, 2012 at 11:03 PM, Bryan
<bryanjugglercryptographer@yahoo.com> wrote:
> Chris Angelico wrote:
>> Suggestion: Create a subclass of dict, the SecureDict or something,
>> ... there's no point adding extra load to every
>> name lookup just because of a security issue in an extremely narrow
>> situation.
>
> That seemingly "extremely narrow situation" turns out to be wide as
> Montana. Maybe Siberia. Does your program take input? Does it accept a
> format that could possibly be downloaded from a malicious site on the
> Internet? Does your market include users who occasionally make
> mistakes? If not, enjoy your utter irrelevance. If so,
> congratulations: you write Internet software.

Yes, but in that "Internet software", there will only be a small
number of dictionaries that an attacker can stuff with keys (GET/POST
data, headers, cookies, etc, and anything derived therefrom); compare
the huge number of dictionaries that exist elsewhere in your Python
program. Adding load to dictionaries will add load to a huge number of
lookups that can never come under attack.

However, since posting that I've read the entire thread on the
python-dev archive. (It is, I might mention, a LOT of text.) A number
of suggestions and arguments are put forth, including a subclassing
notion similar to my postulation, and the same point is raised: that
app/framework developers won't secure their apps. Other options are
also offered (personally, I'm liking the one where an exception is
raised if something collides with too many keys - current suggestion
1000, although it could possibly work well with something that scales
with the dictionary size), and I'm sure that something will be done
that's a lot smarter than one quick idea spun off in response to a
separate query. So, I retract this idea :)

ChrisA

[toc] | [prev] | [next] | [standalone]

#18990

From	Roy Smith <roy@panix.com>
Date	2012-01-14 21:26 -0500
Message-ID	<roy-3399D9.21262614012012@news.panix.com>
In reply to	#18958

In article <4f1107b7$0$29988$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> On the Python Dev mailing list, there is a discussion going on about the 
> stability of the hash function for strings.
> 
> How many people rely on hash(some_string) being stable across Python 
> versions? Does anyone have code that will be broken if the string hashing 
> algorithm changes?

I would never rely on something like that unless the docs unambiguously 
stated it were so.  Which they don't.  All I can find about hash() is:

"Return the hash value of the object (if it has one). Hash values are 
integers. They are used to quickly compare dictionary keys during a 
dictionary lookup. Numeric values that compare equal have the same hash 
value (even if they are of different types, as is the case for 1 and 
1.0)."

[toc] | [prev] | [next] | [standalone]

#18993

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-01-14 23:07 -0500
Message-ID	<mailman.4761.1326600469.27778.python-list@python.org>
In reply to	#18990

On 1/14/2012 9:26 PM, Roy Smith wrote:
>   Steven D'Aprano<steve+comp.lang.python@pearwood.info>  wrote:
>> How many people rely on hash(some_string) being stable across Python
>> versions? Does anyone have code that will be broken if the string hashing
>> algorithm changes?
>
> I would never rely on something like that unless the docs unambiguously
> stated it were so.  Which they don't.  All I can find about hash() is:
>
> "Return the hash value of the object (if it has one).

Based on the pydev discussion since, it appears that enough people have 
inferred stability either from that or empirical stability that it will 
not be broken, by default, in pre-3.3 releases. What ever option is 
chosen to guard against attacks will probably be the default in 3.3.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#19000

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2012-01-15 11:13 +0100
Message-ID	<mailman.4763.1326622458.27778.python-list@python.org>
In reply to	#18958

Heiko Wundram, 14.01.2012 23:45:
> Am 14.01.2012 10:46, schrieb Peter Otten:
>> Steven D'Aprano wrote:
>>> How many people rely on hash(some_string) being stable across Python
>>> versions? Does anyone have code that will be broken if the string hashing
>>> algorithm changes?
>>
>> Nobody who understands the question ;)
> 
> Erm, not exactly true. There are actually some packages out there (take
> suds [https://fedorahosted.org/suds/], for example) that rely on the
> hashing algorithm to be stable to function "properly" (suds uses hash() of
> strings to create caches of objects/XML Schemas on the filesystem).

That's a stupid design. Using a hash function that the application does not
control to index into persistent storage just screams for getting the code
broken at some point.

Stefan

[toc] | [prev] | [next] | [standalone]

#19003

From	Heiko Wundram <modelnine@modelnine.org>
Date	2012-01-15 12:46 +0100
Message-ID	<mailman.4767.1326627982.27778.python-list@python.org>
In reply to	#18958

Am 15.01.2012 11:13, schrieb Stefan Behnel:
> That's a stupid design. Using a hash function that the application does not
> control to index into persistent storage just screams for getting the code
> broken at some point.

I agree completely with that (I hit the corresponding problem with suds 
while transitioning from 32-bit Python to 64-bit Python, where hashes 
aren't stable either), but as stated in my mail: that wasn't the 
original question. ;-)

-- 
--- Heiko.

[toc] | [prev] | [next] | [standalone]

#19006

From	Peter Otten <__peter__@web.de>
Date	2012-01-15 13:22 +0100
Message-ID	<mailman.4769.1326630152.27778.python-list@python.org>
In reply to	#18958

Heiko Wundram wrote:

> Am 15.01.2012 11:13, schrieb Stefan Behnel:
>> That's a stupid design. Using a hash function that the application does
>> not control to index into persistent storage just screams for getting the
>> code broken at some point.
> 
> I agree completely with that (I hit the corresponding problem with suds
> while transitioning from 32-bit Python to 64-bit Python, where hashes
> aren't stable either), but as stated in my mail: that wasn't the
> original question. ;-)

I'm curious: did you actually get false cache hits or just slower responses?

[toc] | [prev] | [next] | [standalone]

#19010

From	Heiko Wundram <modelnine@modelnine.org>
Date	2012-01-15 17:07 +0100
Message-ID	<mailman.4774.1326643677.27778.python-list@python.org>
In reply to	#18958

Am 15.01.2012 13:22, schrieb Peter Otten:
> Heiko Wundram wrote:
>> I agree completely with that (I hit the corresponding problem with suds
>> while transitioning from 32-bit Python to 64-bit Python, where hashes
>> aren't stable either), but as stated in my mail: that wasn't the
>> original question. ;-)
>
> I'm curious: did you actually get false cache hits or just slower responses?

It broke the application using suds, not due to false cache hits, but 
due to not getting a cache hit anymore at all.

Long story: to interpret WSDL-files, suds has to get all related DTDs 
for the WSDL file, and Microsoft (as I wrote I was querying Exchange Web 
Services) insists on using http://www.w3.org/2001/xml.dtd for the XML 
spec path. This path is sometimes functional as a GET URL, but mostly 
not (due to overload of the W3-servers), so basically I worked around 
the problem by creating an appropriate cache entry with the appropriate 
name based on hash() using a local copy of xml.dtd I had around. This 
took place on a development machine (32-bit), and when migrating the 
application to a production machine (64-bit), the cache file wasn't used 
anymore (due to the hash not being stable).

It's not that this came as a surprise (I quickly knew the "workaround" 
by simply rehashing on the target machine and moving the cache file 
appropriately), and I already said that this is mostly just a plain bad 
design decision on the part of the suds developers, but it's one of 
those cases where a non-stable hash() can break applications, and except 
if you know the internal workings of suds, this will seriously bite the 
developer.

I don't know the prevalence of suds, but I guess there's more people 
than me using it to query SOAP-services - all of those will be affected 
if the hash() output is changed. Additionally, if hash() isn't stable 
between runs (the randomized hash() solution which is preferred, and 
would also be my preference), suds caching becomes completely useless. 
And for the results, see above.

-- 
--- Heiko.

[toc] | [prev] | [next] | [standalone]

#19011

From	Chris Angelico <rosuav@gmail.com>
Date	2012-01-16 03:13 +1100
Message-ID	<mailman.4775.1326643983.27778.python-list@python.org>
In reply to	#18958

On Mon, Jan 16, 2012 at 3:07 AM, Heiko Wundram <modelnine@modelnine.org> wrote:
> I don't know the prevalence of suds, but I guess there's more people than me
> using it to query SOAP-services - all of those will be affected if the
> hash() output is changed. Additionally, if hash() isn't stable between runs
> (the randomized hash() solution which is preferred, and would also be my
> preference), suds caching becomes completely useless. And for the results,
> see above.

Or you could just monkey-patch it so that 'hash' points to an old
hashing function. If the current hash() is kept in builtins as (say)
hash_320() or hash_272() or something, then anyone who wants the old
version of the hash can still get it.

Of course, it's still dodgy to depend on the stability of something
that isn't proclaimed stable, and would be far better to use some
other hashing algorithm (MD5 or SHA for uberreliability).

ChrisA

[toc] | [prev] | [next] | [standalone]

#19012

From	Heiko Wundram <modelnine@modelnine.org>
Date	2012-01-15 17:51 +0100
Message-ID	<mailman.4776.1326646275.27778.python-list@python.org>
In reply to	#18958

Am 15.01.2012 17:13, schrieb Chris Angelico:
> On Mon, Jan 16, 2012 at 3:07 AM, Heiko Wundram<modelnine@modelnine.org>  wrote:
>> I don't know the prevalence of suds, but I guess there's more people than me
>> using it to query SOAP-services - all of those will be affected if the
>> hash() output is changed. Additionally, if hash() isn't stable between runs
>> (the randomized hash() solution which is preferred, and would also be my
>> preference), suds caching becomes completely useless. And for the results,
>> see above.
>
> Or you could just monkey-patch it so that 'hash' points to an old
> hashing function. If the current hash() is kept in builtins as (say)
> hash_320() or hash_272() or something, then anyone who wants the old
> version of the hash can still get it.

Or even easier: overwrite the default caching module (called FileCache) 
with something that implements "sensible" caching, for example by using 
the complete URL (with special characters replaced) of the DTD as a 
cache index, instead of hash()ing it. ;-)

There's "workarounds", I know - and I may be implementing one of them if 
the time comes. Again, my mail was only to point at the fact that there 
are (serious) projects out there relying on the "stableness" of hash(), 
and that these will get bitten when hash() is replaced. Which is not a 
bad thing if you ask me. ;-)

-- 
--- Heiko.

[toc] | [prev] | [next] | [standalone]

#19013

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2012-01-15 18:20 +0100
Message-ID	<mailman.4777.1326648019.27778.python-list@python.org>
In reply to	#18958

Chris Angelico, 15.01.2012 17:13:
> Of course, it's still dodgy to depend on the stability of something
> that isn't proclaimed stable, and would be far better to use some
> other hashing algorithm (MD5 or SHA for uberreliability).

I've seen things like MD5 or SHA* being used quite commonly for file caches
(or file storage in general, e.g. for related files referenced in a text
document). Given that these algorithms are right there in the stdlib, I
find them a rather obvious choice.

However, note that they may also be subject to complexity attacks at some
point, although likely requiring substantially more input data. In the
specific case of a cache, an attacker may only need an arbitrary set of
colliding hashes. Those can be calculated in advance for a given hash
function. For example, Wikipedia currently presents MD5 with a collision
complexity of ~2^20, that sounds a bit weak. Something like SHA256 should
be substantially more robust.

https://en.wikipedia.org/wiki/Cryptographic_hash_function#Cryptographic_hash_algorithms

Stefan

[toc] | [prev] | [next] | [standalone]

#19033

From	Peter Otten <__peter__@web.de>
Date	2012-01-16 09:18 +0100
Message-ID	<mailman.4787.1326701934.27778.python-list@python.org>
In reply to	#18958

Heiko Wundram wrote:

> Am 15.01.2012 13:22, schrieb Peter Otten:

>> I'm curious: did you actually get false cache hits 

in a suds cache

>> or just slower responses?

> It broke the application using suds, not due to false cache hits, but
> due to not getting a cache hit anymore at all.

> so basically I worked around
> the problem by creating an appropriate cache entry with the appropriate
> name based on hash() using a local copy of xml.dtd I had around. This
> took place on a development machine (32-bit), and when migrating the
> application to a production machine (64-bit), the cache file wasn't used
> anymore (due to the hash not being stable).

Thanks for the explanation.

> if the hash() output is changed. Additionally, if hash() isn't stable
> between runs (the randomized hash() solution which is preferred, and
> would also be my preference), suds caching becomes completely useless.
> And for the results, see above.

I've taken a quick look into the suds source; the good news is that you have 
to change a single method, reader.Reader.mangle(), to fix the problem with 
hash stability. 

However, I didn't see any code to deal with hash collisions at all.

[toc] | [prev] | [next] | [standalone]

#19034

From	Christian Heimes <lists@cheimes.de>
Date	2012-01-16 09:44 +0100
Message-ID	<mailman.4788.1326703463.27778.python-list@python.org>
In reply to	#18958

Am 16.01.2012 09:18, schrieb Peter Otten:
> I've taken a quick look into the suds source; the good news is that you have 
> to change a single method, reader.Reader.mangle(), to fix the problem with 
> hash stability. 
> 
> However, I didn't see any code to deal with hash collisions at all.

It smells like suds is vulnerable to cache poisoning.

[toc] | [prev] | [next] | [standalone]

#19035

From	Heiko Wundram <modelnine@modelnine.org>
Date	2012-01-16 10:15 +0100
Message-ID	<mailman.4789.1326705327.27778.python-list@python.org>
In reply to	#18958

Am 16.01.2012 09:44, schrieb Christian Heimes:
> Am 16.01.2012 09:18, schrieb Peter Otten:
>> I've taken a quick look into the suds source; the good news is that you have
>> to change a single method, reader.Reader.mangle(), to fix the problem with
>> hash stability.
>>
>> However, I didn't see any code to deal with hash collisions at all.
>
> It smells like suds is vulnerable to cache poisoning.

That it is, yes, at least partially. Generally, this is only relevant in 
case you are actually caching DTDs (which is the default) and in case 
you are querying untrusted SOAP-servers (in which case you'll most 
likely/should not use caching anyway), and in case the attacker has 
control over the URL namespace of a DTD-serving host (because the 
host-part of the DTD URL is used in the cache filename, unhashed, only 
the actual path is hashed to form the cache index).

The easier way to poison the cache is most probably through actual 
traffic modification, as most DTD URLs are served through plain http and 
thus are suspect to MitM-modifications, anyway.

-- 
--- Heiko.

[toc] | [prev] | [standalone]

csiph-web

Hash stability

Contents

#18958 — Hash stability

#18967

#18983

#18986

#19004

#19005

#18990

#18993

#19000

#19003

#19006

#19010

#19011

#19012

#19013

#19033

#19034

#19035