Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #18958 > unrolled thread
| Started by | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| First post | 2012-01-14 04:42 +0000 |
| Last post | 2012-01-16 10:15 +0100 |
| Articles | 18 — 9 participants |
Back to article view | Back to comp.lang.python
Hash stability Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-01-14 04:42 +0000
Re: Hash stability Peter Otten <__peter__@web.de> - 2012-01-14 10:46 +0100
Re: Hash stability Heiko Wundram <modelnine@modelnine.org> - 2012-01-14 23:45 +0100
Re: Hash stability Chris Angelico <rosuav@gmail.com> - 2012-01-15 11:36 +1100
Re: Hash stability Bryan <bryanjugglercryptographer@yahoo.com> - 2012-01-15 04:03 -0800
Re: Hash stability Chris Angelico <rosuav@gmail.com> - 2012-01-15 23:21 +1100
Re: Hash stability Roy Smith <roy@panix.com> - 2012-01-14 21:26 -0500
Re: Hash stability Terry Reedy <tjreedy@udel.edu> - 2012-01-14 23:07 -0500
Re: Hash stability Stefan Behnel <stefan_ml@behnel.de> - 2012-01-15 11:13 +0100
Re: Hash stability Heiko Wundram <modelnine@modelnine.org> - 2012-01-15 12:46 +0100
Re: Hash stability Peter Otten <__peter__@web.de> - 2012-01-15 13:22 +0100
Re: Hash stability Heiko Wundram <modelnine@modelnine.org> - 2012-01-15 17:07 +0100
Re: Hash stability Chris Angelico <rosuav@gmail.com> - 2012-01-16 03:13 +1100
Re: Hash stability Heiko Wundram <modelnine@modelnine.org> - 2012-01-15 17:51 +0100
Re: Hash stability Stefan Behnel <stefan_ml@behnel.de> - 2012-01-15 18:20 +0100
Re: Hash stability Peter Otten <__peter__@web.de> - 2012-01-16 09:18 +0100
Re: Hash stability Christian Heimes <lists@cheimes.de> - 2012-01-16 09:44 +0100
Re: Hash stability Heiko Wundram <modelnine@modelnine.org> - 2012-01-16 10:15 +0100
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-01-14 04:42 +0000 |
| Subject | Hash stability |
| Message-ID | <4f1107b7$0$29988$c3e8da3$5496439d@news.astraweb.com> |
On the Python Dev mailing list, there is a discussion going on about the stability of the hash function for strings. How many people rely on hash(some_string) being stable across Python versions? Does anyone have code that will be broken if the string hashing algorithm changes? -- Steven
[toc] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-01-14 10:46 +0100 |
| Message-ID | <mailman.4742.1326534409.27778.python-list@python.org> |
| In reply to | #18958 |
Steven D'Aprano wrote: > On the Python Dev mailing list, there is a discussion going on about the > stability of the hash function for strings. > > How many people rely on hash(some_string) being stable across Python > versions? Does anyone have code that will be broken if the string hashing > algorithm changes? Nobody who understands the question ;)
[toc] | [prev] | [next] | [standalone]
| From | Heiko Wundram <modelnine@modelnine.org> |
|---|---|
| Date | 2012-01-14 23:45 +0100 |
| Message-ID | <mailman.4754.1326581730.27778.python-list@python.org> |
| In reply to | #18958 |
Am 14.01.2012 10:46, schrieb Peter Otten: > Steven D'Aprano wrote: >> How many people rely on hash(some_string) being stable across Python >> versions? Does anyone have code that will be broken if the string hashing >> algorithm changes? > > Nobody who understands the question ;) Erm, not exactly true. There are actually some packages out there (take suds [https://fedorahosted.org/suds/], for example) that rely on the hashing algorithm to be stable to function "properly" (suds uses hash() of strings to create caches of objects/XML Schemas on the filesystem). This, in a different context, bit me at the end of last week, when required to use suds to access EWS. I'd personally start debating the sensibility of this decision on the part of the suds developers, but... That's not the question. ;-) -- --- Heiko.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-01-15 11:36 +1100 |
| Message-ID | <mailman.4756.1326587769.27778.python-list@python.org> |
| In reply to | #18958 |
On Sat, Jan 14, 2012 at 3:42 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On the Python Dev mailing list, there is a discussion going on about the > stability of the hash function for strings. > > How many people rely on hash(some_string) being stable across Python > versions? Does anyone have code that will be broken if the string hashing > algorithm changes? On reading your post I immediately thought that you could, if changing algorithm, simultaneously fix the issue of malicious collisions, but that appears to be what you're doing it for primarily :) Suggestion: Create a subclass of dict, the SecureDict or something, which could either perturb the hashes or even use a proper cryptographic hash function; normal dictionaries can continue to use the current algorithm. The description in Objects/dictnotes.txt suggests that it's still well worth keeping the current system for programmer-controlled dictionaries, and only change user-controlled ones (such as POST data etc). It would then be up to the individual framework and module authors to make use of this, but it would not impose any cost on the myriad other uses of dictionaries - there's no point adding extra load to every name lookup just because of a security issue in an extremely narrow situation. It would also mean that code relying on hash(str) stability wouldn't be broken. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Bryan <bryanjugglercryptographer@yahoo.com> |
|---|---|
| Date | 2012-01-15 04:03 -0800 |
| Message-ID | <994ca5fa-59b0-4128-8f9a-696d46db6856@4g2000pbz.googlegroups.com> |
| In reply to | #18986 |
Chris Angelico wrote: > Suggestion: Create a subclass of dict, the SecureDict or something, > which could either perturb the hashes or even use a proper > cryptographic hash function; normal dictionaries can continue to use > the current algorithm. The description in Objects/dictnotes.txt > suggests that it's still well worth keeping the current system for > programmer-controlled dictionaries, and only change user-controlled > ones (such as POST data etc). I have to disagree; that's not how the world works, at least not anymore. Competent, skilled, dedicated programmers have over and over again failed to appreciate the importance and the difficulty of maintaining proper function in an adversarial environment. The tactic of ignoring security issues unless and until they are proven problematic stands utterly discredited. > It would then be up to the individual framework and module authors to > make use of this, but it would not impose any cost on the myriad other > uses of dictionaries - there's no point adding extra load to every > name lookup just because of a security issue in an extremely narrow > situation. It would also mean that code relying on hash(str) stability > wouldn't be broken. That seemingly "extremely narrow situation" turns out to be wide as Montana. Maybe Siberia. Does your program take input? Does it accept a format that could possibly be downloaded from a malicious site on the Internet? Does your market include users who occasionally make mistakes? If not, enjoy your utter irrelevance. If so, congratulations: you write Internet software. Varying the hash function is just the first step. Plausible attacks dynamically infer how to induce degenerate behavior. Replacing the dictionary hash function with a "proper cryptographic hash function" is a naive non-solution; all things considered it's somewhat worse than useless. An old and interesting and relevant exercise is to implement a dictionary with O(1) insert, look-up, and delete in the average non-adversarial case; and O(lg n) insert, look-up, and delete in the worse case.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-01-15 23:21 +1100 |
| Message-ID | <mailman.4768.1326630087.27778.python-list@python.org> |
| In reply to | #19004 |
On Sun, Jan 15, 2012 at 11:03 PM, Bryan <bryanjugglercryptographer@yahoo.com> wrote: > Chris Angelico wrote: >> Suggestion: Create a subclass of dict, the SecureDict or something, >> ... there's no point adding extra load to every >> name lookup just because of a security issue in an extremely narrow >> situation. > > That seemingly "extremely narrow situation" turns out to be wide as > Montana. Maybe Siberia. Does your program take input? Does it accept a > format that could possibly be downloaded from a malicious site on the > Internet? Does your market include users who occasionally make > mistakes? If not, enjoy your utter irrelevance. If so, > congratulations: you write Internet software. Yes, but in that "Internet software", there will only be a small number of dictionaries that an attacker can stuff with keys (GET/POST data, headers, cookies, etc, and anything derived therefrom); compare the huge number of dictionaries that exist elsewhere in your Python program. Adding load to dictionaries will add load to a huge number of lookups that can never come under attack. However, since posting that I've read the entire thread on the python-dev archive. (It is, I might mention, a LOT of text.) A number of suggestions and arguments are put forth, including a subclassing notion similar to my postulation, and the same point is raised: that app/framework developers won't secure their apps. Other options are also offered (personally, I'm liking the one where an exception is raised if something collides with too many keys - current suggestion 1000, although it could possibly work well with something that scales with the dictionary size), and I'm sure that something will be done that's a lot smarter than one quick idea spun off in response to a separate query. So, I retract this idea :) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-01-14 21:26 -0500 |
| Message-ID | <roy-3399D9.21262614012012@news.panix.com> |
| In reply to | #18958 |
In article <4f1107b7$0$29988$c3e8da3$5496439d@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On the Python Dev mailing list, there is a discussion going on about the > stability of the hash function for strings. > > How many people rely on hash(some_string) being stable across Python > versions? Does anyone have code that will be broken if the string hashing > algorithm changes? I would never rely on something like that unless the docs unambiguously stated it were so. Which they don't. All I can find about hash() is: "Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0)."
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-01-14 23:07 -0500 |
| Message-ID | <mailman.4761.1326600469.27778.python-list@python.org> |
| In reply to | #18990 |
On 1/14/2012 9:26 PM, Roy Smith wrote: > Steven D'Aprano<steve+comp.lang.python@pearwood.info> wrote: >> How many people rely on hash(some_string) being stable across Python >> versions? Does anyone have code that will be broken if the string hashing >> algorithm changes? > > I would never rely on something like that unless the docs unambiguously > stated it were so. Which they don't. All I can find about hash() is: > > "Return the hash value of the object (if it has one). Based on the pydev discussion since, it appears that enough people have inferred stability either from that or empirical stability that it will not be broken, by default, in pre-3.3 releases. What ever option is chosen to guard against attacks will probably be the default in 3.3. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| Date | 2012-01-15 11:13 +0100 |
| Message-ID | <mailman.4763.1326622458.27778.python-list@python.org> |
| In reply to | #18958 |
Heiko Wundram, 14.01.2012 23:45: > Am 14.01.2012 10:46, schrieb Peter Otten: >> Steven D'Aprano wrote: >>> How many people rely on hash(some_string) being stable across Python >>> versions? Does anyone have code that will be broken if the string hashing >>> algorithm changes? >> >> Nobody who understands the question ;) > > Erm, not exactly true. There are actually some packages out there (take > suds [https://fedorahosted.org/suds/], for example) that rely on the > hashing algorithm to be stable to function "properly" (suds uses hash() of > strings to create caches of objects/XML Schemas on the filesystem). That's a stupid design. Using a hash function that the application does not control to index into persistent storage just screams for getting the code broken at some point. Stefan
[toc] | [prev] | [next] | [standalone]
| From | Heiko Wundram <modelnine@modelnine.org> |
|---|---|
| Date | 2012-01-15 12:46 +0100 |
| Message-ID | <mailman.4767.1326627982.27778.python-list@python.org> |
| In reply to | #18958 |
Am 15.01.2012 11:13, schrieb Stefan Behnel: > That's a stupid design. Using a hash function that the application does not > control to index into persistent storage just screams for getting the code > broken at some point. I agree completely with that (I hit the corresponding problem with suds while transitioning from 32-bit Python to 64-bit Python, where hashes aren't stable either), but as stated in my mail: that wasn't the original question. ;-) -- --- Heiko.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-01-15 13:22 +0100 |
| Message-ID | <mailman.4769.1326630152.27778.python-list@python.org> |
| In reply to | #18958 |
Heiko Wundram wrote: > Am 15.01.2012 11:13, schrieb Stefan Behnel: >> That's a stupid design. Using a hash function that the application does >> not control to index into persistent storage just screams for getting the >> code broken at some point. > > I agree completely with that (I hit the corresponding problem with suds > while transitioning from 32-bit Python to 64-bit Python, where hashes > aren't stable either), but as stated in my mail: that wasn't the > original question. ;-) I'm curious: did you actually get false cache hits or just slower responses?
[toc] | [prev] | [next] | [standalone]
| From | Heiko Wundram <modelnine@modelnine.org> |
|---|---|
| Date | 2012-01-15 17:07 +0100 |
| Message-ID | <mailman.4774.1326643677.27778.python-list@python.org> |
| In reply to | #18958 |
Am 15.01.2012 13:22, schrieb Peter Otten: > Heiko Wundram wrote: >> I agree completely with that (I hit the corresponding problem with suds >> while transitioning from 32-bit Python to 64-bit Python, where hashes >> aren't stable either), but as stated in my mail: that wasn't the >> original question. ;-) > > I'm curious: did you actually get false cache hits or just slower responses? It broke the application using suds, not due to false cache hits, but due to not getting a cache hit anymore at all. Long story: to interpret WSDL-files, suds has to get all related DTDs for the WSDL file, and Microsoft (as I wrote I was querying Exchange Web Services) insists on using http://www.w3.org/2001/xml.dtd for the XML spec path. This path is sometimes functional as a GET URL, but mostly not (due to overload of the W3-servers), so basically I worked around the problem by creating an appropriate cache entry with the appropriate name based on hash() using a local copy of xml.dtd I had around. This took place on a development machine (32-bit), and when migrating the application to a production machine (64-bit), the cache file wasn't used anymore (due to the hash not being stable). It's not that this came as a surprise (I quickly knew the "workaround" by simply rehashing on the target machine and moving the cache file appropriately), and I already said that this is mostly just a plain bad design decision on the part of the suds developers, but it's one of those cases where a non-stable hash() can break applications, and except if you know the internal workings of suds, this will seriously bite the developer. I don't know the prevalence of suds, but I guess there's more people than me using it to query SOAP-services - all of those will be affected if the hash() output is changed. Additionally, if hash() isn't stable between runs (the randomized hash() solution which is preferred, and would also be my preference), suds caching becomes completely useless. And for the results, see above. -- --- Heiko.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-01-16 03:13 +1100 |
| Message-ID | <mailman.4775.1326643983.27778.python-list@python.org> |
| In reply to | #18958 |
On Mon, Jan 16, 2012 at 3:07 AM, Heiko Wundram <modelnine@modelnine.org> wrote: > I don't know the prevalence of suds, but I guess there's more people than me > using it to query SOAP-services - all of those will be affected if the > hash() output is changed. Additionally, if hash() isn't stable between runs > (the randomized hash() solution which is preferred, and would also be my > preference), suds caching becomes completely useless. And for the results, > see above. Or you could just monkey-patch it so that 'hash' points to an old hashing function. If the current hash() is kept in builtins as (say) hash_320() or hash_272() or something, then anyone who wants the old version of the hash can still get it. Of course, it's still dodgy to depend on the stability of something that isn't proclaimed stable, and would be far better to use some other hashing algorithm (MD5 or SHA for uberreliability). ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Heiko Wundram <modelnine@modelnine.org> |
|---|---|
| Date | 2012-01-15 17:51 +0100 |
| Message-ID | <mailman.4776.1326646275.27778.python-list@python.org> |
| In reply to | #18958 |
Am 15.01.2012 17:13, schrieb Chris Angelico: > On Mon, Jan 16, 2012 at 3:07 AM, Heiko Wundram<modelnine@modelnine.org> wrote: >> I don't know the prevalence of suds, but I guess there's more people than me >> using it to query SOAP-services - all of those will be affected if the >> hash() output is changed. Additionally, if hash() isn't stable between runs >> (the randomized hash() solution which is preferred, and would also be my >> preference), suds caching becomes completely useless. And for the results, >> see above. > > Or you could just monkey-patch it so that 'hash' points to an old > hashing function. If the current hash() is kept in builtins as (say) > hash_320() or hash_272() or something, then anyone who wants the old > version of the hash can still get it. Or even easier: overwrite the default caching module (called FileCache) with something that implements "sensible" caching, for example by using the complete URL (with special characters replaced) of the DTD as a cache index, instead of hash()ing it. ;-) There's "workarounds", I know - and I may be implementing one of them if the time comes. Again, my mail was only to point at the fact that there are (serious) projects out there relying on the "stableness" of hash(), and that these will get bitten when hash() is replaced. Which is not a bad thing if you ask me. ;-) -- --- Heiko.
[toc] | [prev] | [next] | [standalone]
| From | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| Date | 2012-01-15 18:20 +0100 |
| Message-ID | <mailman.4777.1326648019.27778.python-list@python.org> |
| In reply to | #18958 |
Chris Angelico, 15.01.2012 17:13: > Of course, it's still dodgy to depend on the stability of something > that isn't proclaimed stable, and would be far better to use some > other hashing algorithm (MD5 or SHA for uberreliability). I've seen things like MD5 or SHA* being used quite commonly for file caches (or file storage in general, e.g. for related files referenced in a text document). Given that these algorithms are right there in the stdlib, I find them a rather obvious choice. However, note that they may also be subject to complexity attacks at some point, although likely requiring substantially more input data. In the specific case of a cache, an attacker may only need an arbitrary set of colliding hashes. Those can be calculated in advance for a given hash function. For example, Wikipedia currently presents MD5 with a collision complexity of ~2^20, that sounds a bit weak. Something like SHA256 should be substantially more robust. https://en.wikipedia.org/wiki/Cryptographic_hash_function#Cryptographic_hash_algorithms Stefan
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-01-16 09:18 +0100 |
| Message-ID | <mailman.4787.1326701934.27778.python-list@python.org> |
| In reply to | #18958 |
Heiko Wundram wrote: > Am 15.01.2012 13:22, schrieb Peter Otten: >> I'm curious: did you actually get false cache hits in a suds cache >> or just slower responses? > It broke the application using suds, not due to false cache hits, but > due to not getting a cache hit anymore at all. > so basically I worked around > the problem by creating an appropriate cache entry with the appropriate > name based on hash() using a local copy of xml.dtd I had around. This > took place on a development machine (32-bit), and when migrating the > application to a production machine (64-bit), the cache file wasn't used > anymore (due to the hash not being stable). Thanks for the explanation. > if the hash() output is changed. Additionally, if hash() isn't stable > between runs (the randomized hash() solution which is preferred, and > would also be my preference), suds caching becomes completely useless. > And for the results, see above. I've taken a quick look into the suds source; the good news is that you have to change a single method, reader.Reader.mangle(), to fix the problem with hash stability. However, I didn't see any code to deal with hash collisions at all.
[toc] | [prev] | [next] | [standalone]
| From | Christian Heimes <lists@cheimes.de> |
|---|---|
| Date | 2012-01-16 09:44 +0100 |
| Message-ID | <mailman.4788.1326703463.27778.python-list@python.org> |
| In reply to | #18958 |
Am 16.01.2012 09:18, schrieb Peter Otten: > I've taken a quick look into the suds source; the good news is that you have > to change a single method, reader.Reader.mangle(), to fix the problem with > hash stability. > > However, I didn't see any code to deal with hash collisions at all. It smells like suds is vulnerable to cache poisoning.
[toc] | [prev] | [next] | [standalone]
| From | Heiko Wundram <modelnine@modelnine.org> |
|---|---|
| Date | 2012-01-16 10:15 +0100 |
| Message-ID | <mailman.4789.1326705327.27778.python-list@python.org> |
| In reply to | #18958 |
Am 16.01.2012 09:44, schrieb Christian Heimes: > Am 16.01.2012 09:18, schrieb Peter Otten: >> I've taken a quick look into the suds source; the good news is that you have >> to change a single method, reader.Reader.mangle(), to fix the problem with >> hash stability. >> >> However, I didn't see any code to deal with hash collisions at all. > > It smells like suds is vulnerable to cache poisoning. That it is, yes, at least partially. Generally, this is only relevant in case you are actually caching DTDs (which is the default) and in case you are querying untrusted SOAP-servers (in which case you'll most likely/should not use caching anyway), and in case the attacker has control over the URL namespace of a DTD-serving host (because the host-part of the DTD URL is used in the cache filename, unhashed, only the actual path is hashed to form the cache index). The easier way to poison the cache is most probably through actual traffic modification, as most DTD URLs are served through plain http and thus are suspect to MitM-modifications, anyway. -- --- Heiko.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web