Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!goblin1!goblin2!goblin.stu.neva.ru!newsfeed1.swip.net!npeer01.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!spln!extra.newsguy.com!newsp.newsguy.com!news3 From: Michael Wojcik Newsgroups: comp.lang.java.programmer,comp.programming,comp.lang.java.databases Subject: Re: Storing large strings for future equality checks Date: Thu, 09 Jun 2011 17:32:12 -0400 Organization: Micro Focus Lines: 34 Message-ID: References: NNTP-Posting-Host: pc9a9a4f7df75b99e9b1916247b8f156a2b0ed3239e1f683f.newsdawg.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.23) Gecko/20090812 Thunderbird/2.0.0.23 Mnenhy/0.7.5.0 In-Reply-To: Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:5166 comp.programming:451 comp.lang.java.databases:473 Gene Wirchenko wrote: > On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya > wrote: > >> I considered using an SHA-512 hash of these strings and storing them in >> the database. However, while these will save on storage space, it will >> take time to do the hashing before comparing an incoming string. So I'm >> still wasting time. (Collisions due to hashing will not be a problem, >> since an occasional false positive will not be fatal for my application). > > What does "occasional" mean? Assuming the application doesn't accidentally run afoul of a hitherto-unknown flaw in SHA-512 - a tremendously unlikely event - it means "once in every N/2**256 uses", where N is the current number of hashes in the database. (2**256 because of the Birthday Paradox; we're interested in the probability of two arbitrary colliding pre-images.) Or, in other words, "probably not before the heat death of the universe". (Or false vacuum decay, zombie apocalypse, Rapture, etc.) Worrying about the time to hash the string is silly. It's linear in the length of the string, so it's roughly equivalent to the time taken to do a few comparisons (where "a few" depends on how long, on average, the matching prefix of the new string and the candidates is). However, as various others have pointed out, this is premature optimization. There's no reason to use any design other than the obvious until a problem is demonstrated. -- Michael Wojcik Micro Focus Rhetoric & Writing, Michigan State University