Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!weretis.net!feeder4.news.weretis.net!news.musoftware.de!wum.musoftware.de!feeder.erje.net!news.internetdienste.de!news.tu-darmstadt.de!news.belwue.de!rz.uni-karlsruhe.de!feed.news.schlund.de!schlund.de!news.online.de!not-for-mail From: Lothar Kimmeringer Newsgroups: comp.lang.java.programmer,comp.programming,comp.lang.java.databases Subject: Re: Storing large strings for future equality checks Followup-To: comp.lang.java.programmer Date: Wed, 8 Jun 2011 20:28:11 +0200 Organization: Organization?! Only chaos here! Lines: 42 Message-ID: <171dpt2926br2.dlg@kimmeringer.de> References: Reply-To: news@kimmeringer.de NNTP-Posting-Host: mnch-4d044605.pool.mediaways.net Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: online.de 1307557691 32691 77.4.70.5 (8 Jun 2011 18:28:11 GMT) X-Complaints-To: abuse@einsundeins.com NNTP-Posting-Date: Wed, 8 Jun 2011 18:28:11 +0000 (UTC) User-Agent: 40tude_Dialog/2.0.15.1de Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:5128 comp.programming:444 comp.lang.java.databases:466 F'up to cljp Abu Yahya wrote: > I considered using an SHA-512 hash of these strings and storing them in > the database. However, while these will save on storage space, it will > take time to do the hashing before comparing an incoming string. So I'm > still wasting time. (Collisions due to hashing will not be a problem, > since an occasional false positive will not be fatal for my application). If you write seldom and read often, why not using two columns: string_hashcode sha1_hashcode If the first is equal, you can calculate the sha1-hash for the string to be checked and if that is equal as well, you can consider the string as equal. That both hashes collide I expect to be very very unlikely (which is why I changed the other alg to sha-1, that should be considerably more performant than sha512). So calculation of the more complex algorithm is only done while storing to the database and when checking a string that is already in the database. If you have that case very often you still might get a better performance with String.hashcode and SHA1 than with just SHA512. > What would be the best approach? There is no single best approach, only an optimal one. Which one it is dependend on what defines one way to be better than the other (in terms of performance, storage-space, collision- rates, etc). Regards, Lothar -- Lothar Kimmeringer E-Mail: spamfang@kimmeringer.de PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81) Always remember: The answer is forty-two, there can only be wrong questions!