Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: markspace <-@.> Newsgroups: comp.lang.java.programmer,comp.programming,comp.lang.java.databases Subject: Re: Storing large strings for future equality checks Date: Wed, 08 Jun 2011 09:49:30 -0700 Organization: A noiseless patient Spider Lines: 24 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Wed, 8 Jun 2011 16:49:31 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="9mIMLLWQE/uBQz+Vsit8fg"; logging-data="26325"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX182Dhq24PUpsdBHqXn/S+MF14ILSneJ6M0=" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 In-Reply-To: Cancel-Lock: sha1:IlBOjVPFjXvUyEHwLIXyzSiO7ow= Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:5115 comp.programming:437 comp.lang.java.databases:459 On 6/8/2011 9:35 AM, Abu Yahya wrote: > I considered using an SHA-512 hash of these strings and storing them in > the database. However, while these will save on storage space, it will > take time to do the hashing before comparing an incoming string. So I'm > still wasting time. (Collisions due to hashing will not be a problem, > since an occasional false positive will not be fatal for my application). You have to store the whole string. Even if the SHA-512 hash codes are equal, it could be that the strings are different. You'll have to eventually compare the raw string, even if the SHA is used as a quick-out case. No one can really tell what is "faster" or "wasting time" until you can better characterize the usage patterns. How big can these strings get? How often will you get an actual duplicate? What's the penalty when you need to add a new string? You'll need to implement a few algorithms, profile them and then make a decision based on actual data. For Java, I'd store the strings in a WeakHashMap or similar to allow them to be cached, but tossed away when more storage is needed. Also you should look into getting some DB caching library, much easier than implementing this yourself (sorry I can't personally recommend any).