Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!.POSTED!not-for-mail From: Abu Yahya Newsgroups: comp.lang.java.programmer,comp.programming,comp.lang.java.databases Subject: Re: Storing large strings for future equality checks Date: Wed, 08 Jun 2011 23:45:00 +0530 Organization: Aioe.org NNTP Server Lines: 38 Message-ID: References: NNTP-Posting-Host: LePVoNEtezBuiMA9+cM5gA.user.speranza.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: abuse@aioe.org User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 X-Notice: Filtered by postfilter v. 0.8.2 Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:5123 comp.programming:441 comp.lang.java.databases:463 On 6/8/2011 10:19 PM, markspace wrote: > On 6/8/2011 9:35 AM, Abu Yahya wrote: > >> I considered using an SHA-512 hash of these strings and storing them in >> the database. However, while these will save on storage space, it will >> take time to do the hashing before comparing an incoming string. So I'm >> still wasting time. (Collisions due to hashing will not be a problem, >> since an occasional false positive will not be fatal for my application). > > > You have to store the whole string. Even if the SHA-512 hash codes are > equal, it could be that the strings are different. You'll have to > eventually compare the raw string, even if the SHA is used as a > quick-out case. You're right about comparing the whole strings anyway, but, for this application I'm creating, I wouldn't mind very very rare incorrect results. > > No one can really tell what is "faster" or "wasting time" until you can > better characterize the usage patterns. How big can these strings get? > How often will you get an actual duplicate? What's the penalty when you > need to add a new string? You'll need to implement a few algorithms, > profile them and then make a decision based on actual data. That makes sense. I'd need to analyze my input data and then run some empirical tests. > > For Java, I'd store the strings in a WeakHashMap or similar to allow > them to be cached, but tossed away when more storage is needed. Also you > should look into getting some DB caching library, much easier than > implementing this yourself (sorry I can't personally recommend any). The WeakHashMap idea looks useful. As for a DB caching library, would you recommend this as a replacement for the WeakHashMap, or as a complement? Thanks for the useful pointers.