Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!.POSTED!not-for-mail
From: Abu Yahya <abu_yahya@invalid.com>
Newsgroups: comp.lang.java.programmer,comp.programming,comp.lang.java.databases
Subject: Re: Storing large strings for future equality checks
Date: Wed, 08 Jun 2011 23:45:00 +0530
Organization: Aioe.org NNTP Server
Lines: 38
Message-ID: <isoe78$pdj$1@speranza.aioe.org>
References: <iso8cm$a80$1@speranza.aioe.org> <iso96r$pml$1@dont-email.me>
NNTP-Posting-Host: LePVoNEtezBuiMA9+cM5gA.user.speranza.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10
X-Notice: Filtered by postfilter v. 0.8.2
Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:5123 comp.programming:441 comp.lang.java.databases:463

On 6/8/2011 10:19 PM, markspace wrote:
> On 6/8/2011 9:35 AM, Abu Yahya wrote:
>
>> I considered using an SHA-512 hash of these strings and storing them in
>> the database. However, while these will save on storage space, it will
>> take time to do the hashing before comparing an incoming string. So I'm
>> still wasting time. (Collisions due to hashing will not be a problem,
>> since an occasional false positive will not be fatal for my application).
>
>
> You have to store the whole string. Even if the SHA-512 hash codes are
> equal, it could be that the strings are different. You'll have to
> eventually compare the raw string, even if the SHA is used as a
> quick-out case.

You're right about comparing the whole strings anyway, but, for this 
application I'm creating, I wouldn't mind very very rare incorrect results.

>
> No one can really tell what is "faster" or "wasting time" until you can
> better characterize the usage patterns. How big can these strings get?
> How often will you get an actual duplicate? What's the penalty when you
> need to add a new string? You'll need to implement a few algorithms,
> profile them and then make a decision based on actual data.

That makes sense. I'd need to analyze my input data and then run some 
empirical tests.

>
> For Java, I'd store the strings in a WeakHashMap or similar to allow
> them to be cached, but tossed away when more storage is needed. Also you
> should look into getting some DB caching library, much easier than
> implementing this yourself (sorry I can't personally recommend any).

The WeakHashMap idea looks useful. As for a DB caching library, would 
you recommend this as a replacement for the WeakHashMap, or as a complement?

Thanks for the useful pointers.