Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!ecngs!feeder.ecngs.de!Xl.tags.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!local2.nntp.ams.giganews.com!nntp.bt.com!news.bt.com.POSTED!not-for-mail NNTP-Posting-Date: Wed, 08 Jun 2011 15:38:22 -0500 From: rossum Newsgroups: comp.lang.java.programmer,comp.programming,comp.lang.java.databases Subject: Re: Storing large strings for future equality checks Date: Wed, 08 Jun 2011 21:38:51 +0100 Message-ID: References: X-Newsreader: Forte Agent 1.93/32.576 English (American) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lines: 38 X-Usenet-Provider: http://www.giganews.com X-AuthenticatedUsername: NoAuthUser X-Trace: sv3-SaH4Ys0CGgltT8YDALKevg+G1hH5nviwK+svTMhoJUeHvt3dZPZTQFhdatEPbSzewc0dNLyed5zCPLD!685XvfCIWcrlvDr06VDBBj6YrL0pybkoHEMU/+PeWo294Gr/FMh+h/bMHFe8jH9TjNGYhif0H4vj!l/Q= X-Complaints-To: abuse@btinternet.com X-DMCA-Complaints-To: abuse@btinternet.com X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 2900 Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:5134 comp.programming:447 comp.lang.java.databases:469 On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya wrote: >A small application that I'm making requires me to store very long >strings (>1000 characters) in a database. > >I will need to use these strings later to compare for equality to >incoming strings from another application. I will also want to add some >of the incoming strings to the storage, if they meet certain criteria. > >For my application, I get a feeling that storing these strings in my >table will be a waste of space, and will impact performance due to >retrieval and storage times, as well as comparison times. > >I considered using an SHA-512 hash of these strings and storing them in >the database. However, while these will save on storage space, it will >take time to do the hashing before comparing an incoming string. So I'm >still wasting time. (Collisions due to hashing will not be a problem, >since an occasional false positive will not be fatal for my application). > >What would be the best approach? As others have said, write the simple obvious approach and see if that is good enough. Tune where required after measuring. Lothar's suggestion of using SHA-1 is good. You could even drop back to MD-4 if you are sure that nobody is going to be deliberately trying to create false collisions. MD-4 is much too badly broken for any cryptographic purposes, but is even faster than SHA-1. If the amount of storage needed is a problem then you might want to zip the strings before storing them. If you can be sure that the zipped versions are identical (not always possible with unicode combining characters) then you could hash the zipped version rather than the originals for more time saving. rossum