Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!postnews.google.com!f31g2000pri.googlegroups.com!not-for-mail From: Joshua Maurice Newsgroups: comp.lang.java.programmer,comp.programming,comp.lang.java.databases Subject: Re: Storing large strings for future equality checks Date: Thu, 9 Jun 2011 15:01:27 -0700 (PDT) Organization: http://groups.google.com Lines: 32 Message-ID: <21013c4d-3ae9-4e81-8999-d8c18e620e5c@f31g2000pri.googlegroups.com> References: NNTP-Posting-Host: 12.108.188.134 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: posting.google.com 1307656888 9826 127.0.0.1 (9 Jun 2011 22:01:28 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Thu, 9 Jun 2011 22:01:28 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: f31g2000pri.googlegroups.com; posting-host=12.108.188.134; posting-account=C7XBLgoAAAAxMpmeFo8Iv_pud1pyFhjy User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-Header-Order: HUALESNKRC X-HTTP-UserAgent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1,gzip(gfe) Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:5167 comp.programming:452 comp.lang.java.databases:474 On Jun 8, 9:35=A0am, Abu Yahya wrote: > A small application that I'm making requires me to store very long > strings (>1000 characters) in a database. > > I will need to use these strings later to compare for equality to > incoming strings from another application. I will also want to add some > of the incoming strings to the storage, if they meet certain criteria. > > For my application, I get a feeling that storing these strings in my > table will be a waste of space, and will impact performance due to > retrieval and storage times, as well as comparison times. > > I considered using an SHA-512 hash of these strings and storing them in > the database. However, while these will save on storage space, it will > take time to do the hashing before comparing an incoming string. So I'm > still wasting time. (Collisions due to hashing will not be a problem, > since an occasional false positive will not be fatal for my application). > > What would be the best approach? If it's that relevant that you're asking, measure first to see if it's a problem. If you're that concerned that it will be, then code a number of reasonable alternatives and measure. Presumably you need to do a Map lookup on the incoming strings. I thought about some itern scheme, but that won't work if you're receiving a lot of incoming new strings. Storing hashs could work. Do you need to store the strings in a database? If you can store them locally, maybe a trie? http://en.wikipedia.org/wiki/Trie I somewhat doubt (maybe?) that you're going to get much better lookup performance than a trie (but of course I would measure too).