Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #5113 > unrolled thread

Storing large strings for future equality checks

Started byAbu Yahya <abu_yahya@invalid.com>
First post2011-06-08 22:05 +0530
Last post2011-06-09 15:01 -0700
Articles 2 on this page of 22 — 15 participants

Back to article view | Back to comp.lang.java.programmer


Contents

  Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 22:05 +0530
    Re: Storing large strings for future equality checks markspace <-@.> - 2011-06-08 09:49 -0700
      Re: Storing large strings for future equality checks Willem <willem@toad.stack.nl> - 2011-06-08 17:28 +0000
        Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:45 +0530
      Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:45 +0530
    Re: Storing large strings for future equality checks David Kerber <dkerber@WarrenRogersAssociates.invalid> - 2011-06-08 12:58 -0400
      Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:49 +0530
      Re: Storing large strings for future equality checks Lothar Kimmeringer <news200709@kimmeringer.de> - 2011-06-08 20:31 +0200
        Re: Storing large strings for future equality checks Harry Tuttle <OTPXDAJCSJVU@spammotel.com> - 2011-06-09 10:50 +0200
          Re: Storing large strings for future equality checks bugbear <bugbear@trim_papermule.co.uk_trim> - 2011-06-09 11:44 +0100
        Re: Storing large strings for future equality checks Harry Tuttle <OTPXDAJCSJVU@spammotel.com> - 2011-06-10 10:15 +0200
    Re: Storing large strings for future equality checks Gene Wirchenko <genew@ocis.net> - 2011-06-08 11:07 -0700
      Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:58 +0530
      Re: Storing large strings for future equality checks Hallvard B Furuseth <h.b.furuseth@usit.uio.no> - 2011-06-09 12:38 +0200
      Re: Storing large strings for future equality checks Michael Wojcik <mwojcik@newsguy.com> - 2011-06-09 17:32 -0400
        Re: Storing large strings for future equality checks bugbear <bugbear@trim_papermule.co.uk_trim> - 2011-06-10 10:51 +0100
    Re: Storing large strings for future equality checks Lothar Kimmeringer <news200709@kimmeringer.de> - 2011-06-08 20:28 +0200
      Re: Storing large strings for future equality checks Martin Gregorie <martin@address-in-sig.invalid> - 2011-06-08 22:02 +0000
    Re: Storing large strings for future equality checks rossum <rossum48@coldmail.com> - 2011-06-08 21:38 +0100
    Re: Storing large strings for future equality checks Robert Klemme <shortcutter@googlemail.com> - 2011-06-08 23:20 +0200
    Re: Storing large strings for future equality checks Tom Anderson <twic@urchin.earth.li> - 2011-06-08 23:02 +0100
    Re: Storing large strings for future equality checks Joshua Maurice <joshuamaurice@gmail.com> - 2011-06-09 15:01 -0700

Page 2 of 2 — ← Prev page 1 [2]


#5140

FromTom Anderson <twic@urchin.earth.li>
Date2011-06-08 23:02 +0100
Message-ID<alpine.DEB.2.00.1106082242210.21659@urchin.earth.li>
In reply to#5113
On Wed, 8 Jun 2011, Abu Yahya wrote:

> A small application that I'm making requires me to store very long 
> strings (>1000 characters) in a database.

So >1000, but by any chance <2000? <4000?

> I will need to use these strings later to compare for equality to 
> incoming strings from another application. I will also want to add some 
> of the incoming strings to the storage, if they meet certain criteria.
>
> For my application, I get a feeling that storing these strings in my 
> table will be a waste of space, and will impact performance due to 
> retrieval and storage times, as well as comparison times.

I assume by 'table' you mean an in-memory hashtable. I wouldn't be so 
quick to discount this - how many strings do you have? If you had 25 000 
2000-character strings, that would be 100 MB; that's not an exorbitant 
amount. Access would be pretty quick.

> I considered using an SHA-512 hash of these strings and storing them in 
> the database. However, while these will save on storage space, it will 
> take time to do the hashing before comparing an incoming string. So I'm 
> still wasting time. (Collisions due to hashing will not be a problem, 
> since an occasional false positive will not be fatal for my 
> application).

If you're using a database, don't bother with a hash, just put the whole 
string in, index the column, and then do equality queries on it. A B-tree 
index will deal with this kind of query pretty efficiently.

If you wanted to pursue in-memory approaches, i'd suggest constructing a 
trie of some sort. Tries are very fast at retrieval, don't need to access 
the whole key to identify a miss, and only need to access the whole key 
once to identify a hit - you always walk through the key from beginning to 
end (perhaps skipping some characters in some kinds of tree), stopping as 
soon as the key doesn't match, and only reaching the end if it does match. 
I haven't found a good overview of the current state of the art in tries, 
but look up Patricia tries, Judy tries, burst tries, fusion tries, and HAT 
tries. You could consider only storing a prefix of the strings - the first 
100 characters, say - in the trie, to save memory, with the leaves having 
pointers to where the complete strings are stored on disk.

tom

-- 
The sun just came out, I can't believe it

[toc] | [prev] | [next] | [standalone]


#5167

FromJoshua Maurice <joshuamaurice@gmail.com>
Date2011-06-09 15:01 -0700
Message-ID<21013c4d-3ae9-4e81-8999-d8c18e620e5c@f31g2000pri.googlegroups.com>
In reply to#5113
On Jun 8, 9:35 am, Abu Yahya <abu_ya...@invalid.com> wrote:
> A small application that I'm making requires me to store very long
> strings (>1000 characters) in a database.
>
> I will need to use these strings later to compare for equality to
> incoming strings from another application. I will also want to add some
> of the incoming strings to the storage, if they meet certain criteria.
>
> For my application, I get a feeling that storing these strings in my
> table will be a waste of space, and will impact performance due to
> retrieval and storage times, as well as comparison times.
>
> I considered using an SHA-512 hash of these strings and storing them in
> the database. However, while these will save on storage space, it will
> take time to do the hashing before comparing an incoming string. So I'm
> still wasting time. (Collisions due to hashing will not be a problem,
> since an occasional false positive will not be fatal for my application).
>
> What would be the best approach?

If it's that relevant that you're asking, measure first to see if it's
a problem. If you're that concerned that it will be, then code a
number of reasonable alternatives and measure.

Presumably you need to do a Map lookup on the incoming strings. I
thought about some itern scheme, but that won't work if you're
receiving a lot of incoming new strings. Storing hashs could work. Do
you need to store the strings in a database? If you can store them
locally, maybe a trie?
http://en.wikipedia.org/wiki/Trie
I somewhat doubt (maybe?) that you're going to get much better lookup
performance than a trie (but of course I would measure too).

[toc] | [prev] | [standalone]


Page 2 of 2 — ← Prev page 1 [2]

Back to top | Article view | comp.lang.java.programmer


csiph-web