Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #5113 > unrolled thread
| Started by | Abu Yahya <abu_yahya@invalid.com> |
|---|---|
| First post | 2011-06-08 22:05 +0530 |
| Last post | 2011-06-09 15:01 -0700 |
| Articles | 20 on this page of 22 — 15 participants |
Back to article view | Back to comp.lang.java.programmer
Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 22:05 +0530
Re: Storing large strings for future equality checks markspace <-@.> - 2011-06-08 09:49 -0700
Re: Storing large strings for future equality checks Willem <willem@toad.stack.nl> - 2011-06-08 17:28 +0000
Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:45 +0530
Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:45 +0530
Re: Storing large strings for future equality checks David Kerber <dkerber@WarrenRogersAssociates.invalid> - 2011-06-08 12:58 -0400
Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:49 +0530
Re: Storing large strings for future equality checks Lothar Kimmeringer <news200709@kimmeringer.de> - 2011-06-08 20:31 +0200
Re: Storing large strings for future equality checks Harry Tuttle <OTPXDAJCSJVU@spammotel.com> - 2011-06-09 10:50 +0200
Re: Storing large strings for future equality checks bugbear <bugbear@trim_papermule.co.uk_trim> - 2011-06-09 11:44 +0100
Re: Storing large strings for future equality checks Harry Tuttle <OTPXDAJCSJVU@spammotel.com> - 2011-06-10 10:15 +0200
Re: Storing large strings for future equality checks Gene Wirchenko <genew@ocis.net> - 2011-06-08 11:07 -0700
Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:58 +0530
Re: Storing large strings for future equality checks Hallvard B Furuseth <h.b.furuseth@usit.uio.no> - 2011-06-09 12:38 +0200
Re: Storing large strings for future equality checks Michael Wojcik <mwojcik@newsguy.com> - 2011-06-09 17:32 -0400
Re: Storing large strings for future equality checks bugbear <bugbear@trim_papermule.co.uk_trim> - 2011-06-10 10:51 +0100
Re: Storing large strings for future equality checks Lothar Kimmeringer <news200709@kimmeringer.de> - 2011-06-08 20:28 +0200
Re: Storing large strings for future equality checks Martin Gregorie <martin@address-in-sig.invalid> - 2011-06-08 22:02 +0000
Re: Storing large strings for future equality checks rossum <rossum48@coldmail.com> - 2011-06-08 21:38 +0100
Re: Storing large strings for future equality checks Robert Klemme <shortcutter@googlemail.com> - 2011-06-08 23:20 +0200
Re: Storing large strings for future equality checks Tom Anderson <twic@urchin.earth.li> - 2011-06-08 23:02 +0100
Re: Storing large strings for future equality checks Joshua Maurice <joshuamaurice@gmail.com> - 2011-06-09 15:01 -0700
Page 1 of 2 [1] 2 Next page →
| From | Abu Yahya <abu_yahya@invalid.com> |
|---|---|
| Date | 2011-06-08 22:05 +0530 |
| Subject | Storing large strings for future equality checks |
| Message-ID | <iso8cm$a80$1@speranza.aioe.org> |
A small application that I'm making requires me to store very long strings (>1000 characters) in a database. I will need to use these strings later to compare for equality to incoming strings from another application. I will also want to add some of the incoming strings to the storage, if they meet certain criteria. For my application, I get a feeling that storing these strings in my table will be a waste of space, and will impact performance due to retrieval and storage times, as well as comparison times. I considered using an SHA-512 hash of these strings and storing them in the database. However, while these will save on storage space, it will take time to do the hashing before comparing an incoming string. So I'm still wasting time. (Collisions due to hashing will not be a problem, since an occasional false positive will not be fatal for my application). What would be the best approach?
[toc] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2011-06-08 09:49 -0700 |
| Message-ID | <iso96r$pml$1@dont-email.me> |
| In reply to | #5113 |
On 6/8/2011 9:35 AM, Abu Yahya wrote: > I considered using an SHA-512 hash of these strings and storing them in > the database. However, while these will save on storage space, it will > take time to do the hashing before comparing an incoming string. So I'm > still wasting time. (Collisions due to hashing will not be a problem, > since an occasional false positive will not be fatal for my application). You have to store the whole string. Even if the SHA-512 hash codes are equal, it could be that the strings are different. You'll have to eventually compare the raw string, even if the SHA is used as a quick-out case. No one can really tell what is "faster" or "wasting time" until you can better characterize the usage patterns. How big can these strings get? How often will you get an actual duplicate? What's the penalty when you need to add a new string? You'll need to implement a few algorithms, profile them and then make a decision based on actual data. For Java, I'd store the strings in a WeakHashMap or similar to allow them to be cached, but tossed away when more storage is needed. Also you should look into getting some DB caching library, much easier than implementing this yourself (sorry I can't personally recommend any).
[toc] | [prev] | [next] | [standalone]
| From | Willem <willem@toad.stack.nl> |
|---|---|
| Date | 2011-06-08 17:28 +0000 |
| Message-ID | <slrniuvcac.2j1v.willem@toad.stack.nl> |
| In reply to | #5115 |
markspace wrote:
) On 6/8/2011 9:35 AM, Abu Yahya wrote:
)> I considered using an SHA-512 hash of these strings and storing them in
)> the database. However, while these will save on storage space, it will
)> take time to do the hashing before comparing an incoming string. So I'm
)> still wasting time. (Collisions due to hashing will not be a problem,
)> since an occasional false positive will not be fatal for my application).
)
) You have to store the whole string. Even if the SHA-512 hash codes are
) equal, it could be that the strings are different. You'll have to
) eventually compare the raw string, even if the SHA is used as a
) quick-out case.
No he doesn't. Read again. Especially the last bit between parentheses.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
[toc] | [prev] | [next] | [standalone]
| From | Abu Yahya <abu_yahya@invalid.com> |
|---|---|
| Date | 2011-06-08 23:45 +0530 |
| Message-ID | <isoe88$pdj$2@speranza.aioe.org> |
| In reply to | #5118 |
On 6/8/2011 10:58 PM, Willem wrote: > markspace wrote: > ) On 6/8/2011 9:35 AM, Abu Yahya wrote: > )> I considered using an SHA-512 hash of these strings and storing them in > )> the database. However, while these will save on storage space, it will > )> take time to do the hashing before comparing an incoming string. So I'm > )> still wasting time. (Collisions due to hashing will not be a problem, > )> since an occasional false positive will not be fatal for my application). > ) > ) You have to store the whole string. Even if the SHA-512 hash codes are > ) equal, it could be that the strings are different. You'll have to > ) eventually compare the raw string, even if the SHA is used as a > ) quick-out case. > > No he doesn't. Read again. Especially the last bit between parentheses. > Yeah, it seems he missed that last part or read it incorrectly.
[toc] | [prev] | [next] | [standalone]
| From | Abu Yahya <abu_yahya@invalid.com> |
|---|---|
| Date | 2011-06-08 23:45 +0530 |
| Message-ID | <isoe78$pdj$1@speranza.aioe.org> |
| In reply to | #5115 |
On 6/8/2011 10:19 PM, markspace wrote: > On 6/8/2011 9:35 AM, Abu Yahya wrote: > >> I considered using an SHA-512 hash of these strings and storing them in >> the database. However, while these will save on storage space, it will >> take time to do the hashing before comparing an incoming string. So I'm >> still wasting time. (Collisions due to hashing will not be a problem, >> since an occasional false positive will not be fatal for my application). > > > You have to store the whole string. Even if the SHA-512 hash codes are > equal, it could be that the strings are different. You'll have to > eventually compare the raw string, even if the SHA is used as a > quick-out case. You're right about comparing the whole strings anyway, but, for this application I'm creating, I wouldn't mind very very rare incorrect results. > > No one can really tell what is "faster" or "wasting time" until you can > better characterize the usage patterns. How big can these strings get? > How often will you get an actual duplicate? What's the penalty when you > need to add a new string? You'll need to implement a few algorithms, > profile them and then make a decision based on actual data. That makes sense. I'd need to analyze my input data and then run some empirical tests. > > For Java, I'd store the strings in a WeakHashMap or similar to allow > them to be cached, but tossed away when more storage is needed. Also you > should look into getting some DB caching library, much easier than > implementing this yourself (sorry I can't personally recommend any). The WeakHashMap idea looks useful. As for a DB caching library, would you recommend this as a replacement for the WeakHashMap, or as a complement? Thanks for the useful pointers.
[toc] | [prev] | [next] | [standalone]
| From | David Kerber <dkerber@WarrenRogersAssociates.invalid> |
|---|---|
| Date | 2011-06-08 12:58 -0400 |
| Message-ID | <MPG.285974489deb29be989729@news.eternal-september.org> |
| In reply to | #5113 |
[This followup was posted to comp.lang.java.databases and a copy was sent to the cited author.] In article <iso8cm$a80$1@speranza.aioe.org>, abu_yahya@invalid.com says... > > A small application that I'm making requires me to store very long > strings (>1000 characters) in a database. Unless you're storing millions of these strings or using Access, I'd say just store and use the string. I think you'll find that the speed penalty is nearly unnoticeable. D
[toc] | [prev] | [next] | [standalone]
| From | Abu Yahya <abu_yahya@invalid.com> |
|---|---|
| Date | 2011-06-08 23:49 +0530 |
| Message-ID | <isoeeo$pdj$3@speranza.aioe.org> |
| In reply to | #5116 |
On 6/8/2011 10:28 PM, David Kerber wrote: > [This followup was posted to comp.lang.java.databases and a copy was > sent to the cited author.] > > In article<iso8cm$a80$1@speranza.aioe.org>, abu_yahya@invalid.com > says... >> >> A small application that I'm making requires me to store very long >> strings (>1000 characters) in a database. > > Unless you're storing millions of these strings or using Access, I'd say > just store and use the string. I think you'll find that the speed > penalty is nearly unnoticeable. Thanks - simply storing them the way they are does seem to be the best way forward.
[toc] | [prev] | [next] | [standalone]
| From | Lothar Kimmeringer <news200709@kimmeringer.de> |
|---|---|
| Date | 2011-06-08 20:31 +0200 |
| Message-ID | <19aedpuwexqkx$.dlg@kimmeringer.de> |
| In reply to | #5116 |
David Kerber wrote:
> In article <iso8cm$a80$1@speranza.aioe.org>, abu_yahya@invalid.com
> says...
>>
>> A small application that I'm making requires me to store very long
>> strings (>1000 characters) in a database.
>
> Unless you're storing millions of these strings or using Access, I'd say
> just store and use the string. I think you'll find that the speed
> penalty is nearly unnoticeable.
Depends on the database. Some of them force you to use CLOBS for
text-columns with more than 255 characters. CLOBS are a PITA in
terms of indexing.
Regards, Lothar
--
Lothar Kimmeringer E-Mail: spamfang@kimmeringer.de
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)
Always remember: The answer is forty-two, there can only be wrong
questions!
[toc] | [prev] | [next] | [standalone]
| From | Harry Tuttle <OTPXDAJCSJVU@spammotel.com> |
|---|---|
| Date | 2011-06-09 10:50 +0200 |
| Message-ID | <95bfpaFgboU1@mid.individual.net> |
| In reply to | #5130 |
Lothar Kimmeringer, 08.06.2011 20:31: > Depends on the database. Some of them force you to use CLOBS for > text-columns with more than 255 characters. CLOBS are a PITA in > terms of indexing. No serious DBMS requires you to use a CLOB for columns with more than 255 characters. The least flexible "big" name is Oracle, which has a limit of 4000 bytes. Beyond that a CLOB is needed. Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB, ...)
[toc] | [prev] | [next] | [standalone]
| From | bugbear <bugbear@trim_papermule.co.uk_trim> |
|---|---|
| Date | 2011-06-09 11:44 +0100 |
| Message-ID | <__KdncFXQ4icOW3QnZ2dnUVZ8qednZ2d@brightview.co.uk> |
| In reply to | #5152 |
Harry Tuttle wrote: > Lothar Kimmeringer, 08.06.2011 20:31: >> Depends on the database. Some of them force you to use CLOBS for >> text-columns with more than 255 characters. CLOBS are a PITA in >> terms of indexing. > > No serious DBMS requires you to use a CLOB for columns with more than > 255 characters. > > The least flexible "big" name is Oracle, which has a limit of 4000 > bytes. Beyond that a CLOB is needed. > > Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB, > ...) Agreed, seconded etc. To OP; just store the strings as you need, make sure the column is indexed, and simply query with your input string in the "obvious" way. Unless this *proves* to be a catastrophe (which I doubt)... job done. BugBear
[toc] | [prev] | [next] | [standalone]
| From | Harry Tuttle <OTPXDAJCSJVU@spammotel.com> |
|---|---|
| Date | 2011-06-10 10:15 +0200 |
| Message-ID | <95e23mFcv3U1@mid.individual.net> |
| In reply to | #5130 |
Lothar Kimmeringer, 08.06.2011 20:31: > Depends on the database. Some of them force you to use CLOBS for > text-columns with more than 255 characters. CLOBS are a PITA in > terms of indexing. No serious DBMS requires you to use a CLOB for columns with more than 255 characters. The least flexible "big" name is Oracle, which has a limit of 4000 bytes. Beyond that a CLOB is needed. Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB, ...)
[toc] | [prev] | [next] | [standalone]
| From | Gene Wirchenko <genew@ocis.net> |
|---|---|
| Date | 2011-06-08 11:07 -0700 |
| Message-ID | <g6evu6t9847r0g5u2nj17aj61gsktspb93@4ax.com> |
| In reply to | #5113 |
On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <abu_yahya@invalid.com>
wrote:
>A small application that I'm making requires me to store very long
>strings (>1000 characters) in a database.
>
>I will need to use these strings later to compare for equality to
>incoming strings from another application. I will also want to add some
>of the incoming strings to the storage, if they meet certain criteria.
>
>For my application, I get a feeling that storing these strings in my
>table will be a waste of space, and will impact performance due to
>retrieval and storage times, as well as comparison times.
Your feeling is irrelevant. You should benchmark. If you do
not, you may end up jumping through hoops for something that is
unneeded (though you may never find out it that it is unneeded).
>I considered using an SHA-512 hash of these strings and storing them in
>the database. However, while these will save on storage space, it will
>take time to do the hashing before comparing an incoming string. So I'm
>still wasting time. (Collisions due to hashing will not be a problem,
>since an occasional false positive will not be fatal for my application).
What does "occasional" mean?
>What would be the best approach?
If it is a small app, why are you worrying about it so much?
Create a straightforward design, code it, and see if it is fast
enough. If it is, keep it. If it is not, then and only then, concern
yourself with how to speed it up.
Quit wasting time posting here, and benchmark something. <BEG>
Sincerely,
Gene Wirchenko
[toc] | [prev] | [next] | [standalone]
| From | Abu Yahya <abu_yahya@invalid.com> |
|---|---|
| Date | 2011-06-08 23:58 +0530 |
| Message-ID | <isof0m$rcp$1@speranza.aioe.org> |
| In reply to | #5122 |
On 6/8/2011 11:37 PM, Gene Wirchenko wrote: > On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya<abu_yahya@invalid.com> > wrote: > >> I considered using an SHA-512 hash of these strings and storing them in >> the database. However, while these will save on storage space, it will >> take time to do the hashing before comparing an incoming string. So I'm >> still wasting time. (Collisions due to hashing will not be a problem, >> since an occasional false positive will not be fatal for my application). > > What does "occasional" mean? > The application will be doing some probabilistic evaluation using the data it is storing, and needs to be fairly close to the actual /most/ of the time. Because it would anyway be impossible to be correct 100% of the time, using a wrong value for a particular piece of input data will not matter, provided it does not happen too often. (In my text above, "rare" would have been a better choice of word than "occasional")
[toc] | [prev] | [next] | [standalone]
| From | Hallvard B Furuseth <h.b.furuseth@usit.uio.no> |
|---|---|
| Date | 2011-06-09 12:38 +0200 |
| Message-ID | <hbf.20110609mhkc@bombur.uio.no> |
| In reply to | #5122 |
Gene Wirchenko writes: >On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <abu_yahya@invalid.com> >wrote: >>For my application, I get a feeling that storing these strings in my >>table will be a waste of space, and will impact performance due to >>retrieval and storage times, as well as comparison times. > > Your feeling is irrelevant. You should benchmark. If you do > not, you may end up jumping through hoops for something that is > unneeded (though you may never find out it that it is unneeded). Indeed. No point in using a lot of time to solve a non-problem. But after the benchmark, the decision can depend on who the application is for. If it scales poorly, that can bite other users with different input data. OTOH if he'll be the only user and he finds that full strings and SHA are both too slow: Another approach would be to use a faster hash, count hash collisions, and don't bother with more if the count is acceptable. Or try tries, as Tom Anderson suggested. -- Hallvard
[toc] | [prev] | [next] | [standalone]
| From | Michael Wojcik <mwojcik@newsguy.com> |
|---|---|
| Date | 2011-06-09 17:32 -0400 |
| Message-ID | <isre5k32k44@news3.newsguy.com> |
| In reply to | #5122 |
Gene Wirchenko wrote: > On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <abu_yahya@invalid.com> > wrote: > >> I considered using an SHA-512 hash of these strings and storing them in >> the database. However, while these will save on storage space, it will >> take time to do the hashing before comparing an incoming string. So I'm >> still wasting time. (Collisions due to hashing will not be a problem, >> since an occasional false positive will not be fatal for my application). > > What does "occasional" mean? Assuming the application doesn't accidentally run afoul of a hitherto-unknown flaw in SHA-512 - a tremendously unlikely event - it means "once in every N/2**256 uses", where N is the current number of hashes in the database. (2**256 because of the Birthday Paradox; we're interested in the probability of two arbitrary colliding pre-images.) Or, in other words, "probably not before the heat death of the universe". (Or false vacuum decay, zombie apocalypse, Rapture, etc.) Worrying about the time to hash the string is silly. It's linear in the length of the string, so it's roughly equivalent to the time taken to do a few comparisons (where "a few" depends on how long, on average, the matching prefix of the new string and the candidates is). However, as various others have pointed out, this is premature optimization. There's no reason to use any design other than the obvious until a problem is demonstrated. -- Michael Wojcik Micro Focus Rhetoric & Writing, Michigan State University
[toc] | [prev] | [next] | [standalone]
| From | bugbear <bugbear@trim_papermule.co.uk_trim> |
|---|---|
| Date | 2011-06-10 10:51 +0100 |
| Message-ID | <z6qdnRH0SMqAdGzQnZ2dnUVZ8n-dnZ2d@brightview.co.uk> |
| In reply to | #5166 |
Michael Wojcik wrote: > Gene Wirchenko wrote: >> On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya<abu_yahya@invalid.com> >> wrote: >> >>> I considered using an SHA-512 hash of these strings and storing them in >>> the database. However, while these will save on storage space, it will >>> take time to do the hashing before comparing an incoming string. So I'm >>> still wasting time. (Collisions due to hashing will not be a problem, >>> since an occasional false positive will not be fatal for my application). >> >> What does "occasional" mean? > > Assuming the application doesn't accidentally run afoul of a > hitherto-unknown flaw in SHA-512 - a tremendously unlikely event - it > means "once in every N/2**256 uses", where N is the current number of > hashes in the database. (2**256 because of the Birthday Paradox; we're > interested in the probability of two arbitrary colliding pre-images.) > > Or, in other words, "probably not before the heat death of the > universe". (Or false vacuum decay, zombie apocalypse, Rapture, etc.) > > Worrying about the time to hash the string is silly. It's linear in > the length of the string, so it's roughly equivalent to the time taken > to do a few comparisons (where "a few" depends on how long, on > average, the matching prefix of the new string and the candidates is). > > However, as various others have pointed out, this is premature > optimization. There's no reason to use any design other than the > obvious until a problem is demonstrated. I any case, it's plausible that the DB is using "something" to index the strings, in other words the string search in the DB already has some-kind-of optimisation. BugBear
[toc] | [prev] | [next] | [standalone]
| From | Lothar Kimmeringer <news200709@kimmeringer.de> |
|---|---|
| Date | 2011-06-08 20:28 +0200 |
| Message-ID | <171dpt2926br2.dlg@kimmeringer.de> |
| In reply to | #5113 |
F'up to cljp
Abu Yahya wrote:
> I considered using an SHA-512 hash of these strings and storing them in
> the database. However, while these will save on storage space, it will
> take time to do the hashing before comparing an incoming string. So I'm
> still wasting time. (Collisions due to hashing will not be a problem,
> since an occasional false positive will not be fatal for my application).
If you write seldom and read often, why not using two columns:
string_hashcode
sha1_hashcode
If the first is equal, you can calculate the sha1-hash for the string
to be checked and if that is equal as well, you can consider the
string as equal. That both hashes collide I expect to be very
very unlikely (which is why I changed the other alg to sha-1, that
should be considerably more performant than sha512).
So calculation of the more complex algorithm is only done while
storing to the database and when checking a string that is already
in the database. If you have that case very often you still might
get a better performance with String.hashcode and SHA1 than with
just SHA512.
> What would be the best approach?
There is no single best approach, only an optimal one. Which
one it is dependend on what defines one way to be better than
the other (in terms of performance, storage-space, collision-
rates, etc).
Regards, Lothar
--
Lothar Kimmeringer E-Mail: spamfang@kimmeringer.de
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)
Always remember: The answer is forty-two, there can only be wrong
questions!
[toc] | [prev] | [next] | [standalone]
| From | Martin Gregorie <martin@address-in-sig.invalid> |
|---|---|
| Date | 2011-06-08 22:02 +0000 |
| Message-ID | <isorht$97k$1@localhost.localdomain> |
| In reply to | #5128 |
On Wed, 08 Jun 2011 20:28:11 +0200, Lothar Kimmeringer wrote: > If you write seldom and read often, why not using two columns: > > string_hashcode > sha1_hashcode > You haven't given us a maximum size for the string or the name of the target database, so its possible that implementation constraints will prevent the strings from fitting into a character() column and you'll have to use a character varying, text, clob etc. instead. If this type isn't allowed as the table's primary key, you can use the hash code as the primary key. Since it doesn't matter if some strings have clashing hash codes this can't cause a problem. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
[toc] | [prev] | [next] | [standalone]
| From | rossum <rossum48@coldmail.com> |
|---|---|
| Date | 2011-06-08 21:38 +0100 |
| Message-ID | <eimvu65abk3j0n0l80a85u979bkij3rv8a@4ax.com> |
| In reply to | #5113 |
On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <abu_yahya@invalid.com> wrote: >A small application that I'm making requires me to store very long >strings (>1000 characters) in a database. > >I will need to use these strings later to compare for equality to >incoming strings from another application. I will also want to add some >of the incoming strings to the storage, if they meet certain criteria. > >For my application, I get a feeling that storing these strings in my >table will be a waste of space, and will impact performance due to >retrieval and storage times, as well as comparison times. > >I considered using an SHA-512 hash of these strings and storing them in >the database. However, while these will save on storage space, it will >take time to do the hashing before comparing an incoming string. So I'm >still wasting time. (Collisions due to hashing will not be a problem, >since an occasional false positive will not be fatal for my application). > >What would be the best approach? As others have said, write the simple obvious approach and see if that is good enough. Tune where required after measuring. Lothar's suggestion of using SHA-1 is good. You could even drop back to MD-4 if you are sure that nobody is going to be deliberately trying to create false collisions. MD-4 is much too badly broken for any cryptographic purposes, but is even faster than SHA-1. If the amount of storage needed is a problem then you might want to zip the strings before storing them. If you can be sure that the zipped versions are identical (not always possible with unicode combining characters) then you could hash the zipped version rather than the originals for more time saving. rossum
[toc] | [prev] | [next] | [standalone]
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Date | 2011-06-08 23:20 +0200 |
| Message-ID | <95a7d0Frr9U1@mid.individual.net> |
| In reply to | #5113 |
On 08.06.2011 18:35, Abu Yahya wrote: > A small application that I'm making requires me to store very long > strings (>1000 characters) in a database. > > I will need to use these strings later to compare for equality to > incoming strings from another application. I will also want to add some > of the incoming strings to the storage, if they meet certain criteria. > > For my application, I get a feeling that storing these strings in my > table will be a waste of space, and will impact performance due to > retrieval and storage times, as well as comparison times. > > I considered using an SHA-512 hash of these strings and storing them in > the database. However, while these will save on storage space, it will > take time to do the hashing before comparing an incoming string. So I'm > still wasting time. (Collisions due to hashing will not be a problem, > since an occasional false positive will not be fatal for my application). > > What would be the best approach? Just out of curiosity: what do those strings represent? What do you do with them? Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.java.programmer
csiph-web