Groups > comp.lang.java.programmer > #5113 > unrolled thread

Storing large strings for future equality checks

Started by	Abu Yahya <abu_yahya@invalid.com>
First post	2011-06-08 22:05 +0530
Last post	2011-06-09 15:01 -0700
Articles	20 on this page of 22 — 15 participants

Back to article view | Back to comp.lang.java.programmer

  Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 22:05 +0530
    Re: Storing large strings for future equality checks markspace <-@.> - 2011-06-08 09:49 -0700
      Re: Storing large strings for future equality checks Willem <willem@toad.stack.nl> - 2011-06-08 17:28 +0000
        Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:45 +0530
      Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:45 +0530
    Re: Storing large strings for future equality checks David Kerber <dkerber@WarrenRogersAssociates.invalid> - 2011-06-08 12:58 -0400
      Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:49 +0530
      Re: Storing large strings for future equality checks Lothar Kimmeringer <news200709@kimmeringer.de> - 2011-06-08 20:31 +0200
        Re: Storing large strings for future equality checks Harry Tuttle <OTPXDAJCSJVU@spammotel.com> - 2011-06-09 10:50 +0200
          Re: Storing large strings for future equality checks bugbear <bugbear@trim_papermule.co.uk_trim> - 2011-06-09 11:44 +0100
        Re: Storing large strings for future equality checks Harry Tuttle <OTPXDAJCSJVU@spammotel.com> - 2011-06-10 10:15 +0200
    Re: Storing large strings for future equality checks Gene Wirchenko <genew@ocis.net> - 2011-06-08 11:07 -0700
      Re: Storing large strings for future equality checks Abu Yahya <abu_yahya@invalid.com> - 2011-06-08 23:58 +0530
      Re: Storing large strings for future equality checks Hallvard B Furuseth <h.b.furuseth@usit.uio.no> - 2011-06-09 12:38 +0200
      Re: Storing large strings for future equality checks Michael Wojcik <mwojcik@newsguy.com> - 2011-06-09 17:32 -0400
        Re: Storing large strings for future equality checks bugbear <bugbear@trim_papermule.co.uk_trim> - 2011-06-10 10:51 +0100
    Re: Storing large strings for future equality checks Lothar Kimmeringer <news200709@kimmeringer.de> - 2011-06-08 20:28 +0200
      Re: Storing large strings for future equality checks Martin Gregorie <martin@address-in-sig.invalid> - 2011-06-08 22:02 +0000
    Re: Storing large strings for future equality checks rossum <rossum48@coldmail.com> - 2011-06-08 21:38 +0100
    Re: Storing large strings for future equality checks Robert Klemme <shortcutter@googlemail.com> - 2011-06-08 23:20 +0200
    Re: Storing large strings for future equality checks Tom Anderson <twic@urchin.earth.li> - 2011-06-08 23:02 +0100
    Re: Storing large strings for future equality checks Joshua Maurice <joshuamaurice@gmail.com> - 2011-06-09 15:01 -0700

Page 1 of 2 [1] 2 Next page →

#5113 — Storing large strings for future equality checks

From	Abu Yahya <abu_yahya@invalid.com>
Date	2011-06-08 22:05 +0530
Subject	Storing large strings for future equality checks
Message-ID	<iso8cm$a80$1@speranza.aioe.org>

A small application that I'm making requires me to store very long 
strings (>1000 characters) in a database.

I will need to use these strings later to compare for equality to 
incoming strings from another application. I will also want to add some 
of the incoming strings to the storage, if they meet certain criteria.

For my application, I get a feeling that storing these strings in my 
table will be a waste of space, and will impact performance due to 
retrieval and storage times, as well as comparison times.

I considered using an SHA-512 hash of these strings and storing them in 
the database. However, while these will save on storage space, it will 
take time to do the hashing before comparing an incoming string. So I'm 
still wasting time. (Collisions due to hashing will not be a problem, 
since an occasional false positive will not be fatal for my application).

What would be the best approach?

[toc] | [next] | [standalone]

#5115

From	markspace <-@.>
Date	2011-06-08 09:49 -0700
Message-ID	<iso96r$pml$1@dont-email.me>
In reply to	#5113

On 6/8/2011 9:35 AM, Abu Yahya wrote:

> I considered using an SHA-512 hash of these strings and storing them in
> the database. However, while these will save on storage space, it will
> take time to do the hashing before comparing an incoming string. So I'm
> still wasting time. (Collisions due to hashing will not be a problem,
> since an occasional false positive will not be fatal for my application).

You have to store the whole string.  Even if the SHA-512 hash codes are 
equal, it could be that the strings are different.  You'll have to 
eventually compare the raw string, even if the SHA is used as a 
quick-out case.

No one can really tell what is "faster" or "wasting time" until you can 
better characterize the usage patterns.  How big can these strings get? 
  How often will you get an actual duplicate?  What's the penalty when 
you need to add a new string?  You'll need to implement a few 
algorithms, profile them and then make a decision based on actual data.

For Java, I'd store the strings in a WeakHashMap or similar to allow 
them to be cached, but tossed away when more storage is needed.  Also 
you should look into getting some DB caching library, much easier than 
implementing this yourself (sorry I can't personally recommend any).

[toc] | [prev] | [next] | [standalone]

#5118

From	Willem <willem@toad.stack.nl>
Date	2011-06-08 17:28 +0000
Message-ID	<slrniuvcac.2j1v.willem@toad.stack.nl>
In reply to	#5115

markspace wrote:
) On 6/8/2011 9:35 AM, Abu Yahya wrote:
)> I considered using an SHA-512 hash of these strings and storing them in
)> the database. However, while these will save on storage space, it will
)> take time to do the hashing before comparing an incoming string. So I'm
)> still wasting time. (Collisions due to hashing will not be a problem,
)> since an occasional false positive will not be fatal for my application).
)
) You have to store the whole string.  Even if the SHA-512 hash codes are 
) equal, it could be that the strings are different.  You'll have to 
) eventually compare the raw string, even if the SHA is used as a 
) quick-out case.

No he doesn't.  Read again.  Especially the last bit between parentheses.


SaSW, Willem
-- 
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

[toc] | [prev] | [next] | [standalone]

#5124

From	Abu Yahya <abu_yahya@invalid.com>
Date	2011-06-08 23:45 +0530
Message-ID	<isoe88$pdj$2@speranza.aioe.org>
In reply to	#5118

On 6/8/2011 10:58 PM, Willem wrote:
> markspace wrote:
> ) On 6/8/2011 9:35 AM, Abu Yahya wrote:
> )>  I considered using an SHA-512 hash of these strings and storing them in
> )>  the database. However, while these will save on storage space, it will
> )>  take time to do the hashing before comparing an incoming string. So I'm
> )>  still wasting time. (Collisions due to hashing will not be a problem,
> )>  since an occasional false positive will not be fatal for my application).
> )
> ) You have to store the whole string.  Even if the SHA-512 hash codes are
> ) equal, it could be that the strings are different.  You'll have to
> ) eventually compare the raw string, even if the SHA is used as a
> ) quick-out case.
>
> No he doesn't.  Read again.  Especially the last bit between parentheses.
>

Yeah, it seems he missed that last part or read it incorrectly.

[toc] | [prev] | [next] | [standalone]

#5123

From	Abu Yahya <abu_yahya@invalid.com>
Date	2011-06-08 23:45 +0530
Message-ID	<isoe78$pdj$1@speranza.aioe.org>
In reply to	#5115

On 6/8/2011 10:19 PM, markspace wrote:
> On 6/8/2011 9:35 AM, Abu Yahya wrote:
>
>> I considered using an SHA-512 hash of these strings and storing them in
>> the database. However, while these will save on storage space, it will
>> take time to do the hashing before comparing an incoming string. So I'm
>> still wasting time. (Collisions due to hashing will not be a problem,
>> since an occasional false positive will not be fatal for my application).
>
>
> You have to store the whole string. Even if the SHA-512 hash codes are
> equal, it could be that the strings are different. You'll have to
> eventually compare the raw string, even if the SHA is used as a
> quick-out case.

You're right about comparing the whole strings anyway, but, for this 
application I'm creating, I wouldn't mind very very rare incorrect results.

>
> No one can really tell what is "faster" or "wasting time" until you can
> better characterize the usage patterns. How big can these strings get?
> How often will you get an actual duplicate? What's the penalty when you
> need to add a new string? You'll need to implement a few algorithms,
> profile them and then make a decision based on actual data.

That makes sense. I'd need to analyze my input data and then run some 
empirical tests.

>
> For Java, I'd store the strings in a WeakHashMap or similar to allow
> them to be cached, but tossed away when more storage is needed. Also you
> should look into getting some DB caching library, much easier than
> implementing this yourself (sorry I can't personally recommend any).

The WeakHashMap idea looks useful. As for a DB caching library, would 
you recommend this as a replacement for the WeakHashMap, or as a complement?

Thanks for the useful pointers.

[toc] | [prev] | [next] | [standalone]

#5116

From	David Kerber <dkerber@WarrenRogersAssociates.invalid>
Date	2011-06-08 12:58 -0400
Message-ID	<MPG.285974489deb29be989729@news.eternal-september.org>
In reply to	#5113

[This followup was posted to comp.lang.java.databases and a copy was 
sent to the cited author.]

In article <iso8cm$a80$1@speranza.aioe.org>, abu_yahya@invalid.com 
says...
> 
> A small application that I'm making requires me to store very long 
> strings (>1000 characters) in a database.

Unless you're storing millions of these strings or using Access, I'd say 
just store and use the string.  I think you'll find that the speed 
penalty is nearly unnoticeable.

D

[toc] | [prev] | [next] | [standalone]

#5126

From	Abu Yahya <abu_yahya@invalid.com>
Date	2011-06-08 23:49 +0530
Message-ID	<isoeeo$pdj$3@speranza.aioe.org>
In reply to	#5116

On 6/8/2011 10:28 PM, David Kerber wrote:
> [This followup was posted to comp.lang.java.databases and a copy was
> sent to the cited author.]
>
> In article<iso8cm$a80$1@speranza.aioe.org>, abu_yahya@invalid.com
> says...
>>
>> A small application that I'm making requires me to store very long
>> strings (>1000 characters) in a database.
>
> Unless you're storing millions of these strings or using Access, I'd say
> just store and use the string.  I think you'll find that the speed
> penalty is nearly unnoticeable.

Thanks - simply storing them the way they are does seem to be the best 
way forward.

[toc] | [prev] | [next] | [standalone]

#5130

From	Lothar Kimmeringer <news200709@kimmeringer.de>
Date	2011-06-08 20:31 +0200
Message-ID	<19aedpuwexqkx$.dlg@kimmeringer.de>
In reply to	#5116

David Kerber wrote:

> In article <iso8cm$a80$1@speranza.aioe.org>, abu_yahya@invalid.com 
> says...
>> 
>> A small application that I'm making requires me to store very long 
>> strings (>1000 characters) in a database.
> 
> Unless you're storing millions of these strings or using Access, I'd say 
> just store and use the string.  I think you'll find that the speed 
> penalty is nearly unnoticeable.

Depends on the database. Some of them force you to use CLOBS for
text-columns with more than 255 characters. CLOBS are a PITA in
terms of indexing.


Regards, Lothar
-- 
Lothar Kimmeringer                E-Mail: spamfang@kimmeringer.de
               PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
                 questions!

[toc] | [prev] | [next] | [standalone]

#5152

From	Harry Tuttle <OTPXDAJCSJVU@spammotel.com>
Date	2011-06-09 10:50 +0200
Message-ID	<95bfpaFgboU1@mid.individual.net>
In reply to	#5130

Lothar Kimmeringer, 08.06.2011 20:31:
> Depends on the database. Some of them force you to use CLOBS for
> text-columns with more than 255 characters. CLOBS are a PITA in
> terms of indexing.

No serious DBMS requires you to use a CLOB for columns with more than 255 characters.

The least flexible "big" name is Oracle, which has a limit of 4000 bytes. Beyond that a CLOB is needed.

Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB, ...)

[toc] | [prev] | [next] | [standalone]

#5155

From	bugbear <bugbear@trim_papermule.co.uk_trim>
Date	2011-06-09 11:44 +0100
Message-ID	<__KdncFXQ4icOW3QnZ2dnUVZ8qednZ2d@brightview.co.uk>
In reply to	#5152

Harry Tuttle wrote:
> Lothar Kimmeringer, 08.06.2011 20:31:
>> Depends on the database. Some of them force you to use CLOBS for
>> text-columns with more than 255 characters. CLOBS are a PITA in
>> terms of indexing.
>
> No serious DBMS requires you to use a CLOB for columns with more than
> 255 characters.
>
> The least flexible "big" name is Oracle, which has a limit of 4000
> bytes. Beyond that a CLOB is needed.
>
> Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB,
> ...)

Agreed, seconded etc.

To OP; just store the strings as you need, make sure the
column is indexed, and simply query with your input
string in the "obvious" way.

Unless this *proves* to be a catastrophe (which I doubt)...
job done.

   BugBear

[toc] | [prev] | [next] | [standalone]

#5170

From	Harry Tuttle <OTPXDAJCSJVU@spammotel.com>
Date	2011-06-10 10:15 +0200
Message-ID	<95e23mFcv3U1@mid.individual.net>
In reply to	#5130

Lothar Kimmeringer, 08.06.2011 20:31:
> Depends on the database. Some of them force you to use CLOBS for
> text-columns with more than 255 characters. CLOBS are a PITA in
> terms of indexing.

No serious DBMS requires you to use a CLOB for columns with more than 255 characters.

The least flexible "big" name is Oracle, which has a limit of 4000 bytes. Beyond that a CLOB is needed.

Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB, ...)

[toc] | [prev] | [next] | [standalone]

#5122

From	Gene Wirchenko <genew@ocis.net>
Date	2011-06-08 11:07 -0700
Message-ID	<g6evu6t9847r0g5u2nj17aj61gsktspb93@4ax.com>
In reply to	#5113

On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <abu_yahya@invalid.com>
wrote:

>A small application that I'm making requires me to store very long 
>strings (>1000 characters) in a database.
>
>I will need to use these strings later to compare for equality to 
>incoming strings from another application. I will also want to add some 
>of the incoming strings to the storage, if they meet certain criteria.
>
>For my application, I get a feeling that storing these strings in my 
>table will be a waste of space, and will impact performance due to 
>retrieval and storage times, as well as comparison times.

     Your feeling is irrelevant.  You should benchmark.  If you do
not, you may end up jumping through hoops for something that is
unneeded (though you may never find out it that it is unneeded).

>I considered using an SHA-512 hash of these strings and storing them in 
>the database. However, while these will save on storage space, it will 
>take time to do the hashing before comparing an incoming string. So I'm 
>still wasting time. (Collisions due to hashing will not be a problem, 
>since an occasional false positive will not be fatal for my application).

     What does "occasional" mean?

>What would be the best approach?

     If it is a small app, why are you worrying about it so much?
Create a straightforward design, code it, and see if it is fast
enough.  If it is, keep it.  If it is not, then and only then, concern
yourself with how to speed it up.

     Quit wasting time posting here, and benchmark something.  <BEG>

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#5129

From	Abu Yahya <abu_yahya@invalid.com>
Date	2011-06-08 23:58 +0530
Message-ID	<isof0m$rcp$1@speranza.aioe.org>
In reply to	#5122

On 6/8/2011 11:37 PM, Gene Wirchenko wrote:
> On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya<abu_yahya@invalid.com>
> wrote:
>
>> I considered using an SHA-512 hash of these strings and storing them in
>> the database. However, while these will save on storage space, it will
>> take time to do the hashing before comparing an incoming string. So I'm
>> still wasting time. (Collisions due to hashing will not be a problem,
>> since an occasional false positive will not be fatal for my application).
>
>       What does "occasional" mean?
>

The application will be doing some probabilistic evaluation using the 
data it is storing, and needs to be fairly close to the actual /most/ of 
the time. Because it would anyway be impossible to be correct 100% of 
the time, using a wrong value for a particular piece of input data will 
not matter, provided it does not happen too often. (In my text above, 
"rare" would have been a better choice of word than "occasional")

[toc] | [prev] | [next] | [standalone]

#5154

From	Hallvard B Furuseth <h.b.furuseth@usit.uio.no>
Date	2011-06-09 12:38 +0200
Message-ID	<hbf.20110609mhkc@bombur.uio.no>
In reply to	#5122

Gene Wirchenko writes:
>On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <abu_yahya@invalid.com>
>wrote:
>>For my application, I get a feeling that storing these strings in my 
>>table will be a waste of space, and will impact performance due to 
>>retrieval and storage times, as well as comparison times.
>
>      Your feeling is irrelevant.  You should benchmark.  If you do
> not, you may end up jumping through hoops for something that is
> unneeded (though you may never find out it that it is unneeded).

Indeed.  No point in using a lot of time to solve a non-problem.  But
after the benchmark, the decision can depend on who the application is
for.  If it scales poorly, that can bite other users with different
input data.

OTOH if he'll be the only user and he finds that full strings and SHA
are both too slow: Another approach would be to use a faster hash, count
hash collisions, and don't bother with more if the count is acceptable.

Or try tries, as Tom Anderson suggested.

-- 
Hallvard

[toc] | [prev] | [next] | [standalone]

#5166

From	Michael Wojcik <mwojcik@newsguy.com>
Date	2011-06-09 17:32 -0400
Message-ID	<isre5k32k44@news3.newsguy.com>
In reply to	#5122

Gene Wirchenko wrote:
> On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <abu_yahya@invalid.com>
> wrote:
> 
>> I considered using an SHA-512 hash of these strings and storing them in 
>> the database. However, while these will save on storage space, it will 
>> take time to do the hashing before comparing an incoming string. So I'm 
>> still wasting time. (Collisions due to hashing will not be a problem, 
>> since an occasional false positive will not be fatal for my application).
> 
>      What does "occasional" mean?

Assuming the application doesn't accidentally run afoul of a
hitherto-unknown flaw in SHA-512 - a tremendously unlikely event - it
means "once in every N/2**256 uses", where N is the current number of
hashes in the database. (2**256 because of the Birthday Paradox; we're
interested in the probability of two arbitrary colliding pre-images.)

Or, in other words, "probably not before the heat death of the
universe". (Or false vacuum decay, zombie apocalypse, Rapture, etc.)

Worrying about the time to hash the string is silly. It's linear in
the length of the string, so it's roughly equivalent to the time taken
to do a few comparisons (where "a few" depends on how long, on
average, the matching prefix of the new string and the candidates is).

However, as various others have pointed out, this is premature
optimization. There's no reason to use any design other than the
obvious until a problem is demonstrated.

-- 
Michael Wojcik
Micro Focus
Rhetoric & Writing, Michigan State University

[toc] | [prev] | [next] | [standalone]

#5174

From	bugbear <bugbear@trim_papermule.co.uk_trim>
Date	2011-06-10 10:51 +0100
Message-ID	<z6qdnRH0SMqAdGzQnZ2dnUVZ8n-dnZ2d@brightview.co.uk>
In reply to	#5166

Michael Wojcik wrote:
> Gene Wirchenko wrote:
>> On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya<abu_yahya@invalid.com>
>> wrote:
>>
>>> I considered using an SHA-512 hash of these strings and storing them in
>>> the database. However, while these will save on storage space, it will
>>> take time to do the hashing before comparing an incoming string. So I'm
>>> still wasting time. (Collisions due to hashing will not be a problem,
>>> since an occasional false positive will not be fatal for my application).
>>
>>       What does "occasional" mean?
>
> Assuming the application doesn't accidentally run afoul of a
> hitherto-unknown flaw in SHA-512 - a tremendously unlikely event - it
> means "once in every N/2**256 uses", where N is the current number of
> hashes in the database. (2**256 because of the Birthday Paradox; we're
> interested in the probability of two arbitrary colliding pre-images.)
>
> Or, in other words, "probably not before the heat death of the
> universe". (Or false vacuum decay, zombie apocalypse, Rapture, etc.)
>
> Worrying about the time to hash the string is silly. It's linear in
> the length of the string, so it's roughly equivalent to the time taken
> to do a few comparisons (where "a few" depends on how long, on
> average, the matching prefix of the new string and the candidates is).
>
> However, as various others have pointed out, this is premature
> optimization. There's no reason to use any design other than the
> obvious until a problem is demonstrated.

I any case, it's plausible that the DB is using "something"
to index the strings, in other words the string search in the DB
already has some-kind-of optimisation.

   BugBear

[toc] | [prev] | [next] | [standalone]

#5128

From	Lothar Kimmeringer <news200709@kimmeringer.de>
Date	2011-06-08 20:28 +0200
Message-ID	<171dpt2926br2.dlg@kimmeringer.de>
In reply to	#5113

F'up to cljp

Abu Yahya wrote:

> I considered using an SHA-512 hash of these strings and storing them in 
> the database. However, while these will save on storage space, it will 
> take time to do the hashing before comparing an incoming string. So I'm 
> still wasting time. (Collisions due to hashing will not be a problem, 
> since an occasional false positive will not be fatal for my application).

If you write seldom and read often, why not using two columns:

string_hashcode
sha1_hashcode

If the first is equal, you can calculate the sha1-hash for the string
to be checked and if that is equal as well, you can consider the
string as equal. That both hashes collide I expect to be very
very unlikely (which is why I changed the other alg to sha-1, that
should be considerably more performant than sha512).

So calculation of the more complex algorithm is only done while
storing to the database and when checking a string that is already
in the database. If you have that case very often you still might
get a better performance with String.hashcode and SHA1 than with
just SHA512.

> What would be the best approach?

There is no single best approach, only an optimal one. Which
one it is dependend on what defines one way to be better than
the other (in terms of performance, storage-space, collision-
rates, etc).

Regards, Lothar
-- 
Lothar Kimmeringer                E-Mail: spamfang@kimmeringer.de
               PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
                 questions!

[toc] | [prev] | [next] | [standalone]

#5141

From	Martin Gregorie <martin@address-in-sig.invalid>
Date	2011-06-08 22:02 +0000
Message-ID	<isorht$97k$1@localhost.localdomain>
In reply to	#5128

On Wed, 08 Jun 2011 20:28:11 +0200, Lothar Kimmeringer wrote:

> If you write seldom and read often, why not using two columns:
> 
> string_hashcode
> sha1_hashcode
>
You haven't given us a maximum size for the string or the name of the 
target database, so its possible that implementation constraints will 
prevent the strings from fitting into a character() column and you'll 
have to use a character varying, text, clob etc. instead. 

If this type  isn't allowed as the table's primary key, you can use the 
hash code as the primary key. Since it doesn't matter if some strings 
have clashing hash codes this can't cause a problem.


-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]

#5134

From	rossum <rossum48@coldmail.com>
Date	2011-06-08 21:38 +0100
Message-ID	<eimvu65abk3j0n0l80a85u979bkij3rv8a@4ax.com>
In reply to	#5113

On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <abu_yahya@invalid.com>
wrote:

>A small application that I'm making requires me to store very long 
>strings (>1000 characters) in a database.
>
>I will need to use these strings later to compare for equality to 
>incoming strings from another application. I will also want to add some 
>of the incoming strings to the storage, if they meet certain criteria.
>
>For my application, I get a feeling that storing these strings in my 
>table will be a waste of space, and will impact performance due to 
>retrieval and storage times, as well as comparison times.
>
>I considered using an SHA-512 hash of these strings and storing them in 
>the database. However, while these will save on storage space, it will 
>take time to do the hashing before comparing an incoming string. So I'm 
>still wasting time. (Collisions due to hashing will not be a problem, 
>since an occasional false positive will not be fatal for my application).
>
>What would be the best approach?
As others have said, write the simple obvious approach and see if that
is good enough.  Tune where required after measuring.

Lothar's suggestion of using SHA-1 is good.  You could even drop back
to MD-4 if you are sure that nobody is going to be deliberately trying
to create false collisions.  MD-4 is much too badly broken for any
cryptographic purposes, but is even faster than SHA-1.

If the amount of storage needed is a problem then you might want to
zip the strings before storing them.

If you can be sure that the zipped versions are identical (not always
possible with unicode combining characters) then you could hash the
zipped version rather than the originals for more time saving.

rossum

[toc] | [prev] | [next] | [standalone]

#5139

From	Robert Klemme <shortcutter@googlemail.com>
Date	2011-06-08 23:20 +0200
Message-ID	<95a7d0Frr9U1@mid.individual.net>
In reply to	#5113

On 08.06.2011 18:35, Abu Yahya wrote:
> A small application that I'm making requires me to store very long
> strings (>1000 characters) in a database.
>
> I will need to use these strings later to compare for equality to
> incoming strings from another application. I will also want to add some
> of the incoming strings to the storage, if they meet certain criteria.
>
> For my application, I get a feeling that storing these strings in my
> table will be a waste of space, and will impact performance due to
> retrieval and storage times, as well as comparison times.
>
> I considered using an SHA-512 hash of these strings and storing them in
> the database. However, while these will save on storage space, it will
> take time to do the hashing before comparing an incoming string. So I'm
> still wasting time. (Collisions due to hashing will not be a problem,
> since an occasional false positive will not be fatal for my application).
>
> What would be the best approach?

Just out of curiosity: what do those strings represent?  What do you do 
with them?

Kind regards

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Storing large strings for future equality checks

Contents

#5113 — Storing large strings for future equality checks

#5115

#5118

#5124

#5123

#5116

#5126

#5130

#5152

#5155

#5170

#5122

#5129

#5154

#5166

#5174

#5128

#5141

#5134

#5139