Groups > comp.lang.ruby > #2037

Re: Inserting hash value slows down as table gets larger

Path	csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!news.musoftware.de!wum.musoftware.de!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!talisker.lacave.net!lacave.net!not-for-mail
From	Philip Rhoades <phil@pricom.com.au>
Newsgroups	comp.lang.ruby
Subject	Re: Inserting hash value slows down as table gets larger
Date	Thu, 31 Mar 2011 04:58:30 -0500
Organization	Service de news de lacave.net
Lines	173
Message-ID	<4D944FFA.7060601@pricom.com.au> (permalink)
References	<4D8204C0.4040004@pricom.com.au> <882988B5-32AF-4C5B-8C04-E5A91EFDA389@3dlabs.com> <f50b25a82a9044dea45150b0d1d74092@ruby-forum.com> <4D8298E8.9080004@pricom.com.au> <f16a1415b298c9ff56433eac95bccf0b@ruby-forum.com> <4D8474A4.4020106@pricom.com.au> <4d1919a5808a55af36b034cf3feac5a9@ruby-forum.com> <4D859407.7060904@pricom.com.au>
NNTP-Posting-Host	bristol.highgroove.com
Content-Type	text/plain; charset=ISO-8859-1
Content-Transfer-Encoding	7bit
X-Trace	talisker.lacave.net 1301567408 33004 65.111.164.187 (31 Mar 2011 10:30:08 GMT)
X-Complaints-To	abuse@lacave.net
NNTP-Posting-Date	Thu, 31 Mar 2011 10:30:08 +0000 (UTC)
In-Reply-To	<4D859407.7060904@pricom.com.au>
X-Received-From	This message has been automatically forwarded from the ruby-talk mailing list by a gateway at comp.lang.ruby. If it is SPAM, it did not originate at comp.lang.ruby. Please report the original sender, and not us. Thanks! For more details about this gateway, please visit: http://blog.grayproductions.net/categories/the_gateway
X-Mail-Count	380682
X-Ml-Name	ruby-talk
X-Rubymirror	Yes
X-Ruby-Talk	<4D944FFA.7060601@pricom.com.au>
Xref	x330-a1.tempe.blueboxinc.net comp.lang.ruby:2037

Show key headers only | View raw

People,

To add to my last post:


On 2011-03-20 16:44, Philip Rhoades wrote:
> Brian,
>
>
> On 2011-03-20 08:31, Brian Candler wrote:
>> Philip Rhoades wrote in post #988264:
>>>> Your sample code looks like it's handling numeric-style data
>>>> (although I
>>>> realise this is just a test case for the problems you're having).
>>>> Integers in the range -2^30..+2^30 (or larger in on a 64-bit machine)
>>>> have their values encoded within the reference, so no memory allocation
>>>> is done.
>>>
>>>
>>> Are you talking about the hash key or the hash values?
>>
>> Either.
>
>
> Right - for the following in my test script:
>
> h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){ Array.new( 20,
> rand(100) ) } }
> h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new(2){ Array.new(1){
> Array.new( 20, rand(100) ) } }
> h1[ "#{a}.#{b}.#{c}.#{d}".to_i ] = Array.new(2){ Array.new(1){
> Array.new( 20, rand(100) ) } }
> h1[ "#{a}.#{b}.#{c}.#{d}".to_i.freeze ] = Array.new(2){ Array.new(1){
> Array.new( 20, rand(100) ) } }
>
> I get the following times:
>
> 18.350s
> 18.113s
> 4.724s
> 4.896s
>
> So I guess I should live with the slight decrease of readability when
> searching for particular results in the JSON output file by using ints
> instead of strings for the hash keys.
>
>
>>> - the values in
>>> the real script will all be floats . .
>>
>> Then they will be allocated on the heap, just like strings. I presume
>> you're aware of the inherent inaccuracy of floats (in any language), and
>> are OK with this.
>>
>>>> 1.0/2.0 == 1.0 + 1.0/2.0 - 1.0
>> => true
>>>> 1.0/10.0 == 1.0 + 1.0/10.0 - 1.0
>> => false
>
>
> I suppose I could convert them all to six or eight digit ints . . they
> are measures of biological diversity and changing them backwards and
> forwards is a bit of a hassle but maybe it is worth doing for the speed
> advantage? - I will try my test script with floats and see what happens.
>
>
>>>> Or, if you're handling a relatively small set of unique values, you
>>>> could use symbols instead of strings. Each symbol reference again
>>>> doesn't allocate any memory; it just points to the entry in the symbol
>>>> table.
>>>
>>> Not sure what you mean - example?
>>
>> a = []
>> a[0] = :foo
>> a[1] = :foo
>> a[2] = :foo
>>
>> puts a[0].object_id
>> puts a[1].object_id
>> puts a[2].object_id
>
>
> The reason the numbers are called "seeds" is that they correspond to the
> seed for the random dumber generator in the C/C++ simulation program -
> so they are all unique for each of the 32,000 simulations.
>
>
>>>> Or you could use frozen strings and share the references.
>>>>
>>>> LABEL1 = "00".freeze
>>>> LABEL2 = "01".freeze
>>>> MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
>>>> a = MAP["00"]
>>>> puts a.object_id
>>>> puts LABEL1.object_id
>>>
>>>
>>> I ran that code but I don't understand how it helps . .
>>
>> It uses less memory if you have (say) millions of identical strings.
>
>
> Not the case for the keys and unlikely for the values.
>
>
>> It
>> may help garbage collection performance, but not much else
>>
>>>> Although that's more work than symbols, it might be useful depending on
>>>> your use case. For example, you could replace a subset of the values
>>>> you
>>>> see with these frozen strings (which covers the majority of the data),
>>>> whilst still allowing arbitrary other strings.
>>>
>>>
>>> Still not clear - examples?
>>
>> Suppose the strings "foo" and "bar" comprise 80% of your hash keys or
>> values. Then mapping them to the same frozen string means that you only
>> have one instance of string "foo" and one instance of string "bar" in
>> the system, instead of (say) millions of distinct strings. You can still
>> use individual strings for the other 20%.
>
>
> Unfortunately this doesn't correspond to my case . .
>
>
>> This is really an edge optimisation though, you really shouldn't need to
>> be worrying about these things - if they are significant, then perhaps
>> ruby is the wrong language for the problem in hand.
>>
>>> The other thing that occurred to me was that on my 64-bit machine maybe
>>> I could run 2-3 threads for inserting into the hash table?
>>
>> Noooo..... even in ruby 1.9, there is a global interpreter lock.
>> Multiple threads gain you nothing really, except for threads which are
>> blocked on I/O.
>
>
> Right.
>
>
>> Even if there were not, having multiple threads contending on the same
>> hash (and controlling access via, say, a mutex) would be pretty much
>> guaranteed to make performance worse not better.
>
>
> OK - oh well it was worth a thought!


I ran the normal number of the inner two loops and only one of the 
outermost loop (ie 1/50th of the total) and got the following with 
different versions of Ruby:

ruby-1.8.7-p334 [ x86_64 ]	=  31.41
ruby-1.9.2-p180 [ x86_64 ]	=  19.53
rbx-head [ ]			= 128.42

Many thanks!

Regards,

Phil.

-- 
Philip Rhoades

GPO Box 3411
Sydney NSW	2001
Australia
E-mail:  phil@pricom.com.au

Back to comp.lang.ruby | Previous | Next | Find similar | Unroll thread

Thread

Re: Inserting hash value slows down as table gets larger Philip Rhoades <phil@pricom.com.au> - 2011-03-31 04:58 -0500

csiph-web