Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #19910 > unrolled thread

SortedMap question?

Started byKnute Johnson <nospam@rabbitbrush.frazmtn.com>
First post2012-11-24 15:23 -0800
Last post2012-11-25 14:00 +0000
Articles 20 on this page of 38 — 13 participants

Back to article view | Back to comp.lang.java.programmer


Contents

  SortedMap question? Knute Johnson <nospam@rabbitbrush.frazmtn.com> - 2012-11-24 15:23 -0800
    Re: SortedMap question? Joerg Meier <joergmmeier@arcor.de> - 2012-11-25 00:57 +0100
      Re: SortedMap question? David Lamb <dalamb@cs.queensu.ca> - 2012-11-25 08:49 -0500
        Re: SortedMap question? David Lamb <dalamb@cs.queensu.ca> - 2012-11-25 08:59 -0500
          Re: SortedMap question? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-25 14:02 +0000
            Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 14:20 -0500
          Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 14:24 -0500
            Re: SortedMap question? David Lamb <dalamb@cs.queensu.ca> - 2012-11-25 21:25 -0500
              Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 22:24 -0500
              Re: SortedMap question? Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-11-26 15:45 -0800
                Re: SortedMap question? Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-26 20:29 -0500
      Re: SortedMap question? Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-25 09:02 -0500
        Re: SortedMap question? Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 17:47 +0000
          Re: SortedMap question? Knute Johnson <nospam@rabbitbrush.frazmtn.com> - 2012-11-25 16:25 -0800
            Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 19:36 -0500
              Re: SortedMap question? Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-25 22:58 -0500
            Re: SortedMap question? Lew <lewbloch@gmail.com> - 2012-11-25 16:53 -0800
            Re: SortedMap question? Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-26 02:20 +0000
        Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 14:20 -0500
          Re: SortedMap question? Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-25 14:44 -0500
            Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:18 -0500
              Re: SortedMap question? Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-25 17:35 -0500
                Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 19:58 -0500
                Re: SortedMap question? Jim Janney <jjanney@shell.xmission.com> - 2012-11-26 12:45 -0700
                  Re: SortedMap question? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-26 14:26 -0600
          Re: SortedMap question? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 14:16 -0600
            Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:21 -0500
              Re: SortedMap question? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 17:07 -0600
                Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 20:05 -0500
            Re: SortedMap question? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-26 01:34 +0000
              Re: SortedMap question? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 20:36 -0600
                Re: SortedMap question? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-12-15 12:27 +0000
    Re: SortedMap question? Robert Klemme <shortcutter@googlemail.com> - 2012-11-25 01:36 +0100
    Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 20:00 -0500
    Re: SortedMap question? markspace <-@.> - 2012-11-24 17:34 -0800
      Re: SortedMap question? Robert Klemme <shortcutter@googlemail.com> - 2012-11-25 15:28 +0100
        Re: SortedMap question? Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 17:49 +0000
    Re: SortedMap question? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-25 14:00 +0000

Page 1 of 2  [1] 2  Next page →


#19910 — SortedMap question?

FromKnute Johnson <nospam@rabbitbrush.frazmtn.com>
Date2012-11-24 15:23 -0800
SubjectSortedMap question?
Message-ID<k8rkud$i3o$1@dont-email.me>
I have been operating under a mis-perception that a SortedMap would be 
more quickly searched than an unsorted Map.  More specifically that the 
cost would come when adding an element to the SortedMap.  I wrote a 
little test program to measure the amount of time it took to look for 
non-existent data in a Map and a SortedMap and it took somewhere between 
twice and three times as long to look for the data in the SortedMap.

So am I know correct in thinking that the only real advantage of a 
SortedMap is if you wish to iterate over the keys of the map in some order?

Thanks,

knute...

[toc] | [next] | [standalone]


#19911

FromJoerg Meier <joergmmeier@arcor.de>
Date2012-11-25 00:57 +0100
Message-ID<q4qoyde15lpk$.13amx8nbfrj28.dlg@40tude.net>
In reply to#19910
On Sat, 24 Nov 2012 15:23:55 -0800, Knute Johnson wrote:

> I have been operating under a mis-perception that a SortedMap would be 
> more quickly searched than an unsorted Map.  More specifically that the 
> cost would come when adding an element to the SortedMap.  I wrote a 
> little test program to measure the amount of time it took to look for 
> non-existent data in a Map and a SortedMap and it took somewhere between 
> twice and three times as long to look for the data in the SortedMap.

Considering Map is an Interface, I question your claim that you measured
its get(x) performance :p

Assuming you used HashMap, as far as I recall, its get(x) performance is
already at O(1) unless you mess up your hashcodes. Doesn't really go lower.

Liebe Gruesse,
		Joerg

-- 
Ich lese meine Emails nicht, replies to Email bleiben also leider
ungelesen.

[toc] | [prev] | [next] | [standalone]


#19933

FromDavid Lamb <dalamb@cs.queensu.ca>
Date2012-11-25 08:49 -0500
Message-ID<k8t7l5$l22$2@dont-email.me>
In reply to#19911
On 24/11/2012 6:57 PM, Joerg Meier wrote:
> On Sat, 24 Nov 2012 15:23:55 -0800, Knute Johnson wrote:
>>  I wrote a
>> little test program to measure the amount of time it took to look for
>> non-existent data in a Map and a SortedMap and it took somewhere between
>> twice and three times as long to look for the data in the SortedMap.
>
> Considering Map is an Interface, I question your claim that you measured
> its get(x) performance :p
>
> Assuming you used HashMap, as far as I recall, its get(x) performance is
> already at O(1) unless you mess up your hashcodes. Doesn't really go lower.

The constant factors can matter a lot. 100 > 1 even though both are O(1).

[toc] | [prev] | [next] | [standalone]


#19935

FromDavid Lamb <dalamb@cs.queensu.ca>
Date2012-11-25 08:59 -0500
Message-ID<k8t87f$rsq$2@dont-email.me>
In reply to#19933
On 25/11/2012 8:49 AM, David Lamb wrote:
> On 24/11/2012 6:57 PM, Joerg Meier wrote:
>> Assuming you used HashMap, as far as I recall, its get(x) performance is
>> already at O(1) unless you mess up your hashcodes. Doesn't really go
>> lower.
>
> The constant factors can matter a lot. 100 > 1 even though both are O(1).

Which, sigh, of course they're not, as Arne pointed out. O(logN) for 
SortedMap, as I should have immediately picked up on from the name alone.

[toc] | [prev] | [next] | [standalone]


#19938

From"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Date2012-11-25 14:02 +0000
Message-ID<xL2dnTzi4dJnuS_NnZ2dnUVZ7r2dnZ2d@bt.com>
In reply to#19935
David Lamb wrote:

> > The constant factors can matter a lot. 100 > 1 even though both are
> > O(1).
>
> Which, sigh, of course they're not, as Arne pointed out. O(logN) for
> SortedMap, as I should have immediately picked up on from the name alone.

Still a good point though ;-)

    -- chris 

[toc] | [prev] | [next] | [standalone]


#19950

FromArne Vajhøj <arne@vajhoej.dk>
Date2012-11-25 14:20 -0500
Message-ID<50b26f8d$0$295$14726298@news.sunsite.dk>
In reply to#19938
On 11/25/2012 9:02 AM, Chris Uppal wrote:
> David Lamb wrote:
>
>>> The constant factors can matter a lot. 100 > 1 even though both are
>>> O(1).
>>
>> Which, sigh, of course they're not, as Arne pointed out. O(logN) for
>> SortedMap, as I should have immediately picked up on from the name alone.
>
> Still a good point though ;-)

:-)

Arne

[toc] | [prev] | [next] | [standalone]


#19951

FromArne Vajhøj <arne@vajhoej.dk>
Date2012-11-25 14:24 -0500
Message-ID<50b2705c$0$295$14726298@news.sunsite.dk>
In reply to#19935
On 11/25/2012 8:59 AM, David Lamb wrote:
> On 25/11/2012 8:49 AM, David Lamb wrote:
>> On 24/11/2012 6:57 PM, Joerg Meier wrote:
>>> Assuming you used HashMap, as far as I recall, its get(x) performance is
>>> already at O(1) unless you mess up your hashcodes. Doesn't really go
>>> lower.
>>
>> The constant factors can matter a lot. 100 > 1 even though both are O(1).
>
> Which, sigh, of course they're not, as Arne pointed out. O(logN) for
> SortedMap, as I should have immediately picked up on from the name alone.

SortedMap is still just an interface.

Arne

[toc] | [prev] | [next] | [standalone]


#19974

FromDavid Lamb <dalamb@cs.queensu.ca>
Date2012-11-25 21:25 -0500
Message-ID<k8ujuj$mu4$2@dont-email.me>
In reply to#19951
On 25/11/2012 2:24 PM, Arne Vajhøj wrote:
> On 11/25/2012 8:59 AM, David Lamb wrote:
>> Which, sigh, of course they're not, as Arne pointed out. O(logN) for
>> SortedMap, as I should have immediately picked up on from the name alone.
>
> SortedMap is still just an interface.

Maybe so, but I suspect programmers would be unhappy if an 
implementation didn't actually sort anything. I suppose there might be 
special cases; sortedMap from small integers to anything could be O(1) 
for get()

[toc] | [prev] | [next] | [standalone]


#19977

FromArne Vajhøj <arne@vajhoej.dk>
Date2012-11-25 22:24 -0500
Message-ID<50b2e108$0$285$14726298@news.sunsite.dk>
In reply to#19974
On 11/25/2012 9:25 PM, David Lamb wrote:
> On 25/11/2012 2:24 PM, Arne Vajhøj wrote:
>> On 11/25/2012 8:59 AM, David Lamb wrote:
>>> Which, sigh, of course they're not, as Arne pointed out. O(logN) for
>>> SortedMap, as I should have immediately picked up on from the name
>>> alone.
>>
>> SortedMap is still just an interface.
>
> Maybe so, but I suspect programmers would be unhappy if an
> implementation didn't actually sort anything. I suppose there might be
> special cases; sortedMap from small integers to anything could be O(1)
> for get()

Better is difficult for the general case. Worse is easy.

Arne

[toc] | [prev] | [next] | [standalone]


#19989

FromDaniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date2012-11-26 15:45 -0800
Message-ID<yaTss.19810$W21.8324@newsfe27.iad>
In reply to#19974
On 11/25/12 6:25 PM, David Lamb wrote:
> On 25/11/2012 2:24 PM, Arne Vajhøj wrote:
>> On 11/25/2012 8:59 AM, David Lamb wrote:
>>> Which, sigh, of course they're not, as Arne pointed out. O(logN) for
>>> SortedMap, as I should have immediately picked up on from the name
>>> alone.
>>
>> SortedMap is still just an interface.
>
> Maybe so, but I suspect programmers would be unhappy if an
> implementation didn't actually sort anything. I suppose there might be
> special cases; sortedMap from small integers to anything could be O(1)
> for get()
>
That special case happens to be called BitSet, which doesn't implement 
SortedMap :-)

[toc] | [prev] | [next] | [standalone]


#19991

FromEric Sosman <esosman@comcast-dot-net.invalid>
Date2012-11-26 20:29 -0500
Message-ID<k9152f$kl4$2@dont-email.me>
In reply to#19989
On 11/26/2012 6:45 PM, Daniel Pitts wrote:
> On 11/25/12 6:25 PM, David Lamb wrote:
>> On 25/11/2012 2:24 PM, Arne Vajhøj wrote:
>>> On 11/25/2012 8:59 AM, David Lamb wrote:
>>>> Which, sigh, of course they're not, as Arne pointed out. O(logN) for
>>>> SortedMap, as I should have immediately picked up on from the name
>>>> alone.
>>>
>>> SortedMap is still just an interface.
>>
>> Maybe so, but I suspect programmers would be unhappy if an
>> implementation didn't actually sort anything. I suppose there might be
>> special cases; sortedMap from small integers to anything could be O(1)
>> for get()
>>
> That special case happens to be called BitSet, which doesn't implement
> SortedMap :-)

     Another name for that special case is Value[] -- which
doesn't actually implement SortedMap, but does all that's needed.

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]


#19937

FromEric Sosman <esosman@comcast-dot-net.invalid>
Date2012-11-25 09:02 -0500
Message-ID<k8t8dp$tv5$1@dont-email.me>
In reply to#19911
On 11/24/2012 6:57 PM, Joerg Meier wrote:
> On Sat, 24 Nov 2012 15:23:55 -0800, Knute Johnson wrote:
>
>> I have been operating under a mis-perception that a SortedMap would be
>> more quickly searched than an unsorted Map.  More specifically that the
>> cost would come when adding an element to the SortedMap.  I wrote a
>> little test program to measure the amount of time it took to look for
>> non-existent data in a Map and a SortedMap and it took somewhere between
>> twice and three times as long to look for the data in the SortedMap.
>
> Considering Map is an Interface, I question your claim that you measured
> its get(x) performance :p
>
> Assuming you used HashMap, as far as I recall, its get(x) performance is
> already at O(1) unless you mess up your hashcodes. Doesn't really go lower.

     "HashMap is O(1)" is sort of true, but sweeps a lot of dirt
under the rug.  Most obviously, there's the possibility of a bad
or unlucky hash distribution -- by no means a remote possibility,
as <http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/>
demonstrates.  Even if the hash distribution is acceptably uniform,
there's also the matter of building the map in the first place: As
the insertions accumulate, HashMap may need to expand and re-insert
everything it already contains.  With N=n1+n2+...+nk mappings, there
could be n1+(n1+n2)+...+(n1+n2+...+nk) = k*n1+(k-1)*n2+...+1*nk
insertions altogether (a good up-front estimate of N can help here).

     Consider also the operation cost.  Searching in a TreeMap of
N mappings takes about lg(N) comparisons -- but with String keys
the comparisons will be of varying difficulties.  Comparisons near
the root will probably examine only the first few characters to
determine the result; only as the search approaches the leaves will
the Strings' tail characters come into play.  On a successful search
HashMap must iterate over the key String's entire length twice: Once
to compute the hashCode(), and again to verify that a match has in
fact been found (comparisons against non-matching keys in the same
bucket will be quick, like those near the root of a TreeMap).

     Finally, O() itself hides a lot of detail -- that is, after all,
its purpose.  Given the choice between O(N*N) Algorithm X and O(1)
Algorithm Y, which would you choose?  Might you change your mind
if you learned that X takes N*N nanoseconds while Y takes a constant
eighteen hours?  Might you change your mind yet again if you learned
that N was ten million?  O() alone is not enough to decide a question
of speed.

     Still and all: HashMap makes a good "default" choice, and is
"likely" to outperform TreeMap over a "reasonable range" of inputs.
Turn to TreeMap when the order of the keys is significant -- for
example, when  you need "closest match" for an absent key.

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]


#19944

FromMartin Gregorie <martin@address-in-sig.invalid>
Date2012-11-25 17:47 +0000
Message-ID<k8tljj$6jr$1@localhost.localdomain>
In reply to#19937
On Sun, 25 Nov 2012 09:02:29 -0500, Eric Sosman wrote:

> 
>      Still and all: HashMap makes a good "default" choice, and is
> "likely" to outperform TreeMap over a "reasonable range" of inputs. Turn
> to TreeMap when the order of the keys is significant -- for example,
> when  you need "closest match" for an absent key.
>
Or, if the key comparison is cheap, i.e. short strings or integer 
comparisons, and the key population is very large TreeMap can become 
attractive: with 8 million nodes you only need 23 comparisions to find a 
miss. Growing the tree isn't too expensive either: rebalancing a balanced 
red-black tree, which is necessary from time to time as it grows, is only 
a matter of adjusting pointers. 

There is one possible demerit that I found in what was the standard C 
implementation a while back: adding a node was very expensive because it 
required three mallocs: there were separate bits of heap storage for the 
pointers (left, right and parent), the key storage and the data storage.

On the box we were running at the time (a Dec Alphaserver running DEC 
Unix) the best I could get in a situation where each key was looked up 
and then added if it wasn't in the tree was around 700 lookups a second 
on a small dev box: not nearly enough. First I threw out the library 
code, combined pointers, key and data into one struct and rewrote the red-
black tree handler using Sedgewick. This pushed the rate up to 1500 and 
proved that malloc was the bottleneck, but it was still too slow for 
projected data volumes, so I moved memory management into user land by 
only mallocing for several MB at a time and allocating bits of these 
large chunks to the nodes. This rocked: the lookup rate immediately 
jumped to about 35,000 a second and, as a bonus, because all my pointers 
were relative to the big memory chunks and their places in a chunk table, 
it was possible to bring the live system's start-up time down to 2-3 
minutes by saving and reloading the chunks as files: rebuilding the red-
black tree from a database had been taking 40 minutes.

Yes, I know: all this is C and the JVM should be much better at heap 
management because 'top' shows that it only grabs and releases memory in 
large chunks which minimises the malloc/sbrk overheads. However, it does 
show that serious runtime optimisation may require you to know what's 
under the hood of library code and to realise that this code may have 
sacrificed speed on the altar of flexibility.
 

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]


#19965

FromKnute Johnson <nospam@rabbitbrush.frazmtn.com>
Date2012-11-25 16:25 -0800
Message-ID<k8ucsh$o2v$1@dont-email.me>
In reply to#19944
On 11/25/2012 09:47, Martin Gregorie wrote:
> Or, if the key comparison is cheap, i.e. short strings or integer
> comparisons, and the key population is very large TreeMap can become
> attractive: with 8 million nodes you only need 23 comparisions to find a
> miss. Growing the tree isn't too expensive either: rebalancing a balanced
> red-black tree, which is necessary from time to time as it grows, is only
> a matter of adjusting pointers.

That's what I thought, that the searching for a match would be very 
quick once the data was in the TreeMap.  The test code I wrote may not 
have been any good.  I created a map with Strings for the keys and 
values.  I could get about 2 million entries before I started running 
into virtual memory issues slowing things down.  I searched for 
non-existent elements.  Using both a HashMap and a TreeMap, the TreeMap 
was significantly slower than the HashMap.  I even tried a 
ConcurrentHashMap and multiple threads to do the search to see if that 
would speed things up.  While that technique was better than TreeMap it 
was still slower than a plain HashMap.

So for the actual case that I am working on, I have a collection of 
about 5000 elements and am using an Integer as the key.  Data is rarely 
changed but often accessed.  There should never be a case where the 
searched for key will not exist.  What would you use, a HashMap or a 
TreeMap?

Thanks,

knute...

[toc] | [prev] | [next] | [standalone]


#19966

FromArne Vajhøj <arne@vajhoej.dk>
Date2012-11-25 19:36 -0500
Message-ID<50b2b971$0$294$14726298@news.sunsite.dk>
In reply to#19965
On 11/25/2012 7:25 PM, Knute Johnson wrote:
> On 11/25/2012 09:47, Martin Gregorie wrote:
>> Or, if the key comparison is cheap, i.e. short strings or integer
>> comparisons, and the key population is very large TreeMap can become
>> attractive: with 8 million nodes you only need 23 comparisions to find a
>> miss. Growing the tree isn't too expensive either: rebalancing a balanced
>> red-black tree, which is necessary from time to time as it grows, is only
>> a matter of adjusting pointers.
>
> That's what I thought, that the searching for a match would be very
> quick once the data was in the TreeMap.  The test code I wrote may not
> have been any good.  I created a map with Strings for the keys and
> values.  I could get about 2 million entries before I started running
> into virtual memory issues slowing things down.  I searched for
> non-existent elements.  Using both a HashMap and a TreeMap, the TreeMap
> was significantly slower than the HashMap.  I even tried a
> ConcurrentHashMap and multiple threads to do the search to see if that
> would speed things up.  While that technique was better than TreeMap it
> was still slower than a plain HashMap.
>
> So for the actual case that I am working on, I have a collection of
> about 5000 elements and am using an Integer as the key.  Data is rarely
> changed but often accessed.  There should never be a case where the
> searched for key will not exist.  What would you use, a HashMap or a
> TreeMap?

If it is important then you should measure!

But I would still expect HashMap to be faster than TreeMap.

Arne

[toc] | [prev] | [next] | [standalone]


#19978

FromEric Sosman <esosman@comcast-dot-net.invalid>
Date2012-11-25 22:58 -0500
Message-ID<k8updu$f62$1@dont-email.me>
In reply to#19966
On 11/25/2012 7:36 PM, Arne Vajhøj wrote:
> On 11/25/2012 7:25 PM, Knute Johnson wrote:
>> On 11/25/2012 09:47, Martin Gregorie wrote:
>>> Or, if the key comparison is cheap, i.e. short strings or integer
>>> comparisons, and the key population is very large TreeMap can become
>>> attractive: with 8 million nodes you only need 23 comparisions to find a
>>> miss. Growing the tree isn't too expensive either: rebalancing a
>>> balanced
>>> red-black tree, which is necessary from time to time as it grows, is
>>> only
>>> a matter of adjusting pointers.
>>
>> That's what I thought, that the searching for a match would be very
>> quick once the data was in the TreeMap.  The test code I wrote may not
>> have been any good.  I created a map with Strings for the keys and
>> values.  I could get about 2 million entries before I started running
>> into virtual memory issues slowing things down.  I searched for
>> non-existent elements.  Using both a HashMap and a TreeMap, the TreeMap
>> was significantly slower than the HashMap.  I even tried a
>> ConcurrentHashMap and multiple threads to do the search to see if that
>> would speed things up.  While that technique was better than TreeMap it
>> was still slower than a plain HashMap.
>>
>> So for the actual case that I am working on, I have a collection of
>> about 5000 elements and am using an Integer as the key.  Data is rarely
>> changed but often accessed.  There should never be a case where the
>> searched for key will not exist.  What would you use, a HashMap or a
>> TreeMap?
>
> If it is important then you should measure!

     Aaay-men!

> But I would still expect HashMap to be faster than TreeMap.

     Yes.  A HashMap with 5000 keys and the default 0.75 load
factor will want 5000/0.75 ~= 6667 buckets, which it will round
up to 8192.  With normal luck the 5000 keys will land in ~3743
buckets, about two-thirds of which will hold just one key.  So
a typical search involves a few arithmetic operations to locate
the bucket, then about 5000/3743 ~= 1.3 Integer#equals() calls.

     Searching a TreeMap of 5000 keys, on the other hand, will
use about lg(5000) ~= 11.3 Integer#compareTo() calls.  Uses of
equals() and compareTo() have different costs (equals() needs to
work with Objects and check their classes before comparing, but
compareTo() needs to produce a three-way answer rather than a
simpler boolean ...), but any differences are almost certainly
swamped by the disparity in call counts.

     If the key values are not too widely scattered, other
possibilities exist.  A BitSet can answer "present or absent"
queries very quickly and without taking much space, or a simple
array can serve as a Map stand-in.  But follow Arne's advice
when making your choice: Measure It!

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]


#19967

FromLew <lewbloch@gmail.com>
Date2012-11-25 16:53 -0800
Message-ID<858d42ec-16f0-453f-971e-d3d7d0de3bdf@googlegroups.com>
In reply to#19965
Knute Johnson wrote:
> That's what I thought, that the searching for a match would be very 
> quick once the data was in the TreeMap.  The test code I wrote may not 
> have been any good.  I created a map with Strings for the keys and 
> values.  I could get about 2 million entries before I started running 
> into virtual memory issues slowing things down.  I searched for 
> non-existent elements.  Using both a HashMap and a TreeMap, the TreeMap 
> was significantly slower than the HashMap.  I even tried a 
> ConcurrentHashMap and multiple threads to do the search to see if that 
> would speed things up.  While that technique was better than TreeMap it 
> was still slower than a plain HashMap.
> 
> So for the actual case that I am working on, I have a collection of 
> about 5000 elements and am using an Integer as the key.  Data is rarely 
> changed but often accessed.  There should never be a case where the 
> searched for key will not exist.  What would you use, a HashMap or a 
> TreeMap?

That would depend entirely on whether I needed to iterate the map in 
sorted order. There is little reason to prefer 'TreeMap' absent a sorting 
requirement except, as pointed out upthread, if you need guaranteed O(log n)
performance for gets rather than the probabilistic promise of hashing.
Such scenarios would require specialized knowledge of the data sets to be 
mapped.

Kudos for your efforts to obtain objective performance evidence.

-- 
Lew

[toc] | [prev] | [next] | [standalone]


#19973

FromMartin Gregorie <martin@address-in-sig.invalid>
Date2012-11-26 02:20 +0000
Message-ID<k8ujlm$enu$1@localhost.localdomain>
In reply to#19965
On Sun, 25 Nov 2012 16:25:02 -0800, Knute Johnson wrote:

> So for the actual case that I am working on, I have a collection of
> about 5000 elements and am using an Integer as the key.  Data is rarely
> changed but often accessed.  There should never be a case where the
> searched for key will not exist.  What would you use, a HashMap or a
> TreeMap?
>
I had an additional reason for thinking that an ordered storage method 
would help us: compression. 

We had, for the day, a very large data warehouse in which many of the 
keys were relatively large. It was storing phone network-related data 
and, for instance, it turned out that things like IMEIs and dialed 
numbers are very much larger than a 32 bit soft database key (sdbk), so 
we got a worthwhile compression merely by holding a single copy of, say, 
a phone number as a dimension table replacing it by the sdbk in the call-
related data. Rinse, wash and repeat for all the dimension tables in that 
database. This gave as good a compression ratio as the zip algorithm and 
without the compress/decompress cpu overhead, but it meant that we needed 
to do the sdbk substitutions before we stored the records. At this point 
we found a problem: theoretically the datawarehouse could do the lookups 
but in practise it hit a brick wall at 350 lookups/second and we needed 
at least 15,000/sec to handle the projected daily input volume. 

Hence the giant lookup table: the last time I saw it there were around 
300M phone numbers in the tree. We didn't have the memory to touch that 
with only one number per node but fortunately a late night brain storming 
session came up with a cunning plot. We discovered that, although actual 
the phone number space is sparsely populated, the least significant two 
digits are quite densely grouped into clusters, so after digging 
statistics out of a sample of 500,000 numbers, we constructed the key 
from the phone number less the last two digits and stored these as a 
bitmap. This is why we needed an ordered map: its job was to ensure that 
the numbers in the cluster all hit the same node. The lookup procedure 
was: 
- take off the last two digits
- use the prefix as the B-tree index
- test the bit indexed by the last two digits.
- if unset, there is no match

Please excuse the use of xcess bandwidth, but hopefully somebody may find 
this type of approach helpful.
 

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]


#19949

FromArne Vajhøj <arne@vajhoej.dk>
Date2012-11-25 14:20 -0500
Message-ID<50b26f67$0$295$14726298@news.sunsite.dk>
In reply to#19937
On 11/25/2012 9:02 AM, Eric Sosman wrote:
>      Finally, O() itself hides a lot of detail -- that is, after all,
> its purpose.  Given the choice between O(N*N) Algorithm X and O(1)
> Algorithm Y, which would you choose?  Might you change your mind
> if you learned that X takes N*N nanoseconds while Y takes a constant
> eighteen hours?  Might you change your mind yet again if you learned
> that N was ten million?  O() alone is not enough to decide a question
> of speed.

Obviously true.

But I would consider it very rare that a worse big-O
algorithm/data-structure is faster in a case where it has
a measurable impact on overall performance.

>      Still and all: HashMap makes a good "default" choice, and is
> "likely" to outperform TreeMap over a "reasonable range" of inputs.
> Turn to TreeMap when the order of the keys is significant -- for
> example, when  you need "closest match" for an absent key.

Yep.

Arne


[toc] | [prev] | [next] | [standalone]


#19953

FromEric Sosman <esosman@comcast-dot-net.invalid>
Date2012-11-25 14:44 -0500
Message-ID<k8tsfj$nn5$1@dont-email.me>
In reply to#19949
On 11/25/2012 2:20 PM, Arne Vajhøj wrote:
> On 11/25/2012 9:02 AM, Eric Sosman wrote:
>>      Finally, O() itself hides a lot of detail -- that is, after all,
>> its purpose.  Given the choice between O(N*N) Algorithm X and O(1)
>> Algorithm Y, which would you choose?  Might you change your mind
>> if you learned that X takes N*N nanoseconds while Y takes a constant
>> eighteen hours?  Might you change your mind yet again if you learned
>> that N was ten million?  O() alone is not enough to decide a question
>> of speed.
>
> Obviously true.
>
> But I would consider it very rare that a worse big-O
> algorithm/data-structure is faster in a case where it has
> a measurable impact on overall performance.

     That's why Quicksort implementations never use an O(N*N)
InsertionSort for small subfiles.

     Oh -- Hey, wait ...

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.java.programmer


csiph-web