Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #19910 > unrolled thread
| Started by | Knute Johnson <nospam@rabbitbrush.frazmtn.com> |
|---|---|
| First post | 2012-11-24 15:23 -0800 |
| Last post | 2012-11-25 14:00 +0000 |
| Articles | 20 on this page of 38 — 13 participants |
Back to article view | Back to comp.lang.java.programmer
SortedMap question? Knute Johnson <nospam@rabbitbrush.frazmtn.com> - 2012-11-24 15:23 -0800
Re: SortedMap question? Joerg Meier <joergmmeier@arcor.de> - 2012-11-25 00:57 +0100
Re: SortedMap question? David Lamb <dalamb@cs.queensu.ca> - 2012-11-25 08:49 -0500
Re: SortedMap question? David Lamb <dalamb@cs.queensu.ca> - 2012-11-25 08:59 -0500
Re: SortedMap question? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-25 14:02 +0000
Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 14:20 -0500
Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 14:24 -0500
Re: SortedMap question? David Lamb <dalamb@cs.queensu.ca> - 2012-11-25 21:25 -0500
Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 22:24 -0500
Re: SortedMap question? Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-11-26 15:45 -0800
Re: SortedMap question? Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-26 20:29 -0500
Re: SortedMap question? Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-25 09:02 -0500
Re: SortedMap question? Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 17:47 +0000
Re: SortedMap question? Knute Johnson <nospam@rabbitbrush.frazmtn.com> - 2012-11-25 16:25 -0800
Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 19:36 -0500
Re: SortedMap question? Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-25 22:58 -0500
Re: SortedMap question? Lew <lewbloch@gmail.com> - 2012-11-25 16:53 -0800
Re: SortedMap question? Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-26 02:20 +0000
Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 14:20 -0500
Re: SortedMap question? Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-25 14:44 -0500
Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:18 -0500
Re: SortedMap question? Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-25 17:35 -0500
Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 19:58 -0500
Re: SortedMap question? Jim Janney <jjanney@shell.xmission.com> - 2012-11-26 12:45 -0700
Re: SortedMap question? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-26 14:26 -0600
Re: SortedMap question? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 14:16 -0600
Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:21 -0500
Re: SortedMap question? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 17:07 -0600
Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 20:05 -0500
Re: SortedMap question? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-26 01:34 +0000
Re: SortedMap question? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 20:36 -0600
Re: SortedMap question? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-12-15 12:27 +0000
Re: SortedMap question? Robert Klemme <shortcutter@googlemail.com> - 2012-11-25 01:36 +0100
Re: SortedMap question? Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 20:00 -0500
Re: SortedMap question? markspace <-@.> - 2012-11-24 17:34 -0800
Re: SortedMap question? Robert Klemme <shortcutter@googlemail.com> - 2012-11-25 15:28 +0100
Re: SortedMap question? Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 17:49 +0000
Re: SortedMap question? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-25 14:00 +0000
Page 1 of 2 [1] 2 Next page →
| From | Knute Johnson <nospam@rabbitbrush.frazmtn.com> |
|---|---|
| Date | 2012-11-24 15:23 -0800 |
| Subject | SortedMap question? |
| Message-ID | <k8rkud$i3o$1@dont-email.me> |
I have been operating under a mis-perception that a SortedMap would be more quickly searched than an unsorted Map. More specifically that the cost would come when adding an element to the SortedMap. I wrote a little test program to measure the amount of time it took to look for non-existent data in a Map and a SortedMap and it took somewhere between twice and three times as long to look for the data in the SortedMap. So am I know correct in thinking that the only real advantage of a SortedMap is if you wish to iterate over the keys of the map in some order? Thanks, knute...
[toc] | [next] | [standalone]
| From | Joerg Meier <joergmmeier@arcor.de> |
|---|---|
| Date | 2012-11-25 00:57 +0100 |
| Message-ID | <q4qoyde15lpk$.13amx8nbfrj28.dlg@40tude.net> |
| In reply to | #19910 |
On Sat, 24 Nov 2012 15:23:55 -0800, Knute Johnson wrote: > I have been operating under a mis-perception that a SortedMap would be > more quickly searched than an unsorted Map. More specifically that the > cost would come when adding an element to the SortedMap. I wrote a > little test program to measure the amount of time it took to look for > non-existent data in a Map and a SortedMap and it took somewhere between > twice and three times as long to look for the data in the SortedMap. Considering Map is an Interface, I question your claim that you measured its get(x) performance :p Assuming you used HashMap, as far as I recall, its get(x) performance is already at O(1) unless you mess up your hashcodes. Doesn't really go lower. Liebe Gruesse, Joerg -- Ich lese meine Emails nicht, replies to Email bleiben also leider ungelesen.
[toc] | [prev] | [next] | [standalone]
| From | David Lamb <dalamb@cs.queensu.ca> |
|---|---|
| Date | 2012-11-25 08:49 -0500 |
| Message-ID | <k8t7l5$l22$2@dont-email.me> |
| In reply to | #19911 |
On 24/11/2012 6:57 PM, Joerg Meier wrote: > On Sat, 24 Nov 2012 15:23:55 -0800, Knute Johnson wrote: >> I wrote a >> little test program to measure the amount of time it took to look for >> non-existent data in a Map and a SortedMap and it took somewhere between >> twice and three times as long to look for the data in the SortedMap. > > Considering Map is an Interface, I question your claim that you measured > its get(x) performance :p > > Assuming you used HashMap, as far as I recall, its get(x) performance is > already at O(1) unless you mess up your hashcodes. Doesn't really go lower. The constant factors can matter a lot. 100 > 1 even though both are O(1).
[toc] | [prev] | [next] | [standalone]
| From | David Lamb <dalamb@cs.queensu.ca> |
|---|---|
| Date | 2012-11-25 08:59 -0500 |
| Message-ID | <k8t87f$rsq$2@dont-email.me> |
| In reply to | #19933 |
On 25/11/2012 8:49 AM, David Lamb wrote: > On 24/11/2012 6:57 PM, Joerg Meier wrote: >> Assuming you used HashMap, as far as I recall, its get(x) performance is >> already at O(1) unless you mess up your hashcodes. Doesn't really go >> lower. > > The constant factors can matter a lot. 100 > 1 even though both are O(1). Which, sigh, of course they're not, as Arne pointed out. O(logN) for SortedMap, as I should have immediately picked up on from the name alone.
[toc] | [prev] | [next] | [standalone]
| From | "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> |
|---|---|
| Date | 2012-11-25 14:02 +0000 |
| Message-ID | <xL2dnTzi4dJnuS_NnZ2dnUVZ7r2dnZ2d@bt.com> |
| In reply to | #19935 |
David Lamb wrote:
> > The constant factors can matter a lot. 100 > 1 even though both are
> > O(1).
>
> Which, sigh, of course they're not, as Arne pointed out. O(logN) for
> SortedMap, as I should have immediately picked up on from the name alone.
Still a good point though ;-)
-- chris
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-25 14:20 -0500 |
| Message-ID | <50b26f8d$0$295$14726298@news.sunsite.dk> |
| In reply to | #19938 |
On 11/25/2012 9:02 AM, Chris Uppal wrote: > David Lamb wrote: > >>> The constant factors can matter a lot. 100 > 1 even though both are >>> O(1). >> >> Which, sigh, of course they're not, as Arne pointed out. O(logN) for >> SortedMap, as I should have immediately picked up on from the name alone. > > Still a good point though ;-) :-) Arne
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-25 14:24 -0500 |
| Message-ID | <50b2705c$0$295$14726298@news.sunsite.dk> |
| In reply to | #19935 |
On 11/25/2012 8:59 AM, David Lamb wrote: > On 25/11/2012 8:49 AM, David Lamb wrote: >> On 24/11/2012 6:57 PM, Joerg Meier wrote: >>> Assuming you used HashMap, as far as I recall, its get(x) performance is >>> already at O(1) unless you mess up your hashcodes. Doesn't really go >>> lower. >> >> The constant factors can matter a lot. 100 > 1 even though both are O(1). > > Which, sigh, of course they're not, as Arne pointed out. O(logN) for > SortedMap, as I should have immediately picked up on from the name alone. SortedMap is still just an interface. Arne
[toc] | [prev] | [next] | [standalone]
| From | David Lamb <dalamb@cs.queensu.ca> |
|---|---|
| Date | 2012-11-25 21:25 -0500 |
| Message-ID | <k8ujuj$mu4$2@dont-email.me> |
| In reply to | #19951 |
On 25/11/2012 2:24 PM, Arne Vajhøj wrote: > On 11/25/2012 8:59 AM, David Lamb wrote: >> Which, sigh, of course they're not, as Arne pointed out. O(logN) for >> SortedMap, as I should have immediately picked up on from the name alone. > > SortedMap is still just an interface. Maybe so, but I suspect programmers would be unhappy if an implementation didn't actually sort anything. I suppose there might be special cases; sortedMap from small integers to anything could be O(1) for get()
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-25 22:24 -0500 |
| Message-ID | <50b2e108$0$285$14726298@news.sunsite.dk> |
| In reply to | #19974 |
On 11/25/2012 9:25 PM, David Lamb wrote: > On 25/11/2012 2:24 PM, Arne Vajhøj wrote: >> On 11/25/2012 8:59 AM, David Lamb wrote: >>> Which, sigh, of course they're not, as Arne pointed out. O(logN) for >>> SortedMap, as I should have immediately picked up on from the name >>> alone. >> >> SortedMap is still just an interface. > > Maybe so, but I suspect programmers would be unhappy if an > implementation didn't actually sort anything. I suppose there might be > special cases; sortedMap from small integers to anything could be O(1) > for get() Better is difficult for the general case. Worse is easy. Arne
[toc] | [prev] | [next] | [standalone]
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Date | 2012-11-26 15:45 -0800 |
| Message-ID | <yaTss.19810$W21.8324@newsfe27.iad> |
| In reply to | #19974 |
On 11/25/12 6:25 PM, David Lamb wrote: > On 25/11/2012 2:24 PM, Arne Vajhøj wrote: >> On 11/25/2012 8:59 AM, David Lamb wrote: >>> Which, sigh, of course they're not, as Arne pointed out. O(logN) for >>> SortedMap, as I should have immediately picked up on from the name >>> alone. >> >> SortedMap is still just an interface. > > Maybe so, but I suspect programmers would be unhappy if an > implementation didn't actually sort anything. I suppose there might be > special cases; sortedMap from small integers to anything could be O(1) > for get() > That special case happens to be called BitSet, which doesn't implement SortedMap :-)
[toc] | [prev] | [next] | [standalone]
| From | Eric Sosman <esosman@comcast-dot-net.invalid> |
|---|---|
| Date | 2012-11-26 20:29 -0500 |
| Message-ID | <k9152f$kl4$2@dont-email.me> |
| In reply to | #19989 |
On 11/26/2012 6:45 PM, Daniel Pitts wrote:
> On 11/25/12 6:25 PM, David Lamb wrote:
>> On 25/11/2012 2:24 PM, Arne Vajhøj wrote:
>>> On 11/25/2012 8:59 AM, David Lamb wrote:
>>>> Which, sigh, of course they're not, as Arne pointed out. O(logN) for
>>>> SortedMap, as I should have immediately picked up on from the name
>>>> alone.
>>>
>>> SortedMap is still just an interface.
>>
>> Maybe so, but I suspect programmers would be unhappy if an
>> implementation didn't actually sort anything. I suppose there might be
>> special cases; sortedMap from small integers to anything could be O(1)
>> for get()
>>
> That special case happens to be called BitSet, which doesn't implement
> SortedMap :-)
Another name for that special case is Value[] -- which
doesn't actually implement SortedMap, but does all that's needed.
--
Eric Sosman
esosman@comcast-dot-net.invalid
[toc] | [prev] | [next] | [standalone]
| From | Eric Sosman <esosman@comcast-dot-net.invalid> |
|---|---|
| Date | 2012-11-25 09:02 -0500 |
| Message-ID | <k8t8dp$tv5$1@dont-email.me> |
| In reply to | #19911 |
On 11/24/2012 6:57 PM, Joerg Meier wrote:
> On Sat, 24 Nov 2012 15:23:55 -0800, Knute Johnson wrote:
>
>> I have been operating under a mis-perception that a SortedMap would be
>> more quickly searched than an unsorted Map. More specifically that the
>> cost would come when adding an element to the SortedMap. I wrote a
>> little test program to measure the amount of time it took to look for
>> non-existent data in a Map and a SortedMap and it took somewhere between
>> twice and three times as long to look for the data in the SortedMap.
>
> Considering Map is an Interface, I question your claim that you measured
> its get(x) performance :p
>
> Assuming you used HashMap, as far as I recall, its get(x) performance is
> already at O(1) unless you mess up your hashcodes. Doesn't really go lower.
"HashMap is O(1)" is sort of true, but sweeps a lot of dirt
under the rug. Most obviously, there's the possibility of a bad
or unlucky hash distribution -- by no means a remote possibility,
as <http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/>
demonstrates. Even if the hash distribution is acceptably uniform,
there's also the matter of building the map in the first place: As
the insertions accumulate, HashMap may need to expand and re-insert
everything it already contains. With N=n1+n2+...+nk mappings, there
could be n1+(n1+n2)+...+(n1+n2+...+nk) = k*n1+(k-1)*n2+...+1*nk
insertions altogether (a good up-front estimate of N can help here).
Consider also the operation cost. Searching in a TreeMap of
N mappings takes about lg(N) comparisons -- but with String keys
the comparisons will be of varying difficulties. Comparisons near
the root will probably examine only the first few characters to
determine the result; only as the search approaches the leaves will
the Strings' tail characters come into play. On a successful search
HashMap must iterate over the key String's entire length twice: Once
to compute the hashCode(), and again to verify that a match has in
fact been found (comparisons against non-matching keys in the same
bucket will be quick, like those near the root of a TreeMap).
Finally, O() itself hides a lot of detail -- that is, after all,
its purpose. Given the choice between O(N*N) Algorithm X and O(1)
Algorithm Y, which would you choose? Might you change your mind
if you learned that X takes N*N nanoseconds while Y takes a constant
eighteen hours? Might you change your mind yet again if you learned
that N was ten million? O() alone is not enough to decide a question
of speed.
Still and all: HashMap makes a good "default" choice, and is
"likely" to outperform TreeMap over a "reasonable range" of inputs.
Turn to TreeMap when the order of the keys is significant -- for
example, when you need "closest match" for an absent key.
--
Eric Sosman
esosman@comcast-dot-net.invalid
[toc] | [prev] | [next] | [standalone]
| From | Martin Gregorie <martin@address-in-sig.invalid> |
|---|---|
| Date | 2012-11-25 17:47 +0000 |
| Message-ID | <k8tljj$6jr$1@localhost.localdomain> |
| In reply to | #19937 |
On Sun, 25 Nov 2012 09:02:29 -0500, Eric Sosman wrote: > > Still and all: HashMap makes a good "default" choice, and is > "likely" to outperform TreeMap over a "reasonable range" of inputs. Turn > to TreeMap when the order of the keys is significant -- for example, > when you need "closest match" for an absent key. > Or, if the key comparison is cheap, i.e. short strings or integer comparisons, and the key population is very large TreeMap can become attractive: with 8 million nodes you only need 23 comparisions to find a miss. Growing the tree isn't too expensive either: rebalancing a balanced red-black tree, which is necessary from time to time as it grows, is only a matter of adjusting pointers. There is one possible demerit that I found in what was the standard C implementation a while back: adding a node was very expensive because it required three mallocs: there were separate bits of heap storage for the pointers (left, right and parent), the key storage and the data storage. On the box we were running at the time (a Dec Alphaserver running DEC Unix) the best I could get in a situation where each key was looked up and then added if it wasn't in the tree was around 700 lookups a second on a small dev box: not nearly enough. First I threw out the library code, combined pointers, key and data into one struct and rewrote the red- black tree handler using Sedgewick. This pushed the rate up to 1500 and proved that malloc was the bottleneck, but it was still too slow for projected data volumes, so I moved memory management into user land by only mallocing for several MB at a time and allocating bits of these large chunks to the nodes. This rocked: the lookup rate immediately jumped to about 35,000 a second and, as a bonus, because all my pointers were relative to the big memory chunks and their places in a chunk table, it was possible to bring the live system's start-up time down to 2-3 minutes by saving and reloading the chunks as files: rebuilding the red- black tree from a database had been taking 40 minutes. Yes, I know: all this is C and the JVM should be much better at heap management because 'top' shows that it only grabs and releases memory in large chunks which minimises the malloc/sbrk overheads. However, it does show that serious runtime optimisation may require you to know what's under the hood of library code and to realise that this code may have sacrificed speed on the altar of flexibility. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
[toc] | [prev] | [next] | [standalone]
| From | Knute Johnson <nospam@rabbitbrush.frazmtn.com> |
|---|---|
| Date | 2012-11-25 16:25 -0800 |
| Message-ID | <k8ucsh$o2v$1@dont-email.me> |
| In reply to | #19944 |
On 11/25/2012 09:47, Martin Gregorie wrote: > Or, if the key comparison is cheap, i.e. short strings or integer > comparisons, and the key population is very large TreeMap can become > attractive: with 8 million nodes you only need 23 comparisions to find a > miss. Growing the tree isn't too expensive either: rebalancing a balanced > red-black tree, which is necessary from time to time as it grows, is only > a matter of adjusting pointers. That's what I thought, that the searching for a match would be very quick once the data was in the TreeMap. The test code I wrote may not have been any good. I created a map with Strings for the keys and values. I could get about 2 million entries before I started running into virtual memory issues slowing things down. I searched for non-existent elements. Using both a HashMap and a TreeMap, the TreeMap was significantly slower than the HashMap. I even tried a ConcurrentHashMap and multiple threads to do the search to see if that would speed things up. While that technique was better than TreeMap it was still slower than a plain HashMap. So for the actual case that I am working on, I have a collection of about 5000 elements and am using an Integer as the key. Data is rarely changed but often accessed. There should never be a case where the searched for key will not exist. What would you use, a HashMap or a TreeMap? Thanks, knute...
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-25 19:36 -0500 |
| Message-ID | <50b2b971$0$294$14726298@news.sunsite.dk> |
| In reply to | #19965 |
On 11/25/2012 7:25 PM, Knute Johnson wrote: > On 11/25/2012 09:47, Martin Gregorie wrote: >> Or, if the key comparison is cheap, i.e. short strings or integer >> comparisons, and the key population is very large TreeMap can become >> attractive: with 8 million nodes you only need 23 comparisions to find a >> miss. Growing the tree isn't too expensive either: rebalancing a balanced >> red-black tree, which is necessary from time to time as it grows, is only >> a matter of adjusting pointers. > > That's what I thought, that the searching for a match would be very > quick once the data was in the TreeMap. The test code I wrote may not > have been any good. I created a map with Strings for the keys and > values. I could get about 2 million entries before I started running > into virtual memory issues slowing things down. I searched for > non-existent elements. Using both a HashMap and a TreeMap, the TreeMap > was significantly slower than the HashMap. I even tried a > ConcurrentHashMap and multiple threads to do the search to see if that > would speed things up. While that technique was better than TreeMap it > was still slower than a plain HashMap. > > So for the actual case that I am working on, I have a collection of > about 5000 elements and am using an Integer as the key. Data is rarely > changed but often accessed. There should never be a case where the > searched for key will not exist. What would you use, a HashMap or a > TreeMap? If it is important then you should measure! But I would still expect HashMap to be faster than TreeMap. Arne
[toc] | [prev] | [next] | [standalone]
| From | Eric Sosman <esosman@comcast-dot-net.invalid> |
|---|---|
| Date | 2012-11-25 22:58 -0500 |
| Message-ID | <k8updu$f62$1@dont-email.me> |
| In reply to | #19966 |
On 11/25/2012 7:36 PM, Arne Vajhøj wrote:
> On 11/25/2012 7:25 PM, Knute Johnson wrote:
>> On 11/25/2012 09:47, Martin Gregorie wrote:
>>> Or, if the key comparison is cheap, i.e. short strings or integer
>>> comparisons, and the key population is very large TreeMap can become
>>> attractive: with 8 million nodes you only need 23 comparisions to find a
>>> miss. Growing the tree isn't too expensive either: rebalancing a
>>> balanced
>>> red-black tree, which is necessary from time to time as it grows, is
>>> only
>>> a matter of adjusting pointers.
>>
>> That's what I thought, that the searching for a match would be very
>> quick once the data was in the TreeMap. The test code I wrote may not
>> have been any good. I created a map with Strings for the keys and
>> values. I could get about 2 million entries before I started running
>> into virtual memory issues slowing things down. I searched for
>> non-existent elements. Using both a HashMap and a TreeMap, the TreeMap
>> was significantly slower than the HashMap. I even tried a
>> ConcurrentHashMap and multiple threads to do the search to see if that
>> would speed things up. While that technique was better than TreeMap it
>> was still slower than a plain HashMap.
>>
>> So for the actual case that I am working on, I have a collection of
>> about 5000 elements and am using an Integer as the key. Data is rarely
>> changed but often accessed. There should never be a case where the
>> searched for key will not exist. What would you use, a HashMap or a
>> TreeMap?
>
> If it is important then you should measure!
Aaay-men!
> But I would still expect HashMap to be faster than TreeMap.
Yes. A HashMap with 5000 keys and the default 0.75 load
factor will want 5000/0.75 ~= 6667 buckets, which it will round
up to 8192. With normal luck the 5000 keys will land in ~3743
buckets, about two-thirds of which will hold just one key. So
a typical search involves a few arithmetic operations to locate
the bucket, then about 5000/3743 ~= 1.3 Integer#equals() calls.
Searching a TreeMap of 5000 keys, on the other hand, will
use about lg(5000) ~= 11.3 Integer#compareTo() calls. Uses of
equals() and compareTo() have different costs (equals() needs to
work with Objects and check their classes before comparing, but
compareTo() needs to produce a three-way answer rather than a
simpler boolean ...), but any differences are almost certainly
swamped by the disparity in call counts.
If the key values are not too widely scattered, other
possibilities exist. A BitSet can answer "present or absent"
queries very quickly and without taking much space, or a simple
array can serve as a Map stand-in. But follow Arne's advice
when making your choice: Measure It!
--
Eric Sosman
esosman@comcast-dot-net.invalid
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2012-11-25 16:53 -0800 |
| Message-ID | <858d42ec-16f0-453f-971e-d3d7d0de3bdf@googlegroups.com> |
| In reply to | #19965 |
Knute Johnson wrote: > That's what I thought, that the searching for a match would be very > quick once the data was in the TreeMap. The test code I wrote may not > have been any good. I created a map with Strings for the keys and > values. I could get about 2 million entries before I started running > into virtual memory issues slowing things down. I searched for > non-existent elements. Using both a HashMap and a TreeMap, the TreeMap > was significantly slower than the HashMap. I even tried a > ConcurrentHashMap and multiple threads to do the search to see if that > would speed things up. While that technique was better than TreeMap it > was still slower than a plain HashMap. > > So for the actual case that I am working on, I have a collection of > about 5000 elements and am using an Integer as the key. Data is rarely > changed but often accessed. There should never be a case where the > searched for key will not exist. What would you use, a HashMap or a > TreeMap? That would depend entirely on whether I needed to iterate the map in sorted order. There is little reason to prefer 'TreeMap' absent a sorting requirement except, as pointed out upthread, if you need guaranteed O(log n) performance for gets rather than the probabilistic promise of hashing. Such scenarios would require specialized knowledge of the data sets to be mapped. Kudos for your efforts to obtain objective performance evidence. -- Lew
[toc] | [prev] | [next] | [standalone]
| From | Martin Gregorie <martin@address-in-sig.invalid> |
|---|---|
| Date | 2012-11-26 02:20 +0000 |
| Message-ID | <k8ujlm$enu$1@localhost.localdomain> |
| In reply to | #19965 |
On Sun, 25 Nov 2012 16:25:02 -0800, Knute Johnson wrote: > So for the actual case that I am working on, I have a collection of > about 5000 elements and am using an Integer as the key. Data is rarely > changed but often accessed. There should never be a case where the > searched for key will not exist. What would you use, a HashMap or a > TreeMap? > I had an additional reason for thinking that an ordered storage method would help us: compression. We had, for the day, a very large data warehouse in which many of the keys were relatively large. It was storing phone network-related data and, for instance, it turned out that things like IMEIs and dialed numbers are very much larger than a 32 bit soft database key (sdbk), so we got a worthwhile compression merely by holding a single copy of, say, a phone number as a dimension table replacing it by the sdbk in the call- related data. Rinse, wash and repeat for all the dimension tables in that database. This gave as good a compression ratio as the zip algorithm and without the compress/decompress cpu overhead, but it meant that we needed to do the sdbk substitutions before we stored the records. At this point we found a problem: theoretically the datawarehouse could do the lookups but in practise it hit a brick wall at 350 lookups/second and we needed at least 15,000/sec to handle the projected daily input volume. Hence the giant lookup table: the last time I saw it there were around 300M phone numbers in the tree. We didn't have the memory to touch that with only one number per node but fortunately a late night brain storming session came up with a cunning plot. We discovered that, although actual the phone number space is sparsely populated, the least significant two digits are quite densely grouped into clusters, so after digging statistics out of a sample of 500,000 numbers, we constructed the key from the phone number less the last two digits and stored these as a bitmap. This is why we needed an ordered map: its job was to ensure that the numbers in the cluster all hit the same node. The lookup procedure was: - take off the last two digits - use the prefix as the B-tree index - test the bit indexed by the last two digits. - if unset, there is no match Please excuse the use of xcess bandwidth, but hopefully somebody may find this type of approach helpful. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-25 14:20 -0500 |
| Message-ID | <50b26f67$0$295$14726298@news.sunsite.dk> |
| In reply to | #19937 |
On 11/25/2012 9:02 AM, Eric Sosman wrote: > Finally, O() itself hides a lot of detail -- that is, after all, > its purpose. Given the choice between O(N*N) Algorithm X and O(1) > Algorithm Y, which would you choose? Might you change your mind > if you learned that X takes N*N nanoseconds while Y takes a constant > eighteen hours? Might you change your mind yet again if you learned > that N was ten million? O() alone is not enough to decide a question > of speed. Obviously true. But I would consider it very rare that a worse big-O algorithm/data-structure is faster in a case where it has a measurable impact on overall performance. > Still and all: HashMap makes a good "default" choice, and is > "likely" to outperform TreeMap over a "reasonable range" of inputs. > Turn to TreeMap when the order of the keys is significant -- for > example, when you need "closest match" for an absent key. Yep. Arne
[toc] | [prev] | [next] | [standalone]
| From | Eric Sosman <esosman@comcast-dot-net.invalid> |
|---|---|
| Date | 2012-11-25 14:44 -0500 |
| Message-ID | <k8tsfj$nn5$1@dont-email.me> |
| In reply to | #19949 |
On 11/25/2012 2:20 PM, Arne Vajhøj wrote:
> On 11/25/2012 9:02 AM, Eric Sosman wrote:
>> Finally, O() itself hides a lot of detail -- that is, after all,
>> its purpose. Given the choice between O(N*N) Algorithm X and O(1)
>> Algorithm Y, which would you choose? Might you change your mind
>> if you learned that X takes N*N nanoseconds while Y takes a constant
>> eighteen hours? Might you change your mind yet again if you learned
>> that N was ten million? O() alone is not enough to decide a question
>> of speed.
>
> Obviously true.
>
> But I would consider it very rare that a worse big-O
> algorithm/data-structure is faster in a case where it has
> a measurable impact on overall performance.
That's why Quicksort implementations never use an O(N*N)
InsertionSort for small subfiles.
Oh -- Hey, wait ...
--
Eric Sosman
esosman@comcast-dot-net.invalid
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.java.programmer
csiph-web