Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail
From: Eric Sosman <esosman@comcast-dot-net.invalid>
Newsgroups: comp.lang.java.programmer
Subject: Re: SortedMap question?
Date: Sun, 25 Nov 2012 09:02:29 -0500
Organization: A noiseless patient Spider
Lines: 54
Message-ID: <k8t8dp$tv5$1@dont-email.me>
References: <k8rkud$i3o$1@dont-email.me> <q4qoyde15lpk$.13amx8nbfrj28.dlg@40tude.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 25 Nov 2012 14:02:33 +0000 (UTC)
Injection-Info: mx04.eternal-september.org; posting-host="ffb8f7085759b339c1002252b48331a4"; logging-data="30693"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+PBuAC01XpxLXeRp15dvvw"
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/17.0 Thunderbird/17.0
In-Reply-To: <q4qoyde15lpk$.13amx8nbfrj28.dlg@40tude.net>
Cancel-Lock: sha1:Rrn/Tq8r6yVsu5skbQZ7wpEImuY=
Xref: csiph.com comp.lang.java.programmer:19937

On 11/24/2012 6:57 PM, Joerg Meier wrote:
> On Sat, 24 Nov 2012 15:23:55 -0800, Knute Johnson wrote:
>
>> I have been operating under a mis-perception that a SortedMap would be
>> more quickly searched than an unsorted Map.  More specifically that the
>> cost would come when adding an element to the SortedMap.  I wrote a
>> little test program to measure the amount of time it took to look for
>> non-existent data in a Map and a SortedMap and it took somewhere between
>> twice and three times as long to look for the data in the SortedMap.
>
> Considering Map is an Interface, I question your claim that you measured
> its get(x) performance :p
>
> Assuming you used HashMap, as far as I recall, its get(x) performance is
> already at O(1) unless you mess up your hashcodes. Doesn't really go lower.

     "HashMap is O(1)" is sort of true, but sweeps a lot of dirt
under the rug.  Most obviously, there's the possibility of a bad
or unlucky hash distribution -- by no means a remote possibility,
as <http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/>
demonstrates.  Even if the hash distribution is acceptably uniform,
there's also the matter of building the map in the first place: As
the insertions accumulate, HashMap may need to expand and re-insert
everything it already contains.  With N=n1+n2+...+nk mappings, there
could be n1+(n1+n2)+...+(n1+n2+...+nk) = k*n1+(k-1)*n2+...+1*nk
insertions altogether (a good up-front estimate of N can help here).

     Consider also the operation cost.  Searching in a TreeMap of
N mappings takes about lg(N) comparisons -- but with String keys
the comparisons will be of varying difficulties.  Comparisons near
the root will probably examine only the first few characters to
determine the result; only as the search approaches the leaves will
the Strings' tail characters come into play.  On a successful search
HashMap must iterate over the key String's entire length twice: Once
to compute the hashCode(), and again to verify that a match has in
fact been found (comparisons against non-matching keys in the same
bucket will be quick, like those near the root of a TreeMap).

     Finally, O() itself hides a lot of detail -- that is, after all,
its purpose.  Given the choice between O(N*N) Algorithm X and O(1)
Algorithm Y, which would you choose?  Might you change your mind
if you learned that X takes N*N nanoseconds while Y takes a constant
eighteen hours?  Might you change your mind yet again if you learned
that N was ten million?  O() alone is not enough to decide a question
of speed.

     Still and all: HashMap makes a good "default" choice, and is
"likely" to outperform TreeMap over a "reasonable range" of inputs.
Turn to TreeMap when the order of the keys is significant -- for
example, when  you need "closest match" for an absent key.

-- 
Eric Sosman
esosman@comcast-dot-net.invalid