Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!usenet.ukfsn.org!not-for-mail
From: Martin Gregorie <martin@address-in-sig.invalid>
Newsgroups: comp.lang.java.programmer
Subject: Re: SortedMap question?
Date: Mon, 26 Nov 2012 02:20:38 +0000 (UTC)
Organization: UK Free Software Network
Lines: 49
Message-ID: <k8ujlm$enu$1@localhost.localdomain>
References: <k8rkud$i3o$1@dont-email.me> <q4qoyde15lpk$.13amx8nbfrj28.dlg@40tude.net> <k8t8dp$tv5$1@dont-email.me> <k8tljj$6jr$1@localhost.localdomain> <k8ucsh$o2v$1@dont-email.me>
NNTP-Posting-Host: 84.45.235.129
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: localhost.localdomain 1353896438 15102 84.45.235.129 (26 Nov 2012 02:20:38 GMT)
X-Complaints-To: usenet@localhost.localdomain
NNTP-Posting-Date: Mon, 26 Nov 2012 02:20:38 +0000 (UTC)
User-Agent: Pan/0.139 (Sexual Chocolate; GIT bf56508 git://git.gnome.org/pan2)
Xref: csiph.com comp.lang.java.programmer:19973

On Sun, 25 Nov 2012 16:25:02 -0800, Knute Johnson wrote:

> So for the actual case that I am working on, I have a collection of
> about 5000 elements and am using an Integer as the key.  Data is rarely
> changed but often accessed.  There should never be a case where the
> searched for key will not exist.  What would you use, a HashMap or a
> TreeMap?
>
I had an additional reason for thinking that an ordered storage method 
would help us: compression. 

We had, for the day, a very large data warehouse in which many of the 
keys were relatively large. It was storing phone network-related data 
and, for instance, it turned out that things like IMEIs and dialed 
numbers are very much larger than a 32 bit soft database key (sdbk), so 
we got a worthwhile compression merely by holding a single copy of, say, 
a phone number as a dimension table replacing it by the sdbk in the call-
related data. Rinse, wash and repeat for all the dimension tables in that 
database. This gave as good a compression ratio as the zip algorithm and 
without the compress/decompress cpu overhead, but it meant that we needed 
to do the sdbk substitutions before we stored the records. At this point 
we found a problem: theoretically the datawarehouse could do the lookups 
but in practise it hit a brick wall at 350 lookups/second and we needed 
at least 15,000/sec to handle the projected daily input volume. 

Hence the giant lookup table: the last time I saw it there were around 
300M phone numbers in the tree. We didn't have the memory to touch that 
with only one number per node but fortunately a late night brain storming 
session came up with a cunning plot. We discovered that, although actual 
the phone number space is sparsely populated, the least significant two 
digits are quite densely grouped into clusters, so after digging 
statistics out of a sample of 500,000 numbers, we constructed the key 
from the phone number less the last two digits and stored these as a 
bitmap. This is why we needed an ordered map: its job was to ensure that 
the numbers in the cluster all hit the same node. The lookup procedure 
was: 
- take off the last two digits
- use the prefix as the B-tree index
- test the bit indexed by the last two digits.
- if unset, there is no match

Please excuse the use of xcess bandwidth, but hopefully somebody may find 
this type of approach helpful.
 

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |