Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!usenet.ukfsn.org!not-for-mail From: Martin Gregorie Newsgroups: comp.lang.java.programmer Subject: Re: SortedMap question? Date: Mon, 26 Nov 2012 02:20:38 +0000 (UTC) Organization: UK Free Software Network Lines: 49 Message-ID: References: NNTP-Posting-Host: 84.45.235.129 Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: localhost.localdomain 1353896438 15102 84.45.235.129 (26 Nov 2012 02:20:38 GMT) X-Complaints-To: usenet@localhost.localdomain NNTP-Posting-Date: Mon, 26 Nov 2012 02:20:38 +0000 (UTC) User-Agent: Pan/0.139 (Sexual Chocolate; GIT bf56508 git://git.gnome.org/pan2) Xref: csiph.com comp.lang.java.programmer:19973 On Sun, 25 Nov 2012 16:25:02 -0800, Knute Johnson wrote: > So for the actual case that I am working on, I have a collection of > about 5000 elements and am using an Integer as the key. Data is rarely > changed but often accessed. There should never be a case where the > searched for key will not exist. What would you use, a HashMap or a > TreeMap? > I had an additional reason for thinking that an ordered storage method would help us: compression. We had, for the day, a very large data warehouse in which many of the keys were relatively large. It was storing phone network-related data and, for instance, it turned out that things like IMEIs and dialed numbers are very much larger than a 32 bit soft database key (sdbk), so we got a worthwhile compression merely by holding a single copy of, say, a phone number as a dimension table replacing it by the sdbk in the call- related data. Rinse, wash and repeat for all the dimension tables in that database. This gave as good a compression ratio as the zip algorithm and without the compress/decompress cpu overhead, but it meant that we needed to do the sdbk substitutions before we stored the records. At this point we found a problem: theoretically the datawarehouse could do the lookups but in practise it hit a brick wall at 350 lookups/second and we needed at least 15,000/sec to handle the projected daily input volume. Hence the giant lookup table: the last time I saw it there were around 300M phone numbers in the tree. We didn't have the memory to touch that with only one number per node but fortunately a late night brain storming session came up with a cunning plot. We discovered that, although actual the phone number space is sparsely populated, the least significant two digits are quite densely grouped into clusters, so after digging statistics out of a sample of 500,000 numbers, we constructed the key from the phone number less the last two digits and stored these as a bitmap. This is why we needed an ordered map: its job was to ensure that the numbers in the cluster all hit the same node. The lookup procedure was: - take off the last two digits - use the prefix as the B-tree index - test the bit indexed by the last two digits. - if unset, there is no match Please excuse the use of xcess bandwidth, but hopefully somebody may find this type of approach helpful. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |