Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.wtal.de!news.tal.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: v_borchert@despammed.com (Volker Borchert)
Newsgroups: comp.lang.java.programmer
Subject: Re: optimsed HashMap
Date: 24 Nov 2012 08:05:55 GMT
Organization: Private site at Eddersheim, Germany
Lines: 75
Distribution: world
Message-ID: <k8pv53$7gl$1@Gaia.teknon.de>
References: <8i70b8d0pm6ibk03ti4t2pv60jd0bctlcs@4ax.com>
X-Trace: individual.net LQ2g0PmO24uHBQMlyEChGg419QJ+z8AB3PS1afNW8iuNpdyEAK4LzBnNBn/IYSJ5we
Cancel-Lock: sha1:DLBs90sdUJU/Nq6D9c6QDawVZDs=
Xref: csiph.com comp.lang.java.programmer:19886

Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
> 
> I have to scan an entire website looking up every word.

Unless said web site is www.archive.org, just create a HashMap
and use a Collections.unmodifiableMap() wrapper for lookup. I'd
expect memory to become a problem way earlier than processor.

Depending on how often each String is looked up, and how long the
Strings are, a data structure that does not need String.hashCode()
might be faster.

I'd consider a tree along the following lines:

Characters not in [a-zA-Z] are expressed as `nnnn - the ASCII value of
'`' is 'a' - 1. Characters in [A-Z] are lowercased. Nodes contain a
boolean and a Node[27]. The tree contains a word if there is a path
from the root to a Node whose boolean is true and that is reached via
Node[String.charAt(l)] at level l.

For example, a tree containing exactly
	"a", "at", "and", "no", "nil", "not"
would be

N0 = { false, [ null, N1, 12*null, N2, 12*null ] }        // ""
N1 = { true,  [ null, 13*null, N3, 5*null, N4, 6*null ] } // "a"
N2 = { false, [ null, 8*null, N5, 5*null, N6, 11*null ] } // "n"
N3 = { false, [ null, 13*null, N7, 12*null ] }            // "an"
N4 = { true,  [ null, 26*null ] }                         // "at"
N5 = { false, [ null, 11*null, N8, 14*null ] }            // "ni"
N6 = { true,  [ null, 19*null, N9, 6*null ] }             // "no"
N7 = { true,  [ null, 26*null ] }                         // "and"
N8 = { true,  [ null, 26*null ] }                         // "nil"
N9 = { true,  [ null, 26*null ] }                         // "not"

For a Map, replace the boolean with a field holding the value.

Possible improvements:

- Since we have three (32bit VM) or seven (64bit VM) padding bytes
  in each Node, we could use two of them to hold a short indicating
  the maximum length of Strings in the subtree.

- Instantiate Node[] only if actually needed to hold a child, and only
  large enough to hold all currently known children. Use one of the
  padding bytes to hold an index offset, such that a Node with only
  a single child only holds a Node[1]:
  N4 = { true, 0, null }
  N6 = { true, 20, [ N9 ] }

- Use an interface and specialized concrete classes for
  - leaf node (no fields at all, because boolean will always be true)
  - single child node { boolean, Node }
    or even "true" and "false" single child nodes
  - ...
  but this will considerable increase preprocessing cost.

- If many of your words contain non-[a-z] characters, provide array
  positions for the most common of these, or use a linear probing
  hash table for the non-[a-z] children.

- If the tree is very sparse, doing so even for the [a-z] children
  might be better in the long run because more of it fits into caches.

This surely has been invented before and given a nice name...
Apache's or Google's collection libraries might even feature
high quality implementations.

-- 

"I'm a doctor, not a mechanic." Dr Leonard McCoy <mccoy@ncc1701.starfleet.fed>
"I'm a mechanic, not a doctor." Volker Borchert  <v_borchert@despammed.com>