Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.wtal.de!news.tal.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: v_borchert@despammed.com (Volker Borchert) Newsgroups: comp.lang.java.programmer Subject: Re: optimsed HashMap Date: 24 Nov 2012 08:05:55 GMT Organization: Private site at Eddersheim, Germany Lines: 75 Distribution: world Message-ID: References: <8i70b8d0pm6ibk03ti4t2pv60jd0bctlcs@4ax.com> X-Trace: individual.net LQ2g0PmO24uHBQMlyEChGg419QJ+z8AB3PS1afNW8iuNpdyEAK4LzBnNBn/IYSJ5we Cancel-Lock: sha1:DLBs90sdUJU/Nq6D9c6QDawVZDs= Xref: csiph.com comp.lang.java.programmer:19886 Roedy Green wrote: > Is there something like HashMap but that optimised when nearly always > the thing you are looking up is not in the list, and when you can add > the list of words to look up and then freeze it. > > I have to scan an entire website looking up every word. Unless said web site is www.archive.org, just create a HashMap and use a Collections.unmodifiableMap() wrapper for lookup. I'd expect memory to become a problem way earlier than processor. Depending on how often each String is looked up, and how long the Strings are, a data structure that does not need String.hashCode() might be faster. I'd consider a tree along the following lines: Characters not in [a-zA-Z] are expressed as `nnnn - the ASCII value of '`' is 'a' - 1. Characters in [A-Z] are lowercased. Nodes contain a boolean and a Node[27]. The tree contains a word if there is a path from the root to a Node whose boolean is true and that is reached via Node[String.charAt(l)] at level l. For example, a tree containing exactly "a", "at", "and", "no", "nil", "not" would be N0 = { false, [ null, N1, 12*null, N2, 12*null ] } // "" N1 = { true, [ null, 13*null, N3, 5*null, N4, 6*null ] } // "a" N2 = { false, [ null, 8*null, N5, 5*null, N6, 11*null ] } // "n" N3 = { false, [ null, 13*null, N7, 12*null ] } // "an" N4 = { true, [ null, 26*null ] } // "at" N5 = { false, [ null, 11*null, N8, 14*null ] } // "ni" N6 = { true, [ null, 19*null, N9, 6*null ] } // "no" N7 = { true, [ null, 26*null ] } // "and" N8 = { true, [ null, 26*null ] } // "nil" N9 = { true, [ null, 26*null ] } // "not" For a Map, replace the boolean with a field holding the value. Possible improvements: - Since we have three (32bit VM) or seven (64bit VM) padding bytes in each Node, we could use two of them to hold a short indicating the maximum length of Strings in the subtree. - Instantiate Node[] only if actually needed to hold a child, and only large enough to hold all currently known children. Use one of the padding bytes to hold an index offset, such that a Node with only a single child only holds a Node[1]: N4 = { true, 0, null } N6 = { true, 20, [ N9 ] } - Use an interface and specialized concrete classes for - leaf node (no fields at all, because boolean will always be true) - single child node { boolean, Node } or even "true" and "false" single child nodes - ... but this will considerable increase preprocessing cost. - If many of your words contain non-[a-z] characters, provide array positions for the most common of these, or use a linear probing hash table for the non-[a-z] children. - If the tree is very sparse, doing so even for the [a-z] children might be better in the long run because more of it fits into caches. This surely has been invented before and given a nice name... Apache's or Google's collection libraries might even feature high quality implementations. -- "I'm a doctor, not a mechanic." Dr Leonard McCoy "I'm a mechanic, not a doctor." Volker Borchert