Groups > comp.lang.java.programmer > #19875 > unrolled thread

optimsed HashMap

Started by	Roedy Green <see_website@mindprod.com.invalid>
First post	2012-11-23 17:12 -0800
Last post	2012-11-27 14:16 -0800
Articles	13 on this page of 33 — 15 participants

Back to article view | Back to comp.lang.java.programmer

  optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 17:12 -0800
    Re: optimsed HashMap Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 20:19 -0500
    Re: optimsed HashMap markspace <-@.> - 2012-11-23 17:33 -0800
      Re: optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 22:42 -0800
        Re: optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 03:34 -0800
          Re: optimsed HashMap Knute Johnson <nospam@knutejohnson.com> - 2012-11-24 08:39 -0800
            Re: optimsed HashMap Knute Johnson <nospam@rabbitbrush.frazmtn.com> - 2012-11-24 15:14 -0800
        Re: optimsed HashMap Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 13:24 -0500
        Re: optimsed HashMap markspace <-@.> - 2012-11-24 10:44 -0800
          Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-25 13:40 +0000
        Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-26 22:03 +0100
          Re: optimsed HashMap Robert Klemme <shortcutter@googlemail.com> - 2012-11-26 23:32 +0100
            Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-27 03:24 +0100
              Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-27 03:35 +0100
                Re: optimsed HashMap Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-27 08:44 -0500
                  Re: optimsed HashMap Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-11-27 14:20 -0800
                    Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-30 03:35 +0100
    Re: optimsed HashMap Patricia Shanahan <pats@acm.org> - 2012-11-23 19:51 -0800
      Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-24 10:21 +0000
        Re: optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 03:39 -0800
          Re: optimsed HashMap Robert Klemme <shortcutter@googlemail.com> - 2012-11-24 16:24 +0100
            Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-25 13:50 +0000
              Re: optimsed HashMap Robert Klemme <shortcutter@googlemail.com> - 2012-11-25 15:30 +0100
        Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-26 21:13 +0000
      Re: optimsed HashMap Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 13:16 -0500
    Re: optimsed HashMap v_borchert@despammed.com (Volker Borchert) - 2012-11-24 08:05 +0000
    Re: optimsed HashMap Silvio <silvio@internet.com> - 2012-11-26 11:57 +0100
    Re: optimsed HashMap Jim Janney <jjanney@shell.xmission.com> - 2012-11-26 11:13 -0700
    Re: optimsed HashMap Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-11-26 15:44 -0800
      Re: optimsed HashMap Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-26 20:28 -0500
        Re: optimsed HashMap Arved Sandstrom <asandstrom2@eastlink.ca> - 2012-11-27 06:01 -0400
          Re: optimsed HashMap Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-27 08:56 -0500
        Re: optimsed HashMap Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-11-27 14:16 -0800

Page 2 of 2 — ← Prev page 1 [2]

#19895

From	Robert Klemme <shortcutter@googlemail.com>
Date	2012-11-24 16:24 +0100
Message-ID	<ahc75eFn0gqU1@mid.individual.net>
In reply to	#19892

On 24.11.2012 12:39, Roedy Green wrote:
> On Sat, 24 Nov 2012 10:21:14 -0000, "Chris Uppal"
> <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
> quoted someone who said :
>
>> Look into the literature on fast text searching (for instance bit-parallel
>> matching).  It's not entirely clear to me what Roedy is trying to do, but it
>> sounds as if "bulk" matching/searching might be relevant.
>
> Yes a Boyer-Moore to simultaneously search for the whole list of
> words, then when it has a hit see if it has word in isolation rather
> than a word fragment.

Here's another approach:

1. fill a HashMap with the translations.
2. Create a tree or trie from the keys.
3. Convert the trie to a regular expression optimized for NFA automata 
(such as is used in Java std. library).
4. Surround the regexp with additional regexp to ensure word matches and 
probably exclude matching inside HTML tags
5. Scan the document with Matcher.find()

The idea of item 3 is to create a regexp with as little backtracking as 
possible.  For example, from

foo
foot
fuss

you make

(?:f(?:oot?)|uss)

Not sure though whether it is dramatically faster or slower than a 
standard string search like Boyer-Moore - probably not.

Kind regards

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]

#19934

From	"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Date	2012-11-25 13:50 +0000
Message-ID	<fNCdnchke6quvy_NnZ2dnUVZ8hCdnZ2d@bt.com>
In reply to	#19895

Robert Klemme wrote:

> 1. fill a HashMap with the translations.
> 2. Create a tree or trie from the keys.
> 3. Convert the trie to a regular expression optimized for NFA automata
> (such as is used in Java std. library).
> 4. Surround the regexp with additional regexp to ensure word matches and
> probably exclude matching inside HTML tags
> 5. Scan the document with Matcher.find()
>
> The idea of item 3 is to create a regexp with as little backtracking as
> possible.  For example, from
>
> foo
> foot
> fuss
>
> you make
>
> (?:f(?:oot?)|uss)

Alternatively, use a real DFA matcher which will do all of that work for you as 
it creates a minimal (or maybe only nearly mininal) DFA.

> Not sure though whether it is dramatically faster or slower than a standard 
> string search like Boyer-Moore - probably not.

As I mentioned elsewhere BM is no longer considered the best (there's a nice, 
albeit somewhat specialised, book by Navarro and Raffinot "Flexible Pattern 
Matching in Strings").  But my gut feeling (another way of saying that I 
haven't bothered to test it) is that IO would probably dominate the exection 
time whatever algorithm was used (assuming it wasn't implemented 
inefficiently).

    -- chris

    -- chris

[toc] | [prev] | [next] | [standalone]

#19942

From	Robert Klemme <shortcutter@googlemail.com>
Date	2012-11-25 15:30 +0100
Message-ID	<aheoclF9ot2U2@mid.individual.net>
In reply to	#19934

On 25.11.2012 14:50, Chris Uppal wrote:
> Robert Klemme wrote:

> Alternatively, use a real DFA matcher which will do all of that work for you as
> it creates a minimal (or maybe only nearly mininal) DFA.

Right, that's probably an even better option. :-)

>> Not sure though whether it is dramatically faster or slower than a standard
>> string search like Boyer-Moore - probably not.
>
> As I mentioned elsewhere BM is no longer considered the best (there's a nice,
> albeit somewhat specialised, book by Navarro and Raffinot "Flexible Pattern
> Matching in Strings").

Thank you for that pointer, Chris!

>  But my gut feeling (another way of saying that I
> haven't bothered to test it) is that IO would probably dominate the exection
> time whatever algorithm was used (assuming it wasn't implemented
> inefficiently).

Yes, that seems likely.

Kind regards

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]

#19986

From	"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Date	2012-11-26 21:13 +0000
Message-ID	<ZPednYB0iPbTQS7NnZ2dnUVZ8l2dnZ2d@bt.com>
In reply to	#19890

I wrote:

> Also worth considering (assuming that the standard HashSet isn't doing
> the job well enough):

And Bloom Filters.

(I always forget Bloom Filters, God knows why, like radix sort...)

    -- chris

[toc] | [prev] | [next] | [standalone]

#19901

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-24 13:16 -0500
Message-ID	<50b10f1e$0$285$14726298@news.sunsite.dk>
In reply to	#19882

On 11/23/2012 10:51 PM, Patricia Shanahan wrote:
> On 11/23/2012 5:12 PM, Roedy Green wrote:
>> Is there something like HashMap but that optimised when nearly always
>> the thing you are looking up is not in the list, and when you can add
>> the list of words to look up and then freeze it.
>>
>> I have to scan an entire website looking up every word.
>
> Look up "perfect hash". The idea is that, given a set of strings, you
> can calculate a hash function that maps each of those strings to a
> different bucket. That avoids any chain searching due to bucket
> collisions, and simplifies the data structures.

I would expects gains by using a better hash function
compared to the standard Java one to be very small for
String's containing words (English and Java code).

Arne

[toc] | [prev] | [next] | [standalone]

#19886

From	v_borchert@despammed.com (Volker Borchert)
Date	2012-11-24 08:05 +0000
Message-ID	<k8pv53$7gl$1@Gaia.teknon.de>
In reply to	#19875

Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
> 
> I have to scan an entire website looking up every word.

Unless said web site is www.archive.org, just create a HashMap
and use a Collections.unmodifiableMap() wrapper for lookup. I'd
expect memory to become a problem way earlier than processor.

Depending on how often each String is looked up, and how long the
Strings are, a data structure that does not need String.hashCode()
might be faster.

I'd consider a tree along the following lines:

Characters not in [a-zA-Z] are expressed as `nnnn - the ASCII value of
'`' is 'a' - 1. Characters in [A-Z] are lowercased. Nodes contain a
boolean and a Node[27]. The tree contains a word if there is a path
from the root to a Node whose boolean is true and that is reached via
Node[String.charAt(l)] at level l.

For example, a tree containing exactly
	"a", "at", "and", "no", "nil", "not"
would be

N0 = { false, [ null, N1, 12*null, N2, 12*null ] }        // ""
N1 = { true,  [ null, 13*null, N3, 5*null, N4, 6*null ] } // "a"
N2 = { false, [ null, 8*null, N5, 5*null, N6, 11*null ] } // "n"
N3 = { false, [ null, 13*null, N7, 12*null ] }            // "an"
N4 = { true,  [ null, 26*null ] }                         // "at"
N5 = { false, [ null, 11*null, N8, 14*null ] }            // "ni"
N6 = { true,  [ null, 19*null, N9, 6*null ] }             // "no"
N7 = { true,  [ null, 26*null ] }                         // "and"
N8 = { true,  [ null, 26*null ] }                         // "nil"
N9 = { true,  [ null, 26*null ] }                         // "not"

For a Map, replace the boolean with a field holding the value.

Possible improvements:

- Since we have three (32bit VM) or seven (64bit VM) padding bytes
  in each Node, we could use two of them to hold a short indicating
  the maximum length of Strings in the subtree.

- Instantiate Node[] only if actually needed to hold a child, and only
  large enough to hold all currently known children. Use one of the
  padding bytes to hold an index offset, such that a Node with only
  a single child only holds a Node[1]:
  N4 = { true, 0, null }
  N6 = { true, 20, [ N9 ] }

- Use an interface and specialized concrete classes for
  - leaf node (no fields at all, because boolean will always be true)
  - single child node { boolean, Node }
    or even "true" and "false" single child nodes
  - ...
  but this will considerable increase preprocessing cost.

- If many of your words contain non-[a-z] characters, provide array
  positions for the most common of these, or use a linear probing
  hash table for the non-[a-z] children.

- If the tree is very sparse, doing so even for the [a-z] children
  might be better in the long run because more of it fits into caches.

This surely has been invented before and given a nice name...
Apache's or Google's collection libraries might even feature
high quality implementations.

-- 

"I'm a doctor, not a mechanic." Dr Leonard McCoy <mccoy@ncc1701.starfleet.fed>
"I'm a mechanic, not a doctor." Volker Borchert  <v_borchert@despammed.com>

[toc] | [prev] | [next] | [standalone]

#19980

From	Silvio <silvio@internet.com>
Date	2012-11-26 11:57 +0100
Message-ID	<50b34b1c$0$6865$e4fe514c@news2.news.xs4all.nl>
In reply to	#19875

On 11/24/2012 02:12 AM, Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.
>

The use case you describe sounds like a good fit for a character trie. 
The root represents the empty input string and each node has a child 
node for each character extension of its current input path that needs 
to be mapped. A node can optionally have a replacement string if it 
marks the end of a word-to-be-mapped. If it doesn't than it marks a 
sub-word only.
You simply read word characters and step down into the trie. If you walk 
off the tree halfway through a word or if you end on a node that has no 
replacement value then the word does not exist and you need to copy it 
to your output. If the last character of a word leads to a node with a 
replacement string than that is what you output.

Setting up the trie might take more time than setting up a plain word 
map but the processing should be a lot faster.

Silvio

[toc] | [prev] | [next] | [standalone]

#19981

From	Jim Janney <jjanney@shell.xmission.com>
Date	2012-11-26 11:13 -0700
Message-ID	<ydnzk248c77.fsf@shell.xmission.com>
In reply to	#19875

Roedy Green <see_website@mindprod.com.invalid> writes:

> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.

In this case, where you are always looking up a newly created string,
it's likely that computing the hash code takes more time than the actual
lookup: java.lang.String.hashCode() looks at every character in the
string.  You could experiment with a weaker hash code that is cheaper to
compute but generates more collisions, for example only hash the first n
characters for some small value of n.

Or take the length of the string (very cheap to compute) and check it
against a BitSet.  Or do the same with the first (or last) character.
How well these strategies work will depend very much on the actual data.

-- 
Jim Janney

[toc] | [prev] | [next] | [standalone]

#19988

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-11-26 15:44 -0800
Message-ID	<b9Tss.19809$W21.10542@newsfe27.iad>
In reply to	#19875

On 11/23/12 5:12 PM, Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.
>

I don't think your use-case needs this kind of (read "micro") 
optimization.  However, I have often wished for a way to get a Map.Entry 
for a key, which could then be used to insert a new value efficiently.

Map.Entry<K,V> getOrAddEntry(K key);

Map.Entry<K,V> e = map.getOrAddEntry(myKey);

if (e.getValue() == null) {
    e.setValue(newValue(myKey));
}

useValue(e.getValue());

Alas, not exactly possible to add it now.

[toc] | [prev] | [next] | [standalone]

#19990

From	Eric Sosman <esosman@comcast-dot-net.invalid>
Date	2012-11-26 20:28 -0500
Message-ID	<k914ve$kl4$1@dont-email.me>
In reply to	#19988

On 11/26/2012 6:44 PM, Daniel Pitts wrote:
> On 11/23/12 5:12 PM, Roedy Green wrote:
>> Is there something like HashMap but that optimised when nearly always
>> the thing you are looking up is not in the list, and when you can add
>> the list of words to look up and then freeze it.
>>
>> I have to scan an entire website looking up every word.
>>
>
> I don't think your use-case needs this kind of (read "micro")
> optimization.  However, I have often wished for a way to get a Map.Entry
> for a key, which could then be used to insert a new value efficiently.
>
> Map.Entry<K,V> getOrAddEntry(K key);
>
>
> Map.Entry<K,V> e = map.getOrAddEntry(myKey);
>
> if (e.getValue() == null) {
>     e.setValue(newValue(myKey));
> }
>
> useValue(e.getValue());
>
> Alas, not exactly possible to add it now.

     java.util.concurrent.ConcurrentMap has putIfAbsent().

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]

#19994

From	Arved Sandstrom <asandstrom2@eastlink.ca>
Date	2012-11-27 06:01 -0400
Message-ID	<jc0ts.12172$v83.1077@newsfe09.iad>
In reply to	#19990

On 11/26/2012 09:28 PM, Eric Sosman wrote:
> On 11/26/2012 6:44 PM, Daniel Pitts wrote:
>> On 11/23/12 5:12 PM, Roedy Green wrote:
>>> Is there something like HashMap but that optimised when nearly always
>>> the thing you are looking up is not in the list, and when you can add
>>> the list of words to look up and then freeze it.
>>>
>>> I have to scan an entire website looking up every word.
>>>
>>
>> I don't think your use-case needs this kind of (read "micro")
>> optimization.  However, I have often wished for a way to get a Map.Entry
>> for a key, which could then be used to insert a new value efficiently.
>>
>> Map.Entry<K,V> getOrAddEntry(K key);
>>
>>
>> Map.Entry<K,V> e = map.getOrAddEntry(myKey);
>>
>> if (e.getValue() == null) {
>>     e.setValue(newValue(myKey));
>> }
>>
>> useValue(e.getValue());
>>
>> Alas, not exactly possible to add it now.
>
>      java.util.concurrent.ConcurrentMap has putIfAbsent().
>
It's almost certainly a reflection (no pun intended) of what I do with 
Maps, but a common variant use case for me is

putEmptyListAndAddValueToListIfAbsent

I don't normally use ConcurrentHashMap so the usual idiom for me is a 
contains() check that may put() the new empty list as a value, 
afterwards the list (empty or otherwise) is always got, and the actual 
value added to it.

AHS

[toc] | [prev] | [next] | [standalone]

#19996

From	Eric Sosman <esosman@comcast-dot-net.invalid>
Date	2012-11-27 08:56 -0500
Message-ID	<k92gpu$2ah$1@dont-email.me>
In reply to	#19994

On 11/27/2012 5:01 AM, Arved Sandstrom wrote:
> On 11/26/2012 09:28 PM, Eric Sosman wrote:
>> On 11/26/2012 6:44 PM, Daniel Pitts wrote:
>>>
>>> I don't think your use-case needs this kind of (read "micro")
>>> optimization.  However, I have often wished for a way to get a Map.Entry
>>> for a key, which could then be used to insert a new value efficiently.
>>>
>>> Map.Entry<K,V> getOrAddEntry(K key);
>>>
>>> Map.Entry<K,V> e = map.getOrAddEntry(myKey);
>>> if (e.getValue() == null) {
>>>     e.setValue(newValue(myKey));
>>> }
>>> useValue(e.getValue());
>>>
>>> Alas, not exactly possible to add it now.
>>
>>      java.util.concurrent.ConcurrentMap has putIfAbsent().
>>
> It's almost certainly a reflection (no pun intended) of what I do with
> Maps, but a common variant use case for me is
>
> putEmptyListAndAddValueToListIfAbsent
>
> I don't normally use ConcurrentHashMap so the usual idiom for me is a
> contains() check that may put() the new empty list as a value,
> afterwards the list (empty or otherwise) is always got, and the actual
> value added to it.

     I usually write it as

	List<V> list = map.get(key);
	if (list == null) {
	    list = new SomeKindOfList();
	    map.put(key, list);
	}
	list.add(newValue);

... which involves one extra lookup per mapped key, not one
extra lookup per query.

     The imaginary getOrAddEntry() would be nice, although it
might complicate the Map.Entry interface itself (you'd want a
way to distinguish "This key maps to null" from "This key is
newly entered").

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]

#19999

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-11-27 14:16 -0800
Message-ID	<8Zats.20121$W21.7789@newsfe27.iad>
In reply to	#19990

On 11/26/12 5:28 PM, Eric Sosman wrote:
> On 11/26/2012 6:44 PM, Daniel Pitts wrote:
>> On 11/23/12 5:12 PM, Roedy Green wrote:
>>> Is there something like HashMap but that optimised when nearly always
>>> the thing you are looking up is not in the list, and when you can add
>>> the list of words to look up and then freeze it.
>>>
>>> I have to scan an entire website looking up every word.
>>>
>>
>> I don't think your use-case needs this kind of (read "micro")
>> optimization.  However, I have often wished for a way to get a Map.Entry
>> for a key, which could then be used to insert a new value efficiently.
>>
>> Map.Entry<K,V> getOrAddEntry(K key);
>>
>>
>> Map.Entry<K,V> e = map.getOrAddEntry(myKey);
>>
>> if (e.getValue() == null) {
>>     e.setValue(newValue(myKey));
>> }
>>
>> useValue(e.getValue());
>>
>> Alas, not exactly possible to add it now.
>
>      java.util.concurrent.ConcurrentMap has putIfAbsent().
>
putIfAbsent isn't exactly what I want. First, I might not want to go to 
the expense of creating a new value *unless* there isn't already one. 
Second, this is an operation that may not need the overhead of a 
concurrency-safe map.

The first problem *can* be overcome by using a wrapper object, but that 
adds additional overhead, and the point of this is to reduce overhead.

[toc] | [prev] | [standalone]

Page 2 of 2 — ← Prev page 1 [2]

csiph-web

optimsed HashMap

Contents

#19895

#19934

#19942

#19986

#19901

#19886

#19980

#19981

#19988

#19990

#19994

#19996

#19999