Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #19875 > unrolled thread
| Started by | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| First post | 2012-11-23 17:12 -0800 |
| Last post | 2012-11-27 14:16 -0800 |
| Articles | 13 on this page of 33 — 15 participants |
Back to article view | Back to comp.lang.java.programmer
optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 17:12 -0800
Re: optimsed HashMap Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 20:19 -0500
Re: optimsed HashMap markspace <-@.> - 2012-11-23 17:33 -0800
Re: optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 22:42 -0800
Re: optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 03:34 -0800
Re: optimsed HashMap Knute Johnson <nospam@knutejohnson.com> - 2012-11-24 08:39 -0800
Re: optimsed HashMap Knute Johnson <nospam@rabbitbrush.frazmtn.com> - 2012-11-24 15:14 -0800
Re: optimsed HashMap Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 13:24 -0500
Re: optimsed HashMap markspace <-@.> - 2012-11-24 10:44 -0800
Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-25 13:40 +0000
Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-26 22:03 +0100
Re: optimsed HashMap Robert Klemme <shortcutter@googlemail.com> - 2012-11-26 23:32 +0100
Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-27 03:24 +0100
Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-27 03:35 +0100
Re: optimsed HashMap Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-27 08:44 -0500
Re: optimsed HashMap Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-11-27 14:20 -0800
Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-30 03:35 +0100
Re: optimsed HashMap Patricia Shanahan <pats@acm.org> - 2012-11-23 19:51 -0800
Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-24 10:21 +0000
Re: optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 03:39 -0800
Re: optimsed HashMap Robert Klemme <shortcutter@googlemail.com> - 2012-11-24 16:24 +0100
Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-25 13:50 +0000
Re: optimsed HashMap Robert Klemme <shortcutter@googlemail.com> - 2012-11-25 15:30 +0100
Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-26 21:13 +0000
Re: optimsed HashMap Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 13:16 -0500
Re: optimsed HashMap v_borchert@despammed.com (Volker Borchert) - 2012-11-24 08:05 +0000
Re: optimsed HashMap Silvio <silvio@internet.com> - 2012-11-26 11:57 +0100
Re: optimsed HashMap Jim Janney <jjanney@shell.xmission.com> - 2012-11-26 11:13 -0700
Re: optimsed HashMap Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-11-26 15:44 -0800
Re: optimsed HashMap Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-26 20:28 -0500
Re: optimsed HashMap Arved Sandstrom <asandstrom2@eastlink.ca> - 2012-11-27 06:01 -0400
Re: optimsed HashMap Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-27 08:56 -0500
Re: optimsed HashMap Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-11-27 14:16 -0800
Page 2 of 2 — ← Prev page 1 [2]
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Date | 2012-11-24 16:24 +0100 |
| Message-ID | <ahc75eFn0gqU1@mid.individual.net> |
| In reply to | #19892 |
On 24.11.2012 12:39, Roedy Green wrote: > On Sat, 24 Nov 2012 10:21:14 -0000, "Chris Uppal" > <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly > quoted someone who said : > >> Look into the literature on fast text searching (for instance bit-parallel >> matching). It's not entirely clear to me what Roedy is trying to do, but it >> sounds as if "bulk" matching/searching might be relevant. > > Yes a Boyer-Moore to simultaneously search for the whole list of > words, then when it has a hit see if it has word in isolation rather > than a word fragment. Here's another approach: 1. fill a HashMap with the translations. 2. Create a tree or trie from the keys. 3. Convert the trie to a regular expression optimized for NFA automata (such as is used in Java std. library). 4. Surround the regexp with additional regexp to ensure word matches and probably exclude matching inside HTML tags 5. Scan the document with Matcher.find() The idea of item 3 is to create a regexp with as little backtracking as possible. For example, from foo foot fuss you make (?:f(?:oot?)|uss) Not sure though whether it is dramatically faster or slower than a standard string search like Boyer-Moore - probably not. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
[toc] | [prev] | [next] | [standalone]
| From | "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> |
|---|---|
| Date | 2012-11-25 13:50 +0000 |
| Message-ID | <fNCdnchke6quvy_NnZ2dnUVZ8hCdnZ2d@bt.com> |
| In reply to | #19895 |
Robert Klemme wrote:
> 1. fill a HashMap with the translations.
> 2. Create a tree or trie from the keys.
> 3. Convert the trie to a regular expression optimized for NFA automata
> (such as is used in Java std. library).
> 4. Surround the regexp with additional regexp to ensure word matches and
> probably exclude matching inside HTML tags
> 5. Scan the document with Matcher.find()
>
> The idea of item 3 is to create a regexp with as little backtracking as
> possible. For example, from
>
> foo
> foot
> fuss
>
> you make
>
> (?:f(?:oot?)|uss)
Alternatively, use a real DFA matcher which will do all of that work for you as
it creates a minimal (or maybe only nearly mininal) DFA.
> Not sure though whether it is dramatically faster or slower than a standard
> string search like Boyer-Moore - probably not.
As I mentioned elsewhere BM is no longer considered the best (there's a nice,
albeit somewhat specialised, book by Navarro and Raffinot "Flexible Pattern
Matching in Strings"). But my gut feeling (another way of saying that I
haven't bothered to test it) is that IO would probably dominate the exection
time whatever algorithm was used (assuming it wasn't implemented
inefficiently).
-- chris
-- chris
[toc] | [prev] | [next] | [standalone]
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Date | 2012-11-25 15:30 +0100 |
| Message-ID | <aheoclF9ot2U2@mid.individual.net> |
| In reply to | #19934 |
On 25.11.2012 14:50, Chris Uppal wrote: > Robert Klemme wrote: > Alternatively, use a real DFA matcher which will do all of that work for you as > it creates a minimal (or maybe only nearly mininal) DFA. Right, that's probably an even better option. :-) >> Not sure though whether it is dramatically faster or slower than a standard >> string search like Boyer-Moore - probably not. > > As I mentioned elsewhere BM is no longer considered the best (there's a nice, > albeit somewhat specialised, book by Navarro and Raffinot "Flexible Pattern > Matching in Strings"). Thank you for that pointer, Chris! > But my gut feeling (another way of saying that I > haven't bothered to test it) is that IO would probably dominate the exection > time whatever algorithm was used (assuming it wasn't implemented > inefficiently). Yes, that seems likely. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
[toc] | [prev] | [next] | [standalone]
| From | "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> |
|---|---|
| Date | 2012-11-26 21:13 +0000 |
| Message-ID | <ZPednYB0iPbTQS7NnZ2dnUVZ8l2dnZ2d@bt.com> |
| In reply to | #19890 |
I wrote:
> Also worth considering (assuming that the standard HashSet isn't doing
> the job well enough):
And Bloom Filters.
(I always forget Bloom Filters, God knows why, like radix sort...)
-- chris
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-24 13:16 -0500 |
| Message-ID | <50b10f1e$0$285$14726298@news.sunsite.dk> |
| In reply to | #19882 |
On 11/23/2012 10:51 PM, Patricia Shanahan wrote: > On 11/23/2012 5:12 PM, Roedy Green wrote: >> Is there something like HashMap but that optimised when nearly always >> the thing you are looking up is not in the list, and when you can add >> the list of words to look up and then freeze it. >> >> I have to scan an entire website looking up every word. > > Look up "perfect hash". The idea is that, given a set of strings, you > can calculate a hash function that maps each of those strings to a > different bucket. That avoids any chain searching due to bucket > collisions, and simplifies the data structures. I would expects gains by using a better hash function compared to the standard Java one to be very small for String's containing words (English and Java code). Arne
[toc] | [prev] | [next] | [standalone]
| From | v_borchert@despammed.com (Volker Borchert) |
|---|---|
| Date | 2012-11-24 08:05 +0000 |
| Message-ID | <k8pv53$7gl$1@Gaia.teknon.de> |
| In reply to | #19875 |
Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.
Unless said web site is www.archive.org, just create a HashMap
and use a Collections.unmodifiableMap() wrapper for lookup. I'd
expect memory to become a problem way earlier than processor.
Depending on how often each String is looked up, and how long the
Strings are, a data structure that does not need String.hashCode()
might be faster.
I'd consider a tree along the following lines:
Characters not in [a-zA-Z] are expressed as `nnnn - the ASCII value of
'`' is 'a' - 1. Characters in [A-Z] are lowercased. Nodes contain a
boolean and a Node[27]. The tree contains a word if there is a path
from the root to a Node whose boolean is true and that is reached via
Node[String.charAt(l)] at level l.
For example, a tree containing exactly
"a", "at", "and", "no", "nil", "not"
would be
N0 = { false, [ null, N1, 12*null, N2, 12*null ] } // ""
N1 = { true, [ null, 13*null, N3, 5*null, N4, 6*null ] } // "a"
N2 = { false, [ null, 8*null, N5, 5*null, N6, 11*null ] } // "n"
N3 = { false, [ null, 13*null, N7, 12*null ] } // "an"
N4 = { true, [ null, 26*null ] } // "at"
N5 = { false, [ null, 11*null, N8, 14*null ] } // "ni"
N6 = { true, [ null, 19*null, N9, 6*null ] } // "no"
N7 = { true, [ null, 26*null ] } // "and"
N8 = { true, [ null, 26*null ] } // "nil"
N9 = { true, [ null, 26*null ] } // "not"
For a Map, replace the boolean with a field holding the value.
Possible improvements:
- Since we have three (32bit VM) or seven (64bit VM) padding bytes
in each Node, we could use two of them to hold a short indicating
the maximum length of Strings in the subtree.
- Instantiate Node[] only if actually needed to hold a child, and only
large enough to hold all currently known children. Use one of the
padding bytes to hold an index offset, such that a Node with only
a single child only holds a Node[1]:
N4 = { true, 0, null }
N6 = { true, 20, [ N9 ] }
- Use an interface and specialized concrete classes for
- leaf node (no fields at all, because boolean will always be true)
- single child node { boolean, Node }
or even "true" and "false" single child nodes
- ...
but this will considerable increase preprocessing cost.
- If many of your words contain non-[a-z] characters, provide array
positions for the most common of these, or use a linear probing
hash table for the non-[a-z] children.
- If the tree is very sparse, doing so even for the [a-z] children
might be better in the long run because more of it fits into caches.
This surely has been invented before and given a nice name...
Apache's or Google's collection libraries might even feature
high quality implementations.
--
"I'm a doctor, not a mechanic." Dr Leonard McCoy <mccoy@ncc1701.starfleet.fed>
"I'm a mechanic, not a doctor." Volker Borchert <v_borchert@despammed.com>
[toc] | [prev] | [next] | [standalone]
| From | Silvio <silvio@internet.com> |
|---|---|
| Date | 2012-11-26 11:57 +0100 |
| Message-ID | <50b34b1c$0$6865$e4fe514c@news2.news.xs4all.nl> |
| In reply to | #19875 |
On 11/24/2012 02:12 AM, Roedy Green wrote: > Is there something like HashMap but that optimised when nearly always > the thing you are looking up is not in the list, and when you can add > the list of words to look up and then freeze it. > > I have to scan an entire website looking up every word. > The use case you describe sounds like a good fit for a character trie. The root represents the empty input string and each node has a child node for each character extension of its current input path that needs to be mapped. A node can optionally have a replacement string if it marks the end of a word-to-be-mapped. If it doesn't than it marks a sub-word only. You simply read word characters and step down into the trie. If you walk off the tree halfway through a word or if you end on a node that has no replacement value then the word does not exist and you need to copy it to your output. If the last character of a word leads to a node with a replacement string than that is what you output. Setting up the trie might take more time than setting up a plain word map but the processing should be a lot faster. Silvio
[toc] | [prev] | [next] | [standalone]
| From | Jim Janney <jjanney@shell.xmission.com> |
|---|---|
| Date | 2012-11-26 11:13 -0700 |
| Message-ID | <ydnzk248c77.fsf@shell.xmission.com> |
| In reply to | #19875 |
Roedy Green <see_website@mindprod.com.invalid> writes: > Is there something like HashMap but that optimised when nearly always > the thing you are looking up is not in the list, and when you can add > the list of words to look up and then freeze it. > > I have to scan an entire website looking up every word. In this case, where you are always looking up a newly created string, it's likely that computing the hash code takes more time than the actual lookup: java.lang.String.hashCode() looks at every character in the string. You could experiment with a weaker hash code that is cheaper to compute but generates more collisions, for example only hash the first n characters for some small value of n. Or take the length of the string (very cheap to compute) and check it against a BitSet. Or do the same with the first (or last) character. How well these strategies work will depend very much on the actual data. -- Jim Janney
[toc] | [prev] | [next] | [standalone]
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Date | 2012-11-26 15:44 -0800 |
| Message-ID | <b9Tss.19809$W21.10542@newsfe27.iad> |
| In reply to | #19875 |
On 11/23/12 5:12 PM, Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.
>
I don't think your use-case needs this kind of (read "micro")
optimization. However, I have often wished for a way to get a Map.Entry
for a key, which could then be used to insert a new value efficiently.
Map.Entry<K,V> getOrAddEntry(K key);
Map.Entry<K,V> e = map.getOrAddEntry(myKey);
if (e.getValue() == null) {
e.setValue(newValue(myKey));
}
useValue(e.getValue());
Alas, not exactly possible to add it now.
[toc] | [prev] | [next] | [standalone]
| From | Eric Sosman <esosman@comcast-dot-net.invalid> |
|---|---|
| Date | 2012-11-26 20:28 -0500 |
| Message-ID | <k914ve$kl4$1@dont-email.me> |
| In reply to | #19988 |
On 11/26/2012 6:44 PM, Daniel Pitts wrote:
> On 11/23/12 5:12 PM, Roedy Green wrote:
>> Is there something like HashMap but that optimised when nearly always
>> the thing you are looking up is not in the list, and when you can add
>> the list of words to look up and then freeze it.
>>
>> I have to scan an entire website looking up every word.
>>
>
> I don't think your use-case needs this kind of (read "micro")
> optimization. However, I have often wished for a way to get a Map.Entry
> for a key, which could then be used to insert a new value efficiently.
>
> Map.Entry<K,V> getOrAddEntry(K key);
>
>
> Map.Entry<K,V> e = map.getOrAddEntry(myKey);
>
> if (e.getValue() == null) {
> e.setValue(newValue(myKey));
> }
>
> useValue(e.getValue());
>
> Alas, not exactly possible to add it now.
java.util.concurrent.ConcurrentMap has putIfAbsent().
--
Eric Sosman
esosman@comcast-dot-net.invalid
[toc] | [prev] | [next] | [standalone]
| From | Arved Sandstrom <asandstrom2@eastlink.ca> |
|---|---|
| Date | 2012-11-27 06:01 -0400 |
| Message-ID | <jc0ts.12172$v83.1077@newsfe09.iad> |
| In reply to | #19990 |
On 11/26/2012 09:28 PM, Eric Sosman wrote:
> On 11/26/2012 6:44 PM, Daniel Pitts wrote:
>> On 11/23/12 5:12 PM, Roedy Green wrote:
>>> Is there something like HashMap but that optimised when nearly always
>>> the thing you are looking up is not in the list, and when you can add
>>> the list of words to look up and then freeze it.
>>>
>>> I have to scan an entire website looking up every word.
>>>
>>
>> I don't think your use-case needs this kind of (read "micro")
>> optimization. However, I have often wished for a way to get a Map.Entry
>> for a key, which could then be used to insert a new value efficiently.
>>
>> Map.Entry<K,V> getOrAddEntry(K key);
>>
>>
>> Map.Entry<K,V> e = map.getOrAddEntry(myKey);
>>
>> if (e.getValue() == null) {
>> e.setValue(newValue(myKey));
>> }
>>
>> useValue(e.getValue());
>>
>> Alas, not exactly possible to add it now.
>
> java.util.concurrent.ConcurrentMap has putIfAbsent().
>
It's almost certainly a reflection (no pun intended) of what I do with
Maps, but a common variant use case for me is
putEmptyListAndAddValueToListIfAbsent
I don't normally use ConcurrentHashMap so the usual idiom for me is a
contains() check that may put() the new empty list as a value,
afterwards the list (empty or otherwise) is always got, and the actual
value added to it.
AHS
[toc] | [prev] | [next] | [standalone]
| From | Eric Sosman <esosman@comcast-dot-net.invalid> |
|---|---|
| Date | 2012-11-27 08:56 -0500 |
| Message-ID | <k92gpu$2ah$1@dont-email.me> |
| In reply to | #19994 |
On 11/27/2012 5:01 AM, Arved Sandstrom wrote:
> On 11/26/2012 09:28 PM, Eric Sosman wrote:
>> On 11/26/2012 6:44 PM, Daniel Pitts wrote:
>>>
>>> I don't think your use-case needs this kind of (read "micro")
>>> optimization. However, I have often wished for a way to get a Map.Entry
>>> for a key, which could then be used to insert a new value efficiently.
>>>
>>> Map.Entry<K,V> getOrAddEntry(K key);
>>>
>>> Map.Entry<K,V> e = map.getOrAddEntry(myKey);
>>> if (e.getValue() == null) {
>>> e.setValue(newValue(myKey));
>>> }
>>> useValue(e.getValue());
>>>
>>> Alas, not exactly possible to add it now.
>>
>> java.util.concurrent.ConcurrentMap has putIfAbsent().
>>
> It's almost certainly a reflection (no pun intended) of what I do with
> Maps, but a common variant use case for me is
>
> putEmptyListAndAddValueToListIfAbsent
>
> I don't normally use ConcurrentHashMap so the usual idiom for me is a
> contains() check that may put() the new empty list as a value,
> afterwards the list (empty or otherwise) is always got, and the actual
> value added to it.
I usually write it as
List<V> list = map.get(key);
if (list == null) {
list = new SomeKindOfList();
map.put(key, list);
}
list.add(newValue);
... which involves one extra lookup per mapped key, not one
extra lookup per query.
The imaginary getOrAddEntry() would be nice, although it
might complicate the Map.Entry interface itself (you'd want a
way to distinguish "This key maps to null" from "This key is
newly entered").
--
Eric Sosman
esosman@comcast-dot-net.invalid
[toc] | [prev] | [next] | [standalone]
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Date | 2012-11-27 14:16 -0800 |
| Message-ID | <8Zats.20121$W21.7789@newsfe27.iad> |
| In reply to | #19990 |
On 11/26/12 5:28 PM, Eric Sosman wrote:
> On 11/26/2012 6:44 PM, Daniel Pitts wrote:
>> On 11/23/12 5:12 PM, Roedy Green wrote:
>>> Is there something like HashMap but that optimised when nearly always
>>> the thing you are looking up is not in the list, and when you can add
>>> the list of words to look up and then freeze it.
>>>
>>> I have to scan an entire website looking up every word.
>>>
>>
>> I don't think your use-case needs this kind of (read "micro")
>> optimization. However, I have often wished for a way to get a Map.Entry
>> for a key, which could then be used to insert a new value efficiently.
>>
>> Map.Entry<K,V> getOrAddEntry(K key);
>>
>>
>> Map.Entry<K,V> e = map.getOrAddEntry(myKey);
>>
>> if (e.getValue() == null) {
>> e.setValue(newValue(myKey));
>> }
>>
>> useValue(e.getValue());
>>
>> Alas, not exactly possible to add it now.
>
> java.util.concurrent.ConcurrentMap has putIfAbsent().
>
putIfAbsent isn't exactly what I want. First, I might not want to go to
the expense of creating a new value *unless* there isn't already one.
Second, this is an operation that may not need the overhead of a
concurrency-safe map.
The first problem *can* be overcome by using a wrapper object, but that
adds additional overhead, and the point of this is to reduce overhead.
[toc] | [prev] | [standalone]
Page 2 of 2 — ← Prev page 1 [2]
Back to top | Article view | comp.lang.java.programmer
csiph-web