Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #19875 > unrolled thread

optimsed HashMap

Started byRoedy Green <see_website@mindprod.com.invalid>
First post2012-11-23 17:12 -0800
Last post2012-11-27 14:16 -0800
Articles 20 on this page of 33 — 15 participants

Back to article view | Back to comp.lang.java.programmer


Contents

  optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 17:12 -0800
    Re: optimsed HashMap Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 20:19 -0500
    Re: optimsed HashMap markspace <-@.> - 2012-11-23 17:33 -0800
      Re: optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 22:42 -0800
        Re: optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 03:34 -0800
          Re: optimsed HashMap Knute Johnson <nospam@knutejohnson.com> - 2012-11-24 08:39 -0800
            Re: optimsed HashMap Knute Johnson <nospam@rabbitbrush.frazmtn.com> - 2012-11-24 15:14 -0800
        Re: optimsed HashMap Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 13:24 -0500
        Re: optimsed HashMap markspace <-@.> - 2012-11-24 10:44 -0800
          Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-25 13:40 +0000
        Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-26 22:03 +0100
          Re: optimsed HashMap Robert Klemme <shortcutter@googlemail.com> - 2012-11-26 23:32 +0100
            Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-27 03:24 +0100
              Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-27 03:35 +0100
                Re: optimsed HashMap Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-27 08:44 -0500
                  Re: optimsed HashMap Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-11-27 14:20 -0800
                    Re: optimsed HashMap Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-11-30 03:35 +0100
    Re: optimsed HashMap Patricia Shanahan <pats@acm.org> - 2012-11-23 19:51 -0800
      Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-24 10:21 +0000
        Re: optimsed HashMap Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 03:39 -0800
          Re: optimsed HashMap Robert Klemme <shortcutter@googlemail.com> - 2012-11-24 16:24 +0100
            Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-25 13:50 +0000
              Re: optimsed HashMap Robert Klemme <shortcutter@googlemail.com> - 2012-11-25 15:30 +0100
        Re: optimsed HashMap "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-11-26 21:13 +0000
      Re: optimsed HashMap Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 13:16 -0500
    Re: optimsed HashMap v_borchert@despammed.com (Volker Borchert) - 2012-11-24 08:05 +0000
    Re: optimsed HashMap Silvio <silvio@internet.com> - 2012-11-26 11:57 +0100
    Re: optimsed HashMap Jim Janney <jjanney@shell.xmission.com> - 2012-11-26 11:13 -0700
    Re: optimsed HashMap Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-11-26 15:44 -0800
      Re: optimsed HashMap Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-26 20:28 -0500
        Re: optimsed HashMap Arved Sandstrom <asandstrom2@eastlink.ca> - 2012-11-27 06:01 -0400
          Re: optimsed HashMap Eric Sosman <esosman@comcast-dot-net.invalid> - 2012-11-27 08:56 -0500
        Re: optimsed HashMap Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-11-27 14:16 -0800

Page 1 of 2  [1] 2  Next page →


#19875 — optimsed HashMap

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-11-23 17:12 -0800
Subjectoptimsed HashMap
Message-ID<8i70b8d0pm6ibk03ti4t2pv60jd0bctlcs@4ax.com>
Is there something like HashMap but that optimised when nearly always
the thing you are looking up is not in the list, and when you can add
the list of words to look up and then freeze it.

I have to scan an entire website looking up every word.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish 
as couch potatoes who hire others to go to the gym for them. 

[toc] | [next] | [standalone]


#19876

FromArne Vajhøj <arne@vajhoej.dk>
Date2012-11-23 20:19 -0500
Message-ID<50b0208e$0$286$14726298@news.sunsite.dk>
In reply to#19875
On 11/23/2012 8:12 PM, Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.

HashMap get is already O(1). There is very little
room for optimization.

Arne

[toc] | [prev] | [next] | [standalone]


#19877

Frommarkspace <-@.>
Date2012-11-23 17:33 -0800
Message-ID<k8p85p$hqr$1@dont-email.me>
In reply to#19875
On 11/23/2012 5:12 PM, Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.


I'm not sure what you are trying to say there.  You want the case where 
you do not find something in a hash map to be optimized?  "Optimized" how?

What do you mean "add to the list of words" and "freeze"?

>
> I have to scan an entire website looking up every word.
>

Google for "search engine."  Wikipedia has an entry for it.

[toc] | [prev] | [next] | [standalone]


#19883

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-11-23 22:42 -0800
Message-ID<8ip0b8p7blu31eub502so8cus1h9so3m9s@4ax.com>
In reply to#19877
On Fri, 23 Nov 2012 17:33:43 -0800, markspace <-@.> wrote, quoted or
indirectly quoted someone who said :

>I'm not sure what you are trying to say there.  You want the case where 
>you do not find something in a hash map to be optimized?  "Optimized" how?

>What do you mean "add to the list of words" and "freeze"?

The following is not the real problem, but it might more simply
illustrate what I am asking.

Think of an ordinary HashMap<String,String>

What it does is translate a few English words with French derivation,
putting the French accents on them. e.g. naive -> na&iuml;ve Napoleon
-> Napol&acute;on 

Let us say you have 100 such words you want to transform. (In my
actual problem I have about 1500  words).

You go through the files for a website looking at each word of text
(avoiding HTML markup) in the HashMap. If you find it you replace it.

Most of the time word you look up is not in the list.

This is a time-consuming process.  I would like to speed it up.

My lookup has two properties that might be exploited in some variant
HashMap.

1. nearly always the lookup fails. The code should be optimised for
this case.  If it has some fast way of knowing the elt is not there,
it should do that first.

2. the list of words to lookup does not change after initial
preparation. I can afford to do some special calculation to prime the
lookup. For example, I once heard of  some tweaking to avoid long
collision chains for a C implementation of HashMap.

My question had two purposes.  To see if there was something available
off the shelf, and to stimulate thought on some new algorithm that
could have wider application that just my problem.

Another way of looking at the problem is it would be nice to have a
HashSet implementation that was considerably faster than a HashMap.
IIRC, currently HashSet is implemented as a HashMap.

Such an algorithm could be used to fix your most common spelling
mistakes, to add links to magic words, to add markup to magic words
to find and report the presence of certain words, or in my case find
acronyms and replace them with a macro for that acronym that displays
the meaning of the acronym the first time it is used on a page.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish 
as couch potatoes who hire others to go to the gym for them. 

[toc] | [prev] | [next] | [standalone]


#19891

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-11-24 03:34 -0800
Message-ID<m4c1b8t1mosig1qfohauol1gqftjongadt@4ax.com>
In reply to#19883
On Fri, 23 Nov 2012 22:42:40 -0800, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>Another way of looking at the problem is it would be nice to have a
>HashSet implementation that was considerably faster than a HashMap.
>IIRC, currently HashSet is implemented as a HashMap.

A a 3- valued HashSet.contains would still be useful
no -- not in set
yes - in set
maybe -- don't know 

At least you could eliminate the no's from further processing.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish 
as couch potatoes who hire others to go to the gym for them. 

[toc] | [prev] | [next] | [standalone]


#19899

FromKnute Johnson <nospam@knutejohnson.com>
Date2012-11-24 08:39 -0800
Message-ID<k8qt7n$52u$1@dont-email.me>
In reply to#19891
On 11/24/2012 3:34 AM, Roedy Green wrote:
> On Fri, 23 Nov 2012 22:42:40 -0800, Roedy Green
> <see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
> someone who said :
>
>> Another way of looking at the problem is it would be nice to have a
>> HashSet implementation that was considerably faster than a HashMap.
>> IIRC, currently HashSet is implemented as a HashMap.
>
> A a 3- valued HashSet.contains would still be useful
> no -- not in set
> yes - in set
> maybe -- don't know
>
> At least you could eliminate the no's from further processing.
>

What about a sorted map?  I would think searching a sorted map would be 
much quicker than an unsorted one.

-- 

Knute Johnson

[toc] | [prev] | [next] | [standalone]


#19909

FromKnute Johnson <nospam@rabbitbrush.frazmtn.com>
Date2012-11-24 15:14 -0800
Message-ID<k8rkco$epr$1@dont-email.me>
In reply to#19899
On 11/24/2012 08:39 AM, Knute Johnson wrote:
> On 11/24/2012 3:34 AM, Roedy Green wrote:
>> On Fri, 23 Nov 2012 22:42:40 -0800, Roedy Green
>> <see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
>> someone who said :
>>
>>> Another way of looking at the problem is it would be nice to have a
>>> HashSet implementation that was considerably faster than a HashMap.
>>> IIRC, currently HashSet is implemented as a HashMap.
>>
>> A a 3- valued HashSet.contains would still be useful
>> no -- not in set
>> yes - in set
>> maybe -- don't know
>>
>> At least you could eliminate the no's from further processing.
>>
>
> What about a sorted map?  I would think searching a sorted map would be
> much quicker than an unsorted one.
>

Actually it's not faster, I tried it.

knute...

[toc] | [prev] | [next] | [standalone]


#19902

FromArne Vajhøj <arne@vajhoej.dk>
Date2012-11-24 13:24 -0500
Message-ID<50b110df$0$293$14726298@news.sunsite.dk>
In reply to#19883
On 11/24/2012 1:42 AM, Roedy Green wrote:
> On Fri, 23 Nov 2012 17:33:43 -0800, markspace <-@.> wrote, quoted or
> indirectly quoted someone who said :
>
>> I'm not sure what you are trying to say there.  You want the case where
>> you do not find something in a hash map to be optimized?  "Optimized" how?
>
>> What do you mean "add to the list of words" and "freeze"?
>
> The following is not the real problem, but it might more simply
> illustrate what I am asking.
>
> Think of an ordinary HashMap<String,String>
>
> What it does is translate a few English words with French derivation,
> putting the French accents on them. e.g. naive -> na&iuml;ve Napoleon
> -> Napol&acute;on
>
> Let us say you have 100 such words you want to transform. (In my
> actual problem I have about 1500  words).
>
> You go through the files for a website looking at each word of text
> (avoiding HTML markup) in the HashMap. If you find it you replace it.
>
> Most of the time word you look up is not in the list.
>
> This is a time-consuming process.  I would like to speed it up.

Reading the HTML files, splitting them up in words and discarding
HTML seems more likely to be the bottleneck than the lookups in a
HashMap.

What does your measurements show CPU time and wall time
distributed between the different activities?

You did measure right??

> My lookup has two properties that might be exploited in some variant
> HashMap.
>
> 1. nearly always the lookup fails. The code should be optimised for
> this case.  If it has some fast way of knowing the elt is not there,
> it should do that first.

HashMap should already do that.

> 2. the list of words to lookup does not change after initial
> preparation. I can afford to do some special calculation to prime the
> lookup. For example, I once heard of  some tweaking to avoid long
> collision chains for a C implementation of HashMap.

If you have long collision chains then you can obviously improve
by choosing a better hash functions.

But do you have long collision chains?

Java String hash is not that bad.

You can easily test number of collisions.

> Another way of looking at the problem is it would be nice to have a
> HashSet implementation that was considerably faster than a HashMap.
> IIRC, currently HashSet is implemented as a HashMap.

Not carrying data will not make it considerable faster.

Hashing and lookup are the same.

Arne


[toc] | [prev] | [next] | [standalone]


#19903

Frommarkspace <-@.>
Date2012-11-24 10:44 -0800
Message-ID<k8r4j2$kab$1@dont-email.me>
In reply to#19883
On 11/23/2012 10:42 PM, Roedy Green wrote:

> My question had two purposes.  To see if there was something available
> off the shelf, and to stimulate thought on some new algorithm that
> could have wider application that just my problem.


I think it's unlikely that you're going to get a better algorithm than 
Boyer-Moore or some other standard algorithm.  Those are standard 
algorithms for a reason: no one's come up with anything better.

You might try compiling a regex to do it.  Regex can be pretty efficient 
in many cases.  Nothing can beat a hand code parser though, so if you 
really need speed, that's the ticket.

Example regex:

"\\b(word1|word2|word3|...etc.)\\b"

You also might consider the case where you do have several matches in 
your character buffer. If you use a string and standard replacement, 
you'll copy the entire buffer for each match.  If the buffer is large, 
that's a lot of time spent just copying the same bytes around.  Look 
into making your own buffer/string that doesn't require making copies 
for each replacement.  However if you really do seldom have a match, 
this might not be worth your time investment.



[toc] | [prev] | [next] | [standalone]


#19932

From"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Date2012-11-25 13:40 +0000
Message-ID<scqdnb16Ttn_vS_NnZ2dnUVZ8t6dnZ2d@bt.com>
In reply to#19903
markspace wrote:

> I think it's unlikely that you're going to get a better algorithm than
> Boyer-Moore or some other standard algorithm.  Those are standard
> algorithms for a reason: no one's come up with anything better.

The current state-of-the-art lies well beyond Boyer-Moore (or 
Boyer-Moore-Horspool).  But I agree with the main point you make.  There are 
standard algorithms; use them!

> You might try compiling a regex to do it.  Regex can be pretty efficient
> in many cases.

If there is a need for extreme speed (as yet unproven) then I'd be very wary of 
the Java built-in regexp implementation.  At least the last time I looked 
(several years ago) it used some wretched non-linear backtracking 
implementation even when it wasn't forced to (by someone feeding it Perl-style 
"regexps").

    -- chris 

[toc] | [prev] | [next] | [standalone]


#19985

FromDaniele Futtorovic <da.futt.news@laposte-dot-net.invalid>
Date2012-11-26 22:03 +0100
Message-ID<k90les$q1j$1@dont-email.me>
In reply to#19883
On 24/11/2012 07:42, Roedy Green allegedly wrote:
> On Fri, 23 Nov 2012 17:33:43 -0800, markspace <-@.> wrote, quoted or
> indirectly quoted someone who said :
> 
>> I'm not sure what you are trying to say there.  You want the case where 
>> you do not find something in a hash map to be optimized?  "Optimized" how?
> 
>> What do you mean "add to the list of words" and "freeze"?
> 
> The following is not the real problem, but it might more simply
> illustrate what I am asking.
> 
> Think of an ordinary HashMap<String,String>
> 
> What it does is translate a few English words with French derivation,
> putting the French accents on them. e.g. naive -> na&iuml;ve Napoleon
> -> Napol&acute;on 
> 
> Let us say you have 100 such words you want to transform. (In my
> actual problem I have about 1500  words).
> 
> You go through the files for a website looking at each word of text
> (avoiding HTML markup) in the HashMap. If you find it you replace it.
> 
> Most of the time word you look up is not in the list.
> 
> This is a time-consuming process.  I would like to speed it up.

You might want to intern() the input to avoid having to recompute the
hash every time (if applicable). Other than that, you'll either be
wanting a better hashing algorithm, to avoid collisions, or indeed
something altogether fancier (but riskier in terms or RoI).

*shrug*

-- 
DF.

[toc] | [prev] | [next] | [standalone]


#19987

FromRobert Klemme <shortcutter@googlemail.com>
Date2012-11-26 23:32 +0100
Message-ID<ahi90qF48bvU1@mid.individual.net>
In reply to#19985
On 11/26/2012 10:03 PM, Daniele Futtorovic wrote:
> On 24/11/2012 07:42, Roedy Green allegedly wrote:

>> You go through the files for a website looking at each word of text
>> (avoiding HTML markup) in the HashMap. If you find it you replace it.
>>
>> Most of the time word you look up is not in the list.
>>
>> This is a time-consuming process.  I would like to speed it up.
>
> You might want to intern() the input to avoid having to recompute the
> hash every time (if applicable). Other than that, you'll either be
> wanting a better hashing algorithm, to avoid collisions, or indeed
> something altogether fancier (but riskier in terms or RoI).

How would interning help?  The input is read only once anyway and if you 
mean to intern individual words of the input then how does the JVM do 
the interning?  My guess would be that some form of hashing would be 
used there as well - plus that internal data structure must be thread 
safe...

Kind regards

	robert

[toc] | [prev] | [next] | [standalone]


#19992

FromDaniele Futtorovic <da.futt.news@laposte-dot-net.invalid>
Date2012-11-27 03:24 +0100
Message-ID<k918bb$45a$1@dont-email.me>
In reply to#19987
On 26/11/2012 23:32, Robert Klemme allegedly wrote:
> On 11/26/2012 10:03 PM, Daniele Futtorovic wrote:
>> On 24/11/2012 07:42, Roedy Green allegedly wrote:
> 
>>> You go through the files for a website looking at each word of text
>>> (avoiding HTML markup) in the HashMap. If you find it you replace it.
>>>
>>> Most of the time word you look up is not in the list.
>>>
>>> This is a time-consuming process.  I would like to speed it up.
>>
>> You might want to intern() the input to avoid having to recompute the
>> hash every time (if applicable). Other than that, you'll either be
>> wanting a better hashing algorithm, to avoid collisions, or indeed
>> something altogether fancier (but riskier in terms or RoI).
> 
> How would interning help?  The input is read only once anyway 

Depends on the input, of course. But natural text on the web (which
appears to be what this is about) is quite likely to contain the same
words more than once each.

> and if you
> mean to intern individual words of the input then how does the JVM do
> the interning?  

Like it does all interning? I must admit I couldn't lay out the details
off the top of my head, but the JLS should have them within reasonable
accuracy.

Of course, this would only be an option for a batch-like program. You
wouldn't want to clutter the string pool of a long-running app.

Interning would also perhaps allow one to use an IdentityHashMap, and
thus doing away with the (probably relatively costly) string comparisons.

For sure, this wouldn't be a replacement for more sophisticated
solutions, but could one of the things to try if it is to be kept "simple".

> My guess would be that some form of hashing would be
> used there as well - plus that internal data structure must be thread
> safe...

True.

-- 
DF.

[toc] | [prev] | [next] | [standalone]


#19993

FromDaniele Futtorovic <da.futt.news@laposte-dot-net.invalid>
Date2012-11-27 03:35 +0100
Message-ID<k918u3$686$1@dont-email.me>
In reply to#19992
On 27/11/2012 03:24, Daniele Futtorovic allegedly wrote:
> On 26/11/2012 23:32, Robert Klemme allegedly wrote:
>> On 11/26/2012 10:03 PM, Daniele Futtorovic wrote:
>>> On 24/11/2012 07:42, Roedy Green allegedly wrote:
>>
>>>> You go through the files for a website looking at each word of text
>>>> (avoiding HTML markup) in the HashMap. If you find it you replace it.
>>>>
>>>> Most of the time word you look up is not in the list.
>>>>
>>>> This is a time-consuming process.  I would like to speed it up.
>>>
>>> You might want to intern() the input to avoid having to recompute the
>>> hash every time (if applicable). Other than that, you'll either be
>>> wanting a better hashing algorithm, to avoid collisions, or indeed
>>> something altogether fancier (but riskier in terms or RoI).
>>
>> How would interning help?  The input is read only once anyway 
> 
> Depends on the input, of course. But natural text on the web (which
> appears to be what this is about) is quite likely to contain the same
> words more than once each.
> 
>> and if you
>> mean to intern individual words of the input then how does the JVM do
>> the interning?  
> 
> Like it does all interning? I must admit I couldn't lay out the details
> off the top of my head, but the JLS should have them within reasonable
> accuracy.
> 
> Of course, this would only be an option for a batch-like program. You
> wouldn't want to clutter the string pool of a long-running app.
> 
> Interning would also perhaps allow one to use an IdentityHashMap, and
> thus doing away with the (probably relatively costly) string comparisons.
> 
> For sure, this wouldn't be a replacement for more sophisticated
> solutions, but could one of the things to try if it is to be kept "simple".
> 
>> My guess would be that some form of hashing would be
>> used there as well - plus that internal data structure must be thread
>> safe...
> 
> True.
> 

Hm. According to Roedy himself
(<http://www.mindprod.com/jgloss/interned.html#UNDERTHEHOOD>), the JVM
uses a HashMap for intern()'d string lookup. So there may be no point in
doing it after all.

-- 
DF.

[toc] | [prev] | [next] | [standalone]


#19995

FromEric Sosman <esosman@comcast-dot-net.invalid>
Date2012-11-27 08:44 -0500
Message-ID<k92g3k$te0$1@dont-email.me>
In reply to#19993
On 11/26/2012 9:35 PM, Daniele Futtorovic wrote:
>
> Hm. According to Roedy himself
> (<http://www.mindprod.com/jgloss/interned.html#UNDERTHEHOOD>), the JVM
> uses a HashMap for intern()'d string lookup. So there may be no point in
> doing it after all.

     Here is the implementation of intern(), quoted in its
entirety from the Java 1.7 source:

	public native String intern();

Draw your own conclusions about its use of HashMap.

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]


#20000

FromDaniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date2012-11-27 14:20 -0800
Message-ID<x0bts.20122$W21.3302@newsfe27.iad>
In reply to#19995
On 11/27/12 5:44 AM, Eric Sosman wrote:
> On 11/26/2012 9:35 PM, Daniele Futtorovic wrote:
>>
>> Hm. According to Roedy himself
>> (<http://www.mindprod.com/jgloss/interned.html#UNDERTHEHOOD>), the JVM
>> uses a HashMap for intern()'d string lookup. So there may be no point in
>> doing it after all.
>
>      Here is the implementation of intern(), quoted in its
> entirety from the Java 1.7 source:
>
>      public native String intern();
>
> Draw your own conclusions about its use of HashMap.
>

I suspect it uses a hash map, not necessarily a HashMap. Of course, the 
underlying mechanism is probably implementation dependent.

[toc] | [prev] | [next] | [standalone]


#20020

FromDaniele Futtorovic <da.futt.news@laposte-dot-net.invalid>
Date2012-11-30 03:35 +0100
Message-ID<k9964h$rli$1@dont-email.me>
In reply to#20000
On 27/11/2012 23:20, Daniel Pitts allegedly wrote:
> On 11/27/12 5:44 AM, Eric Sosman wrote:
>> On 11/26/2012 9:35 PM, Daniele Futtorovic wrote:
>>>
>>> Hm. According to Roedy himself
>>> (<http://www.mindprod.com/jgloss/interned.html#UNDERTHEHOOD>), the JVM
>>> uses a HashMap for intern()'d string lookup. So there may be no point in
>>> doing it after all.
>>
>>      Here is the implementation of intern(), quoted in its
>> entirety from the Java 1.7 source:
>>
>>      public native String intern();
>>
>> Draw your own conclusions about its use of HashMap.
>>
> 
> I suspect it uses a hash map, not necessarily a HashMap. Of course, the
> underlying mechanism is probably implementation dependent.

Yeah, the camel casing was out of habit.

-- 
DF.

[toc] | [prev] | [next] | [standalone]


#19882

FromPatricia Shanahan <pats@acm.org>
Date2012-11-23 19:51 -0800
Message-ID<5Yidnbg3DrTK2S3NnZ2dnUVZ_uudnZ2d@earthlink.com>
In reply to#19875
On 11/23/2012 5:12 PM, Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.
>

Look up "perfect hash". The idea is that, given a set of strings, you
can calculate a hash function that maps each of those strings to a
different bucket. That avoids any chain searching due to bucket
collisions, and simplifies the data structures.

Patricia

[toc] | [prev] | [next] | [standalone]


#19890

From"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Date2012-11-24 10:21 +0000
Message-ID<AOKdncX-1L4NPC3NnZ2dnUVZ8k2dnZ2d@bt.com>
In reply to#19882
Patricia Shanahan wrote:
> > Is there something like HashMap but that optimised when nearly always
> > the thing you are looking up is not in the list, and when you can add
> > the list of words to look up and then freeze it.
[...]
> Look up "perfect hash".

Also worth considering (assuming that the standard HashSet isn't doing the job 
well enough):

Use hash set with a better hash function (but not a pre-computed perfect hash).

Look into the literature on fast text searching (for instance bit-parallel 
matching).  It's not entirely clear to me what Roedy is trying to do, but it 
sounds as if "bulk" matching/searching might be relevant.

    -- chris 

[toc] | [prev] | [next] | [standalone]


#19892

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-11-24 03:39 -0800
Message-ID<p7c1b8dk6uebqjvil7iuo56cq40j0haj94@4ax.com>
In reply to#19890
On Sat, 24 Nov 2012 10:21:14 -0000, "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>Look into the literature on fast text searching (for instance bit-parallel 
>matching).  It's not entirely clear to me what Roedy is trying to do, but it 
>sounds as if "bulk" matching/searching might be relevant.

Yes a Boyer-Moore to simultaneously search for the whole list of
words, then when it has a hit see if it has word in isolation rather
than a word fragment.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish 
as couch potatoes who hire others to go to the gym for them. 

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.java.programmer


csiph-web