Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #23037

Re: email stop words

From Eric Sosman <esosman@comcast-dot-net.invalid>
Newsgroups comp.lang.java.programmer
Subject Re: email stop words
Date 2013-03-21 14:15 -0400
Organization A noiseless patient Spider
Message-ID <kifijo$uaa$1@dont-email.me> (permalink)
References <kidh9f$57s$1@dont-email.me> <kidrti$hgn$1@dont-email.me> <kif1jc$jrr$1@dont-email.me> <kifckl$p1f$1@dont-email.me>

Show all headers | View raw


On 3/21/2013 12:33 PM, markspace wrote:
> On 3/21/2013 6:24 AM, Eric Sosman wrote:
>>
>>      Integer count = map.get(word);
>>      map.put(word, count == null ? 1 : count + 1);
>
> Basically, yes.
>
>>
>> ... and that you switched to something more like
>>
>>      Integer count = map.get(word);
>>      map.put(word, new Integer(count == null
>>          ? 1 : count.intValue() + 1);
>>
>
> No, I made a Counter with a primitive and a reference to the word:
>
>    Counter counter = map.get( word );
>    if( counter == null ) {
>      counter = new Counter();
>      counter.word = word;
>      counter.count = 1;
>      map.put( word, counter );
>    } else
>      counter.count++;
>
>> If so, the slowdown is probably due to increased memory pressure
>> and garbage collection: `new' actually creates a new object every
>
> Yeah, that's what I thought too.  Although since there's only as many
> Counters as there are Strings (words), I don't get why just making a 2x
> change would slow the system as horribly as it did.  There should be
> only 4 million Strings and therefore also 4 million Counters.  I can't
> figure out why that would be a problem.

     It might be the "long tail" I mentioned earlier.  With the
second scheme you need four million Counter objects, while the
original used (perhaps) a hundred thousand large Integers plus
3.9 million references to the few small Integers in the static pool.

     Back of the envelope: The Map holds four million references
to Map.Entry objects, each of which holds a key reference, a
value reference, and a link.  With the Integer original, to this
you add a hundred thousand (same out-of-thin-air figure as before)
Integer instances.  Total: 16 million references, 4.1 million objects.

     The change to a "word-aware" Counter adds four million more
references and 3.9 million more objects.  Yeah, I can see where
that might have a teeny-tiny impact ...

> Also, any thoughts on the best way to observe a GC that is thrashing?
> I'm really curious to pin this down to some sort of root cause.  I
> couldn't rule out a coding error somewhere either.

     Hmmm: I used to know something about tuning GC, but my knowledge
is about a decade out of date -- in an area that's had a lot of R&D
in the meantime.  There's some Java 6 stuff at

http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

... but I haven't read it and can't assess it.

>>      My suggestion would be to implement a Counter class that
>> wraps a mutable integer value.  Then you'd use
>
> Thanks, I'll take a look at this when I get a chance.  A good suggestion!

     If I've understood you correctly, you've already done this --
and that's when the trouble started.  Perhaps the hybrid Integer-
or-Counter approach would help, though.

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
  Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
    Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
      Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
    Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
      Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
  Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
  Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
    Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
    Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
        Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
    Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
    Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
  Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000

csiph-web