Path: csiph.com!usenet.pasdenom.info!gegeweb.org!eternal-september.org!feeder.eternal-september.org!mx05.eternal-september.org!.POSTED!not-for-mail From: Eric Sosman Newsgroups: comp.lang.java.programmer Subject: Re: email stop words Date: Thu, 21 Mar 2013 14:15:10 -0400 Organization: A noiseless patient Spider Lines: 75 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Thu, 21 Mar 2013 18:13:12 +0000 (UTC) Injection-Info: mx05.eternal-september.org; posting-host="0d73d8cc209bff1c6395088b400d0605"; logging-data="31050"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19u8hbXAMWD/McDUzfrzOWF" User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130307 Thunderbird/17.0.4 In-Reply-To: Cancel-Lock: sha1:v8c/46pzcz4YQ5jFjSi5zMWKi8s= Xref: csiph.com comp.lang.java.programmer:23037 On 3/21/2013 12:33 PM, markspace wrote: > On 3/21/2013 6:24 AM, Eric Sosman wrote: >> >> Integer count = map.get(word); >> map.put(word, count == null ? 1 : count + 1); > > Basically, yes. > >> >> ... and that you switched to something more like >> >> Integer count = map.get(word); >> map.put(word, new Integer(count == null >> ? 1 : count.intValue() + 1); >> > > No, I made a Counter with a primitive and a reference to the word: > > Counter counter = map.get( word ); > if( counter == null ) { > counter = new Counter(); > counter.word = word; > counter.count = 1; > map.put( word, counter ); > } else > counter.count++; > >> If so, the slowdown is probably due to increased memory pressure >> and garbage collection: `new' actually creates a new object every > > Yeah, that's what I thought too. Although since there's only as many > Counters as there are Strings (words), I don't get why just making a 2x > change would slow the system as horribly as it did. There should be > only 4 million Strings and therefore also 4 million Counters. I can't > figure out why that would be a problem. It might be the "long tail" I mentioned earlier. With the second scheme you need four million Counter objects, while the original used (perhaps) a hundred thousand large Integers plus 3.9 million references to the few small Integers in the static pool. Back of the envelope: The Map holds four million references to Map.Entry objects, each of which holds a key reference, a value reference, and a link. With the Integer original, to this you add a hundred thousand (same out-of-thin-air figure as before) Integer instances. Total: 16 million references, 4.1 million objects. The change to a "word-aware" Counter adds four million more references and 3.9 million more objects. Yeah, I can see where that might have a teeny-tiny impact ... > Also, any thoughts on the best way to observe a GC that is thrashing? > I'm really curious to pin this down to some sort of root cause. I > couldn't rule out a coding error somewhere either. Hmmm: I used to know something about tuning GC, but my knowledge is about a decade out of date -- in an area that's had a lot of R&D in the meantime. There's some Java 6 stuff at http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html ... but I haven't read it and can't assess it. >> My suggestion would be to implement a Counter class that >> wraps a mutable integer value. Then you'd use > > Thanks, I'll take a look at this when I get a chance. A good suggestion! If I've understood you correctly, you've already done this -- and that's when the trouble started. Perhaps the hybrid Integer- or-Counter approach would help, though. -- Eric Sosman esosman@comcast-dot-net.invalid