Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #23027

Re: email stop words

From Eric Sosman <esosman@comcast-dot-net.invalid>
Newsgroups comp.lang.java.programmer
Subject Re: email stop words
Date 2013-03-21 09:24 -0400
Organization A noiseless patient Spider
Message-ID <kif1jc$jrr$1@dont-email.me> (permalink)
References <kidh9f$57s$1@dont-email.me> <kidrti$hgn$1@dont-email.me>

Show all headers | View raw


On 3/20/2013 10:41 PM, markspace wrote:
> On 3/20/2013 4:40 PM, markspace wrote:
>> When indexing text files, there's a concept known as "stop words", which
>> are basically really common words that you don't normally want to index.
>>
>> I just got done with a preliminary part of a project, where I indexed my
>> gmail inbox by parsing out all the white-space separated words from all
>> of my emails.  For what it's worth, here's the 19 most common words in
>> my inbox, out of over 600 million characters, nearly 4 million words,
>> and probably almost 400,000 email messages.
>
>
> OK, weird.  I removed the auto-boxing from the HashMap I was using, and
> it got MUCH slower.  I'd estimate 30x slower, although I didn't let the
> biggest test run to completion.
>
> Any ideas?
>
> Mine run to: I ended up making a lot more objects to avoid the immutable
> Integer, and ended up using so much memory the garbage collector started
> trashing.

     You didn't say exactly what your code was like, but I imagine
you were using a HashMap<Word,Integer> and processing each Word
with something along the lines of

	Integer count = map.get(word);
	map.put(word, count == null ? 1 : count + 1);

... and that you switched to something more like

	Integer count = map.get(word);
	map.put(word, new Integer(count == null
	    ? 1 : count.intValue() + 1);

If so, the slowdown is probably due to increased memory pressure
and garbage collection: `new' actually creates a new object every
time, while auto-boxing uses (the equivalent of) Integer.valueOf().
The latter maintains a pool of a couple hundred small-valued Integers
and doles them out whenever needed, using `new' only for un-pooled
values.

     (It's probably not the frequent words that kill you, since
their counts will grow too large to be satisfied from the pool.
But the "long tail" may be murder: Using `new' you'll have many
millions of Integer objects representing small numbers, whereas
by using auto-boxing every "five" would be the same object.)

     My suggestion would be to implement a Counter class that
wraps a mutable integer value.  Then you'd use

	HashMap<Word,Counter> map = ...;
	Counter count = map.get(word);
	if (count == null)
	    map.put(word, new Counter());
	count.increment();

This way you'd create just one Counter per unique Word, instead
of one Integer for every occurrence of every word.  To deal with
the long tail, make Counter extend Number and use

	HashMap<Word,Number> map = ...;
	Number count = map.get(word);
	if (count == null) {
	    map.put(word, 1);
	} else if (count < THRESHOLD) {
	    map.put(word, count + 1);
	} else {
	    if (count == THRESHOLD) {
	        count = new Counter(THRESHOLD);
	        map.put(word, count);
	    }
	    count.increment();
	}

This uses the pooled Integers as long as they last (they're created
anyhow, so they take no additional space), and then one Counter
object for each Word that appears more than THRESHOLD times.

     Or, you could just go back to auto-boxing.

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
  Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
    Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
      Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
    Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
      Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
  Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
  Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
    Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
    Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
        Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
    Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
    Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
  Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000

csiph-web