Groups > comp.lang.java.programmer > #22994 > unrolled thread

email stop words

Started by	markspace <markspace@nospam.nospam>
First post	2013-03-20 16:40 -0700
Last post	2013-03-21 06:58 +0000
Articles	16 — 9 participants

Back to article view | Back to comp.lang.java.programmer

  email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
    Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
      Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
        Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
        Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
    Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
    Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
      Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
      Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
        Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
          Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
      Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
      Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
        Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
    Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000

#22994 — email stop words

From	markspace <markspace@nospam.nospam>
Date	2013-03-20 16:40 -0700
Subject	email stop words
Message-ID	<kidh9f$57s$1@dont-email.me>

When indexing text files, there's a concept known as "stop words", which 
are basically really common words that you don't normally want to index.

I just got done with a preliminary part of a project, where I indexed my 
gmail inbox by parsing out all the white-space separated words from all 
of my emails.  For what it's worth, here's the 19 most common words in 
my inbox, out of over 600 million characters, nearly 4 million words, 
and probably almost 400,000 email messages.

So what I have here is a list of stop words for email. Here it is 
without further ado, enjoy.

3888905 words
top words:
 >, 1730013
to, 582496
=A0, 552868
the, 544917
with, 476503
by, 451309
Received:, 398679
id, 380817
this, 324269
of, 296506
SMTP, 285885
for, 252344
from, 244664
-0700, 234140
2010, 231221
 >>, 227200
a, 224202
(PDT), 220162
and, 217103
BUILD SUCCESSFUL (total time: 1 minute 8 seconds)

[toc] | [next] | [standalone]

#22995

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2013-03-20 20:13 -0400
Message-ID	<514a50a0$0$32115$14726298@news.sunsite.dk>
In reply to	#22994

On 3/20/2013 7:40 PM, markspace wrote:
> When indexing text files, there's a concept known as "stop words", which
> are basically really common words that you don't normally want to index.
>
> I just got done with a preliminary part of a project, where I indexed my
> gmail inbox by parsing out all the white-space separated words from all
> of my emails.  For what it's worth, here's the 19 most common words in
> my inbox, out of over 600 million characters, nearly 4 million words,
> and probably almost 400,000 email messages.
>
> So what I have here is a list of stop words for email. Here it is
> without further ado, enjoy.
>
> 3888905 words
> top words:
>  >, 1730013
> to, 582496
> =A0, 552868
> the, 544917
> with, 476503
> by, 451309
> Received:, 398679
> id, 380817
> this, 324269
> of, 296506
> SMTP, 285885
> for, 252344
> from, 244664
> -0700, 234140
> 2010, 231221
>  >>, 227200
> a, 224202
> (PDT), 220162
> and, 217103
> BUILD SUCCESSFUL (total time: 1 minute 8 seconds)

I would have discarded special characters:
    >=-()
up front.

Arne

[toc] | [prev] | [next] | [standalone]

#22997

From	Lew <lewbloch@gmail.com>
Date	2013-03-20 17:21 -0700
Message-ID	<5ca94924-be65-45ee-9e0d-38afde16a808@googlegroups.com>
In reply to	#22995

Arne Vajhøj wrote:
> I would have discarded special characters:
> 
>     >=-()
> 
> up front.

You need a hyphen to spell words like "laissez-faire" and "higgledy-piggledy".

Apostrophe is a "special" character but very common in English words (most possessives, for example).

Plus-sign appears in, for example, "Google+" and "G+" and "+1".

So you need to be judicious in your definition of "special".

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#23003

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2013-03-20 20:41 -0400
Message-ID	<514a5752$0$32116$14726298@news.sunsite.dk>
In reply to	#22997

On 3/20/2013 8:21 PM, Lew wrote:
> Arne Vajhøj wrote:
>> I would have discarded special characters:
>>
>>      >=-()
>>
>> up front.
>
> You need a hyphen to spell words like "laissez-faire" and "higgledy-piggledy".

I would be willing to search for the two words "laissez" and "faire" and
manually sort out those cases where they were not hyphened together.

> Apostrophe is a "special" character but very common in English words (most possessives, for example).

I would again prefer to search for the base word.

> Plus-sign appears in, for example, "Google+" and "G+" and "+1".

I can see that point.

Arne

[toc] | [prev] | [next] | [standalone]

#22998

From	markspace <markspace@nospam.nospam>
Date	2013-03-20 17:21 -0700
Message-ID	<kidjmt$f8j$1@dont-email.me>
In reply to	#22995

On 3/20/2013 5:13 PM, Arne Vajhøj wrote:
>
> I would have discarded special characters:
>     >=-()
> up front.

It's not nearly that sophisticated yet.  In time, it will find actual 
words.  Right now I'm just making sure I can read the whole mess in a 
timely fashion.  You should have seen how long it took before 
ByteBuffers and using Regex.

[toc] | [prev] | [next] | [standalone]

#23016

From	lipska the kat <"nospam at neversurrender dot co dot uk">
Date	2013-03-21 09:31 +0000
Message-ID	<AOSdnVU1JL7lTtfMnZ2dnUVZ7o2dnZ2d@bt.com>
In reply to	#22998

On 21/03/13 00:21, markspace wrote:
> On 3/20/2013 5:13 PM, Arne Vajhøj wrote:
>>
>> I would have discarded special characters:
>> >=-()
>> up front.
>
> It's not nearly that sophisticated yet. In time, it will find actual
> words. Right now I'm just making sure I can read the whole mess in a
> timely fashion. You should have seen how long it took before ByteBuffers
> and using Regex.

This is just a suggestion relating to speed of indexing and may not be 
suitable but have you ever used lucene.

http://lucene.apache.org

I've used it to index and search text databases for several years now.
I'm always surprised at how fast the indexing is.

It's a pretty sophisticated piece of kit though and the code can get 
quite verbose.

Like I say, just a suggestion WRT speed

lipska

-- 
Lipska the Kat©: Troll hunter, sandbox destroyer
and farscape dreamer of Aeryn Sun

[toc] | [prev] | [next] | [standalone]

#23004

From	Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid>
Date	2013-03-20 20:51 -0500
Message-ID	<kidp0d$67v$1@dont-email.me>
In reply to	#22994

On 3/20/2013 6:40 PM, markspace wrote:
> When indexing text files, there's a concept known as "stop words", which
> are basically really common words that you don't normally want to index.
>
> I just got done with a preliminary part of a project, where I indexed my
> gmail inbox by parsing out all the white-space separated words from all
> of my emails.  For what it's worth, here's the 19 most common words in
> my inbox, out of over 600 million characters, nearly 4 million words,
> and probably almost 400,000 email messages.
>
> So what I have here is a list of stop words for email. Here it is
> without further ado, enjoy.
>
> 3888905 words
> top words:
>  >, 1730013
> to, 582496
> =A0, 552868

The presence of =XX (where X is a hexadecimal character) is heavy 
indication that you are failing to decode quoted-printable bodies 
correctly, which makes the rest of your data suspect.

> Received:, 398679

This is clearly an RFC 822 message header. You should probably be 
performing some sort of normalization on message bodies to avoid parsing 
this (you might be parsing, e.g., message/rfc822 as plain text or a 
message/delivery-notification as plaintext.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

[toc] | [prev] | [next] | [standalone]

#23006

From	markspace <markspace@nospam.nospam>
Date	2013-03-20 19:41 -0700
Message-ID	<kidrti$hgn$1@dont-email.me>
In reply to	#22994

On 3/20/2013 4:40 PM, markspace wrote:
> When indexing text files, there's a concept known as "stop words", which
> are basically really common words that you don't normally want to index.
>
> I just got done with a preliminary part of a project, where I indexed my
> gmail inbox by parsing out all the white-space separated words from all
> of my emails.  For what it's worth, here's the 19 most common words in
> my inbox, out of over 600 million characters, nearly 4 million words,
> and probably almost 400,000 email messages.

OK, weird.  I removed the auto-boxing from the HashMap I was using, and 
it got MUCH slower.  I'd estimate 30x slower, although I didn't let the 
biggest test run to completion.

Any ideas?

Mine run to: I ended up making a lot more objects to avoid the immutable 
Integer, and ended up using so much memory the garbage collector started 
trashing.

Or, Integer shares low values (<127) and doesn't create new objects. 
There's so many of those low numbers that this ends up saving memory, 
and objects, in the long run (my biggest test case, in other words).

Other thoughts?

[toc] | [prev] | [next] | [standalone]

#23010

From	Jukka Lahtinen <jtfjdehf@hotmail.com.invalid>
Date	2013-03-21 08:29 +0200
Message-ID	<lvsj3p8dvy.fsf@saunalahti.fi>
In reply to	#23006

markspace <markspace@nospam.nospam> writes:

> Mine run to: I ended up making a lot more objects to avoid the immutable
> Integer, and ended up using so much memory the garbage collector started
> trashing.
> Or, Integer shares low values (<127) and doesn't create new
> objects. There's so many of those low numbers that this ends up saving
> memory, and objects, in the long run (my biggest test case, in other

Do you use new Integer(x) where you should use Integer.parseInt(x)?

-- 
Jukka Lahtinen

[toc] | [prev] | [next] | [standalone]

#23027

From	Eric Sosman <esosman@comcast-dot-net.invalid>
Date	2013-03-21 09:24 -0400
Message-ID	<kif1jc$jrr$1@dont-email.me>
In reply to	#23006

On 3/20/2013 10:41 PM, markspace wrote:
> On 3/20/2013 4:40 PM, markspace wrote:
>> When indexing text files, there's a concept known as "stop words", which
>> are basically really common words that you don't normally want to index.
>>
>> I just got done with a preliminary part of a project, where I indexed my
>> gmail inbox by parsing out all the white-space separated words from all
>> of my emails.  For what it's worth, here's the 19 most common words in
>> my inbox, out of over 600 million characters, nearly 4 million words,
>> and probably almost 400,000 email messages.
>
>
> OK, weird.  I removed the auto-boxing from the HashMap I was using, and
> it got MUCH slower.  I'd estimate 30x slower, although I didn't let the
> biggest test run to completion.
>
> Any ideas?
>
> Mine run to: I ended up making a lot more objects to avoid the immutable
> Integer, and ended up using so much memory the garbage collector started
> trashing.

     You didn't say exactly what your code was like, but I imagine
you were using a HashMap<Word,Integer> and processing each Word
with something along the lines of

	Integer count = map.get(word);
	map.put(word, count == null ? 1 : count + 1);

... and that you switched to something more like

	Integer count = map.get(word);
	map.put(word, new Integer(count == null
	    ? 1 : count.intValue() + 1);

If so, the slowdown is probably due to increased memory pressure
and garbage collection: `new' actually creates a new object every
time, while auto-boxing uses (the equivalent of) Integer.valueOf().
The latter maintains a pool of a couple hundred small-valued Integers
and doles them out whenever needed, using `new' only for un-pooled
values.

     (It's probably not the frequent words that kill you, since
their counts will grow too large to be satisfied from the pool.
But the "long tail" may be murder: Using `new' you'll have many
millions of Integer objects representing small numbers, whereas
by using auto-boxing every "five" would be the same object.)

     My suggestion would be to implement a Counter class that
wraps a mutable integer value.  Then you'd use

	HashMap<Word,Counter> map = ...;
	Counter count = map.get(word);
	if (count == null)
	    map.put(word, new Counter());
	count.increment();

This way you'd create just one Counter per unique Word, instead
of one Integer for every occurrence of every word.  To deal with
the long tail, make Counter extend Number and use

	HashMap<Word,Number> map = ...;
	Number count = map.get(word);
	if (count == null) {
	    map.put(word, 1);
	} else if (count < THRESHOLD) {
	    map.put(word, count + 1);
	} else {
	    if (count == THRESHOLD) {
	        count = new Counter(THRESHOLD);
	        map.put(word, count);
	    }
	    count.increment();
	}

This uses the pooled Integers as long as they last (they're created
anyhow, so they take no additional space), and then one Counter
object for each Word that appears more than THRESHOLD times.

     Or, you could just go back to auto-boxing.

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]

#23034

From	markspace <markspace@nospam.nospam>
Date	2013-03-21 09:33 -0700
Message-ID	<kifckl$p1f$1@dont-email.me>
In reply to	#23027

On 3/21/2013 6:24 AM, Eric Sosman wrote:
>
>      Integer count = map.get(word);
>      map.put(word, count == null ? 1 : count + 1);

Basically, yes.

>
> ... and that you switched to something more like
>
>      Integer count = map.get(word);
>      map.put(word, new Integer(count == null
>          ? 1 : count.intValue() + 1);
>

No, I made a Counter with a primitive and a reference to the word:

   Counter counter = map.get( word );
   if( counter == null ) {
     counter = new Counter();
     counter.word = word;
     counter.count = 1;
     map.put( word, counter );
   } else
     counter.count++;

> If so, the slowdown is probably due to increased memory pressure
> and garbage collection: `new' actually creates a new object every

Yeah, that's what I thought too.  Although since there's only as many 
Counters as there are Strings (words), I don't get why just making a 2x 
change would slow the system as horribly as it did.  There should be 
only 4 million Strings and therefore also 4 million Counters.  I can't 
figure out why that would be a problem.

> time, while auto-boxing uses (the equivalent of) Integer.valueOf().
> The latter maintains a pool of a couple hundred small-valued Integers
> and doles them out whenever needed, using `new' only for un-pooled
> values.

I think it would be worth it to change the JVM memory parameters from 
the defaults and see if that makes a difference.

Also, any thoughts on the best way to observe a GC that is thrashing? 
I'm really curious to pin this down to some sort of root cause.  I 
couldn't rule out a coding error somewhere either.

>      My suggestion would be to implement a Counter class that
> wraps a mutable integer value.  Then you'd use

Thanks, I'll take a look at this when I get a chance.  A good suggestion!

>      Or, you could just go back to auto-boxing.

Yes, A-B-A testing works.  Going back to auto-boxing restored the 
previous run times, so I'm fairly certain it's related to memory 
pressure or something similar.

[toc] | [prev] | [next] | [standalone]

#23037

From	Eric Sosman <esosman@comcast-dot-net.invalid>
Date	2013-03-21 14:15 -0400
Message-ID	<kifijo$uaa$1@dont-email.me>
In reply to	#23034

On 3/21/2013 12:33 PM, markspace wrote:
> On 3/21/2013 6:24 AM, Eric Sosman wrote:
>>
>>      Integer count = map.get(word);
>>      map.put(word, count == null ? 1 : count + 1);
>
> Basically, yes.
>
>>
>> ... and that you switched to something more like
>>
>>      Integer count = map.get(word);
>>      map.put(word, new Integer(count == null
>>          ? 1 : count.intValue() + 1);
>>
>
> No, I made a Counter with a primitive and a reference to the word:
>
>    Counter counter = map.get( word );
>    if( counter == null ) {
>      counter = new Counter();
>      counter.word = word;
>      counter.count = 1;
>      map.put( word, counter );
>    } else
>      counter.count++;
>
>> If so, the slowdown is probably due to increased memory pressure
>> and garbage collection: `new' actually creates a new object every
>
> Yeah, that's what I thought too.  Although since there's only as many
> Counters as there are Strings (words), I don't get why just making a 2x
> change would slow the system as horribly as it did.  There should be
> only 4 million Strings and therefore also 4 million Counters.  I can't
> figure out why that would be a problem.

     It might be the "long tail" I mentioned earlier.  With the
second scheme you need four million Counter objects, while the
original used (perhaps) a hundred thousand large Integers plus
3.9 million references to the few small Integers in the static pool.

     Back of the envelope: The Map holds four million references
to Map.Entry objects, each of which holds a key reference, a
value reference, and a link.  With the Integer original, to this
you add a hundred thousand (same out-of-thin-air figure as before)
Integer instances.  Total: 16 million references, 4.1 million objects.

     The change to a "word-aware" Counter adds four million more
references and 3.9 million more objects.  Yeah, I can see where
that might have a teeny-tiny impact ...

> Also, any thoughts on the best way to observe a GC that is thrashing?
> I'm really curious to pin this down to some sort of root cause.  I
> couldn't rule out a coding error somewhere either.

     Hmmm: I used to know something about tuning GC, but my knowledge
is about a decade out of date -- in an area that's had a lot of R&D
in the meantime.  There's some Java 6 stuff at

http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

... but I haven't read it and can't assess it.

>>      My suggestion would be to implement a Counter class that
>> wraps a mutable integer value.  Then you'd use
>
> Thanks, I'll take a look at this when I get a chance.  A good suggestion!

     If I've understood you correctly, you've already done this --
and that's when the trouble started.  Perhaps the hybrid Integer-
or-Counter approach would help, though.

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]

#23028

From	Joerg Meier <joergmmeier@arcor.de>
Date	2013-03-21 14:29 +0100
Message-ID	<127vq2mlerlkc$.1vkw26e34n0o8$.dlg@40tude.net>
In reply to	#23006

On Wed, 20 Mar 2013 19:41:38 -0700, markspace wrote:

> Or, Integer shares low values (<127) and doesn't create new objects. 

If you are asking: yes, it does.

Liebe Gruesse,
		Joerg

-- 
Ich lese meine Emails nicht, replies to Email bleiben also leider
ungelesen.

[toc] | [prev] | [next] | [standalone]

#23042

From	Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid>
Date	2013-03-21 15:38 -0500
Message-ID	<kifr01$ip2$1@dont-email.me>
In reply to	#23006

On 3/20/2013 9:41 PM, markspace wrote:
> On 3/20/2013 4:40 PM, markspace wrote:
>> When indexing text files, there's a concept known as "stop words", which
>> are basically really common words that you don't normally want to index.
>>
>> I just got done with a preliminary part of a project, where I indexed my
>> gmail inbox by parsing out all the white-space separated words from all
>> of my emails.  For what it's worth, here's the 19 most common words in
>> my inbox, out of over 600 million characters, nearly 4 million words,
>> and probably almost 400,000 email messages.
>
>
> OK, weird.  I removed the auto-boxing from the HashMap I was using, and
> it got MUCH slower.  I'd estimate 30x slower, although I didn't let the
> biggest test run to completion.
>
> Any ideas?

The JVM is optimizing autoboxed integers? Looking briefly at the code 
for OpenJDK, there is definitely optimizations for autoboxing that do 
things like "if foo is from the Integer cache, turn foo.value into 
address_of_foo - address_of_IntegerCache"

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

[toc] | [prev] | [next] | [standalone]

#23052

From	markspace <markspace@nospam.nospam>
Date	2013-03-21 16:49 -0700
Message-ID	<kig67g$jqu$1@dont-email.me>
In reply to	#23042

On 3/21/2013 1:38 PM, Joshua Cranmer 🐧 wrote:
>
> The JVM is optimizing autoboxed integers?

That occurred to me, but since I know nothing about how the JVM might do 
that, I didn't speculate.

> Looking briefly at the code
> for OpenJDK, there is definitely optimizations for autoboxing that do
> things like "if foo is from the Integer cache, turn foo.value into
> address_of_foo - address_of_IntegerCache"
>

Interesting.  I mostly just increment Integers by one.  If the addresses 
of pre-allocated Integers (<127) are all stored sequentially, then 
that's equivalent to adding a constant offset to each address, about the 
same overhead as just adding 1.  Hmmm, that's 100% speculation, but a 
cool optimization if they do it that way.

[toc] | [prev] | [next] | [standalone]

#23011

From	Fredrik Jonson <fredrik@jonson.org>
Date	2013-03-21 06:58 +0000
Message-ID	<slrnkklbsp.vov.fredrik@biggles.jonson.org>
In reply to	#22994

Stefan Ram wrote:
 
>  cat mbox | egrep -o '\w+' | sort | uniq -c | sort -nr | head -9

You have a "useless use of cat" in that row of pipes. ITYM:

 egrep -o '\w+' mbox | sort | uniq -c | sort -nr | head -9

http://www.partmaps.org/era/unix/award.html

Tip of the hat for the neat use of command line tools.

-- 
Fredrik Jonson

[toc] | [prev] | [standalone]

csiph-web

email stop words

Contents

#22994 — email stop words

#22995

#22997

#23003

#22998

#23016

#23004

#23006

#23010

#23027

#23034

#23037

#23028

#23042

#23052

#23011