Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #22994 > unrolled thread
| Started by | markspace <markspace@nospam.nospam> |
|---|---|
| First post | 2013-03-20 16:40 -0700 |
| Last post | 2013-03-21 06:58 +0000 |
| Articles | 16 — 9 participants |
Back to article view | Back to comp.lang.java.programmer
email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000
| From | markspace <markspace@nospam.nospam> |
|---|---|
| Date | 2013-03-20 16:40 -0700 |
| Subject | email stop words |
| Message-ID | <kidh9f$57s$1@dont-email.me> |
When indexing text files, there's a concept known as "stop words", which are basically really common words that you don't normally want to index. I just got done with a preliminary part of a project, where I indexed my gmail inbox by parsing out all the white-space separated words from all of my emails. For what it's worth, here's the 19 most common words in my inbox, out of over 600 million characters, nearly 4 million words, and probably almost 400,000 email messages. So what I have here is a list of stop words for email. Here it is without further ado, enjoy. 3888905 words top words: >, 1730013 to, 582496 =A0, 552868 the, 544917 with, 476503 by, 451309 Received:, 398679 id, 380817 this, 324269 of, 296506 SMTP, 285885 for, 252344 from, 244664 -0700, 234140 2010, 231221 >>, 227200 a, 224202 (PDT), 220162 and, 217103 BUILD SUCCESSFUL (total time: 1 minute 8 seconds)
[toc] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2013-03-20 20:13 -0400 |
| Message-ID | <514a50a0$0$32115$14726298@news.sunsite.dk> |
| In reply to | #22994 |
On 3/20/2013 7:40 PM, markspace wrote:
> When indexing text files, there's a concept known as "stop words", which
> are basically really common words that you don't normally want to index.
>
> I just got done with a preliminary part of a project, where I indexed my
> gmail inbox by parsing out all the white-space separated words from all
> of my emails. For what it's worth, here's the 19 most common words in
> my inbox, out of over 600 million characters, nearly 4 million words,
> and probably almost 400,000 email messages.
>
> So what I have here is a list of stop words for email. Here it is
> without further ado, enjoy.
>
> 3888905 words
> top words:
> >, 1730013
> to, 582496
> =A0, 552868
> the, 544917
> with, 476503
> by, 451309
> Received:, 398679
> id, 380817
> this, 324269
> of, 296506
> SMTP, 285885
> for, 252344
> from, 244664
> -0700, 234140
> 2010, 231221
> >>, 227200
> a, 224202
> (PDT), 220162
> and, 217103
> BUILD SUCCESSFUL (total time: 1 minute 8 seconds)
I would have discarded special characters:
>=-()
up front.
Arne
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2013-03-20 17:21 -0700 |
| Message-ID | <5ca94924-be65-45ee-9e0d-38afde16a808@googlegroups.com> |
| In reply to | #22995 |
Arne Vajhøj wrote: > I would have discarded special characters: > > >=-() > > up front. You need a hyphen to spell words like "laissez-faire" and "higgledy-piggledy". Apostrophe is a "special" character but very common in English words (most possessives, for example). Plus-sign appears in, for example, "Google+" and "G+" and "+1". So you need to be judicious in your definition of "special". -- Lew
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2013-03-20 20:41 -0400 |
| Message-ID | <514a5752$0$32116$14726298@news.sunsite.dk> |
| In reply to | #22997 |
On 3/20/2013 8:21 PM, Lew wrote: > Arne Vajhøj wrote: >> I would have discarded special characters: >> >> >=-() >> >> up front. > > You need a hyphen to spell words like "laissez-faire" and "higgledy-piggledy". I would be willing to search for the two words "laissez" and "faire" and manually sort out those cases where they were not hyphened together. > Apostrophe is a "special" character but very common in English words (most possessives, for example). I would again prefer to search for the base word. > Plus-sign appears in, for example, "Google+" and "G+" and "+1". I can see that point. Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <markspace@nospam.nospam> |
|---|---|
| Date | 2013-03-20 17:21 -0700 |
| Message-ID | <kidjmt$f8j$1@dont-email.me> |
| In reply to | #22995 |
On 3/20/2013 5:13 PM, Arne Vajhøj wrote: > > I would have discarded special characters: > >=-() > up front. It's not nearly that sophisticated yet. In time, it will find actual words. Right now I'm just making sure I can read the whole mess in a timely fashion. You should have seen how long it took before ByteBuffers and using Regex.
[toc] | [prev] | [next] | [standalone]
| From | lipska the kat <"nospam at neversurrender dot co dot uk"> |
|---|---|
| Date | 2013-03-21 09:31 +0000 |
| Message-ID | <AOSdnVU1JL7lTtfMnZ2dnUVZ7o2dnZ2d@bt.com> |
| In reply to | #22998 |
On 21/03/13 00:21, markspace wrote: > On 3/20/2013 5:13 PM, Arne Vajhøj wrote: >> >> I would have discarded special characters: >> >=-() >> up front. > > It's not nearly that sophisticated yet. In time, it will find actual > words. Right now I'm just making sure I can read the whole mess in a > timely fashion. You should have seen how long it took before ByteBuffers > and using Regex. This is just a suggestion relating to speed of indexing and may not be suitable but have you ever used lucene. http://lucene.apache.org I've used it to index and search text databases for several years now. I'm always surprised at how fast the indexing is. It's a pretty sophisticated piece of kit though and the code can get quite verbose. Like I say, just a suggestion WRT speed lipska -- Lipska the Kat©: Troll hunter, sandbox destroyer and farscape dreamer of Aeryn Sun
[toc] | [prev] | [next] | [standalone]
| From | Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> |
|---|---|
| Date | 2013-03-20 20:51 -0500 |
| Message-ID | <kidp0d$67v$1@dont-email.me> |
| In reply to | #22994 |
On 3/20/2013 6:40 PM, markspace wrote: > When indexing text files, there's a concept known as "stop words", which > are basically really common words that you don't normally want to index. > > I just got done with a preliminary part of a project, where I indexed my > gmail inbox by parsing out all the white-space separated words from all > of my emails. For what it's worth, here's the 19 most common words in > my inbox, out of over 600 million characters, nearly 4 million words, > and probably almost 400,000 email messages. > > So what I have here is a list of stop words for email. Here it is > without further ado, enjoy. > > 3888905 words > top words: > >, 1730013 > to, 582496 > =A0, 552868 The presence of =XX (where X is a hexadecimal character) is heavy indication that you are failing to decode quoted-printable bodies correctly, which makes the rest of your data suspect. > Received:, 398679 This is clearly an RFC 822 message header. You should probably be performing some sort of normalization on message bodies to avoid parsing this (you might be parsing, e.g., message/rfc822 as plain text or a message/delivery-notification as plaintext. -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
[toc] | [prev] | [next] | [standalone]
| From | markspace <markspace@nospam.nospam> |
|---|---|
| Date | 2013-03-20 19:41 -0700 |
| Message-ID | <kidrti$hgn$1@dont-email.me> |
| In reply to | #22994 |
On 3/20/2013 4:40 PM, markspace wrote: > When indexing text files, there's a concept known as "stop words", which > are basically really common words that you don't normally want to index. > > I just got done with a preliminary part of a project, where I indexed my > gmail inbox by parsing out all the white-space separated words from all > of my emails. For what it's worth, here's the 19 most common words in > my inbox, out of over 600 million characters, nearly 4 million words, > and probably almost 400,000 email messages. OK, weird. I removed the auto-boxing from the HashMap I was using, and it got MUCH slower. I'd estimate 30x slower, although I didn't let the biggest test run to completion. Any ideas? Mine run to: I ended up making a lot more objects to avoid the immutable Integer, and ended up using so much memory the garbage collector started trashing. Or, Integer shares low values (<127) and doesn't create new objects. There's so many of those low numbers that this ends up saving memory, and objects, in the long run (my biggest test case, in other words). Other thoughts?
[toc] | [prev] | [next] | [standalone]
| From | Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> |
|---|---|
| Date | 2013-03-21 08:29 +0200 |
| Message-ID | <lvsj3p8dvy.fsf@saunalahti.fi> |
| In reply to | #23006 |
markspace <markspace@nospam.nospam> writes: > Mine run to: I ended up making a lot more objects to avoid the immutable > Integer, and ended up using so much memory the garbage collector started > trashing. > Or, Integer shares low values (<127) and doesn't create new > objects. There's so many of those low numbers that this ends up saving > memory, and objects, in the long run (my biggest test case, in other Do you use new Integer(x) where you should use Integer.parseInt(x)? -- Jukka Lahtinen
[toc] | [prev] | [next] | [standalone]
| From | Eric Sosman <esosman@comcast-dot-net.invalid> |
|---|---|
| Date | 2013-03-21 09:24 -0400 |
| Message-ID | <kif1jc$jrr$1@dont-email.me> |
| In reply to | #23006 |
On 3/20/2013 10:41 PM, markspace wrote:
> On 3/20/2013 4:40 PM, markspace wrote:
>> When indexing text files, there's a concept known as "stop words", which
>> are basically really common words that you don't normally want to index.
>>
>> I just got done with a preliminary part of a project, where I indexed my
>> gmail inbox by parsing out all the white-space separated words from all
>> of my emails. For what it's worth, here's the 19 most common words in
>> my inbox, out of over 600 million characters, nearly 4 million words,
>> and probably almost 400,000 email messages.
>
>
> OK, weird. I removed the auto-boxing from the HashMap I was using, and
> it got MUCH slower. I'd estimate 30x slower, although I didn't let the
> biggest test run to completion.
>
> Any ideas?
>
> Mine run to: I ended up making a lot more objects to avoid the immutable
> Integer, and ended up using so much memory the garbage collector started
> trashing.
You didn't say exactly what your code was like, but I imagine
you were using a HashMap<Word,Integer> and processing each Word
with something along the lines of
Integer count = map.get(word);
map.put(word, count == null ? 1 : count + 1);
... and that you switched to something more like
Integer count = map.get(word);
map.put(word, new Integer(count == null
? 1 : count.intValue() + 1);
If so, the slowdown is probably due to increased memory pressure
and garbage collection: `new' actually creates a new object every
time, while auto-boxing uses (the equivalent of) Integer.valueOf().
The latter maintains a pool of a couple hundred small-valued Integers
and doles them out whenever needed, using `new' only for un-pooled
values.
(It's probably not the frequent words that kill you, since
their counts will grow too large to be satisfied from the pool.
But the "long tail" may be murder: Using `new' you'll have many
millions of Integer objects representing small numbers, whereas
by using auto-boxing every "five" would be the same object.)
My suggestion would be to implement a Counter class that
wraps a mutable integer value. Then you'd use
HashMap<Word,Counter> map = ...;
Counter count = map.get(word);
if (count == null)
map.put(word, new Counter());
count.increment();
This way you'd create just one Counter per unique Word, instead
of one Integer for every occurrence of every word. To deal with
the long tail, make Counter extend Number and use
HashMap<Word,Number> map = ...;
Number count = map.get(word);
if (count == null) {
map.put(word, 1);
} else if (count < THRESHOLD) {
map.put(word, count + 1);
} else {
if (count == THRESHOLD) {
count = new Counter(THRESHOLD);
map.put(word, count);
}
count.increment();
}
This uses the pooled Integers as long as they last (they're created
anyhow, so they take no additional space), and then one Counter
object for each Word that appears more than THRESHOLD times.
Or, you could just go back to auto-boxing.
--
Eric Sosman
esosman@comcast-dot-net.invalid
[toc] | [prev] | [next] | [standalone]
| From | markspace <markspace@nospam.nospam> |
|---|---|
| Date | 2013-03-21 09:33 -0700 |
| Message-ID | <kifckl$p1f$1@dont-email.me> |
| In reply to | #23027 |
On 3/21/2013 6:24 AM, Eric Sosman wrote:
>
> Integer count = map.get(word);
> map.put(word, count == null ? 1 : count + 1);
Basically, yes.
>
> ... and that you switched to something more like
>
> Integer count = map.get(word);
> map.put(word, new Integer(count == null
> ? 1 : count.intValue() + 1);
>
No, I made a Counter with a primitive and a reference to the word:
Counter counter = map.get( word );
if( counter == null ) {
counter = new Counter();
counter.word = word;
counter.count = 1;
map.put( word, counter );
} else
counter.count++;
> If so, the slowdown is probably due to increased memory pressure
> and garbage collection: `new' actually creates a new object every
Yeah, that's what I thought too. Although since there's only as many
Counters as there are Strings (words), I don't get why just making a 2x
change would slow the system as horribly as it did. There should be
only 4 million Strings and therefore also 4 million Counters. I can't
figure out why that would be a problem.
> time, while auto-boxing uses (the equivalent of) Integer.valueOf().
> The latter maintains a pool of a couple hundred small-valued Integers
> and doles them out whenever needed, using `new' only for un-pooled
> values.
I think it would be worth it to change the JVM memory parameters from
the defaults and see if that makes a difference.
Also, any thoughts on the best way to observe a GC that is thrashing?
I'm really curious to pin this down to some sort of root cause. I
couldn't rule out a coding error somewhere either.
> My suggestion would be to implement a Counter class that
> wraps a mutable integer value. Then you'd use
Thanks, I'll take a look at this when I get a chance. A good suggestion!
> Or, you could just go back to auto-boxing.
Yes, A-B-A testing works. Going back to auto-boxing restored the
previous run times, so I'm fairly certain it's related to memory
pressure or something similar.
[toc] | [prev] | [next] | [standalone]
| From | Eric Sosman <esosman@comcast-dot-net.invalid> |
|---|---|
| Date | 2013-03-21 14:15 -0400 |
| Message-ID | <kifijo$uaa$1@dont-email.me> |
| In reply to | #23034 |
On 3/21/2013 12:33 PM, markspace wrote:
> On 3/21/2013 6:24 AM, Eric Sosman wrote:
>>
>> Integer count = map.get(word);
>> map.put(word, count == null ? 1 : count + 1);
>
> Basically, yes.
>
>>
>> ... and that you switched to something more like
>>
>> Integer count = map.get(word);
>> map.put(word, new Integer(count == null
>> ? 1 : count.intValue() + 1);
>>
>
> No, I made a Counter with a primitive and a reference to the word:
>
> Counter counter = map.get( word );
> if( counter == null ) {
> counter = new Counter();
> counter.word = word;
> counter.count = 1;
> map.put( word, counter );
> } else
> counter.count++;
>
>> If so, the slowdown is probably due to increased memory pressure
>> and garbage collection: `new' actually creates a new object every
>
> Yeah, that's what I thought too. Although since there's only as many
> Counters as there are Strings (words), I don't get why just making a 2x
> change would slow the system as horribly as it did. There should be
> only 4 million Strings and therefore also 4 million Counters. I can't
> figure out why that would be a problem.
It might be the "long tail" I mentioned earlier. With the
second scheme you need four million Counter objects, while the
original used (perhaps) a hundred thousand large Integers plus
3.9 million references to the few small Integers in the static pool.
Back of the envelope: The Map holds four million references
to Map.Entry objects, each of which holds a key reference, a
value reference, and a link. With the Integer original, to this
you add a hundred thousand (same out-of-thin-air figure as before)
Integer instances. Total: 16 million references, 4.1 million objects.
The change to a "word-aware" Counter adds four million more
references and 3.9 million more objects. Yeah, I can see where
that might have a teeny-tiny impact ...
> Also, any thoughts on the best way to observe a GC that is thrashing?
> I'm really curious to pin this down to some sort of root cause. I
> couldn't rule out a coding error somewhere either.
Hmmm: I used to know something about tuning GC, but my knowledge
is about a decade out of date -- in an area that's had a lot of R&D
in the meantime. There's some Java 6 stuff at
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html
... but I haven't read it and can't assess it.
>> My suggestion would be to implement a Counter class that
>> wraps a mutable integer value. Then you'd use
>
> Thanks, I'll take a look at this when I get a chance. A good suggestion!
If I've understood you correctly, you've already done this --
and that's when the trouble started. Perhaps the hybrid Integer-
or-Counter approach would help, though.
--
Eric Sosman
esosman@comcast-dot-net.invalid
[toc] | [prev] | [next] | [standalone]
| From | Joerg Meier <joergmmeier@arcor.de> |
|---|---|
| Date | 2013-03-21 14:29 +0100 |
| Message-ID | <127vq2mlerlkc$.1vkw26e34n0o8$.dlg@40tude.net> |
| In reply to | #23006 |
On Wed, 20 Mar 2013 19:41:38 -0700, markspace wrote: > Or, Integer shares low values (<127) and doesn't create new objects. If you are asking: yes, it does. Liebe Gruesse, Joerg -- Ich lese meine Emails nicht, replies to Email bleiben also leider ungelesen.
[toc] | [prev] | [next] | [standalone]
| From | Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> |
|---|---|
| Date | 2013-03-21 15:38 -0500 |
| Message-ID | <kifr01$ip2$1@dont-email.me> |
| In reply to | #23006 |
On 3/20/2013 9:41 PM, markspace wrote: > On 3/20/2013 4:40 PM, markspace wrote: >> When indexing text files, there's a concept known as "stop words", which >> are basically really common words that you don't normally want to index. >> >> I just got done with a preliminary part of a project, where I indexed my >> gmail inbox by parsing out all the white-space separated words from all >> of my emails. For what it's worth, here's the 19 most common words in >> my inbox, out of over 600 million characters, nearly 4 million words, >> and probably almost 400,000 email messages. > > > OK, weird. I removed the auto-boxing from the HashMap I was using, and > it got MUCH slower. I'd estimate 30x slower, although I didn't let the > biggest test run to completion. > > Any ideas? The JVM is optimizing autoboxed integers? Looking briefly at the code for OpenJDK, there is definitely optimizations for autoboxing that do things like "if foo is from the Integer cache, turn foo.value into address_of_foo - address_of_IntegerCache" -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
[toc] | [prev] | [next] | [standalone]
| From | markspace <markspace@nospam.nospam> |
|---|---|
| Date | 2013-03-21 16:49 -0700 |
| Message-ID | <kig67g$jqu$1@dont-email.me> |
| In reply to | #23042 |
On 3/21/2013 1:38 PM, Joshua Cranmer 🐧 wrote: > > The JVM is optimizing autoboxed integers? That occurred to me, but since I know nothing about how the JVM might do that, I didn't speculate. > Looking briefly at the code > for OpenJDK, there is definitely optimizations for autoboxing that do > things like "if foo is from the Integer cache, turn foo.value into > address_of_foo - address_of_IntegerCache" > Interesting. I mostly just increment Integers by one. If the addresses of pre-allocated Integers (<127) are all stored sequentially, then that's equivalent to adding a constant offset to each address, about the same overhead as just adding 1. Hmmm, that's 100% speculation, but a cool optimization if they do it that way.
[toc] | [prev] | [next] | [standalone]
| From | Fredrik Jonson <fredrik@jonson.org> |
|---|---|
| Date | 2013-03-21 06:58 +0000 |
| Message-ID | <slrnkklbsp.vov.fredrik@biggles.jonson.org> |
| In reply to | #22994 |
Stefan Ram wrote: > cat mbox | egrep -o '\w+' | sort | uniq -c | sort -nr | head -9 You have a "useless use of cat" in that row of pipes. ITYM: egrep -o '\w+' mbox | sort | uniq -c | sort -nr | head -9 http://www.partmaps.org/era/unix/award.html Tip of the hat for the neat use of command line tools. -- Fredrik Jonson
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.programmer
csiph-web