Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #23006
| From | markspace <markspace@nospam.nospam> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: email stop words |
| Date | 2013-03-20 19:41 -0700 |
| Organization | A noiseless patient Spider |
| Message-ID | <kidrti$hgn$1@dont-email.me> (permalink) |
| References | <kidh9f$57s$1@dont-email.me> |
On 3/20/2013 4:40 PM, markspace wrote: > When indexing text files, there's a concept known as "stop words", which > are basically really common words that you don't normally want to index. > > I just got done with a preliminary part of a project, where I indexed my > gmail inbox by parsing out all the white-space separated words from all > of my emails. For what it's worth, here's the 19 most common words in > my inbox, out of over 600 million characters, nearly 4 million words, > and probably almost 400,000 email messages. OK, weird. I removed the auto-boxing from the HashMap I was using, and it got MUCH slower. I'd estimate 30x slower, although I didn't let the biggest test run to completion. Any ideas? Mine run to: I ended up making a lot more objects to avoid the immutable Integer, and ended up using so much memory the garbage collector started trashing. Or, Integer shares low values (<127) and doesn't create new objects. There's so many of those low numbers that this ends up saving memory, and objects, in the long run (my biggest test case, in other words). Other thoughts?
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000
csiph-web