Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #23006

Re: email stop words

Path csiph.com!usenet.pasdenom.info!gegeweb.org!eternal-september.org!feeder.eternal-september.org!mx05.eternal-september.org!.POSTED!not-for-mail
From markspace <markspace@nospam.nospam>
Newsgroups comp.lang.java.programmer
Subject Re: email stop words
Date Wed, 20 Mar 2013 19:41:38 -0700
Organization A noiseless patient Spider
Lines 29
Message-ID <kidrti$hgn$1@dont-email.me> (permalink)
References <kidh9f$57s$1@dont-email.me>
Mime-Version 1.0
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
Injection-Date Thu, 21 Mar 2013 02:39:46 +0000 (UTC)
Injection-Info mx05.eternal-september.org; posting-host="fba3415ba68d85d643935af2f52f0b4b"; logging-data="17943"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX192PSRhwIzKymHL3X/ME8CwvioO2yi6wm8="
User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130307 Thunderbird/17.0.4
In-Reply-To <kidh9f$57s$1@dont-email.me>
Cancel-Lock sha1:H6O+jSW9EaGKxiQWPGM6c1JLqmo=
Xref csiph.com comp.lang.java.programmer:23006

Show key headers only | View raw


On 3/20/2013 4:40 PM, markspace wrote:
> When indexing text files, there's a concept known as "stop words", which
> are basically really common words that you don't normally want to index.
>
> I just got done with a preliminary part of a project, where I indexed my
> gmail inbox by parsing out all the white-space separated words from all
> of my emails.  For what it's worth, here's the 19 most common words in
> my inbox, out of over 600 million characters, nearly 4 million words,
> and probably almost 400,000 email messages.


OK, weird.  I removed the auto-boxing from the HashMap I was using, and 
it got MUCH slower.  I'd estimate 30x slower, although I didn't let the 
biggest test run to completion.

Any ideas?

Mine run to: I ended up making a lot more objects to avoid the immutable 
Integer, and ended up using so much memory the garbage collector started 
trashing.

Or, Integer shares low values (<127) and doesn't create new objects. 
There's so many of those low numbers that this ends up saving memory, 
and objects, in the long run (my biggest test case, in other words).

Other thoughts?


Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
  Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
    Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
      Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
    Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
      Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
  Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
  Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
    Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
    Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
        Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
    Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
    Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
  Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000

csiph-web