Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #23042

Re: email stop words

From Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid>
Newsgroups comp.lang.java.programmer
Subject Re: email stop words
Date 2013-03-21 15:38 -0500
Organization A noiseless patient Spider
Message-ID <kifr01$ip2$1@dont-email.me> (permalink)
References <kidh9f$57s$1@dont-email.me> <kidrti$hgn$1@dont-email.me>

Show all headers | View raw


On 3/20/2013 9:41 PM, markspace wrote:
> On 3/20/2013 4:40 PM, markspace wrote:
>> When indexing text files, there's a concept known as "stop words", which
>> are basically really common words that you don't normally want to index.
>>
>> I just got done with a preliminary part of a project, where I indexed my
>> gmail inbox by parsing out all the white-space separated words from all
>> of my emails.  For what it's worth, here's the 19 most common words in
>> my inbox, out of over 600 million characters, nearly 4 million words,
>> and probably almost 400,000 email messages.
>
>
> OK, weird.  I removed the auto-boxing from the HashMap I was using, and
> it got MUCH slower.  I'd estimate 30x slower, although I didn't let the
> biggest test run to completion.
>
> Any ideas?

The JVM is optimizing autoboxed integers? Looking briefly at the code 
for OpenJDK, there is definitely optimizations for autoboxing that do 
things like "if foo is from the Integer cache, turn foo.value into 
address_of_foo - address_of_IntegerCache"

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
  Re: email stop words Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
    Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
      Re: email stop words Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
    Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
      Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
  Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
  Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
    Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
    Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
        Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
    Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
    Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
  Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000

csiph-web