Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!eternal-september.org!feeder.eternal-september.org!mx05.eternal-september.org!.POSTED!not-for-mail From: =?UTF-8?B?Sm9zaHVhIENyYW5tZXIg8J+Qpw==?= Newsgroups: comp.lang.java.programmer Subject: Re: email stop words Date: Thu, 21 Mar 2013 15:38:08 -0500 Organization: A noiseless patient Spider Lines: 26 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Thu, 21 Mar 2013 20:36:17 +0000 (UTC) Injection-Info: mx05.eternal-september.org; posting-host="d4f756ecbcafbdef0d455e61327651f3"; logging-data="19234"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX187Tp9/9/Pj07kVKemiUU2qji1xXvhBPM4=" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Thunderbird/20.0 In-Reply-To: Cancel-Lock: sha1:P3vLPIMh34NOnu8Pc4cphU4IS8E= Xref: csiph.com comp.lang.java.programmer:23042 On 3/20/2013 9:41 PM, markspace wrote: > On 3/20/2013 4:40 PM, markspace wrote: >> When indexing text files, there's a concept known as "stop words", which >> are basically really common words that you don't normally want to index. >> >> I just got done with a preliminary part of a project, where I indexed my >> gmail inbox by parsing out all the white-space separated words from all >> of my emails. For what it's worth, here's the 19 most common words in >> my inbox, out of over 600 million characters, nearly 4 million words, >> and probably almost 400,000 email messages. > > > OK, weird. I removed the auto-boxing from the HashMap I was using, and > it got MUCH slower. I'd estimate 30x slower, although I didn't let the > biggest test run to completion. > > Any ideas? The JVM is optimizing autoboxed integers? Looking briefly at the code for OpenJDK, there is definitely optimizations for autoboxing that do things like "if foo is from the Integer cache, turn foo.value into address_of_foo - address_of_IntegerCache" -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth