Path: csiph.com!usenet.pasdenom.info!gegeweb.org!eternal-september.org!feeder.eternal-september.org!mx05.eternal-september.org!.POSTED!not-for-mail From: markspace Newsgroups: comp.lang.java.programmer Subject: Re: email stop words Date: Wed, 20 Mar 2013 19:41:38 -0700 Organization: A noiseless patient Spider Lines: 29 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Thu, 21 Mar 2013 02:39:46 +0000 (UTC) Injection-Info: mx05.eternal-september.org; posting-host="fba3415ba68d85d643935af2f52f0b4b"; logging-data="17943"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX192PSRhwIzKymHL3X/ME8CwvioO2yi6wm8=" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130307 Thunderbird/17.0.4 In-Reply-To: Cancel-Lock: sha1:H6O+jSW9EaGKxiQWPGM6c1JLqmo= Xref: csiph.com comp.lang.java.programmer:23006 On 3/20/2013 4:40 PM, markspace wrote: > When indexing text files, there's a concept known as "stop words", which > are basically really common words that you don't normally want to index. > > I just got done with a preliminary part of a project, where I indexed my > gmail inbox by parsing out all the white-space separated words from all > of my emails. For what it's worth, here's the 19 most common words in > my inbox, out of over 600 million characters, nearly 4 million words, > and probably almost 400,000 email messages. OK, weird. I removed the auto-boxing from the HashMap I was using, and it got MUCH slower. I'd estimate 30x slower, although I didn't let the biggest test run to completion. Any ideas? Mine run to: I ended up making a lot more objects to avoid the immutable Integer, and ended up using so much memory the garbage collector started trashing. Or, Integer shares low values (<127) and doesn't create new objects. There's so many of those low numbers that this ends up saving memory, and objects, in the long run (my biggest test case, in other words). Other thoughts?