Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!us.feeder.erje.net!137.226.231.214.MISMATCH!newsfeed.fsmpi.rwth-aachen.de!weretis.net!feeder4.news.weretis.net!feeder2.ecngs.de!ecngs!feeder.ecngs.de!Xl.tags.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!local2.nntp.ams.giganews.com!nntp.bt.com!news.bt.com.POSTED!not-for-mail NNTP-Posting-Date: Thu, 21 Mar 2013 04:31:36 -0500 Date: Thu, 21 Mar 2013 09:31:35 +0000 From: lipska the kat <"nospam at neversurrender dot co dot uk"> Organization: Trollbusters 3 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120410 Thunderbird/11.0.1 MIME-Version: 1.0 Newsgroups: comp.lang.java.programmer Subject: Re: email stop words References: <514a50a0$0$32115$14726298@news.sunsite.dk> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Message-ID: Lines: 30 X-Usenet-Provider: http://www.giganews.com X-AuthenticatedUsername: NoAuthUser X-Trace: sv3-3UCwWmcaAvQcpnrro/kHzr3AtkIO46Pkv/tXPvwmkW+FQmXe11xV1joXFV+2bNYu8+mNDVs5ALTY8Ay!mEM2fFYrDkhydhXvXhA6YUo5BeAzrWb84zWn6u7kSDiDqSterAX1oNFCTbtgbTWoC9VUaXVLLSc= X-Complaints-To: abuse@btinternet.com X-DMCA-Complaints-To: abuse@btinternet.com X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 2179 Xref: csiph.com comp.lang.java.programmer:23016 On 21/03/13 00:21, markspace wrote: > On 3/20/2013 5:13 PM, Arne Vajhøj wrote: >> >> I would have discarded special characters: >> >=-() >> up front. > > It's not nearly that sophisticated yet. In time, it will find actual > words. Right now I'm just making sure I can read the whole mess in a > timely fashion. You should have seen how long it took before ByteBuffers > and using Regex. This is just a suggestion relating to speed of indexing and may not be suitable but have you ever used lucene. http://lucene.apache.org I've used it to index and search text databases for several years now. I'm always surprised at how fast the indexing is. It's a pretty sophisticated piece of kit though and the code can get quite verbose. Like I say, just a suggestion WRT speed lipska -- Lipska the Kat©: Troll hunter, sandbox destroyer and farscape dreamer of Aeryn Sun