Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #23016

Re: email stop words

Date 2013-03-21 09:31 +0000
From lipska the kat <"nospam at neversurrender dot co dot uk">
Organization Trollbusters 3
Newsgroups comp.lang.java.programmer
Subject Re: email stop words
References <kidh9f$57s$1@dont-email.me> <514a50a0$0$32115$14726298@news.sunsite.dk> <kidjmt$f8j$1@dont-email.me>
Message-ID <AOSdnVU1JL7lTtfMnZ2dnUVZ7o2dnZ2d@bt.com> (permalink)

Show all headers | View raw


On 21/03/13 00:21, markspace wrote:
> On 3/20/2013 5:13 PM, Arne Vajhøj wrote:
>>
>> I would have discarded special characters:
>> >=-()
>> up front.
>
> It's not nearly that sophisticated yet. In time, it will find actual
> words. Right now I'm just making sure I can read the whole mess in a
> timely fashion. You should have seen how long it took before ByteBuffers
> and using Regex.

This is just a suggestion relating to speed of indexing and may not be 
suitable but have you ever used lucene.

http://lucene.apache.org

I've used it to index and search text databases for several years now.
I'm always surprised at how fast the indexing is.

It's a pretty sophisticated piece of kit though and the code can get 
quite verbose.

Like I say, just a suggestion WRT speed

lipska

-- 
Lipska the Kat©: Troll hunter, sandbox destroyer
and farscape dreamer of Aeryn Sun

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
  Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
    Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
      Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
    Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
      Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
  Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
  Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
    Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
    Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
        Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
    Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
    Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
  Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000

csiph-web