Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #23004

Re: email stop words

From Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid>
Newsgroups comp.lang.java.programmer
Subject Re: email stop words
Date 2013-03-20 20:51 -0500
Organization A noiseless patient Spider
Message-ID <kidp0d$67v$1@dont-email.me> (permalink)
References <kidh9f$57s$1@dont-email.me>

Show all headers | View raw


On 3/20/2013 6:40 PM, markspace wrote:
> When indexing text files, there's a concept known as "stop words", which
> are basically really common words that you don't normally want to index.
>
> I just got done with a preliminary part of a project, where I indexed my
> gmail inbox by parsing out all the white-space separated words from all
> of my emails.  For what it's worth, here's the 19 most common words in
> my inbox, out of over 600 million characters, nearly 4 million words,
> and probably almost 400,000 email messages.
>
> So what I have here is a list of stop words for email. Here it is
> without further ado, enjoy.
>
> 3888905 words
> top words:
>  >, 1730013
> to, 582496
> =A0, 552868

The presence of =XX (where X is a hexadecimal character) is heavy 
indication that you are failing to decode quoted-printable bodies 
correctly, which makes the rest of your data suspect.

> Received:, 398679

This is clearly an RFC 822 message header. You should probably be 
performing some sort of normalization on message bodies to avoid parsing 
this (you might be parsing, e.g., message/rfc822 as plain text or a 
message/delivery-notification as plaintext.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
  Re: email stop words Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
    Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
      Re: email stop words Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
    Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
      Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
  Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
  Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
    Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
    Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
        Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
    Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
    Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
  Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000

csiph-web