Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #22994

email stop words

From markspace <markspace@nospam.nospam>
Newsgroups comp.lang.java.programmer
Subject email stop words
Date 2013-03-20 16:40 -0700
Organization A noiseless patient Spider
Message-ID <kidh9f$57s$1@dont-email.me> (permalink)

Show all headers | View raw


When indexing text files, there's a concept known as "stop words", which 
are basically really common words that you don't normally want to index.

I just got done with a preliminary part of a project, where I indexed my 
gmail inbox by parsing out all the white-space separated words from all 
of my emails.  For what it's worth, here's the 19 most common words in 
my inbox, out of over 600 million characters, nearly 4 million words, 
and probably almost 400,000 email messages.

So what I have here is a list of stop words for email. Here it is 
without further ado, enjoy.

3888905 words
top words:
 >, 1730013
to, 582496
=A0, 552868
the, 544917
with, 476503
by, 451309
Received:, 398679
id, 380817
this, 324269
of, 296506
SMTP, 285885
for, 252344
from, 244664
-0700, 234140
2010, 231221
 >>, 227200
a, 224202
(PDT), 220162
and, 217103
BUILD SUCCESSFUL (total time: 1 minute 8 seconds)

Back to comp.lang.java.programmer | Previous | NextNext in thread | Find similar | Unroll thread


Thread

email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
  Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
    Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
      Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
    Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
      Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
  Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
  Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
    Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
    Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
        Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
    Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
    Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
      Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
  Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000

csiph-web