Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #22994
| Path | csiph.com!usenet.pasdenom.info!aioe.org!eternal-september.org!feeder.eternal-september.org!mx05.eternal-september.org!.POSTED!not-for-mail |
|---|---|
| From | markspace <markspace@nospam.nospam> |
| Newsgroups | comp.lang.java.programmer |
| Subject | email stop words |
| Date | Wed, 20 Mar 2013 16:40:15 -0700 |
| Organization | A noiseless patient Spider |
| Lines | 34 |
| Message-ID | <kidh9f$57s$1@dont-email.me> (permalink) |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset=ISO-8859-1; format=flowed |
| Content-Transfer-Encoding | 7bit |
| Injection-Date | Wed, 20 Mar 2013 23:38:23 +0000 (UTC) |
| Injection-Info | mx05.eternal-september.org; posting-host="fba3415ba68d85d643935af2f52f0b4b"; logging-data="5372"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/MQDrFX4TqXkRcJ/lgRKdID0m5FcjUqbc=" |
| User-Agent | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130307 Thunderbird/17.0.4 |
| Cancel-Lock | sha1:+sMiIvKd/4NNSh3kDhPrPw6QJ1k= |
| Xref | csiph.com comp.lang.java.programmer:22994 |
Show key headers only | View raw
When indexing text files, there's a concept known as "stop words", which are basically really common words that you don't normally want to index. I just got done with a preliminary part of a project, where I indexed my gmail inbox by parsing out all the white-space separated words from all of my emails. For what it's worth, here's the 19 most common words in my inbox, out of over 600 million characters, nearly 4 million words, and probably almost 400,000 email messages. So what I have here is a list of stop words for email. Here it is without further ado, enjoy. 3888905 words top words: >, 1730013 to, 582496 =A0, 552868 the, 544917 with, 476503 by, 451309 Received:, 398679 id, 380817 this, 324269 of, 296506 SMTP, 285885 for, 252344 from, 244664 -0700, 234140 2010, 231221 >>, 227200 a, 224202 (PDT), 220162 and, 217103 BUILD SUCCESSFUL (total time: 1 minute 8 seconds)
Back to comp.lang.java.programmer | Previous | Next — Next in thread | Find similar | Unroll thread
email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
Re: email stop words Arne Vajhøj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
Re: email stop words Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000
csiph-web