Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #23004
| From | Joshua Cranmer đ§ <Pidgeot18@verizon.invalid> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: email stop words |
| Date | 2013-03-20 20:51 -0500 |
| Organization | A noiseless patient Spider |
| Message-ID | <kidp0d$67v$1@dont-email.me> (permalink) |
| References | <kidh9f$57s$1@dont-email.me> |
On 3/20/2013 6:40 PM, markspace wrote: > When indexing text files, there's a concept known as "stop words", which > are basically really common words that you don't normally want to index. > > I just got done with a preliminary part of a project, where I indexed my > gmail inbox by parsing out all the white-space separated words from all > of my emails. For what it's worth, here's the 19 most common words in > my inbox, out of over 600 million characters, nearly 4 million words, > and probably almost 400,000 email messages. > > So what I have here is a list of stop words for email. Here it is > without further ado, enjoy. > > 3888905 words > top words: > >, 1730013 > to, 582496 > =A0, 552868 The presence of =XX (where X is a hexadecimal character) is heavy indication that you are failing to decode quoted-printable bodies correctly, which makes the rest of your data suspect. > Received:, 398679 This is clearly an RFC 822 message header. You should probably be performing some sort of normalization on message bodies to avoid parsing this (you might be parsing, e.g., message/rfc822 as plain text or a message/delivery-notification as plaintext. -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
email stop words markspace <markspace@nospam.nospam> - 2013-03-20 16:40 -0700
Re: email stop words Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-20 20:13 -0400
Re: email stop words Lew <lewbloch@gmail.com> - 2013-03-20 17:21 -0700
Re: email stop words Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-20 20:41 -0400
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 17:21 -0700
Re: email stop words lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-21 09:31 +0000
Re: email stop words Joshua Cranmer đ§ <Pidgeot18@verizon.invalid> - 2013-03-20 20:51 -0500
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-20 19:41 -0700
Re: email stop words Jukka Lahtinen <jtfjdehf@hotmail.com.invalid> - 2013-03-21 08:29 +0200
Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 09:24 -0400
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 09:33 -0700
Re: email stop words Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-21 14:15 -0400
Re: email stop words Joerg Meier <joergmmeier@arcor.de> - 2013-03-21 14:29 +0100
Re: email stop words Joshua Cranmer đ§ <Pidgeot18@verizon.invalid> - 2013-03-21 15:38 -0500
Re: email stop words markspace <markspace@nospam.nospam> - 2013-03-21 16:49 -0700
Re: email stop words Fredrik Jonson <fredrik@jonson.org> - 2013-03-21 06:58 +0000
csiph-web