Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #100650
| Path | csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail |
|---|---|
| From | Peter Otten <__peter__@web.de> |
| Newsgroups | comp.lang.python |
| Subject | Re: Catogorising strings into random versus non-random |
| Date | Mon, 21 Dec 2015 09:24:24 +0100 |
| Organization | None |
| Lines | 122 |
| Message-ID | <mailman.16.1450686280.2237.python-list@python.org> (permalink) |
| References | <56776b9d$0$1615$c3e8da3$5496439d@news.astraweb.com> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset="ISO-8859-1" |
| Content-Transfer-Encoding | 7Bit |
| X-Trace | news.uni-berlin.de Xp4V8+xyxQqcfAmWtyH9zw8Paphd8CYKW+AfIt1PIF0A== |
| Return-Path | <python-python-list@m.gmane.org> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.000 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'preferably': 0.05; 'python3': 0.05; 'sys': 0.05; '"__main__":': 0.07; '__name__': 0.07; 'main()': 0.07; 'meaningful': 0.09; 'non-ascii': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:into': 0.09; 'worse': 0.09; 'def': 0.13; 'skip:p 40': 0.15; '"""\\': 0.16; '1):': 0.16; '1.00': 0.16; 'atl': 0.16; 'filters,': 0.16; 'ideally,': 0.16; 'libraries?': 0.16; 'main():': 0.16; 'pairs': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:io': 0.16; 'received:plane.gmane.org': 0.16; 'received:psf.io': 0.16; 'received:t-ipconnect.de': 0.16; 'set()': 0.16; 'subject:non': 0.16; 'subject:random': 0.16; 'subject:versus': 0.16; 'tweak': 0.16; 'underscores,': 0.16; 'wrote:': 0.16; 'string': 0.17; 'thoughts': 0.18; 'python?': 0.18; 'language': 0.19; '%s"': 0.22; 'arguments': 0.22; 'ascii': 0.22; 'function,': 0.22; 'questions:': 0.22; '(or': 0.23; 'split': 0.23; 'second': 0.24; 'import': 0.24; 'words': 0.24; 'sort': 0.25; 'header:User-Agent:1': 0.26; 'header:X-Complaints-To:1': 0.26; 'skip:" 20': 0.26; 'mostly': 0.27; 'not.': 0.27; 'skip:e 30': 0.27; 'tend': 0.27; 'yield': 0.27; 'boundaries': 0.29; 'cat': 0.29; 'dictionary': 0.29; 'random': 0.29; 'tutorial': 0.29; "i'm": 0.30; 'e.g.': 0.30; 'anyone': 0.32; 'getting': 0.33; 'problem': 0.33; "d'aprano": 0.33; 'steven': 0.33; "i'll": 0.33; 'file': 0.34; 'this?': 0.34; 'text': 0.35; 'false': 0.35; 'something': 0.35; 'but': 0.36; 'there': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'really': 0.37; 'two': 0.37; 'received:org': 0.37; 'skip:9 10': 0.37; 'does': 0.39; 'subject:-': 0.39; 'skip:e 20': 0.39; 'to:addr:python.org': 0.40; 'where': 0.40; 'received:de': 0.40; 'some': 0.40; 'determine': 0.61; 'suitable': 0.61; 'real': 0.62; 'necessarily': 0.63; 'sample': 0.63; 'fall': 0.66; 'wish': 0.71; 'groups.': 0.72; 'score': 0.76; 'groups:': 0.84; 'random,': 0.84; 'ratio': 0.91; 'baby': 0.95 |
| X-Injected-Via-Gmane | http://gmane.org/ |
| X-Gmane-NNTP-Posting-Host | p57bd875e.dip0.t-ipconnect.de |
| User-Agent | KNode/4.13.3 |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.20+ |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Xref | csiph.com comp.lang.python:100650 |
Show key headers only | View raw
Steven D'Aprano wrote:
> I have a large number of strings (originally file names) which tend to
> fall into two groups. Some are human-meaningful, but not necessarily
> dictionary words e.g.:
>
>
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
>
>
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
>
>
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
>
>
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether
> each string is random or not. I need to split the strings into three
> groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
>
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
>
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
>
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
>
> - I think nltk has a "language detection" function, would that be
> suitable?
>
> - If not nltk, are there are suitable language detection libraries?
>
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
>
> - How about Bayesian filters, e.g. SpamBayes?
A dead simple approach -- look at the pairs in real words and calculate the
ratio
pairs-also-found-in-real-words/num-pairs
$ cat score.py
import sys
WORDLIST = "/usr/share/dict/words"
SAMPLE = """\
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
""".splitlines()
def extract_pairs(text):
for i in range(len(text)-1):
yield text[i:i+2]
def load_pairs():
pairs = set()
with open(WORDLIST) as f:
for line in f:
pairs.update(extract_pairs(line.strip()))
return pairs
def get_score(text, popular_pairs):
m = 0
for i, p in enumerate(extract_pairs(text), 1):
if p in popular_pairs:
m += 1
return m/i
def main():
popular_pairs = load_pairs()
for text in sys.argv[1:] or SAMPLE:
score = get_score(text, popular_pairs)
print("%4.2f %s" % (score, text))
if __name__ == "__main__":
main()
$ python3 score.py
0.65 baby lions at play
0.76 saturday_morning12
1.00 Fukushima
0.92 ImpossibleFork
0.36 xy39mGWbosjY
0.31 9sjz7s8198ghwt
0.31 rz4sdko-28dbRW00u
However:
$ python3 -c 'import random, sys; a = list(sys.argv[1]); random.shuffle(a);
print("".join(a))' 'baby lions at play'
bnsip atl ayba loy
$ python3 score.py 'bnsip atl ayba loy'
0.65 bnsip atl ayba loy
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 14:01 +1100
Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Ben Finney <ben+python@benfinney.id.au> - 2015-12-21 14:45 +1100
Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:47 +1100
Re: Catogorising strings into random versus non-random Chris Angelico <rosuav@gmail.com> - 2015-12-21 15:22 +1100
Re: Catogorising strings into random versus non-random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:57 +1100
Re: Catogorising strings into random versus non-random Rick Johnson <rantingrickjohnson@gmail.com> - 2015-12-21 17:45 -0800
Re: Catogorising strings into random versus non-random Peter Otten <__peter__@web.de> - 2015-12-21 09:24 +0100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 10:56 +0100
Re: Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 21:36 +1100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:53 +0100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:56 +0100
Re: Catogorising strings into random versus non-random Vlastimil Brom <vlastimil.brom@gmail.com> - 2015-12-21 14:25 +0100
Re: Catogorising strings into random versus non-random Vincent Davis <vincent@vincentdavis.net> - 2015-12-21 07:51 -0600
Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 16:40 +0000
Re: Catogorising strings into random versus non-random Ian Kelly <ian.g.kelly@gmail.com> - 2015-12-21 09:49 -0700
Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 17:41 +0000
Re: Catogorising strings into random versus non-random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-12-21 17:09 +0000
Re: Catogorising strings into random versus non-random Paul Rubin <no.email@nospam.invalid> - 2015-12-21 09:20 -0800
csiph-web