Re: Catogorising strings into random versus non-random

From	Peter Otten <__peter__@web.de>
Newsgroups	comp.lang.python
Subject	Re: Catogorising strings into random versus non-random
Date	2015-12-21 09:24 +0100
Organization	None
Message-ID	<mailman.16.1450686280.2237.python-list@python.org> (permalink)
References	<56776b9d$0$1615$c3e8da3$5496439d@news.astraweb.com>

Show all headers | View raw

Steven D'Aprano wrote:

> I have a large number of strings (originally file names) which tend to
> fall into two groups. Some are human-meaningful, but not necessarily
> dictionary words e.g.:
> 
> 
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
> 
> 
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
> 
> 
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
> 
> 
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether
> each string is random or not. I need to split the strings into three
> groups:
> 
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
> 
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
> 
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
> 
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
> 
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
> 
> - I think nltk has a "language detection" function, would that be
> suitable?
> 
> - If not nltk, are there are suitable language detection libraries?
> 
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
> 
> - How about Bayesian filters, e.g. SpamBayes?

A dead simple approach -- look at the pairs in real words and calculate the 
ratio

pairs-also-found-in-real-words/num-pairs

$ cat score.py
import sys
WORDLIST = "/usr/share/dict/words"

SAMPLE = """\
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
""".splitlines()

def extract_pairs(text):
    for i in range(len(text)-1):
        yield text[i:i+2]


def load_pairs():
    pairs = set()
    with open(WORDLIST) as f:
        for line in f:
            pairs.update(extract_pairs(line.strip()))
    return pairs


def get_score(text, popular_pairs):
    m = 0
    for i, p in enumerate(extract_pairs(text), 1):
        if p in popular_pairs:
            m += 1
    return m/i


def main():
    popular_pairs = load_pairs()
    for text in sys.argv[1:] or SAMPLE:
        score = get_score(text, popular_pairs)
        print("%4.2f  %s" % (score, text))


if __name__ == "__main__":
    main()

$ python3 score.py
0.65  baby lions at play
0.76  saturday_morning12
1.00  Fukushima
0.92  ImpossibleFork
0.36  xy39mGWbosjY
0.31  9sjz7s8198ghwt
0.31  rz4sdko-28dbRW00u

However:

$ python3 -c 'import random, sys; a = list(sys.argv[1]); random.shuffle(a); 
print("".join(a))' 'baby lions at play'
bnsip atl ayba loy
$ python3 score.py 'bnsip atl ayba loy'
0.65  bnsip atl ayba loy

Thread

Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 14:01 +1100
  Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Ben Finney <ben+python@benfinney.id.au> - 2015-12-21 14:45 +1100
    Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:47 +1100
  Re: Catogorising strings into random versus non-random Chris Angelico <rosuav@gmail.com> - 2015-12-21 15:22 +1100
    Re: Catogorising strings into random versus non-random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:57 +1100
    Re: Catogorising strings into random versus non-random Rick Johnson <rantingrickjohnson@gmail.com> - 2015-12-21 17:45 -0800
  Re: Catogorising strings into random versus non-random Peter Otten <__peter__@web.de> - 2015-12-21 09:24 +0100
    Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 10:56 +0100
      Re: Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 21:36 +1100
        Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:53 +0100
          Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:56 +0100
  Re: Catogorising strings into random versus non-random Vlastimil Brom <vlastimil.brom@gmail.com> - 2015-12-21 14:25 +0100
  Re: Catogorising strings into random versus non-random Vincent Davis <vincent@vincentdavis.net> - 2015-12-21 07:51 -0600
  Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 16:40 +0000
    Re: Catogorising strings into random versus non-random Ian Kelly <ian.g.kelly@gmail.com> - 2015-12-21 09:49 -0700
      Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 17:41 +0000
    Re: Catogorising strings into random versus non-random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-12-21 17:09 +0000
  Re: Catogorising strings into random versus non-random Paul Rubin <no.email@nospam.invalid> - 2015-12-21 09:20 -0800

csiph-web