Re: Catogorising strings into random versus non-random

Path	csiph.com!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From	Vlastimil Brom <vlastimil.brom@gmail.com>
Newsgroups	comp.lang.python
Subject	Re: Catogorising strings into random versus non-random
Date	Mon, 21 Dec 2015 14:25:28 +0100
Lines	98
Message-ID	<mailman.27.1450704337.2237.python-list@python.org> (permalink)
References	<56776b9d$0$1615$c3e8da3$5496439d@news.astraweb.com>
Mime-Version	1.0
Content-Type	text/plain; charset=UTF-8
X-Trace	news.uni-berlin.de N1/vnzABwfdj3m/W+CBERwAbVgpbWKodlnJsOx+dw0/w==
Return-Path	<vlastimil.brom@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.006
X-Spam-Evidence	'H': 0.99; 'S': 0.00; 'preferably': 0.05; 'filename.': 0.09; 'meaningful': 0.09; 'non-ascii': 0.09; 'splitting': 0.09; 'subject:into': 0.09; 'worse': 0.09; 'exception': 0.13; 'output': 0.13; '"word"': 0.16; 'filters,': 0.16; 'here).': 0.16; 'ideally,': 0.16; 'libraries?': 0.16; 'proportion': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'recognised': 0.16; 'subject:non': 0.16; 'subject:random': 0.16; 'subject:versus': 0.16; 'tagged': 0.16; 'tweak': 0.16; 'underscores,': 0.16; 'url:courses': 0.16; 'url:tag': 0.16; 'string': 0.17; 'comparing': 0.18; 'thoughts': 0.18; 'python?': 0.18; 'language': 0.19; 'arguments': 0.22; 'ascii': 0.22; 'function,': 0.22; 'interpret': 0.22; 'questions:': 0.22; '(or': 0.23; 'split': 0.23; 'second': 0.24; 'url:edu': 0.24; 'words': 0.24; 'header:In-Reply-To:1': 0.24; 'sort': 0.25; 'compatible': 0.27; 'helpful': 0.27; 'mostly': 0.27; 'not.': 0.27; 'respective': 0.27; 'message-id:@mail.gmail.com': 0.27; 'experiences': 0.27; 'tend': 0.27; 'this.': 0.28; '(maybe': 0.29; 'boundaries': 0.29; 'dictionary': 0.29; 'random': 0.29; 'tutorial': 0.29; "i'm": 0.30; 'url:mailman': 0.30; 'code': 0.30; 'e.g.': 0.30; 'minimal': 0.30; 'probably': 0.31; 'anyone': 0.32; 'maybe': 0.33; 'getting': 0.33; 'useful': 0.33; 'problem': 0.33; 'url:python': 0.33; "d'aprano": 0.33; 'steven': 0.33; "i'll": 0.33; 'url:listinfo': 0.34; 'file': 0.34; 'this?': 0.34; 'handle': 0.34; 'received:google.com': 0.35; 'could': 0.35; 'false': 0.35; 'something': 0.35; "isn't": 0.35; 'sometimes': 0.35; 'according': 0.36; 'but': 0.36; 'too': 0.36; 'should': 0.36; 'there': 0.36; 'url:org': 0.36; 'received:209.85': 0.36; 'possible': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'really': 0.37; 'two': 0.37; 'skip:9 10': 0.37; '(with': 0.38; 'received:209': 0.38; 'names': 0.38; 'hi,': 0.38; 'does': 0.39; 'subject:-': 0.39; 'skip:x 10': 0.40; 'url:mail': 0.40; 'to:addr:python.org': 0.40; 'where': 0.40; 'some': 0.40; 'your': 0.60; 'determine': 0.61; 'suitable': 0.61; 'total': 0.62; 'more': 0.63; 'different': 0.63; 'necessarily': 0.63; 'sample': 0.63; 'fall': 0.66; 'results': 0.66; 'online': 0.71; 'wish': 0.71; 'approaches': 0.72; 'groups.': 0.72; 'score': 0.76; 'training': 0.78; 'comparable': 0.84; 'groups:': 0.84; 'speech': 0.84; 'to:name:python': 0.84; 'words)': 0.84; 'url:demo': 0.91; 'baby': 0.95
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=kJ4BVhTbYKuP2ZNscI1Vw5GVsO7VunQ17IDEvSy3puQ=; b=ZUfeO9NMfc559zSknE0GA1wnuWLhf8Sls8mLQDZWKE8DvLWYbmSpdy0W/st1yt1NKK HzFZpwjQxS63Y8iwegKnO9V3AmdMygir3UiEIskjmvVX92ZA7eRlepeGFYfEwm02R/Oo 4Gd615jw1Lg+g+GiS7N0x818fqNSBYx5GF/odRePl0ZZv5/bd7/Js6QQUUERgKMkDHXI 5Yy4iTE4fb3hBr6ArxnUBiOO27D9ZVq4OI3b0/27LpYps8N6GF1f+FmBnvvJpzOnulT0 Xc1vdRmQrnwTUxHeAB8dz8edyl/L1I/P2zxF3nT76O/NDw2SwJRM5PD12Mqc+S2tuln4 eeSw==
X-Received	by 10.25.141.9 with SMTP id p9mr6497496lfd.112.1450704328752; Mon, 21 Dec 2015 05:25:28 -0800 (PST)
In-Reply-To	<56776b9d$0$1615$c3e8da3$5496439d@news.astraweb.com>
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.20+
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Xref	csiph.com comp.lang.python:100670

Show key headers only | View raw

2015-12-21 4:01 GMT+01:00 Steven D'Aprano <steve@pearwood.info>:
> I have a large number of strings (originally file names) which tend to fall
> into two groups. Some are human-meaningful, but not necessarily dictionary
> words e.g.:
>
>
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
>
>
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
>
>
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
>
>
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether each
> string is random or not. I need to split the strings into three groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
>
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
>
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
>
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
>
> - I think nltk has a "language detection" function, would that be suitable?
>
> - If not nltk, are there are suitable language detection libraries?
>
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
>
> - How about Bayesian filters, e.g. SpamBayes?
>
>
>
>
> --
> Steven
>
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
as you probably already know, NLTK could be helpful for some parts of
this task; if you can handle the most likely "word" splitting involved
by underscores, CamelCase etc., you could try to tag the parts of
speech of the words and interpret for the results according to your
needs.
In the online demo
http://text-processing.com/demo/tag/
your sample (with different approaches to splitt the words) yields:

baby/NN lions/NNS at/IN play/VB saturday/NN morning/NN 12/CD
Fukushima/NNP Impossible/JJ Fork/NNP xy39mGWbosjY/-None-
9sjz7s8198ghwt/-None- rz4sdko/-None- -/: 28dbRW00u/-None-

or with more splittings on case or letter-digit boundaries:
baby/NN lions/NNS at/IN play/VB saturday/NN morning/NN 12/CD
Fukushima/NNP Impossible/JJ Fork/NNP xy/-None- 39/CD m/-None- G/NNP
Wbosj/-None- Y/-None- 9/CD sjz/-None- 7/CD s/-None- 8198/-NONE-
ghwt/-None- rz/-None- 4/CD sdko/-None- -/: 28/CD db/-None- R/NNP
W/-None- 00/-None- u/-None-

 the tagset might be compatible with
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

There is sample code with a comparable output to this demo:
http://stackoverflow.com/questions/23953709/how-do-i-tag-a-sentence-with-the-brown-or-conll2000-tagger-chunker

For the given minimal sample, the results look useful (maybe with
exception of the capitalised words sometimes tagged as proper names -
but it might not be that relevant here).
Of course, any scoring isn't available with this approach, but you
could maybe check the proportion of the  recognised "words" comparing
to the total number of the "words" for the respective filename.
Training the tagger should be possible too in NLTK, but I don't have
experiences with this.

regards,
     vbr

Thread

Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 14:01 +1100
  Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Ben Finney <ben+python@benfinney.id.au> - 2015-12-21 14:45 +1100
    Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:47 +1100
  Re: Catogorising strings into random versus non-random Chris Angelico <rosuav@gmail.com> - 2015-12-21 15:22 +1100
    Re: Catogorising strings into random versus non-random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:57 +1100
    Re: Catogorising strings into random versus non-random Rick Johnson <rantingrickjohnson@gmail.com> - 2015-12-21 17:45 -0800
  Re: Catogorising strings into random versus non-random Peter Otten <__peter__@web.de> - 2015-12-21 09:24 +0100
    Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 10:56 +0100
      Re: Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 21:36 +1100
        Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:53 +0100
          Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:56 +0100
  Re: Catogorising strings into random versus non-random Vlastimil Brom <vlastimil.brom@gmail.com> - 2015-12-21 14:25 +0100
  Re: Catogorising strings into random versus non-random Vincent Davis <vincent@vincentdavis.net> - 2015-12-21 07:51 -0600
  Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 16:40 +0000
    Re: Catogorising strings into random versus non-random Ian Kelly <ian.g.kelly@gmail.com> - 2015-12-21 09:49 -0700
      Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 17:41 +0000
    Re: Catogorising strings into random versus non-random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-12-21 17:09 +0000
  Re: Catogorising strings into random versus non-random Paul Rubin <no.email@nospam.invalid> - 2015-12-21 09:20 -0800

csiph-web