Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #100678

Re: Catogorising strings into random versus non-random

Subject Re: Catogorising strings into random versus non-random
Newsgroups comp.lang.python
References <56776b9d$0$1615$c3e8da3$5496439d@news.astraweb.com>
From duncan smith <duncan@invalid.invalid>
Message-ID <oYVdy.22469$Hz3.17030@fx43.iad> (permalink)
Organization blocknews - www.blocknews.net
Date 2015-12-21 16:40 +0000

Show all headers | View raw


On 21/12/15 03:01, Steven D'Aprano wrote:
> I have a large number of strings (originally file names) which tend to fall
> into two groups. Some are human-meaningful, but not necessarily dictionary
> words e.g.:
> 
> 
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
> 
> 
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
> 
> 
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
> 
> 
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether each
> string is random or not. I need to split the strings into three groups:
> 
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
> 
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
> 
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
> 
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
> 
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
> 
> - I think nltk has a "language detection" function, would that be suitable?
> 
> - If not nltk, are there are suitable language detection libraries?
> 
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
> 
> - How about Bayesian filters, e.g. SpamBayes?
> 
> 
> 
> 

Finite state machine / transition matrix. Learn from some English text
source. Then process your strings by lower casing, replacing underscores
with spaces, removing trailing numeric characters etc. Base your score
on something like the mean transition probability. I'd expect to see two
pretty well separated groups of scores.

Duncan

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 14:01 +1100
  Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Ben Finney <ben+python@benfinney.id.au> - 2015-12-21 14:45 +1100
    Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:47 +1100
  Re: Catogorising strings into random versus non-random Chris Angelico <rosuav@gmail.com> - 2015-12-21 15:22 +1100
    Re: Catogorising strings into random versus non-random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:57 +1100
    Re: Catogorising strings into random versus non-random Rick Johnson <rantingrickjohnson@gmail.com> - 2015-12-21 17:45 -0800
  Re: Catogorising strings into random versus non-random Peter Otten <__peter__@web.de> - 2015-12-21 09:24 +0100
    Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 10:56 +0100
      Re: Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 21:36 +1100
        Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:53 +0100
          Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:56 +0100
  Re: Catogorising strings into random versus non-random Vlastimil Brom <vlastimil.brom@gmail.com> - 2015-12-21 14:25 +0100
  Re: Catogorising strings into random versus non-random Vincent Davis <vincent@vincentdavis.net> - 2015-12-21 07:51 -0600
  Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 16:40 +0000
    Re: Catogorising strings into random versus non-random Ian Kelly <ian.g.kelly@gmail.com> - 2015-12-21 09:49 -0700
      Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 17:41 +0000
    Re: Catogorising strings into random versus non-random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-12-21 17:09 +0000
  Re: Catogorising strings into random versus non-random Paul Rubin <no.email@nospam.invalid> - 2015-12-21 09:20 -0800

csiph-web