Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #100678
| Subject | Re: Catogorising strings into random versus non-random |
|---|---|
| Newsgroups | comp.lang.python |
| References | <56776b9d$0$1615$c3e8da3$5496439d@news.astraweb.com> |
| From | duncan smith <duncan@invalid.invalid> |
| Message-ID | <oYVdy.22469$Hz3.17030@fx43.iad> (permalink) |
| Organization | blocknews - www.blocknews.net |
| Date | 2015-12-21 16:40 +0000 |
On 21/12/15 03:01, Steven D'Aprano wrote: > I have a large number of strings (originally file names) which tend to fall > into two groups. Some are human-meaningful, but not necessarily dictionary > words e.g.: > > > baby lions at play > saturday_morning12 > Fukushima > ImpossibleFork > > > (note that some use underscores, others spaces, and some CamelCase) while > others are completely meaningless (or mostly so): > > > xy39mGWbosjY > 9sjz7s8198ghwt > rz4sdko-28dbRW00u > > > Let's call the second group "random" and the first "non-random", without > getting bogged down into arguments about whether they are really random or > not. I wish to process the strings and automatically determine whether each > string is random or not. I need to split the strings into three groups: > > - those that I'm confident are random > - those that I'm unsure about > - those that I'm confident are non-random > > Ideally, I'll get some sort of numeric score so I can tweak where the > boundaries fall. > > Strings are *mostly* ASCII but may include a few non-ASCII characters. > > Note that false positives (detecting a meaningful non-random string as > random) is worse for me than false negatives (miscategorising a random > string as non-random). > > Does anyone have any suggestions for how to do this? Preferably something > already existing. I have some thoughts and/or questions: > > - I think nltk has a "language detection" function, would that be suitable? > > - If not nltk, are there are suitable language detection libraries? > > - Is this the sort of problem that neural networks are good at solving? > Anyone know a really good tutorial for neural networks in Python? > > - How about Bayesian filters, e.g. SpamBayes? > > > > Finite state machine / transition matrix. Learn from some English text source. Then process your strings by lower casing, replacing underscores with spaces, removing trailing numeric characters etc. Base your score on something like the mean transition probability. I'd expect to see two pretty well separated groups of scores. Duncan
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 14:01 +1100
Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Ben Finney <ben+python@benfinney.id.au> - 2015-12-21 14:45 +1100
Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:47 +1100
Re: Catogorising strings into random versus non-random Chris Angelico <rosuav@gmail.com> - 2015-12-21 15:22 +1100
Re: Catogorising strings into random versus non-random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:57 +1100
Re: Catogorising strings into random versus non-random Rick Johnson <rantingrickjohnson@gmail.com> - 2015-12-21 17:45 -0800
Re: Catogorising strings into random versus non-random Peter Otten <__peter__@web.de> - 2015-12-21 09:24 +0100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 10:56 +0100
Re: Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 21:36 +1100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:53 +0100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:56 +0100
Re: Catogorising strings into random versus non-random Vlastimil Brom <vlastimil.brom@gmail.com> - 2015-12-21 14:25 +0100
Re: Catogorising strings into random versus non-random Vincent Davis <vincent@vincentdavis.net> - 2015-12-21 07:51 -0600
Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 16:40 +0000
Re: Catogorising strings into random versus non-random Ian Kelly <ian.g.kelly@gmail.com> - 2015-12-21 09:49 -0700
Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 17:41 +0000
Re: Catogorising strings into random versus non-random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-12-21 17:09 +0000
Re: Catogorising strings into random versus non-random Paul Rubin <no.email@nospam.invalid> - 2015-12-21 09:20 -0800
csiph-web