Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #100657

Re: Catogorising strings into random versus non-random

From Christian Gollwitzer <auriocus@gmx.de>
Newsgroups comp.lang.python
Subject Re: Catogorising strings into random versus non-random
Date 2015-12-21 11:53 +0100
Organization A noiseless patient Spider
Message-ID <n58lhv$nok$1@dont-email.me> (permalink)
References <56776b9d$0$1615$c3e8da3$5496439d@news.astraweb.com> <mailman.16.1450686280.2237.python-list@python.org> <n58i7f$cbd$1@dont-email.me> <5677d62e$0$1605$c3e8da3$5496439d@news.astraweb.com>

Show all headers | View raw


Am 21.12.15 um 11:36 schrieb Steven D'Aprano:
> On Mon, 21 Dec 2015 08:56 pm, Christian Gollwitzer wrote:
>
>> Apfelkiste:Tests chris$ python score_my.py
>> -8.74  baby lions at play
>> -7.63  saturday_morning12
>> -6.38  Fukushima
>> -5.72  ImpossibleFork
>> -10.6  xy39mGWbosjY
>> -12.9  9sjz7s8198ghwt
>> -12.1  rz4sdko-28dbRW00u
>> Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
>> -9.43  bnsip atl ayba loy
>
> Thanks Christian and Peter for the suggestion, I'll certainly investigate
> this further.
>
> But the scoring doesn't seem very good. "baby lions at play" is 100% English
> words, and ought to have a radically different score from (say)
> xy39mGWbosjY which is extremely non-English like. (How many English words
> do you know of with W, X, two Y, and J?) And yet they are only two units
> apart. "baby lions..." is a score almost as negative as the authentic
> gibberish, while Fukushima (a Japanese word) has a much less negative
> score.

It is the spaces, which do not occur in the training wordlist (I 
mentioned that above, maybe not prominently enough). 
/usr/share/dict/words contains one word per line. The underscore _ is 
probably putting the saturday morning low, while the spaces put the 
babies low. Using trigraphs:


Apfelkiste:Tests chris$ python score_my.py
-11.5  baby lions at play
-9.88  saturday_morning12
-9.85  Fukushima
-7.68  ImpossibleFork
-13.4  xy39mGWbosjY
-14.2  9sjz7s8198ghwt
-14.2  rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'babylionsatplay'
-8.74  babylionsatplay
Apfelkiste:Tests chris$ python score_my.py 'saturdaymorning12'
-8.93  saturdaymorning12
Apfelkiste:Tests chris$

So for the spaces, either use a proper trainig material (some long 
corpus from Wikipedia or such), with punctuation removed. Then it will 
catch the correct probabilities at word boundaries. Or preprocess by 
removing the spaces.

	Christian

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 14:01 +1100
  Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Ben Finney <ben+python@benfinney.id.au> - 2015-12-21 14:45 +1100
    Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:47 +1100
  Re: Catogorising strings into random versus non-random Chris Angelico <rosuav@gmail.com> - 2015-12-21 15:22 +1100
    Re: Catogorising strings into random versus non-random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:57 +1100
    Re: Catogorising strings into random versus non-random Rick Johnson <rantingrickjohnson@gmail.com> - 2015-12-21 17:45 -0800
  Re: Catogorising strings into random versus non-random Peter Otten <__peter__@web.de> - 2015-12-21 09:24 +0100
    Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 10:56 +0100
      Re: Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 21:36 +1100
        Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:53 +0100
          Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:56 +0100
  Re: Catogorising strings into random versus non-random Vlastimil Brom <vlastimil.brom@gmail.com> - 2015-12-21 14:25 +0100
  Re: Catogorising strings into random versus non-random Vincent Davis <vincent@vincentdavis.net> - 2015-12-21 07:51 -0600
  Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 16:40 +0000
    Re: Catogorising strings into random versus non-random Ian Kelly <ian.g.kelly@gmail.com> - 2015-12-21 09:49 -0700
      Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 17:41 +0000
    Re: Catogorising strings into random versus non-random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-12-21 17:09 +0000
  Re: Catogorising strings into random versus non-random Paul Rubin <no.email@nospam.invalid> - 2015-12-21 09:20 -0800

csiph-web