Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #100657
| From | Christian Gollwitzer <auriocus@gmx.de> |
|---|---|
| Newsgroups | comp.lang.python |
| Subject | Re: Catogorising strings into random versus non-random |
| Date | 2015-12-21 11:53 +0100 |
| Organization | A noiseless patient Spider |
| Message-ID | <n58lhv$nok$1@dont-email.me> (permalink) |
| References | <56776b9d$0$1615$c3e8da3$5496439d@news.astraweb.com> <mailman.16.1450686280.2237.python-list@python.org> <n58i7f$cbd$1@dont-email.me> <5677d62e$0$1605$c3e8da3$5496439d@news.astraweb.com> |
Am 21.12.15 um 11:36 schrieb Steven D'Aprano: > On Mon, 21 Dec 2015 08:56 pm, Christian Gollwitzer wrote: > >> Apfelkiste:Tests chris$ python score_my.py >> -8.74 baby lions at play >> -7.63 saturday_morning12 >> -6.38 Fukushima >> -5.72 ImpossibleFork >> -10.6 xy39mGWbosjY >> -12.9 9sjz7s8198ghwt >> -12.1 rz4sdko-28dbRW00u >> Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy' >> -9.43 bnsip atl ayba loy > > Thanks Christian and Peter for the suggestion, I'll certainly investigate > this further. > > But the scoring doesn't seem very good. "baby lions at play" is 100% English > words, and ought to have a radically different score from (say) > xy39mGWbosjY which is extremely non-English like. (How many English words > do you know of with W, X, two Y, and J?) And yet they are only two units > apart. "baby lions..." is a score almost as negative as the authentic > gibberish, while Fukushima (a Japanese word) has a much less negative > score. It is the spaces, which do not occur in the training wordlist (I mentioned that above, maybe not prominently enough). /usr/share/dict/words contains one word per line. The underscore _ is probably putting the saturday morning low, while the spaces put the babies low. Using trigraphs: Apfelkiste:Tests chris$ python score_my.py -11.5 baby lions at play -9.88 saturday_morning12 -9.85 Fukushima -7.68 ImpossibleFork -13.4 xy39mGWbosjY -14.2 9sjz7s8198ghwt -14.2 rz4sdko-28dbRW00u Apfelkiste:Tests chris$ python score_my.py 'babylionsatplay' -8.74 babylionsatplay Apfelkiste:Tests chris$ python score_my.py 'saturdaymorning12' -8.93 saturdaymorning12 Apfelkiste:Tests chris$ So for the spaces, either use a proper trainig material (some long corpus from Wikipedia or such), with punctuation removed. Then it will catch the correct probabilities at word boundaries. Or preprocess by removing the spaces. Christian
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 14:01 +1100
Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Ben Finney <ben+python@benfinney.id.au> - 2015-12-21 14:45 +1100
Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:47 +1100
Re: Catogorising strings into random versus non-random Chris Angelico <rosuav@gmail.com> - 2015-12-21 15:22 +1100
Re: Catogorising strings into random versus non-random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:57 +1100
Re: Catogorising strings into random versus non-random Rick Johnson <rantingrickjohnson@gmail.com> - 2015-12-21 17:45 -0800
Re: Catogorising strings into random versus non-random Peter Otten <__peter__@web.de> - 2015-12-21 09:24 +0100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 10:56 +0100
Re: Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 21:36 +1100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:53 +0100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:56 +0100
Re: Catogorising strings into random versus non-random Vlastimil Brom <vlastimil.brom@gmail.com> - 2015-12-21 14:25 +0100
Re: Catogorising strings into random versus non-random Vincent Davis <vincent@vincentdavis.net> - 2015-12-21 07:51 -0600
Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 16:40 +0000
Re: Catogorising strings into random versus non-random Ian Kelly <ian.g.kelly@gmail.com> - 2015-12-21 09:49 -0700
Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 17:41 +0000
Re: Catogorising strings into random versus non-random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-12-21 17:09 +0000
Re: Catogorising strings into random versus non-random Paul Rubin <no.email@nospam.invalid> - 2015-12-21 09:20 -0800
csiph-web