Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #100646
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Newsgroups | comp.lang.python |
| Subject | Re: Catogorising strings into random versus non-random |
| Date | 2015-12-21 15:22 +1100 |
| Message-ID | <mailman.14.1450671763.2237.python-list@python.org> (permalink) |
| References | <56776b9d$0$1615$c3e8da3$5496439d@news.astraweb.com> |
On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano <steve@pearwood.info> wrote: > I have a large number of strings (originally file names) which tend to fall > into two groups. Some are human-meaningful, but not necessarily dictionary > words e.g.: > > > baby lions at play > saturday_morning12 > Fukushima > ImpossibleFork > > > (note that some use underscores, others spaces, and some CamelCase) while > others are completely meaningless (or mostly so): > > > xy39mGWbosjY > 9sjz7s8198ghwt > rz4sdko-28dbRW00u > > I need to split the strings into three groups: > > - those that I'm confident are random > - those that I'm unsure about > - those that I'm confident are non-random > > Ideally, I'll get some sort of numeric score so I can tweak where the > boundaries fall. The first thing that comes to my mind is poking the string into a search engine and seeing how many results come back. You might need to do some preprocessing to recognize multi-word forms (maybe a handful of recognized cases like snake_case, CamelCase, CamelCasewiththeLittleWordsLeftUnchanged, etc), but doing that manually on the above text gives me: * baby lions at play * saturday morning 12 * fukushima * impossible fork * xy 39 mgwbosjy * 9 sjz 7 s 8198 ghwt * rz 4 sdko 28 dbrw 00 u Putting those into Google without quotes yields: * About 23,800,000 results * About 227,000,000 results * About 32,500,000 results * About 16,400,000 results * About 1,180 results * 7 results * About 30,300 results DuckDuckGo doesn't give a result count, so I skipped it. Yahoo search yielded: * 6,040,000 results * 123,000,000 results * 3,920,000 results * 720,000 results * No results at all * No results at all * 2 results Bing produces much more chaotic results, though: * 34,000,000 RESULTS * 15,600,000 RESULTS * 11,000,000 RESULTS * 1,620,000 RESULTS * 5,720,000 RESULTS * 1,580,000,000 RESULTS * 3,380,000 RESULTS This suggests that search engine results MAY be useful, but in some cases, tweaks may be necessary (I couldn't force Bing to do phrase search, for some reason probably related to my inexperience with it), and also that the boundary between "meaningful" and "non-meaningful" will depend on the engine used (I'd use 1,000,000 as the boundary with Google, but probably 100,000 with Yahoo). You might want to handle numerics differently, too - converting "9" into "nine" could improve the result reliability. How many of these keywords would you be looking up, and would a network transaction (a search engine API call) for each one be too expensive? ChrisA
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 14:01 +1100
Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Ben Finney <ben+python@benfinney.id.au> - 2015-12-21 14:45 +1100
Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:47 +1100
Re: Catogorising strings into random versus non-random Chris Angelico <rosuav@gmail.com> - 2015-12-21 15:22 +1100
Re: Catogorising strings into random versus non-random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:57 +1100
Re: Catogorising strings into random versus non-random Rick Johnson <rantingrickjohnson@gmail.com> - 2015-12-21 17:45 -0800
Re: Catogorising strings into random versus non-random Peter Otten <__peter__@web.de> - 2015-12-21 09:24 +0100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 10:56 +0100
Re: Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 21:36 +1100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:53 +0100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:56 +0100
Re: Catogorising strings into random versus non-random Vlastimil Brom <vlastimil.brom@gmail.com> - 2015-12-21 14:25 +0100
Re: Catogorising strings into random versus non-random Vincent Davis <vincent@vincentdavis.net> - 2015-12-21 07:51 -0600
Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 16:40 +0000
Re: Catogorising strings into random versus non-random Ian Kelly <ian.g.kelly@gmail.com> - 2015-12-21 09:49 -0700
Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 17:41 +0000
Re: Catogorising strings into random versus non-random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-12-21 17:09 +0000
Re: Catogorising strings into random versus non-random Paul Rubin <no.email@nospam.invalid> - 2015-12-21 09:20 -0800
csiph-web