Re: Catogorising strings into random versus non-random

From	Chris Angelico <rosuav@gmail.com>
Newsgroups	comp.lang.python
Subject	Re: Catogorising strings into random versus non-random
Date	2015-12-21 15:22 +1100
Message-ID	<mailman.14.1450671763.2237.python-list@python.org> (permalink)
References	<56776b9d$0$1615$c3e8da3$5496439d@news.astraweb.com>

Show all headers | View raw

On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> I have a large number of strings (originally file names) which tend to fall
> into two groups. Some are human-meaningful, but not necessarily dictionary
> words e.g.:
>
>
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
>
>
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
>
>
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
>
> I need to split the strings into three groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.

The first thing that comes to my mind is poking the string into a
search engine and seeing how many results come back. You might need to
do some preprocessing to recognize multi-word forms (maybe a handful
of recognized cases like snake_case, CamelCase,
CamelCasewiththeLittleWordsLeftUnchanged, etc), but doing that
manually on the above text gives me:

* baby lions at play
* saturday morning 12
* fukushima
* impossible fork
* xy 39 mgwbosjy
* 9 sjz 7 s 8198 ghwt
* rz 4 sdko 28 dbrw 00 u

Putting those into Google without quotes yields:

* About 23,800,000 results
* About 227,000,000 results
* About 32,500,000 results
* About 16,400,000 results
* About 1,180 results
* 7 results
* About 30,300 results

DuckDuckGo doesn't give a result count, so I skipped it. Yahoo search yielded:

* 6,040,000 results
* 123,000,000 results
* 3,920,000 results
* 720,000 results
* No results at all
* No results at all
* 2 results

Bing produces much more chaotic results, though:
* 34,000,000 RESULTS
* 15,600,000 RESULTS
* 11,000,000 RESULTS
* 1,620,000 RESULTS
* 5,720,000 RESULTS
* 1,580,000,000 RESULTS
* 3,380,000 RESULTS

This suggests that search engine results MAY be useful, but in some
cases, tweaks may be necessary (I couldn't force Bing to do phrase
search, for some reason probably related to my inexperience with it),
and also that the boundary between "meaningful" and "non-meaningful"
will depend on the engine used (I'd use 1,000,000 as the boundary with
Google, but probably 100,000 with Yahoo). You might want to handle
numerics differently, too - converting "9" into "nine" could improve
the result reliability.

How many of these keywords would you be looking up, and would a
network transaction (a search engine API call) for each one be too
expensive?

ChrisA

Thread

Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 14:01 +1100
  Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Ben Finney <ben+python@benfinney.id.au> - 2015-12-21 14:45 +1100
    Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:47 +1100
  Re: Catogorising strings into random versus non-random Chris Angelico <rosuav@gmail.com> - 2015-12-21 15:22 +1100
    Re: Catogorising strings into random versus non-random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:57 +1100
    Re: Catogorising strings into random versus non-random Rick Johnson <rantingrickjohnson@gmail.com> - 2015-12-21 17:45 -0800
  Re: Catogorising strings into random versus non-random Peter Otten <__peter__@web.de> - 2015-12-21 09:24 +0100
    Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 10:56 +0100
      Re: Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 21:36 +1100
        Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:53 +0100
          Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:56 +0100
  Re: Catogorising strings into random versus non-random Vlastimil Brom <vlastimil.brom@gmail.com> - 2015-12-21 14:25 +0100
  Re: Catogorising strings into random versus non-random Vincent Davis <vincent@vincentdavis.net> - 2015-12-21 07:51 -0600
  Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 16:40 +0000
    Re: Catogorising strings into random versus non-random Ian Kelly <ian.g.kelly@gmail.com> - 2015-12-21 09:49 -0700
      Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 17:41 +0000
    Re: Catogorising strings into random versus non-random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-12-21 17:09 +0000
  Re: Catogorising strings into random versus non-random Paul Rubin <no.email@nospam.invalid> - 2015-12-21 09:20 -0800

csiph-web