Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #100643 > unrolled thread
| Started by | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| First post | 2015-12-21 14:01 +1100 |
| Last post | 2015-12-21 09:20 -0800 |
| Articles | 18 — 13 participants |
Back to article view | Back to comp.lang.python
Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 14:01 +1100
Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Ben Finney <ben+python@benfinney.id.au> - 2015-12-21 14:45 +1100
Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:47 +1100
Re: Catogorising strings into random versus non-random Chris Angelico <rosuav@gmail.com> - 2015-12-21 15:22 +1100
Re: Catogorising strings into random versus non-random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-12-21 19:57 +1100
Re: Catogorising strings into random versus non-random Rick Johnson <rantingrickjohnson@gmail.com> - 2015-12-21 17:45 -0800
Re: Catogorising strings into random versus non-random Peter Otten <__peter__@web.de> - 2015-12-21 09:24 +0100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 10:56 +0100
Re: Catogorising strings into random versus non-random Steven D'Aprano <steve@pearwood.info> - 2015-12-21 21:36 +1100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:53 +0100
Re: Catogorising strings into random versus non-random Christian Gollwitzer <auriocus@gmx.de> - 2015-12-21 11:56 +0100
Re: Catogorising strings into random versus non-random Vlastimil Brom <vlastimil.brom@gmail.com> - 2015-12-21 14:25 +0100
Re: Catogorising strings into random versus non-random Vincent Davis <vincent@vincentdavis.net> - 2015-12-21 07:51 -0600
Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 16:40 +0000
Re: Catogorising strings into random versus non-random Ian Kelly <ian.g.kelly@gmail.com> - 2015-12-21 09:49 -0700
Re: Catogorising strings into random versus non-random duncan smith <duncan@invalid.invalid> - 2015-12-21 17:41 +0000
Re: Catogorising strings into random versus non-random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-12-21 17:09 +0000
Re: Catogorising strings into random versus non-random Paul Rubin <no.email@nospam.invalid> - 2015-12-21 09:20 -0800
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-12-21 14:01 +1100 |
| Subject | Catogorising strings into random versus non-random |
| Message-ID | <56776b9d$0$1615$c3e8da3$5496439d@news.astraweb.com> |
I have a large number of strings (originally file names) which tend to fall into two groups. Some are human-meaningful, but not necessarily dictionary words e.g.: baby lions at play saturday_morning12 Fukushima ImpossibleFork (note that some use underscores, others spaces, and some CamelCase) while others are completely meaningless (or mostly so): xy39mGWbosjY 9sjz7s8198ghwt rz4sdko-28dbRW00u Let's call the second group "random" and the first "non-random", without getting bogged down into arguments about whether they are really random or not. I wish to process the strings and automatically determine whether each string is random or not. I need to split the strings into three groups: - those that I'm confident are random - those that I'm unsure about - those that I'm confident are non-random Ideally, I'll get some sort of numeric score so I can tweak where the boundaries fall. Strings are *mostly* ASCII but may include a few non-ASCII characters. Note that false positives (detecting a meaningful non-random string as random) is worse for me than false negatives (miscategorising a random string as non-random). Does anyone have any suggestions for how to do this? Preferably something already existing. I have some thoughts and/or questions: - I think nltk has a "language detection" function, would that be suitable? - If not nltk, are there are suitable language detection libraries? - Is this the sort of problem that neural networks are good at solving? Anyone know a really good tutorial for neural networks in Python? - How about Bayesian filters, e.g. SpamBayes? -- Steven
[toc] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2015-12-21 14:45 +1100 |
| Subject | Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) |
| Message-ID | <mailman.12.1450669549.2237.python-list@python.org> |
| In reply to | #100643 |
Steven D'Aprano <steve@pearwood.info> writes: > Let's call the second group "random" and the first "non-random", > without getting bogged down into arguments about whether they are > really random or not. I think we should discuss it, even at risk of getting bogged down. As you know better than I, “random” is not an observable property of the value, but of the process that produced it. So, I don't think “random” is at all helpful as a descriptor of the criteria you need for discriminating these values. Can you give a better definition of what criteria distinguish the values, based only on their observable properties? You used “meaningless”; that seems at least more hopeful as a criterion we can use by examining text values. So, what counts as meaningless? > I wish to process the strings and automatically determine whether each > string is random or not. I need to split the strings into three groups: > > - those that I'm confident are random > - those that I'm unsure about > - those that I'm confident are non-random > > Ideally, I'll get some sort of numeric score so I can tweak where the > boundaries fall. Perhaps you could measure Shannon entropy (“expected information value”) <URL:https://en.wikipedia.org/wiki/Entropy_%28information_theory%29> as a proxy? Or maybe I don't quite understand the criteria. -- \ “Actually I made up the term “object-oriented”, and I can tell | `\ you I did not have C++ in mind.” —Alan Kay, creator of | _o__) Smalltalk, at OOPSLA 1997 | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-12-21 19:47 +1100 |
| Subject | Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random) |
| Message-ID | <5677bcb2$0$2890$c3e8da3$76491128@news.astraweb.com> |
| In reply to | #100644 |
On Monday 21 December 2015 14:45, Ben Finney wrote: > Steven D'Aprano <steve@pearwood.info> writes: > >> Let's call the second group "random" and the first "non-random", >> without getting bogged down into arguments about whether they are >> really random or not. > > I think we should discuss it, even at risk of getting bogged down. As > you know better than I, “random” is not an observable property of the > value, but of the process that produced it. > > So, I don't think “random” is at all helpful as a descriptor of the > criteria you need for discriminating these values. > > Can you give a better definition of what criteria distinguish the > values, based only on their observable properties? No, not really. This *literally* is a case of "I'll know it when I see it", which suggests that some sort of machine-learning solution (neural network?) may be useful. I can train it on a bunch of strings which I can hand- classify, and let the machine pick out the correlations, then apply it to the rest of the strings. The best I can say is that the "non-random" strings either are, or consist of, mostly English words, names, or things which look like they might be English words, containing no more than a few non-ASCII characters, punctuation, or digits. > You used “meaningless”; that seems at least more hopeful as a criterion > we can use by examining text values. So, what counts as meaningless? Strings made up of random-looking sequences of characters, like you often see on sites like imgur or tumblr. Characters from non-Latin character sets that I can't read (e.g. Japanese, Korean, Arabic, etc). Jumbled up words, e.g. "python" is non-random, "nyohtp" would be random. [...] > Perhaps you could measure Shannon entropy (“expected information value”) > <URL:https://en.wikipedia.org/wiki/Entropy_%28information_theory%29> as > a proxy? Or maybe I don't quite understand the criteria. That's a possibility. At least, it might be able to distinguish some strings, although if I understand correctly, the two strings "python" and "nhoypt" have identical entropy, so this alone won't be sufficient. -- Steve
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-12-21 15:22 +1100 |
| Message-ID | <mailman.14.1450671763.2237.python-list@python.org> |
| In reply to | #100643 |
On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano <steve@pearwood.info> wrote: > I have a large number of strings (originally file names) which tend to fall > into two groups. Some are human-meaningful, but not necessarily dictionary > words e.g.: > > > baby lions at play > saturday_morning12 > Fukushima > ImpossibleFork > > > (note that some use underscores, others spaces, and some CamelCase) while > others are completely meaningless (or mostly so): > > > xy39mGWbosjY > 9sjz7s8198ghwt > rz4sdko-28dbRW00u > > I need to split the strings into three groups: > > - those that I'm confident are random > - those that I'm unsure about > - those that I'm confident are non-random > > Ideally, I'll get some sort of numeric score so I can tweak where the > boundaries fall. The first thing that comes to my mind is poking the string into a search engine and seeing how many results come back. You might need to do some preprocessing to recognize multi-word forms (maybe a handful of recognized cases like snake_case, CamelCase, CamelCasewiththeLittleWordsLeftUnchanged, etc), but doing that manually on the above text gives me: * baby lions at play * saturday morning 12 * fukushima * impossible fork * xy 39 mgwbosjy * 9 sjz 7 s 8198 ghwt * rz 4 sdko 28 dbrw 00 u Putting those into Google without quotes yields: * About 23,800,000 results * About 227,000,000 results * About 32,500,000 results * About 16,400,000 results * About 1,180 results * 7 results * About 30,300 results DuckDuckGo doesn't give a result count, so I skipped it. Yahoo search yielded: * 6,040,000 results * 123,000,000 results * 3,920,000 results * 720,000 results * No results at all * No results at all * 2 results Bing produces much more chaotic results, though: * 34,000,000 RESULTS * 15,600,000 RESULTS * 11,000,000 RESULTS * 1,620,000 RESULTS * 5,720,000 RESULTS * 1,580,000,000 RESULTS * 3,380,000 RESULTS This suggests that search engine results MAY be useful, but in some cases, tweaks may be necessary (I couldn't force Bing to do phrase search, for some reason probably related to my inexperience with it), and also that the boundary between "meaningful" and "non-meaningful" will depend on the engine used (I'd use 1,000,000 as the boundary with Google, but probably 100,000 with Yahoo). You might want to handle numerics differently, too - converting "9" into "nine" could improve the result reliability. How many of these keywords would you be looking up, and would a network transaction (a search engine API call) for each one be too expensive? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-12-21 19:57 +1100 |
| Message-ID | <5677bf02$0$1530$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #100646 |
On Monday 21 December 2015 15:22, Chris Angelico wrote: > On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano <steve@pearwood.info> > wrote: >> I have a large number of strings (originally file names) which tend to >> fall into two groups. Some are human-meaningful, but not necessarily >> dictionary words e.g.: [...] > The first thing that comes to my mind is poking the string into a > search engine and seeing how many results come back. You might need to > do some preprocessing to recognize multi-word forms (maybe a handful > of recognized cases like snake_case, CamelCase, > CamelCasewiththeLittleWordsLeftUnchanged, etc), I could possibly split the string into "words", based on CamelCase, spaces, hyphens or underscores. That would cover most of the cases. > How many of these keywords would you be looking up, and would a > network transaction (a search engine API call) for each one be too > expensive? Tens or hundreds of thousands of strings, and yes a network transaction probably would be a bit much. I'd rather not have Google or Bing be a dependency :-) -- Steve
[toc] | [prev] | [next] | [standalone]
| From | Rick Johnson <rantingrickjohnson@gmail.com> |
|---|---|
| Date | 2015-12-21 17:45 -0800 |
| Message-ID | <4b131565-a03e-44e1-9fb3-03efb18cb8f6@googlegroups.com> |
| In reply to | #100646 |
On Sunday, December 20, 2015 at 10:22:57 PM UTC-6, Chris Angelico wrote: > DuckDuckGo doesn't give a result count, so I skipped it. Yahoo search yielded: So why bother to mention it then? Is this another one of your "pikeish" propaganda campaigns?
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2015-12-21 09:24 +0100 |
| Message-ID | <mailman.16.1450686280.2237.python-list@python.org> |
| In reply to | #100643 |
Steven D'Aprano wrote:
> I have a large number of strings (originally file names) which tend to
> fall into two groups. Some are human-meaningful, but not necessarily
> dictionary words e.g.:
>
>
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
>
>
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
>
>
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
>
>
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether
> each string is random or not. I need to split the strings into three
> groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
>
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
>
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
>
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
>
> - I think nltk has a "language detection" function, would that be
> suitable?
>
> - If not nltk, are there are suitable language detection libraries?
>
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
>
> - How about Bayesian filters, e.g. SpamBayes?
A dead simple approach -- look at the pairs in real words and calculate the
ratio
pairs-also-found-in-real-words/num-pairs
$ cat score.py
import sys
WORDLIST = "/usr/share/dict/words"
SAMPLE = """\
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
""".splitlines()
def extract_pairs(text):
for i in range(len(text)-1):
yield text[i:i+2]
def load_pairs():
pairs = set()
with open(WORDLIST) as f:
for line in f:
pairs.update(extract_pairs(line.strip()))
return pairs
def get_score(text, popular_pairs):
m = 0
for i, p in enumerate(extract_pairs(text), 1):
if p in popular_pairs:
m += 1
return m/i
def main():
popular_pairs = load_pairs()
for text in sys.argv[1:] or SAMPLE:
score = get_score(text, popular_pairs)
print("%4.2f %s" % (score, text))
if __name__ == "__main__":
main()
$ python3 score.py
0.65 baby lions at play
0.76 saturday_morning12
1.00 Fukushima
0.92 ImpossibleFork
0.36 xy39mGWbosjY
0.31 9sjz7s8198ghwt
0.31 rz4sdko-28dbRW00u
However:
$ python3 -c 'import random, sys; a = list(sys.argv[1]); random.shuffle(a);
print("".join(a))' 'baby lions at play'
bnsip atl ayba loy
$ python3 score.py 'bnsip atl ayba loy'
0.65 bnsip atl ayba loy
[toc] | [prev] | [next] | [standalone]
| From | Christian Gollwitzer <auriocus@gmx.de> |
|---|---|
| Date | 2015-12-21 10:56 +0100 |
| Message-ID | <n58i7f$cbd$1@dont-email.me> |
| In reply to | #100650 |
Am 21.12.15 um 09:24 schrieb Peter Otten:
> Steven D'Aprano wrote:
>
>> I have a large number of strings (originally file names) which tend to
>> fall into two groups. Some are human-meaningful, but not necessarily
>> dictionary words e.g.:
>>
>>
>> baby lions at play
>> saturday_morning12
>> Fukushima
>> ImpossibleFork
>>
>>
>> (note that some use underscores, others spaces, and some CamelCase) while
>> others are completely meaningless (or mostly so):
>>
>>
>> xy39mGWbosjY
>> 9sjz7s8198ghwt
>> rz4sdko-28dbRW00u
>>
>>
>> Let's call the second group "random" and the first "non-random", without
>> getting bogged down into arguments about whether they are really random or
>> not. I wish to process the strings and automatically determine whether
>> each string is random or not. I need to split the strings into three
>> groups:
>>
>> - those that I'm confident are random
>> - those that I'm unsure about
>> - those that I'm confident are non-random
>>
>> Ideally, I'll get some sort of numeric score so I can tweak where the
>> boundaries fall.
>>
>> Strings are *mostly* ASCII but may include a few non-ASCII characters.
>>
>> Note that false positives (detecting a meaningful non-random string as
>> random) is worse for me than false negatives (miscategorising a random
>> string as non-random).
>>
>> Does anyone have any suggestions for how to do this? Preferably something
>> already existing. I have some thoughts and/or questions:
>>
>> - I think nltk has a "language detection" function, would that be
>> suitable?
>>
>> - If not nltk, are there are suitable language detection libraries?
>>
>> - Is this the sort of problem that neural networks are good at solving?
>> Anyone know a really good tutorial for neural networks in Python?
>>
>> - How about Bayesian filters, e.g. SpamBayes?
>
> A dead simple approach -- look at the pairs in real words and calculate the
> ratio
>
> pairs-also-found-in-real-words/num-pairs
Sounds reasonable. Building on this approach, two simple improvements:
- calculate the log-likelihood instead, which also makes use of the
frequency of the digraphs in the training set
- Use trigraphs instead of digraphs
- preprocess the string (lowercase), but more sophisticated
preprocessing could be an option (i.e. converting under_scores and
CamelCase to spaces)
The main reason for the low score of the baby lions is the space
character, I think - the word list does not contain that much spaces.
Maybe one should feed in some long wikipedia article to calculate the
digraph/trigraph probabilities
=====================================
Apfelkiste:Tests chris$ cat score_my.py
from __future__ import division
from collections import Counter, defaultdict
from math import log
import sys
WORDLIST = "/usr/share/dict/words"
SAMPLE = """\
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
""".splitlines()
def extract_pairs(text):
for i in range(len(text)-1):
yield text.lower()[i:i+2]
# or len(text)-2 and i:i+3
def load_pairs():
pairs = Counter()
with open(WORDLIST) as f:
for line in f:
pairs.update(extract_pairs(line.strip()))
# normalize to sum
total_count = sum([pairs[x] for x in pairs])
N = total_count+len(pairs)
dist = defaultdict(lambda:1/N, ((x, (pairs[x]+1)/N) for x in pairs))
return dist
def get_score(text, dist):
ll = 0
for i, x in enumerate(extract_pairs(text), 1):
ll += log(dist[x])
return ll / i
def main():
pair_dist = load_pairs()
for text in sys.argv[1:] or SAMPLE:
score = get_score(text, pair_dist)
print("%.3g %s" % (score, text))
if __name__ == "__main__":
main()
Apfelkiste:Tests chris$ python score_my.py
-8.74 baby lions at play
-7.63 saturday_morning12
-6.38 Fukushima
-5.72 ImpossibleFork
-10.6 xy39mGWbosjY
-12.9 9sjz7s8198ghwt
-12.1 rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-9.43 bnsip atl ayba loy
Apfelkiste:Tests chris$
and using trigraphs:
Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-12.5 bnsip atl ayba loy
Apfelkiste:Tests chris$ python score_my.py
-11.5 baby lions at play
-9.88 saturday_morning12
-9.85 Fukushima
-7.68 ImpossibleFork
-13.4 xy39mGWbosjY
-14.2 9sjz7s8198ghwt
-14.2 rz4sdko-28dbRW00u
==============================
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-12-21 21:36 +1100 |
| Message-ID | <5677d62e$0$1605$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #100654 |
On Mon, 21 Dec 2015 08:56 pm, Christian Gollwitzer wrote: > Apfelkiste:Tests chris$ python score_my.py > -8.74 baby lions at play > -7.63 saturday_morning12 > -6.38 Fukushima > -5.72 ImpossibleFork > -10.6 xy39mGWbosjY > -12.9 9sjz7s8198ghwt > -12.1 rz4sdko-28dbRW00u > Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy' > -9.43 bnsip atl ayba loy Thanks Christian and Peter for the suggestion, I'll certainly investigate this further. But the scoring doesn't seem very good. "baby lions at play" is 100% English words, and ought to have a radically different score from (say) xy39mGWbosjY which is extremely non-English like. (How many English words do you know of with W, X, two Y, and J?) And yet they are only two units apart. "baby lions..." is a score almost as negative as the authentic gibberish, while Fukushima (a Japanese word) has a much less negative score. Using trigraphs doesn't change that: > -11.5 baby lions at play > -9.85 Fukushima > -13.4 xy39mGWbosjY So this test appears to find that English-like words are nearly as "random" as actual random strings. But it's certainly worth looking into. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Christian Gollwitzer <auriocus@gmx.de> |
|---|---|
| Date | 2015-12-21 11:53 +0100 |
| Message-ID | <n58lhv$nok$1@dont-email.me> |
| In reply to | #100656 |
Am 21.12.15 um 11:36 schrieb Steven D'Aprano: > On Mon, 21 Dec 2015 08:56 pm, Christian Gollwitzer wrote: > >> Apfelkiste:Tests chris$ python score_my.py >> -8.74 baby lions at play >> -7.63 saturday_morning12 >> -6.38 Fukushima >> -5.72 ImpossibleFork >> -10.6 xy39mGWbosjY >> -12.9 9sjz7s8198ghwt >> -12.1 rz4sdko-28dbRW00u >> Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy' >> -9.43 bnsip atl ayba loy > > Thanks Christian and Peter for the suggestion, I'll certainly investigate > this further. > > But the scoring doesn't seem very good. "baby lions at play" is 100% English > words, and ought to have a radically different score from (say) > xy39mGWbosjY which is extremely non-English like. (How many English words > do you know of with W, X, two Y, and J?) And yet they are only two units > apart. "baby lions..." is a score almost as negative as the authentic > gibberish, while Fukushima (a Japanese word) has a much less negative > score. It is the spaces, which do not occur in the training wordlist (I mentioned that above, maybe not prominently enough). /usr/share/dict/words contains one word per line. The underscore _ is probably putting the saturday morning low, while the spaces put the babies low. Using trigraphs: Apfelkiste:Tests chris$ python score_my.py -11.5 baby lions at play -9.88 saturday_morning12 -9.85 Fukushima -7.68 ImpossibleFork -13.4 xy39mGWbosjY -14.2 9sjz7s8198ghwt -14.2 rz4sdko-28dbRW00u Apfelkiste:Tests chris$ python score_my.py 'babylionsatplay' -8.74 babylionsatplay Apfelkiste:Tests chris$ python score_my.py 'saturdaymorning12' -8.93 saturdaymorning12 Apfelkiste:Tests chris$ So for the spaces, either use a proper trainig material (some long corpus from Wikipedia or such), with punctuation removed. Then it will catch the correct probabilities at word boundaries. Or preprocess by removing the spaces. Christian
[toc] | [prev] | [next] | [standalone]
| From | Christian Gollwitzer <auriocus@gmx.de> |
|---|---|
| Date | 2015-12-21 11:56 +0100 |
| Message-ID | <n58lnc$nok$2@dont-email.me> |
| In reply to | #100657 |
Am 21.12.15 um 11:53 schrieb Christian Gollwitzer: > So for the spaces, either use a proper trainig material (some long > corpus from Wikipedia or such), with punctuation removed. Then it will > catch the correct probabilities at word boundaries. Or preprocess by > removing the spaces. > > Christian PS: The real log-likelihood would become -infinity, when some pair does not appear at all in the training set (esp. the numbers, e.g.). I used the 1/total in the defaultdict to mitigate that. You could tweak that value a bit. The larger the corpus, the sharper it will divide by itself, too. Christian
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2015-12-21 14:25 +0100 |
| Message-ID | <mailman.27.1450704337.2237.python-list@python.org> |
| In reply to | #100643 |
2015-12-21 4:01 GMT+01:00 Steven D'Aprano <steve@pearwood.info>:
> I have a large number of strings (originally file names) which tend to fall
> into two groups. Some are human-meaningful, but not necessarily dictionary
> words e.g.:
>
>
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
>
>
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
>
>
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
>
>
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether each
> string is random or not. I need to split the strings into three groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
>
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
>
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
>
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
>
> - I think nltk has a "language detection" function, would that be suitable?
>
> - If not nltk, are there are suitable language detection libraries?
>
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
>
> - How about Bayesian filters, e.g. SpamBayes?
>
>
>
>
> --
> Steven
>
> --
> https://mail.python.org/mailman/listinfo/python-list
Hi,
as you probably already know, NLTK could be helpful for some parts of
this task; if you can handle the most likely "word" splitting involved
by underscores, CamelCase etc., you could try to tag the parts of
speech of the words and interpret for the results according to your
needs.
In the online demo
http://text-processing.com/demo/tag/
your sample (with different approaches to splitt the words) yields:
baby/NN lions/NNS at/IN play/VB saturday/NN morning/NN 12/CD
Fukushima/NNP Impossible/JJ Fork/NNP xy39mGWbosjY/-None-
9sjz7s8198ghwt/-None- rz4sdko/-None- -/: 28dbRW00u/-None-
or with more splittings on case or letter-digit boundaries:
baby/NN lions/NNS at/IN play/VB saturday/NN morning/NN 12/CD
Fukushima/NNP Impossible/JJ Fork/NNP xy/-None- 39/CD m/-None- G/NNP
Wbosj/-None- Y/-None- 9/CD sjz/-None- 7/CD s/-None- 8198/-NONE-
ghwt/-None- rz/-None- 4/CD sdko/-None- -/: 28/CD db/-None- R/NNP
W/-None- 00/-None- u/-None-
the tagset might be compatible with
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
There is sample code with a comparable output to this demo:
http://stackoverflow.com/questions/23953709/how-do-i-tag-a-sentence-with-the-brown-or-conll2000-tagger-chunker
For the given minimal sample, the results look useful (maybe with
exception of the capitalised words sometimes tagged as proper names -
but it might not be that relevant here).
Of course, any scoring isn't available with this approach, but you
could maybe check the proportion of the recognised "words" comparing
to the total number of the "words" for the respective filename.
Training the tagger should be possible too in NLTK, but I don't have
experiences with this.
regards,
vbr
[toc] | [prev] | [next] | [standalone]
| From | Vincent Davis <vincent@vincentdavis.net> |
|---|---|
| Date | 2015-12-21 07:51 -0600 |
| Message-ID | <mailman.28.1450705921.2237.python-list@python.org> |
| In reply to | #100643 |
On Mon, Dec 21, 2015 at 7:25 AM, Vlastimil Brom <vlastimil.brom@gmail.com> wrote: > > baby lions at play > > saturday_morning12 > > Fukushima > > ImpossibleFork > > > > > > (note that some use underscores, others spaces, and some CamelCase) while > > others are completely meaningless (or mostly so): > > > > > > xy39mGWbosjY > > 9sjz7s8198ghwt > > rz4sdko-28dbRW00u > My first thought it to search google for each wor d or phase and count (google gives a count) the results. For example if you search for "xy39mGWbosjY" there is one result as of now, which is an archive of this tread. If you search for any given word or even the phrase , for example "baby lions at play " you get a much larger set of results ~500 . I assue there are many was to search google with python, this looks like one. https://pypi.python.org/pypi/google Vincent Davis
[toc] | [prev] | [next] | [standalone]
| From | duncan smith <duncan@invalid.invalid> |
|---|---|
| Date | 2015-12-21 16:40 +0000 |
| Message-ID | <oYVdy.22469$Hz3.17030@fx43.iad> |
| In reply to | #100643 |
On 21/12/15 03:01, Steven D'Aprano wrote: > I have a large number of strings (originally file names) which tend to fall > into two groups. Some are human-meaningful, but not necessarily dictionary > words e.g.: > > > baby lions at play > saturday_morning12 > Fukushima > ImpossibleFork > > > (note that some use underscores, others spaces, and some CamelCase) while > others are completely meaningless (or mostly so): > > > xy39mGWbosjY > 9sjz7s8198ghwt > rz4sdko-28dbRW00u > > > Let's call the second group "random" and the first "non-random", without > getting bogged down into arguments about whether they are really random or > not. I wish to process the strings and automatically determine whether each > string is random or not. I need to split the strings into three groups: > > - those that I'm confident are random > - those that I'm unsure about > - those that I'm confident are non-random > > Ideally, I'll get some sort of numeric score so I can tweak where the > boundaries fall. > > Strings are *mostly* ASCII but may include a few non-ASCII characters. > > Note that false positives (detecting a meaningful non-random string as > random) is worse for me than false negatives (miscategorising a random > string as non-random). > > Does anyone have any suggestions for how to do this? Preferably something > already existing. I have some thoughts and/or questions: > > - I think nltk has a "language detection" function, would that be suitable? > > - If not nltk, are there are suitable language detection libraries? > > - Is this the sort of problem that neural networks are good at solving? > Anyone know a really good tutorial for neural networks in Python? > > - How about Bayesian filters, e.g. SpamBayes? > > > > Finite state machine / transition matrix. Learn from some English text source. Then process your strings by lower casing, replacing underscores with spaces, removing trailing numeric characters etc. Base your score on something like the mean transition probability. I'd expect to see two pretty well separated groups of scores. Duncan
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2015-12-21 09:49 -0700 |
| Message-ID | <mailman.33.1450716637.2237.python-list@python.org> |
| In reply to | #100678 |
On Mon, Dec 21, 2015 at 9:40 AM, duncan smith <duncan@invalid.invalid> wrote: > Finite state machine / transition matrix. Learn from some English text > source. Then process your strings by lower casing, replacing underscores > with spaces, removing trailing numeric characters etc. Base your score > on something like the mean transition probability. I'd expect to see two > pretty well separated groups of scores. Sounds like a case for a Hidden Markov Model.
[toc] | [prev] | [next] | [standalone]
| From | duncan smith <duncan@invalid.invalid> |
|---|---|
| Date | 2015-12-21 17:41 +0000 |
| Message-ID | <kRWdy.44154$Xk5.39385@fx17.iad> |
| In reply to | #100679 |
On 21/12/15 16:49, Ian Kelly wrote: > On Mon, Dec 21, 2015 at 9:40 AM, duncan smith <duncan@invalid.invalid> wrote: >> Finite state machine / transition matrix. Learn from some English text >> source. Then process your strings by lower casing, replacing underscores >> with spaces, removing trailing numeric characters etc. Base your score >> on something like the mean transition probability. I'd expect to see two >> pretty well separated groups of scores. > > Sounds like a case for a Hidden Markov Model. > Perhaps. That would allow the encoding of marginal probabilities and distinct transition matrices for each class - if we could learn those extra parameters. Duncan
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2015-12-21 17:09 +0000 |
| Message-ID | <mailman.34.1450717787.2237.python-list@python.org> |
| In reply to | #100678 |
On 21/12/2015 16:49, Ian Kelly wrote: > On Mon, Dec 21, 2015 at 9:40 AM, duncan smith <duncan@invalid.invalid> wrote: >> Finite state machine / transition matrix. Learn from some English text >> source. Then process your strings by lower casing, replacing underscores >> with spaces, removing trailing numeric characters etc. Base your score >> on something like the mean transition probability. I'd expect to see two >> pretty well separated groups of scores. > > Sounds like a case for a Hidden Markov Model. > In which case https://pypi.python.org/pypi/Markov/0.1 would seem to be a starting point. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2015-12-21 09:20 -0800 |
| Message-ID | <874mfbk9by.fsf@nightsong.com> |
| In reply to | #100643 |
Steven D'Aprano <steve@pearwood.info> writes: > Does anyone have any suggestions for how to do this? Preferably something > already existing. I have some thoughts and/or questions: I think I'd just look at the set of digraphs or trigraphs in each name and see if there are a lot that aren't found in English. > - I think nltk has a "language detection" function, would that be suitable? > - If not nltk, are there are suitable language detection libraries? I suspect these need longer strings to work. > - Is this the sort of problem that neural networks are good at solving? > Anyone know a really good tutorial for neural networks in Python? > - How about Bayesian filters, e.g. SpamBayes? You want large training sets for these approaches.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web