Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #31507 > unrolled thread
| Started by | nwaits <nowaits@gmail.com> |
|---|---|
| First post | 2012-10-17 07:31 -0700 |
| Last post | 2012-10-17 08:32 -0700 |
| Articles | 13 — 7 participants |
Back to article view | Back to comp.lang.python
Script for finding words of any size that do NOT contain vowels with acute diacritic marks? nwaits <nowaits@gmail.com> - 2012-10-17 07:31 -0700
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? Dave Angel <d@davea.name> - 2012-10-17 11:00 -0400
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? wxjmfauth@gmail.com - 2012-10-17 08:32 -0700
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? Ian Kelly <ian.g.kelly@gmail.com> - 2012-10-17 11:07 -0600
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? wxjmfauth@gmail.com - 2012-10-17 11:17 -0700
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? Chris Angelico <rosuav@gmail.com> - 2012-10-18 05:22 +1100
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? Ian Kelly <ian.g.kelly@gmail.com> - 2012-10-17 12:27 -0600
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? wxjmfauth@gmail.com - 2012-10-17 11:33 -0700
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? wxjmfauth@gmail.com - 2012-10-17 11:33 -0700
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? wxjmfauth@gmail.com - 2012-10-17 11:17 -0700
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? David Robinow <drobinow@gmail.com> - 2012-10-17 13:16 -0400
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-10-17 23:18 +0000
Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? wxjmfauth@gmail.com - 2012-10-17 08:32 -0700
| From | nwaits <nowaits@gmail.com> |
|---|---|
| Date | 2012-10-17 07:31 -0700 |
| Subject | Script for finding words of any size that do NOT contain vowels with acute diacritic marks? |
| Message-ID | <a7454cb7-e6dc-4167-b72a-56a67a5873a7@googlegroups.com> |
I'm very impressed with python's wordlist script for plain text. Is there a script for finding words that do NOT have certain diacritic marks, like acute or grave accents (utf-8), over the vowels? Thank you.
[toc] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-10-17 11:00 -0400 |
| Message-ID | <mailman.2350.1350486045.27098.python-list@python.org> |
| In reply to | #31507 |
On 10/17/2012 10:31 AM, nwaits wrote: > I'm very impressed with python's wordlist script for plain text. Is there a script for finding words that do NOT have certain diacritic marks, like acute or grave accents (utf-8), over the vowels? > Thank you. if you can construct a list of "illegal" characters, then you can simply check each character of the word against the list, and if it succeeds for all of the characters, it's a winner. If that's not fast enough, you can build a translation table from the list of illegal characters, and use translate on each word. Then it becomes a question of checking if the translated word is all zeroes. More setup time, but much faster looping for each word. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-10-17 08:32 -0700 |
| Message-ID | <748e561a-7e75-4b13-be6b-91831d3b59c4@googlegroups.com> |
| In reply to | #31516 |
Le mercredi 17 octobre 2012 17:00:46 UTC+2, Dave Angel a écrit :
> On 10/17/2012 10:31 AM, nwaits wrote:
>
> > I'm very impressed with python's wordlist script for plain text. Is there a script for finding words that do NOT have certain diacritic marks, like acute or grave accents (utf-8), over the vowels?
>
> > Thank you.
>
>
>
> if you can construct a list of "illegal" characters, then you can simply
>
> check each character of the word against the list, and if it succeeds
>
> for all of the characters, it's a winner.
>
>
>
> If that's not fast enough, you can build a translation table from the
>
> list of illegal characters, and use translate on each word. Then it
>
> becomes a question of checking if the translated word is all zeroes.
>
> More setup time, but much faster looping for each word.
>
>
>
> --
>
>
>
> DaveA
Lazy way.
Py3.2
>>> import unicodedata
>>> def HasDiacritics(w):
... w_decomposed = unicodedata.normalize('NFKD', w)
... return 'no' if len(w) == len(w_decomposed) else 'yes'
...
>>> HasDiacritics('éléphant')
'yes'
>>> HasDiacritics('elephant')
'no'
>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
'yes'
>>> HasDiacritics('U')
'no'
>>>
Should be ok for the CombiningDiacriticalMarks unicode range
(common diacritics)
jmf
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-10-17 11:07 -0600 |
| Message-ID | <mailman.2354.1350493663.27098.python-list@python.org> |
| In reply to | #31519 |
On Wed, Oct 17, 2012 at 9:32 AM, <wxjmfauth@gmail.com> wrote:
>>>> import unicodedata
>>>> def HasDiacritics(w):
> ... w_decomposed = unicodedata.normalize('NFKD', w)
> ... return 'no' if len(w) == len(w_decomposed) else 'yes'
> ...
>>>> HasDiacritics('éléphant')
> 'yes'
>>>> HasDiacritics('elephant')
> 'no'
>>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
> 'yes'
>>>> HasDiacritics('U')
> 'no'
Is there something wrong with True and False that you had to replace
them with strings?
"return len(w) != len(w_decomposed)" is all you need.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-10-17 11:17 -0700 |
| Message-ID | <84d4571f-430a-4fb3-8e54-cf6769a2c76f@googlegroups.com> |
| In reply to | #31526 |
Le mercredi 17 octobre 2012 19:07:43 UTC+2, Ian a écrit :
> On Wed, Oct 17, 2012 at 9:32 AM, <wxjmfauth@gmail.com> wrote:
>
> >>>> import unicodedata
>
> >>>> def HasDiacritics(w):
>
> > ... w_decomposed = unicodedata.normalize('NFKD', w)
>
> > ... return 'no' if len(w) == len(w_decomposed) else 'yes'
>
> > ...
>
> >>>> HasDiacritics('éléphant')
>
> > 'yes'
>
> >>>> HasDiacritics('elephant')
>
> > 'no'
>
> >>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
>
> > 'yes'
>
> >>>> HasDiacritics('U')
>
> > 'no'
>
>
>
> Is there something wrong with True and False that you had to replace
>
> them with strings?
>
>
>
> "return len(w) != len(w_decomposed)" is all you need.
Not at all, I knew this. In this I decided to program like
this.
Do you get it? Yes/No or True/False
jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-10-18 05:22 +1100 |
| Message-ID | <mailman.2359.1350498151.27098.python-list@python.org> |
| In reply to | #31530 |
On Thu, Oct 18, 2012 at 5:17 AM, <wxjmfauth@gmail.com> wrote: > Not at all, I knew this. In this I decided to program like > this. > > Do you get it? Yes/No or True/False Yes but why? When you're returning a boolean concept, why not return a boolean value? You don't even use values with one that compares-as-true and the other that compares-as-false (for instance, you could write the function so that it returns just the diacritic-containing characters, meaning it'll return "" if there aren't any). To what benefit? Puzzled. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-10-17 12:27 -0600 |
| Message-ID | <mailman.2360.1350498464.27098.python-list@python.org> |
| In reply to | #31530 |
On Wed, Oct 17, 2012 at 12:17 PM, <wxjmfauth@gmail.com> wrote:
> Not at all, I knew this. In this I decided to program like
> this.
>
> Do you get it? Yes/No or True/False
It's just bad style, because both 'yes' and 'no' evaluate true.
if HasDiacritics('éléphant'):
print('Correct!')
if HasDiacritics('elephant'):
print('Error!')
Prints:
Correct!
Error!
You could replace the test with "if HasDiacritics('elephant') ==
'yes':", but why force the caller to write that out when the former
test is more natural and less prone to error (e.g. typoing 'yes')?
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-10-17 11:33 -0700 |
| Message-ID | <64995d1a-5c04-4a90-a6a1-e73aa2ed5e34@googlegroups.com> |
| In reply to | #31533 |
Le mercredi 17 octobre 2012 20:28:21 UTC+2, Ian a écrit :
> On Wed, Oct 17, 2012 at 12:17 PM, <wxjmfauth@gmail.com> wrote:
>
> > Not at all, I knew this. In this I decided to program like
>
> > this.
>
> >
>
> > Do you get it? Yes/No or True/False
>
>
>
> It's just bad style, because both 'yes' and 'no' evaluate true.
>
>
>
> if HasDiacritics('éléphant'):
>
> print('Correct!')
>
>
>
> if HasDiacritics('elephant'):
>
> print('Error!')
>
>
>
> Prints:
>
>
>
> Correct!
>
> Error!
>
>
>
> You could replace the test with "if HasDiacritics('elephant') ==
>
> 'yes':", but why force the caller to write that out when the former
>
> test is more natural and less prone to error (e.g. typoing 'yes')?
I *know* all this. In my prev. msg, the goal was to emph. the
usage of *unicode.normalize()".
jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-10-17 11:33 -0700 |
| Message-ID | <mailman.2361.1350498820.27098.python-list@python.org> |
| In reply to | #31533 |
Le mercredi 17 octobre 2012 20:28:21 UTC+2, Ian a écrit :
> On Wed, Oct 17, 2012 at 12:17 PM, <wxjmfauth@gmail.com> wrote:
>
> > Not at all, I knew this. In this I decided to program like
>
> > this.
>
> >
>
> > Do you get it? Yes/No or True/False
>
>
>
> It's just bad style, because both 'yes' and 'no' evaluate true.
>
>
>
> if HasDiacritics('éléphant'):
>
> print('Correct!')
>
>
>
> if HasDiacritics('elephant'):
>
> print('Error!')
>
>
>
> Prints:
>
>
>
> Correct!
>
> Error!
>
>
>
> You could replace the test with "if HasDiacritics('elephant') ==
>
> 'yes':", but why force the caller to write that out when the former
>
> test is more natural and less prone to error (e.g. typoing 'yes')?
I *know* all this. In my prev. msg, the goal was to emph. the
usage of *unicode.normalize()".
jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-10-17 11:17 -0700 |
| Message-ID | <mailman.2358.1350497840.27098.python-list@python.org> |
| In reply to | #31526 |
Le mercredi 17 octobre 2012 19:07:43 UTC+2, Ian a écrit :
> On Wed, Oct 17, 2012 at 9:32 AM, <wxjmfauth@gmail.com> wrote:
>
> >>>> import unicodedata
>
> >>>> def HasDiacritics(w):
>
> > ... w_decomposed = unicodedata.normalize('NFKD', w)
>
> > ... return 'no' if len(w) == len(w_decomposed) else 'yes'
>
> > ...
>
> >>>> HasDiacritics('éléphant')
>
> > 'yes'
>
> >>>> HasDiacritics('elephant')
>
> > 'no'
>
> >>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
>
> > 'yes'
>
> >>>> HasDiacritics('U')
>
> > 'no'
>
>
>
> Is there something wrong with True and False that you had to replace
>
> them with strings?
>
>
>
> "return len(w) != len(w_decomposed)" is all you need.
Not at all, I knew this. In this I decided to program like
this.
Do you get it? Yes/No or True/False
jmf
[toc] | [prev] | [next] | [standalone]
| From | David Robinow <drobinow@gmail.com> |
|---|---|
| Date | 2012-10-17 13:16 -0400 |
| Message-ID | <mailman.2355.1350494207.27098.python-list@python.org> |
| In reply to | #31519 |
On Wed, Oct 17, 2012 at 1:07 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote: > "return len(w) != len(w_decomposed)" is all you need. Thanks for helping, but I already knew that.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-10-17 23:18 +0000 |
| Message-ID | <507f3cab$0$6599$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #31527 |
On Wed, 17 Oct 2012 13:16:43 -0400, David Robinow wrote: > On Wed, Oct 17, 2012 at 1:07 PM, Ian Kelly <ian.g.kelly@gmail.com> > wrote: >> "return len(w) != len(w_decomposed)" is all you need. > > Thanks for helping, but I already knew that. David, Ian was directly responding to wxjmfauth@gmail.com, whose suggestion included an entirely unnecessary conversion from a bool flag to the strings 'yes' and 'no'. That can be seen in the part of Ian's post that you deleted. Regardless of whether *you personally* already knew that jmf's function was unidiomatic and a poor design, you weren't directly the target of the comment. I'm glad you already knew what Ian said, but you're not the only person reading this thread. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-10-17 08:32 -0700 |
| Message-ID | <mailman.2352.1350487976.27098.python-list@python.org> |
| In reply to | #31516 |
Le mercredi 17 octobre 2012 17:00:46 UTC+2, Dave Angel a écrit :
> On 10/17/2012 10:31 AM, nwaits wrote:
>
> > I'm very impressed with python's wordlist script for plain text. Is there a script for finding words that do NOT have certain diacritic marks, like acute or grave accents (utf-8), over the vowels?
>
> > Thank you.
>
>
>
> if you can construct a list of "illegal" characters, then you can simply
>
> check each character of the word against the list, and if it succeeds
>
> for all of the characters, it's a winner.
>
>
>
> If that's not fast enough, you can build a translation table from the
>
> list of illegal characters, and use translate on each word. Then it
>
> becomes a question of checking if the translated word is all zeroes.
>
> More setup time, but much faster looping for each word.
>
>
>
> --
>
>
>
> DaveA
Lazy way.
Py3.2
>>> import unicodedata
>>> def HasDiacritics(w):
... w_decomposed = unicodedata.normalize('NFKD', w)
... return 'no' if len(w) == len(w_decomposed) else 'yes'
...
>>> HasDiacritics('éléphant')
'yes'
>>> HasDiacritics('elephant')
'no'
>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
'yes'
>>> HasDiacritics('U')
'no'
>>>
Should be ok for the CombiningDiacriticalMarks unicode range
(common diacritics)
jmf
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web