Groups > comp.lang.python > #31507 > unrolled thread

Script for finding words of any size that do NOT contain vowels with acute diacritic marks?

Started by	nwaits <nowaits@gmail.com>
First post	2012-10-17 07:31 -0700
Last post	2012-10-17 08:32 -0700
Articles	13 — 7 participants

Back to article view | Back to comp.lang.python

#31507 — Script for finding words of any size that do NOT contain vowels with acute diacritic marks?

From	nwaits <nowaits@gmail.com>
Date	2012-10-17 07:31 -0700
Subject	Script for finding words of any size that do NOT contain vowels with acute diacritic marks?
Message-ID	<a7454cb7-e6dc-4167-b72a-56a67a5873a7@googlegroups.com>

I'm very impressed with python's wordlist script for plain text.  Is there a script for finding words that do NOT have certain diacritic marks, like acute or grave accents (utf-8), over the vowels?  
Thank you.

[toc] | [next] | [standalone]

#31516

From	Dave Angel <d@davea.name>
Date	2012-10-17 11:00 -0400
Message-ID	<mailman.2350.1350486045.27098.python-list@python.org>
In reply to	#31507

On 10/17/2012 10:31 AM, nwaits wrote:
> I'm very impressed with python's wordlist script for plain text.  Is there a script for finding words that do NOT have certain diacritic marks, like acute or grave accents (utf-8), over the vowels?  
> Thank you.

if you can construct a list of "illegal" characters, then you can simply
check each character of the word against the list, and if it succeeds
for all of the characters, it's a winner.

If that's not fast enough, you can build a translation table from the
list of illegal characters, and use translate on each word.  Then it
becomes a question of checking if the translated word is all zeroes.  
More setup time, but much faster looping for each word.

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#31519

From	wxjmfauth@gmail.com
Date	2012-10-17 08:32 -0700
Message-ID	<748e561a-7e75-4b13-be6b-91831d3b59c4@googlegroups.com>
In reply to	#31516

Le mercredi 17 octobre 2012 17:00:46 UTC+2, Dave Angel a écrit :
> On 10/17/2012 10:31 AM, nwaits wrote:
> 
> > I'm very impressed with python's wordlist script for plain text.  Is there a script for finding words that do NOT have certain diacritic marks, like acute or grave accents (utf-8), over the vowels?  
> 
> > Thank you.
> 
> 
> 
> if you can construct a list of "illegal" characters, then you can simply
> 
> check each character of the word against the list, and if it succeeds
> 
> for all of the characters, it's a winner.
> 
> 
> 
> If that's not fast enough, you can build a translation table from the
> 
> list of illegal characters, and use translate on each word.  Then it
> 
> becomes a question of checking if the translated word is all zeroes.  
> 
> More setup time, but much faster looping for each word.
> 
> 
> 
> -- 
> 
> 
> 
> DaveA

Lazy way.
Py3.2

>>> import unicodedata
>>> def HasDiacritics(w):
...     w_decomposed = unicodedata.normalize('NFKD', w)
...     return 'no' if len(w) == len(w_decomposed) else 'yes'
...     
>>> HasDiacritics('éléphant')
'yes'
>>> HasDiacritics('elephant')
'no'
>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
'yes'
>>> HasDiacritics('U')
'no'
>>>

Should be ok for the CombiningDiacriticalMarks unicode range
(common diacritics)

jmf

[toc] | [prev] | [next] | [standalone]

#31526

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-10-17 11:07 -0600
Message-ID	<mailman.2354.1350493663.27098.python-list@python.org>
In reply to	#31519

On Wed, Oct 17, 2012 at 9:32 AM,  <wxjmfauth@gmail.com> wrote:
>>>> import unicodedata
>>>> def HasDiacritics(w):
> ...     w_decomposed = unicodedata.normalize('NFKD', w)
> ...     return 'no' if len(w) == len(w_decomposed) else 'yes'
> ...
>>>> HasDiacritics('éléphant')
> 'yes'
>>>> HasDiacritics('elephant')
> 'no'
>>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
> 'yes'
>>>> HasDiacritics('U')
> 'no'

Is there something wrong with True and False that you had to replace
them with strings?

"return len(w) != len(w_decomposed)" is all you need.

[toc] | [prev] | [next] | [standalone]

#31530

From	wxjmfauth@gmail.com
Date	2012-10-17 11:17 -0700
Message-ID	<84d4571f-430a-4fb3-8e54-cf6769a2c76f@googlegroups.com>
In reply to	#31526

Le mercredi 17 octobre 2012 19:07:43 UTC+2, Ian a écrit :
> On Wed, Oct 17, 2012 at 9:32 AM,  <wxjmfauth@gmail.com> wrote:
> 
> >>>> import unicodedata
> 
> >>>> def HasDiacritics(w):
> 
> > ...     w_decomposed = unicodedata.normalize('NFKD', w)
> 
> > ...     return 'no' if len(w) == len(w_decomposed) else 'yes'
> 
> > ...
> 
> >>>> HasDiacritics('éléphant')
> 
> > 'yes'
> 
> >>>> HasDiacritics('elephant')
> 
> > 'no'
> 
> >>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
> 
> > 'yes'
> 
> >>>> HasDiacritics('U')
> 
> > 'no'
> 
> 
> 
> Is there something wrong with True and False that you had to replace
> 
> them with strings?
> 
> 
> 
> "return len(w) != len(w_decomposed)" is all you need.

Not at all, I knew this. In this I decided to program like
this.

Do you get it?  Yes/No  or True/False

jmf

[toc] | [prev] | [next] | [standalone]

#31532

From	Chris Angelico <rosuav@gmail.com>
Date	2012-10-18 05:22 +1100
Message-ID	<mailman.2359.1350498151.27098.python-list@python.org>
In reply to	#31530

On Thu, Oct 18, 2012 at 5:17 AM,  <wxjmfauth@gmail.com> wrote:
> Not at all, I knew this. In this I decided to program like
> this.
>
> Do you get it?  Yes/No  or True/False

Yes but why? When you're returning a boolean concept, why not return a
boolean value? You don't even use values with one that
compares-as-true and the other that compares-as-false (for instance,
you could write the function so that it returns just the
diacritic-containing characters, meaning it'll return "" if there
aren't any). To what benefit?

Puzzled.

ChrisA

[toc] | [prev] | [next] | [standalone]

#31533

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-10-17 12:27 -0600
Message-ID	<mailman.2360.1350498464.27098.python-list@python.org>
In reply to	#31530

On Wed, Oct 17, 2012 at 12:17 PM,  <wxjmfauth@gmail.com> wrote:
> Not at all, I knew this. In this I decided to program like
> this.
>
> Do you get it?  Yes/No  or True/False

It's just bad style, because both 'yes' and 'no' evaluate true.

if HasDiacritics('éléphant'):
    print('Correct!')

if HasDiacritics('elephant'):
    print('Error!')

Prints:

Correct!
Error!

You could replace the test with "if HasDiacritics('elephant') ==
'yes':", but why force the caller to write that out when the former
test is more natural and less prone to error (e.g. typoing 'yes')?

[toc] | [prev] | [next] | [standalone]

#31534

From	wxjmfauth@gmail.com
Date	2012-10-17 11:33 -0700
Message-ID	<64995d1a-5c04-4a90-a6a1-e73aa2ed5e34@googlegroups.com>
In reply to	#31533

Le mercredi 17 octobre 2012 20:28:21 UTC+2, Ian a écrit :
> On Wed, Oct 17, 2012 at 12:17 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > Not at all, I knew this. In this I decided to program like
> 
> > this.
> 
> >
> 
> > Do you get it?  Yes/No  or True/False
> 
> 
> 
> It's just bad style, because both 'yes' and 'no' evaluate true.
> 
> 
> 
> if HasDiacritics('éléphant'):
> 
>     print('Correct!')
> 
> 
> 
> if HasDiacritics('elephant'):
> 
>     print('Error!')
> 
> 
> 
> Prints:
> 
> 
> 
> Correct!
> 
> Error!
> 
> 
> 
> You could replace the test with "if HasDiacritics('elephant') ==
> 
> 'yes':", but why force the caller to write that out when the former
> 
> test is more natural and less prone to error (e.g. typoing 'yes')?

I *know* all this. In my prev. msg, the goal was to emph. the
usage of *unicode.normalize()".

jmf

[toc] | [prev] | [next] | [standalone]

#31535

From	wxjmfauth@gmail.com
Date	2012-10-17 11:33 -0700
Message-ID	<mailman.2361.1350498820.27098.python-list@python.org>
In reply to	#31533

Le mercredi 17 octobre 2012 20:28:21 UTC+2, Ian a écrit :
> On Wed, Oct 17, 2012 at 12:17 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > Not at all, I knew this. In this I decided to program like
> 
> > this.
> 
> >
> 
> > Do you get it?  Yes/No  or True/False
> 
> 
> 
> It's just bad style, because both 'yes' and 'no' evaluate true.
> 
> 
> 
> if HasDiacritics('éléphant'):
> 
>     print('Correct!')
> 
> 
> 
> if HasDiacritics('elephant'):
> 
>     print('Error!')
> 
> 
> 
> Prints:
> 
> 
> 
> Correct!
> 
> Error!
> 
> 
> 
> You could replace the test with "if HasDiacritics('elephant') ==
> 
> 'yes':", but why force the caller to write that out when the former
> 
> test is more natural and less prone to error (e.g. typoing 'yes')?

I *know* all this. In my prev. msg, the goal was to emph. the
usage of *unicode.normalize()".

jmf

[toc] | [prev] | [next] | [standalone]

#31531

From	wxjmfauth@gmail.com
Date	2012-10-17 11:17 -0700
Message-ID	<mailman.2358.1350497840.27098.python-list@python.org>
In reply to	#31526

Le mercredi 17 octobre 2012 19:07:43 UTC+2, Ian a écrit :
> On Wed, Oct 17, 2012 at 9:32 AM,  <wxjmfauth@gmail.com> wrote:
> 
> >>>> import unicodedata
> 
> >>>> def HasDiacritics(w):
> 
> > ...     w_decomposed = unicodedata.normalize('NFKD', w)
> 
> > ...     return 'no' if len(w) == len(w_decomposed) else 'yes'
> 
> > ...
> 
> >>>> HasDiacritics('éléphant')
> 
> > 'yes'
> 
> >>>> HasDiacritics('elephant')
> 
> > 'no'
> 
> >>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
> 
> > 'yes'
> 
> >>>> HasDiacritics('U')
> 
> > 'no'
> 
> 
> 
> Is there something wrong with True and False that you had to replace
> 
> them with strings?
> 
> 
> 
> "return len(w) != len(w_decomposed)" is all you need.

Not at all, I knew this. In this I decided to program like
this.

Do you get it?  Yes/No  or True/False

jmf

[toc] | [prev] | [next] | [standalone]

#31527

From	David Robinow <drobinow@gmail.com>
Date	2012-10-17 13:16 -0400
Message-ID	<mailman.2355.1350494207.27098.python-list@python.org>
In reply to	#31519

On Wed, Oct 17, 2012 at 1:07 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> "return len(w) != len(w_decomposed)" is all you need.
 Thanks for helping, but I already knew that.

[toc] | [prev] | [next] | [standalone]

#31549

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-10-17 23:18 +0000
Message-ID	<507f3cab$0$6599$c3e8da3$5496439d@news.astraweb.com>
In reply to	#31527

On Wed, 17 Oct 2012 13:16:43 -0400, David Robinow wrote:

> On Wed, Oct 17, 2012 at 1:07 PM, Ian Kelly <ian.g.kelly@gmail.com>
> wrote:
>> "return len(w) != len(w_decomposed)" is all you need.
>
>  Thanks for helping, but I already knew that.

David, Ian was directly responding to wxjmfauth@gmail.com, whose 
suggestion included an entirely unnecessary conversion from a bool flag 
to the strings 'yes' and 'no'. That can be seen in the part of Ian's post 
that you deleted.

Regardless of whether *you personally* already knew that jmf's function 
was unidiomatic and a poor design, you weren't directly the target of the 
comment. I'm glad you already knew what Ian said, but you're not the only 
person reading this thread.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#31520

From	wxjmfauth@gmail.com
Date	2012-10-17 08:32 -0700
Message-ID	<mailman.2352.1350487976.27098.python-list@python.org>
In reply to	#31516

Le mercredi 17 octobre 2012 17:00:46 UTC+2, Dave Angel a écrit :
> On 10/17/2012 10:31 AM, nwaits wrote:
> 
> > I'm very impressed with python's wordlist script for plain text.  Is there a script for finding words that do NOT have certain diacritic marks, like acute or grave accents (utf-8), over the vowels?  
> 
> > Thank you.
> 
> 
> 
> if you can construct a list of "illegal" characters, then you can simply
> 
> check each character of the word against the list, and if it succeeds
> 
> for all of the characters, it's a winner.
> 
> 
> 
> If that's not fast enough, you can build a translation table from the
> 
> list of illegal characters, and use translate on each word.  Then it
> 
> becomes a question of checking if the translated word is all zeroes.  
> 
> More setup time, but much faster looping for each word.
> 
> 
> 
> -- 
> 
> 
> 
> DaveA

Lazy way.
Py3.2

>>> import unicodedata
>>> def HasDiacritics(w):
...     w_decomposed = unicodedata.normalize('NFKD', w)
...     return 'no' if len(w) == len(w_decomposed) else 'yes'
...     
>>> HasDiacritics('éléphant')
'yes'
>>> HasDiacritics('elephant')
'no'
>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
'yes'
>>> HasDiacritics('U')
'no'
>>>

Should be ok for the CombiningDiacriticalMarks unicode range
(common diacritics)

jmf

[toc] | [prev] | [standalone]

csiph-web

Script for finding words of any size that do NOT contain vowels with acute diacritic marks?

Contents

#31507 — Script for finding words of any size that do NOT contain vowels with acute diacritic marks?

#31516

#31519

#31526

#31530

#31532

#31533

#31534

#31535

#31531

#31527

#31549

#31520