Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Wed, 17 Oct 2012 08:32:52 -0700 (PDT)
In-Reply-To: <mailman.2350.1350486045.27098.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=81.62.101.121; posting-account=ung4FAoAAAC46zhHJ0Nsnuox7M5gDvs_
References: <a7454cb7-e6dc-4167-b72a-56a67a5873a7@googlegroups.com> <mailman.2350.1350486045.27098.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks?
From: wxjmfauth@gmail.com
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: python-list@python.org, d@davea.name, nwaits <nowaits@gmail.com>
Precedence: list
Message-ID: <mailman.2352.1350487976.27098.python-list@python.org>
Lines: 57
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:31520

Le mercredi 17 octobre 2012 17:00:46 UTC+2, Dave Angel a =E9crit=A0:
> On 10/17/2012 10:31 AM, nwaits wrote:
>=20
> > I'm very impressed with python's wordlist script for plain text.  Is th=
ere a script for finding words that do NOT have certain diacritic marks, li=
ke acute or grave accents (utf-8), over the vowels? =20
>=20
> > Thank you.
>=20
>=20
>=20
> if you can construct a list of "illegal" characters, then you can simply
>=20
> check each character of the word against the list, and if it succeeds
>=20
> for all of the characters, it's a winner.
>=20
>=20
>=20
> If that's not fast enough, you can build a translation table from the
>=20
> list of illegal characters, and use translate on each word.  Then it
>=20
> becomes a question of checking if the translated word is all zeroes. =20
>=20
> More setup time, but much faster looping for each word.
>=20
>=20
>=20
> --=20
>=20
>=20
>=20
> DaveA

Lazy way.
Py3.2

>>> import unicodedata
>>> def HasDiacritics(w):
...     w_decomposed =3D unicodedata.normalize('NFKD', w)
...     return 'no' if len(w) =3D=3D len(w_decomposed) else 'yes'
...    =20
>>> HasDiacritics('=E9l=E9phant')
'yes'
>>> HasDiacritics('elephant')
'no'
>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
'yes'
>>> HasDiacritics('U')
'no'
>>>

Should be ok for the CombiningDiacriticalMarks unicode range
(common diacritics)

jmf