Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.069 X-Spam-Evidence: '*H*': 0.86; '*S*': 0.00; 'to:addr:comp.lang.python': 0.09; 'cc:addr:python-list': 0.10; 'def': 0.10; 'looping': 0.16; 'subject: \n ': 0.16; 'winner.': 0.16; 'translation': 0.16; 'wrote:': 0.17; 'unicode': 0.17; '>>>': 0.18; 'translate': 0.20; 'import': 0.21; "python's": 0.23; 'script': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; 'am,': 0.27; 'cc:addr:gmail.com': 0.27; 'checking': 0.27; 'plain': 0.27; 'question': 0.27; 'cc:2**2': 0.27; 'translated': 0.27; 'subject:size': 0.29; 'character': 0.29; 'words': 0.29; "i'm": 0.29; 'becomes': 0.30; 'certain': 0.33; 'word.': 0.33; 'received:google.com': 0.34; 'list': 0.35; 'faster': 0.35; 'text.': 0.35; 'table': 0.35; 'subject:?': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'skip:u 20': 0.36; 'but': 0.36; 'cc:no real name:2**1': 0.36; 'subject:with': 0.36; 'should': 0.36; 'thank': 0.36; 'received:209': 0.37; 'received:209.85.216': 0.37; 'subject:: ': 0.38; 'build': 0.39; 'list,': 0.39; 'from:no real name:2**0': 0.60; 'range': 0.60; 'you.': 0.61; 'time,': 0.62; 'more': 0.63; 'capital': 0.68; 'construct': 0.84; 'grave': 0.84; 'received:209.85.216.184': 0.84; 'subject:NOT': 0.84; 'subject:any': 0.84; 'angel': 0.93 Newsgroups: comp.lang.python Date: Wed, 17 Oct 2012 08:32:52 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=81.62.101.121; posting-account=ung4FAoAAAC46zhHJ0Nsnuox7M5gDvs_ References: User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 81.62.101.121 MIME-Version: 1.0 Subject: Re: Script for finding words of any size that do NOT contain vowels with acute diacritic marks? From: wxjmfauth@gmail.com To: comp.lang.python@googlegroups.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: python-list@python.org, d@davea.name, nwaits X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Message-ID: Lines: 57 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1350487976 news.xs4all.nl 6845 [2001:888:2000:d::a6]:44546 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:31520 Le mercredi 17 octobre 2012 17:00:46 UTC+2, Dave Angel a =E9crit=A0: > On 10/17/2012 10:31 AM, nwaits wrote: >=20 > > I'm very impressed with python's wordlist script for plain text. Is th= ere a script for finding words that do NOT have certain diacritic marks, li= ke acute or grave accents (utf-8), over the vowels? =20 >=20 > > Thank you. >=20 >=20 >=20 > if you can construct a list of "illegal" characters, then you can simply >=20 > check each character of the word against the list, and if it succeeds >=20 > for all of the characters, it's a winner. >=20 >=20 >=20 > If that's not fast enough, you can build a translation table from the >=20 > list of illegal characters, and use translate on each word. Then it >=20 > becomes a question of checking if the translated word is all zeroes. =20 >=20 > More setup time, but much faster looping for each word. >=20 >=20 >=20 > --=20 >=20 >=20 >=20 > DaveA Lazy way. Py3.2 >>> import unicodedata >>> def HasDiacritics(w): ... w_decomposed =3D unicodedata.normalize('NFKD', w) ... return 'no' if len(w) =3D=3D len(w_decomposed) else 'yes' ... =20 >>> HasDiacritics('=E9l=E9phant') 'yes' >>> HasDiacritics('elephant') 'no' >>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}') 'yes' >>> HasDiacritics('U') 'no' >>> Should be ok for the CombiningDiacriticalMarks unicode range (common diacritics) jmf