Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.dougwise.org!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
DomainKey-Signature: a=rsa-sha1; c=nofws; d=rebertia.com; s=google; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:cc:content-type; b=DRaWblMy1PjsVL9aD50tHVTciCbqRKVOGraJo7dZ8D7TXfmtr5zOor6WfiaPUHVUw4 QBvBf50X2cin3iBG/31EArsOL1EJJAJ213C6WKgBAjTErV9GjQnSBXvxlP4Gu8IodetH SCGjeMIdHdgC0MIJXgF4SQMcd/DMBcdNyYERc=
MIME-Version: 1.0
Sender: chris@rebertia.com
In-Reply-To: <4d963c1d$0$1584$426a34cc@news.free.fr>
References: <4d963c1d$0$1584$426a34cc@news.free.fr>
Date: Fri, 1 Apr 2011 16:10:06 -0700
Subject: Re: Extracting "true" words
From: Chris Rebert <clp2@rebertia.com>
Cc: python-list@python.org
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.109.1301699409.2990.python-list@python.org>
Lines: 31
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:2414

On Fri, Apr 1, 2011 at 1:55 PM, candide <candide@free.invalid> wrote:
> Back again with my study of regular expressions ;) There exists a special
> character allowing alphanumeric extraction, the special character \w (BTW,
> what the letter 'w' refers to?).

"Word" presumably/intuitively; hence the non-standard "[:word:]"
POSIX-like character class alias for \w in some environments.

> But this feature doesn't permit to extract
> true words; by "true" I mean word composed only of _alphabetic_ letters (not
> digit nor underscore).

Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
And what of hyphenated terms (e.g. "re-lock")?

> So I was wondering what is the pattern to extract (or to match) _true_ words
> ? Of course, I don't restrict myself to the ascii universe so that the
> pattern [a-zA-Z]+ doesn't meet my needs.

AFAICT, there doesn't appear to be a nice way to do this in Python
using the std lib `re` module, but I'm not a regex guru.
POSIX character classes are unsupported, which rules out "[:alpha:]".
\w can be made Unicode/locale-sensitive, but includes digits and the
underscore, as you've already pointed out.
\p (Unicode property/block testing), which would allow for
"\p{Alphabetic}" or similar, is likewise unsupported.

Cheers,
Chris
--
http://blog.rebertia.com