Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.dougwise.org!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Sat, 02 Apr 2011 00:15:22 +0100
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Extracting "true" words
References: <4d963c1d$0$1584$426a34cc@news.free.fr>
In-Reply-To: <4d963c1d$0$1584$426a34cc@news.free.fr>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Reply-To: python-list@python.org
Newsgroups: comp.lang.python
Message-ID: <mailman.110.1301699738.2990.python-list@python.org>
Lines: 28
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:2415

On 01/04/2011 21:55, candide wrote:
> Back again with my study of regular expressions ;) There exists a
> special character allowing alphanumeric extraction, the special
> character \w (BTW, what the letter 'w' refers to?). But this feature
> doesn't permit to extract true words; by "true" I mean word composed
> only of _alphabetic_ letters (not digit nor underscore).
>
The 'w' refers to a 'word' character, although in regex it refers to
letters, digits and the underscore character '_' due to its use in
computer languages (basically, the characters of an identifier or name).
>
> So I was wondering what is the pattern to extract (or to match) _true_
> words ? Of course, I don't restrict myself to the ascii universe so that
> the pattern [a-zA-Z]+ doesn't meet my needs.
 >
Using the re module, you would have to create a character class out of
all the possible letters, something like this:

     letter_class = u"[" + u"".join(unichr(c) for c in range(0x10000) if 
unichr(c).isalpha()) + u"]"

Alternatively, you could try the new regex implementation here:

     http://pypi.python.org/pypi/regex

which adds support for Unicode properties, and do something like this:

     words = regex.findall(ur"\p{Letter}+", unicode_text)