Re: Extracting "true" words

Path	csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.dougwise.org!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path	<chris@rebertia.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.001
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'subject:" ': 0.03; '(unicode': 0.07; 'ascii': 0.07; 'refers': 0.07; 'python': 0.07; 'alias': 0.09; 'pm,': 0.11; 'wrote:': 0.14; 'excluding': 0.16; 'likewise': 0.16; 'non-standard': 0.16; 'posix': 0.16; 'subject:Extracting': 0.16; 'unsupported,': 0.16; 'url:blog': 0.18; 'exists': 0.19; 'expressions': 0.19; 'appear': 0.19; 'wondering': 0.19; 'cc:no real name:20': 0.20; 'cc:20': 0.20; 'cheers,': 0.20; '(or': 0.22; 'header:In-Reply-To:1': 0.22; 'cc:addr:python-list': 0.22; 'module,': 0.23; 'received:209.85.213': 0.23; 'restrict': 0.23; 'similar,': 0.23; '(not': 0.24; 'extract': 0.25; 'pointed': 0.25; '(e.g.': 0.26; "i'm": 0.26; 'classes': 0.26; 'chris': 0.27; 'message- id:@mail.gmail.com': 0.28; "doesn't": 0.28; '(as': 0.29; 'fri,': 0.29; 'class': 0.29; 'cc:addr:python.org': 0.31; 'digits': 0.31; 'pattern': 0.31; 'character': 0.33; 'using': 0.34; 'skip:" 10': 0.34; 'regular': 0.34; 'there': 0.35; 'digit': 0.35; 'allow': 0.36; 'feature': 0.36; 'some': 0.37; 'received:209.85': 0.37; 'apr': 0.38; 'received:google.com': 0.38; 'but': 0.38; 'skip:" 20': 0.38; 'received:209': 0.39; 'would': 0.40; 'header:Received:5': 0.40; 'permit': 0.60; 'back': 0.61; '2011': 0.62; 'special': 0.66; 'candide': 0.84; 'universe': 0.91; 'to:none': 0.92
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=rebertia.com; s=google; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:cc:content-type; bh=HzeAhpJCGRtapa2MJI5wJN0lRRRsF5haUYeH/2Iw7a4=; b=GdVbTsdqyAKLOrYJX2239f8LCrUdsPxn8Zu1k3Q+ULNTw+RoB/i3lh4YmzAVR7jItD Gb+z6JZxMe9thdNGxkkDOxYEFW259Ql5AVDkSyl8UCL6H32UdajQGOid6VCWytvtT5YV gNFGogXxs4bHLhFtXZIeyjRtoExw6pwqsmQMQ=
DomainKey-Signature	a=rsa-sha1; c=nofws; d=rebertia.com; s=google; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:cc:content-type; b=DRaWblMy1PjsVL9aD50tHVTciCbqRKVOGraJo7dZ8D7TXfmtr5zOor6WfiaPUHVUw4 QBvBf50X2cin3iBG/31EArsOL1EJJAJ213C6WKgBAjTErV9GjQnSBXvxlP4Gu8IodetH SCGjeMIdHdgC0MIJXgF4SQMcd/DMBcdNyYERc=
MIME-Version	1.0
Sender	chris@rebertia.com
In-Reply-To	<4d963c1d$0$1584$426a34cc@news.free.fr>
References	<4d963c1d$0$1584$426a34cc@news.free.fr>
Date	Fri, 1 Apr 2011 16:10:06 -0700
X-Google-Sender-Auth	pP6eiJUp_8AHFYBMtmLZvzeBMMo
Subject	Re: Extracting "true" words
From	Chris Rebert <clp2@rebertia.com>
Cc	python-list@python.org
Content-Type	text/plain; charset=UTF-8
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.12
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.109.1301699409.2990.python-list@python.org> (permalink)
Lines	31
NNTP-Posting-Host	82.94.164.166
X-Trace	1301699409 news.xs4all.nl 81482 [::ffff:82.94.164.166]:47362
X-Complaints-To	abuse@xs4all.nl
Xref	x330-a1.tempe.blueboxinc.net comp.lang.python:2414

Show key headers only | View raw

On Fri, Apr 1, 2011 at 1:55 PM, candide <candide@free.invalid> wrote:
> Back again with my study of regular expressions ;) There exists a special
> character allowing alphanumeric extraction, the special character \w (BTW,
> what the letter 'w' refers to?).

"Word" presumably/intuitively; hence the non-standard "[:word:]"
POSIX-like character class alias for \w in some environments.

> But this feature doesn't permit to extract
> true words; by "true" I mean word composed only of _alphabetic_ letters (not
> digit nor underscore).

Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
And what of hyphenated terms (e.g. "re-lock")?

> So I was wondering what is the pattern to extract (or to match) _true_ words
> ? Of course, I don't restrict myself to the ascii universe so that the
> pattern [a-zA-Z]+ doesn't meet my needs.

AFAICT, there doesn't appear to be a nice way to do this in Python
using the std lib `re` module, but I'm not a regex guru.
POSIX character classes are unsupported, which rules out "[:alpha:]".
\w can be made Unicode/locale-sensitive, but includes digits and the
underscore, as you've already pointed out.
\p (Unicode property/block testing), which would allow for
"\p{Alphabetic}" or similar, is likewise unsupported.

Cheers,
Chris
--
http://blog.rebertia.com

Thread

Extracting "true" words candide <candide@free.invalid> - 2011-04-01 22:55 +0200
  Re: Extracting "true" words Chris Rebert <clp2@rebertia.com> - 2011-04-01 16:10 -0700
    Re: Extracting "true" words John Nagle <nagle@animats.com> - 2011-04-01 21:04 -0700
    Re: Extracting "true" words candide <candide@free.invalid> - 2011-04-02 15:18 +0200
  Re: Extracting "true" words MRAB <python@mrabarnett.plus.com> - 2011-04-02 00:15 +0100

csiph-web