Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.dougwise.org!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'subject:" ': 0.03; '(unicode': 0.07; 'ascii': 0.07; 'refers': 0.07; 'python': 0.07; 'alias': 0.09; 'pm,': 0.11; 'wrote:': 0.14; 'excluding': 0.16; 'likewise': 0.16; 'non-standard': 0.16; 'posix': 0.16; 'subject:Extracting': 0.16; 'unsupported,': 0.16; 'url:blog': 0.18; 'exists': 0.19; 'expressions': 0.19; 'appear': 0.19; 'wondering': 0.19; 'cc:no real name:2**0': 0.20; 'cc:2**0': 0.20; 'cheers,': 0.20; '(or': 0.22; 'header:In-Reply-To:1': 0.22; 'cc:addr:python-list': 0.22; 'module,': 0.23; 'received:209.85.213': 0.23; 'restrict': 0.23; 'similar,': 0.23; '(not': 0.24; 'extract': 0.25; 'pointed': 0.25; '(e.g.': 0.26; "i'm": 0.26; 'classes': 0.26; 'chris': 0.27; 'message- id:@mail.gmail.com': 0.28; "doesn't": 0.28; '(as': 0.29; 'fri,': 0.29; 'class': 0.29; 'cc:addr:python.org': 0.31; 'digits': 0.31; 'pattern': 0.31; 'character': 0.33; 'using': 0.34; 'skip:" 10': 0.34; 'regular': 0.34; 'there': 0.35; 'digit': 0.35; 'allow': 0.36; 'feature': 0.36; 'some': 0.37; 'received:209.85': 0.37; 'apr': 0.38; 'received:google.com': 0.38; 'but': 0.38; 'skip:" 20': 0.38; 'received:209': 0.39; 'would': 0.40; 'header:Received:5': 0.40; 'permit': 0.60; 'back': 0.61; '2011': 0.62; 'special': 0.66; 'candide': 0.84; 'universe': 0.91; 'to:none': 0.92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rebertia.com; s=google; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:cc:content-type; bh=HzeAhpJCGRtapa2MJI5wJN0lRRRsF5haUYeH/2Iw7a4=; b=GdVbTsdqyAKLOrYJX2239f8LCrUdsPxn8Zu1k3Q+ULNTw+RoB/i3lh4YmzAVR7jItD Gb+z6JZxMe9thdNGxkkDOxYEFW259Ql5AVDkSyl8UCL6H32UdajQGOid6VCWytvtT5YV gNFGogXxs4bHLhFtXZIeyjRtoExw6pwqsmQMQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=rebertia.com; s=google; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:cc:content-type; b=DRaWblMy1PjsVL9aD50tHVTciCbqRKVOGraJo7dZ8D7TXfmtr5zOor6WfiaPUHVUw4 QBvBf50X2cin3iBG/31EArsOL1EJJAJ213C6WKgBAjTErV9GjQnSBXvxlP4Gu8IodetH SCGjeMIdHdgC0MIJXgF4SQMcd/DMBcdNyYERc= MIME-Version: 1.0 Sender: chris@rebertia.com In-Reply-To: <4d963c1d$0$1584$426a34cc@news.free.fr> References: <4d963c1d$0$1584$426a34cc@news.free.fr> Date: Fri, 1 Apr 2011 16:10:06 -0700 X-Google-Sender-Auth: pP6eiJUp_8AHFYBMtmLZvzeBMMo Subject: Re: Extracting "true" words From: Chris Rebert Cc: python-list@python.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 31 NNTP-Posting-Host: 82.94.164.166 X-Trace: 1301699409 news.xs4all.nl 81482 [::ffff:82.94.164.166]:47362 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:2414 On Fri, Apr 1, 2011 at 1:55 PM, candide wrote: > Back again with my study of regular expressions ;) There exists a special > character allowing alphanumeric extraction, the special character \w (BTW, > what the letter 'w' refers to?). "Word" presumably/intuitively; hence the non-standard "[:word:]" POSIX-like character class alias for \w in some environments. > But this feature doesn't permit to extract > true words; by "true" I mean word composed only of _alphabetic_ letters (not > digit nor underscore). Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)? And what of hyphenated terms (e.g. "re-lock")? > So I was wondering what is the pattern to extract (or to match) _true_ words > ? Of course, I don't restrict myself to the ascii universe so that the > pattern [a-zA-Z]+ doesn't meet my needs. AFAICT, there doesn't appear to be a nice way to do this in Python using the std lib `re` module, but I'm not a regex guru. POSIX character classes are unsupported, which rules out "[:alpha:]". \w can be made Unicode/locale-sensitive, but includes digits and the underscore, as you've already pointed out. \p (Unicode property/block testing), which would allow for "\p{Alphabetic}" or similar, is likewise unsupported. Cheers, Chris -- http://blog.rebertia.com