Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #2414

Re: Extracting "true" words

References <4d963c1d$0$1584$426a34cc@news.free.fr>
Date 2011-04-01 16:10 -0700
Subject Re: Extracting "true" words
From Chris Rebert <clp2@rebertia.com>
Newsgroups comp.lang.python
Message-ID <mailman.109.1301699409.2990.python-list@python.org> (permalink)

Show all headers | View raw


On Fri, Apr 1, 2011 at 1:55 PM, candide <candide@free.invalid> wrote:
> Back again with my study of regular expressions ;) There exists a special
> character allowing alphanumeric extraction, the special character \w (BTW,
> what the letter 'w' refers to?).

"Word" presumably/intuitively; hence the non-standard "[:word:]"
POSIX-like character class alias for \w in some environments.

> But this feature doesn't permit to extract
> true words; by "true" I mean word composed only of _alphabetic_ letters (not
> digit nor underscore).

Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
And what of hyphenated terms (e.g. "re-lock")?

> So I was wondering what is the pattern to extract (or to match) _true_ words
> ? Of course, I don't restrict myself to the ascii universe so that the
> pattern [a-zA-Z]+ doesn't meet my needs.

AFAICT, there doesn't appear to be a nice way to do this in Python
using the std lib `re` module, but I'm not a regex guru.
POSIX character classes are unsupported, which rules out "[:alpha:]".
\w can be made Unicode/locale-sensitive, but includes digits and the
underscore, as you've already pointed out.
\p (Unicode property/block testing), which would allow for
"\p{Alphabetic}" or similar, is likewise unsupported.

Cheers,
Chris
--
http://blog.rebertia.com

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Extracting "true" words candide <candide@free.invalid> - 2011-04-01 22:55 +0200
  Re: Extracting "true" words Chris Rebert <clp2@rebertia.com> - 2011-04-01 16:10 -0700
    Re: Extracting "true" words John Nagle <nagle@animats.com> - 2011-04-01 21:04 -0700
    Re: Extracting "true" words candide <candide@free.invalid> - 2011-04-02 15:18 +0200
  Re: Extracting "true" words MRAB <python@mrabarnett.plus.com> - 2011-04-02 00:15 +0100

csiph-web