Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #2415
| Date | 2011-04-02 00:15 +0100 |
|---|---|
| From | MRAB <python@mrabarnett.plus.com> |
| Subject | Re: Extracting "true" words |
| References | <4d963c1d$0$1584$426a34cc@news.free.fr> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.110.1301699738.2990.python-list@python.org> (permalink) |
On 01/04/2011 21:55, candide wrote:
> Back again with my study of regular expressions ;) There exists a
> special character allowing alphanumeric extraction, the special
> character \w (BTW, what the letter 'w' refers to?). But this feature
> doesn't permit to extract true words; by "true" I mean word composed
> only of _alphabetic_ letters (not digit nor underscore).
>
The 'w' refers to a 'word' character, although in regex it refers to
letters, digits and the underscore character '_' due to its use in
computer languages (basically, the characters of an identifier or name).
>
> So I was wondering what is the pattern to extract (or to match) _true_
> words ? Of course, I don't restrict myself to the ascii universe so that
> the pattern [a-zA-Z]+ doesn't meet my needs.
>
Using the re module, you would have to create a character class out of
all the possible letters, something like this:
letter_class = u"[" + u"".join(unichr(c) for c in range(0x10000) if
unichr(c).isalpha()) + u"]"
Alternatively, you could try the new regex implementation here:
http://pypi.python.org/pypi/regex
which adds support for Unicode properties, and do something like this:
words = regex.findall(ur"\p{Letter}+", unicode_text)
Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread
Extracting "true" words candide <candide@free.invalid> - 2011-04-01 22:55 +0200
Re: Extracting "true" words Chris Rebert <clp2@rebertia.com> - 2011-04-01 16:10 -0700
Re: Extracting "true" words John Nagle <nagle@animats.com> - 2011-04-01 21:04 -0700
Re: Extracting "true" words candide <candide@free.invalid> - 2011-04-02 15:18 +0200
Re: Extracting "true" words MRAB <python@mrabarnett.plus.com> - 2011-04-02 00:15 +0100
csiph-web