Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #2401 > unrolled thread
| Started by | candide <candide@free.invalid> |
|---|---|
| First post | 2011-04-01 22:55 +0200 |
| Last post | 2011-04-02 00:15 +0100 |
| Articles | 5 — 4 participants |
Back to article view | Back to comp.lang.python
Extracting "true" words candide <candide@free.invalid> - 2011-04-01 22:55 +0200
Re: Extracting "true" words Chris Rebert <clp2@rebertia.com> - 2011-04-01 16:10 -0700
Re: Extracting "true" words John Nagle <nagle@animats.com> - 2011-04-01 21:04 -0700
Re: Extracting "true" words candide <candide@free.invalid> - 2011-04-02 15:18 +0200
Re: Extracting "true" words MRAB <python@mrabarnett.plus.com> - 2011-04-02 00:15 +0100
| From | candide <candide@free.invalid> |
|---|---|
| Date | 2011-04-01 22:55 +0200 |
| Subject | Extracting "true" words |
| Message-ID | <4d963c1d$0$1584$426a34cc@news.free.fr> |
Back again with my study of regular expressions ;) There exists a special character allowing alphanumeric extraction, the special character \w (BTW, what the letter 'w' refers to?). But this feature doesn't permit to extract true words; by "true" I mean word composed only of _alphabetic_ letters (not digit nor underscore). So I was wondering what is the pattern to extract (or to match) _true_ words ? Of course, I don't restrict myself to the ascii universe so that the pattern [a-zA-Z]+ doesn't meet my needs.
[toc] | [next] | [standalone]
| From | Chris Rebert <clp2@rebertia.com> |
|---|---|
| Date | 2011-04-01 16:10 -0700 |
| Message-ID | <mailman.109.1301699409.2990.python-list@python.org> |
| In reply to | #2401 |
On Fri, Apr 1, 2011 at 1:55 PM, candide <candide@free.invalid> wrote:
> Back again with my study of regular expressions ;) There exists a special
> character allowing alphanumeric extraction, the special character \w (BTW,
> what the letter 'w' refers to?).
"Word" presumably/intuitively; hence the non-standard "[:word:]"
POSIX-like character class alias for \w in some environments.
> But this feature doesn't permit to extract
> true words; by "true" I mean word composed only of _alphabetic_ letters (not
> digit nor underscore).
Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
And what of hyphenated terms (e.g. "re-lock")?
> So I was wondering what is the pattern to extract (or to match) _true_ words
> ? Of course, I don't restrict myself to the ascii universe so that the
> pattern [a-zA-Z]+ doesn't meet my needs.
AFAICT, there doesn't appear to be a nice way to do this in Python
using the std lib `re` module, but I'm not a regex guru.
POSIX character classes are unsupported, which rules out "[:alpha:]".
\w can be made Unicode/locale-sensitive, but includes digits and the
underscore, as you've already pointed out.
\p (Unicode property/block testing), which would allow for
"\p{Alphabetic}" or similar, is likewise unsupported.
Cheers,
Chris
--
http://blog.rebertia.com
[toc] | [prev] | [next] | [standalone]
| From | John Nagle <nagle@animats.com> |
|---|---|
| Date | 2011-04-01 21:04 -0700 |
| Message-ID | <4d96a062$0$10535$742ec2ed@news.sonic.net> |
| In reply to | #2414 |
On 4/1/2011 4:10 PM, Chris Rebert wrote:
> On Fri, Apr 1, 2011 at 1:55 PM, candide<candide@free.invalid> wrote:
>> Back again with my study of regular expressions ;) There exists a special
>> character allowing alphanumeric extraction, the special character \w (BTW,
>> what the letter 'w' refers to?).
>
> "Word" presumably/intuitively; hence the non-standard "[:word:]"
> POSIX-like character class alias for \w in some environments.
>
>> But this feature doesn't permit to extract
>> true words; by "true" I mean word composed only of _alphabetic_ letters (not
>> digit nor underscore).
>
> Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
> And what of hyphenated terms (e.g. "re-lock")?
It's an interesting parsing problem to find word breaks in mixed
language text. It's quite common to find English and Japanese text
mixed. (See "http://www.dokidoki6.com/00_index1.html". Caution,
excessively cute.) Each ideograph is a "word", of course.
Parse this into words:
★12/25/2009★
6%DOKIDOKI VISUAL FILE vol.4を公開しました。
アルバムの上部で再生操作、下部でサムネイルがご覧いただけます。
John Nagle
[toc] | [prev] | [next] | [standalone]
| From | candide <candide@free.invalid> |
|---|---|
| Date | 2011-04-02 15:18 +0200 |
| Message-ID | <4d97229c$0$4785$426a74cc@news.free.fr> |
| In reply to | #2414 |
Le 02/04/2011 01:10, Chris Rebert a écrit : > "Word" presumably/intuitively; hence the non-standard "[:word:]" > POSIX-like character class alias for \w in some environments. OK > Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)? Yes, CJK ideographs don't belong to the locale I'm working with ;) > And what of hyphenated terms (e.g. "re-lock")? I'm interested only with ascii letters and ascii letters with diacritics Thanks for your response.
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-04-02 00:15 +0100 |
| Message-ID | <mailman.110.1301699738.2990.python-list@python.org> |
| In reply to | #2401 |
On 01/04/2011 21:55, candide wrote:
> Back again with my study of regular expressions ;) There exists a
> special character allowing alphanumeric extraction, the special
> character \w (BTW, what the letter 'w' refers to?). But this feature
> doesn't permit to extract true words; by "true" I mean word composed
> only of _alphabetic_ letters (not digit nor underscore).
>
The 'w' refers to a 'word' character, although in regex it refers to
letters, digits and the underscore character '_' due to its use in
computer languages (basically, the characters of an identifier or name).
>
> So I was wondering what is the pattern to extract (or to match) _true_
> words ? Of course, I don't restrict myself to the ascii universe so that
> the pattern [a-zA-Z]+ doesn't meet my needs.
>
Using the re module, you would have to create a character class out of
all the possible letters, something like this:
letter_class = u"[" + u"".join(unichr(c) for c in range(0x10000) if
unichr(c).isalpha()) + u"]"
Alternatively, you could try the new regex implementation here:
http://pypi.python.org/pypi/regex
which adds support for Unicode properties, and do something like this:
words = regex.findall(ur"\p{Letter}+", unicode_text)
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web