Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #2432
| Date | 2011-04-01 21:04 -0700 |
|---|---|
| From | John Nagle <nagle@animats.com> |
| Newsgroups | comp.lang.python |
| Subject | Re: Extracting "true" words |
| References | <4d963c1d$0$1584$426a34cc@news.free.fr> <mailman.109.1301699409.2990.python-list@python.org> |
| Message-ID | <4d96a062$0$10535$742ec2ed@news.sonic.net> (permalink) |
| Organization | Sonic.Net |
On 4/1/2011 4:10 PM, Chris Rebert wrote:
> On Fri, Apr 1, 2011 at 1:55 PM, candide<candide@free.invalid> wrote:
>> Back again with my study of regular expressions ;) There exists a special
>> character allowing alphanumeric extraction, the special character \w (BTW,
>> what the letter 'w' refers to?).
>
> "Word" presumably/intuitively; hence the non-standard "[:word:]"
> POSIX-like character class alias for \w in some environments.
>
>> But this feature doesn't permit to extract
>> true words; by "true" I mean word composed only of _alphabetic_ letters (not
>> digit nor underscore).
>
> Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
> And what of hyphenated terms (e.g. "re-lock")?
It's an interesting parsing problem to find word breaks in mixed
language text. It's quite common to find English and Japanese text
mixed. (See "http://www.dokidoki6.com/00_index1.html". Caution,
excessively cute.) Each ideograph is a "word", of course.
Parse this into words:
★12/25/2009★
6%DOKIDOKI VISUAL FILE vol.4を公開しました。
アルバムの上部で再生操作、下部でサムネイルがご覧いただけます。
John Nagle
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Extracting "true" words candide <candide@free.invalid> - 2011-04-01 22:55 +0200
Re: Extracting "true" words Chris Rebert <clp2@rebertia.com> - 2011-04-01 16:10 -0700
Re: Extracting "true" words John Nagle <nagle@animats.com> - 2011-04-01 21:04 -0700
Re: Extracting "true" words candide <candide@free.invalid> - 2011-04-02 15:18 +0200
Re: Extracting "true" words MRAB <python@mrabarnett.plus.com> - 2011-04-02 00:15 +0100
csiph-web