Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #39691

Re: Correct handling of case in unicode and regexps

Date 2013-02-23 18:12 +0000
From MRAB <python@mrabarnett.plus.com>
Subject Re: Correct handling of case in unicode and regexps
References (1 earlier) <CAHzaPEMmSExoFunOp_OyRCEOKE-+WzEO-hdb61DUiZFnzOG_rw@mail.gmail.com> <CABicbJJ0RoyQVdX9Hyd-fYeumS4faH2TVpYHiMwW0MRuPZUL8g@mail.gmail.com> <CABicbJ+fQW0og8rJsL5Bio_uTNCUtNwEN2MAtSdWmg49Zw7r8Q@mail.gmail.com> <5128FF37.7060500@mrabarnett.plus.com> <CABicbJJ0aPB-bytZo9g8OwmP7RTKmVkKoz=uxqz6XCSKecR2eA@mail.gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.2362.1361643158.2939.python-list@python.org> (permalink)

Show all headers | View raw


On 2013-02-23 17:51, Devin Jeanpierre wrote:
> On Sat, Feb 23, 2013 at 12:41 PM, MRAB <python@mrabarnett.plus.com>
> wrote:
>> Getting full case folding to work can be tricky. There's always
>> going to be a limit to what's worth doing.
>>
>> There are also areas where it's not clear what the result should
>> be. You've already mentioned matching 's' against 'ß' (fails) and
>> matching 'ss' against 'ß' (succeeds), but how about matching
>> '(s)(s)' against 'ß' (fails)?
>>
>> For the record, Perl also says that 'ss' matches 'ß', but 's+' does
>> not.
>
> I would find it helpful to know the exact rules. The regex module
> docs say that it works, but don't say what it means to "work".
>
The basic rule is that a series of characters in the regex must match a
series of characters in the text, with no partial matches in either.

For example, 'ss' can match 'ß', but 's' can't match 'ß' because that
would be matching part of 'ß'.

In a regex like 's+', you're asking it to match one or more repetitions
of 's', but that would mean that 's' would have to match part of 'ß' in
the first iteration and the remainder of 'ß' in the second iteration.

Although it's theoretically possible to do that, the code is already
difficult enough. The cost outweighs the potential benefit.

If you'd like to have a go at implementing it, the code _is_ open
source. :-)

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Correct handling of case in unicode and regexps MRAB <python@mrabarnett.plus.com> - 2013-02-23 18:12 +0000

csiph-web