Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!eweka.nl!lightspeed.eweka.nl!194.109.133.83.MISMATCH!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Sat, 23 Feb 2013 20:23:29 +0000
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Correct handling of case in unicode and regexps
References: <CABicbJLzQ9AHrGuaooiBRk45U5CHZYw6CodJFiQvAuF4+7kToA@mail.gmail.com> <CAHzaPEMmSExoFunOp_OyRCEOKE-+WzEO-hdb61DUiZFnzOG_rw@mail.gmail.com> <CABicbJJ0RoyQVdX9Hyd-fYeumS4faH2TVpYHiMwW0MRuPZUL8g@mail.gmail.com> <CABicbJ+fQW0og8rJsL5Bio_uTNCUtNwEN2MAtSdWmg49Zw7r8Q@mail.gmail.com> <5128FF37.7060500@mrabarnett.plus.com> <CABicbJJ0aPB-bytZo9g8OwmP7RTKmVkKoz=uxqz6XCSKecR2eA@mail.gmail.com> <51290699.8050209@mrabarnett.plus.com> <CABicbJJyF9OLBhB8dJCHjE30KfkUsBAwch5GpEfEby8X19=JYg@mail.gmail.com>
In-Reply-To: <CABicbJJyF9OLBhB8dJCHjE30KfkUsBAwch5GpEfEby8X19=JYg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: list
Reply-To: python-list@python.org
Newsgroups: comp.lang.python
Message-ID: <mailman.2377.1361651008.2939.python-list@python.org>
Lines: 63
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:39710

On 2013-02-23 18:57, Devin Jeanpierre wrote:
> On Sat, Feb 23, 2013 at 1:12 PM, MRAB <python@mrabarnett.plus.com>
> wrote:
>> The basic rule is that a series of characters in the regex must
>> match a series of characters in the text, with no partial matches
>> in either.
>>
>> For example, 'ss' can match 'ß', but 's' can't match 'ß' because
>> that would be matching part of 'ß'.
>>
>> In a regex like 's+', you're asking it to match one or more
>> repetitions of 's', but that would mean that 's' would have to
>> match part of 'ß' in the first iteration and the remainder of 'ß'
>> in the second iteration.
>
> That makes sense. I'll have to think about this and run some tests
> through regex, as well.
>
>> Although it's theoretically possible to do that, the code is
>> already difficult enough. The cost outweighs the potential
>> benefit.
>>
>> If you'd like to have a go at implementing it, the code _is_ open
>> source. :-)
>
> Actually, the reason it's relevant to me is that I'm reimplementing
> the re module using a more automata theoretic approach (it's my
> second attack at the problem). Also, I've read the _sre source code
> and it's unpleasant. Is regex much better?
>
I like to think so! ;-)

Part of the problem may be that it also supports fuzzy (approximate)
matching.

> At least the way I'm planning on going about it, supporting this is
> easier, as long as one can figure out what it means to match halfway
> inside a ß. Since case folding is a homomorphism*, I can case fold
> the regex** and case fold the input and then I'm done. Case folding
> of the input can be done character by character, and to emulate the
> regex module behavior I'd need to check at certain places whether or
> not I'm in the middle of a casefolding expansion, and fail if so. On
> the other hand, if I don't emulate the regex module's behavior in at
> least some cases, I'd need to figure out what the value of a match
> of 's' against 'ß' would be.
>
It seems like a reasonable restriction to me that start and end
positions should be integers, i.e. forbid partial matches within a
character.

This means that matching '(ss)' against 'ß' would be OK, as would '(s+)'
against 'ß', but '(s)' or '(s)(s)' against 'ß' would not, otherwise you
could get:

>>> match('(s)', 'ß').span(0)
(0, 0.5)
>>> match('(s)(s)', 'ß').span(0, 1, 2)
((0, 1), (0, 0.5), (0.5, 1)).

> [*]  i.e. it can be done character by character (see Unicode 3.13
> Default Case Algorithms) [**] Not as trivial as it sounds, but still
>  easy. [ßa-z] goes to e.g. [a-z]|ss (not [ssa-z]).
>