Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #39691

Re: Correct handling of case in unicode and regexps

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python@mrabarnett.plus.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.004
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; 'mrab': 0.05; 'source.': 0.05; 'matches': 0.07; 'rules.': 0.09; ':-)': 0.13; 'sat,': 0.15; 'benefit.': 0.16; 'enough.': 0.16; 'folding': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'iteration': 0.16; 'iteration.': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'record,': 0.16; 'repetitions': 0.16; 'subject:case': 0.16; 'subject:handling': 0.16; 'subject:unicode': 0.16; 'wrote:': 0.17; 'implementing': 0.17; 'feb': 0.19; 'module': 0.19; 'either.': 0.22; 'matching': 0.23; 'second': 0.24; 'header:In-Reply-To:1': 0.25; 'header:User- Agent:1': 0.26; 'perl': 0.29; 'basic': 0.30; 'helpful': 0.30; 'code': 0.31; 'asking': 0.32; 'not.': 0.32; 'getting': 0.33; 'says': 0.33; 'docs': 0.33; 'text,': 0.33; 'to:addr:python-list': 0.33; 'that,': 0.34; "can't": 0.34; 'clear': 0.35; 'open': 0.35; 'pm,': 0.35; 'there': 0.35; 'but': 0.36; 'be.': 0.36; 'characters': 0.36; 'should': 0.36; 'possible': 0.37; 'does': 0.37; 'subject:: ': 0.38; 'mean': 0.38; 'to:addr:python.org': 0.39; 'received:192': 0.39; 'where': 0.40; 'received:192.168': 0.40; "you've": 0.61; 'first': 0.61; 'mentioned': 0.63; 'series': 0.63; 'worth': 0.63; 'more': 0.63; 'limit': 0.65; 'header:Reply- To:1': 0.68; 'reply-to:no real name:2**0': 0.72; '2013': 0.84; 'reply-to:addr:python.org': 0.84
X-CM-Score 0.00
X-CNFS-Analysis v=2.0 cv=XeZXOvF5 c=1 sm=1 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=_vX2mLoUpDQA:10 a=pYDWctdyWDYA:10 a=ihvODaAuJD4A:10 a=OUOv7kDek9cA:10 a=IkcTkHD0fZMA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=PrWc515xQVYA:10 a=LOUOfpc6hURcbjKw-egA:9 a=QEXdDO2ut3YA:10 a=0nF1XD0wxitMEM03M9B4ZQ==:117
X-AUTH mrabarnett:2500
Date Sat, 23 Feb 2013 18:12:41 +0000
From MRAB <python@mrabarnett.plus.com>
User-Agent Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version 1.0
To python-list@python.org
Subject Re: Correct handling of case in unicode and regexps
References <CABicbJLzQ9AHrGuaooiBRk45U5CHZYw6CodJFiQvAuF4+7kToA@mail.gmail.com> <CAHzaPEMmSExoFunOp_OyRCEOKE-+WzEO-hdb61DUiZFnzOG_rw@mail.gmail.com> <CABicbJJ0RoyQVdX9Hyd-fYeumS4faH2TVpYHiMwW0MRuPZUL8g@mail.gmail.com> <CABicbJ+fQW0og8rJsL5Bio_uTNCUtNwEN2MAtSdWmg49Zw7r8Q@mail.gmail.com> <5128FF37.7060500@mrabarnett.plus.com> <CABicbJJ0aPB-bytZo9g8OwmP7RTKmVkKoz=uxqz6XCSKecR2eA@mail.gmail.com>
In-Reply-To <CABicbJJ0aPB-bytZo9g8OwmP7RTKmVkKoz=uxqz6XCSKecR2eA@mail.gmail.com>
Content-Type text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding 8bit
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
Reply-To python-list@python.org
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.2362.1361643158.2939.python-list@python.org> (permalink)
Lines 32
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1361643159 news.xs4all.nl 6867 [2001:888:2000:d::a6]:34694
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:39691

Show key headers only | View raw


On 2013-02-23 17:51, Devin Jeanpierre wrote:
> On Sat, Feb 23, 2013 at 12:41 PM, MRAB <python@mrabarnett.plus.com>
> wrote:
>> Getting full case folding to work can be tricky. There's always
>> going to be a limit to what's worth doing.
>>
>> There are also areas where it's not clear what the result should
>> be. You've already mentioned matching 's' against 'ß' (fails) and
>> matching 'ss' against 'ß' (succeeds), but how about matching
>> '(s)(s)' against 'ß' (fails)?
>>
>> For the record, Perl also says that 'ss' matches 'ß', but 's+' does
>> not.
>
> I would find it helpful to know the exact rules. The regex module
> docs say that it works, but don't say what it means to "work".
>
The basic rule is that a series of characters in the regex must match a
series of characters in the text, with no partial matches in either.

For example, 'ss' can match 'ß', but 's' can't match 'ß' because that
would be matching part of 'ß'.

In a regex like 's+', you're asking it to match one or more repetitions
of 's', but that would mean that 's' would have to match part of 'ß' in
the first iteration and the remainder of 'ß' in the second iteration.

Although it's theoretically possible to do that, the code is already
difficult enough. The cost outweighs the potential benefit.

If you'd like to have a go at implementing it, the code _is_ open
source. :-)

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Correct handling of case in unicode and regexps MRAB <python@mrabarnett.plus.com> - 2013-02-23 18:12 +0000

csiph-web