Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #39699
| Path | csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <jeanpierreda@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.001 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'mrab': 0.05; 'source.': 0.05; 'character,': 0.07; 'emulate': 0.07; 'matches': 0.07; ':-)': 0.13; 'sat,': 0.15; "(it's": 0.16; '[*]': 0.16; 'benefit.': 0.16; 'better?': 0.16; 'enough.': 0.16; 'fold': 0.16; 'folding': 0.16; 'iteration': 0.16; "module's": 0.16; 'problem).': 0.16; 'regex,': 0.16; 'repetitions': 0.16; 'subject:case': 0.16; 'subject:handling': 0.16; 'subject:unicode': 0.16; 'wrote:': 0.17; 'implementing': 0.17; 'unicode': 0.17; 'tests': 0.18; 'input': 0.18; 'feb': 0.19; 'module': 0.19; '(not': 0.20; 'either.': 0.22; "i'd": 0.22; 'matching': 0.23; "i've": 0.23; 'second': 0.24; 'so.': 0.24; 'least': 0.25; 'header:In-Reply-To:1': 0.25; 'thanks!': 0.26; '(see': 0.27; 'done.': 0.27; 'i.e.': 0.27; 'message-id:@mail.gmail.com': 0.27; 'received:209.85.212': 0.28; 'run': 0.28; 'character': 0.29; 'source': 0.29; "i'm": 0.29; 'e.g.': 0.30; 'basic': 0.30; 'figure': 0.30; 'code': 0.31; 'asking': 0.32; 'certain': 0.33; 'goes': 0.33; 'cases,': 0.33; 'text,': 0.33; 'to:addr:python-list': 0.33; 'that,': 0.34; "can't": 0.34; 'received:google.com': 0.34; 'done': 0.34; 'fail': 0.35; 'open': 0.35; 'pm,': 0.35; 'received:209.85': 0.35; 'but': 0.36; 'be.': 0.36; 'characters': 0.36; "i'll": 0.36; 'possible': 0.37; 'supporting': 0.37; 'received:209': 0.37; 'well.': 0.37; 'subject:: ': 0.38; 'mean': 0.38; 'planning': 0.38; 'some': 0.38; 'to:addr:python.org': 0.39; 'think': 0.40; 'places': 0.61; 'first': 0.61; 'series': 0.63; 'more': 0.63; 'behavior': 0.64; 'middle': 0.66; '2013': 0.84; '3.13': 0.84; 'actually,': 0.84; 'easier,': 0.84; 'hand,': 0.97 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:content-transfer-encoding; bh=HsOsrofvZBvCwRCv/BOC+qsHUL2YhDLdY63WI+YvHrE=; b=vx3UczSC/9OmST1QsrqdK12byiHpNt//L4nrxPz1gDqlhS/Dfvg2TYmmr2ctNLEYIA Ojmlv02PUjZAbTlyX+7k7Zd9K+77pg3xU+dLEhP6RURtt/v3GZFmFlW9UKbjTakfeoTe 1Vn2FSZaV7JlQtfLR2Oh3Ndu9MWTOpagy56Rirmijb09d8vSwQbkMQSA5l5QQ2Uibcqt R9WSKvCpszCNde1Q6dy9b4HyDdGXnqlZOWLP1Nie+9u0glgt5spCehxpYmHnH8YcBsUC 4XVc5hxWAxjhOqoe4Ja28/KSmDNlnTF5Ip7QHEcE51xB0EhQSuUzPQSuGn2Rq6n1mzVE TMsg== |
| X-Received | by 10.59.13.197 with SMTP id fa5mr7905848ved.47.1361645878130; Sat, 23 Feb 2013 10:57:58 -0800 (PST) |
| MIME-Version | 1.0 |
| In-Reply-To | <51290699.8050209@mrabarnett.plus.com> |
| References | <CABicbJLzQ9AHrGuaooiBRk45U5CHZYw6CodJFiQvAuF4+7kToA@mail.gmail.com> <CAHzaPEMmSExoFunOp_OyRCEOKE-+WzEO-hdb61DUiZFnzOG_rw@mail.gmail.com> <CABicbJJ0RoyQVdX9Hyd-fYeumS4faH2TVpYHiMwW0MRuPZUL8g@mail.gmail.com> <CABicbJ+fQW0og8rJsL5Bio_uTNCUtNwEN2MAtSdWmg49Zw7r8Q@mail.gmail.com> <5128FF37.7060500@mrabarnett.plus.com> <CABicbJJ0aPB-bytZo9g8OwmP7RTKmVkKoz=uxqz6XCSKecR2eA@mail.gmail.com> <51290699.8050209@mrabarnett.plus.com> |
| From | Devin Jeanpierre <jeanpierreda@gmail.com> |
| Date | Sat, 23 Feb 2013 13:57:18 -0500 |
| Subject | Re: Correct handling of case in unicode and regexps |
| To | python-list@python.org |
| Content-Type | text/plain; charset=UTF-8 |
| Content-Transfer-Encoding | quoted-printable |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.2367.1361645886.2939.python-list@python.org> (permalink) |
| Lines | 47 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1361645886 news.xs4all.nl 6959 [2001:888:2000:d::a6]:40662 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:39699 |
Show key headers only | View raw
On Sat, Feb 23, 2013 at 1:12 PM, MRAB <python@mrabarnett.plus.com> wrote: > The basic rule is that a series of characters in the regex must match a > series of characters in the text, with no partial matches in either. > > For example, 'ss' can match 'ß', but 's' can't match 'ß' because that > would be matching part of 'ß'. > > In a regex like 's+', you're asking it to match one or more repetitions > of 's', but that would mean that 's' would have to match part of 'ß' in > the first iteration and the remainder of 'ß' in the second iteration. That makes sense. I'll have to think about this and run some tests through regex, as well. Thanks! > Although it's theoretically possible to do that, the code is already > difficult enough. The cost outweighs the potential benefit. > > If you'd like to have a go at implementing it, the code _is_ open > source. :-) Actually, the reason it's relevant to me is that I'm reimplementing the re module using a more automata theoretic approach (it's my second attack at the problem). Also, I've read the _sre source code and it's unpleasant. Is regex much better? At least the way I'm planning on going about it, supporting this is easier, as long as one can figure out what it means to match halfway inside a ß. Since case folding is a homomorphism*, I can case fold the regex** and case fold the input and then I'm done. Case folding of the input can be done character by character, and to emulate the regex module behavior I'd need to check at certain places whether or not I'm in the middle of a casefolding expansion, and fail if so. On the other hand, if I don't emulate the regex module's behavior in at least some cases, I'd need to figure out what the value of a match of 's' against 'ß' would be. [*] i.e. it can be done character by character (see Unicode 3.13 Default Case Algorithms) [**] Not as trivial as it sounds, but still easy. [ßa-z] goes to e.g. [a-z]|ss (not [ssa-z]). -- Devin
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
Re: Correct handling of case in unicode and regexps Devin Jeanpierre <jeanpierreda@gmail.com> - 2013-02-23 13:57 -0500
csiph-web