Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder7.xlned.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <CABicbJLzQ9AHrGuaooiBRk45U5CHZYw6CodJFiQvAuF4+7kToA@mail.gmail.com>
References: <CABicbJLzQ9AHrGuaooiBRk45U5CHZYw6CodJFiQvAuF4+7kToA@mail.gmail.com>
Date: Sat, 23 Feb 2013 16:11:36 +0100
Subject: Re: Correct handling of case in unicode and regexps
From: Vlastimil Brom <vlastimil.brom@gmail.com>
To: Devin Jeanpierre <jeanpierreda@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: "comp.lang.python" <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2348.1361632298.2939.python-list@python.org>
Lines: 39
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:39672

2013/2/23 Devin Jeanpierre <jeanpierreda@gmail.com>:
> Hi folks,
>
> I'm pretty unsure of myself when it comes to unicode. As I understand
> it, you're generally supposed to compare things in a case insensitive
> manner by case folding, right? So instead of a.lower() =3D=3D b.lower()
> (the ASCII way), you do a.casefold() =3D=3D b.casefold()
>
> However, I'm struggling to figure out how regular expressions should
> treat case. Python's re module doesn't "work properly" to my
> understanding, because:
>
>     >>> a =3D 'ss'
>     >>> b =3D '=DF'
>     >>> a.casefold() =3D=3D b.casefold()
>     True
>     >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
>     >>> # oh dear!
>
> In addition, it seems improbable that this ever _could_ work. Because
> if it did work like that, then what would the value be of
> re.match('s', '=DF', re.UNICODE | re.IGNORECASE).end() ? 0.5?
>
> I'd really like to hear the thoughts of people more experienced with
> unicode. What is the ideal correct behavior here? Or do I
> misunderstand things?
>
> -- Devin
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
you may check the new regex implementation
https://pypi.python.org/pypi/regex
which does support casefolding in case insensitive matches (beyond
many other features and improvements comparing to re)

hth,
  vbr