Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder7.xlned.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
From: Devin Jeanpierre <jeanpierreda@gmail.com>
Date: Sat, 23 Feb 2013 09:26:17 -0500
Subject: Correct handling of case in unicode and regexps
To: "comp.lang.python" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2347.1361629625.2939.python-list@python.org>
Lines: 27
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:39671

Hi folks,

I'm pretty unsure of myself when it comes to unicode. As I understand
it, you're generally supposed to compare things in a case insensitive
manner by case folding, right? So instead of a.lower() =3D=3D b.lower()
(the ASCII way), you do a.casefold() =3D=3D b.casefold()

However, I'm struggling to figure out how regular expressions should
treat case. Python's re module doesn't "work properly" to my
understanding, because:

    >>> a =3D 'ss'
    >>> b =3D '=C3=9F'
    >>> a.casefold() =3D=3D b.casefold()
    True
    >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
    >>> # oh dear!

In addition, it seems improbable that this ever _could_ work. Because
if it did work like that, then what would the value be of
re.match('s', '=C3=9F', re.UNICODE | re.IGNORECASE).end() ? 0.5?

I'd really like to hear the thoughts of people more experienced with
unicode. What is the ideal correct behavior here? Or do I
misunderstand things?

-- Devin