Groups > comp.lang.python > #39671 > unrolled thread

Correct handling of case in unicode and regexps

Started by	Devin Jeanpierre <jeanpierreda@gmail.com>
First post	2013-02-23 09:26 -0500
Last post	2013-02-24 11:28 -0800
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

  Correct handling of case in unicode and regexps Devin Jeanpierre <jeanpierreda@gmail.com> - 2013-02-23 09:26 -0500
    Re: Correct handling of case in unicode and regexps jmfauth <wxjmfauth@gmail.com> - 2013-02-24 11:28 -0800

#39671 — Correct handling of case in unicode and regexps

From	Devin Jeanpierre <jeanpierreda@gmail.com>
Date	2013-02-23 09:26 -0500
Subject	Correct handling of case in unicode and regexps
Message-ID	<mailman.2347.1361629625.2939.python-list@python.org>

Hi folks,

I'm pretty unsure of myself when it comes to unicode. As I understand
it, you're generally supposed to compare things in a case insensitive
manner by case folding, right? So instead of a.lower() == b.lower()
(the ASCII way), you do a.casefold() == b.casefold()

However, I'm struggling to figure out how regular expressions should
treat case. Python's re module doesn't "work properly" to my
understanding, because:

    >>> a = 'ss'
    >>> b = 'ß'
    >>> a.casefold() == b.casefold()
    True
    >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
    >>> # oh dear!

In addition, it seems improbable that this ever _could_ work. Because
if it did work like that, then what would the value be of
re.match('s', 'ß', re.UNICODE | re.IGNORECASE).end() ? 0.5?

I'd really like to hear the thoughts of people more experienced with
unicode. What is the ideal correct behavior here? Or do I
misunderstand things?

-- Devin

[toc] | [next] | [standalone]

#39776

From	jmfauth <wxjmfauth@gmail.com>
Date	2013-02-24 11:28 -0800
Message-ID	<63ffb861-f38e-4b4b-ad56-21e5c1bdc6bd@g16g2000vbf.googlegroups.com>
In reply to	#39671

On 23 fév, 15:26, Devin Jeanpierre <jeanpierr...@gmail.com> wrote:
> Hi folks,
>
> I'm pretty unsure of myself when it comes to unicode. As I understand
> it, you're generally supposed to compare things in a case insensitive
> manner by case folding, right? So instead of a.lower() == b.lower()
> (the ASCII way), you do a.casefold() == b.casefold()
>
> However, I'm struggling to figure out how regular expressions should
> treat case. Python's re module doesn't "work properly" to my
> understanding, because:
>
>     >>> a = 'ss'
>     >>> b = 'ß'
>     >>> a.casefold() == b.casefold()
>     True
>     >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
>     >>> # oh dear!
>
> In addition, it seems improbable that this ever _could_ work. Because
> if it did work like that, then what would the value be of
> re.match('s', 'ß', re.UNICODE | re.IGNORECASE).end() ? 0.5?
>
> I'd really like to hear the thoughts of people more experienced with
> unicode. What is the ideal correct behavior here? Or do I
> misunderstand things?

-----

I'm just wondering if there is a real issue here. After all,
this is only a question of conventions. Unicode has some
conventions, re modules may (has to) use some conventions too.

It seems to me, the safest way is to preprocess the text,
which has to be examinated.

Proposed case study:
How should be ss/ß/SS/ẞ interpreted?

'Richard-Strauss-Straße'
'Richard-Strauss-Strasse'
'RICHARD-STRAUSS-STRASSE'
'RICHARD-STRAUSS-STRAẞE'


There is more or less the same situation with sorting.
Unicode can not do all and it may be mandatory to
preprocess the "input".

Eg. This fct I wrote once for the fun. It sorts French
words (without unicodedata and locale).

>>> import libfrancais
>>> z = ['oeuf', 'œuf', 'od', 'of']
>>> zo = libfrancais.sortedfr(z)
>>> zo
['od', 'oeuf', 'œuf', 'of']

jmf

[toc] | [prev] | [standalone]

csiph-web

Correct handling of case in unicode and regexps

Contents

#39671 — Correct handling of case in unicode and regexps

#39776