Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #39671 > unrolled thread

Correct handling of case in unicode and regexps

Started byDevin Jeanpierre <jeanpierreda@gmail.com>
First post2013-02-23 09:26 -0500
Last post2013-02-24 11:28 -0800
Articles 2 — 2 participants

Back to article view | Back to comp.lang.python


Contents

  Correct handling of case in unicode and regexps Devin Jeanpierre <jeanpierreda@gmail.com> - 2013-02-23 09:26 -0500
    Re: Correct handling of case in unicode and regexps jmfauth <wxjmfauth@gmail.com> - 2013-02-24 11:28 -0800

#39671 — Correct handling of case in unicode and regexps

FromDevin Jeanpierre <jeanpierreda@gmail.com>
Date2013-02-23 09:26 -0500
SubjectCorrect handling of case in unicode and regexps
Message-ID<mailman.2347.1361629625.2939.python-list@python.org>
Hi folks,

I'm pretty unsure of myself when it comes to unicode. As I understand
it, you're generally supposed to compare things in a case insensitive
manner by case folding, right? So instead of a.lower() == b.lower()
(the ASCII way), you do a.casefold() == b.casefold()

However, I'm struggling to figure out how regular expressions should
treat case. Python's re module doesn't "work properly" to my
understanding, because:

    >>> a = 'ss'
    >>> b = 'ß'
    >>> a.casefold() == b.casefold()
    True
    >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
    >>> # oh dear!

In addition, it seems improbable that this ever _could_ work. Because
if it did work like that, then what would the value be of
re.match('s', 'ß', re.UNICODE | re.IGNORECASE).end() ? 0.5?

I'd really like to hear the thoughts of people more experienced with
unicode. What is the ideal correct behavior here? Or do I
misunderstand things?

-- Devin

[toc] | [next] | [standalone]


#39776

Fromjmfauth <wxjmfauth@gmail.com>
Date2013-02-24 11:28 -0800
Message-ID<63ffb861-f38e-4b4b-ad56-21e5c1bdc6bd@g16g2000vbf.googlegroups.com>
In reply to#39671
On 23 fév, 15:26, Devin Jeanpierre <jeanpierr...@gmail.com> wrote:
> Hi folks,
>
> I'm pretty unsure of myself when it comes to unicode. As I understand
> it, you're generally supposed to compare things in a case insensitive
> manner by case folding, right? So instead of a.lower() == b.lower()
> (the ASCII way), you do a.casefold() == b.casefold()
>
> However, I'm struggling to figure out how regular expressions should
> treat case. Python's re module doesn't "work properly" to my
> understanding, because:
>
>     >>> a = 'ss'
>     >>> b = 'ß'
>     >>> a.casefold() == b.casefold()
>     True
>     >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
>     >>> # oh dear!
>
> In addition, it seems improbable that this ever _could_ work. Because
> if it did work like that, then what would the value be of
> re.match('s', 'ß', re.UNICODE | re.IGNORECASE).end() ? 0.5?
>
> I'd really like to hear the thoughts of people more experienced with
> unicode. What is the ideal correct behavior here? Or do I
> misunderstand things?

-----

I'm just wondering if there is a real issue here. After all,
this is only a question of conventions. Unicode has some
conventions, re modules may (has to) use some conventions too.

It seems to me, the safest way is to preprocess the text,
which has to be examinated.

Proposed case study:
How should be ss/ß/SS/ẞ interpreted?

'Richard-Strauss-Straße'
'Richard-Strauss-Strasse'
'RICHARD-STRAUSS-STRASSE'
'RICHARD-STRAUSS-STRAẞE'


There is more or less the same situation with sorting.
Unicode can not do all and it may be mandatory to
preprocess the "input".

Eg. This fct I wrote once for the fun. It sorts French
words (without unicodedata and locale).

>>> import libfrancais
>>> z = ['oeuf', 'œuf', 'od', 'of']
>>> zo = libfrancais.sortedfr(z)
>>> zo
['od', 'oeuf', 'œuf', 'of']

jmf

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web