Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #39671 > unrolled thread
| Started by | Devin Jeanpierre <jeanpierreda@gmail.com> |
|---|---|
| First post | 2013-02-23 09:26 -0500 |
| Last post | 2013-02-24 11:28 -0800 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.python
Correct handling of case in unicode and regexps Devin Jeanpierre <jeanpierreda@gmail.com> - 2013-02-23 09:26 -0500
Re: Correct handling of case in unicode and regexps jmfauth <wxjmfauth@gmail.com> - 2013-02-24 11:28 -0800
| From | Devin Jeanpierre <jeanpierreda@gmail.com> |
|---|---|
| Date | 2013-02-23 09:26 -0500 |
| Subject | Correct handling of case in unicode and regexps |
| Message-ID | <mailman.2347.1361629625.2939.python-list@python.org> |
Hi folks,
I'm pretty unsure of myself when it comes to unicode. As I understand
it, you're generally supposed to compare things in a case insensitive
manner by case folding, right? So instead of a.lower() == b.lower()
(the ASCII way), you do a.casefold() == b.casefold()
However, I'm struggling to figure out how regular expressions should
treat case. Python's re module doesn't "work properly" to my
understanding, because:
>>> a = 'ss'
>>> b = 'ß'
>>> a.casefold() == b.casefold()
True
>>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
>>> # oh dear!
In addition, it seems improbable that this ever _could_ work. Because
if it did work like that, then what would the value be of
re.match('s', 'ß', re.UNICODE | re.IGNORECASE).end() ? 0.5?
I'd really like to hear the thoughts of people more experienced with
unicode. What is the ideal correct behavior here? Or do I
misunderstand things?
-- Devin
[toc] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2013-02-24 11:28 -0800 |
| Message-ID | <63ffb861-f38e-4b4b-ad56-21e5c1bdc6bd@g16g2000vbf.googlegroups.com> |
| In reply to | #39671 |
On 23 fév, 15:26, Devin Jeanpierre <jeanpierr...@gmail.com> wrote:
> Hi folks,
>
> I'm pretty unsure of myself when it comes to unicode. As I understand
> it, you're generally supposed to compare things in a case insensitive
> manner by case folding, right? So instead of a.lower() == b.lower()
> (the ASCII way), you do a.casefold() == b.casefold()
>
> However, I'm struggling to figure out how regular expressions should
> treat case. Python's re module doesn't "work properly" to my
> understanding, because:
>
> >>> a = 'ss'
> >>> b = 'ß'
> >>> a.casefold() == b.casefold()
> True
> >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
> >>> # oh dear!
>
> In addition, it seems improbable that this ever _could_ work. Because
> if it did work like that, then what would the value be of
> re.match('s', 'ß', re.UNICODE | re.IGNORECASE).end() ? 0.5?
>
> I'd really like to hear the thoughts of people more experienced with
> unicode. What is the ideal correct behavior here? Or do I
> misunderstand things?
-----
I'm just wondering if there is a real issue here. After all,
this is only a question of conventions. Unicode has some
conventions, re modules may (has to) use some conventions too.
It seems to me, the safest way is to preprocess the text,
which has to be examinated.
Proposed case study:
How should be ss/ß/SS/ẞ interpreted?
'Richard-Strauss-Straße'
'Richard-Strauss-Strasse'
'RICHARD-STRAUSS-STRASSE'
'RICHARD-STRAUSS-STRAẞE'
There is more or less the same situation with sorting.
Unicode can not do all and it may be mandatory to
preprocess the "input".
Eg. This fct I wrote once for the fun. It sorts French
words (without unicodedata and locale).
>>> import libfrancais
>>> z = ['oeuf', 'œuf', 'od', 'of']
>>> zo = libfrancais.sortedfr(z)
>>> zo
['od', 'oeuf', 'œuf', 'of']
jmf
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web