Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Thu, 15 Sep 2011 15:06:08 +0100
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20110902 Thunderbird/6.0.2
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Turkic I and re
References: <4E71F763.2010109@mrabarnett.plus.com> <4E71FA9F.8090702@alanplum.com> <CAJPn68QhTcscQ93d5mjfhw-nHdC8H5bmNYdp=SXj41dH0qjGkA@mail.gmail.com>
In-Reply-To: <CAJPn68QhTcscQ93d5mjfhw-nHdC8H5bmNYdp=SXj41dH0qjGkA@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: list
Reply-To: python-list@python.org
Newsgroups: comp.lang.python
Message-ID: <mailman.1166.1316095568.27778.python-list@python.org>
Lines: 94
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:13324

On 15/09/2011 14:44, John-John Tedro wrote:
> On Thu, Sep 15, 2011 at 1:16 PM, Alan Plum <me@alanplum.com
> <mailto:me@alanplum.com>> wrote:
>
>     On 2011-09-15 15:02, MRAB wrote:
>
>         The regex module at http://pypi.python.org/pypi/__regex
>         <http://pypi.python.org/pypi/regex> currently uses a
>         compromise, where it matches 'I' with 'i' and also 'I' with 'ı'
>         and 'İ'
>         with 'i'.
>
>         I was wondering if it would be preferable to have a TURKIC flag
>         instead
>         ("(?T)" or "(?T:...)" in the pattern).
>
>
>     I think the problem many people ignore when coming up with solutions
>     like this is that while this behaviour is pretty much unique for
>     Turkish script, there is no guarantee that Turkish substrings won't
>     appear in other language strings (or vice versa).
>
>     For example, foreign names in Turkish are often given as spelled in
>     their native (non-Turkish) script variants. Likewise, Turkish names
>     in other languages are often given as spelled in Turkish.
>
>     The Turkish 'I' is a peculiarity that will probably haunt us
>     programmers until hell freezes over. Unless Turkey abandons its
>     traditional orthography or people start speaking only a single
>     language at a time (including names), there's no easy way to deal
>     with this.
>
>     In other words: the only way to make use of your proposed flag is if
>     you have a fully language-tagged input (e.g. an XML document making
>     extensive use of xml:lang) and only ever apply regular expressions
>     to substrings containing one culture at a time.
>
>     --
>     http://mail.python.org/__mailman/listinfo/python-list
>     <http://mail.python.org/mailman/listinfo/python-list>
>
>
> Python does not appear to support special cases mapping, in effect, it
> is not 100% compliant with the unicode standard.
>
> The locale specific 'i' casing in Turkic is mentioned in 5.18 (Case
> Mappings <http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180>)
> of the unicode standard.
> http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180
>
> AFAIK, the case methods of python strings seems to be built around the
> assumption that len("string") == len("string".upper()), but some of
> these casing rules require that the string grow. Like uppercasing of the
> german sharp s "ß" which should be translated to the expanded string "SS".
> These special cases should be triggered on specific locales, but I have
> not been able to verify that the Turkic uppercasing of "i" works on
> either python 2.6, 2.7 or 3.1:
>
>    locale.setlocale(locale.LC_ALL, "tr_TR.utf8") # warning, requires
> turkish locale on your system.
>    ord("i".upper()) == 0x130 # is False for me, but should be True
>
> I wouldn't be surprised if these issues are translated into the 're' module.
>
There has been some discussion on the Python-dev list about improving
Unicode support in Python 3.

It's somewhat unlikely that Unicode will become locale-dependent in
Python because it would cause problems; you don't want:

     "i".upper() == "I"

to be maybe true, maybe false.

An option would be to specify whether it should be locale-dependent.

> The only support appears to be 'L' switch, but it only makes "\w, \W,
> \b, \B, \s and \S dependent on the current locale".

That flag is for locale-dependent 8-bit encodings. The ASCII (Python
3), LOCALE and UNICODE flags are mutually exclusive.

> Which probably does not yield to the special rules mentioned above, but
> I could be wrong. Make sure that your locale is correct and test again.
>
> If you are unsuccessful, I don't see a 'Turkic flag' being introduced
> into re module any time soon, given the following from PEP 20
> "Special cases aren't special enough to break the rules"
>
That's why I'm interested in the view of Turkish users. The rest of us
will probably never have to worry about it! :-)

(There's a report in the Python bug tracker about this issue, which is
why the regex module has the compromise.)