Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'bug': 0.02; '2.7': 0.04; 'mrab': 0.04; '(python': 0.05; 'flags': 0.05; ':-)': 0.06; 'ascii': 0.07; 'pep': 0.07; 'script,': 0.07; 'python': 0.08; 'url:pypi': 0.08; '(case': 0.09; 'freezes': 0.09; 'from:addr:python': 0.09; 'likewise,': 0.09; 'locale': 0.09; 'tracker': 0.09; '2.6,': 0.16; '3),': 0.16; '8-bit': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:name:mrab': 0.16; 'mapping,': 0.16; 'mappings': 0.16; 'message- id:@mrabarnett.plus.com': 0.16; 'received:84.92': 0.16; 'received:84.92.122': 0.16; 'received:84.92.122.60': 0.16; 'received:84.93': 0.16; 'received:84.93.230': 0.16; 'reply-to:addr :python-list': 0.16; 'url:ch05': 0.16; 'url:unicode': 0.16; 'wrote:': 0.16; "wouldn't": 0.17; 'language': 0.17; 'issue,': 0.18; 'seems': 0.20; 'appears': 0.20; "aren't": 0.21; 'maybe': 0.21; 'discussion': 0.22; 'header:In-Reply-To:1': 0.22; 'appear': 0.23; '(or': 0.23; 'alan': 0.23; 'expanded': 0.23; 'sep': 0.23; 'translated': 0.23; 'unlikely': 0.23; 'pm,': 0.24; 'input': 0.24; 'skip:l 30': 0.24; 'specify': 0.24; 'xml': 0.25; 'ignore': 0.26; 'string': 0.26; "i'm": 0.27; 'users.': 0.28; 'received:84': 0.28; 'thu,': 0.28; 'problem': 0.28; 'url:mailman': 0.28; 'correct': 0.28; 'expressions': 0.29; 'matches': 0.29; 'module.': 0.29; 'true,': 0.29; 'unicode': 0.29; 'yield': 0.29; "won't": 0.29; 'script': 0.29; 'module': 0.30; 'false.': 0.30; 'over.': 0.30; 'preferable': 0.30; '(including': 0.30; '(e.g.': 0.31; 'programmers': 0.31; 'list': 0.32; 'cases': 0.32; 'dependent': 0.32; 'proposed': 0.32; 'this.': 0.32; 'break': 0.32; 'does': 0.32; 'probably': 0.33; 'there': 0.33; 'to:addr:python-list': 0.33; 'instead': 0.33; 'url:listinfo': 0.33; 'wondering': 0.33; 'rules': 0.34; 'header:User-Agent:1': 0.34; 'test': 0.34; 'flag': 0.34; 'reply-to:addr:python.org': 0.34; 'surprised': 0.34; 'pretty': 0.35; 'apply': 0.35; 'uses': 0.35; 'speaking': 0.35; 'regular': 0.35; 'unless': 0.36; 'url:python': 0.36; 'skip:" 10': 0.36; 'example,': 0.37; 'languages': 0.37; 'rest': 0.37; 'but': 0.37; 'could': 0.38; 'think': 0.38; 'somewhat': 0.38; 'some': 0.38; 'url:org': 0.38; 'should': 0.38; 'subject:: ': 0.39; 'vice': 0.39; 'option': 0.39; 'enough': 0.39; 'either': 0.39; "there's": 0.39; 'introduced': 0.39; 'why': 0.39; 'to:addr:python.org': 0.39; 'case': 0.39; "it's": 0.40; 'where': 0.40; 'your': 0.61; 'unique': 0.62; 'foreign': 0.64; 'traditional': 0.64; 'ever': 0.65; 'guarantee': 0.66; 'making': 0.67; 'cause': 0.67; 'view': 0.67; 'compliant': 0.67; 'special': 0.67; 'url:0': 0.69; 'header:Reply- To:1': 0.71; 'reply-to:no real name:2**0': 0.71; 'become': 0.71; '100%': 0.82; '5.18': 0.84; 'encodings.': 0.84; 'python-dev': 0.84; 'substrings': 0.84; 'want:': 0.84 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgEFADUFck5UXebj/2dsb2JhbABDhFWieHiBUwEBAQEBAQEBAQEgDwEFQAYLCwgQAgIFFggDAgIJAwIBAgENCB8REwYCAQEFDAaHXAICpHWRXoEshDeBEQSMAkmMGIwP Date: Thu, 15 Sep 2011 15:06:08 +0100 From: MRAB User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20110902 Thunderbird/6.0.2 MIME-Version: 1.0 To: python-list@python.org Subject: Re: Turkic I and re References: <4E71F763.2010109@mrabarnett.plus.com> <4E71FA9F.8090702@alanplum.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: python-list@python.org List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 94 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1316095568 news.xs4all.nl 2502 [2001:888:2000:d::a6]:52755 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:13324 On 15/09/2011 14:44, John-John Tedro wrote: > On Thu, Sep 15, 2011 at 1:16 PM, Alan Plum > wrote: > > On 2011-09-15 15:02, MRAB wrote: > > The regex module at http://pypi.python.org/pypi/__regex > currently uses a > compromise, where it matches 'I' with 'i' and also 'I' with 'ı' > and 'İ' > with 'i'. > > I was wondering if it would be preferable to have a TURKIC flag > instead > ("(?T)" or "(?T:...)" in the pattern). > > > I think the problem many people ignore when coming up with solutions > like this is that while this behaviour is pretty much unique for > Turkish script, there is no guarantee that Turkish substrings won't > appear in other language strings (or vice versa). > > For example, foreign names in Turkish are often given as spelled in > their native (non-Turkish) script variants. Likewise, Turkish names > in other languages are often given as spelled in Turkish. > > The Turkish 'I' is a peculiarity that will probably haunt us > programmers until hell freezes over. Unless Turkey abandons its > traditional orthography or people start speaking only a single > language at a time (including names), there's no easy way to deal > with this. > > In other words: the only way to make use of your proposed flag is if > you have a fully language-tagged input (e.g. an XML document making > extensive use of xml:lang) and only ever apply regular expressions > to substrings containing one culture at a time. > > -- > http://mail.python.org/__mailman/listinfo/python-list > > > > Python does not appear to support special cases mapping, in effect, it > is not 100% compliant with the unicode standard. > > The locale specific 'i' casing in Turkic is mentioned in 5.18 (Case > Mappings ) > of the unicode standard. > http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180 > > AFAIK, the case methods of python strings seems to be built around the > assumption that len("string") == len("string".upper()), but some of > these casing rules require that the string grow. Like uppercasing of the > german sharp s "ß" which should be translated to the expanded string "SS". > These special cases should be triggered on specific locales, but I have > not been able to verify that the Turkic uppercasing of "i" works on > either python 2.6, 2.7 or 3.1: > > locale.setlocale(locale.LC_ALL, "tr_TR.utf8") # warning, requires > turkish locale on your system. > ord("i".upper()) == 0x130 # is False for me, but should be True > > I wouldn't be surprised if these issues are translated into the 're' module. > There has been some discussion on the Python-dev list about improving Unicode support in Python 3. It's somewhat unlikely that Unicode will become locale-dependent in Python because it would cause problems; you don't want: "i".upper() == "I" to be maybe true, maybe false. An option would be to specify whether it should be locale-dependent. > The only support appears to be 'L' switch, but it only makes "\w, \W, > \b, \B, \s and \S dependent on the current locale". That flag is for locale-dependent 8-bit encodings. The ASCII (Python 3), LOCALE and UNICODE flags are mutually exclusive. > Which probably does not yield to the special rules mentioned above, but > I could be wrong. Make sure that your locale is correct and test again. > > If you are unsuccessful, I don't see a 'Turkic flag' being introduced > into re module any time soon, given the following from PEP 20 > "Special cases aren't special enough to break the rules" > That's why I'm interested in the view of Turkish users. The rest of us will probably never have to worry about it! :-) (There's a report in the Python bug tracker about this issue, which is why the regex module has the compromise.)