Groups > comp.lang.python > #13321 > unrolled thread

Re: Turkic I and re

Started by	Alan Plum <me@alanplum.com>
First post	2011-09-15 15:16 +0200
Last post	2011-09-16 17:25 +1000
Articles	3 — 3 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Turkic I and re Alan Plum <me@alanplum.com> - 2011-09-15 15:16 +0200
    Re: Turkic I and re Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2011-09-16 09:01 +0200
      Re: Turkic I and re Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-16 17:25 +1000

#13321 — Re: Turkic I and re

From	Alan Plum <me@alanplum.com>
Date	2011-09-15 15:16 +0200
Subject	Re: Turkic I and re
Message-ID	<mailman.1163.1316092594.27778.python-list@python.org>

On 2011-09-15 15:02, MRAB wrote:
> The regex module at http://pypi.python.org/pypi/regex currently uses a
> compromise, where it matches 'I' with 'i' and also 'I' with 'ı' and 'İ'
> with 'i'.
>
> I was wondering if it would be preferable to have a TURKIC flag instead
> ("(?T)" or "(?T:...)" in the pattern).

I think the problem many people ignore when coming up with solutions 
like this is that while this behaviour is pretty much unique for Turkish 
script, there is no guarantee that Turkish substrings won't appear in 
other language strings (or vice versa).

For example, foreign names in Turkish are often given as spelled in 
their native (non-Turkish) script variants. Likewise, Turkish names in 
other languages are often given as spelled in Turkish.

The Turkish 'I' is a peculiarity that will probably haunt us programmers 
until hell freezes over. Unless Turkey abandons its traditional 
orthography or people start speaking only a single language at a time 
(including names), there's no easy way to deal with this.

In other words: the only way to make use of your proposed flag is if you 
have a fully language-tagged input (e.g. an XML document making 
extensive use of xml:lang) and only ever apply regular expressions to 
substrings containing one culture at a time.

[toc] | [next] | [standalone]

#13367

From	Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de>
Date	2011-09-16 09:01 +0200
Message-ID	<j4usl8$jel$1@r03.glglgl.eu>
In reply to	#13321

Am 15.09.2011 15:16 schrieb Alan Plum:

> The Turkish 'I' is a peculiarity that will probably haunt us programmers
> until hell freezes over.

That's why it would have been nice if the Unicode guys had defined "both 
Turkish i-s" at separate codepoints.

Then one could have the three pairs
I, i ("normal")
I (other one), ı

and

İ, i (the other one).

But alas, they haven't.


Thomas

[toc] | [prev] | [next] | [standalone]

#13369

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-09-16 17:25 +1000
Message-ID	<4e72f9f4$0$30003$c3e8da3$5496439d@news.astraweb.com>
In reply to	#13367

Thomas Rachel wrote:

> Am 15.09.2011 15:16 schrieb Alan Plum:
> 
>> The Turkish 'I' is a peculiarity that will probably haunt us programmers
>> until hell freezes over.

Meh, I don't think it's much more peculiar that any other diacritic issue.
If I'm German or English, I probably want ö and O to match during
case-insensitive comparisons, so that Zöe and ZOE match. If I'm Icelandic,
I don't. I don't really see why Turkic gets singled out.

> That's why it would have been nice if the Unicode guys had defined "both
> Turkish i-s" at separate codepoints.
> 
> Then one could have the three pairs
> I, i ("normal")
> I (other one), ı
> 
> and
> 
> İ, i (the other one).

And then people will say, "How can I match both sorts of dotless uppercase I
but not dotted I when I'm doing comparisons?"

-- 
Steven

[toc] | [prev] | [standalone]

csiph-web

Re: Turkic I and re

Contents

#13321 — Re: Turkic I and re

#13367

#13369