Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #13321
| Path | csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <me@alanplum.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.021 |
| X-Spam-Evidence | '*H*': 0.96; '*S*': 0.00; 'mrab': 0.04; 'script,': 0.07; 'url:pypi': 0.08; 'freezes': 0.09; 'likewise,': 0.09; 'wrote:': 0.16; 'language': 0.17; 'received:internal': 0.18; 'header:In-Reply-To:1': 0.22; 'appear': 0.23; '(or': 0.23; 'received:10.202': 0.23; 'received:10.202.2': 0.23; 'received:66.111': 0.23; 'received:66.111.4': 0.23; 'received:messagingengine.com': 0.23; 'received:smtp.messagingengine.com': 0.23; 'input': 0.24; 'xml': 0.25; 'ignore': 0.26; 'problem': 0.28; 'expressions': 0.29; 'matches': 0.29; "won't": 0.29; 'script': 0.29; 'module': 0.30; 'over.': 0.30; 'preferable': 0.30; '(including': 0.30; '(e.g.': 0.31; 'programmers': 0.31; 'proposed': 0.32; 'this.': 0.32; 'probably': 0.33; 'there': 0.33; 'to:addr:python-list': 0.33; 'instead': 0.33; 'wondering': 0.33; 'header:User-Agent:1': 0.34; 'flag': 0.34; 'pretty': 0.35; 'apply': 0.35; 'uses': 0.35; 'speaking': 0.35; 'regular': 0.35; 'unless': 0.36; 'url:python': 0.36; 'example,': 0.37; 'languages': 0.37; 'think': 0.38; 'url:org': 0.38; 'subject:: ': 0.39; 'received:192': 0.39; 'vice': 0.39; "there's": 0.39; 'to:addr:python.org': 0.39; 'where': 0.40; 'your': 0.61; 'unique': 0.62; 'foreign': 0.64; 'traditional': 0.64; 'ever': 0.65; 'guarantee': 0.66; 'making': 0.67; 'substrings': 0.84 |
| DKIM-Signature | v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=message-id:date:from:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; s=smtpout; bh=dOjd3WryR7SrBROG1HboxC jB2Eo=; b=Z7b719zjVOGp8ZjSRZHONJ81MZv50dmk5De+mn3mOYc4vqu7VqwsHm ELaEoo7Lx00U00primukyA3jc5n8VIb880a+qUU6cd66XNZsAN6KOpTuYvpjFCIf GALQk759kftrz/6begGb7Z3YPYHsXHW4gOAntCMm/ORwJB8P9xyao= |
| X-Sasl-enc | V0nmurTqLuEBUT2bkMe67QhBkxnzh3nBrjyWL6KrRiYz 1316092591 |
| Date | Thu, 15 Sep 2011 15:16:15 +0200 |
| From | Alan Plum <me@alanplum.com> |
| User-Agent | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0.2) Gecko/20110902 Thunderbird/6.0.2 |
| MIME-Version | 1.0 |
| To | python-list@python.org |
| Subject | Re: Turkic I and re |
| References | <4E71F763.2010109@mrabarnett.plus.com> |
| In-Reply-To | <4E71F763.2010109@mrabarnett.plus.com> |
| Content-Type | text/plain; charset=UTF-8; format=flowed |
| Content-Transfer-Encoding | 8bit |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.12 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.1163.1316092594.27778.python-list@python.org> (permalink) |
| Lines | 26 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1316092594 news.xs4all.nl 2400 [2001:888:2000:d::a6]:55132 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | x330-a1.tempe.blueboxinc.net comp.lang.python:13321 |
Show key headers only | View raw
On 2011-09-15 15:02, MRAB wrote:
> The regex module at http://pypi.python.org/pypi/regex currently uses a
> compromise, where it matches 'I' with 'i' and also 'I' with 'ı' and 'İ'
> with 'i'.
>
> I was wondering if it would be preferable to have a TURKIC flag instead
> ("(?T)" or "(?T:...)" in the pattern).
I think the problem many people ignore when coming up with solutions
like this is that while this behaviour is pretty much unique for Turkish
script, there is no guarantee that Turkish substrings won't appear in
other language strings (or vice versa).
For example, foreign names in Turkish are often given as spelled in
their native (non-Turkish) script variants. Likewise, Turkish names in
other languages are often given as spelled in Turkish.
The Turkish 'I' is a peculiarity that will probably haunt us programmers
until hell freezes over. Unless Turkey abandons its traditional
orthography or people start speaking only a single language at a time
(including names), there's no easy way to deal with this.
In other words: the only way to make use of your proposed flag is if you
have a fully language-tagged input (e.g. an XML document making
extensive use of xml:lang) and only ever apply regular expressions to
substrings containing one culture at a time.
Back to comp.lang.python | Previous | Next — Next in thread | Find similar | Unroll thread
Re: Turkic I and re Alan Plum <me@alanplum.com> - 2011-09-15 15:16 +0200
Re: Turkic I and re Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2011-09-16 09:01 +0200
Re: Turkic I and re Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-16 17:25 +1000
csiph-web