Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python.': 0.02; 'causing': 0.04; 'subsequent': 0.05; 'deny': 0.07; 'explicit': 0.07; 'permitted': 0.07; '------------': 0.09; 'ascii': 0.09; 'converted': 0.09; 'derived': 0.09; 'identifier': 0.09; 'parsing': 0.09; 'pep': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'url:unicode': 0.09; '--------': 0.10; 'python': 0.11; 'mostly': 0.14; "wouldn't": 0.14; '(it': 0.16; '10:59': 0.16; '::=': 0.16; 'as-is': 0.16; 'backwards': 0.16; 'be:': 0.16; 'belongs': 0.16; 'categories,': 0.16; 'caveat': 0.16; 'entirely.': 0.16; 'hypothetical': 0.16; 'identifiers': 0.16; 'identifiers;': 0.16; 'introduces': 0.16; 'length.': 0.16; 'likewise': 0.16; 'line)': 0.16; 'lowercase': 0.16; 'modifier': 0.16; 'need:': 0.16; 'operators,': 0.16; 'operators.': 0.16; 'reasonable.': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'simplest': 0.16; 'spacing': 0.16; 'subject:unicode': 0.16; 'symbols': 0.16; 'uppercase': 0.16; 'wrote:': 0.18; 'obviously': 0.18; '3.0': 0.19; "python's": 0.19; 'seems': 0.21; '>>>': 0.22; 'header:User-Agent:1': 0.23; 'mathematical': 0.24; 'unicode': 0.24; 'fine': 0.24; '(see': 0.26; 'defined': 0.27; 'header:X -Complaints-To:1': 0.27; 'header:In-Reply-To:1': 0.27; 'chris': 0.29; 'am,': 0.29; 'character': 0.29; 'thus': 0.29; "doesn't": 0.30; 'characters': 0.30; "i'm": 0.30; 'included': 0.31; 'that.': 0.31; '(maybe': 0.31; 'boundary': 0.31; 'comparison': 0.31; 'decimal': 0.31; 'indentation': 0.31; 'operators': 0.31; 'symbolic': 0.31; 'url:category': 0.31; 'this.': 0.32; 'probably': 0.32; 'figure': 0.32; "we're": 0.32; 'url:python': 0.33; 'sense': 0.34; 'maybe': 0.34; 'classes': 0.35; 'something': 0.35; 'definition': 0.35; 'one,': 0.35; 'but': 0.35; 'there': 0.35; 'version': 0.36; 'marks': 0.36; 'module.': 0.36; 'possible': 0.36; 'url:org': 0.36; 'should': 0.36; 'list': 0.37; 'to:addr:python- list': 0.38; 'pm,': 0.38; 'to:addr:python.org': 0.39; 'either': 0.39; 'received:org': 0.40; 'space': 0.40; 'skip:x 10': 0.40; 'ensure': 0.60; 'letters': 0.60; 'problems.': 0.60; 'mentioned': 0.61; 'new': 0.61; 'numbers': 0.61; 'url:3': 0.61; 'entire': 0.61; 'range': 0.61; 'first': 0.61; 'name': 0.63; 'personal': 0.63; 'skip:n 10': 0.64; 'choose': 0.64; 'for:': 0.64; 'stand': 0.64; 'become': 0.64; 'talking': 0.65; 'within': 0.65; 'combining': 0.68; 'legal': 0.71; 'url:htm': 0.73; 'url:info': 0.73; 'future,': 0.83; 'low': 0.83; 'characters,': 0.84; 'collision': 0.84; 'pardon': 0.84; 'ruled': 0.84; 'url:reference': 0.84; '9:00': 0.91; 'pc,': 0.91 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Ned Batchelder Subject: Re: unicode as valid naming symbols Date: Tue, 01 Apr 2014 09:33:33 -0400 References: <5331D902.3030902@gmail.com> <53321819$0$29994$c3e8da3$5496439d@news.astraweb.com> <53393BA4.2080305@rece.vub.ac.be> <5339C281.7080300@rece.vub.ac.be> <533A768F.5080102@rece.vub.ac.be> <533A96E9.1030107@rece.vub.ac.be> <533AAA13.4010309@rece.vub.ac.be> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Gmane-NNTP-Posting-Host: 18.189.9.83 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 In-Reply-To: X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 95 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1396359227 news.xs4all.nl 2872 [2001:888:2000:d::a6]:52469 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:69513 On 4/1/14 9:00 AM, Chris Angelico wrote: > On Tue, Apr 1, 2014 at 10:59 PM, Antoon Pardon > wrote: >> On 01-04-14 12:58, Chris Angelico wrote: >>> But because, in the future, Python may choose to create new operators, >>> the simplest and safest way to ensure safety is to put a boundary on >>> what can be operators and what can be names; Unicode character classes >>> are perfect for this. It's also possible that all Unicode whitespace >>> characters might become legal for indentation and separation (maybe >>> they are already??), so obviously they're ruled out as identifiers; >>> anyway, I honestly do not think people would want to use U+2007 FIGURE >>> SPACE inside a name. So if we deny whitespace, and accept letters and >>> digits, it makes good sense to deny mathematical symbols so as to keep >>> them available for operators. (It also makes reasonable sense to >>> *permit* mathematical symbols, thus allowing you to use them for >>> functions/methods, in the same way that you can use "n", "o", and "t", >>> but not "not"; but with word operators, the entire word has to be used >>> as-is before it's a collision - with a symbolic one, any instance of >>> that symbol inside a name will change parsing entirely. It's a >>> trade-off, and Python's made a decision one way and not the other.) >> >> This mostly makes sense to me. The only caveat I have is that since we >> also allow _ (U+005F LOW LINE) in names which belongs to the category >> , we should allow other symbols within this >> category in a name. >> >> But I confess that is mostly personal taste, since I find names_like_this >> ugly. Names-like-this look better to me but that wouldn't be workable >> in python. But maybe there is some connector that would be aestetically >> pleasing and not causing other problems. > > That's reasonable. The Pc category doesn't have much in it: > > http://www.fileformat.info/info/unicode/category/Pc/list.htm > > If the definition of "characters permitted in identifiers" is derived > exclusively from the Unicode categories, including Pc would make fine > sense. Probably the definition should be: First character is L* or Pc, > subsequent characters are L*, N*, or Pc, and either Mn or M* > (combining characters). Or something like that. Maybe I'm misunderstanding the discussion... It seems like we're talking about a hypothetical definition of identifiers based on Unicode character categories, but there's no need: Python 3 has defined precisely that. From the docs (https://docs.python.org/3/reference/lexical_analysis.html#identifiers): ------------ Python 3.0 introduces additional characters from outside the ASCII range (see PEP 3131). For these characters, the classification uses the version of the Unicode Character Database as included in the unicodedata module. Identifiers are unlimited in length. Case is significant. identifier ::= xid_start xid_continue* id_start ::= id_continue ::= xid_start ::= xid_continue ::= The Unicode category codes mentioned above stand for: Lu - uppercase letters Ll - lowercase letters Lt - titlecase letters Lm - modifier letters Lo - other letters Nl - letter numbers Mn - nonspacing marks Mc - spacing combining marks Nd - decimal numbers Pc - connector punctuations Other_ID_Start - explicit list of characters in PropList.txt to support backwards compatibility Other_ID_Continue - likewise All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC. -------- > > ChrisA > -- Ned Batchelder, http://nedbatchelder.com