Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.012 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'encoding': 0.05; 'represents': 0.05; 'subject:Python': 0.06; 'referring': 0.07; 'sys': 0.07; 'bytes.': 0.09; 'encode': 0.09; 'forms,': 0.09; 'terms,': 0.09; 'runs': 0.10; 'python': 0.11; 'jan': 0.12; '8bit%:32': 0.16; 'alphabet': 0.16; 'encoding.': 0.16; 'encodings': 0.16; 'encodings,': 0.16; 'semantics': 0.16; 'utf8': 0.16; 'wrote:': 0.18; "skip:' 30": 0.19; '>>>': 0.22; 'import': 0.22; 'bytes': 0.24; 'unicode': 0.24; 'guys': 0.24; '(or': 0.24; 'source': 0.25; '15,': 0.26; 'define': 0.26; 'skip:v 30': 0.26; 'skip:_ 20': 0.27; 'header:In-Reply-To:1': 0.27; 'received:172.16': 0.29; 'skip:p 30': 0.29; 'am,': 0.29; 'points': 0.29; "doesn't": 0.30; 'skip:( 20': 0.30; 'code': 0.31; '>>>>': 0.31; 'waters': 0.31; 'yourself.': 0.31; 'says': 0.33; 'sense': 0.34; 'but': 0.35; 'received:google.com': 0.35; 'representing': 0.36; 'sequence': 0.36; 'should': 0.36; 'example,': 0.37; 'two': 0.37; 'message-id:@gmail.com': 0.38; 'e.g.': 0.38; 'mapping': 0.38; 'to:addr:python-list': 0.38; 'to:addr:python.org': 0.39; 'either': 0.39; "you'll": 0.62; 'header:Message-Id:1': 0.63; 'term': 0.63; '8bit%:10': 0.64; 'more': 0.64; 'different': 0.65; 'levels': 0.65; 'skip:\xe2 10': 0.65; 'talking': 0.65; 'between': 0.67; 'rendering': 0.68; 'subjectcharset:utf-8': 0.72; 'other.': 0.75; '8bit%:46': 0.78; '2014,': 0.84; 'ambiguous': 0.84; 'batchelder': 0.84; 'confusing': 0.84; 'everything.': 0.84; 'it\xe2\x80\x99s': 0.84; 'you\xe2\x80\x99re': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=zOwFtInn0BfjthJXV/Wpk+9DBQSZOJws6O/bQ3Os20U=; b=s8SfgYsrgCmS3d3XC50PeSQMdRwE896glRLl5I13jSsl1axbIRJe+ExLMXv6rrryAE pARPXOoj5lM0CCrlHgd1dgqCVFydSEwuRrrzX6ZzRwljdwPGyf+g3MFyUg8YM+rhkIk5 tg7HVAb2XFvvTsvC+ExowuLb2ovpyxkEVH7sQI6NoBR3V7h6cMzFUZxUio3j8lwI39Xz UqI3XEEB5fYAkdWhp/Fr49PcQo6Mt+j5XLtMUg7ZUAHrtUTwpIO2mt5OnPdCb0zHn5MO D4Yx5TEGi9CAOsII7ecMVn7Il+iQqjSTmFJcIu4qc8QCUgx2t0Ul1RPIC8JNT7E6VMnn xA3Q== X-Received: by 10.66.180.200 with SMTP id dq8mr3945733pac.104.1389803332421; Wed, 15 Jan 2014 08:28:52 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 7.1 \(1827\)) Subject: =?utf-8?Q?Re=3A_=27Stra=C3=9Fe=27_=28=27Strasse=27=29_and_Python?= =?utf-8?Q?_2?= From: Travis Griggs In-Reply-To: <52D68402.6030407@chamonix.reportlab.co.uk> Date: Wed, 15 Jan 2014 08:28:49 -0800 Content-Transfer-Encoding: quoted-printable References: <30dfa6f1-61b2-49b8-bc65-5fd18d498c38@googlegroups.com> <52D67873.2010502@chamonix.reportlab.co.uk> <52D68402.6030407@chamonix.reportlab.co.uk> To: python-list@python.org X-Mailer: Apple Mail (2.1827) X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 63 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1389803336 news.xs4all.nl 2975 [2001:888:2000:d::a6]:48380 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:63995 On Jan 15, 2014, at 4:50 AM, Robin Becker wrote: > On 15/01/2014 12:13, Ned Batchelder wrote: > ........ >>> On my utf8 based system >>>=20 >>>=20 >>>> robin@everest ~: >>>> $ cat ooo.py >>>> if __name__=3D=3D'__main__': >>>> import sys >>>> s=3D'A=CC=85B' >>>> print('version_info=3D%s\nlen(%s)=3D%d' % = (sys.version_info,s,len(s))) >>>> robin@everest ~: >>>> $ python ooo.py >>>> version_info=3Dsys.version_info(major=3D3, minor=3D3, micro=3D3, >>>> releaselevel=3D'final', serial=3D0) >>>> len(A=CC=85B)=3D3 >>>> robin@everest ~: >>>> $ >>>=20 >>>=20 > ........ >> You are right that more than one codepoint makes up a grapheme, and = that you'll >> need code to deal with the correspondence between them. But let's not = muddy >> these already confusing waters by referring to that mapping as an = encoding. >>=20 >> In Unicode terms, an encoding is a mapping between codepoints and = bytes. Python >> 3's str is a sequence of codepoints. >>=20 > Semantics is everything. For me graphemes are the endpoint (or should = be); to get a proper rendering of a sequence of graphemes I can use = either a sequence of bytes or a sequence of codepoints. They are both = encodings of the graphemes; what unicode says is an encoding doesn't = define what encodings are ie mappings from some source alphabet to a = target alphabet. But you=E2=80=99re talking about two levels of encoding. One runs on top = of the other. So insisting that you be able to call them all encodings, = makes the term pointless, because now it=E2=80=99s ambiguous as to what = you=E2=80=99re referring to. Are you referring to encoding in the sense = of representing code points with bytes? Or are you referring to what the = unicode guys call =E2=80=9Cforms=E2=80=9D? For example, the NFC form of =E2=80=98=C3=B1=E2=80=99 is =E2=80=99\u00F1=E2= =80=99. =E2=80=98nThe NFD form represents the exact same grapheme, but = is =E2=80=98\u006e\u0303=E2=80=99. You can call them encodings if you = want, but I echo Ned=E2=80=99s sentiment that you keep that to yourself. = Conventionally, they=E2=80=99re different forms, not different = encodings. You can encode either form with an encoding, e.g. '\u00F1'.encode('utf8=E2=80=99) '\u00F1'.encode('utf16=E2=80=99) '\u006e\u0303'.encode('utf8=E2=80=99) '\u006e\u0303'.encode('utf16')