Path: csiph.com!usenet.pasdenom.info!gegeweb.org!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '"this': 0.03; 'encoding': 0.05; 'python)': 0.05; 'context': 0.07; 'correct.': 0.07; 'encoded': 0.07; 'python3': 0.07; 'see:': 0.07; 'utf-8': 0.07; 'string': 0.09; '*is*': 0.09; 'ascii': 0.09; 'assuming': 0.09; 'denoted': 0.09; 'integers': 0.09; 'mixed': 0.09; 'separately': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'stored': 0.12; 'sections': 0.14; '(note': 0.16; '*almost*': 0.16; '*raw': 0.16; '16)': 0.16; 'character.': 0.16; 'charset': 0.16; 'encodings': 0.16; 'internally': 0.16; 'itself,': 0.16; 'literals': 0.16; 'original:': 0.16; 'pythonistas': 0.16; 'subject:Unicode': 0.16; 'tends': 0.16; 'unicode,': 0.16; 'so.': 0.16; 'sender:addr:gmail.com': 0.17; 'wrote:': 0.18; 'basically': 0.19; 'things.': 0.19; '>>>': 0.22; 'saying': 0.22; 'cc:addr:python.org': 0.22; 'byte': 0.24; 'bytes': 0.24; 'integer': 0.24; "shouldn't": 0.24; 'unicode': 0.24; 'looks': 0.24; '(or': 0.24; 'question': 0.24; 'cc:2**0': 0.24; 'this:': 0.26; 'second': 0.26; 'read,': 0.26; 'skip:" 20': 0.27; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'wondering': 0.29; 'character': 0.29; 'respective': 0.29; "doesn't": 0.30; 'characters': 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'easier': 0.31; "skip:' 10": 0.31; 'produces': 0.31; 'probably': 0.32; 'languages': 0.32; 'supposed': 0.32; 'says': 0.33; 'not.': 0.33; 'plain': 0.33; 'raw': 0.33; 'table': 0.34; 'could': 0.34; 'convert': 0.35; 'no,': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'really': 0.36; "he's": 0.36; 'interaction': 0.36; 'sequence': 0.36; 'doing': 0.36; 'wrong': 0.37; 'turn': 0.37; 'two': 0.37; 'represent': 0.38; 'bringing': 0.38; 'mapping': 0.38; 'that,': 0.38; 'expect': 0.39; 'explain': 0.39; 'does': 0.39; 'how': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'read': 0.60; 'experts': 0.60; 'up,': 0.60; 'hope': 0.61; 'free': 0.61; 'simply': 0.61; 'simple': 0.61; "you're": 0.61; 'first': 0.61; 'back': 0.62; 'such': 0.63; 'july': 0.63; 'skip:n 10': 0.64; 'our': 0.64; 'more': 0.64; 'by:': 0.65; 'to:addr:gmail.com': 0.65; 'between': 0.67; 'default': 0.69; 'capital': 0.73; 'carefully': 0.74; 'manner.': 0.74; '(according': 0.84; 'explanation:': 0.84; 'thing,': 0.91; 'was:': 0.91; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:cc:content-type :content-transfer-encoding; bh=CwUqfhfPkkthyiLxk4lqk1dpD1yma7adSnNQzuR2v0Q=; b=u92LKBXFAGVYRsvzFBkNWCgQO56An7mLey13N8XNtiA4gVlroPLT2uSMF269o7E+AS hmHSkIEIHOHkywLufKOUBktXxu7scwg90oo9x1ZdhybjVwmb+5fq6CvMDxmvRniE63Kc 01VMRoDfubltE3MycMGLQ1NhHMAPYPs+OkvLPWhIOP+zz0M+u0nEzaEG8ty2cu9gOklt 8hE4cI5kffR7GgG2IDy2qve1IVNcd17w1YDOCEKWxziw89EzJy/QfhdYUqtR1A1DIofH zkOEQfVRKrt2GpH7TBorjeykkbiH/k3mmyRygN0gMVvLYBxkOBg7Dcsfc809ZuA8n6TI N/Wg== X-Received: by 10.112.5.199 with SMTP id u7mr22215215lbu.67.1373786079728; Sun, 14 Jul 2013 00:14:39 -0700 (PDT) MIME-Version: 1.0 Sender: joshua.landau.ws@gmail.com In-Reply-To: <51cbaddd-c29d-48a3-97ab-3beb1d944f1a@googlegroups.com> References: <51cbaddd-c29d-48a3-97ab-3beb1d944f1a@googlegroups.com> From: Joshua Landau Date: Sun, 14 Jul 2013 08:13:59 +0100 X-Google-Sender-Auth: -RGgvMmkLS08j0AZZxsyY4cw2bI Subject: Re: Beazley 4E P.E.R, Page29: Unicode To: vek.m1234@gmail.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: python-list X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 114 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1373786087 news.xs4all.nl 16000 [2001:888:2000:d::a6]:41577 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:50630 On 14 July 2013 04:09, wrote: > http://stackoverflow.com/questions/17632246/beazley-4e-p-e-r-page29-unico= de > > "directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o' si= mply produces a nine-character string U+004A, U+0061, U+006C, U+0061, U+007= 0, U+0065, U+00C3, U+00B1, U+006F, which is probably not what you intended.= This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to = represent the single character U+00F1, not the two characters U+00C3 and U+= 00B1." Correct. > My original question was: Shouldn't this be 8 characters - not 9? No, Python tends to be right on these things. > He says: \xc3\xb1 is supposed to represent the single character. However = after some interaction with fellow Pythonistas i'm even more confused. You would be, given the way he said it. > With reference to the above para: > 1. What does he mean by "writing a raw UTF-8 encoded string"?? Well, that doesn't really mean much with no context like he gave it. > In Python2, once can do 'Jalape funny-n o'. This is a 'bytes' string wher= e each glyph is 1 byte long when stored internally so each glyph is associa= ted with an integer as per charset ASCII or Latin-1. If these charsets have= a funny-n glyph then yay! else nay! There is no UTF-8 here!! or UTF-16!! T= hese are plain bytes (8 bits). > > Unicode is a really big mapping table between glyphs and integers and are= denoted as Uxxxx or Uxxxx-xxxx. *Waits for our resident unicode experts to explain why you're actually wron= g* > UTF-8 UTF-16 are encodings to store those big integers in an efficient ma= nner. So when DB says "writing a raw UTF-8 encoded string" - well the only = way to do this is to use Python3 where the default string literals are stor= ed in Unicode which then will use a UTF-8 UTF-16 internally to store the by= tes in their respective structures; or, one could use u'Jalape' which is un= icode in both languages (note the leading 'u'). Correct. > 2. So assuming this is Python 3: 'Jalape \xYY \xZZ o' (spaces for readabi= lity) what DB is saying is that, the stupid-user would expect Jalapeno with= a squiggly-n but instead he gets is: Jalape funny1 funny2 o (spaces for re= adability) -9 glyphs or 9 Unicode-points or 9-UTF8 characters. Correct? I think so. > 3. Which leaves me wondering what he means by: > "This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed = to represent the single character U+00F1, not the two characters U+00C3 and= U+00B1" He's mixed some things up, AFAICT. > Could someone take the time to read carefully and clarify what DB is sayi= ng?? Here's a simple explanation: you're both wrong (or you're both *almost* rig= ht): As of Python 3: >>> "\xc3\xb1" '=C3=83=C2=B1' >>> b"\xc3\xb1".decode() '=C3=B1' "WHAT?!" you scream, "THAT'S WRONG!" But it's not. Let me explain. Python 3's strings want you to give each character separately (*winces in case I'm wrong*). Python is interpreting the "\xc3" as "\N{LATIN CAPITAL LETTER A WITH TILDE}" and "\xb1" as "\N{PLUS-MINUS SIGN}"=C2=B9. This means that Python is given *two* characters. Python is basically doing this: number =3D int("c3", 16) # Convert from base16 chr(number) # Turn to the character from the Unicode mapping When you give Python *raw bytes*, you are saying that this is what the string looks like *when encoded* -- you are not giving Python Unicode, but *encoded Unicode*. This means that when you decode it (.decode()) it is free to convert multibyte sections to their relevant characters. To see how an *encoded string* is not the same as the string itself, see: >>> "Jalepe=C3=B1o".encode("ASCII", errors=3D"xmlcharrefreplace") b'Jalepeño' Those *represent* the same thing, but the first (according to Python) *is* the thing, the second needs to be *decoded*. Now, bringing this back to the original: >>> "\xc3\xb1".encode() b'\xc3\x83\xc2\xb1' You can see that the *encoded* bytes represent the *two* characters; the string you see above is *not the encoded one*. The encoding is *internal to Python*. I hope that helps; good luck. =C2=B9 Note that I find the "\N{...}" form much easier to read, and recomme= nd it.