Re: Beazley 4E P.E.R, Page29: Unicode

Path	csiph.com!usenet.pasdenom.info!gegeweb.org!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path	<joshua.landau.ws@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.000
X-Spam-Evidence	'H': 1.00; 'S': 0.00; '"this': 0.03; 'encoding': 0.05; 'python)': 0.05; 'context': 0.07; 'correct.': 0.07; 'encoded': 0.07; 'python3': 0.07; 'see:': 0.07; 'utf-8': 0.07; 'string': 0.09; 'is': 0.09; 'ascii': 0.09; 'assuming': 0.09; 'denoted': 0.09; 'integers': 0.09; 'mixed': 0.09; 'separately': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'stored': 0.12; 'sections': 0.14; '(note': 0.16; 'almost': 0.16; 'raw': 0.16; '16)': 0.16; 'character.': 0.16; 'charset': 0.16; 'encodings': 0.16; 'internally': 0.16; 'itself,': 0.16; 'literals': 0.16; 'original:': 0.16; 'pythonistas': 0.16; 'subject:Unicode': 0.16; 'tends': 0.16; 'unicode,': 0.16; 'so.': 0.16; 'sender:addr:gmail.com': 0.17; 'wrote:': 0.18; 'basically': 0.19; 'things.': 0.19; '>>>': 0.22; 'saying': 0.22; 'cc:addr:python.org': 0.22; 'byte': 0.24; 'bytes': 0.24; 'integer': 0.24; "shouldn't": 0.24; 'unicode': 0.24; 'looks': 0.24; '(or': 0.24; 'question': 0.24; 'cc:2*0': 0.24; 'this:': 0.26; 'second': 0.26; 'read,': 0.26; 'skip:" 20': 0.27; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'wondering': 0.29; 'character': 0.29; 'respective': 0.29; "doesn't": 0.30; 'characters': 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'easier': 0.31; "skip:' 10": 0.31; 'produces': 0.31; 'probably': 0.32; 'languages': 0.32; 'supposed': 0.32; 'says': 0.33; 'not.': 0.33; 'plain': 0.33; 'raw': 0.33; 'table': 0.34; 'could': 0.34; 'convert': 0.35; 'no,': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'really': 0.36; "he's": 0.36; 'interaction': 0.36; 'sequence': 0.36; 'doing': 0.36; 'wrong': 0.37; 'turn': 0.37; 'two': 0.37; 'represent': 0.38; 'bringing': 0.38; 'mapping': 0.38; 'that,': 0.38; 'expect': 0.39; 'explain': 0.39; 'does': 0.39; 'how': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'read': 0.60; 'experts': 0.60; 'up,': 0.60; 'hope': 0.61; 'free': 0.61; 'simply': 0.61; 'simple': 0.61; "you're": 0.61; 'first': 0.61; 'back': 0.62; 'such': 0.63; 'july': 0.63; 'skip:n 10': 0.64; 'our': 0.64; 'more': 0.64; 'by:': 0.65; 'to:addr:gmail.com': 0.65; 'between': 0.67; 'default': 0.69; 'capital': 0.73; 'carefully': 0.74; 'manner.': 0.74; '(according': 0.84; 'explanation:': 0.84; 'thing,': 0.91; 'was:': 0.91; '2013': 0.98
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:cc:content-type :content-transfer-encoding; bh=CwUqfhfPkkthyiLxk4lqk1dpD1yma7adSnNQzuR2v0Q=; b=u92LKBXFAGVYRsvzFBkNWCgQO56An7mLey13N8XNtiA4gVlroPLT2uSMF269o7E+AS hmHSkIEIHOHkywLufKOUBktXxu7scwg90oo9x1ZdhybjVwmb+5fq6CvMDxmvRniE63Kc 01VMRoDfubltE3MycMGLQ1NhHMAPYPs+OkvLPWhIOP+zz0M+u0nEzaEG8ty2cu9gOklt 8hE4cI5kffR7GgG2IDy2qve1IVNcd17w1YDOCEKWxziw89EzJy/QfhdYUqtR1A1DIofH zkOEQfVRKrt2GpH7TBorjeykkbiH/k3mmyRygN0gMVvLYBxkOBg7Dcsfc809ZuA8n6TI N/Wg==
X-Received	by 10.112.5.199 with SMTP id u7mr22215215lbu.67.1373786079728; Sun, 14 Jul 2013 00:14:39 -0700 (PDT)
MIME-Version	1.0
Sender	joshua.landau.ws@gmail.com
In-Reply-To	<51cbaddd-c29d-48a3-97ab-3beb1d944f1a@googlegroups.com>
References	<51cbaddd-c29d-48a3-97ab-3beb1d944f1a@googlegroups.com>
From	Joshua Landau <joshua@landau.ws>
Date	Sun, 14 Jul 2013 08:13:59 +0100
X-Google-Sender-Auth	-RGgvMmkLS08j0AZZxsyY4cw2bI
Subject	Re: Beazley 4E P.E.R, Page29: Unicode
To	vek.m1234@gmail.com
Content-Type	text/plain; charset=UTF-8
Content-Transfer-Encoding	quoted-printable
Cc	python-list <python-list@python.org>
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.4696.1373786087.3114.python-list@python.org> (permalink)
Lines	114
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1373786087 news.xs4all.nl 16000 [2001:888:2000:d::a6]:41577
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:50630

Show key headers only | View raw

On 14 July 2013 04:09,  <vek.m1234@gmail.com> wrote:
> http://stackoverflow.com/questions/17632246/beazley-4e-p-e-r-page29-unicode
>
> "directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o' simply produces a nine-character string U+004A, U+0061, U+006C, U+0061, U+0070, U+0065, U+00C3, U+00B1, U+006F, which is probably not what you intended.This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1."

Correct.

> My original question was: Shouldn't this be 8 characters - not 9?

No, Python tends to be right on these things.

> He says: \xc3\xb1 is supposed to represent the single character. However after some interaction with fellow Pythonistas i'm even more confused.

You would be, given the way he said it.

> With reference to the above para:
> 1. What does he mean by "writing a raw UTF-8 encoded string"??

Well, that doesn't really mean much with no context like he gave it.

> In Python2, once can do 'Jalape funny-n o'. This is a 'bytes' string where each glyph is 1 byte long when stored internally so each glyph is associated with an integer as per charset ASCII or Latin-1. If these charsets have a funny-n glyph then yay! else nay! There is no UTF-8 here!! or UTF-16!! These are plain bytes (8 bits).
>
> Unicode is a really big mapping table between glyphs and integers and are denoted as Uxxxx or Uxxxx-xxxx.

*Waits for our resident unicode experts to explain why you're actually wrong*

> UTF-8 UTF-16 are encodings to store those big integers in an efficient manner. So when DB says "writing a raw UTF-8 encoded string" - well the only way to do this is to use Python3 where the default string literals are stored in Unicode which then will use a UTF-8 UTF-16 internally to store the bytes in their respective structures; or, one could use u'Jalape' which is unicode in both languages (note the leading 'u').

Correct.

> 2. So assuming this is Python 3: 'Jalape \xYY \xZZ o' (spaces for readability) what DB is saying is that, the stupid-user would expect Jalapeno with a squiggly-n but instead he gets is: Jalape funny1 funny2 o (spaces for readability) -9 glyphs or 9 Unicode-points or 9-UTF8 characters. Correct?

I think so.

> 3. Which leaves me wondering what he means by:
> "This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1"

He's mixed some things up, AFAICT.

> Could someone take the time to read carefully and clarify what DB is saying??

Here's a simple explanation: you're both wrong (or you're both *almost* right):

As of Python 3:

>>> "\xc3\xb1"
'Ã±'
>>> b"\xc3\xb1".decode()
'ñ'

"WHAT?!" you scream, "THAT'S WRONG!" But it's not. Let me explain.

Python 3's strings want you to give each character separately (*winces
in case I'm wrong*). Python is interpreting the "\xc3" as "\N{LATIN
CAPITAL LETTER A WITH TILDE}" and "\xb1" as "\N{PLUS-MINUS SIGN}"¹.
This means that Python is given *two* characters. Python is basically
doing this:

number = int("c3", 16) # Convert from base16
chr(number) # Turn to the character from the Unicode mapping

When you give Python *raw bytes*, you are saying that this is what the
string looks like *when encoded* -- you are not giving Python Unicode,
but *encoded Unicode*. This means that when you decode it (.decode())
it is free to convert multibyte sections to their relevant characters.

To see how an *encoded string* is not the same as the string itself, see:

>>> "Jalepeño".encode("ASCII", errors="xmlcharrefreplace")
b'Jalepe&#241;o'

Those *represent* the same thing, but the first (according to Python)
*is* the thing, the second needs to be *decoded*.

Now, bringing this back to the original:

>>> "\xc3\xb1".encode()
b'\xc3\x83\xc2\xb1'

You can see that the *encoded* bytes represent the *two* characters;
the string you see above is *not the encoded one*. The encoding is
*internal to Python*.

I hope that helps; good luck.

¹ Note that I find the "\N{...}" form much easier to read, and recommend it.

Thread

Beazley 4E P.E.R, Page29: Unicode vek.m1234@gmail.com - 2013-07-13 20:09 -0700
  Re: Beazley 4E P.E.R, Page29: Unicode Terry Reedy <tjreedy@udel.edu> - 2013-07-14 03:08 -0400
  Re: Beazley 4E P.E.R, Page29: Unicode Joshua Landau <joshua@landau.ws> - 2013-07-14 08:13 +0100
    Re: Beazley 4E P.E.R, Page29: Unicode vek.m1234@gmail.com - 2013-07-14 01:10 -0700
  Re: Beazley 4E P.E.R, Page29: Unicode Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-14 08:18 +0000
    Re: Beazley 4E P.E.R, Page29: Unicode vek.m1234@gmail.com - 2013-07-14 02:39 -0700

csiph-web