Path: csiph.com!usenet.pasdenom.info!gegeweb.org!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
Sender: joshua.landau.ws@gmail.com
In-Reply-To: <51cbaddd-c29d-48a3-97ab-3beb1d944f1a@googlegroups.com>
References: <51cbaddd-c29d-48a3-97ab-3beb1d944f1a@googlegroups.com>
From: Joshua Landau <joshua@landau.ws>
Date: Sun, 14 Jul 2013 08:13:59 +0100
Subject: Re: Beazley 4E P.E.R, Page29: Unicode
To: vek.m1234@gmail.com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: python-list <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.4696.1373786087.3114.python-list@python.org>
Lines: 114
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:50630

On 14 July 2013 04:09,  <vek.m1234@gmail.com> wrote:
> http://stackoverflow.com/questions/17632246/beazley-4e-p-e-r-page29-unico=
de
>
> "directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o' si=
mply produces a nine-character string U+004A, U+0061, U+006C, U+0061, U+007=
0, U+0065, U+00C3, U+00B1, U+006F, which is probably not what you intended.=
This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to =
represent the single character U+00F1, not the two characters U+00C3 and U+=
00B1."

Correct.

> My original question was: Shouldn't this be 8 characters - not 9?

No, Python tends to be right on these things.

> He says: \xc3\xb1 is supposed to represent the single character. However =
after some interaction with fellow Pythonistas i'm even more confused.

You would be, given the way he said it.

> With reference to the above para:
> 1. What does he mean by "writing a raw UTF-8 encoded string"??

Well, that doesn't really mean much with no context like he gave it.

> In Python2, once can do 'Jalape funny-n o'. This is a 'bytes' string wher=
e each glyph is 1 byte long when stored internally so each glyph is associa=
ted with an integer as per charset ASCII or Latin-1. If these charsets have=
 a funny-n glyph then yay! else nay! There is no UTF-8 here!! or UTF-16!! T=
hese are plain bytes (8 bits).
>
> Unicode is a really big mapping table between glyphs and integers and are=
 denoted as Uxxxx or Uxxxx-xxxx.

*Waits for our resident unicode experts to explain why you're actually wron=
g*

> UTF-8 UTF-16 are encodings to store those big integers in an efficient ma=
nner. So when DB says "writing a raw UTF-8 encoded string" - well the only =
way to do this is to use Python3 where the default string literals are stor=
ed in Unicode which then will use a UTF-8 UTF-16 internally to store the by=
tes in their respective structures; or, one could use u'Jalape' which is un=
icode in both languages (note the leading 'u').

Correct.

> 2. So assuming this is Python 3: 'Jalape \xYY \xZZ o' (spaces for readabi=
lity) what DB is saying is that, the stupid-user would expect Jalapeno with=
 a squiggly-n but instead he gets is: Jalape funny1 funny2 o (spaces for re=
adability) -9 glyphs or 9 Unicode-points or 9-UTF8 characters. Correct?

I think so.

> 3. Which leaves me wondering what he means by:
> "This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed =
to represent the single character U+00F1, not the two characters U+00C3 and=
 U+00B1"

He's mixed some things up, AFAICT.

> Could someone take the time to read carefully and clarify what DB is sayi=
ng??

Here's a simple explanation: you're both wrong (or you're both *almost* rig=
ht):

As of Python 3:

>>> "\xc3\xb1"
'=C3=83=C2=B1'
>>> b"\xc3\xb1".decode()
'=C3=B1'

"WHAT?!" you scream, "THAT'S WRONG!" But it's not. Let me explain.

Python 3's strings want you to give each character separately (*winces
in case I'm wrong*). Python is interpreting the "\xc3" as "\N{LATIN
CAPITAL LETTER A WITH TILDE}" and "\xb1" as "\N{PLUS-MINUS SIGN}"=C2=B9.
This means that Python is given *two* characters. Python is basically
doing this:

number =3D int("c3", 16) # Convert from base16
chr(number) # Turn to the character from the Unicode mapping

When you give Python *raw bytes*, you are saying that this is what the
string looks like *when encoded* -- you are not giving Python Unicode,
but *encoded Unicode*. This means that when you decode it (.decode())
it is free to convert multibyte sections to their relevant characters.

To see how an *encoded string* is not the same as the string itself, see:

>>> "Jalepe=C3=B1o".encode("ASCII", errors=3D"xmlcharrefreplace")
b'Jalepe&#241;o'

Those *represent* the same thing, but the first (according to Python)
*is* the thing, the second needs to be *decoded*.

Now, bringing this back to the original:

>>> "\xc3\xb1".encode()
b'\xc3\x83\xc2\xb1'

You can see that the *encoded* bytes represent the *two* characters;
the string you see above is *not the encoded one*. The encoding is
*internal to Python*.


I hope that helps; good luck.


=C2=B9 Note that I find the "\N{...}" form much easier to read, and recomme=
nd it.