Beazley 4E P.E.R, Page29: Unicode

Newsgroups	comp.lang.python
Date	2013-07-13 20:09 -0700
Message-ID	<51cbaddd-c29d-48a3-97ab-3beb1d944f1a@googlegroups.com> (permalink)
Subject	Beazley 4E P.E.R, Page29: Unicode
From	vek.m1234@gmail.com

Show all headers | View raw

http://stackoverflow.com/questions/17632246/beazley-4e-p-e-r-page29-unicode

"directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o' simply produces a nine-character string U+004A, U+0061, U+006C, U+0061, U+0070, U+0065, U+00C3, U+00B1, U+006F, which is probably not what you intended.This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1."

My original question was: Shouldn't this be 8 characters - not 9? He says: \xc3\xb1 is supposed to represent the single character. However after some interaction with fellow Pythonistas i'm even more confused.

With reference to the above para:
1. What does he mean by "writing a raw UTF-8 encoded string"??
In Python2, once can do 'Jalape funny-n o'. This is a 'bytes' string where each glyph is 1 byte long when stored internally so each glyph is associated with an integer as per charset ASCII or Latin-1. If these charsets have a funny-n glyph then yay! else nay! There is no UTF-8 here!! or UTF-16!! These are plain bytes (8 bits).

Unicode is a really big mapping table between glyphs and integers and are denoted as Uxxxx or Uxxxx-xxxx. UTF-8 UTF-16 are encodings to store those big integers in an efficient manner. So when DB says "writing a raw UTF-8 encoded string" - well the only way to do this is to use Python3 where the default string literals are stored in Unicode which then will use a UTF-8 UTF-16 internally to store the bytes in their respective structures; or, one could use u'Jalape' which is unicode in both languages (note the leading 'u').

2. So assuming this is Python 3: 'Jalape \xYY \xZZ o' (spaces for readability) what DB is saying is that, the stupid-user would expect Jalapeno with a squiggly-n but instead he gets is: Jalape funny1 funny2 o (spaces for readability) -9 glyphs or 9 Unicode-points or 9-UTF8 characters. Correct?

3. Which leaves me wondering what he means by:
"This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1"

Could someone take the time to read carefully and clarify what DB is saying??

Back to comp.lang.python | Previous | Next — Next in thread | Find similar | Unroll thread

Thread

Beazley 4E P.E.R, Page29: Unicode vek.m1234@gmail.com - 2013-07-13 20:09 -0700
  Re: Beazley 4E P.E.R, Page29: Unicode Terry Reedy <tjreedy@udel.edu> - 2013-07-14 03:08 -0400
  Re: Beazley 4E P.E.R, Page29: Unicode Joshua Landau <joshua@landau.ws> - 2013-07-14 08:13 +0100
    Re: Beazley 4E P.E.R, Page29: Unicode vek.m1234@gmail.com - 2013-07-14 01:10 -0700
  Re: Beazley 4E P.E.R, Page29: Unicode Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-14 08:18 +0000
    Re: Beazley 4E P.E.R, Page29: Unicode vek.m1234@gmail.com - 2013-07-14 02:39 -0700

csiph-web