Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com>
References: <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com>
Date: Sun, 9 Jun 2013 13:18:08 +0100
Subject: Re: A few questiosn about encoding
From: =?ISO-8859-1?Q?F=E1bio_Santos?= <fabiosantosart@gmail.com>
To: =?ISO-8859-7?B?zenq/Ovh7/Igyu/98eHy?= <nikos.gr33k@gmail.com>
Content-Type: multipart/alternative; boundary=e89a8f50389c2ee89904deb7a87e
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2915.1370780298.3114.python-list@python.org>
Lines: 126
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:47454

--e89a8f50389c2ee89904deb7a87e
Content-Type: text/plain; charset=ISO-8859-7
Content-Transfer-Encoding: quoted-printable

On 9 Jun 2013 11:49, "=CD=E9=EA=FC=EB=E1=EF=F2 =CA=EF=FD=F1=E1=F2" <nikos.g=
r33k@gmail.com> wrote:
>
> A few questiosn about encoding please:
>
> >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
> >> values up to 256?
>
> >Because then how do you tell when you need one byte, and when you need
> >two? If you read two bytes, and see 0x4C 0xFA, does that mean two
> >characters, with ordinal values 0x4C and 0xFA, or one character with
> >ordinal value 0x4CFA?
>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant
up to 256, not above 256.
>
>
> >> UTF-8 and UTF-16 and UTF-32
> >> I though the number beside of UTF- was to declare how many bits the
> >> character set was using to store a character into the hdd, no?
>
> >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
> >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
> >values to make a surrogate pair.
>
> A surrogate pair is like itting for example Ctrl-A, which means is a
combination character that consists of 2 different characters?
> Is this what a surrogate is? a pari of 2 chars?
>
>
> >UTF-8 uses 8-bit values, but sometimes
> >it combines two, three or four of them to represent a single code-point.
>
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal =3D 65)
> '=E1=B4' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal i=
s >
127 )
> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ?
(since ordinal >  65000 )
>
> The amount of bytes needed to store a character solely depends on the
character's ordinal value in the Unicode table?
> --
> http://mail.python.org/mailman/listinfo/python-list

In short, a utf-8 character takes 1 to 4 bytes. A utf-16 character takes 2
to 4 bytes. A utf-32 always takes 4 bytes.

The process of encoding bytes to characters is called encoding. The
opposite is decoding. This is all made transparent in python with the
encode() and decode() methods. You normally don't care about this kind of
things.

--e89a8f50389c2ee89904deb7a87e
Content-Type: text/html; charset=ISO-8859-7
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr"><br>
On 9 Jun 2013 11:49, &quot;=CD=E9=EA=FC=EB=E1=EF=F2 =CA=EF=FD=F1=E1=F2&quot=
; &lt;<a href=3D"mailto:nikos.gr33k@gmail.com">nikos.gr33k@gmail.com</a>&gt=
; wrote:<br>
&gt;<br>
&gt; A few questiosn about encoding please:<br>
&gt;<br>
&gt; &gt;&gt; Since 1 byte can hold up to 256 chars, why not utf-8 use 1-by=
te for<br>
&gt; &gt;&gt; values up to 256?<br>
&gt;<br>
&gt; &gt;Because then how do you tell when you need one byte, and when you =
need<br>
&gt; &gt;two? If you read two bytes, and see 0x4C 0xFA, does that mean two<=
br>
&gt; &gt;characters, with ordinal values 0x4C and 0xFA, or one character wi=
th<br>
&gt; &gt;ordinal value 0x4CFA?<br>
&gt;<br>
&gt; I mean utf-8 could use 1 byte for storing the 1st 256 characters. I me=
ant up to 256, not above 256.<br>
&gt;<br>
&gt;<br>
&gt; &gt;&gt; UTF-8 and UTF-16 and UTF-32<br>
&gt; &gt;&gt; I though the number beside of UTF- was to declare how many bi=
ts the<br>
&gt; &gt;&gt; character set was using to store a character into the hdd, no=
?<br>
&gt;<br>
&gt; &gt;Not exactly, but close. UTF-32 is completely 32-bit (4 byte) value=
s.<br>
&gt; &gt;UTF-16 mostly uses 16-bit values, but sometimes it combines two 16=
-bit<br>
&gt; &gt;values to make a surrogate pair.<br>
&gt;<br>
&gt; A surrogate pair is like itting for example Ctrl-A, which means is a c=
ombination character that consists of 2 different characters?<br>
&gt; Is this what a surrogate is? a pari of 2 chars?<br>
&gt;<br>
&gt;<br>
&gt; &gt;UTF-8 uses 8-bit values, but sometimes<br>
&gt; &gt;it combines two, three or four of them to represent a single code-=
point.<br>
&gt;<br>
&gt; &#39;a&#39; to be utf8 encoded needs 1 byte to be stored ? (since ordi=
nal =3D 65)<br>
&gt; &#39;=E1=B4&#39; to be utf8 encoded needs 2 bytes to be stored ? (sinc=
e ordinal is &gt; 127 )<br>
&gt; &#39;a chinese ideogramm&#39; to be utf8 encoded needs 4 byte to be st=
ored ? (since ordinal &gt; =A065000 )<br>
&gt;<br>
&gt; The amount of bytes needed to store a character solely depends on the =
character&#39;s ordinal value in the Unicode table?<br>
&gt; --<br>
&gt; <a href=3D"http://mail.python.org/mailman/listinfo/python-list">http:/=
/mail.python.org/mailman/listinfo/python-list</a></p>
<p dir=3D"ltr">In short, a utf-8 character takes 1 to 4 bytes. A utf-16 cha=
racter takes 2 to 4 bytes. A utf-32 always takes 4 bytes.</p>
<p dir=3D"ltr">The process of encoding bytes to characters is called encodi=
ng. The opposite is decoding. This is all made transparent in python with t=
he encode() and decode() methods. You normally don&#39;t care about this ki=
nd of things.</p>


--e89a8f50389c2ee89904deb7a87e--