Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'charset:iso-8859-7': 0.04; 'encoding': 0.05; 'encoded': 0.07; 'utf-8': 0.07; "'a'": 0.09; '32-bit': 0.09; 'bits': 0.09; 'bytes,': 0.09; 'bytes.': 0.09; 'combines': 0.09; 'subject:few': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'stored': 0.12; 'mostly': 0.14; '>>': 0.16; '127': 0.16; '16-bit': 0.16; '8-bit': 0.16; 'byte,': 0.16; 'encoding.': 0.16; 'exactly,': 0.16; 'opposite': 0.16; 'ordinal': 0.16; 'pair.': 0.16; 'storing': 0.16; 'surrogate': 0.16; 'two,': 0.16; 'utf8': 0.16; 'wrote:': 0.18; 'normally': 0.19; 'things.': 0.19; 'meant': 0.20; 'example': 0.22; 'email addr:gmail.com>': 0.22; 'cc:addr:python.org': 0.22; 'byte': 0.24; 'bytes': 0.24; 'unicode': 0.24; 'cc:2**0': 0.24; 'cc:no real name:2**0': 0.24; '>': 0.26; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'character': 0.29; 'characters': 0.30; 'is?': 0.30; 'message- id:@mail.gmail.com': 0.30; 'url:mailman': 0.30; '(since': 0.31; 'values.': 0.31; 'url:python': 0.33; 'could': 0.34; 'but': 0.35; 'received:google.com': 0.35; 'combination': 0.36; 'url:listinfo': 0.36; 'url:org': 0.36; 'two': 0.37; 'represent': 0.38; 'sometimes': 0.38; 'skip:& 10': 0.38; 'depends': 0.38; 'needed': 0.38; 'does': 0.39; 'called': 0.40; 'url:mail': 0.40; 'how': 0.40; 'read': 0.60; 'consists': 0.60; 'tell': 0.60; 'kind': 0.63; 'different': 0.65; 'to:addr:gmail.com': 0.65; 'transparent': 0.68; '1st': 0.74; '8bit%:57': 0.74; 'chinese': 0.74; 'beside': 0.84; 'short,': 0.84; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Shns0xNOttHg7yAU53sPmDUdfyJqnNl44kVyPERTZd4=; b=dSnKn8oexXFBfbXOmJzA6hL0oGvMkPFYWf6ZQIFYrx5fva1J0lSY0FpPwi1VzOpKwt 6fN3Yv2WlnA93kjAXbk8nHcrFrsA6p3sITotesZ+unBJc5qT0WVahZbJgooESUnDxb5B ooTQoqfMzwa1vVePk1/S+2o/Q788NA+/w8BzGDW4Tg5mBCeGu2tsfXUrOcTtIYbOxPLF yrF6nGq7WfbTNYi5bTn/KQ/bNPtPRbNSZ4iJXHZXRwRk3szsTwWJWr0Lyy1JvGkWV/d9 lPSQOW1AIrfFQXbhdKdXkXqKWp7R352bDlkO/iy39ykeDp4Lya92p03iD97X7lL5cxBD bvbQ== MIME-Version: 1.0 X-Received: by 10.229.133.65 with SMTP id e1mr2378997qct.105.1370780288867; Sun, 09 Jun 2013 05:18:08 -0700 (PDT) In-Reply-To: <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com> References: <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com> Date: Sun, 9 Jun 2013 13:18:08 +0100 Subject: Re: A few questiosn about encoding From: =?ISO-8859-1?Q?F=E1bio_Santos?= To: =?ISO-8859-7?B?zenq/Ovh7/Igyu/98eHy?= Content-Type: multipart/alternative; boundary=e89a8f50389c2ee89904deb7a87e Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 126 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1370780298 news.xs4all.nl 15949 [2001:888:2000:d::a6]:45376 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:47454 --e89a8f50389c2ee89904deb7a87e Content-Type: text/plain; charset=ISO-8859-7 Content-Transfer-Encoding: quoted-printable On 9 Jun 2013 11:49, "=CD=E9=EA=FC=EB=E1=EF=F2 =CA=EF=FD=F1=E1=F2" wrote: > > A few questiosn about encoding please: > > >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for > >> values up to 256? > > >Because then how do you tell when you need one byte, and when you need > >two? If you read two bytes, and see 0x4C 0xFA, does that mean two > >characters, with ordinal values 0x4C and 0xFA, or one character with > >ordinal value 0x4CFA? > > I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. > > > >> UTF-8 and UTF-16 and UTF-32 > >> I though the number beside of UTF- was to declare how many bits the > >> character set was using to store a character into the hdd, no? > > >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. > >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit > >values to make a surrogate pair. > > A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? > Is this what a surrogate is? a pari of 2 chars? > > > >UTF-8 uses 8-bit values, but sometimes > >it combines two, three or four of them to represent a single code-point. > > 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal =3D 65) > '=E1=B4' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal i= s > 127 ) > 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 ) > > The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? > -- > http://mail.python.org/mailman/listinfo/python-list In short, a utf-8 character takes 1 to 4 bytes. A utf-16 character takes 2 to 4 bytes. A utf-32 always takes 4 bytes. The process of encoding bytes to characters is called encoding. The opposite is decoding. This is all made transparent in python with the encode() and decode() methods. You normally don't care about this kind of things. --e89a8f50389c2ee89904deb7a87e Content-Type: text/html; charset=ISO-8859-7 Content-Transfer-Encoding: quoted-printable


On 9 Jun 2013 11:49, "=CD=E9=EA=FC=EB=E1=EF=F2 =CA=EF=FD=F1=E1=F2"= ; <nikos.gr33k@gmail.com>= ; wrote:
>
> A few questiosn about encoding please:
>
> >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-by= te for
> >> values up to 256?
>
> >Because then how do you tell when you need one byte, and when you = need
> >two? If you read two bytes, and see 0x4C 0xFA, does that mean two<= br> > >characters, with ordinal values 0x4C and 0xFA, or one character wi= th
> >ordinal value 0x4CFA?
>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I me= ant up to 256, not above 256.
>
>
> >> UTF-8 and UTF-16 and UTF-32
> >> I though the number beside of UTF- was to declare how many bi= ts the
> >> character set was using to store a character into the hdd, no= ?
>
> >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) value= s.
> >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16= -bit
> >values to make a surrogate pair.
>
> A surrogate pair is like itting for example Ctrl-A, which means is a c= ombination character that consists of 2 different characters?
> Is this what a surrogate is? a pari of 2 chars?
>
>
> >UTF-8 uses 8-bit values, but sometimes
> >it combines two, three or four of them to represent a single code-= point.
>
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordi= nal =3D 65)
> '=E1=B4' to be utf8 encoded needs 2 bytes to be stored ? (sinc= e ordinal is > 127 )
> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be st= ored ? (since ordinal > =A065000 )
>
> The amount of bytes needed to store a character solely depends on the = character's ordinal value in the Unicode table?
> --
> http:/= /mail.python.org/mailman/listinfo/python-list

In short, a utf-8 character takes 1 to 4 bytes. A utf-16 cha= racter takes 2 to 4 bytes. A utf-32 always takes 4 bytes.

The process of encoding bytes to characters is called encodi= ng. The opposite is decoding. This is all made transparent in python with t= he encode() and decode() methods. You normally don't care about this ki= nd of things.

--e89a8f50389c2ee89904deb7a87e--