Groups > comp.lang.python > #47448 > unrolled thread

A few questiosn about encoding

Started by	Νικόλαος Κούρας <nikos.gr33k@gmail.com>
First post	2013-06-09 03:44 -0700
Last post	2013-06-14 10:28 +0300
Articles	20 on this page of 110 — 36 participants

Back to article view | Back to comp.lang.python

  A few questiosn about encoding Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:44 -0700
    Re: A few questiosn about encoding Fábio Santos <fabiosantosart@gmail.com> - 2013-06-09 13:18 +0100
    Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-09 18:01 +0100
    Re: A few questiosn about encoding Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-06-09 19:12 +0200
      Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-12 09:09 +0000
        Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-12 09:24 +0000
          Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-12 14:23 +0300
            Re: A few questiosn about encoding Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2013-06-12 14:52 +0200
            Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-12 21:30 +0100
              Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 01:40 +0000
                Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 12:01 +1000
                  Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-13 11:02 +0100
              Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 09:21 +0300
                Re: A few questiosn about encoding jmfauth <wxjmfauth@gmail.com> - 2013-06-12 23:28 -0700
                Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 16:48 +1000
            Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 00:13 +0000
              Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 09:09 +0300
                Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 07:11 +0000
                  Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 10:42 +0300
                    Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 17:58 +1000
                      Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 11:08 +0300
                        Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 18:20 +1000
                          Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 12:41 +0300
                            Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 11:49 +0000
                              Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 17:19 +0300
                                Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 11:00 +1000
                                  Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 09:59 +0300
                                    Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:14 +1000
                                      Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 16:58 +0300
                                        Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 11:21 -0400
                                          Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 18:26 +0300
                                            Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:03 +1000
                                              Re: A few questiosn about encoding Walter Hurry <walterhurry@lavabit.com> - 2013-06-14 23:32 +0000
                                        Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:26 +1000
                                        Re: A few questiosn about encoding Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:34 +0000
                                          Re: A few questiosn about encoding Grant Edwards <invalid@invalid.invalid> - 2013-06-15 14:44 +0000
                                            Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 17:49 +0300
                                              Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-15 15:30 +0000
                                            Re: A few questiosn about encoding Roy Smith <roy@panix.com> - 2013-06-15 10:59 -0400
                                              Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 18:14 +0300
                                                Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-15 11:35 -0400
                                        Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 22:26 +0300
                                          Re: A few questiosn about encoding Benjamin Schollnick <benjamin@schollnick.net> - 2013-06-15 16:35 -0400
                                          Re: A few questiosn about encoding Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-06-16 15:45 +0200
                        Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 09:36 +0200
                          Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 10:49 +0300
                            Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 10:22 +0200
                              Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 11:37 +0300
                                Don't feed the troll... (was: Re: A few questiosn about encoding) Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 11:06 +0200
                                  Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 12:32 +0300
                                    Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 13:09 +0200
                                      Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:36 +0300
                                        Re: Don't feed the troll... Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 08:44 -0400
                                        Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:25 +0200
                                          Re: Don't feed the troll... Neil Cerutti <neilc@norwich.edu> - 2013-06-14 15:54 +0000
                                    Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 12:15 +0200
                                    Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-14 18:50 -0400
                                    Re: Don't feed the troll... Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:31 +0000
                                      Re: Don't feed the troll... Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-06-15 13:04 -0400
                                    Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-17 16:15 -0400
                                      Re: Don't feed the troll... Chris Angelico <rosuav@gmail.com> - 2013-06-18 07:46 +1000
                                Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:19 +1000
                                  Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:41 +0300
                                Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 11:20 +0100
                                  Re: Don't feed the troll... (was: Re: A few questiosn about encoding) rusi <rustompmody@gmail.com> - 2013-06-14 04:51 -0700
                                    Re: Don't feed the help-vampire rusi <rustompmody@gmail.com> - 2013-06-14 05:09 -0700
                                      Re: Don't feed the help-vampire Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:31 +0200
                                      Re: Don't feed the help-vampire Ian Kelly <ian.g.kelly@gmail.com> - 2013-06-14 10:51 -0600
                                    Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:50 +0300
                                      Re: Don't feed the troll... Zero Piraeus <schesis@gmail.com> - 2013-06-14 09:33 -0400
                                  Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:45 +0300
                                    Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:58 +0200
                                    Re: Don't feed the troll... Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 14:25 +0100
                                    Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 17:12 +0100
                                Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 12:50 +0200
                                  Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:59 +0300
                                    Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:52 +0200
                                    Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:28 +1000
                                    Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-17 08:49 +0200
                                Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 12:57 +0100
                                Re: Don't feed the troll... (was: Re: A few questiosn about encoding) "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:13 -0400
                                Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:31 +1000
                                  Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Grant Edwards <invalid@invalid.invalid> - 2013-06-14 19:40 +0000
                                Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:56 -0400
                                Re: Don't feed the troll Tim Chase <python.list@tim.thechases.com> - 2013-06-14 14:00 -0500
                                Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 15:17 -0400
                                Re: Don't feed the troll... Ben Finney <ben+python@benfinney.id.au> - 2013-06-15 10:42 +1000
                  Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-19 18:46 -0700
                    Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-20 06:26 +0000
                      Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 12:43 +0100
                        Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-20 09:27 -0700
                          Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 02:37 +1000
                          Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 18:17 +0100
                            Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-23 08:51 -0700
                              Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-23 16:30 +0000
                                Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-25 13:16 -0700
                          Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 03:21 +1000
                          Re: A few questiosn about encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-20 20:43 +0100
                      Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 06:40 -0700
                        Re: A few questiosn about encoding Andrew Berg <robotsondrugs@gmail.com> - 2013-06-20 09:04 -0500
                          Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 08:12 -0700
                            Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:26 +1000
                            Re: A few questiosn about encoding Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2013-06-20 20:25 +0300
                        Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:28 +1000
                        Re: A few questiosn about encoding Andreas Perstinger <andipersti@gmail.com> - 2013-06-20 19:08 +0200
          Re: A few questiosn about encoding Dave Angel <davea@davea.name> - 2013-06-12 08:43 -0400
        Re: A few questiosn about encoding Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-06-13 18:46 -0400
          Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 08:34 +0300
            Re: A few questiosn about encoding Zero Piraeus <schesis@gmail.com> - 2013-06-14 02:00 -0400
              Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 10:28 +0300

Page 1 of 6 [1] 2 3 4 5 6 Next page →

#47448 — A few questiosn about encoding

From	Νικόλαος Κούρας <nikos.gr33k@gmail.com>
Date	2013-06-09 03:44 -0700
Subject	A few questiosn about encoding
Message-ID	<6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com>

A few questiosn about encoding please:

>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
>> values up to 256? 

>Because then how do you tell when you need one byte, and when you need 
>two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
>characters, with ordinal values 0x4C and 0xFA, or one character with 
>ordinal value 0x4CFA? 

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. 


>> UTF-8 and UTF-16 and UTF-32 
>> I though the number beside of UTF- was to declare how many bits the 
>> character set was using to store a character into the hdd, no? 

>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
>values to make a surrogate pair. 

A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? 
Is this what a surrogate is? a pari of 2 chars? 


>UTF-8 uses 8-bit values, but sometimes 
>it combines two, three or four of them to represent a single code-point. 

'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) 
'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 ) 
'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal >  65000 ) 

The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table?

[toc] | [next] | [standalone]

#47454

From	Fábio Santos <fabiosantosart@gmail.com>
Date	2013-06-09 13:18 +0100
Message-ID	<mailman.2915.1370780298.3114.python-list@python.org>
In reply to	#47448

[Multipart message — attachments visible in raw view] — view raw

On 9 Jun 2013 11:49, "Νικόλαος Κούρας" <nikos.gr33k@gmail.com> wrote:
>
> A few questiosn about encoding please:
>
> >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
> >> values up to 256?
>
> >Because then how do you tell when you need one byte, and when you need
> >two? If you read two bytes, and see 0x4C 0xFA, does that mean two
> >characters, with ordinal values 0x4C and 0xFA, or one character with
> >ordinal value 0x4CFA?
>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant
up to 256, not above 256.
>
>
> >> UTF-8 and UTF-16 and UTF-32
> >> I though the number beside of UTF- was to declare how many bits the
> >> character set was using to store a character into the hdd, no?
>
> >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
> >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
> >values to make a surrogate pair.
>
> A surrogate pair is like itting for example Ctrl-A, which means is a
combination character that consists of 2 different characters?
> Is this what a surrogate is? a pari of 2 chars?
>
>
> >UTF-8 uses 8-bit values, but sometimes
> >it combines two, three or four of them to represent a single code-point.
>
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is >
127 )
> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ?
(since ordinal >  65000 )
>
> The amount of bytes needed to store a character solely depends on the
character's ordinal value in the Unicode table?
> --
> http://mail.python.org/mailman/listinfo/python-list

In short, a utf-8 character takes 1 to 4 bytes. A utf-16 character takes 2
to 4 bytes. A utf-32 always takes 4 bytes.

The process of encoding bytes to characters is called encoding. The
opposite is decoding. This is all made transparent in python with the
encode() and decode() methods. You normally don't care about this kind of
things.

[toc] | [prev] | [next] | [standalone]

#47470

From	Nobody <nobody@nowhere.com>
Date	2013-06-09 18:01 +0100
Message-ID	<pan.2013.06.09.17.01.19.553000@nowhere.com>
In reply to	#47448

On Sun, 09 Jun 2013 03:44:57 -0700, Νικόλαος Κούρας wrote:

>>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
>>> values up to 256? 
> 
>>Because then how do you tell when you need one byte, and when you need 
>>two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
>>characters, with ordinal values 0x4C and 0xFA, or one character with 
>>ordinal value 0x4CFA? 
> 
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I
> meant up to 256, not above 256.

But then you've used up all 256 possible bytes for storing the first 256
characters, and there aren't any left for use in multi-byte sequences.

You need some means to distinguish between a single-byte character and an
individual byte within a multi-byte sequence.

UTF-8 does that by allocating specific ranges to specific purposes.
0x00-0x7F are single-byte characters, 0x80-0xBF are continuation bytes of
multi-byte sequences, 0xC0-0xFF are leading bytes of multi-byte sequences.

This scheme has the advantage of making UTF-8 non-modal, i.e. if a byte is
corrupted, added or removed, it will only affect the character containing
that particular byte; the encoder can re-synchronise at the beginning of
the following character.

OTOH, with encodings such as UTF-16, UTF-32 or ISO-2022, adding or
removing a byte will result in desyncronisation, with all subsequent
characters being corrupted.

> A surrogate pair is like itting for example Ctrl-A, which means is a
> combination character that consists of 2 different characters? Is this
> what a surrogate is? a pari of 2 chars?

A surrogate pair is a pair of 16-bit codes used to represent a single
Unicode character whose code is greater than 0xFFFF.

The 2048 codepoints from 0xD800 to 0xDFFF inclusive aren't used to
represent characters, but "surrogates". Unicode characters with codes
in the range 0x10000-0x10FFFF are represented in UTF-16 as a pair of
surrogates. First, 0x10000 is subtracted from the code, giving a value in
the range 0-0xFFFFF (20 bits). The top ten bits are added to 0xD800 to
give a value in the range 0xD800-0xDBFF, while the bottom ten bits are
added to 0xDC00 to give a value in the range 0xDC00-0xDFFF.

Because the codes used for surrogates aren't valid as individual
characters, scanning a string for a particular character won't
accidentally match part of a multi-word character.

> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is
> > 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be
> stored ? (since ordinal >  65000 )

Most Chinese, Japanese and Korean (CJK) characters have codepoints within
the BMP (i.e. <= 0xFFFF), so they only require 3 bytes in UTF-8. The
codepoints above the BMP are mostly for archaic ideographs (those no
longer in normal use), mathematical symbols, dead languages, etc.

> The amount of bytes needed to store a character solely depends on the
> character's ordinal value in the Unicode table?

Yes. UTF-8 is essentially a mechanism for representing 31-bit unsigned
integers such that smaller integers require fewer bytes than larger
integers (subsequent revisions of Unicode cap the range of possible
codepoints to 0x10FFFF, as that's all that UTF-16 can handle).

[toc] | [prev] | [next] | [standalone]

#47472

From	Chris “Kwpolska” Warrick <kwpolska@gmail.com>
Date	2013-06-09 19:12 +0200
Message-ID	<mailman.2923.1370797972.3114.python-list@python.org>
In reply to	#47448

On Sun, Jun 9, 2013 at 12:44 PM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
> A few questiosn about encoding please:
>
>>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>> values up to 256?
>
>>Because then how do you tell when you need one byte, and when you need
>>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>characters, with ordinal values 0x4C and 0xFA, or one character with
>>ordinal value 0x4CFA?
>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.

It is required so the computer can know where characters begin.
0x0080 (first non-ASCII character) becomes 0xC280 in UTF-8.  Further
details here: http://en.wikipedia.org/wiki/UTF-8#Description

>>> UTF-8 and UTF-16 and UTF-32
>>> I though the number beside of UTF- was to declare how many bits the
>>> character set was using to store a character into the hdd, no?
>
>>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>>values to make a surrogate pair.
>
> A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
> Is this what a surrogate is? a pari of 2 chars?

http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF

Long story short: codepoint - 0x10000 (up to 20 bits) → two 10-bit
numbers → 0xD800 + first_half 0xDC00 + second_half.  Rephrasing:

We take MATHEMATICAL BOLD CAPITAL B (U+1D401).  If you have UTF-8: 𝐁

It is over 0xFFFF, and we need to use surrogate pairs.  We end up with
0xD401, or 0b1101010000000001.  Both representations are worthless, as
we have a 16-bit number, not a 20-bit one.  We throw in some leading
zeroes and end up with 0b00001101010000000001.  Split it in half and
we get 0b0000110101 and 0b0000000001, which we can now shorten to
0b110101 and 0b1, or translate to hex as 0x0035 and 0x0001.  0xD800 +
0x0035 and 0xDC00 + 0x0035 → 0xD835 0xDC00.  Type it into python and:

>>> b'\xD8\x35\xDC\x01'.decode('utf-16be')
'𝐁'

And before you ask: that “BE” stands for Big-Endian.  Little-Endian
would mean reversing the bytes in a codepoint, which would make it
'\x35\xD8\x01\xDC' (the name is based on the first 256 characters,
which are 0x6500 for 'a' in a little-endian encoding.

Another question you may ask: 0xD800…0xDFFF are reserved in Unicode
for the purposes of UTF-16, so there is no conflicts.

>>UTF-8 uses 8-bit values, but sometimes
>>it combines two, three or four of them to represent a single code-point.
>
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )

yup.  α is at 0x03B1, or 945 decimal.

> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal >  65000 )

Not necessarily, as CJK characters start at U+2E80, which is in the
3-byte range (0x0800 through 0xFFFF) — the table is here:
http://en.wikipedia.org/wiki/UTF-8#Description

--
Kwpolska <http://kwpolska.tk> | GPG KEY: 5EAAEA16
stop html mail                | always bottom-post
http://asciiribbon.org        | http://caliburn.nl/topposting.html

[toc] | [prev] | [next] | [standalone]

#47762

From	Νικόλαος Κούρας <support@superhost.gr>
Date	2013-06-12 09:09 +0000
Message-ID	<kp9drh$1o0t$1@news.ntua.gr>
In reply to	#47472

>> (*) infact UTF8 also indicates the end of each character

> Up to a point.  The initial byte encodes the length and the top few
> bits, but the subsequent octets aren’t distinguishable as final in
> isolation.  0x80-0xBF can all be either medial or final.


So, the first high-bits are a directive that UTF-8 uses to know how many 
bytes each character is being represented as.

0-127 codepoints(characters) use 1 bit to signify they need 1 bit for 
storage and the rest 7 bits to actually store the character ?

while

128-256 codepoints(characters) use 2 bit to signify they need 2 bits for 
storage and the rest 14 bits to actually store the character ?

Isn't 14 bits way to many to store a character ?

[toc] | [prev] | [next] | [standalone]

#47767

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-06-12 09:24 +0000
Message-ID	<51b83e5a$0$29998$c3e8da3$5496439d@news.astraweb.com>
In reply to	#47762

On Wed, 12 Jun 2013 09:09:05 +0000, Νικόλαος Κούρας wrote:

> Isn't 14 bits way to many to store a character ?

No.

There are 1114111 possible characters in Unicode. (And in Japan, they 
sometimes use TRON instead of Unicode, which has even more.)

If you list out all the combinations of 14 bits:

0000 0000 0000 00
0000 0000 0000 01
0000 0000 0000 10
0000 0000 0000 11
[...]
1111 1111 1111 10
1111 1111 1111 11

you will see that there are only 32767 (2**15-1) such values. You can't 
fit 1114111 characters with just 32767 values.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#47783

From	Νικόλαος Κούρας <support@superhost.gr>
Date	2013-06-12 14:23 +0300
Message-ID	<kp9lo6$9l5$2@news.ntua.gr>
In reply to	#47767

On 12/6/2013 12:24 μμ, Steven D'Aprano wrote:
> On Wed, 12 Jun 2013 09:09:05 +0000, Νικόλαος Κούρας wrote:
>
>> Isn't 14 bits way to many to store a character ?
>
> No.
>
> There are 1114111 possible characters in Unicode. (And in Japan, they
> sometimes use TRON instead of Unicode, which has even more.)
>
> If you list out all the combinations of 14 bits:
>
> 0000 0000 0000 00
> 0000 0000 0000 01
> 0000 0000 0000 10
> 0000 0000 0000 11
> [...]
> 1111 1111 1111 10
> 1111 1111 1111 11
>
> you will see that there are only 32767 (2**15-1) such values. You can't
> fit 1114111 characters with just 32767 values.
>
>
>
Thanks Steven,
So, how many bytes does UTF-8 stored for codepoints > 127 ?

example for codepoint 256, 1345, 16474 ?

[toc] | [prev] | [next] | [standalone]

#47800

From	Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com>
Date	2013-06-12 14:52 +0200
Message-ID	<pg7m8a-mto.ln1@satorlaser.homedns.org>
In reply to	#47783

Am 12.06.2013 13:23, schrieb Νικόλαος Κούρας:
> So, how many bytes does UTF-8 stored for codepoints > 127 ?

What has your research turned up? I personally consider it lazy and 
respectless to get lots of pointers that you could use for further 
research and ask for more info before you even followed these links.


> example for codepoint 256, 1345, 16474 ?

Yes, examples exist. Gee, if there only was an information network that 
you could access and where you could locate information on various 
programming-related topics somehow. Seriously, someone should invent 
this thing! But still, even without it, you have all the tools (i.e. 
Python) in your hand to generate these examples yourself! Check out ord, 
bin, encode, decode for a start.


Uli

[toc] | [prev] | [next] | [standalone]

#47844

From	Nobody <nobody@nowhere.com>
Date	2013-06-12 21:30 +0100
Message-ID	<pan.2013.06.12.20.30.22.31000@nowhere.com>
In reply to	#47783

On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:

> So, how many bytes does UTF-8 stored for codepoints > 127 ?

U+0000..U+007F  1 byte
U+0080..U+07FF  2 bytes
U+0800..U+FFFF 	3 bytes
>=U+10000       4 bytes

So, 1 byte for ASCII, 2 bytes for other Latin characters, Greek, Cyrillic,
Arabic, and Hebrew, 3 bytes for Chinese/Japanese/Korean, 4 bytes for dead
languages and mathematical symbols.

The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total
of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than 20 bits).

[toc] | [prev] | [next] | [standalone]

#47883

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-06-13 01:40 +0000
Message-ID	<51b9231b$0$29997$c3e8da3$5496439d@news.astraweb.com>
In reply to	#47844

On Wed, 12 Jun 2013 21:30:23 +0100, Nobody wrote:

> The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a
> total of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than
> 20 bits).

Same with UTF-8 and UTF-32, both of which are limited to U+10FFFF because 
that is what Unicode is limited to.

The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but 
that's not UTF-8, that's UTF-8-plus-extra-codepoints. Likewise the 
mechanism of UTF-32 could go up to 0xFFFFFFFF, but doing so means you 
don't have Unicode chars any more, and hence your byte-string is not 
valid UTF-32:

py> b = b'\xFF'*8
py> b.decode('UTF-32')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3: 
codepoint not in range(0x110000)

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#47886

From	Chris Angelico <rosuav@gmail.com>
Date	2013-06-13 12:01 +1000
Message-ID	<mailman.3153.1371088918.3114.python-list@python.org>
In reply to	#47883

On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
> that's not UTF-8, that's UTF-8-plus-extra-codepoints.

And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80",
even though mathematically they would translate into U+0000 and U+D800
respectively. The UTF-16 *mechanism* is limited to no more than
Unicode has currently used, but I'm left wondering if that's actually
the other way around - that Unicode planes were deemed to stop at the
point where UTF-16 can't encode any more. Not that it matters; with
most of the current planes completely unallocated, it seems unlikely
we'll be needing more.

ChrisA

[toc] | [prev] | [next] | [standalone]

#47931

From	Nobody <nobody@nowhere.com>
Date	2013-06-13 11:02 +0100
Message-ID	<pan.2013.06.13.10.02.38.693000@nowhere.com>
In reply to	#47886

On Thu, 13 Jun 2013 12:01:55 +1000, Chris Angelico wrote:

> On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
>> that's not UTF-8, that's UTF-8-plus-extra-codepoints.
> 
> And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80", even
> though mathematically they would translate into U+0000 and U+D800
> respectively. The UTF-16 *mechanism* is limited to no more than Unicode
> has currently used, but I'm left wondering if that's actually the other
> way around - that Unicode planes were deemed to stop at the point where
> UTF-16 can't encode any more.

Indeed. 5-byte and 6-byte sequences were originally part of the UTF-8
specification, allowing for 31 bits. Later revisions of the standard
imposed the UTF-16 limit on Unicode as a whole.

[toc] | [prev] | [next] | [standalone]

#47904

From	Νικόλαος Κούρας <support@superhost.gr>
Date	2013-06-13 09:21 +0300
Message-ID	<kpboda$qvk$3@news.ntua.gr>
In reply to	#47844

On 12/6/2013 11:30 μμ, Nobody wrote:
> On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:
>
>> So, how many bytes does UTF-8 stored for codepoints > 127 ?
>
> U+0000..U+007F  1 byte
> U+0080..U+07FF  2 bytes
> U+0800..U+FFFF 	3 bytes
>> =U+10000       4 bytes

'U' stands for Unicode code-point which means a character right?

How can you be able to tell up to what character utf-8 needs 1 byte or 2 
bytes or 3?

And some of the bytes' bits are used to tell where a code-points 
representations stops, right?  i mean if we have a code-point that needs 
2 bytes to be stored that the high bit must be set to 1 to signify that 
this character's encoding stops at 2 bytes.

I just know that 2^8 = 256, that's by first look 265 places, which mean 
256 positions to hold a code-point which in turn means a character.

We take the high bit out and then we have 2^7 which is enough positions 
for 0-127 standard ASCII. High bit is set to '0' to signify that char is 
encoded in 1 byte.

Please tell me that i understood correct so far.

But how about for 2 or 3 or 4 bytes?

Am i saying ti correct ?

[toc] | [prev] | [next] | [standalone]

#47905

From	jmfauth <wxjmfauth@gmail.com>
Date	2013-06-12 23:28 -0700
Message-ID	<7d1f7756-31f4-4e0f-a5d3-6b736c2eef3c@k3g2000vbn.googlegroups.com>
In reply to	#47904

------

UTF-8, Unicode (consortium): 1 to 4 *Unicode Transformation Unit*

UTF-8, ISO 10646: 1 to 6 *Unicode Transformation Unit*

(still actual, unless tealy freshly modified)

jmf

[toc] | [prev] | [next] | [standalone]

#47910

From	Chris Angelico <rosuav@gmail.com>
Date	2013-06-13 16:48 +1000
Message-ID	<mailman.3167.1371106123.3114.python-list@python.org>
In reply to	#47904

On Thu, Jun 13, 2013 at 4:21 PM, Νικόλαος Κούρας <support@superhost.gr> wrote:
> How can you be able to tell up to what character utf-8 needs 1 byte or 2
> bytes or 3?

You look up Wikipedia, using the handy links that have been put to you
MULTIPLE TIMES.

ChrisA

[toc] | [prev] | [next] | [standalone]

#47866

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-06-13 00:13 +0000
Message-ID	<51b90ead$0$29997$c3e8da3$5496439d@news.astraweb.com>
In reply to	#47783

On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:

> So, how many bytes does UTF-8 stored for codepoints > 127 ?

Two, three or four, depending on the codepoint.

> example for codepoint 256, 1345, 16474 ?

You can do this yourself. I have already given you enough information in 
previous emails to answer this question on your own, but here it is again:

Open an interactive Python session, and run this code:

c = ord(16474)
len(c.encode('utf-8'))

That will tell you how many bytes are used for that example.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#47902

From	Νικόλαος Κούρας <support@superhost.gr>
Date	2013-06-13 09:09 +0300
Message-ID	<kpbnmg$qvk$2@news.ntua.gr>
In reply to	#47866

On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:
> On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:
>
>> So, how many bytes does UTF-8 stored for codepoints > 127 ?
>
> Two, three or four, depending on the codepoint.

The amount of bytes needed by UTF-8 to store a code-point(character), 
depends on the ordinal value of the code-point in the Unicode charset, 
correct?

If this is correct then the higher the ordinal value(which is an decimal 
integer) in the Unicode charset the more bytes needed for storage.

Its like the bigger a decimal integer is the bigger binary number it 
produces.

Is this correct?

>> example for codepoint 256, 1345, 16474 ?
>
> You can do this yourself. I have already given you enough information in
> previous emails to answer this question on your own, but here it is again:
>
> Open an interactive Python session, and run this code:
>
> c = ord(16474)
> len(c.encode('utf-8'))
>
>
> That will tell you how many bytes are used for that example.
This si actually wrong.

ord()'s arguments must be a character for which we expect its ordinal value.

 >>> chr(16474)
'䁚'

Some Chinese symbol.
So code-point '䁚' has a Unicode ordinal value of 16474, correct?

where in after encoding this glyph's ordinal value to binary gives us 
the following bytes:

 >>> bin(16474).encode('utf-8')
b'0b100000001011010'

Now, we take tow symbols out:

'b' symbolism which is there to tell us that we are looking a bytes 
object as well as the
'0b' symbolism which is there to tell us that we are looking a binary 
representation of a bytes object

Thus, there we count 15 bits left.
So it says 15 bits, which is 1-bit less that 2 bytes.
Is the above statements correct please?

but thinking this through more and more:

 >>> chr(16474).encode('utf-8')
b'\xe4\x81\x9a'
 >>> len(b'\xe4\x81\x9a')
3

it seems that the bytestring the encode process produces is of length 3.

So i take it is 3 bytes?

but there is a mismatch of what >>> bin(16474).encode('utf-8') and >>> 
chr(16474).encode('utf-8') is telling us here.

Care to explain that too please ?

[toc] | [prev] | [next] | [standalone]

#47912

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-06-13 07:11 +0000
Message-ID	<51b9708b$0$29872$c3e8da3$5496439d@news.astraweb.com>
In reply to	#47902

On Thu, 13 Jun 2013 09:09:19 +0300, Νικόλαος Κούρας wrote:

> On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:

>> Open an interactive Python session, and run this code:
>>
>> c = ord(16474)
>> len(c.encode('utf-8'))
>>
>>
>> That will tell you how many bytes are used for that example.
> This si actually wrong.
> 
> ord()'s arguments must be a character for which we expect its ordinal
> value.

Gah! 

That's twice I've screwed that up. Sorry about that!

>  >>> chr(16474)
> '䁚'
> 
> Some Chinese symbol.
> So code-point '䁚' has a Unicode ordinal value of 16474, correct?

Correct.

> where in after encoding this glyph's ordinal value to binary gives us
> the following bytes:
> 
>  >>> bin(16474).encode('utf-8')
> b'0b100000001011010'

No! That creates a string from 16474 in base two:

'0b100000001011010'

The leading 0b is just syntax to tell you "this is base 2, not base 8 
(0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.

Then you encode the string '0b100000001011010' into UTF-8. There are 17 
characters in this string, and they are all ASCII characters to they take 
up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form). In 
hex form, they are:

b'\x30\x62\x31\x30\x30\x30\x30\x30\x30\x30\x31\x30\x31\x31\x30\x31\x30'

which takes up a lot more room, which is why Python prefers to show ASCII 
characters as characters rather than as hex.

What you want is:

chr(16474).encode('utf-8')

[...]
> Thus, there we count 15 bits left.
> So it says 15 bits, which is 1-bit less that 2 bytes. Is the above
> statements correct please?

No. There are 17 BYTES there. The string "0" doesn't get turned into a 
single bit. It still takes up a full byte, 0x30, which is 8 bits.

> but thinking this through more and more:
> 
>  >>> chr(16474).encode('utf-8')
> b'\xe4\x81\x9a'
>  >>> len(b'\xe4\x81\x9a')
> 3
> 
> it seems that the bytestring the encode process produces is of length 3.

Correct! Now you have got the right idea.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#47917

From	Νικόλαος Κούρας <support@superhost.gr>
Date	2013-06-13 10:42 +0300
Message-ID	<kpbt5i$7bj$1@news.ntua.gr>
In reply to	#47912

On 13/6/2013 10:11 πμ, Steven D'Aprano wrote:

>>   >>> chr(16474)
>> '䁚'
>>
>> Some Chinese symbol.
>> So code-point '䁚' has a Unicode ordinal value of 16474, correct?
>
> Correct.
>
>
>> where in after encoding this glyph's ordinal value to binary gives us
>> the following bytes:
>>
>>   >>> bin(16474).encode('utf-8')
>> b'0b100000001011010'

An observations here that you please confirm as valid.

1. A code-point and the code-point's ordinal value are associated into a 
Unicode charset. They have the so called 1:1 mapping.

So, i was under the impression that by encoding the code-point into 
utf-8 was the same as encoding the code-point's ordinal value into utf-8.

That is why i tried to:
bin(16474).encode('utf-8') instead of chr(16474).encode('utf-8')

So, now i believe they are two different things.
The code-point *is what actually* needs to be encoded and *not* its 
ordinal value.

> The leading 0b is just syntax to tell you "this is base 2, not base 8
> (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.

But byte objects are represented as '\x' instead of the aforementioned 
'0x'. Why is that?

 > No! That creates a string from 16474 in base two:
 > '0b100000001011010'

I disagree here.
16474 is a number in base 10. Doing bin(16474) we get the binary 
representation of number 16474 and not a string.
Why you say we receive a string while python presents a binary number?

> Then you encode the string '0b100000001011010' into UTF-8. There are 17
> characters in this string, and they are all ASCII characters to they take
> up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form).

0b100000001011010 stands for a number in base 2 for me not as a string.
Have i understood something wrong?

[toc] | [prev] | [next] | [standalone]

#47919

From	Chris Angelico <rosuav@gmail.com>
Date	2013-06-13 17:58 +1000
Message-ID	<mailman.3172.1371110288.3114.python-list@python.org>
In reply to	#47917

On Thu, Jun 13, 2013 at 5:42 PM, Νικόλαος Κούρας <support@superhost.gr> wrote:
> On 13/6/2013 10:11 πμ, Steven D'Aprano wrote:
>> No! That creates a string from 16474 in base two:
>> '0b100000001011010'
>
> I disagree here.
> 16474 is a number in base 10. Doing bin(16474) we get the binary
> representation of number 16474 and not a string.
> Why you say we receive a string while python presents a binary number?

You can disagree all you like. Steven cited a simple point of fact,
one which can be verified in any Python interpreter. Nikos, you are
flat wrong here; bin(16474) creates a string.

ChrisA

[toc] | [prev] | [next] | [standalone]

Page 1 of 6 [1] 2 3 4 5 6 Next page →

csiph-web

A few questiosn about encoding

Contents

#47448 — A few questiosn about encoding

#47454

#47470

#47472

#47762

#47767

#47783

#47800

#47844

#47883

#47886

#47931

#47904

#47905

#47910

#47866

#47902

#47912

#47917

#47919