Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #47448 > unrolled thread
| Started by | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| First post | 2013-06-09 03:44 -0700 |
| Last post | 2013-06-14 10:28 +0300 |
| Articles | 20 on this page of 110 — 36 participants |
Back to article view | Back to comp.lang.python
A few questiosn about encoding Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:44 -0700
Re: A few questiosn about encoding Fábio Santos <fabiosantosart@gmail.com> - 2013-06-09 13:18 +0100
Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-09 18:01 +0100
Re: A few questiosn about encoding Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-06-09 19:12 +0200
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-12 09:09 +0000
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-12 09:24 +0000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-12 14:23 +0300
Re: A few questiosn about encoding Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2013-06-12 14:52 +0200
Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-12 21:30 +0100
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 01:40 +0000
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 12:01 +1000
Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-13 11:02 +0100
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 09:21 +0300
Re: A few questiosn about encoding jmfauth <wxjmfauth@gmail.com> - 2013-06-12 23:28 -0700
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 16:48 +1000
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 00:13 +0000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 09:09 +0300
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 07:11 +0000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 10:42 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 17:58 +1000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 11:08 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 18:20 +1000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 12:41 +0300
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 11:49 +0000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 17:19 +0300
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 11:00 +1000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 09:59 +0300
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:14 +1000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 16:58 +0300
Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 11:21 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 18:26 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:03 +1000
Re: A few questiosn about encoding Walter Hurry <walterhurry@lavabit.com> - 2013-06-14 23:32 +0000
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:26 +1000
Re: A few questiosn about encoding Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:34 +0000
Re: A few questiosn about encoding Grant Edwards <invalid@invalid.invalid> - 2013-06-15 14:44 +0000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 17:49 +0300
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-15 15:30 +0000
Re: A few questiosn about encoding Roy Smith <roy@panix.com> - 2013-06-15 10:59 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 18:14 +0300
Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-15 11:35 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 22:26 +0300
Re: A few questiosn about encoding Benjamin Schollnick <benjamin@schollnick.net> - 2013-06-15 16:35 -0400
Re: A few questiosn about encoding Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-06-16 15:45 +0200
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 09:36 +0200
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 10:49 +0300
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 10:22 +0200
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 11:37 +0300
Don't feed the troll... (was: Re: A few questiosn about encoding) Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 11:06 +0200
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 12:32 +0300
Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 13:09 +0200
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:36 +0300
Re: Don't feed the troll... Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 08:44 -0400
Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:25 +0200
Re: Don't feed the troll... Neil Cerutti <neilc@norwich.edu> - 2013-06-14 15:54 +0000
Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 12:15 +0200
Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-14 18:50 -0400
Re: Don't feed the troll... Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:31 +0000
Re: Don't feed the troll... Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-06-15 13:04 -0400
Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-17 16:15 -0400
Re: Don't feed the troll... Chris Angelico <rosuav@gmail.com> - 2013-06-18 07:46 +1000
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:19 +1000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:41 +0300
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 11:20 +0100
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) rusi <rustompmody@gmail.com> - 2013-06-14 04:51 -0700
Re: Don't feed the help-vampire rusi <rustompmody@gmail.com> - 2013-06-14 05:09 -0700
Re: Don't feed the help-vampire Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:31 +0200
Re: Don't feed the help-vampire Ian Kelly <ian.g.kelly@gmail.com> - 2013-06-14 10:51 -0600
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:50 +0300
Re: Don't feed the troll... Zero Piraeus <schesis@gmail.com> - 2013-06-14 09:33 -0400
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:45 +0300
Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:58 +0200
Re: Don't feed the troll... Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 14:25 +0100
Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 17:12 +0100
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 12:50 +0200
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:59 +0300
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:52 +0200
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:28 +1000
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-17 08:49 +0200
Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 12:57 +0100
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:13 -0400
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:31 +1000
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Grant Edwards <invalid@invalid.invalid> - 2013-06-14 19:40 +0000
Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:56 -0400
Re: Don't feed the troll Tim Chase <python.list@tim.thechases.com> - 2013-06-14 14:00 -0500
Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 15:17 -0400
Re: Don't feed the troll... Ben Finney <ben+python@benfinney.id.au> - 2013-06-15 10:42 +1000
Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-19 18:46 -0700
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-20 06:26 +0000
Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 12:43 +0100
Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-20 09:27 -0700
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 02:37 +1000
Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 18:17 +0100
Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-23 08:51 -0700
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-23 16:30 +0000
Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-25 13:16 -0700
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 03:21 +1000
Re: A few questiosn about encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-20 20:43 +0100
Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 06:40 -0700
Re: A few questiosn about encoding Andrew Berg <robotsondrugs@gmail.com> - 2013-06-20 09:04 -0500
Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 08:12 -0700
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:26 +1000
Re: A few questiosn about encoding Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2013-06-20 20:25 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:28 +1000
Re: A few questiosn about encoding Andreas Perstinger <andipersti@gmail.com> - 2013-06-20 19:08 +0200
Re: A few questiosn about encoding Dave Angel <davea@davea.name> - 2013-06-12 08:43 -0400
Re: A few questiosn about encoding Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-06-13 18:46 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 08:34 +0300
Re: A few questiosn about encoding Zero Piraeus <schesis@gmail.com> - 2013-06-14 02:00 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 10:28 +0300
Page 1 of 6 [1] 2 3 4 5 6 Next page →
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 03:44 -0700 |
| Subject | A few questiosn about encoding |
| Message-ID | <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com> |
A few questiosn about encoding please: >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for >> values up to 256? >Because then how do you tell when you need one byte, and when you need >two? If you read two bytes, and see 0x4C 0xFA, does that mean two >characters, with ordinal values 0x4C and 0xFA, or one character with >ordinal value 0x4CFA? I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. >> UTF-8 and UTF-16 and UTF-32 >> I though the number beside of UTF- was to declare how many bits the >> character set was using to store a character into the hdd, no? >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit >values to make a surrogate pair. A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? Is this what a surrogate is? a pari of 2 chars? >UTF-8 uses 8-bit values, but sometimes >it combines two, three or four of them to represent a single code-point. 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 ) The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table?
[toc] | [next] | [standalone]
| From | Fábio Santos <fabiosantosart@gmail.com> |
|---|---|
| Date | 2013-06-09 13:18 +0100 |
| Message-ID | <mailman.2915.1370780298.3114.python-list@python.org> |
| In reply to | #47448 |
[Multipart message — attachments visible in raw view] — view raw
On 9 Jun 2013 11:49, "Νικόλαος Κούρας" <nikos.gr33k@gmail.com> wrote: > > A few questiosn about encoding please: > > >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for > >> values up to 256? > > >Because then how do you tell when you need one byte, and when you need > >two? If you read two bytes, and see 0x4C 0xFA, does that mean two > >characters, with ordinal values 0x4C and 0xFA, or one character with > >ordinal value 0x4CFA? > > I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. > > > >> UTF-8 and UTF-16 and UTF-32 > >> I though the number beside of UTF- was to declare how many bits the > >> character set was using to store a character into the hdd, no? > > >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. > >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit > >values to make a surrogate pair. > > A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? > Is this what a surrogate is? a pari of 2 chars? > > > >UTF-8 uses 8-bit values, but sometimes > >it combines two, three or four of them to represent a single code-point. > > 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) > 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 ) > 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 ) > > The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? > -- > http://mail.python.org/mailman/listinfo/python-list In short, a utf-8 character takes 1 to 4 bytes. A utf-16 character takes 2 to 4 bytes. A utf-32 always takes 4 bytes. The process of encoding bytes to characters is called encoding. The opposite is decoding. This is all made transparent in python with the encode() and decode() methods. You normally don't care about this kind of things.
[toc] | [prev] | [next] | [standalone]
| From | Nobody <nobody@nowhere.com> |
|---|---|
| Date | 2013-06-09 18:01 +0100 |
| Message-ID | <pan.2013.06.09.17.01.19.553000@nowhere.com> |
| In reply to | #47448 |
On Sun, 09 Jun 2013 03:44:57 -0700, Νικόλαος Κούρας wrote: >>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for >>> values up to 256? > >>Because then how do you tell when you need one byte, and when you need >>two? If you read two bytes, and see 0x4C 0xFA, does that mean two >>characters, with ordinal values 0x4C and 0xFA, or one character with >>ordinal value 0x4CFA? > > I mean utf-8 could use 1 byte for storing the 1st 256 characters. I > meant up to 256, not above 256. But then you've used up all 256 possible bytes for storing the first 256 characters, and there aren't any left for use in multi-byte sequences. You need some means to distinguish between a single-byte character and an individual byte within a multi-byte sequence. UTF-8 does that by allocating specific ranges to specific purposes. 0x00-0x7F are single-byte characters, 0x80-0xBF are continuation bytes of multi-byte sequences, 0xC0-0xFF are leading bytes of multi-byte sequences. This scheme has the advantage of making UTF-8 non-modal, i.e. if a byte is corrupted, added or removed, it will only affect the character containing that particular byte; the encoder can re-synchronise at the beginning of the following character. OTOH, with encodings such as UTF-16, UTF-32 or ISO-2022, adding or removing a byte will result in desyncronisation, with all subsequent characters being corrupted. > A surrogate pair is like itting for example Ctrl-A, which means is a > combination character that consists of 2 different characters? Is this > what a surrogate is? a pari of 2 chars? A surrogate pair is a pair of 16-bit codes used to represent a single Unicode character whose code is greater than 0xFFFF. The 2048 codepoints from 0xD800 to 0xDFFF inclusive aren't used to represent characters, but "surrogates". Unicode characters with codes in the range 0x10000-0x10FFFF are represented in UTF-16 as a pair of surrogates. First, 0x10000 is subtracted from the code, giving a value in the range 0-0xFFFFF (20 bits). The top ten bits are added to 0xD800 to give a value in the range 0xD800-0xDBFF, while the bottom ten bits are added to 0xDC00 to give a value in the range 0xDC00-0xDFFF. Because the codes used for surrogates aren't valid as individual characters, scanning a string for a particular character won't accidentally match part of a multi-word character. > 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) > 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > > 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be > stored ? (since ordinal > 65000 ) Most Chinese, Japanese and Korean (CJK) characters have codepoints within the BMP (i.e. <= 0xFFFF), so they only require 3 bytes in UTF-8. The codepoints above the BMP are mostly for archaic ideographs (those no longer in normal use), mathematical symbols, dead languages, etc. > The amount of bytes needed to store a character solely depends on the > character's ordinal value in the Unicode table? Yes. UTF-8 is essentially a mechanism for representing 31-bit unsigned integers such that smaller integers require fewer bytes than larger integers (subsequent revisions of Unicode cap the range of possible codepoints to 0x10FFFF, as that's all that UTF-16 can handle).
[toc] | [prev] | [next] | [standalone]
| From | Chris “Kwpolska” Warrick <kwpolska@gmail.com> |
|---|---|
| Date | 2013-06-09 19:12 +0200 |
| Message-ID | <mailman.2923.1370797972.3114.python-list@python.org> |
| In reply to | #47448 |
On Sun, Jun 9, 2013 at 12:44 PM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
> A few questiosn about encoding please:
>
>>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>> values up to 256?
>
>>Because then how do you tell when you need one byte, and when you need
>>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>characters, with ordinal values 0x4C and 0xFA, or one character with
>>ordinal value 0x4CFA?
>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.
It is required so the computer can know where characters begin.
0x0080 (first non-ASCII character) becomes 0xC280 in UTF-8. Further
details here: http://en.wikipedia.org/wiki/UTF-8#Description
>>> UTF-8 and UTF-16 and UTF-32
>>> I though the number beside of UTF- was to declare how many bits the
>>> character set was using to store a character into the hdd, no?
>
>>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>>values to make a surrogate pair.
>
> A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
> Is this what a surrogate is? a pari of 2 chars?
http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF
Long story short: codepoint - 0x10000 (up to 20 bits) → two 10-bit
numbers → 0xD800 + first_half 0xDC00 + second_half. Rephrasing:
We take MATHEMATICAL BOLD CAPITAL B (U+1D401). If you have UTF-8: 𝐁
It is over 0xFFFF, and we need to use surrogate pairs. We end up with
0xD401, or 0b1101010000000001. Both representations are worthless, as
we have a 16-bit number, not a 20-bit one. We throw in some leading
zeroes and end up with 0b00001101010000000001. Split it in half and
we get 0b0000110101 and 0b0000000001, which we can now shorten to
0b110101 and 0b1, or translate to hex as 0x0035 and 0x0001. 0xD800 +
0x0035 and 0xDC00 + 0x0035 → 0xD835 0xDC00. Type it into python and:
>>> b'\xD8\x35\xDC\x01'.decode('utf-16be')
'𝐁'
And before you ask: that “BE” stands for Big-Endian. Little-Endian
would mean reversing the bytes in a codepoint, which would make it
'\x35\xD8\x01\xDC' (the name is based on the first 256 characters,
which are 0x6500 for 'a' in a little-endian encoding.
Another question you may ask: 0xD800…0xDFFF are reserved in Unicode
for the purposes of UTF-16, so there is no conflicts.
>>UTF-8 uses 8-bit values, but sometimes
>>it combines two, three or four of them to represent a single code-point.
>
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
yup. α is at 0x03B1, or 945 decimal.
> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 )
Not necessarily, as CJK characters start at U+2E80, which is in the
3-byte range (0x0800 through 0xFFFF) — the table is here:
http://en.wikipedia.org/wiki/UTF-8#Description
--
Kwpolska <http://kwpolska.tk> | GPG KEY: 5EAAEA16
stop html mail | always bottom-post
http://asciiribbon.org | http://caliburn.nl/topposting.html
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <support@superhost.gr> |
|---|---|
| Date | 2013-06-12 09:09 +0000 |
| Message-ID | <kp9drh$1o0t$1@news.ntua.gr> |
| In reply to | #47472 |
>> (*) infact UTF8 also indicates the end of each character > Up to a point. The initial byte encodes the length and the top few > bits, but the subsequent octets aren’t distinguishable as final in > isolation. 0x80-0xBF can all be either medial or final. So, the first high-bits are a directive that UTF-8 uses to know how many bytes each character is being represented as. 0-127 codepoints(characters) use 1 bit to signify they need 1 bit for storage and the rest 7 bits to actually store the character ? while 128-256 codepoints(characters) use 2 bit to signify they need 2 bits for storage and the rest 14 bits to actually store the character ? Isn't 14 bits way to many to store a character ?
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-12 09:24 +0000 |
| Message-ID | <51b83e5a$0$29998$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47762 |
On Wed, 12 Jun 2013 09:09:05 +0000, Νικόλαος Κούρας wrote: > Isn't 14 bits way to many to store a character ? No. There are 1114111 possible characters in Unicode. (And in Japan, they sometimes use TRON instead of Unicode, which has even more.) If you list out all the combinations of 14 bits: 0000 0000 0000 00 0000 0000 0000 01 0000 0000 0000 10 0000 0000 0000 11 [...] 1111 1111 1111 10 1111 1111 1111 11 you will see that there are only 32767 (2**15-1) such values. You can't fit 1114111 characters with just 32767 values. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <support@superhost.gr> |
|---|---|
| Date | 2013-06-12 14:23 +0300 |
| Message-ID | <kp9lo6$9l5$2@news.ntua.gr> |
| In reply to | #47767 |
On 12/6/2013 12:24 μμ, Steven D'Aprano wrote: > On Wed, 12 Jun 2013 09:09:05 +0000, Νικόλαος Κούρας wrote: > >> Isn't 14 bits way to many to store a character ? > > No. > > There are 1114111 possible characters in Unicode. (And in Japan, they > sometimes use TRON instead of Unicode, which has even more.) > > If you list out all the combinations of 14 bits: > > 0000 0000 0000 00 > 0000 0000 0000 01 > 0000 0000 0000 10 > 0000 0000 0000 11 > [...] > 1111 1111 1111 10 > 1111 1111 1111 11 > > you will see that there are only 32767 (2**15-1) such values. You can't > fit 1114111 characters with just 32767 values. > > > Thanks Steven, So, how many bytes does UTF-8 stored for codepoints > 127 ? example for codepoint 256, 1345, 16474 ?
[toc] | [prev] | [next] | [standalone]
| From | Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> |
|---|---|
| Date | 2013-06-12 14:52 +0200 |
| Message-ID | <pg7m8a-mto.ln1@satorlaser.homedns.org> |
| In reply to | #47783 |
Am 12.06.2013 13:23, schrieb Νικόλαος Κούρας: > So, how many bytes does UTF-8 stored for codepoints > 127 ? What has your research turned up? I personally consider it lazy and respectless to get lots of pointers that you could use for further research and ask for more info before you even followed these links. > example for codepoint 256, 1345, 16474 ? Yes, examples exist. Gee, if there only was an information network that you could access and where you could locate information on various programming-related topics somehow. Seriously, someone should invent this thing! But still, even without it, you have all the tools (i.e. Python) in your hand to generate these examples yourself! Check out ord, bin, encode, decode for a start. Uli
[toc] | [prev] | [next] | [standalone]
| From | Nobody <nobody@nowhere.com> |
|---|---|
| Date | 2013-06-12 21:30 +0100 |
| Message-ID | <pan.2013.06.12.20.30.22.31000@nowhere.com> |
| In reply to | #47783 |
On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote: > So, how many bytes does UTF-8 stored for codepoints > 127 ? U+0000..U+007F 1 byte U+0080..U+07FF 2 bytes U+0800..U+FFFF 3 bytes >=U+10000 4 bytes So, 1 byte for ASCII, 2 bytes for other Latin characters, Greek, Cyrillic, Arabic, and Hebrew, 3 bytes for Chinese/Japanese/Korean, 4 bytes for dead languages and mathematical symbols. The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than 20 bits).
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-13 01:40 +0000 |
| Message-ID | <51b9231b$0$29997$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47844 |
On Wed, 12 Jun 2013 21:30:23 +0100, Nobody wrote:
> The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a
> total of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than
> 20 bits).
Same with UTF-8 and UTF-32, both of which are limited to U+10FFFF because
that is what Unicode is limited to.
The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
that's not UTF-8, that's UTF-8-plus-extra-codepoints. Likewise the
mechanism of UTF-32 could go up to 0xFFFFFFFF, but doing so means you
don't have Unicode chars any more, and hence your byte-string is not
valid UTF-32:
py> b = b'\xFF'*8
py> b.decode('UTF-32')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
codepoint not in range(0x110000)
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-13 12:01 +1000 |
| Message-ID | <mailman.3153.1371088918.3114.python-list@python.org> |
| In reply to | #47883 |
On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but > that's not UTF-8, that's UTF-8-plus-extra-codepoints. And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80", even though mathematically they would translate into U+0000 and U+D800 respectively. The UTF-16 *mechanism* is limited to no more than Unicode has currently used, but I'm left wondering if that's actually the other way around - that Unicode planes were deemed to stop at the point where UTF-16 can't encode any more. Not that it matters; with most of the current planes completely unallocated, it seems unlikely we'll be needing more. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Nobody <nobody@nowhere.com> |
|---|---|
| Date | 2013-06-13 11:02 +0100 |
| Message-ID | <pan.2013.06.13.10.02.38.693000@nowhere.com> |
| In reply to | #47886 |
On Thu, 13 Jun 2013 12:01:55 +1000, Chris Angelico wrote: > On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but >> that's not UTF-8, that's UTF-8-plus-extra-codepoints. > > And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80", even > though mathematically they would translate into U+0000 and U+D800 > respectively. The UTF-16 *mechanism* is limited to no more than Unicode > has currently used, but I'm left wondering if that's actually the other > way around - that Unicode planes were deemed to stop at the point where > UTF-16 can't encode any more. Indeed. 5-byte and 6-byte sequences were originally part of the UTF-8 specification, allowing for 31 bits. Later revisions of the standard imposed the UTF-16 limit on Unicode as a whole.
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <support@superhost.gr> |
|---|---|
| Date | 2013-06-13 09:21 +0300 |
| Message-ID | <kpboda$qvk$3@news.ntua.gr> |
| In reply to | #47844 |
On 12/6/2013 11:30 μμ, Nobody wrote: > On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote: > >> So, how many bytes does UTF-8 stored for codepoints > 127 ? > > U+0000..U+007F 1 byte > U+0080..U+07FF 2 bytes > U+0800..U+FFFF 3 bytes >> =U+10000 4 bytes 'U' stands for Unicode code-point which means a character right? How can you be able to tell up to what character utf-8 needs 1 byte or 2 bytes or 3? And some of the bytes' bits are used to tell where a code-points representations stops, right? i mean if we have a code-point that needs 2 bytes to be stored that the high bit must be set to 1 to signify that this character's encoding stops at 2 bytes. I just know that 2^8 = 256, that's by first look 265 places, which mean 256 positions to hold a code-point which in turn means a character. We take the high bit out and then we have 2^7 which is enough positions for 0-127 standard ASCII. High bit is set to '0' to signify that char is encoded in 1 byte. Please tell me that i understood correct so far. But how about for 2 or 3 or 4 bytes? Am i saying ti correct ?
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2013-06-12 23:28 -0700 |
| Message-ID | <7d1f7756-31f4-4e0f-a5d3-6b736c2eef3c@k3g2000vbn.googlegroups.com> |
| In reply to | #47904 |
------ UTF-8, Unicode (consortium): 1 to 4 *Unicode Transformation Unit* UTF-8, ISO 10646: 1 to 6 *Unicode Transformation Unit* (still actual, unless tealy freshly modified) jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-13 16:48 +1000 |
| Message-ID | <mailman.3167.1371106123.3114.python-list@python.org> |
| In reply to | #47904 |
On Thu, Jun 13, 2013 at 4:21 PM, Νικόλαος Κούρας <support@superhost.gr> wrote: > How can you be able to tell up to what character utf-8 needs 1 byte or 2 > bytes or 3? You look up Wikipedia, using the handy links that have been put to you MULTIPLE TIMES. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-13 00:13 +0000 |
| Message-ID | <51b90ead$0$29997$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47783 |
On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:
> So, how many bytes does UTF-8 stored for codepoints > 127 ?
Two, three or four, depending on the codepoint.
> example for codepoint 256, 1345, 16474 ?
You can do this yourself. I have already given you enough information in
previous emails to answer this question on your own, but here it is again:
Open an interactive Python session, and run this code:
c = ord(16474)
len(c.encode('utf-8'))
That will tell you how many bytes are used for that example.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <support@superhost.gr> |
|---|---|
| Date | 2013-06-13 09:09 +0300 |
| Message-ID | <kpbnmg$qvk$2@news.ntua.gr> |
| In reply to | #47866 |
On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:
> On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:
>
>> So, how many bytes does UTF-8 stored for codepoints > 127 ?
>
> Two, three or four, depending on the codepoint.
The amount of bytes needed by UTF-8 to store a code-point(character),
depends on the ordinal value of the code-point in the Unicode charset,
correct?
If this is correct then the higher the ordinal value(which is an decimal
integer) in the Unicode charset the more bytes needed for storage.
Its like the bigger a decimal integer is the bigger binary number it
produces.
Is this correct?
>> example for codepoint 256, 1345, 16474 ?
>
> You can do this yourself. I have already given you enough information in
> previous emails to answer this question on your own, but here it is again:
>
> Open an interactive Python session, and run this code:
>
> c = ord(16474)
> len(c.encode('utf-8'))
>
>
> That will tell you how many bytes are used for that example.
This si actually wrong.
ord()'s arguments must be a character for which we expect its ordinal value.
>>> chr(16474)
'䁚'
Some Chinese symbol.
So code-point '䁚' has a Unicode ordinal value of 16474, correct?
where in after encoding this glyph's ordinal value to binary gives us
the following bytes:
>>> bin(16474).encode('utf-8')
b'0b100000001011010'
Now, we take tow symbols out:
'b' symbolism which is there to tell us that we are looking a bytes
object as well as the
'0b' symbolism which is there to tell us that we are looking a binary
representation of a bytes object
Thus, there we count 15 bits left.
So it says 15 bits, which is 1-bit less that 2 bytes.
Is the above statements correct please?
but thinking this through more and more:
>>> chr(16474).encode('utf-8')
b'\xe4\x81\x9a'
>>> len(b'\xe4\x81\x9a')
3
it seems that the bytestring the encode process produces is of length 3.
So i take it is 3 bytes?
but there is a mismatch of what >>> bin(16474).encode('utf-8') and >>>
chr(16474).encode('utf-8') is telling us here.
Care to explain that too please ?
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-13 07:11 +0000 |
| Message-ID | <51b9708b$0$29872$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47902 |
On Thu, 13 Jun 2013 09:09:19 +0300, Νικόλαος Κούρας wrote:
> On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:
>> Open an interactive Python session, and run this code:
>>
>> c = ord(16474)
>> len(c.encode('utf-8'))
>>
>>
>> That will tell you how many bytes are used for that example.
> This si actually wrong.
>
> ord()'s arguments must be a character for which we expect its ordinal
> value.
Gah!
That's twice I've screwed that up. Sorry about that!
> >>> chr(16474)
> '䁚'
>
> Some Chinese symbol.
> So code-point '䁚' has a Unicode ordinal value of 16474, correct?
Correct.
> where in after encoding this glyph's ordinal value to binary gives us
> the following bytes:
>
> >>> bin(16474).encode('utf-8')
> b'0b100000001011010'
No! That creates a string from 16474 in base two:
'0b100000001011010'
The leading 0b is just syntax to tell you "this is base 2, not base 8
(0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
Then you encode the string '0b100000001011010' into UTF-8. There are 17
characters in this string, and they are all ASCII characters to they take
up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form). In
hex form, they are:
b'\x30\x62\x31\x30\x30\x30\x30\x30\x30\x30\x31\x30\x31\x31\x30\x31\x30'
which takes up a lot more room, which is why Python prefers to show ASCII
characters as characters rather than as hex.
What you want is:
chr(16474).encode('utf-8')
[...]
> Thus, there we count 15 bits left.
> So it says 15 bits, which is 1-bit less that 2 bytes. Is the above
> statements correct please?
No. There are 17 BYTES there. The string "0" doesn't get turned into a
single bit. It still takes up a full byte, 0x30, which is 8 bits.
> but thinking this through more and more:
>
> >>> chr(16474).encode('utf-8')
> b'\xe4\x81\x9a'
> >>> len(b'\xe4\x81\x9a')
> 3
>
> it seems that the bytestring the encode process produces is of length 3.
Correct! Now you have got the right idea.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <support@superhost.gr> |
|---|---|
| Date | 2013-06-13 10:42 +0300 |
| Message-ID | <kpbt5i$7bj$1@news.ntua.gr> |
| In reply to | #47912 |
On 13/6/2013 10:11 πμ, Steven D'Aprano wrote:
>> >>> chr(16474)
>> '䁚'
>>
>> Some Chinese symbol.
>> So code-point '䁚' has a Unicode ordinal value of 16474, correct?
>
> Correct.
>
>
>> where in after encoding this glyph's ordinal value to binary gives us
>> the following bytes:
>>
>> >>> bin(16474).encode('utf-8')
>> b'0b100000001011010'
An observations here that you please confirm as valid.
1. A code-point and the code-point's ordinal value are associated into a
Unicode charset. They have the so called 1:1 mapping.
So, i was under the impression that by encoding the code-point into
utf-8 was the same as encoding the code-point's ordinal value into utf-8.
That is why i tried to:
bin(16474).encode('utf-8') instead of chr(16474).encode('utf-8')
So, now i believe they are two different things.
The code-point *is what actually* needs to be encoded and *not* its
ordinal value.
> The leading 0b is just syntax to tell you "this is base 2, not base 8
> (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
But byte objects are represented as '\x' instead of the aforementioned
'0x'. Why is that?
> No! That creates a string from 16474 in base two:
> '0b100000001011010'
I disagree here.
16474 is a number in base 10. Doing bin(16474) we get the binary
representation of number 16474 and not a string.
Why you say we receive a string while python presents a binary number?
> Then you encode the string '0b100000001011010' into UTF-8. There are 17
> characters in this string, and they are all ASCII characters to they take
> up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form).
0b100000001011010 stands for a number in base 2 for me not as a string.
Have i understood something wrong?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-13 17:58 +1000 |
| Message-ID | <mailman.3172.1371110288.3114.python-list@python.org> |
| In reply to | #47917 |
On Thu, Jun 13, 2013 at 5:42 PM, Νικόλαος Κούρας <support@superhost.gr> wrote: > On 13/6/2013 10:11 πμ, Steven D'Aprano wrote: >> No! That creates a string from 16474 in base two: >> '0b100000001011010' > > I disagree here. > 16474 is a number in base 10. Doing bin(16474) we get the binary > representation of number 16474 and not a string. > Why you say we receive a string while python presents a binary number? You can disagree all you like. Steven cited a simple point of fact, one which can be verified in any Python interpreter. Nikos, you are flat wrong here; bin(16474) creates a string. ChrisA
[toc] | [prev] | [next] | [standalone]
Page 1 of 6 [1] 2 3 4 5 6 Next page →
Back to top | Article view | comp.lang.python
csiph-web