Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #47472

Re: A few questiosn about encoding

References <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com>
From Chris “Kwpolska” Warrick <kwpolska@gmail.com>
Date 2013-06-09 19:12 +0200
Subject Re: A few questiosn about encoding
Newsgroups comp.lang.python
Message-ID <mailman.2923.1370797972.3114.python-list@python.org> (permalink)

Show all headers | View raw


On Sun, Jun 9, 2013 at 12:44 PM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
> A few questiosn about encoding please:
>
>>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>> values up to 256?
>
>>Because then how do you tell when you need one byte, and when you need
>>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>characters, with ordinal values 0x4C and 0xFA, or one character with
>>ordinal value 0x4CFA?
>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.

It is required so the computer can know where characters begin.
0x0080 (first non-ASCII character) becomes 0xC280 in UTF-8.  Further
details here: http://en.wikipedia.org/wiki/UTF-8#Description

>>> UTF-8 and UTF-16 and UTF-32
>>> I though the number beside of UTF- was to declare how many bits the
>>> character set was using to store a character into the hdd, no?
>
>>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>>values to make a surrogate pair.
>
> A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
> Is this what a surrogate is? a pari of 2 chars?

http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF

Long story short: codepoint - 0x10000 (up to 20 bits) → two 10-bit
numbers → 0xD800 + first_half 0xDC00 + second_half.  Rephrasing:

We take MATHEMATICAL BOLD CAPITAL B (U+1D401).  If you have UTF-8: 𝐁

It is over 0xFFFF, and we need to use surrogate pairs.  We end up with
0xD401, or 0b1101010000000001.  Both representations are worthless, as
we have a 16-bit number, not a 20-bit one.  We throw in some leading
zeroes and end up with 0b00001101010000000001.  Split it in half and
we get 0b0000110101 and 0b0000000001, which we can now shorten to
0b110101 and 0b1, or translate to hex as 0x0035 and 0x0001.  0xD800 +
0x0035 and 0xDC00 + 0x0035 → 0xD835 0xDC00.  Type it into python and:

>>> b'\xD8\x35\xDC\x01'.decode('utf-16be')
'𝐁'

And before you ask: that “BE” stands for Big-Endian.  Little-Endian
would mean reversing the bytes in a codepoint, which would make it
'\x35\xD8\x01\xDC' (the name is based on the first 256 characters,
which are 0x6500 for 'a' in a little-endian encoding.

Another question you may ask: 0xD800…0xDFFF are reserved in Unicode
for the purposes of UTF-16, so there is no conflicts.

>>UTF-8 uses 8-bit values, but sometimes
>>it combines two, three or four of them to represent a single code-point.
>
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )

yup.  α is at 0x03B1, or 945 decimal.

> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal >  65000 )

Not necessarily, as CJK characters start at U+2E80, which is in the
3-byte range (0x0800 through 0xFFFF) — the table is here:
http://en.wikipedia.org/wiki/UTF-8#Description

--
Kwpolska <http://kwpolska.tk> | GPG KEY: 5EAAEA16
stop html mail                | always bottom-post
http://asciiribbon.org        | http://caliburn.nl/topposting.html

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

A few questiosn about encoding Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:44 -0700
  Re: A few questiosn about encoding Fábio Santos <fabiosantosart@gmail.com> - 2013-06-09 13:18 +0100
  Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-09 18:01 +0100
  Re: A few questiosn about encoding Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-06-09 19:12 +0200
    Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-12 09:09 +0000
      Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-12 09:24 +0000
        Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-12 14:23 +0300
          Re: A few questiosn about encoding Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2013-06-12 14:52 +0200
          Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-12 21:30 +0100
            Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 01:40 +0000
              Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 12:01 +1000
                Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-13 11:02 +0100
            Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 09:21 +0300
              Re: A few questiosn about encoding jmfauth <wxjmfauth@gmail.com> - 2013-06-12 23:28 -0700
              Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 16:48 +1000
          Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 00:13 +0000
            Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 09:09 +0300
              Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 07:11 +0000
                Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 10:42 +0300
                Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 17:58 +1000
                Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 11:08 +0300
                Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 18:20 +1000
                Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 12:41 +0300
                Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 11:49 +0000
                Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 17:19 +0300
                Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 11:00 +1000
                Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 09:59 +0300
                Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:14 +1000
                Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 16:58 +0300
                Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 11:21 -0400
                Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 18:26 +0300
                Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:03 +1000
                Re: A few questiosn about encoding Walter Hurry <walterhurry@lavabit.com> - 2013-06-14 23:32 +0000
                Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:26 +1000
                Re: A few questiosn about encoding Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:34 +0000
                Re: A few questiosn about encoding Grant Edwards <invalid@invalid.invalid> - 2013-06-15 14:44 +0000
                Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 17:49 +0300
                Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-15 15:30 +0000
                Re: A few questiosn about encoding Roy Smith <roy@panix.com> - 2013-06-15 10:59 -0400
                Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 18:14 +0300
                Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-15 11:35 -0400
                Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 22:26 +0300
                Re: A few questiosn about encoding Benjamin Schollnick <benjamin@schollnick.net> - 2013-06-15 16:35 -0400
                Re: A few questiosn about encoding Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-06-16 15:45 +0200
                Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 09:36 +0200
                Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 10:49 +0300
                Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 10:22 +0200
                Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 11:37 +0300
                Don't feed the troll... (was: Re: A few questiosn about encoding) Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 11:06 +0200
                Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 12:32 +0300
                Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 13:09 +0200
                Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:36 +0300
                Re: Don't feed the troll... Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 08:44 -0400
                Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:25 +0200
                Re: Don't feed the troll... Neil Cerutti <neilc@norwich.edu> - 2013-06-14 15:54 +0000
                Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 12:15 +0200
                Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-14 18:50 -0400
                Re: Don't feed the troll... Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:31 +0000
                Re: Don't feed the troll... Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-06-15 13:04 -0400
                Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-17 16:15 -0400
                Re: Don't feed the troll... Chris Angelico <rosuav@gmail.com> - 2013-06-18 07:46 +1000
                Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:19 +1000
                Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:41 +0300
                Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 11:20 +0100
                Re: Don't feed the troll... (was: Re: A few questiosn about encoding) rusi <rustompmody@gmail.com> - 2013-06-14 04:51 -0700
                Re: Don't feed the help-vampire rusi <rustompmody@gmail.com> - 2013-06-14 05:09 -0700
                Re: Don't feed the help-vampire Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:31 +0200
                Re: Don't feed the help-vampire Ian Kelly <ian.g.kelly@gmail.com> - 2013-06-14 10:51 -0600
                Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:50 +0300
                Re: Don't feed the troll... Zero Piraeus <schesis@gmail.com> - 2013-06-14 09:33 -0400
                Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:45 +0300
                Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:58 +0200
                Re: Don't feed the troll... Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 14:25 +0100
                Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 17:12 +0100
                Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 12:50 +0200
                Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:59 +0300
                Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:52 +0200
                Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:28 +1000
                Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-17 08:49 +0200
                Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 12:57 +0100
                Re: Don't feed the troll... (was: Re: A few questiosn about encoding) "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:13 -0400
                Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:31 +1000
                Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Grant Edwards <invalid@invalid.invalid> - 2013-06-14 19:40 +0000
                Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:56 -0400
                Re: Don't feed the troll Tim Chase <python.list@tim.thechases.com> - 2013-06-14 14:00 -0500
                Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 15:17 -0400
                Re: Don't feed the troll... Ben Finney <ben+python@benfinney.id.au> - 2013-06-15 10:42 +1000
                Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-19 18:46 -0700
                Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-20 06:26 +0000
                Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 12:43 +0100
                Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-20 09:27 -0700
                Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 02:37 +1000
                Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 18:17 +0100
                Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-23 08:51 -0700
                Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-23 16:30 +0000
                Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-25 13:16 -0700
                Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 03:21 +1000
                Re: A few questiosn about encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-20 20:43 +0100
                Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 06:40 -0700
                Re: A few questiosn about encoding Andrew Berg <robotsondrugs@gmail.com> - 2013-06-20 09:04 -0500
                Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 08:12 -0700
                Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:26 +1000
                Re: A few questiosn about encoding Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2013-06-20 20:25 +0300
                Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:28 +1000
                Re: A few questiosn about encoding Andreas Perstinger <andipersti@gmail.com> - 2013-06-20 19:08 +0200
        Re: A few questiosn about encoding Dave Angel <davea@davea.name> - 2013-06-12 08:43 -0400
      Re: A few questiosn about encoding Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-06-13 18:46 -0400
        Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 08:34 +0300
          Re: A few questiosn about encoding Zero Piraeus <schesis@gmail.com> - 2013-06-14 02:00 -0400
            Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 10:28 +0300

csiph-web