Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder2.hal-mli.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Dennis Lee Bieber <wlfraed@ix.netcom.com>
Subject: Re: A few questiosn about encoding
Date: Thu, 13 Jun 2013 18:46:12 -0400
Organization: IISS Elusive Unicorn
References: <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com> <mailman.2923.1370797972.3114.python-list@python.org> <kp9drh$1o0t$1@news.ntua.gr>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3238.1371163584.3114.python-list@python.org>
Lines: 54
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:48038

On Wed, 12 Jun 2013 09:09:05 +0000 (UTC), ???????? ??????
<support@superhost.gr> declaimed the following:

>>> (*) infact UTF8 also indicates the end of each character
>
>> Up to a point.  The initial byte encodes the length and the top few
>> bits, but the subsequent octets aren’t distinguishable as final in
>> isolation.  0x80-0xBF can all be either medial or final.
>
>
>So, the first high-bits are a directive that UTF-8 uses to know how many 
>bytes each character is being represented as.
>
>0-127 codepoints(characters) use 1 bit to signify they need 1 bit for 
>storage and the rest 7 bits to actually store the character ?
>
	Not quite... The leading bit is a 0 -> which means 0..127 are sent
as-is, no manipulation.

>while
>
>128-256 codepoints(characters) use 2 bit to signify they need 2 bits for 
>storage and the rest 14 bits to actually store the character ?
>
	128..255 -- in what encoding? These all have the leading bit with a
value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is
inherent in the specified encoding and they are sent as-is.

	BUT, in UTF-8, a byte with a leading 1-bit signals that the byte
identifies a multi-byte sequence. CF:
https://en.wikipedia.org/wiki/UTF-8#Description

	So anything that starts with bits 110 is a two byte sequence (and the
second byte must start with bits 10 to be valid)

	1110 starts a three byte sequence, 11110 starts a four byte sequence...
Basically, count the number of leading 1-bits before a 0 bit, and that
tells you how many bytes are in the multi-byte sequence -- and all bytes
that start with 10 are supposed to be the continuations of a multibyte set
(and not a signal that this is a 1-byte entry -- those only have a leading
0)

>Isn't 14 bits way to many to store a character ? 

Original UTF-8 allowed for 31-bits to specify a character in the Unicode
set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were
the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each
continuation.

	
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/