Path: csiph.com!usenet.pasdenom.info!news.albasani.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Sun, 9 Jun 2013 19:12:36 +1000
From: Cameron Simpson <cs@zip.com.au>
To: =?utf-8?B?zp3Ouc66z4zOu86xzr/PgiDOms6/z43Pgc6xz4I=?= <nikos.gr33k@gmail.com>
Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <5b0d3d7c-e3a4-436d-a55f-26bd40064fd5@googlegroups.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
References: <5b0d3d7c-e3a4-436d-a55f-26bd40064fd5@googlegroups.com>
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2911.1370769172.3114.python-list@python.org>
Lines: 49
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:47437

On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
| Steven wrote:
| >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
| >> values up to 256? 
| 
| >Because then how do you tell when you need one byte, and when you need 
| >two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
| >characters, with ordinal values 0x4C and 0xFA, or one character with 
| >ordinal value 0x4CFA? 
| 
| I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.

Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your suggestion will not.

I'd point out that if you did this, you'd be back in the same
situation you just encountered with ASCII: the first above-255 value
would raise a UnicodeEncodeError (an error which does not even exist
at present:-)

| >> UTF-8 and UTF-16 and UTF-32 
| >> I though the number beside of UTF- was to declare how many bits the 
| >> character set was using to store a character into the hdd, no? 
| 
| >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
| >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
| >values to make a surrogate pair.
| 
| A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
| Is this what a surrogate is? a pari of 2 chars?

Essentially. The combination represents a code point.

| >UTF-8 uses 8-bit values, but sometimes 
| >it combines two, three or four of them to represent a single code-point.
| 
| 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
| 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
| 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal >  65000 )
| 
| The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table?

Essentially. You can read up on the exact process in Wikipedia or the Unicode Standard.

Cheers,
-- 
Cameron Simpson <cs@zip.com.au>

The most annoying thing about being without my files after our disc crash was
discovering once again how widespread BLINK was on the web.