Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <8471f19b-e21a-4859-9842-92a97d75a840@googlegroups.com>
References: <5b0d3d7c-e3a4-436d-a55f-26bd40064fd5@googlegroups.com> <mailman.2911.1370769172.3114.python-list@python.org> <8471f19b-e21a-4859-9842-92a97d75a840@googlegroups.com>
Date: Sun, 9 Jun 2013 13:01:15 -0700
Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain)
From: Benjamin Kaplan <benjamin.kaplan@case.edu>
To: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2934.1370808466.3114.python-list@python.org>
Lines: 114
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:47487

On Sun, Jun 9, 2013 at 2:20 AM, =CE=9D=CE=B9=CE=BA=CF=8C=CE=BB=CE=B1=CE=BF=
=CF=82 =CE=9A=CE=BF=CF=8D=CF=81=CE=B1=CF=82 <nikos.gr33k@gmail.com> wrote:
> =CE=A4=CE=B7 =CE=9A=CF=85=CF=81=CE=B9=CE=B1=CE=BA=CE=AE, 9 =CE=99=CE=BF=
=CF=85=CE=BD=CE=AF=CE=BF=CF=85 2013 12:12:36 =CE=BC.=CE=BC. UTC+3, =CE=BF =
=CF=87=CF=81=CE=AE=CF=83=CF=84=CE=B7=CF=82 Cameron Simpson =CE=AD=CE=B3=CF=
=81=CE=B1=CF=88=CE=B5:
>> On 09Jun2013 02:00, =3D?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?=3D <nikos.=
gr33k@gmail.com> wrote:
>>
>> | Steven wrote:
>>
>> | >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>
>> | >> values up to 256?
>>
>> |
>>
>> | >Because then how do you tell when you need one byte, and when you nee=
d
>>
>> | >two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>
>> | >characters, with ordinal values 0x4C and 0xFA, or one character with
>>
>> | >ordinal value 0x4CFA?
>>
>> |
>>
>> | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I me=
ant up to 256, not above 256.
>>
>>
>>
>> Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your=
 >suggestion will not.
>
> I dont follow.
>

The point in the UTF formats is that they can encode any of the 1.1
million codepoints available in Unicode. Your suggestion can only
encode 256 code points. We have that encoding already- it's called
Latin-1 and it can't encode any of your Greek characters (hence why
ISO-8859-7 exists, which can encode the Greek characters but not the
Latin ones).

If you were to use the whole byte to store the first 256 characters,
you wouldn't be able to store character number 256 because the
computer wouldn't be able to tell the difference between character 257
(0x01 0x01) and two chr(1)s. UTF-8 gets around this by reserving the
top bit as a "am I part of a multibyte sequence" flag,

>> | >> UTF-8 and UTF-16 and UTF-32
>>
>> | >> I though the number beside of UTF- was to declare how many bits the
>>
>> | >> character set was using to store a character into the hdd, no?
>>
>> |
>>
>> | >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>
>> | >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bi=
t
>>
>> | >values to make a surrogate pair.
>>
>> |
>>
>> | A surrogate pair is like itting for example Ctrl-A, which means is a c=
ombination character that consists of 2 different characters?
>>
>> | Is this what a surrogate is? a pari of 2 chars?
>>
>>
>>
>> Essentially. The combination represents a code point.
>>
>>
>>
>> | >UTF-8 uses 8-bit values, but sometimes
>>
>> | >it combines two, three or four of them to represent a single code-poi=
nt.
>>
>> |
>>
>> | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal =3D =
65)
>>
>> | '=CE=B1=CE=84' to be utf8 encoded needs 2 bytes to be stored ? (since =
ordinal is > 127 )
>>
>> | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (=
since ordinal >  65000 )
>>
>> |
>>
>> | The amount of bytes needed to store a character solely depends on the =
character's ordinal value in the Unicode table?
>>
>>
>>
>> Essentially. You can read up on the exact process in Wikipedia or the Un=
icode Standard.
>
>
>
> When you say essentially means you agree with my statements?
> --

In UTF-8 or UTF-16, the number of bytes required for the character is
dependent on its code point, yes. That isn't the case for UTF-32,
where every character uses exactly four bytes.