Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'encoding': 0.05; 'utf-8': 0.07; 'string': 0.09; "'a'": 0.09; '128': 0.09; '34,': 0.09; 'ascii': 0.09; 'bits': 0.09; 'bytes,': 0.09; 'bytes.': 0.09; 'compact': 0.09; 'encode': 0.09; 'locale': 0.09; "system's": 0.09; 'works.': 0.09; 'cc:addr:python-list': 0.11; 'stored': 0.12; "wouldn't": 0.14; '&&': 0.16; "'b'": 0.16; 'byte,': 0.16; 'character.': 0.16; 'ebcdic': 0.16; 'ebcdic,': 0.16; 'emits': 0.16; 'encodings': 0.16; 'mapped': 0.16; 'ordinal': 0.16; 'pairs': 0.16; 'rgb(255,': 0.16; 'scripts.': 0.16; 'simpson': 0.16; 'surrogate': 0.16; 'unicode.': 0.16; 'utf-8)': 0.16; 'weblog': 0.16; 'wrote:': 0.18; 'bit': 0.19; 'have:': 0.19; 'normally': 0.19; 'value.': 0.19; 'fit': 0.20; 'email addr:gmail.com>': 0.22; 'cc:addr:python.org': 0.22; 'header:User-Agent:1': 0.23; 'byte': 0.24; 'bytes': 0.24; 'unicode': 0.24; 'cc:2**0': 0.24; 'cc:no real name:2**0': 0.24; '>': 0.26; 'tables': 0.26; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'rest': 0.29; 'character': 0.29; 'characters': 0.30; 'sets': 0.30; "i'm": 0.30; 'usually': 0.31; '255,': 0.31; "d'aprano": 0.31; 'directly,': 0.31; 'sets.': 0.31; 'skip:= 40': 0.31; 'steven': 0.31; 'them?': 0.31; 'values.': 0.31; 'linux': 0.33; 'trouble': 0.34; '"the': 0.34; 'subject:from': 0.34; 'subject: (': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'ordered': 0.36; 'method': 0.36; 'should': 0.36; 'skip:- 20': 0.37; 'two': 0.37; 'easily': 0.37; 'auto;': 0.38; 'message-id:@gmail.com': 0.38; 'thank': 0.38; 'form,': 0.38; 'explain': 0.39; 'does': 0.39; "couldn't": 0.39; 'use.': 0.39; 'enough': 0.39; 'system.': 0.39; 'how': 0.40; 'skip:u 10': 0.60; 'read': 0.60; 'is.': 0.60; 'numbers': 0.61; 'range': 0.61; 'from:charset:utf-8': 0.61; 'back': 0.62; 'skip:n 10': 0.64; 'more': 0.64; 'optimized': 0.68; 'reads': 0.68; 'default': 0.69; '8bit%:92': 0.71; '8bit%:100': 0.72; 'computers': 0.72; 'arial,': 0.74; 'helvetica,': 0.74; 'inline': 0.74; 'sans-serif;': 0.78; 'url:wordpress': 0.78; 'associations': 0.84; 'beside': 0.84; 'characters,': 0.84; 'on?': 0.91; 'imagine': 0.93; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type; bh=PE+3bhe5wbX4COLJUHjg6p9xS2q4l2FFeYyvzVdvWig=; b=MJYXN/e7SgBnUA8g4QHKMIzVkPTQK01TFTQvNVqtX6jLnDNgdNa0z95ILwQ0wy7389 8itTy3HkkNwvzNS60LT3Qmrl56tetOWB59jTcvdJh1u2BPQH9ReKuorF15qdvj2n2IJw 9dm+eUhsz9ex2+weB+CLZJHODxki2ldw5wOobVk72zNR/fd53oSXpYdRMSDOyIZDy+8+ Ns7B2VAX2Cujt5L2tFSkwzS71wi7F4hT4Bph+iiNIDEO8oCAnhKFg7qEJIQOzTnPaji6 Hgg3/oQ0JO4lZJDS5W10tcUBO67x+gpj03uNfILW1PoBTM/pVqHRDs9B3CqHBku8FrFU eM6Q== X-Received: by 10.14.115.1 with SMTP id d1mr5276926eeh.27.1370753203583; Sat, 08 Jun 2013 21:46:43 -0700 (PDT) Date: Sun, 09 Jun 2013 07:46:40 +0300 From: =?UTF-8?B?zp3Ouc66z4zOu86xzr/PgiDOms6/z43Pgc6xz4I=?= User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:22.0) Gecko/20100101 Thunderbird/22.0 MIME-Version: 1.0 To: Cameron Simpson Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain) References: <20130608223258.GA29311@cskk.homeip.net> In-Reply-To: <20130608223258.GA29311@cskk.homeip.net> Content-Type: multipart/alternative; boundary="------------050603000002090308010604" Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 234 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1370753210 news.xs4all.nl 15876 [2001:888:2000:d::a6]:40948 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:47422 This is a multi-part message in MIME format. --------------050603000002090308010604 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 9/6/2013 1:32 πμ, Cameron Simpson wrote: > On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= wrote: > | Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: > | > ASCII actually needs 7 bits to store a character. Since computers are > | > optimized to work with bytes, not bits, normally ASCII characters are > | > stored in a single byte, with one bit wasted. > | > | So ASCII and Unicode are 2 Encoding Systems currently in use. > | How should i imagine them, visualize them? > | Like tables 'A' = 65, 'B' = 66 and so on? > > Yes, that works. > > | But if i do then that would be the visualization of a 'charset' not of an encoding system. > | What the diffrence of an encoding system and of a charset? > > And encoding system is the method or transcribing these values to bytes and back again. So we have: ( 'A' mapped to the value of '65' ) => encoding process(i.e. uf-8) => bytes bytes => decoding process(i.e. utf-8) => ( '65' mapped to character 'A' ) Why does every character in a character set needs to be associated with a numeric value? I mean couldn't we just have characters sets that wouldn't have numeric associations like: 'A' => encoding process(i.e. uf-8) => bytes bytes => decoding process(i.e. utf-8) => character 'A' > > EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets. > (1:1 mappings of characters to numbers/ordinals). > > And encoding is a way of writing these values to bytes. > Decoding reads bytes and emits character values. > > Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the range 0-255, > they are usually transcribed (encoded) directly, one byte per ordinal. > > Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one value. > There are several ways of transcribing Unicode. UTF-8 is a popular and usually compact form, > using one byte for values below 128 and and multiple bytes for higher values. An ordinal = ordered numbers like 7,8,910 and so on? Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for values up to 256? UTF-8 and UTF-16 and UTF-32 I though the number beside of UTF- was to declare how many bits the character set was using to store a character into the hdd, no? "Narrow" Unicode uses two bytes per character. Since two bytes is only enough for about 65,000 characters, not 1,000,000+, the rest of the characters are stored as pairs of two-byte "surrogates". Can you please explain this line "the rest of thecharacters are stored as pairs of two-byte "surrogates"" more easily for me to understand it? I'm still having troubl understanding what a surrogate is. Again, thank you very much for explaining the encodings to me, they were giving me trouble for years in all of my scripts. And one last thing. When locale to linux system is set to utf-8 that would mean that the linux applications, should try to encode string into hdd by using system's default encoding to utf-8 nad read them back from bytes by also using utf-8. Is that correct? -- Webhost && Weblog --------------050603000002090308010604 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
On 9/6/2013 1:32 πμ, Cameron Simpson wrote:
On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
| Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
| > ASCII actually needs 7 bits to store a character. Since computers are  
| > optimized to work with bytes, not bits, normally ASCII characters are
| > stored in a single byte, with one bit wasted.
| 
| So ASCII and Unicode are 2 Encoding Systems currently in use.
| How should i imagine them, visualize them?
| Like tables 'A' = 65, 'B' = 66 and so on?

Yes, that works.

| But if i do then that would be the visualization of a 'charset' not of an encoding system.
| What the diffrence of an encoding system and of a charset?

And encoding system is the method or transcribing these values to bytes and back again.
So we have:

( 'A' mapped to the value of '65' ) => encoding process(i.e. uf-8) => bytes
bytes => decoding process(i.e. utf-8) =>  ( '65' mapped to character 'A' )

Why does every character in a character set needs to be associated with a numeric value?
I mean couldn't we just have characters sets that wouldn't have numeric associations like:

'A'  => encoding process(i.e. uf-8) => bytes
bytes => decoding process(i.e. utf-8) =>  character 'A'



EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets.
(1:1 mappings of characters to numbers/ordinals).

And encoding is a way of writing these values to bytes.
Decoding reads bytes and emits character values.

Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the range 0-255,
they are usually transcribed (encoded) directly, one byte per ordinal.

Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one value.
There are several ways of transcribing Unicode. UTF-8 is a popular and usually compact form,
using one byte for values below 128 and and multiple bytes for higher values.
An ordinal = ordered numbers like 7,8,910 and so on?

Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for values up to 256?

UTF-8 and UTF-16 and UTF-32
I though the number beside of UTF- was to declare how many bits the character set was using to store a character into the hdd, no?

"Narrow" Unicode uses two bytes per character. Since two bytes is only 
enough for about 65,000 characters, not 1,000,000+, the rest of the 
characters are stored as pairs of two-byte "surrogates".

Can you please explain this line "
the rest of the characters are stored as pairs of two-byte "surrogates"" more easily for me to understand it?
I'm still having troubl understanding what a surrogate is.

Again, thank you very much for explaining the encodings to me, they were giving me trouble for years in all of my scripts.


And one last thing.
When locale to linux system is set to utf-8 that would mean that the linux applications, should try to encode string into hdd by using system's default encoding to utf-8 nad read them back from bytes by also using utf-8. Is that correct?
--------------050603000002090308010604--