Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.006 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'essentially': 0.04; 'encoding': 0.05; 'encoded': 0.07; 'utf-8': 0.07; 'string': 0.09; "'a'": 0.09; 'ascii': 0.09; 'bytes.': 0.09; 'closest': 0.09; 'cc:addr:python-list': 0.11; '&&': 0.16; '(and,': 0.16; '(assuming': 0.16; '0-127': 0.16; '8-bit': 0.16; 'ascii,': 0.16; 'bytes)': 0.16; 'encodings': 0.16; 'integer.': 0.16; 'mangled': 0.16; 'mapped': 0.16; 'range,': 0.16; 'range.': 0.16; 'simpson': 0.16; 'storing': 0.16; 'weblog': 0.16; 'wrote:': 0.18; 'variable': 0.18; "skip:' 30": 0.19; 'slightly': 0.19; 'email addr:gmail.com>': 0.22; 'cc:addr:python.org': 0.22; 'header :User-Agent:1': 0.23; 'byte': 0.24; 'bytes': 0.24; 'certainly': 0.24; 'char': 0.24; 'unicode': 0.24; 'fine': 0.24; 'cc:2**0': 0.24; 'cc:no real name:2**0': 0.24; '>': 0.26; 'possibly': 0.26; 'post': 0.26; 'least': 0.26; 'values': 0.27; 'header:In- Reply-To:1': 0.27; 'tried': 0.27; 'correct': 0.29; 'rest': 0.29; 'character': 0.29; 'characters': 0.30; 'compared': 0.30; 'errors': 0.30; 'sets': 0.30; "i'm": 0.30; '(which': 0.31; 'commonly': 0.31; 'quotes': 0.31; "skip:' 40": 0.31; 'skip:= 40': 0.31; 'values.': 0.31; 'yes.': 0.31; 'this.': 0.32; 'themselves': 0.32; 'another': 0.32; 'beginning': 0.33; 'subject:from': 0.34; 'could': 0.34; 'subject: (': 0.35; 'something': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'sequence': 0.36; 'should': 0.36; 'half': 0.37; 'wrong': 0.37; 'message- id:@gmail.com': 0.38; 'mapping': 0.38; 'files': 0.38; 'expect': 0.39; 'explain': 0.39; 'how': 0.40; 'even': 0.60; 'read': 0.60; 'break': 0.61; 'range': 0.61; 'from:charset:utf-8': 0.61; 'course': 0.61; 'first': 0.61; 'name': 0.63; '8bit%:95': 0.64; 'occur': 0.65; 'bottom': 0.67; 'dont': 0.67; 'reverse': 0.68; 'fact,': 0.69; '8bit%:92': 0.71; '8bit%:100': 0.72; '1st': 0.74; 'upper': 0.74; 'url:wordpress': 0.78; 'c++:': 0.84; 'characters,': 0.84; 'everything,': 0.84; 'more?': 0.84; 'safe.': 0.84; 'unclear': 0.84; '8bit%:70': 0.91; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type; bh=a09wCWjYKEJ+Mv7P9vJXtNuZvpCsae2O4YSKetsDCds=; b=ny+HuZvGyYdIbESa3MNTfm7BVmL9Z234yTTEO+wT0SADNcCKm38q25hg4UZe6OKfmn IRMAwTq3ud2M0bEhMjafxXxJE0XAUz69x51JvyjXexfbA7vFEgM/hIeUvI8wSgt8lbbI hQMpl6zUdMsJ0gPCi58iM1L7fLt2SpQZmuKnC52X3X+V12Qe0dS5hL69mUH/KEh/UNDm vleB/yXG0/aDr6R+/7AXIMNY637+JdCbZfAlKApt0ZCZRjuTuS7AwrUPez1Zv9tSu2sO zVQ9Sz2/XO1lCnOwZ/JnP4zizuYAdT2ekri7YjKWbTSUEmcRwxR3yQ3cX602g70UAeJx 2sbQ== X-Received: by 10.180.206.228 with SMTP id lr4mr1331464wic.48.1370714494021; Sat, 08 Jun 2013 11:01:34 -0700 (PDT) Date: Sat, 08 Jun 2013 21:01:23 +0300 From: =?UTF-8?B?zp3Ouc66z4zOu86xzr/PgiDOms6/z43Pgc6xz4I=?= User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:22.0) Gecko/20100101 Thunderbird/22.0 MIME-Version: 1.0 To: Cameron Simpson Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain) References: <7d8da6c9-fb92-4329-b207-4280f29ba664@googlegroups.com> <20130608024931.GA77888@cskk.homeip.net> In-Reply-To: <20130608024931.GA77888@cskk.homeip.net> Content-Type: multipart/alternative; boundary="------------010502050402070304030906" Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 199 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1370714502 news.xs4all.nl 15887 [2001:888:2000:d::a6]:37364 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:47396 This is a multi-part message in MIME format. --------------010502050402070304030906 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 8/6/2013 5:49 πμ, Cameron Simpson wrote: > On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= wrote: > | Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: > | > | >| errors='replace' mean dont break in case or error? > | > > | > | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled > | > | >for something that would not decode smoothly. > | > > | > | How can it be correct? We have encoded out string in utf-8 and then > | > | we tried to decode it as greek-iso? How can this possibly be > | > | correct? > | > | > If it is a valid iso-8859-7 sequence (which might cover everything, > | > since I expect it is an 8-bit 1:1 mapping from bytes values to a > | > set of codepoints, just like iso-8859-1) then it may decode to the > | > "wrong" characters, but the reverse process (characters encoded as > | > bytes) should produce the original bytes. With a mapping like this, > | > errors='replace' may mean nothing; there will be no errors because > | > the only Unicode characters in play are all from iso-8859-7 to start > | > with. Of course another string may not be safe. > | > | > Visually, the names will be garbage. And if you go: > | > mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3' > | > while using the iso-8859-7 locale, the wrong thing will occur > | > (assuming it even works, though I think it should because all these > | > characters are represented in iso-8859-7, yes?) > | > | All the rest you i understood only the above quotes its still unclear to me. > | I cant see to understand it. > | > | Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar? > > Yes. It is certainly true for utf-8 and latin-iso and ASCII. > I expect it to be so for greek-iso, but have not checked. > > They're all essentially the ASCII set plus a range of other character > codepoints for the upper values. The 8-bit sets iso-8859-1 (which > I take you to mean by "latin-iso") and iso-8859-7 (which I take you > to mean by "greek-iso") are single byte mapping with the top half > mapped to characters commonly used in a particular region. > > Unicode has a much greater range, but the UTF-8 encoding of Unicode > deliberately has the bottom 0-127 identical to ASCII, and higher > values represented by multibyte sequences commences with at least > the first byte in the 128-255 range. In this way pure ASCII files > are already in UTF-8 (and, in fact, work just fine for the iso-8859-x > encodings as well). > Hold on! In the beginning there was ASCII with 0-127 values and then there was Unicode with 0-127 of ASCII's + i dont know how much many more? Now ASCIII needs 1 byte to store a single character while Unicode needs 2 bytes to store a character and that is because it has > 256 characters to store > 2^8bits ? Is this correct? Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into the hard drive? Because in some post i have read that 'UTF-8 encoding of Unicode'. Can you please explain to me whats the difference of ASCII-Unicode themselves aand then of them compared to 'Charsets' . I'm still confused about this. Is it like we said in C++: ' int a', a variable with name 'a' of type integer. 'char a', a variable with name 'a' of type char So taken form above example(the closest i could think of), the way i understand them is: A 'string' can be of (unicode's or ascii's) type and that type needs a way (thats a charset) to store this string into the hdd as a sequense of bytes? -- Webhost && Weblog --------------010502050402070304030906 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
On 8/6/2013 5:49 πμ, Cameron Simpson wrote:
On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
| Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
| > | >| errors='replace' mean dont break in case or error?
| > 
| > | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled
| > | >for something that would not decode smoothly.
| > 
| > | How can it be correct? We have encoded out string in utf-8 and then
| > | we tried to decode it as greek-iso? How can this possibly be
| > | correct?
| 
| > If it is a valid iso-8859-7 sequence (which might cover everything, 
| > since I expect it is an 8-bit 1:1 mapping from bytes values to a 
| > set of codepoints, just like iso-8859-1) then it may decode to the 
| > "wrong" characters, but the reverse process (characters encoded as
| > bytes) should produce the original bytes.  With a mapping like this, 
| > errors='replace' may mean nothing; there will be no errors because
| > the only Unicode characters in play are all from iso-8859-7 to start
| > with. Of course another string may not be safe. 
| 
| > Visually, the names will be garbage. And if you go:
| >   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
| > while using the iso-8859-7 locale, the wrong thing will occur
| > (assuming it even works, though I think it should because all these
| > characters are represented in iso-8859-7, yes?)
| 
| All the rest you i understood only the above quotes its still unclear to me.
| I cant see to understand it.
| 
| Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar?

Yes. It is certainly true for utf-8 and latin-iso and ASCII.
I expect it to be so for greek-iso, but have not checked.

They're all essentially the ASCII set plus a range of other character
codepoints for the upper values.  The 8-bit sets iso-8859-1 (which
I take you to mean by "latin-iso") and iso-8859-7 (which I take you
to mean by "greek-iso") are single byte mapping with the top half
mapped to characters commonly used in a particular region.

Unicode has a much greater range, but the UTF-8 encoding of Unicode
deliberately has the bottom 0-127 identical to ASCII, and higher
values represented by multibyte sequences commences with at least
the first byte in the 128-255 range. In this way pure ASCII files
are already in UTF-8 (and, in fact, work just fine for the iso-8859-x
encodings as well).

Hold on!

In the beginning there was ASCII with 0-127 values and then there was Unicode with 0-127 of ASCII's + i dont know how much many more?

Now ASCIII needs 1 byte to store a single character while Unicode needs 2 bytes to store a character and that is because it has > 256 characters to store > 2^8bits ?

Is this correct?

Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into the hard drive?

Because in some post i have read that 'UTF-8 encoding of Unicode'.
Can you please explain to me whats the difference of ASCII-Unicode themselves aand then of them compared to 'Charsets' . I'm still confused about this.

Is it like we said in C++:
' int a',     a variable with name 'a' of type integer.
'char a',   a variable with name 'a' of type char

So taken form above example(the closest i could think of), the way i understand them is:

A 'string' can be of (unicode's or ascii's) type and that type needs a way (thats a charset) to store this string into the hdd as a sequense of bytes?






--------------010502050402070304030906--