Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Sat, 08 Jun 2013 21:01:23 +0300
From: =?UTF-8?B?zp3Ouc66z4zOu86xzr/PgiDOms6/z43Pgc6xz4I=?= <nikos.gr33k@gmail.com>
User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:22.0) Gecko/20100101 Thunderbird/22.0
MIME-Version: 1.0
To: Cameron Simpson <cs@zip.com.au>
Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain)
References: <7d8da6c9-fb92-4329-b207-4280f29ba664@googlegroups.com> <20130608024931.GA77888@cskk.homeip.net>
In-Reply-To: <20130608024931.GA77888@cskk.homeip.net>
Content-Type: multipart/alternative; boundary="------------010502050402070304030906"
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2891.1370714502.3114.python-list@python.org>
Lines: 199
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:47396

This is a multi-part message in MIME format.
--------------010502050402070304030906
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

On 8/6/2013 5:49 πμ, Cameron Simpson wrote:
> On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
> | Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
> | > | >| errors='replace' mean dont break in case or error?
> | >
> | > | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled
> | > | >for something that would not decode smoothly.
> | >
> | > | How can it be correct? We have encoded out string in utf-8 and then
> | > | we tried to decode it as greek-iso? How can this possibly be
> | > | correct?
> |
> | > If it is a valid iso-8859-7 sequence (which might cover everything,
> | > since I expect it is an 8-bit 1:1 mapping from bytes values to a
> | > set of codepoints, just like iso-8859-1) then it may decode to the
> | > "wrong" characters, but the reverse process (characters encoded as
> | > bytes) should produce the original bytes.  With a mapping like this,
> | > errors='replace' may mean nothing; there will be no errors because
> | > the only Unicode characters in play are all from iso-8859-7 to start
> | > with. Of course another string may not be safe.
> |
> | > Visually, the names will be garbage. And if you go:
> | >   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
> | > while using the iso-8859-7 locale, the wrong thing will occur
> | > (assuming it even works, though I think it should because all these
> | > characters are represented in iso-8859-7, yes?)
> |
> | All the rest you i understood only the above quotes its still unclear to me.
> | I cant see to understand it.
> |
> | Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar?
>
> Yes. It is certainly true for utf-8 and latin-iso and ASCII.
> I expect it to be so for greek-iso, but have not checked.
>
> They're all essentially the ASCII set plus a range of other character
> codepoints for the upper values.  The 8-bit sets iso-8859-1 (which
> I take you to mean by "latin-iso") and iso-8859-7 (which I take you
> to mean by "greek-iso") are single byte mapping with the top half
> mapped to characters commonly used in a particular region.
>
> Unicode has a much greater range, but the UTF-8 encoding of Unicode
> deliberately has the bottom 0-127 identical to ASCII, and higher
> values represented by multibyte sequences commences with at least
> the first byte in the 128-255 range. In this way pure ASCII files
> are already in UTF-8 (and, in fact, work just fine for the iso-8859-x
> encodings as well).
>
Hold on!

In the beginning there was ASCII with 0-127 values and then there was 
Unicode with 0-127 of ASCII's + i dont know how much many more?

Now ASCIII needs 1 byte to store a single character while Unicode needs 
2 bytes to store a character and that is because it has > 256 characters 
to store > 2^8bits ?

Is this correct?

Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters 
into the hard drive?

Because in some post i have read that 'UTF-8 encoding of Unicode'.
Can you please explain to me whats the difference of ASCII-Unicode 
themselves aand then of them compared to 'Charsets' . I'm still confused 
about this.

Is it like we said in C++:
' int a',     a variable with name 'a' of type integer.
'char a',   a variable with name 'a' of type char

So taken form above example(the closest i could think of), the way i 
understand them is:

A 'string' can be of (unicode's or ascii's) type and that type needs a 
way (thats a charset) to store this string into the hdd as a sequense of 
bytes?






-- 
Webhost <http://superhost.gr>&& Weblog <http://psariastonafro.wordpress.com>

--------------010502050402070304030906
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">On 8/6/2013 5:49 πμ, Cameron Simpson
      wrote:<br>
    </div>
    <blockquote cite="mid:20130608024931.GA77888@cskk.homeip.net"
      type="cite">
      <pre wrap="">On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <a class="moz-txt-link-rfc2396E" href="mailto:nikos.gr33k@gmail.com">&lt;nikos.gr33k@gmail.com&gt;</a> wrote:
| Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
| &gt; | &gt;| errors='replace' mean dont break in case or error?
| &gt; 
| &gt; | &gt;Yes. The result will be correct for correct iso-8859-7 and slightly mangled
| &gt; | &gt;for something that would not decode smoothly.
| &gt; 
| &gt; | How can it be correct? We have encoded out string in utf-8 and then
| &gt; | we tried to decode it as greek-iso? How can this possibly be
| &gt; | correct?
| 
| &gt; If it is a valid iso-8859-7 sequence (which might cover everything, 
| &gt; since I expect it is an 8-bit 1:1 mapping from bytes values to a 
| &gt; set of codepoints, just like iso-8859-1) then it may decode to the 
| &gt; "wrong" characters, but the reverse process (characters encoded as
| &gt; bytes) should produce the original bytes.  With a mapping like this, 
| &gt; errors='replace' may mean nothing; there will be no errors because
| &gt; the only Unicode characters in play are all from iso-8859-7 to start
| &gt; with. Of course another string may not be safe. 
| 
| &gt; Visually, the names will be garbage. And if you go:
| &gt;   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
| &gt; while using the iso-8859-7 locale, the wrong thing will occur
| &gt; (assuming it even works, though I think it should because all these
| &gt; characters are represented in iso-8859-7, yes?)
| 
| All the rest you i understood only the above quotes its still unclear to me.
| I cant see to understand it.
| 
| Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar?

Yes. It is certainly true for utf-8 and latin-iso and ASCII.
I expect it to be so for greek-iso, but have not checked.

They're all essentially the ASCII set plus a range of other character
codepoints for the upper values.  The 8-bit sets iso-8859-1 (which
I take you to mean by "latin-iso") and iso-8859-7 (which I take you
to mean by "greek-iso") are single byte mapping with the top half
mapped to characters commonly used in a particular region.

Unicode has a much greater range, but the UTF-8 encoding of Unicode
deliberately has the bottom 0-127 identical to ASCII, and higher
values represented by multibyte sequences commences with at least
the first byte in the 128-255 range. In this way pure ASCII files
are already in UTF-8 (and, in fact, work just fine for the iso-8859-x
encodings as well).

</pre>
    </blockquote>
    Hold on!<br>
    <br>
    In the beginning there was ASCII with 0-127 values and then there
    was Unicode with 0-127 of ASCII's + i dont know how much many more?<br>
    <br>
    Now ASCIII needs 1 byte to store a single character while Unicode
    needs 2 bytes to store a character and that is because it has &gt;
    256 characters to store &gt; 2^8bits ?<br>
    <br>
    Is this correct?<br>
    <br>
    Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters
    into the hard drive?<br>
    <br>
    Because in some post i have read that 'UTF-8 encoding of Unicode'.<br>
    Can you please explain to me whats the difference of ASCII-Unicode
    themselves aand then of them compared to 'Charsets' . I'm still
    confused about this.<br>
    <br>
    Is it like we said in C++:<br>
    ' int a',     a variable with name 'a' of type integer.<br>
    'char a',   a variable with name 'a' of type char<br>
    <br>
    So taken form above example(the closest i could think of), the way i
    understand them is:<br>
    <br>
    A 'string' can be of (unicode's or ascii's) type and that type needs
    a way (thats a charset) to store this string into the hdd as a
    sequense of bytes?<br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <div class="moz-signature">-- <br>
      <a href="http://superhost.gr"><font color="blue">Webhost</font></a><font
        color="blue">
        <font color="lime"> &amp;&amp;
          <a href="http://psariastonafro.wordpress.com"><font
              color="red">Weblog</font></a></font></font></div>
  </body>
</html>

--------------010502050402070304030906--