Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Sun, 09 Jun 2013 07:46:40 +0300
From: =?UTF-8?B?zp3Ouc66z4zOu86xzr/PgiDOms6/z43Pgc6xz4I=?= <nikos.gr33k@gmail.com>
User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:22.0) Gecko/20100101 Thunderbird/22.0
MIME-Version: 1.0
To: Cameron Simpson <cs@zip.com.au>
Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain)
References: <e1cfd5ed-798d-44fa-8bf7-17f3549a288e@googlegroups.com> <20130608223258.GA29311@cskk.homeip.net>
In-Reply-To: <20130608223258.GA29311@cskk.homeip.net>
Content-Type: multipart/alternative; boundary="------------050603000002090308010604"
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2906.1370753210.3114.python-list@python.org>
Lines: 234
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:47422

This is a multi-part message in MIME format.
--------------050603000002090308010604
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

On 9/6/2013 1:32 πμ, Cameron Simpson wrote:
> On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
> | Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
> | > ASCII actually needs 7 bits to store a character. Since computers are
> | > optimized to work with bytes, not bits, normally ASCII characters are
> | > stored in a single byte, with one bit wasted.
> |
> | So ASCII and Unicode are 2 Encoding Systems currently in use.
> | How should i imagine them, visualize them?
> | Like tables 'A' = 65, 'B' = 66 and so on?
>
> Yes, that works.
>
> | But if i do then that would be the visualization of a 'charset' not of an encoding system.
> | What the diffrence of an encoding system and of a charset?
>
> And encoding system is the method or transcribing these values to bytes and back again.
So we have:

( 'A' mapped to the value of '65' ) => encoding process(i.e. uf-8) => bytes
bytes => decoding process(i.e. utf-8) =>  ( '65' mapped to character 'A' )

Why does every character in a character set needs to be associated with 
a numeric value?
I mean couldn't we just have characters sets that wouldn't have numeric 
associations like:

'A'  => encoding process(i.e. uf-8) => bytes
bytes => decoding process(i.e. utf-8) =>  character 'A'


>
> EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets.
> (1:1 mappings of characters to numbers/ordinals).
>
> And encoding is a way of writing these values to bytes.
> Decoding reads bytes and emits character values.
>
> Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the range 0-255,
> they are usually transcribed (encoded) directly, one byte per ordinal.
>
> Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one value.
> There are several ways of transcribing Unicode. UTF-8 is a popular and usually compact form,
> using one byte for values below 128 and and multiple bytes for higher values.
An ordinal = ordered numbers like 7,8,910 and so on?

Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
values up to 256?

UTF-8 and UTF-16 and UTF-32
I though the number beside of UTF- was to declare how many bits the 
character set was using to store a character into the hdd, no?

"Narrow" Unicode uses two bytes per character. Since two bytes is only
enough for about 65,000 characters, not 1,000,000+, the rest of the
characters are stored as pairs of two-byte "surrogates".

Can you please explain this line "the rest of thecharacters are stored 
as pairs of two-byte "surrogates"" more easily for me to understand it?
I'm still having troubl understanding what a surrogate is.

Again, thank you very much for explaining the encodings to me, they were 
giving me trouble for years in all of my scripts.


And one last thing.
When locale to linux system is set to utf-8 that would mean that the 
linux applications, should try to encode string into hdd by using 
system's default encoding to utf-8 nad read them back from bytes by also 
using utf-8. Is that correct?
-- 
Webhost <http://superhost.gr>&& Weblog <http://psariastonafro.wordpress.com>

--------------050603000002090308010604
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">On 9/6/2013 1:32 πμ, Cameron Simpson
      wrote:<br>
    </div>
    <blockquote cite="mid:20130608223258.GA29311@cskk.homeip.net"
      type="cite">
      <pre wrap="">On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <a class="moz-txt-link-rfc2396E" href="mailto:nikos.gr33k@gmail.com">&lt;nikos.gr33k@gmail.com&gt;</a> wrote:
| Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
| &gt; ASCII actually needs 7 bits to store a character. Since computers are  
| &gt; optimized to work with bytes, not bits, normally ASCII characters are
| &gt; stored in a single byte, with one bit wasted.
| 
| So ASCII and Unicode are 2 Encoding Systems currently in use.
| How should i imagine them, visualize them?
| Like tables 'A' = 65, 'B' = 66 and so on?

Yes, that works.

| But if i do then that would be the visualization of a 'charset' not of an encoding system.
| What the diffrence of an encoding system and of a charset?

And encoding system is the method or transcribing these values to bytes and back again.
</pre>
    </blockquote>
    So we have:<br>
    <br>
    ( 'A' mapped to the value of '65' ) =&gt; encoding process(i.e.
    uf-8) =&gt; bytes<br>
    bytes =&gt; decoding process(i.e. utf-8) =&gt;  ( '65' mapped to
    character 'A' )<br>
    <br>
    Why does every character in a character set needs to be associated
    with a numeric value?<br>
    I mean couldn't we just have characters sets that wouldn't have
    numeric associations like:<br>
    <br>
    'A'  =&gt; encoding process(i.e. uf-8) =&gt; bytes<br>
    bytes =&gt; decoding process(i.e. utf-8) =&gt;  character 'A'<br>
    <br>
    <br>
    <blockquote cite="mid:20130608223258.GA29311@cskk.homeip.net"
      type="cite">
      <pre wrap="">

EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets.
(1:1 mappings of characters to numbers/ordinals).

And encoding is a way of writing these values to bytes.
Decoding reads bytes and emits character values.

Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the range 0-255,
they are usually transcribed (encoded) directly, one byte per ordinal.

Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one value.
There are several ways of transcribing Unicode. UTF-8 is a popular and usually compact form,
using one byte for values below 128 and and multiple bytes for higher values.
</pre>
    </blockquote>
    An ordinal = ordered numbers like 7,8,910 and so on?<br>
    <br>
    Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
    values up to 256?<br>
    <br>
    UTF-8 and UTF-16 and UTF-32<br>
    I though the number beside of UTF- was to declare how many bits the
    character set was using to store a character into the hdd, no?<br>
    <br>
    <span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica,
      sans-serif; font-size: 13px; font-style: normal; font-variant:
      normal; font-weight: normal; letter-spacing: normal; line-height:
      normal; orphans: auto; text-align: start; text-indent: 0px;
      text-transform: none; white-space: normal; widows: auto;
      word-spacing: 0px; -webkit-text-stroke-width: 0px;
      background-color: rgb(255, 255, 255); display: inline !important;
      float: none;">"Narrow" Unicode uses two bytes per character. Since
      two bytes is only<span class="Apple-converted-space"> </span></span><br
      style="color: rgb(34, 34, 34); font-family: Arial, Helvetica,
      sans-serif; font-size: 13px; font-style: normal; font-variant:
      normal; font-weight: normal; letter-spacing: normal; line-height:
      normal; orphans: auto; text-align: start; text-indent: 0px;
      text-transform: none; white-space: normal; widows: auto;
      word-spacing: 0px; -webkit-text-stroke-width: 0px;
      background-color: rgb(255, 255, 255);">
    <span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica,
      sans-serif; font-size: 13px; font-style: normal; font-variant:
      normal; font-weight: normal; letter-spacing: normal; line-height:
      normal; orphans: auto; text-align: start; text-indent: 0px;
      text-transform: none; white-space: normal; widows: auto;
      word-spacing: 0px; -webkit-text-stroke-width: 0px;
      background-color: rgb(255, 255, 255); display: inline !important;
      float: none;">enough for about 65,000 characters, not 1,000,000+,
      the rest of the<span class="Apple-converted-space"> </span></span><br
      style="color: rgb(34, 34, 34); font-family: Arial, Helvetica,
      sans-serif; font-size: 13px; font-style: normal; font-variant:
      normal; font-weight: normal; letter-spacing: normal; line-height:
      normal; orphans: auto; text-align: start; text-indent: 0px;
      text-transform: none; white-space: normal; widows: auto;
      word-spacing: 0px; -webkit-text-stroke-width: 0px;
      background-color: rgb(255, 255, 255);">
    <span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica,
      sans-serif; font-size: 13px; font-style: normal; font-variant:
      normal; font-weight: normal; letter-spacing: normal; line-height:
      normal; orphans: auto; text-align: start; text-indent: 0px;
      text-transform: none; white-space: normal; widows: auto;
      word-spacing: 0px; -webkit-text-stroke-width: 0px;
      background-color: rgb(255, 255, 255); display: inline !important;
      float: none;">characters are stored as pairs of two-byte
      "surrogates".<span class="Apple-converted-space"> <br>
        <br>
        Can you please explain this line "</span></span><span
      style="color: rgb(34, 34, 34); font-family: Arial, Helvetica,
      sans-serif; font-size: 13px; font-style: normal; font-variant:
      normal; font-weight: normal; letter-spacing: normal; line-height:
      normal; orphans: auto; text-align: start; text-indent: 0px;
      text-transform: none; white-space: normal; widows: auto;
      word-spacing: 0px; -webkit-text-stroke-width: 0px;
      background-color: rgb(255, 255, 255); display: inline !important;
      float: none;">the rest of the<span class="Apple-converted-space">
      </span></span><span style="color: rgb(34, 34, 34); font-family:
      Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal;
      font-variant: normal; font-weight: normal; letter-spacing: normal;
      line-height: normal; orphans: auto; text-align: start;
      text-indent: 0px; text-transform: none; white-space: normal;
      widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;
      background-color: rgb(255, 255, 255); display: inline !important;
      float: none;">characters are stored as pairs of two-byte
      "surrogates"" more easily for me to understand it?</span><br>
    I'm still having troubl understanding what a surrogate is.<br>
    <br>
    Again, thank you very much for explaining the encodings to me, they
    were giving me trouble for years in all of my scripts.<br>
    <br>
    <br>
    And one last thing.<br>
    When locale to linux system is set to utf-8 that would mean that the
    linux applications, should try to encode string into hdd by using
    system's default encoding to utf-8 nad read them back from bytes by
    also using utf-8. Is that correct?<br>
    <div class="moz-signature">-- <br>
      <a href="http://superhost.gr"><font color="blue">Webhost</font></a><font
        color="blue">
        <font color="lime"> &amp;&amp;
          <a href="http://psariastonafro.wordpress.com"><font
              color="red">Weblog</font></a></font></font></div>
  </body>
</html>

--------------050603000002090308010604--