Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #47322 > unrolled thread

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Started byCameron Simpson <cs@zip.com.au>
First post2013-06-07 18:53 +1000
Last post2013-06-10 13:28 -0700
Articles 20 on this page of 68 — 14 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-07 18:53 +1000
    Re: Changing filenames from Greeklish => Greek (subprocess complain) alex23 <wuwei23@gmail.com> - 2013-06-07 02:41 -0700
    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 04:53 -0700
      Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-07 15:29 +0100
        Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 11:52 -0700
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Zero Piraeus <schesis@gmail.com> - 2013-06-07 15:31 -0400
          Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-07 21:45 +0100
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Zero Piraeus <schesis@gmail.com> - 2013-06-07 19:24 -0400
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-08 12:52 +1000
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 23:49 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-08 16:58 +1000
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-08 07:26 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-08 17:40 +1000
              Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-08 17:32 +0100
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 09:53 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 10:35 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-08 18:48 +0100
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-07 15:33 +0000
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-08 12:49 +1000
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 21:01 +0300
        Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-08 19:01 +0000
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 14:14 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 08:32 +1000
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 07:46 +0300
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 06:25 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 18:02 +1000
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:03 -0700
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 14:21 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-09 08:10 +1000
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 01:11 -0700
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-09 04:47 +1000
        Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-08 22:09 -0700
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 06:45 +0000
            Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-09 00:00 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 08:15 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:14 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:32 -0700
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 19:16 +1000
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 12:36 +0000
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-09 10:25 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Lele Gaifax <lele@metapensiero.it> - 2013-06-09 10:55 +0200
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:08 -0700
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Lele Gaifax <lele@metapensiero.it> - 2013-06-09 11:20 +0200
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:38 -0700
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-09 14:24 +0200
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 13:13 +0000
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-06-09 13:05 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:42 -0700
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:37 -0700
                      Re: Changing filenames from Greeklish => Greek (subprocess complain) Larry Hudson <orgnut@yahoo.com> - 2013-06-10 00:51 -0700
                        Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 01:11 -0700
                          Re: Changing filenames from Greeklish => Greek (subprocess complain) Larry Hudson <orgnut@yahoo.com> - 2013-06-11 00:20 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 11:50 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 05:18 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:00 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 19:12 +1000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:20 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-06-09 13:01 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 12:31 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-10 00:10 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-10 10:15 +0200
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 01:54 -0700
                      Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 02:59 -0700
                        Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-10 12:42 +0200
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-10 11:59 +0000
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 07:27 -0700
                      Re: Changing filenames from Greeklish => Greek (subprocess complain) jmfauth <wxjmfauth@gmail.com> - 2013-06-10 12:48 -0700
                        Re: Changing filenames from Greeklish => Greek (subprocess complain) Ned Batchelder <ned@nedbatchelder.com> - 2013-06-10 13:28 -0700

Page 3 of 4 — ← Prev page 1 2 [3] 4  Next page →


#47433

FromLele Gaifax <lele@metapensiero.it>
Date2013-06-09 10:55 +0200
Message-ID<mailman.2910.1370768144.3114.python-list@python.org>
In reply to#47428
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

> On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:
>
>> chr('A') would give me the mapping of this char, the number 65 while
>> ord(65) would output the char 'A' likewise.
>
> Correct. Python uses Unicode, where code-point 65 ("ordinal value 65") 
> means letter "A".

Actually, that's the other way around:

    >>> chr(65)
    'A'
    >>> ord('A')
    65

>> What would happen if we we try to re-encode bytes on the disk? like
>> trying:
>> 
>> s = "νίκος"
>> utf8_bytes = s.encode('utf-8')
>> greek_bytes = utf_bytes.encode('iso-8869-7')
>> 
>> Can we re-encode twice or as many times we want and then decode back
>> respectively lke?
>
> Of course. Bytes have no memory of where they came from, or what they are 
> used for. All you are doing is flipping bits on a memory chip, or on a 
> hard drive. So long as *you* remember which encoding is the right one, 
> there is no problem. If you forget, and start using the wrong one, you 
> will get garbage characters, mojibake, or errors.

Uhm, no: "encode" transforms a Unicode string into an array of bytes,
"decode" does the opposite transformation. You cannot do the former on
an "arbitrary" array of bytes:

    >>> s = "νίκος"
    >>> utf8_bytes = s.encode('utf-8')
    >>> greek_bytes = utf8_bytes.encode('iso-8869-7')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'bytes' object has no attribute 'encode'

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
lele@metapensiero.it  |                 -- Fortunato Depero, 1929.

[toc] | [prev] | [next] | [standalone]


#47436

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 02:08 -0700
Message-ID<9a0ea98b-f37b-48da-9933-e2caf6fdfdff@googlegroups.com>
In reply to#47433
Τη Κυριακή, 9 Ιουνίου 2013 11:55:43 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> 
> 
> 
> > On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:
> 
> >
> 
> >> chr('A') would give me the mapping of this char, the number 65 while
> 
> >> ord(65) would output the char 'A' likewise.
> 
> >
> 
> > Correct. Python uses Unicode, where code-point 65 ("ordinal value 65") 
> 
> > means letter "A".
> 
> 
> 
> Actually, that's the other way around:
> 
> 
> 
>     >>> chr(65)
> 
>     'A'
> 
>     >>> ord('A')
> 
>     65
> 
> 
> 
> >> What would happen if we we try to re-encode bytes on the disk? like
> 
> >> trying:
> 
> >> 
> 
> >> s = "νίκος"
> 
> >> utf8_bytes = s.encode('utf-8')
> 
> >> greek_bytes = utf_bytes.encode('iso-8869-7')
> 
> >> 
> 
> >> Can we re-encode twice or as many times we want and then decode back
> 
> >> respectively lke?
> 
> >
> 
> > Of course. Bytes have no memory of where they came from, or what they are 
> 
> > used for. All you are doing is flipping bits on a memory chip, or on a 
> 
> > hard drive. So long as *you* remember which encoding is the right one, 
> 
> > there is no problem. If you forget, and start using the wrong one, you 
> 
> > will get garbage characters, mojibake, or errors.
> 
> 
> 
> Uhm, no: "encode" transforms a Unicode string into an array of bytes,
> 
> "decode" does the opposite transformation. You cannot do the former on
> 
> an "arbitrary" array of bytes:
> 
> 
> 
>     >>> s = "νίκος"
> 
>     >>> utf8_bytes = s.encode('utf-8')
> 
>     >>> greek_bytes = utf8_bytes.encode('iso-8869-7')
> 
>     Traceback (most recent call last):
> 
>       File "<stdin>", line 1, in <module>
> 
>     AttributeError: 'bytes' object has no attribute 'encode'

So something encoded into bytes cannot be re-encoded to some other bytes.

How about a string i wonder?
s = "νίκος"
what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')

[toc] | [prev] | [next] | [standalone]


#47441

FromLele Gaifax <lele@metapensiero.it>
Date2013-06-09 11:20 +0200
Message-ID<mailman.2913.1370769662.3114.python-list@python.org>
In reply to#47436
Νικόλαος Κούρας <nikos.gr33k@gmail.com> writes:

> Τη Κυριακή, 9 Ιουνίου 2013 11:55:43 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
>> Uhm, no: "encode" transforms a Unicode string into an array of bytes,
>> "decode" does the opposite transformation. You cannot do the former on
>> an "arbitrary" array of bytes:
>> 
>>     >>> s = "νίκος"
>>     >>> utf8_bytes = s.encode('utf-8')
>>     >>> greek_bytes = utf8_bytes.encode('iso-8869-7')
>>     Traceback (most recent call last):
>>       File "<stdin>", line 1, in <module>
>>     AttributeError: 'bytes' object has no attribute 'encode'
>
> So something encoded into bytes cannot be re-encoded to some other bytes.
>
> How about a string i wonder?
> s = "νίκος"
> what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')

Ignoring the usual syntax error, this is just a variant of the code I
posted: “s.encode('iso-8869-7')” produces a bytes instance which
*cannot* be "re-encoded" again in whatever encoding.

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
lele@metapensiero.it  |                 -- Fortunato Depero, 1929.

[toc] | [prev] | [next] | [standalone]


#47442

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 02:38 -0700
Message-ID<7e01dc4a-ffc0-43ce-8d6b-8bc069a63f19@googlegroups.com>
In reply to#47441
Τη Κυριακή, 9 Ιουνίου 2013 12:20:58 μ.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:

> > How about a string i wonder? 
> > s = "νίκος" 
> > what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')

> Ignoring the usual syntax error, this is just a variant of the code I 
> posted: "s.encode('iso-8869-7')" produces a bytes instance which
> *cannot* be "re-encoded" again in whatever encoding.

s = 'a'
s = s.encode('iso-8859-7').decode('utf-8')
print( s )

a (we got the original character back)
================================
s = 'α'
s = s.encode('iso-8859-7').decode('utf-8')
print( s )

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data

Why this error? because 'a' ordinal value > 127 ?

[toc] | [prev] | [next] | [standalone]


#47455

FromAndreas Perstinger <andipersti@gmail.com>
Date2013-06-09 14:24 +0200
Message-ID<mailman.2916.1370780648.3114.python-list@python.org>
In reply to#47442
On 09.06.2013 11:38, Νικόλαος Κούρας wrote:
> s = 'α'
> s = s.encode('iso-8859-7').decode('utf-8')
> print( s )
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data
>
> Why this error? because 'a' ordinal value > 127 ?

 >>> s = 'α'
 >>> s.encode('iso-8859-7')
b'\xe1'
 >>> bin(0xe1)
'0b11100001'

Now look at the table on https://en.wikipedia.org/wiki/UTF-8#Description 
to find out how many bytes a UTF-8 decoder expects when it reads that value.

Bye, Andreas

[toc] | [prev] | [next] | [standalone]


#47461

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-06-09 13:13 +0000
Message-ID<51b47f82$0$30001$c3e8da3$5496439d@news.astraweb.com>
In reply to#47442
On Sun, 09 Jun 2013 02:38:13 -0700, Νικόλαος Κούρας wrote:

> s = 'α'
> s = s.encode('iso-8859-7').decode('utf-8')
> 
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0:
> unexpected end of data
> 
> Why this error? because 'a' ordinal value > 127 ?


Look at it this way... consider encoding and decoding to be like 
translating from one language to another.

Suppose you start with the English word "street". You encode it to German 
by looking it up in an English-To-German dictionary:

street -> Straße

The you decode the German by looking "Straße" up in a German-To-English 
dictionary:

Straße -> street

and everything is good. But suppose that after encoding the English to 
German, you get confused, and think that it is Italian, not German. So 
when it comes to decoding, you try to look up 'Staße' in an Italian-To-
English dictionary, and discover that there is no such thing as letter ß 
in Italian. So you cannot look the word up, and you get frustrated and 
shout "this is rubbish, there's no such thing as ß, that's not a letter!"

Not in Italian, but it is a perfectly good letter in German. But you're 
looking it up in the wrong dictionary.

Same thing with UTF-8. You encoded the string 'α' by looking it up in the 
"Unicode To ISO-8859-7 bytes" dictionary. Then you try to decode it by 
looking for those bytes in the "UTF-8 bytes To Unicode" dictionary. But 
you can't find byte 0xe1 on its own in UTF-8 bytes, so Python shouts 
"this is rubbish, there's no such thing as 0xe1 on its own in UTF-8!" and 
raises UnicodeDecodeError.


Sometimes you don't get an exception. Suppose that you are encoding from 
French to German:

qui -> die  (both words mean "who" in English)


Now if you get confused, and decode the word 'die' by looking it up in an 
English-To-French dictionary, instead of German-To-French, you get:

die -> mourir

So instead of getting 'qui' back again, you get 'mourir'. This is like 
mojibake: the results are garbage, but there is no exception raised to 
warn you.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#47486

FromBenjamin Kaplan <benjamin.kaplan@case.edu>
Date2013-06-09 13:05 -0700
Message-ID<mailman.2933.1370808420.3114.python-list@python.org>
In reply to#47442
On Sun, Jun 9, 2013 at 2:38 AM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
> Τη Κυριακή, 9 Ιουνίου 2013 12:20:58 μ.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
>
>> > How about a string i wonder?
>> > s = "νίκος"
>> > what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')
>
>> Ignoring the usual syntax error, this is just a variant of the code I
>> posted: "s.encode('iso-8869-7')" produces a bytes instance which
>> *cannot* be "re-encoded" again in whatever encoding.
>
> s = 'a'
> s = s.encode('iso-8859-7').decode('utf-8')
> print( s )
>
> a (we got the original character back)
> ================================
> s = 'α'
> s = s.encode('iso-8859-7').decode('utf-8')
> print( s )
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data
>
> Why this error? because 'a' ordinal value > 127 ?
> --

No. You get that error because the string is not encoded in UTF-8.
It's encoded in ISO-8859-7. For ASCII strings (ord(x) < 127),
ISO-8859-7 and UTF-8 look exactly the same. For anything else, they
are different. If you were to try to decode it as ISO-8859-1, it would
succeed, but you would get the character "á" back instead of α.

You're misunderstanding the decode function. Decode doesn't turn it
into a string with the specified encoding. It takes it *from* the
string with the specified encoding and turns it into Python's internal
string representation. In Python 3.3, that encoding doesn't even have
a name because it's not a standard encoding. So you want the decode
argument to match the encode argument.

[toc] | [prev] | [next] | [standalone]


#47443

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 02:42 -0700
Message-ID<4500f6f7-2296-4320-b6b9-dbc71c732500@googlegroups.com>
In reply to#47441
s = 'a'
s = s.encode('utf-8').decode('iso-8859-7')
print ( s )

a
==========================

s = 'α'
s = s.encode('utf-8').decode('iso-8859-7')
print ( s )

Ξ±
==========================
is the above a garbage character? where did this came from?

[toc] | [prev] | [next] | [standalone]


#47446

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 03:37 -0700
Message-ID<82414c38-cec0-404d-8d5d-435fed0750c7@googlegroups.com>
In reply to#47443
I k nwo i have been a pain in the ass these days but this is the lats explanation i want from you, just to understand it completely.

>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
>> values up to 256? 

>Because then how do you tell when you need one byte, and when you need 
>two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
>characters, with ordinal values 0x4C and 0xFA, or one character with 
>ordinal value 0x4CFA? 

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.


>> UTF-8 and UTF-16 and UTF-32 
>> I though the number beside of UTF- was to declare how many bits the 
>> character set was using to store a character into the hdd, no? 

>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
>values to make a surrogate pair.

A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
Is this what a surrogate is? a pari of 2 chars?


>UTF-8 uses 8-bit values, but sometimes 
>it combines two, three or four of them to represent a single code-point.

'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal >  65000 )

The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table?


>UTF-8 solves this problem by reserving some values to mean "this byte, on 
>its own", and others to mean "this byte, plus the next byte, together", 
>and so forth, up to four bytes.

Some of the utf-8 bits that are used to represent a character's ordinal value are actually been also used to seperate or join the ordinal values themselves?
Can you give an example please? How there are beign seperated?


>Computers are digital and work with numbers.


So character 'A' <-> 65 (in decimal uses in charset's table)  <-> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the file with a hex editor)

Is this how the thing works? (above values are fictional)

[toc] | [prev] | [next] | [standalone]


#47526

FromLarry Hudson <orgnut@yahoo.com>
Date2013-06-10 00:51 -0700
Message-ID<__GdnVhoRbgbGCjMnZ2dnUVZ_qqdnZ2d@giganews.com>
In reply to#47446
On 06/09/2013 03:37 AM, Νικόλαος Κούρας wrote:

>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.
>
NO!!

0 - 127, yes.
128 - 255 -> one byte of a multibyte code.

That's why the decode fails, it sees it as incomplete data so it can't do anything with it.

>
> A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
> Is this what a surrogate is? a pari of 2 chars?
>
You're confusing character encodings with the way NON-CHARACTER keys on the KEYBOARD are encoded 
(function keys, arrow keys and such).  These are NOT text characters but KEYBOARD key codes. 
These are NOT text codes and are entirely different and not related to any character encoding. 
How programs interpret and use these codes depends entirely on the individual programs.  There 
are common conventions on how many are used, but there are no standards.

Also the control-codes are the first 32 values of the ASCII (and ASCII-compatible) character set 
and are not multi-character key codes like the keyboard non-character keys.

However, there are a few keyboard keys that actually produce control-codes.  A few examples:

Return/Enter -> Ctrl-M
Tab -> Ctrl-I
Backspace -> Ctrl-H

>
> So character 'A' <-> 65 (in decimal uses in charset's table)  <-> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the file with a hex editor)
>
You are trying to put too much meaning to this.  The value stored on disk, in memory, or 
whatever is binary bits, nothing else.  How you describe the value, in decimal, in octal, in 
hex, in base-12, or... is totally irrelevant.  These are simply different ways of describing or 
naming these numeric values.

It's the same as saying 3 in English is three, in Spanish is tres, in German is drei...  (I 
don't know Greek, sorry.)  No matter what you call it, it is still the numeric integer value 
that is between 2 and 4.

[toc] | [prev] | [next] | [standalone]


#47530

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-10 01:11 -0700
Message-ID<ebd7c249-1c6d-4807-a5d1-9dc3f8006cc3@googlegroups.com>
In reply to#47526
Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε:

> > I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.

> 0 - 127, yes.
> 128 - 255 -> one byte of a multibyte code.

you mean that in utf-8 for 1 character to be stored, we need 2 bytes?
I still havign troubl e understanding this.

Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but instead its using 1 byte up to the first 127 value and then 2 bytes for anyhtign above.  Why?

[toc] | [prev] | [next] | [standalone]


#47645

FromLarry Hudson <orgnut@yahoo.com>
Date2013-06-11 00:20 -0700
Message-ID<sKadnX8GT48pUivMnZ2dnUVZ_h2dnZ2d@giganews.com>
In reply to#47530
On 06/10/2013 01:11 AM, Νικόλαος Κούρας wrote:
> Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε:
>
>>> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.
>
>> 0 - 127, yes.
>> 128 - 255 -> one byte of a multibyte code.
>
> you mean that in utf-8 for 1 character to be stored, we need 2 bytes?
> I still havign troubl e understanding this.
>
Utf-8 characters are encoded in different sizes, NOT a single fixed number of bytes.
The high _bits_ of the first byte define the number of bytes of the individual character code.

(I'm copying this from Wikipedia...)
0xxxxxxx -> 1 byte
110xxxxx -> 2 bytes
1110xxxx -> 3 bytes
11110xxx -> 4 bytes
111110xx -> 5 bytes
1111110x -> 6 bytes

Notice that in the 1-byte version, since the high bit is always 0, only 7 bits are available for 
the character code, and this is the standard 0-127 ASCII (and ASCII-compatible) code set.

> Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but instead its using 1 byte up to the first 127 value and then 2 bytes for anyhtign above.  Why?
>
As I indicated above, one bit is reserved as a flag to indicate that the code is one-byte code 
and not a multibyte code, only 7 bits are available for the actual 1-byte (ASCII) code.

[toc] | [prev] | [next] | [standalone]


#47450

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-06-09 11:50 +0000
Message-ID<51b46bea$0$30001$c3e8da3$5496439d@news.astraweb.com>
In reply to#47433
On Sun, 09 Jun 2013 10:55:43 +0200, Lele Gaifax wrote:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> 
>> On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:
>>
>>> chr('A') would give me the mapping of this char, the number 65 while
>>> ord(65) would output the char 'A' likewise.
>>
>> Correct. Python uses Unicode, where code-point 65 ("ordinal value 65")
>> means letter "A".
> 
> Actually, that's the other way around:
> 
>     >>> chr(65)
>     'A'
>     >>> ord('A')
>     65

/facepalm 

Of course you are right.


>>> What would happen if we we try to re-encode bytes on the disk? like
>>> trying:
>>> 
>>> s = "νίκος"
>>> utf8_bytes = s.encode('utf-8')
>>> greek_bytes = utf_bytes.encode('iso-8869-7')
>>> 
>>> Can we re-encode twice or as many times we want and then decode back
>>> respectively lke?
>>
>> Of course. [...]

> Uhm, no: "encode" transforms a Unicode string into an array of bytes,
> "decode" does the opposite transformation. You cannot do the former on
> an "arbitrary" array of bytes:

And two for two. I misunderstood Nikos' question.

As you point out, no, Python 3 will not allow you to re-encode bytes. You 
must first decode them to a string first, then encode them using a 
different encoding. (I thought that this was was Nikos actually meant, 
but I on re-reading his question more closely, that's not actually what 
he asked.)

Sorry for any confusion.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#47453

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 05:18 -0700
Message-ID<cfa80b5f-879f-4276-89ac-d1900ccc8c0f@googlegroups.com>
In reply to#47450
Please and tell me that this actually can be solved.
Iam willing to try anything for 'files.py' to load propelry.
Every thign works as expected in my webiste, have manages to correct pelatologio.poy and koukos.py.

This is the last thing the webiste needs, that is files.py to load so users can grab importan files in greek format.

[toc] | [prev] | [next] | [standalone]


#47434

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 02:00 -0700
Message-ID<5b0d3d7c-e3a4-436d-a55f-26bd40064fd5@googlegroups.com>
In reply to#47428
Steven wrote:
>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
>> values up to 256? 

>Because then how do you tell when you need one byte, and when you need 
>two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
>characters, with ordinal values 0x4C and 0xFA, or one character with 
>ordinal value 0x4CFA? 

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.


>> UTF-8 and UTF-16 and UTF-32 
>> I though the number beside of UTF- was to declare how many bits the 
>> character set was using to store a character into the hdd, no? 

>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
>values to make a surrogate pair.

A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
Is this what a surrogate is? a pari of 2 chars?


>UTF-8 uses 8-bit values, but sometimes 
>it combines two, three or four of them to represent a single code-point.

'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal >  65000 )

The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table?


>UTF-8 solves this problem by reserving some values to mean "this byte, on 
>its own", and others to mean "this byte, plus the next byte, together", 
>and so forth, up to four bytes.

Some of the utf-8 bits that are used to represent a character's ordinal value are actually been also used to seperate or join the ordinal values themselves?
Can you give an example please? How there are beign seperated?


>Computers are digital and work with numbers.


So character 'A' <-> 65 (in decimal uses in charset's table)  <-> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the file with a hex editor)

Is this how the thing works? (above values are fictional)

[toc] | [prev] | [next] | [standalone]


#47437

FromCameron Simpson <cs@zip.com.au>
Date2013-06-09 19:12 +1000
Message-ID<mailman.2911.1370769172.3114.python-list@python.org>
In reply to#47434
On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
| Steven wrote:
| >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
| >> values up to 256? 
| 
| >Because then how do you tell when you need one byte, and when you need 
| >two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
| >characters, with ordinal values 0x4C and 0xFA, or one character with 
| >ordinal value 0x4CFA? 
| 
| I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.

Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your suggestion will not.

I'd point out that if you did this, you'd be back in the same
situation you just encountered with ASCII: the first above-255 value
would raise a UnicodeEncodeError (an error which does not even exist
at present:-)

| >> UTF-8 and UTF-16 and UTF-32 
| >> I though the number beside of UTF- was to declare how many bits the 
| >> character set was using to store a character into the hdd, no? 
| 
| >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
| >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
| >values to make a surrogate pair.
| 
| A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
| Is this what a surrogate is? a pari of 2 chars?

Essentially. The combination represents a code point.

| >UTF-8 uses 8-bit values, but sometimes 
| >it combines two, three or four of them to represent a single code-point.
| 
| 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
| 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
| 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal >  65000 )
| 
| The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table?

Essentially. You can read up on the exact process in Wikipedia or the Unicode Standard.

Cheers,
-- 
Cameron Simpson <cs@zip.com.au>

The most annoying thing about being without my files after our disc crash was
discovering once again how widespread BLINK was on the web.

[toc] | [prev] | [next] | [standalone]


#47440

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 02:20 -0700
Message-ID<8471f19b-e21a-4859-9842-92a97d75a840@googlegroups.com>
In reply to#47437
Τη Κυριακή, 9 Ιουνίου 2013 12:12:36 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
> On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
> 
> | Steven wrote:
> 
> | >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
> 
> | >> values up to 256? 
> 
> | 
> 
> | >Because then how do you tell when you need one byte, and when you need 
> 
> | >two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
> 
> | >characters, with ordinal values 0x4C and 0xFA, or one character with 
> 
> | >ordinal value 0x4CFA? 
> 
> | 
> 
> | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.
> 
> 
> 
> Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your >suggestion will not.

I dont follow.

> | >> UTF-8 and UTF-16 and UTF-32 
> 
> | >> I though the number beside of UTF- was to declare how many bits the 
> 
> | >> character set was using to store a character into the hdd, no? 
> 
> | 
> 
> | >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
> 
> | >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
> 
> | >values to make a surrogate pair.
> 
> | 
> 
> | A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
> 
> | Is this what a surrogate is? a pari of 2 chars?
> 
> 
> 
> Essentially. The combination represents a code point.
> 
> 
> 
> | >UTF-8 uses 8-bit values, but sometimes 
> 
> | >it combines two, three or four of them to represent a single code-point.
> 
> | 
> 
> | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
> 
> | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
> 
> | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal >  65000 )
> 
> | 
> 
> | The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table?
> 
> 
> 
> Essentially. You can read up on the exact process in Wikipedia or the Unicode Standard.



When you say essentially means you agree with my statements?

[toc] | [prev] | [next] | [standalone]


#47487

FromBenjamin Kaplan <benjamin.kaplan@case.edu>
Date2013-06-09 13:01 -0700
Message-ID<mailman.2934.1370808466.3114.python-list@python.org>
In reply to#47440
On Sun, Jun 9, 2013 at 2:20 AM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
> Τη Κυριακή, 9 Ιουνίου 2013 12:12:36 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
>> On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
>>
>> | Steven wrote:
>>
>> | >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>
>> | >> values up to 256?
>>
>> |
>>
>> | >Because then how do you tell when you need one byte, and when you need
>>
>> | >two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>
>> | >characters, with ordinal values 0x4C and 0xFA, or one character with
>>
>> | >ordinal value 0x4CFA?
>>
>> |
>>
>> | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.
>>
>>
>>
>> Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your >suggestion will not.
>
> I dont follow.
>

The point in the UTF formats is that they can encode any of the 1.1
million codepoints available in Unicode. Your suggestion can only
encode 256 code points. We have that encoding already- it's called
Latin-1 and it can't encode any of your Greek characters (hence why
ISO-8859-7 exists, which can encode the Greek characters but not the
Latin ones).

If you were to use the whole byte to store the first 256 characters,
you wouldn't be able to store character number 256 because the
computer wouldn't be able to tell the difference between character 257
(0x01 0x01) and two chr(1)s. UTF-8 gets around this by reserving the
top bit as a "am I part of a multibyte sequence" flag,

>> | >> UTF-8 and UTF-16 and UTF-32
>>
>> | >> I though the number beside of UTF- was to declare how many bits the
>>
>> | >> character set was using to store a character into the hdd, no?
>>
>> |
>>
>> | >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>
>> | >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>>
>> | >values to make a surrogate pair.
>>
>> |
>>
>> | A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
>>
>> | Is this what a surrogate is? a pari of 2 chars?
>>
>>
>>
>> Essentially. The combination represents a code point.
>>
>>
>>
>> | >UTF-8 uses 8-bit values, but sometimes
>>
>> | >it combines two, three or four of them to represent a single code-point.
>>
>> |
>>
>> | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
>>
>> | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
>>
>> | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal >  65000 )
>>
>> |
>>
>> | The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table?
>>
>>
>>
>> Essentially. You can read up on the exact process in Wikipedia or the Unicode Standard.
>
>
>
> When you say essentially means you agree with my statements?
> --

In UTF-8 or UTF-16, the number of bytes required for the character is
dependent on its code point, yes. That isn't the case for UTF-32,
where every character uses exactly four bytes.

[toc] | [prev] | [next] | [standalone]


#47457

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-06-09 12:31 +0000
Message-ID<51b475b0$0$30001$c3e8da3$5496439d@news.astraweb.com>
In reply to#47434
On Sun, 09 Jun 2013 02:00:46 -0700, Νικόλαος Κούρας wrote:

> Steven wrote:
>>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>> values up to 256?
> 
>>Because then how do you tell when you need one byte, and when you need
>>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>characters, with ordinal values 0x4C and 0xFA, or one character with
>>ordinal value 0x4CFA?
> 
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I
> meant up to 256, not above 256.

Think about it. Draw up a big table of one million plus characters:

Ordinal   Character
========  ====================
0         NUL control code
1         SOH control code
...
84        LATIN CAPITAL LETTER T
85        LATIN CAPITAL LETTER U
...
255       LATIN SMALL LETTER Y WITH DIAERESIS
256       LATIN CAPITAL LETTER A WITH MACRON
...
8485      OUNCE SIGN


and so on, all the way to 1114111. Now, suppose you read a file, and see 
two bytes, shown in decimal: 84 followed by 85, or in hexadecimal, 0x54 
followed by 0x55.

How do you tell whether that means two characters, T followed by U, or a 
single character, ℥ (OUNCE SIGN)?

With UTF-32, you can, because every value takes exactly the same space. 
So a T followed by a U is:

0x00000054
0x00000055

while a single ℥ is:

0x00002125

and it is easy to tell them apart: each block of 4 bytes is exactly one 
character. But notice how many NUL bytes there are? In the three 
characters shown, there are eight NUL bytes. Most text will be filled 
with NUL bytes, which is very wasteful.

UTF-8 is designed to be compact, and also to be backwards-compatible with 
ASCII. Characters which are in ASCII will be a single byte, so there are 
no null bytes used for padding, (except for NUL itself, of course). So 
the three characters TU℥ will be:

0x54
0x55
0xE2
0x84
0xA5

Five bytes in total, instead of 12 for UTF-32. But the only tricky part 
is that character with ordinal value 0xE2 (decimal 226, â) cannot be 
encoded as the single byte 0xE2, otherwise you would mistake the three 
bytes 0xE284A5 as starting with 'â' followed by two more characters. And 
indeed, 'â' is encoded as two bytes:

0xC3
0xA2

Likewise, character with ordinal value 0xC3 (decimal 195, Ã) is also 
encoded as two bytes:

0xC3
0x83

And so on. This way, there is never any confusion as to whether (say) 
three bytes are three one-byte characters, or one three-byte character.


>>> UTF-8 and UTF-16 and UTF-32
>>> I though the number beside of UTF- was to declare how many bits the
>>> character set was using to store a character into the hdd, no?
> 
>>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>>values to make a surrogate pair.
> 
> A surrogate pair is like itting for example Ctrl-A, which means is a
> combination character that consists of 2 different characters? Is this
> what a surrogate is? a pari of 2 chars?

Yes, a surrogate pair is a pair of two "characters". But they're not 
*real* characters. They don't exist in any human language. They are just 
values that tells the program "these go together, and count as a single 
character".

(This is why Unicode prefers to talk about *code points* rather than 
characters. Some code points are characters, and some are not.)

>>UTF-8 uses 8-bit values, but sometimes it combines two, three or four of
>>them to represent a single code-point.
> 
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)

Correct.


> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is >
> 127 ) 

That looks like two characters to me, 'α' followed by '΄'. That will take 
4 bytes, two for 'α' and two for '΄'.


> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored
> ? (since ordinal >  65000 )

Not necessarily four bytes. Could be three. Depends on the ideogram.

> The amount of bytes needed to store a character solely depends on the
> character's ordinal value in the Unicode table?

Yes.


>>UTF-8 solves this problem by reserving some values to mean "this byte,
>>on its own", and others to mean "this byte, plus the next byte,
>>together", and so forth, up to four bytes.
> 
> Some of the utf-8 bits that are used to represent a character's ordinal
> value are actually been also used to seperate or join the ordinal values
> themselves? Can you give an example please? How there are beign
> seperated?

Did you look up UTF-8 on Wikipedia like I suggested?


>>Computers are digital and work with numbers.
> 
> So character 'A' <-> 65 (in decimal uses in charset's table)  <->
> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the
> file with a hex editor)
> 
> Is this how the thing works? (above values are fictional)

You can check this in Python:


py> c = 'A'
py> ord(c)
65
py> bin(65)
'0b1000001'
py> hex(65)
'0x41'


py> c = 'α'
py> ord(c)
945
py> c.encode('utf-8')
b'\xce\xb1'
py> c.encode('utf-16be')
b'\x03\xb1'
py> c.encode('utf-32be')
b'\x00\x00\x03\xb1'
py> c.encode('iso-8859-7')
b'\xe1'


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#47522

Fromnagia.retsina@gmail.com
Date2013-06-10 00:10 -0700
Message-ID<a64ba08f-2616-4715-818c-073f3d1e2ffb@googlegroups.com>
In reply to#47457
Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:

> py> c = 'α'
> py> ord(c)
> 945

The number 945 is the characters 'α' ordinal value in the unicode charset correct?

The command in the python interactive session to show me how many bytes
this character will take upon encoding to utf-8 is:

>>> s = 'α'
>>> s.encode('utf-8')
b'\xce\xb1'

I see that the encoding of this char takes 2 bytes. But why two exactly?
How do i calculate how many bits are needed to store this char into bytes?


Trying to to the same here but it gave me no bytes back.

>>> s = 'a'
>>> s.encode('utf-8')
b'a'


>py> c.encode('utf-8')
> b'\xce\xb1'

2 bytes here. why 2?

> py> c.encode('utf-16be')
> b'\x03\xb1'

2 byets here also. but why 3 different bytes? the ordinal value of char 'a' is the same in unicode. the encodign system just takes the ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes be the same?

> py> c.encode('utf-32be')
> b'\x00\x00\x03\xb1

every char here takes exactly 4 bytes to be stored. okey.

> py> c.encode('iso-8859-7')
> b'\xe1'

And also does '\x' means that the value is being respresented in hex way?
and when i bin(6) i see '0b1000001'

I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say?

[toc] | [prev] | [next] | [standalone]


Page 3 of 4 — ← Prev page 1 2 [3] 4  Next page →

Back to top | Article view | comp.lang.python


csiph-web