Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #47322 > unrolled thread
| Started by | Cameron Simpson <cs@zip.com.au> |
|---|---|
| First post | 2013-06-07 18:53 +1000 |
| Last post | 2013-06-10 13:28 -0700 |
| Articles | 20 on this page of 68 — 14 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-07 18:53 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) alex23 <wuwei23@gmail.com> - 2013-06-07 02:41 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 04:53 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-07 15:29 +0100
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 11:52 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Zero Piraeus <schesis@gmail.com> - 2013-06-07 15:31 -0400
Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-07 21:45 +0100
Re: Changing filenames from Greeklish => Greek (subprocess complain) Zero Piraeus <schesis@gmail.com> - 2013-06-07 19:24 -0400
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-08 12:52 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 23:49 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-08 16:58 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-08 07:26 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-08 17:40 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-08 17:32 +0100
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 09:53 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 10:35 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-08 18:48 +0100
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-07 15:33 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-08 12:49 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 21:01 +0300
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-08 19:01 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 14:14 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 08:32 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 07:46 +0300
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 06:25 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 18:02 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:03 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 14:21 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-09 08:10 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 01:11 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-09 04:47 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-08 22:09 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 06:45 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-09 00:00 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 08:15 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:14 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:32 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 19:16 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 12:36 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-09 10:25 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Lele Gaifax <lele@metapensiero.it> - 2013-06-09 10:55 +0200
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:08 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Lele Gaifax <lele@metapensiero.it> - 2013-06-09 11:20 +0200
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:38 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-09 14:24 +0200
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 13:13 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-06-09 13:05 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:42 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:37 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Larry Hudson <orgnut@yahoo.com> - 2013-06-10 00:51 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 01:11 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Larry Hudson <orgnut@yahoo.com> - 2013-06-11 00:20 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 11:50 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 05:18 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:00 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 19:12 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:20 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-06-09 13:01 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 12:31 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-10 00:10 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-10 10:15 +0200
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 01:54 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 02:59 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-10 12:42 +0200
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-10 11:59 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 07:27 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) jmfauth <wxjmfauth@gmail.com> - 2013-06-10 12:48 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Ned Batchelder <ned@nedbatchelder.com> - 2013-06-10 13:28 -0700
Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →
| From | Lele Gaifax <lele@metapensiero.it> |
|---|---|
| Date | 2013-06-09 10:55 +0200 |
| Message-ID | <mailman.2910.1370768144.3114.python-list@python.org> |
| In reply to | #47428 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:
>
>> chr('A') would give me the mapping of this char, the number 65 while
>> ord(65) would output the char 'A' likewise.
>
> Correct. Python uses Unicode, where code-point 65 ("ordinal value 65")
> means letter "A".
Actually, that's the other way around:
>>> chr(65)
'A'
>>> ord('A')
65
>> What would happen if we we try to re-encode bytes on the disk? like
>> trying:
>>
>> s = "νίκος"
>> utf8_bytes = s.encode('utf-8')
>> greek_bytes = utf_bytes.encode('iso-8869-7')
>>
>> Can we re-encode twice or as many times we want and then decode back
>> respectively lke?
>
> Of course. Bytes have no memory of where they came from, or what they are
> used for. All you are doing is flipping bits on a memory chip, or on a
> hard drive. So long as *you* remember which encoding is the right one,
> there is no problem. If you forget, and start using the wrong one, you
> will get garbage characters, mojibake, or errors.
Uhm, no: "encode" transforms a Unicode string into an array of bytes,
"decode" does the opposite transformation. You cannot do the former on
an "arbitrary" array of bytes:
>>> s = "νίκος"
>>> utf8_bytes = s.encode('utf-8')
>>> greek_bytes = utf8_bytes.encode('iso-8869-7')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'
ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
lele@metapensiero.it | -- Fortunato Depero, 1929.
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 02:08 -0700 |
| Message-ID | <9a0ea98b-f37b-48da-9933-e2caf6fdfdff@googlegroups.com> |
| In reply to | #47433 |
Τη Κυριακή, 9 Ιουνίου 2013 11:55:43 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>
>
>
> > On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:
>
> >
>
> >> chr('A') would give me the mapping of this char, the number 65 while
>
> >> ord(65) would output the char 'A' likewise.
>
> >
>
> > Correct. Python uses Unicode, where code-point 65 ("ordinal value 65")
>
> > means letter "A".
>
>
>
> Actually, that's the other way around:
>
>
>
> >>> chr(65)
>
> 'A'
>
> >>> ord('A')
>
> 65
>
>
>
> >> What would happen if we we try to re-encode bytes on the disk? like
>
> >> trying:
>
> >>
>
> >> s = "νίκος"
>
> >> utf8_bytes = s.encode('utf-8')
>
> >> greek_bytes = utf_bytes.encode('iso-8869-7')
>
> >>
>
> >> Can we re-encode twice or as many times we want and then decode back
>
> >> respectively lke?
>
> >
>
> > Of course. Bytes have no memory of where they came from, or what they are
>
> > used for. All you are doing is flipping bits on a memory chip, or on a
>
> > hard drive. So long as *you* remember which encoding is the right one,
>
> > there is no problem. If you forget, and start using the wrong one, you
>
> > will get garbage characters, mojibake, or errors.
>
>
>
> Uhm, no: "encode" transforms a Unicode string into an array of bytes,
>
> "decode" does the opposite transformation. You cannot do the former on
>
> an "arbitrary" array of bytes:
>
>
>
> >>> s = "νίκος"
>
> >>> utf8_bytes = s.encode('utf-8')
>
> >>> greek_bytes = utf8_bytes.encode('iso-8869-7')
>
> Traceback (most recent call last):
>
> File "<stdin>", line 1, in <module>
>
> AttributeError: 'bytes' object has no attribute 'encode'
So something encoded into bytes cannot be re-encoded to some other bytes.
How about a string i wonder?
s = "νίκος"
what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')
[toc] | [prev] | [next] | [standalone]
| From | Lele Gaifax <lele@metapensiero.it> |
|---|---|
| Date | 2013-06-09 11:20 +0200 |
| Message-ID | <mailman.2913.1370769662.3114.python-list@python.org> |
| In reply to | #47436 |
Νικόλαος Κούρας <nikos.gr33k@gmail.com> writes:
> Τη Κυριακή, 9 Ιουνίου 2013 11:55:43 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
>> Uhm, no: "encode" transforms a Unicode string into an array of bytes,
>> "decode" does the opposite transformation. You cannot do the former on
>> an "arbitrary" array of bytes:
>>
>> >>> s = "νίκος"
>> >>> utf8_bytes = s.encode('utf-8')
>> >>> greek_bytes = utf8_bytes.encode('iso-8869-7')
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> AttributeError: 'bytes' object has no attribute 'encode'
>
> So something encoded into bytes cannot be re-encoded to some other bytes.
>
> How about a string i wonder?
> s = "νίκος"
> what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')
Ignoring the usual syntax error, this is just a variant of the code I
posted: “s.encode('iso-8869-7')” produces a bytes instance which
*cannot* be "re-encoded" again in whatever encoding.
ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
lele@metapensiero.it | -- Fortunato Depero, 1929.
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 02:38 -0700 |
| Message-ID | <7e01dc4a-ffc0-43ce-8d6b-8bc069a63f19@googlegroups.com> |
| In reply to | #47441 |
Τη Κυριακή, 9 Ιουνίου 2013 12:20:58 μ.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
> > How about a string i wonder?
> > s = "νίκος"
> > what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')
> Ignoring the usual syntax error, this is just a variant of the code I
> posted: "s.encode('iso-8869-7')" produces a bytes instance which
> *cannot* be "re-encoded" again in whatever encoding.
s = 'a'
s = s.encode('iso-8859-7').decode('utf-8')
print( s )
a (we got the original character back)
================================
s = 'α'
s = s.encode('iso-8859-7').decode('utf-8')
print( s )
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data
Why this error? because 'a' ordinal value > 127 ?
[toc] | [prev] | [next] | [standalone]
| From | Andreas Perstinger <andipersti@gmail.com> |
|---|---|
| Date | 2013-06-09 14:24 +0200 |
| Message-ID | <mailman.2916.1370780648.3114.python-list@python.org> |
| In reply to | #47442 |
On 09.06.2013 11:38, Νικόλαος Κούρας wrote:
> s = 'α'
> s = s.encode('iso-8859-7').decode('utf-8')
> print( s )
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data
>
> Why this error? because 'a' ordinal value > 127 ?
>>> s = 'α'
>>> s.encode('iso-8859-7')
b'\xe1'
>>> bin(0xe1)
'0b11100001'
Now look at the table on https://en.wikipedia.org/wiki/UTF-8#Description
to find out how many bytes a UTF-8 decoder expects when it reads that value.
Bye, Andreas
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-09 13:13 +0000 |
| Message-ID | <51b47f82$0$30001$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47442 |
On Sun, 09 Jun 2013 02:38:13 -0700, Νικόλαος Κούρας wrote:
> s = 'α'
> s = s.encode('iso-8859-7').decode('utf-8')
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0:
> unexpected end of data
>
> Why this error? because 'a' ordinal value > 127 ?
Look at it this way... consider encoding and decoding to be like
translating from one language to another.
Suppose you start with the English word "street". You encode it to German
by looking it up in an English-To-German dictionary:
street -> Straße
The you decode the German by looking "Straße" up in a German-To-English
dictionary:
Straße -> street
and everything is good. But suppose that after encoding the English to
German, you get confused, and think that it is Italian, not German. So
when it comes to decoding, you try to look up 'Staße' in an Italian-To-
English dictionary, and discover that there is no such thing as letter ß
in Italian. So you cannot look the word up, and you get frustrated and
shout "this is rubbish, there's no such thing as ß, that's not a letter!"
Not in Italian, but it is a perfectly good letter in German. But you're
looking it up in the wrong dictionary.
Same thing with UTF-8. You encoded the string 'α' by looking it up in the
"Unicode To ISO-8859-7 bytes" dictionary. Then you try to decode it by
looking for those bytes in the "UTF-8 bytes To Unicode" dictionary. But
you can't find byte 0xe1 on its own in UTF-8 bytes, so Python shouts
"this is rubbish, there's no such thing as 0xe1 on its own in UTF-8!" and
raises UnicodeDecodeError.
Sometimes you don't get an exception. Suppose that you are encoding from
French to German:
qui -> die (both words mean "who" in English)
Now if you get confused, and decode the word 'die' by looking it up in an
English-To-French dictionary, instead of German-To-French, you get:
die -> mourir
So instead of getting 'qui' back again, you get 'mourir'. This is like
mojibake: the results are garbage, but there is no exception raised to
warn you.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Benjamin Kaplan <benjamin.kaplan@case.edu> |
|---|---|
| Date | 2013-06-09 13:05 -0700 |
| Message-ID | <mailman.2933.1370808420.3114.python-list@python.org> |
| In reply to | #47442 |
On Sun, Jun 9, 2013 at 2:38 AM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
> Τη Κυριακή, 9 Ιουνίου 2013 12:20:58 μ.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
>
>> > How about a string i wonder?
>> > s = "νίκος"
>> > what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')
>
>> Ignoring the usual syntax error, this is just a variant of the code I
>> posted: "s.encode('iso-8869-7')" produces a bytes instance which
>> *cannot* be "re-encoded" again in whatever encoding.
>
> s = 'a'
> s = s.encode('iso-8859-7').decode('utf-8')
> print( s )
>
> a (we got the original character back)
> ================================
> s = 'α'
> s = s.encode('iso-8859-7').decode('utf-8')
> print( s )
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data
>
> Why this error? because 'a' ordinal value > 127 ?
> --
No. You get that error because the string is not encoded in UTF-8.
It's encoded in ISO-8859-7. For ASCII strings (ord(x) < 127),
ISO-8859-7 and UTF-8 look exactly the same. For anything else, they
are different. If you were to try to decode it as ISO-8859-1, it would
succeed, but you would get the character "á" back instead of α.
You're misunderstanding the decode function. Decode doesn't turn it
into a string with the specified encoding. It takes it *from* the
string with the specified encoding and turns it into Python's internal
string representation. In Python 3.3, that encoding doesn't even have
a name because it's not a standard encoding. So you want the decode
argument to match the encode argument.
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 02:42 -0700 |
| Message-ID | <4500f6f7-2296-4320-b6b9-dbc71c732500@googlegroups.com> |
| In reply to | #47441 |
s = 'a'
s = s.encode('utf-8').decode('iso-8859-7')
print ( s )
a
==========================
s = 'α'
s = s.encode('utf-8').decode('iso-8859-7')
print ( s )
Ξ±
==========================
is the above a garbage character? where did this came from?
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 03:37 -0700 |
| Message-ID | <82414c38-cec0-404d-8d5d-435fed0750c7@googlegroups.com> |
| In reply to | #47443 |
I k nwo i have been a pain in the ass these days but this is the lats explanation i want from you, just to understand it completely. >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for >> values up to 256? >Because then how do you tell when you need one byte, and when you need >two? If you read two bytes, and see 0x4C 0xFA, does that mean two >characters, with ordinal values 0x4C and 0xFA, or one character with >ordinal value 0x4CFA? I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. >> UTF-8 and UTF-16 and UTF-32 >> I though the number beside of UTF- was to declare how many bits the >> character set was using to store a character into the hdd, no? >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit >values to make a surrogate pair. A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? Is this what a surrogate is? a pari of 2 chars? >UTF-8 uses 8-bit values, but sometimes >it combines two, three or four of them to represent a single code-point. 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 ) The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? >UTF-8 solves this problem by reserving some values to mean "this byte, on >its own", and others to mean "this byte, plus the next byte, together", >and so forth, up to four bytes. Some of the utf-8 bits that are used to represent a character's ordinal value are actually been also used to seperate or join the ordinal values themselves? Can you give an example please? How there are beign seperated? >Computers are digital and work with numbers. So character 'A' <-> 65 (in decimal uses in charset's table) <-> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the file with a hex editor) Is this how the thing works? (above values are fictional)
[toc] | [prev] | [next] | [standalone]
| From | Larry Hudson <orgnut@yahoo.com> |
|---|---|
| Date | 2013-06-10 00:51 -0700 |
| Message-ID | <__GdnVhoRbgbGCjMnZ2dnUVZ_qqdnZ2d@giganews.com> |
| In reply to | #47446 |
On 06/09/2013 03:37 AM, Νικόλαος Κούρας wrote: > > I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. > NO!! 0 - 127, yes. 128 - 255 -> one byte of a multibyte code. That's why the decode fails, it sees it as incomplete data so it can't do anything with it. > > A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? > Is this what a surrogate is? a pari of 2 chars? > You're confusing character encodings with the way NON-CHARACTER keys on the KEYBOARD are encoded (function keys, arrow keys and such). These are NOT text characters but KEYBOARD key codes. These are NOT text codes and are entirely different and not related to any character encoding. How programs interpret and use these codes depends entirely on the individual programs. There are common conventions on how many are used, but there are no standards. Also the control-codes are the first 32 values of the ASCII (and ASCII-compatible) character set and are not multi-character key codes like the keyboard non-character keys. However, there are a few keyboard keys that actually produce control-codes. A few examples: Return/Enter -> Ctrl-M Tab -> Ctrl-I Backspace -> Ctrl-H > > So character 'A' <-> 65 (in decimal uses in charset's table) <-> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the file with a hex editor) > You are trying to put too much meaning to this. The value stored on disk, in memory, or whatever is binary bits, nothing else. How you describe the value, in decimal, in octal, in hex, in base-12, or... is totally irrelevant. These are simply different ways of describing or naming these numeric values. It's the same as saying 3 in English is three, in Spanish is tres, in German is drei... (I don't know Greek, sorry.) No matter what you call it, it is still the numeric integer value that is between 2 and 4.
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-10 01:11 -0700 |
| Message-ID | <ebd7c249-1c6d-4807-a5d1-9dc3f8006cc3@googlegroups.com> |
| In reply to | #47526 |
Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε: > > I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. > 0 - 127, yes. > 128 - 255 -> one byte of a multibyte code. you mean that in utf-8 for 1 character to be stored, we need 2 bytes? I still havign troubl e understanding this. Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but instead its using 1 byte up to the first 127 value and then 2 bytes for anyhtign above. Why?
[toc] | [prev] | [next] | [standalone]
| From | Larry Hudson <orgnut@yahoo.com> |
|---|---|
| Date | 2013-06-11 00:20 -0700 |
| Message-ID | <sKadnX8GT48pUivMnZ2dnUVZ_h2dnZ2d@giganews.com> |
| In reply to | #47530 |
On 06/10/2013 01:11 AM, Νικόλαος Κούρας wrote: > Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε: > >>> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. > >> 0 - 127, yes. >> 128 - 255 -> one byte of a multibyte code. > > you mean that in utf-8 for 1 character to be stored, we need 2 bytes? > I still havign troubl e understanding this. > Utf-8 characters are encoded in different sizes, NOT a single fixed number of bytes. The high _bits_ of the first byte define the number of bytes of the individual character code. (I'm copying this from Wikipedia...) 0xxxxxxx -> 1 byte 110xxxxx -> 2 bytes 1110xxxx -> 3 bytes 11110xxx -> 4 bytes 111110xx -> 5 bytes 1111110x -> 6 bytes Notice that in the 1-byte version, since the high bit is always 0, only 7 bits are available for the character code, and this is the standard 0-127 ASCII (and ASCII-compatible) code set. > Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but instead its using 1 byte up to the first 127 value and then 2 bytes for anyhtign above. Why? > As I indicated above, one bit is reserved as a flag to indicate that the code is one-byte code and not a multibyte code, only 7 bits are available for the actual 1-byte (ASCII) code.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-09 11:50 +0000 |
| Message-ID | <51b46bea$0$30001$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47433 |
On Sun, 09 Jun 2013 10:55:43 +0200, Lele Gaifax wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>
>> On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:
>>
>>> chr('A') would give me the mapping of this char, the number 65 while
>>> ord(65) would output the char 'A' likewise.
>>
>> Correct. Python uses Unicode, where code-point 65 ("ordinal value 65")
>> means letter "A".
>
> Actually, that's the other way around:
>
> >>> chr(65)
> 'A'
> >>> ord('A')
> 65
/facepalm
Of course you are right.
>>> What would happen if we we try to re-encode bytes on the disk? like
>>> trying:
>>>
>>> s = "νίκος"
>>> utf8_bytes = s.encode('utf-8')
>>> greek_bytes = utf_bytes.encode('iso-8869-7')
>>>
>>> Can we re-encode twice or as many times we want and then decode back
>>> respectively lke?
>>
>> Of course. [...]
> Uhm, no: "encode" transforms a Unicode string into an array of bytes,
> "decode" does the opposite transformation. You cannot do the former on
> an "arbitrary" array of bytes:
And two for two. I misunderstood Nikos' question.
As you point out, no, Python 3 will not allow you to re-encode bytes. You
must first decode them to a string first, then encode them using a
different encoding. (I thought that this was was Nikos actually meant,
but I on re-reading his question more closely, that's not actually what
he asked.)
Sorry for any confusion.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 05:18 -0700 |
| Message-ID | <cfa80b5f-879f-4276-89ac-d1900ccc8c0f@googlegroups.com> |
| In reply to | #47450 |
Please and tell me that this actually can be solved. Iam willing to try anything for 'files.py' to load propelry. Every thign works as expected in my webiste, have manages to correct pelatologio.poy and koukos.py. This is the last thing the webiste needs, that is files.py to load so users can grab importan files in greek format.
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 02:00 -0700 |
| Message-ID | <5b0d3d7c-e3a4-436d-a55f-26bd40064fd5@googlegroups.com> |
| In reply to | #47428 |
Steven wrote: >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for >> values up to 256? >Because then how do you tell when you need one byte, and when you need >two? If you read two bytes, and see 0x4C 0xFA, does that mean two >characters, with ordinal values 0x4C and 0xFA, or one character with >ordinal value 0x4CFA? I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. >> UTF-8 and UTF-16 and UTF-32 >> I though the number beside of UTF- was to declare how many bits the >> character set was using to store a character into the hdd, no? >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit >values to make a surrogate pair. A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? Is this what a surrogate is? a pari of 2 chars? >UTF-8 uses 8-bit values, but sometimes >it combines two, three or four of them to represent a single code-point. 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 ) The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? >UTF-8 solves this problem by reserving some values to mean "this byte, on >its own", and others to mean "this byte, plus the next byte, together", >and so forth, up to four bytes. Some of the utf-8 bits that are used to represent a character's ordinal value are actually been also used to seperate or join the ordinal values themselves? Can you give an example please? How there are beign seperated? >Computers are digital and work with numbers. So character 'A' <-> 65 (in decimal uses in charset's table) <-> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the file with a hex editor) Is this how the thing works? (above values are fictional)
[toc] | [prev] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2013-06-09 19:12 +1000 |
| Message-ID | <mailman.2911.1370769172.3114.python-list@python.org> |
| In reply to | #47434 |
On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote: | Steven wrote: | >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for | >> values up to 256? | | >Because then how do you tell when you need one byte, and when you need | >two? If you read two bytes, and see 0x4C 0xFA, does that mean two | >characters, with ordinal values 0x4C and 0xFA, or one character with | >ordinal value 0x4CFA? | | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your suggestion will not. I'd point out that if you did this, you'd be back in the same situation you just encountered with ASCII: the first above-255 value would raise a UnicodeEncodeError (an error which does not even exist at present:-) | >> UTF-8 and UTF-16 and UTF-32 | >> I though the number beside of UTF- was to declare how many bits the | >> character set was using to store a character into the hdd, no? | | >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. | >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit | >values to make a surrogate pair. | | A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? | Is this what a surrogate is? a pari of 2 chars? Essentially. The combination represents a code point. | >UTF-8 uses 8-bit values, but sometimes | >it combines two, three or four of them to represent a single code-point. | | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 ) | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 ) | | The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? Essentially. You can read up on the exact process in Wikipedia or the Unicode Standard. Cheers, -- Cameron Simpson <cs@zip.com.au> The most annoying thing about being without my files after our disc crash was discovering once again how widespread BLINK was on the web.
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 02:20 -0700 |
| Message-ID | <8471f19b-e21a-4859-9842-92a97d75a840@googlegroups.com> |
| In reply to | #47437 |
Τη Κυριακή, 9 Ιουνίου 2013 12:12:36 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: > On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote: > > | Steven wrote: > > | >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for > > | >> values up to 256? > > | > > | >Because then how do you tell when you need one byte, and when you need > > | >two? If you read two bytes, and see 0x4C 0xFA, does that mean two > > | >characters, with ordinal values 0x4C and 0xFA, or one character with > > | >ordinal value 0x4CFA? > > | > > | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. > > > > Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your >suggestion will not. I dont follow. > | >> UTF-8 and UTF-16 and UTF-32 > > | >> I though the number beside of UTF- was to declare how many bits the > > | >> character set was using to store a character into the hdd, no? > > | > > | >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. > > | >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit > > | >values to make a surrogate pair. > > | > > | A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? > > | Is this what a surrogate is? a pari of 2 chars? > > > > Essentially. The combination represents a code point. > > > > | >UTF-8 uses 8-bit values, but sometimes > > | >it combines two, three or four of them to represent a single code-point. > > | > > | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) > > | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 ) > > | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 ) > > | > > | The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? > > > > Essentially. You can read up on the exact process in Wikipedia or the Unicode Standard. When you say essentially means you agree with my statements?
[toc] | [prev] | [next] | [standalone]
| From | Benjamin Kaplan <benjamin.kaplan@case.edu> |
|---|---|
| Date | 2013-06-09 13:01 -0700 |
| Message-ID | <mailman.2934.1370808466.3114.python-list@python.org> |
| In reply to | #47440 |
On Sun, Jun 9, 2013 at 2:20 AM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote: > Τη Κυριακή, 9 Ιουνίου 2013 12:12:36 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: >> On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote: >> >> | Steven wrote: >> >> | >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for >> >> | >> values up to 256? >> >> | >> >> | >Because then how do you tell when you need one byte, and when you need >> >> | >two? If you read two bytes, and see 0x4C 0xFA, does that mean two >> >> | >characters, with ordinal values 0x4C and 0xFA, or one character with >> >> | >ordinal value 0x4CFA? >> >> | >> >> | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. >> >> >> >> Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your >suggestion will not. > > I dont follow. > The point in the UTF formats is that they can encode any of the 1.1 million codepoints available in Unicode. Your suggestion can only encode 256 code points. We have that encoding already- it's called Latin-1 and it can't encode any of your Greek characters (hence why ISO-8859-7 exists, which can encode the Greek characters but not the Latin ones). If you were to use the whole byte to store the first 256 characters, you wouldn't be able to store character number 256 because the computer wouldn't be able to tell the difference between character 257 (0x01 0x01) and two chr(1)s. UTF-8 gets around this by reserving the top bit as a "am I part of a multibyte sequence" flag, >> | >> UTF-8 and UTF-16 and UTF-32 >> >> | >> I though the number beside of UTF- was to declare how many bits the >> >> | >> character set was using to store a character into the hdd, no? >> >> | >> >> | >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. >> >> | >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit >> >> | >values to make a surrogate pair. >> >> | >> >> | A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? >> >> | Is this what a surrogate is? a pari of 2 chars? >> >> >> >> Essentially. The combination represents a code point. >> >> >> >> | >UTF-8 uses 8-bit values, but sometimes >> >> | >it combines two, three or four of them to represent a single code-point. >> >> | >> >> | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) >> >> | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 ) >> >> | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 ) >> >> | >> >> | The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? >> >> >> >> Essentially. You can read up on the exact process in Wikipedia or the Unicode Standard. > > > > When you say essentially means you agree with my statements? > -- In UTF-8 or UTF-16, the number of bytes required for the character is dependent on its code point, yes. That isn't the case for UTF-32, where every character uses exactly four bytes.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-09 12:31 +0000 |
| Message-ID | <51b475b0$0$30001$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47434 |
On Sun, 09 Jun 2013 02:00:46 -0700, Νικόλαος Κούρας wrote:
> Steven wrote:
>>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>> values up to 256?
>
>>Because then how do you tell when you need one byte, and when you need
>>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>characters, with ordinal values 0x4C and 0xFA, or one character with
>>ordinal value 0x4CFA?
>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I
> meant up to 256, not above 256.
Think about it. Draw up a big table of one million plus characters:
Ordinal Character
======== ====================
0 NUL control code
1 SOH control code
...
84 LATIN CAPITAL LETTER T
85 LATIN CAPITAL LETTER U
...
255 LATIN SMALL LETTER Y WITH DIAERESIS
256 LATIN CAPITAL LETTER A WITH MACRON
...
8485 OUNCE SIGN
and so on, all the way to 1114111. Now, suppose you read a file, and see
two bytes, shown in decimal: 84 followed by 85, or in hexadecimal, 0x54
followed by 0x55.
How do you tell whether that means two characters, T followed by U, or a
single character, ℥ (OUNCE SIGN)?
With UTF-32, you can, because every value takes exactly the same space.
So a T followed by a U is:
0x00000054
0x00000055
while a single ℥ is:
0x00002125
and it is easy to tell them apart: each block of 4 bytes is exactly one
character. But notice how many NUL bytes there are? In the three
characters shown, there are eight NUL bytes. Most text will be filled
with NUL bytes, which is very wasteful.
UTF-8 is designed to be compact, and also to be backwards-compatible with
ASCII. Characters which are in ASCII will be a single byte, so there are
no null bytes used for padding, (except for NUL itself, of course). So
the three characters TU℥ will be:
0x54
0x55
0xE2
0x84
0xA5
Five bytes in total, instead of 12 for UTF-32. But the only tricky part
is that character with ordinal value 0xE2 (decimal 226, â) cannot be
encoded as the single byte 0xE2, otherwise you would mistake the three
bytes 0xE284A5 as starting with 'â' followed by two more characters. And
indeed, 'â' is encoded as two bytes:
0xC3
0xA2
Likewise, character with ordinal value 0xC3 (decimal 195, Ã) is also
encoded as two bytes:
0xC3
0x83
And so on. This way, there is never any confusion as to whether (say)
three bytes are three one-byte characters, or one three-byte character.
>>> UTF-8 and UTF-16 and UTF-32
>>> I though the number beside of UTF- was to declare how many bits the
>>> character set was using to store a character into the hdd, no?
>
>>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>>values to make a surrogate pair.
>
> A surrogate pair is like itting for example Ctrl-A, which means is a
> combination character that consists of 2 different characters? Is this
> what a surrogate is? a pari of 2 chars?
Yes, a surrogate pair is a pair of two "characters". But they're not
*real* characters. They don't exist in any human language. They are just
values that tells the program "these go together, and count as a single
character".
(This is why Unicode prefers to talk about *code points* rather than
characters. Some code points are characters, and some are not.)
>>UTF-8 uses 8-bit values, but sometimes it combines two, three or four of
>>them to represent a single code-point.
>
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
Correct.
> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is >
> 127 )
That looks like two characters to me, 'α' followed by '΄'. That will take
4 bytes, two for 'α' and two for '΄'.
> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored
> ? (since ordinal > 65000 )
Not necessarily four bytes. Could be three. Depends on the ideogram.
> The amount of bytes needed to store a character solely depends on the
> character's ordinal value in the Unicode table?
Yes.
>>UTF-8 solves this problem by reserving some values to mean "this byte,
>>on its own", and others to mean "this byte, plus the next byte,
>>together", and so forth, up to four bytes.
>
> Some of the utf-8 bits that are used to represent a character's ordinal
> value are actually been also used to seperate or join the ordinal values
> themselves? Can you give an example please? How there are beign
> seperated?
Did you look up UTF-8 on Wikipedia like I suggested?
>>Computers are digital and work with numbers.
>
> So character 'A' <-> 65 (in decimal uses in charset's table) <->
> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the
> file with a hex editor)
>
> Is this how the thing works? (above values are fictional)
You can check this in Python:
py> c = 'A'
py> ord(c)
65
py> bin(65)
'0b1000001'
py> hex(65)
'0x41'
py> c = 'α'
py> ord(c)
945
py> c.encode('utf-8')
b'\xce\xb1'
py> c.encode('utf-16be')
b'\x03\xb1'
py> c.encode('utf-32be')
b'\x00\x00\x03\xb1'
py> c.encode('iso-8859-7')
b'\xe1'
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | nagia.retsina@gmail.com |
|---|---|
| Date | 2013-06-10 00:10 -0700 |
| Message-ID | <a64ba08f-2616-4715-818c-073f3d1e2ffb@googlegroups.com> |
| In reply to | #47457 |
Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
> py> c = 'α'
> py> ord(c)
> 945
The number 945 is the characters 'α' ordinal value in the unicode charset correct?
The command in the python interactive session to show me how many bytes
this character will take upon encoding to utf-8 is:
>>> s = 'α'
>>> s.encode('utf-8')
b'\xce\xb1'
I see that the encoding of this char takes 2 bytes. But why two exactly?
How do i calculate how many bits are needed to store this char into bytes?
Trying to to the same here but it gave me no bytes back.
>>> s = 'a'
>>> s.encode('utf-8')
b'a'
>py> c.encode('utf-8')
> b'\xce\xb1'
2 bytes here. why 2?
> py> c.encode('utf-16be')
> b'\x03\xb1'
2 byets here also. but why 3 different bytes? the ordinal value of char 'a' is the same in unicode. the encodign system just takes the ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes be the same?
> py> c.encode('utf-32be')
> b'\x00\x00\x03\xb1
every char here takes exactly 4 bytes to be stored. okey.
> py> c.encode('iso-8859-7')
> b'\xe1'
And also does '\x' means that the value is being respresented in hex way?
and when i bin(6) i see '0b1000001'
I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say?
[toc] | [prev] | [next] | [standalone]
Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →
Back to top | Article view | comp.lang.python
csiph-web