Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #5163 > unrolled thread
| Started by | harrismh777 <harrismh777@charter.net> |
|---|---|
| First post | 2011-05-11 16:37 -0500 |
| Last post | 2011-05-11 15:34 -0700 |
| Articles | 20 on this page of 32 — 12 participants |
Back to article view | Back to comp.lang.python
unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 16:37 -0500
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-11 16:09 -0600
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 17:51 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 09:32 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 20:22 -0500
Re: unicode by default MRAB <python@mrabarnett.plus.com> - 2011-05-12 03:31 +0100
Re: unicode by default Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-12 03:16 +0000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 22:44 -0500
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 00:12 -0400
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:43 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:14 +1000
Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 21:14 -0700
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:41 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:14 -0500
Re: unicode by default TheSaint <nobody@nowhere.net.no> - 2011-05-12 20:40 +0800
Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-12 14:07 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:31 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 17:58 +1000
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 10:17 -0600
Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-12 23:28 -0700
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-13 14:53 -0500
Re: unicode by default Robert Kern <robert.kern@gmail.com> - 2011-05-13 15:18 -0500
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-13 21:41 -0400
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-14 02:41 -0500
Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-14 03:26 -0700
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-14 16:26 -0400
Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-15 09:47 +1000
Re: unicode by default Nobody <nobody@nowhere.com> - 2011-05-14 09:34 +0100
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 16:42 -0400
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 16:25 -0600
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 13:54 +1000
Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 15:34 -0700
Page 1 of 2 [1] 2 Next page →
| From | harrismh777 <harrismh777@charter.net> |
|---|---|
| Date | 2011-05-11 16:37 -0500 |
| Subject | unicode by default |
| Message-ID | <OkDyp.2983$M61.450@newsfe07.iad> |
hi folks,
I am puzzled by unicode generally, and within the context of python
specifically. For one thing, what do we mean that unicode is used in
python 3.x by default. (I know what default means, I mean, what changed?)
I think part of my problem is that I'm spoiled (American, ascii
heritage) and have been either stuck in ascii knowingly, or UTF-8
without knowing (just because the code points lined up). I am confused
by the implications for using 3.x, because I am reading that there are
significant things to be aware of... what?
On my installation 2.6 sys.maxunicode comes up with 1114111, and my
2.7 and 3.2 installs come up with 65535 each. So, I am assuming that 2.6
was compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that
the default compile option for 2.7 & 3.2 (I didn't change anything) is
set for UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much
correctly?
The books say that the .py sources are UTF-8 by default... and that
3.x is either UCS-2 or UCS-4. If I use the file handling capabilities
of Python in 3.x (by default) what encoding will be used, and how will
that affect the output?
If I do not specify any code points above ascii 0xFF does any of
this matter anyway?
Thanks.
kind regards,
m harris
[toc] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2011-05-11 16:09 -0600 |
| Message-ID | <mailman.1433.1305151801.9059.python-list@python.org> |
| In reply to | #5163 |
On Wed, May 11, 2011 at 3:37 PM, harrismh777 <harrismh777@charter.net> wrote: > hi folks, > I am puzzled by unicode generally, and within the context of python > specifically. For one thing, what do we mean that unicode is used in python > 3.x by default. (I know what default means, I mean, what changed?) The `unicode' class was renamed to `str', and a stripped-down version of the 2.X `str' class was renamed to `bytes'. > I think part of my problem is that I'm spoiled (American, ascii heritage) > and have been either stuck in ascii knowingly, or UTF-8 without knowing > (just because the code points lined up). I am confused by the implications > for using 3.x, because I am reading that there are significant things to be > aware of... what? Mainly Python 3 no longer does explicit conversion between bytes and unicode, requiring the programmer to be explicit about such conversions. If you have Python 2 code that is sloppy about this, you may get some Unicode encode/decode errors when trying to run the same code in Python 3. The 2to3 tool can help somewhat with this, but it can't prevent all problems. > On my installation 2.6 sys.maxunicode comes up with 1114111, and my 2.7 > and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was > compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the > default compile option for 2.7 & 3.2 (I didn't change anything) is set for > UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much correctly? I think that UCS-2 has always been the default unicode width for CPython, although the exact representation used internally is an implementation detail. > The books say that the .py sources are UTF-8 by default... and that 3.x is > either UCS-2 or UCS-4. If I use the file handling capabilities of Python in > 3.x (by default) what encoding will be used, and how will that affect the > output? If you open a file in binary mode, the result is a non-decoded byte stream. If you open a file in text mode and do not specify an encoding, then the result of locale.getpreferredencoding() is used for decoding, and the result is a unicode stream. > If I do not specify any code points above ascii 0xFF does any of this > matter anyway? You mean 0x7F, and probably, due to the need to explicitly encode and decode.
[toc] | [prev] | [next] | [standalone]
| From | harrismh777 <harrismh777@charter.net> |
|---|---|
| Date | 2011-05-11 17:51 -0500 |
| Message-ID | <vpEyp.981$dL5.736@newsfe08.iad> |
| In reply to | #5168 |
Ian Kelly wrote:
Ian, Benjamin, thanks much.
> The `unicode' class was renamed to `str', and a stripped-down version
> of the 2.X `str' class was renamed to `bytes'.
... thank you, this is very helpful.
>> > If I do not specify any code points above ascii 0xFF does any of this
>> > matter anyway?
> You mean 0x7F, and probably, due to the need to explicitly encode and decode.
Yes, actually, I did... and from Benjamin's reply it seems that
this matters only if I am working with bytes. Is it true that if I am
working without using bytes sequences that I will not need to care about
the encoding anyway, unless of course I need to specify a unicode code
point?
Thanks again.
kind regards,
m harris
[toc] | [prev] | [next] | [standalone]
| From | "John Machin" <sjmachin@lexicon.net> |
|---|---|
| Date | 2011-05-12 09:32 +1000 |
| Message-ID | <mailman.1435.1305157329.9059.python-list@python.org> |
| In reply to | #5174 |
On Thu, May 12, 2011 8:51 am, harrismh777 wrote: > Is it true that if I am > working without using bytes sequences that I will not need to care about > the encoding anyway, unless of course I need to specify a unicode code > point? Quite the contrary. (1) You cannot work without using bytes sequences. Files are byte sequences. Web communication is in bytes. You need to (know / assume / be able to extract / guess) the input encoding. You need to encode your output using an encoding that is expected by the consumer (or use an output method that will do it for you). (2) You don't need to use bytes to specify a Unicode code point. Just use an escape sequence e.g. "\u0404" is a Cyrillic character.
[toc] | [prev] | [next] | [standalone]
| From | harrismh777 <harrismh777@charter.net> |
|---|---|
| Date | 2011-05-11 20:22 -0500 |
| Message-ID | <KDGyp.180$0t1.7@newsfe04.iad> |
| In reply to | #5177 |
John Machin wrote: > (1) You cannot work without using bytes sequences. Files are byte > sequences. Web communication is in bytes. You need to (know / assume / be > able to extract / guess) the input encoding. You need to encode your > output using an encoding that is expected by the consumer (or use an > output method that will do it for you). > > (2) You don't need to use bytes to specify a Unicode code point. Just use > an escape sequence e.g. "\u0404" is a Cyrillic character. > Thanks John. In reverse order, I understand point (2). I'm less clear on point (1). If I generate a string of characters that I presume to be ascii/utf-8 (no \u0404 type characters) and write them to a file (stdout) how does default encoding affect that file.by default..? I'm not seeing that there is anything unusual going on... If I open the file with vi? If I open the file with gedit? emacs? .... Another question... in mail I'm receiving many small blocks that look like sprites with four small hex codes, scattered about the mail... mostly punctuation, maybe? ... guessing, are these unicode code points, and if so what is the best way to 'guess' the encoding? ... is it coded in the stream somewhere...protocol? thanks
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-05-12 03:31 +0100 |
| Message-ID | <mailman.1439.1305167541.9059.python-list@python.org> |
| In reply to | #5181 |
On 12/05/2011 02:22, harrismh777 wrote:
> John Machin wrote:
>> (1) You cannot work without using bytes sequences. Files are byte
>> sequences. Web communication is in bytes. You need to (know / assume / be
>> able to extract / guess) the input encoding. You need to encode your
>> output using an encoding that is expected by the consumer (or use an
>> output method that will do it for you).
>>
>> (2) You don't need to use bytes to specify a Unicode code point. Just use
>> an escape sequence e.g. "\u0404" is a Cyrillic character.
>>
>
> Thanks John. In reverse order, I understand point (2). I'm less clear on
> point (1).
>
> If I generate a string of characters that I presume to be ascii/utf-8
> (no \u0404 type characters) and write them to a file (stdout) how does
> default encoding affect that file.by default..? I'm not seeing that
> there is anything unusual going on... If I open the file with vi? If I
> open the file with gedit? emacs?
>
> ....
>
> Another question... in mail I'm receiving many small blocks that look
> like sprites with four small hex codes, scattered about the mail...
> mostly punctuation, maybe? ... guessing, are these unicode code points,
> and if so what is the best way to 'guess' the encoding? ... is it coded
> in the stream somewhere...protocol?
>
You need to understand the difference between characters and bytes.
A string contains characters, a file contains bytes.
The encoding specifies how a character is represented as bytes.
For example:
In the Latin-1 encoding, the character "£" is represented by the
byte 0xA3.
In the UTF-8 encoding, the character "£" is represented by the byte
sequence 0xC2 0xA3.
In the ASCII encoding, the character "£" can't be represented at all.
The advantage of UTF-8 is that it can represent _all_ Unicode
characters (codepoints, actually) as byte sequences, and all those in
the ASCII range are represented by the same single bytes which the
original ASCII system used. Use the UTF-8 encoding unless you have to
use a different one.
A file contains only bytes, a socket handles only bytes. Which encoding
you should use for characters is down to protocol. A system such as
email, which can handle different encodings, should have a way of
specifying the encoding, and perhaps also a default encoding.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2011-05-12 03:16 +0000 |
| Message-ID | <4dcb50f8$0$29973$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #5182 |
On Thu, 12 May 2011 03:31:18 +0100, MRAB wrote: >> Another question... in mail I'm receiving many small blocks that look >> like sprites with four small hex codes, scattered about the mail... >> mostly punctuation, maybe? ... guessing, are these unicode code points, >> and if so what is the best way to 'guess' the encoding? ... is it coded >> in the stream somewhere...protocol? >> > You need to understand the difference between characters and bytes. http://www.joelonsoftware.com/articles/Unicode.html is also a good resource. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | harrismh777 <harrismh777@charter.net> |
|---|---|
| Date | 2011-05-11 22:44 -0500 |
| Message-ID | <rIIyp.3007$M61.2987@newsfe07.iad> |
| In reply to | #5183 |
Steven D'Aprano wrote:
>> You need to understand the difference between characters and bytes.
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
> is also a good resource.
Thanks for being patient guys, here's what I've done:
>>>> astr="pound sign"
>>>> asym=" \u00A3"
>>>> afile=open("myfile", mode='w')
>>>> afile.write(astr + asym)
> 12
>>>> afile.close()
When I edit "myfile" with vi I see the 'characters' :
pound sign £
... same with emacs, same with gedit ...
When I hexdump myfile I see this:
0000000 6f70 6375 2064 6973 6e67 c220 00a3
This is *not* what I expected... well it is (little-endian) right up to
the 'c2' and that is what is confusing me....
I did not open the file with an encoding of UTF-8... so I'm assuming
UTF-16 by default (python3) so I was expecting a '00A3' little-endian as
'A300' but what I got instead was UTF-8 little-endian 'c2a3' ....
See my problem?... when I open the file with emacs I see the character
pound sign... same with gedit... they're all using UTF-8 by default. By
default it looks like Python3 is writing output with UTF-8 as default...
and I thought that by default Python3 was using either UTF-16 or UTF-32.
So, I'm confused here... also, I used the character sequence \u00A3
which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
which is the normal UTF-8...
Thanks again for your patience... I really do hate to be dense about
this... but this is another area where I'm just beginning to dabble and
I'd like to know soon what I'm doing...
Thanks for the link Steve... I'm headed there now...
kind regards,
m harris
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2011-05-12 00:12 -0400 |
| Message-ID | <mailman.1443.1305173553.9059.python-list@python.org> |
| In reply to | #5187 |
On 5/11/2011 11:44 PM, harrismh777 wrote:
> Steven D'Aprano wrote:
>>> You need to understand the difference between characters and bytes.
>>
>> http://www.joelonsoftware.com/articles/Unicode.html
>>
>> is also a good resource.
>
> Thanks for being patient guys, here's what I've done:
>
>>>>> astr="pound sign"
>>>>> asym=" \u00A3"
>>>>> afile=open("myfile", mode='w')
>>>>> afile.write(astr + asym)
>> 12
>>>>> afile.close()
>
>
> When I edit "myfile" with vi I see the 'characters' :
>
> pound sign £
>
> ... same with emacs, same with gedit ...
>
>
> When I hexdump myfile I see this:
>
> 0000000 6f70 6375 2064 6973 6e67 c220 00a3
> This is *not* what I expected... well it is (little-endian) right up to
> the 'c2' and that is what is confusing me....
> I did not open the file with an encoding of UTF-8... so I'm assuming
> UTF-16 by default (python3) so I was expecting a '00A3' little-endian as
> 'A300' but what I got instead was UTF-8 little-endian 'c2a3' ....
>
> See my problem?... when I open the file with emacs I see the character
> pound sign... same with gedit... they're all using UTF-8 by default. By
> default it looks like Python3 is writing output with UTF-8 as default...
> and I thought that by default Python3 was using either UTF-16 or UTF-32.
> So, I'm confused here... also, I used the character sequence \u00A3
> which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
> which is the normal UTF-8...
If you open a file as binary (bytes), you must write bytes, and they are
stored without transformation. If you open in text mode, you must write
text (string as unicode in 3.2) and Python will encode to bytes using
either some default or the encoding you specified in the open statement.
It does not matter how Python stored the unicode internally. Does this
help? Your intent is signalled by how you open the file.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | harrismh777 <harrismh777@charter.net> |
|---|---|
| Date | 2011-05-12 01:43 -0500 |
| Message-ID | <ekLyp.1001$dL5.753@newsfe08.iad> |
| In reply to | #5194 |
Terry Reedy wrote: > It does not matter how Python stored the unicode internally. Does this > help? Your intent is signalled by how you open the file. Very much, actually, thanks. I was missing the 'internal' piece, and did not realize that if I didn't specify the encoding on the open that python would pull the default encoding from locale... kind regards, m harris
[toc] | [prev] | [next] | [standalone]
| From | "John Machin" <sjmachin@lexicon.net> |
|---|---|
| Date | 2011-05-12 14:14 +1000 |
| Message-ID | <mailman.1444.1305173678.9059.python-list@python.org> |
| In reply to | #5187 |
On Thu, May 12, 2011 1:44 pm, harrismh777 wrote: > By > default it looks like Python3 is writing output with UTF-8 as default... > and I thought that by default Python3 was using either UTF-16 or UTF-32. > So, I'm confused here... also, I used the character sequence \u00A3 > which I thought was UTF-16... but Python3 changed my intent to 'c2a3' > which is the normal UTF-8... Python uses either a 16-bit or a 32-bit INTERNAL representation of Unicode code points. Those NN bits have nothing to do with the UTF-NN encodings, which can be used to encode the codepoints as byte sequences for EXTERNAL purposes. In your case, UTF-8 has been used as it is the default encoding on your platform.
[toc] | [prev] | [next] | [standalone]
| From | Benjamin Kaplan <benjamin.kaplan@case.edu> |
|---|---|
| Date | 2011-05-11 21:14 -0700 |
| Message-ID | <mailman.1445.1305173694.9059.python-list@python.org> |
| In reply to | #5187 |
On Wed, May 11, 2011 at 8:44 PM, harrismh777 <harrismh777@charter.net> wrote:
> Steven D'Aprano wrote:
>>>
>>> You need to understand the difference between characters and bytes.
>>
>> http://www.joelonsoftware.com/articles/Unicode.html
>>
>> is also a good resource.
>
> Thanks for being patient guys, here's what I've done:
>
>>>>> astr="pound sign"
>>>>> asym=" \u00A3"
>>>>> afile=open("myfile", mode='w')
>>>>> afile.write(astr + asym)
>>
>> 12
>>>>>
>>>>> afile.close()
>
>
> When I edit "myfile" with vi I see the 'characters' :
>
> pound sign £
>
> ... same with emacs, same with gedit ...
>
>
> When I hexdump myfile I see this:
>
> 0000000 6f70 6375 2064 6973 6e67 c220 00a3
>
>
> This is *not* what I expected... well it is (little-endian) right up to the
> 'c2' and that is what is confusing me....
>
> I did not open the file with an encoding of UTF-8... so I'm assuming UTF-16
> by default (python3) so I was expecting a '00A3' little-endian as 'A300' but
> what I got instead was UTF-8 little-endian 'c2a3' ....
>
quick note here: UTF-8 doesn't have an endian-ness. It's always read
from left to right, with the high bit telling you whether you need to
continue or not. So it's always "little endian".
> See my problem?... when I open the file with emacs I see the character pound
> sign... same with gedit... they're all using UTF-8 by default. By default it
> looks like Python3 is writing output with UTF-8 as default... and I thought
> that by default Python3 was using either UTF-16 or UTF-32. So, I'm confused
> here... also, I used the character sequence \u00A3 which I thought was
> UTF-16... but Python3 changed my intent to 'c2a3' which is the normal
> UTF-8...
>
The fact that CPython uses UCS-2 or UCS-4 internally is an
implementation detail and isn't actually part of the Python
specification. As far as a Python program is concerned, a Unicode
string is a list of character objects, not bytes. Much like any other
object, a unicode character needs to be serialized before it can be
written to a file. An encoding is a serialization function for
characters.
If the file you're writing to doesn't specify an encoding, Python will
default to locale.getdefaultencoding(), which tries to get your
system's preferred encoding from environment variables (in other
words, the same source that emacs and gedit will use to get the
default encoding).
[toc] | [prev] | [next] | [standalone]
| From | "John Machin" <sjmachin@lexicon.net> |
|---|---|
| Date | 2011-05-12 14:41 +1000 |
| Message-ID | <mailman.1446.1305175310.9059.python-list@python.org> |
| In reply to | #5187 |
On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote: > > If the file you're writing to doesn't specify an encoding, Python will > default to locale.getdefaultencoding(), No such attribute. Perhaps you mean locale.getpreferredencoding()
[toc] | [prev] | [next] | [standalone]
| From | harrismh777 <harrismh777@charter.net> |
|---|---|
| Date | 2011-05-12 01:14 -0500 |
| Message-ID | <gVKyp.26577$Vp.24961@newsfe14.iad> |
| In reply to | #5197 |
John Machin wrote: > On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote: >> >> If the file you're writing to doesn't specify an encoding, Python will >> default to locale.getdefaultencoding(), > > No such attribute. Perhaps you mean locale.getpreferredencoding() >>> import locale >>> locale.getpreferredencoding() 'UTF-8' >>> Yessssssss! :)
[toc] | [prev] | [next] | [standalone]
| From | TheSaint <nobody@nowhere.net.no> |
|---|---|
| Date | 2011-05-12 20:40 +0800 |
| Message-ID | <iqgkf8$vob$1@speranza.aioe.org> |
| In reply to | #5197 |
John Machin wrote: > On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote: >> >> If the file you're writing to doesn't specify an encoding, Python will >> default to locale.getdefaultencoding(), > > No such attribute. Perhaps you mean locale.getpreferredencoding() what about sys.getfilesystemencoding() In the event to distribuite a program how to guess which encoding will the user has? -- goto /dev/null
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2011-05-12 14:07 +1000 |
| Message-ID | <874o50k1eb.fsf@benfinney.id.au> |
| In reply to | #5182 |
MRAB <python@mrabarnett.plus.com> writes: > You need to understand the difference between characters and bytes. Yep. Those who don't need to join us in the third millennium, and the resources pointed out in this thread are good to help that. > A string contains characters, a file contains bytes. That's not true for Python 2. I'd phrase that as: * Text is a sequence of characters. Most inputs to the program, including files, sockets, etc., contain a sequence of bytes. * Always know whether you're dealing with text or with bytes. No object can be both. * In Python 2, ‘str’ is the type for a sequence of bytes. ‘unicode’ is the type for text. * In Python 3, ‘str’ is the type for text. ‘bytes’ is the type for a sequence of bytes. -- \ “I went to a garage sale. ‘How much for the garage?’ ‘It's not | `\ for sale.’” —Steven Wright | _o__) | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | harrismh777 <harrismh777@charter.net> |
|---|---|
| Date | 2011-05-12 01:31 -0500 |
| Message-ID | <U8Lyp.1000$dL5.14@newsfe08.iad> |
| In reply to | #5193 |
Ben Finney wrote:
> I'd phrase that as:
> * Text is a sequence of characters. Most inputs to the program,
> including files, sockets, etc., contain a sequence of bytes.
> * Always know whether you're dealing with text or with bytes. No object
> can be both.
> * In Python 2, ‘str’ is the type for a sequence of bytes. ‘unicode’ is
> the type for text.
> * In Python 3, ‘str’ is the type for text. ‘bytes’ is the type for a
> sequence of bytes.
That is very helpful... thanks
MRAB, Steve, John, Terry, Ben F, Ben K, Ian...
...thank you guys so much, I think I've got a better picture now of
what is going on... this is also one place where I don't think the books
are as clear as they need to be at least for me...(Lutz, Summerfield).
So, the UTF-16 UTF-32 is INTERNAL only, for Python... and text in/out is
based on locale... in my case UTF-8 ...that is enormously helpful for
me... understanding locale on this system is as mystifying as unicode is
in the first place.
Well, after reading about unicode tonight (about four hours) I realize
that its not really that hard... there's just a lot of details that have
to come together. Straightening out that whole tower-of-babel thing is
sure a pain in the butt.
I also was not aware that UTF-8 chars could be up to six(6) byes long
from left to right. I see now that the little-endianness I was
ascribing to python is just a function of hexdump... and I was a little
disappointed to find that hexdump does not support UTF-8, just ascii...doh.
Anyway, thanks again... I've got enough now to play around a bit...
PS thanks Steve for that link, informative and entertaining too... Joe
says, "If you are a programmer . . . and you don't know the basics of
characters, character sets, encodings, and Unicode, and I catch you, I'm
going to punish you by making you peel onions for 6 months in a
submarine. I swear I will". :)
kind regards,
m harris
[toc] | [prev] | [next] | [standalone]
| From | "John Machin" <sjmachin@lexicon.net> |
|---|---|
| Date | 2011-05-12 17:58 +1000 |
| Message-ID | <mailman.1450.1305187110.9059.python-list@python.org> |
| In reply to | #5209 |
On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:
>
> So, the UTF-16 UTF-32 is INTERNAL only, for Python
NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
encodings for the EXTERNAL representation of Unicode characters in byte
streams.
> I also was not aware that UTF-8 chars could be up to six(6) byes long
> from left to right.
It could be, once upon a time in ISO faerieland, when it was thought that
Unicode could grow to 2**32 codepoints. However ISO and the Unicode
consortium have agreed that 17 planes is the utter max, and accordingly a
valid UTF-8 byte sequence can be no longer than 4 bytes ... see below
>>> chr(17 * 65536)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(0x110000)
>>> chr(17 * 65536 - 1)
'\U0010ffff'
>>> _.encode('utf8')
b'\xf4\x8f\xbf\xbf'
>>> b'\xf5\x8f\xbf\xbf'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf5 in position 0:
invalid start byte
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2011-05-12 10:17 -0600 |
| Message-ID | <mailman.1476.1305217085.9059.python-list@python.org> |
| In reply to | #5209 |
On Thu, May 12, 2011 at 1:58 AM, John Machin <sjmachin@lexicon.net> wrote: > On Thu, May 12, 2011 4:31 pm, harrismh777 wrote: > >> >> So, the UTF-16 UTF-32 is INTERNAL only, for Python > > NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are > encodings for the EXTERNAL representation of Unicode characters in byte > streams. Right. *Under the hood* Python uses UCS-2 (which is not exactly the same thing as UTF-16, by the way) to represent Unicode strings. However, this is entirely transparent. To the Python programmer, a unicode string is just an abstraction of a sequence of code-points. You don't need to think about UCS-2 at all. The only times you need to worry about encodings are when you're encoding unicode characters to byte strings, or decoding bytes to unicode characters, or opening a stream in text mode; and in those cases the only encoding that matters is the external one.
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2011-05-12 23:28 -0700 |
| Message-ID | <492f8500-fd53-4b52-bd7b-cd90a3118d38@k16g2000yqm.googlegroups.com> |
| In reply to | #5244 |
On 12 mai, 18:17, Ian Kelly <ian.g.ke...@gmail.com> wrote: > ... > to worry about encodings are when you're encoding unicode characters > to byte strings, or decoding bytes to unicode characters A small but important correction/clarification: In Unicode, "unicode" does not encode a *character*. It encodes a *code point*, a number, the integer associated to the character. jmf
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.python
csiph-web