Groups > comp.lang.python > #5163 > unrolled thread

unicode by default

Started by	harrismh777 <harrismh777@charter.net>
First post	2011-05-11 16:37 -0500
Last post	2011-05-11 15:34 -0700
Articles	20 on this page of 32 — 12 participants

Back to article view | Back to comp.lang.python

  unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 16:37 -0500
    Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-11 16:09 -0600
      Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 17:51 -0500
        Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 09:32 +1000
          Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 20:22 -0500
            Re: unicode by default MRAB <python@mrabarnett.plus.com> - 2011-05-12 03:31 +0100
              Re: unicode by default Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-12 03:16 +0000
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 22:44 -0500
                  Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 00:12 -0400
                    Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:43 -0500
                  Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:14 +1000
                  Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 21:14 -0700
                  Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:41 +1000
                    Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:14 -0500
                    Re: unicode by default TheSaint <nobody@nowhere.net.no> - 2011-05-12 20:40 +0800
              Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-12 14:07 +1000
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:31 -0500
                  Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 17:58 +1000
                  Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 10:17 -0600
                    Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-12 23:28 -0700
                      Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-13 14:53 -0500
                        Re: unicode by default Robert Kern <robert.kern@gmail.com> - 2011-05-13 15:18 -0500
                        Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-13 21:41 -0400
                          Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-14 02:41 -0500
                            Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-14 03:26 -0700
                            Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-14 16:26 -0400
                              Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-15 09:47 +1000
                        Re: unicode by default Nobody <nobody@nowhere.com> - 2011-05-14 09:34 +0100
                  Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 16:42 -0400
                  Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 16:25 -0600
            Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 13:54 +1000
    Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 15:34 -0700

Page 1 of 2 [1] 2 Next page →

#5163 — unicode by default

From	harrismh777 <harrismh777@charter.net>
Date	2011-05-11 16:37 -0500
Subject	unicode by default
Message-ID	<OkDyp.2983$M61.450@newsfe07.iad>

hi folks,
    I am puzzled by unicode generally, and within the context of python 
specifically. For one thing, what do we mean that unicode is used in 
python 3.x by default. (I know what default means, I mean, what changed?)

    I think part of my problem is that I'm spoiled (American, ascii 
heritage) and have been either stuck in ascii knowingly, or UTF-8 
without knowing (just because the code points lined up). I am confused 
by the implications for using 3.x, because I am reading that there are 
significant things to be aware of... what?

    On my installation 2.6  sys.maxunicode comes up with 1114111, and my 
2.7 and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 
was compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that 
the default compile option for 2.7 & 3.2 (I didn't change anything) is 
set for UCS-2 (UTF-16) or 2 byte unicode(?).   Do I understand this much 
correctly?

    The books say that the .py sources are UTF-8 by default... and that 
3.x is either UCS-2 or UCS-4.  If I use the file handling capabilities 
of Python in 3.x (by default) what encoding will be used, and how will 
that affect the output?

    If I do not specify any code points above ascii 0xFF does any of 
this matter anyway?



Thanks.

kind regards,
m harris

[toc] | [next] | [standalone]

#5168

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2011-05-11 16:09 -0600
Message-ID	<mailman.1433.1305151801.9059.python-list@python.org>
In reply to	#5163

On Wed, May 11, 2011 at 3:37 PM, harrismh777 <harrismh777@charter.net> wrote:
> hi folks,
>   I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)

The `unicode' class was renamed to `str', and a stripped-down version
of the 2.X `str' class was renamed to `bytes'.

>   I think part of my problem is that I'm spoiled (American, ascii heritage)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?

Mainly Python 3 no longer does explicit conversion between bytes and
unicode, requiring the programmer to be explicit about such
conversions.  If you have Python 2 code that is sloppy about this, you
may get some Unicode encode/decode errors when trying to run the same
code in Python 3.  The 2to3 tool can help somewhat with this, but it
can't prevent all problems.

>   On my installation 2.6  sys.maxunicode comes up with 1114111, and my 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?).   Do I understand this much correctly?

I think that UCS-2 has always been the default unicode width for
CPython, although the exact representation used internally is an
implementation detail.

>   The books say that the .py sources are UTF-8 by default... and that 3.x is
> either UCS-2 or UCS-4.  If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?

If you open a file in binary mode, the result is a non-decoded byte stream.

If you open a file in text mode and do not specify an encoding, then
the result of locale.getpreferredencoding() is used for decoding, and
the result is a unicode stream.

>   If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?

You mean 0x7F, and probably, due to the need to explicitly encode and decode.

[toc] | [prev] | [next] | [standalone]

#5174

From	harrismh777 <harrismh777@charter.net>
Date	2011-05-11 17:51 -0500
Message-ID	<vpEyp.981$dL5.736@newsfe08.iad>
In reply to	#5168

Ian Kelly wrote:

    Ian, Benjamin,  thanks much.

> The `unicode' class was renamed to `str', and a stripped-down version
> of the 2.X `str' class was renamed to `bytes'.

    ... thank you, this is very helpful.

>> >     If I do not specify any code points above ascii 0xFF does any of this
>> >  matter anyway?

> You mean 0x7F, and probably, due to the need to explicitly encode and decode.

     Yes, actually, I did... and from Benjamin's reply it seems that 
this matters only if I am working with bytes.  Is it true that if I am 
working without using bytes sequences that I will not need to care about 
the encoding anyway, unless of course I need to specify a unicode code 
point?

     Thanks again.

kind regards,
m harris

[toc] | [prev] | [next] | [standalone]

#5177

From	"John Machin" <sjmachin@lexicon.net>
Date	2011-05-12 09:32 +1000
Message-ID	<mailman.1435.1305157329.9059.python-list@python.org>
In reply to	#5174

On Thu, May 12, 2011 8:51 am, harrismh777 wrote:
> Is it true that if I am
> working without using bytes sequences that I will not need to care about
> the encoding anyway, unless of course I need to specify a unicode code
> point?

Quite the contrary.

(1) You cannot work without using bytes sequences. Files are byte
sequences. Web communication is in bytes. You need to (know / assume / be
able to extract / guess) the input encoding. You need to encode your
output using an encoding that is expected by the consumer (or use an
output method that will do it for you).

(2) You don't need to use bytes to specify a Unicode code point. Just use
an escape sequence e.g. "\u0404" is a Cyrillic character.

[toc] | [prev] | [next] | [standalone]

#5181

From	harrismh777 <harrismh777@charter.net>
Date	2011-05-11 20:22 -0500
Message-ID	<KDGyp.180$0t1.7@newsfe04.iad>
In reply to	#5177

John Machin wrote:
> (1) You cannot work without using bytes sequences. Files are byte
> sequences. Web communication is in bytes. You need to (know / assume / be
> able to extract / guess) the input encoding. You need to encode your
> output using an encoding that is expected by the consumer (or use an
> output method that will do it for you).
>
> (2) You don't need to use bytes to specify a Unicode code point. Just use
> an escape sequence e.g. "\u0404" is a Cyrillic character.
>

Thanks John.  In reverse order, I understand point (2). I'm less clear 
on point (1).

If I generate a string of characters that I presume to be ascii/utf-8 
(no \u0404 type characters) and write them to a file (stdout) how does 
default encoding affect that file.by default..?   I'm not seeing that 
there is anything unusual going on...   If I open the file with vi?  If 
I open the file with gedit?  emacs?

....

Another question... in mail I'm receiving many small blocks that look 
like sprites with four small hex codes, scattered about the mail... 
mostly punctuation, maybe?   ... guessing, are these unicode code 
points, and if so what is the best way to 'guess' the encoding? ... is 
it coded in the stream somewhere...protocol?

thanks

[toc] | [prev] | [next] | [standalone]

#5182

From	MRAB <python@mrabarnett.plus.com>
Date	2011-05-12 03:31 +0100
Message-ID	<mailman.1439.1305167541.9059.python-list@python.org>
In reply to	#5181

On 12/05/2011 02:22, harrismh777 wrote:
> John Machin wrote:
>> (1) You cannot work without using bytes sequences. Files are byte
>> sequences. Web communication is in bytes. You need to (know / assume / be
>> able to extract / guess) the input encoding. You need to encode your
>> output using an encoding that is expected by the consumer (or use an
>> output method that will do it for you).
>>
>> (2) You don't need to use bytes to specify a Unicode code point. Just use
>> an escape sequence e.g. "\u0404" is a Cyrillic character.
>>
>
> Thanks John. In reverse order, I understand point (2). I'm less clear on
> point (1).
>
> If I generate a string of characters that I presume to be ascii/utf-8
> (no \u0404 type characters) and write them to a file (stdout) how does
> default encoding affect that file.by default..? I'm not seeing that
> there is anything unusual going on... If I open the file with vi? If I
> open the file with gedit? emacs?
>
> ....
>
> Another question... in mail I'm receiving many small blocks that look
> like sprites with four small hex codes, scattered about the mail...
> mostly punctuation, maybe? ... guessing, are these unicode code points,
> and if so what is the best way to 'guess' the encoding? ... is it coded
> in the stream somewhere...protocol?
>
You need to understand the difference between characters and bytes.

A string contains characters, a file contains bytes.

The encoding specifies how a character is represented as bytes.

For example:

     In the Latin-1 encoding, the character "£" is represented by the 
byte 0xA3.

     In the UTF-8 encoding, the character "£" is represented by the byte 
sequence 0xC2 0xA3.

     In the ASCII encoding, the character "£" can't be represented at all.

The advantage of UTF-8 is that it can represent _all_ Unicode
characters (codepoints, actually) as byte sequences, and all those in
the ASCII range are represented by the same single bytes which the
original ASCII system used. Use the UTF-8 encoding unless you have to
use a different one.

A file contains only bytes, a socket handles only bytes. Which encoding
you should use for characters is down to protocol. A system such as
email, which can handle different encodings, should have a way of
specifying the encoding, and perhaps also a default encoding.

[toc] | [prev] | [next] | [standalone]

#5183

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-05-12 03:16 +0000
Message-ID	<4dcb50f8$0$29973$c3e8da3$5496439d@news.astraweb.com>
In reply to	#5182

On Thu, 12 May 2011 03:31:18 +0100, MRAB wrote:

>> Another question... in mail I'm receiving many small blocks that look
>> like sprites with four small hex codes, scattered about the mail...
>> mostly punctuation, maybe? ... guessing, are these unicode code points,
>> and if so what is the best way to 'guess' the encoding? ... is it coded
>> in the stream somewhere...protocol?
>>
> You need to understand the difference between characters and bytes.


http://www.joelonsoftware.com/articles/Unicode.html

is also a good resource.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#5187

From	harrismh777 <harrismh777@charter.net>
Date	2011-05-11 22:44 -0500
Message-ID	<rIIyp.3007$M61.2987@newsfe07.iad>
In reply to	#5183

Steven D'Aprano wrote:
>> You need to understand the difference between characters and bytes.
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
> is also a good resource.

Thanks for being patient guys, here's what I've done:

>>>> astr="pound sign"
>>>> asym=" \u00A3"
>>>> afile=open("myfile", mode='w')
>>>> afile.write(astr + asym)
> 12
>>>> afile.close()

When I edit "myfile" with vi I see the 'characters' :

pound sign £

    ... same with emacs, same with gedit  ...

When I hexdump myfile I see this:

0000000 6f70 6375 2064 6973 6e67 c220 00a3

This is *not* what I expected... well it is (little-endian) right up to 
the 'c2' and that is what is confusing me....

I did not open the file with an encoding of UTF-8... so I'm assuming 
UTF-16 by default (python3) so I was expecting a '00A3' little-endian as 
'A300' but what I got instead was UTF-8 little-endian  'c2a3' ....

See my problem?... when I open the file with emacs I see the character 
pound sign... same with gedit... they're all using UTF-8 by default. By 
default it looks like Python3 is writing output with UTF-8 as default... 
and I thought that by default Python3 was using either UTF-16 or UTF-32. 
So, I'm confused here...  also, I used the character sequence \u00A3 
which I thought was UTF-16... but Python3 changed my intent to  'c2a3' 
which is the normal UTF-8...

Thanks again for your patience... I really do hate to be dense about 
this...  but this is another area where I'm just beginning to dabble and 
I'd like to know soon what I'm doing...

Thanks for the link Steve... I'm headed there now...

kind regards,
m harris

[toc] | [prev] | [next] | [standalone]

#5194

From	Terry Reedy <tjreedy@udel.edu>
Date	2011-05-12 00:12 -0400
Message-ID	<mailman.1443.1305173553.9059.python-list@python.org>
In reply to	#5187

On 5/11/2011 11:44 PM, harrismh777 wrote:
> Steven D'Aprano wrote:
>>> You need to understand the difference between characters and bytes.
>>
>> http://www.joelonsoftware.com/articles/Unicode.html
>>
>> is also a good resource.
>
> Thanks for being patient guys, here's what I've done:
>
>>>>> astr="pound sign"
>>>>> asym=" \u00A3"
>>>>> afile=open("myfile", mode='w')
>>>>> afile.write(astr + asym)
>> 12
>>>>> afile.close()
>
>
> When I edit "myfile" with vi I see the 'characters' :
>
> pound sign £
>
> ... same with emacs, same with gedit ...
>
>
> When I hexdump myfile I see this:
>
> 0000000 6f70 6375 2064 6973 6e67 c220 00a3

> This is *not* what I expected... well it is (little-endian) right up to
> the 'c2' and that is what is confusing me....

> I did not open the file with an encoding of UTF-8... so I'm assuming
> UTF-16 by default (python3) so I was expecting a '00A3' little-endian as
> 'A300' but what I got instead was UTF-8 little-endian 'c2a3' ....
>
> See my problem?... when I open the file with emacs I see the character
> pound sign... same with gedit... they're all using UTF-8 by default. By
> default it looks like Python3 is writing output with UTF-8 as default...
> and I thought that by default Python3 was using either UTF-16 or UTF-32.
> So, I'm confused here... also, I used the character sequence \u00A3
> which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
> which is the normal UTF-8...

If you open a file as binary (bytes), you must write bytes, and they are 
stored without transformation. If you open in text mode, you must write 
text (string as unicode in 3.2) and Python will encode to bytes using 
either some default or the encoding you specified in the open statement. 
It does not matter how Python stored the unicode internally. Does this 
help? Your intent is signalled by how you open the file.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#5211

From	harrismh777 <harrismh777@charter.net>
Date	2011-05-12 01:43 -0500
Message-ID	<ekLyp.1001$dL5.753@newsfe08.iad>
In reply to	#5194

Terry Reedy wrote:
> It does not matter how Python stored the unicode internally. Does this
> help? Your intent is signalled by how you open the file.

Very much, actually, thanks.  I was missing the 'internal' piece, and 
did not realize that if I didn't specify the encoding on the open that 
python would pull the default encoding from locale...

kind regards,
m harris

[toc] | [prev] | [next] | [standalone]

#5195

From	"John Machin" <sjmachin@lexicon.net>
Date	2011-05-12 14:14 +1000
Message-ID	<mailman.1444.1305173678.9059.python-list@python.org>
In reply to	#5187

On Thu, May 12, 2011 1:44 pm, harrismh777 wrote:
> By
> default it looks like Python3 is writing output with UTF-8 as default...
> and I thought that by default Python3 was using either UTF-16 or UTF-32.
> So, I'm confused here...  also, I used the character sequence \u00A3
> which I thought was UTF-16... but Python3 changed my intent to  'c2a3'
> which is the normal UTF-8...

Python uses either a 16-bit or a 32-bit INTERNAL representation of Unicode
code points. Those NN bits have nothing to do with the UTF-NN encodings,
which can be used to encode the codepoints as byte sequences for EXTERNAL
purposes. In your case, UTF-8 has been used as it is the default encoding
on your platform.

[toc] | [prev] | [next] | [standalone]

#5196

From	Benjamin Kaplan <benjamin.kaplan@case.edu>
Date	2011-05-11 21:14 -0700
Message-ID	<mailman.1445.1305173694.9059.python-list@python.org>
In reply to	#5187

On Wed, May 11, 2011 at 8:44 PM, harrismh777 <harrismh777@charter.net> wrote:
> Steven D'Aprano wrote:
>>>
>>> You need to understand the difference between characters and bytes.
>>
>> http://www.joelonsoftware.com/articles/Unicode.html
>>
>> is also a good resource.
>
> Thanks for being patient guys, here's what I've done:
>
>>>>> astr="pound sign"
>>>>> asym=" \u00A3"
>>>>> afile=open("myfile", mode='w')
>>>>> afile.write(astr + asym)
>>
>> 12
>>>>>
>>>>> afile.close()
>
>
> When I edit "myfile" with vi I see the 'characters' :
>
> pound sign £
>
>   ... same with emacs, same with gedit  ...
>
>
> When I hexdump myfile I see this:
>
> 0000000 6f70 6375 2064 6973 6e67 c220 00a3
>
>
> This is *not* what I expected... well it is (little-endian) right up to the
> 'c2' and that is what is confusing me....
>
> I did not open the file with an encoding of UTF-8... so I'm assuming UTF-16
> by default (python3) so I was expecting a '00A3' little-endian as 'A300' but
> what I got instead was UTF-8 little-endian  'c2a3' ....
>
quick note here: UTF-8 doesn't have an endian-ness. It's always read
from left to right, with the high bit telling you whether you need to
continue or not. So it's always "little endian".

> See my problem?... when I open the file with emacs I see the character pound
> sign... same with gedit... they're all using UTF-8 by default. By default it
> looks like Python3 is writing output with UTF-8 as default... and I thought
> that by default Python3 was using either UTF-16 or UTF-32. So, I'm confused
> here...  also, I used the character sequence \u00A3 which I thought was
> UTF-16... but Python3 changed my intent to  'c2a3' which is the normal
> UTF-8...
>

The fact that CPython uses UCS-2 or UCS-4 internally is an
implementation detail and isn't actually part of the Python
specification. As far as a Python program is concerned, a Unicode
string is a list of character objects, not bytes. Much like any other
object, a unicode character needs to be serialized before it can be
written to a file. An encoding is a serialization function for
characters.

If the file you're writing to doesn't specify an encoding, Python will
default to locale.getdefaultencoding(), which tries to get your
system's preferred encoding from environment variables (in other
words, the same source that emacs and gedit will use to get the
default encoding).

[toc] | [prev] | [next] | [standalone]

#5197

From	"John Machin" <sjmachin@lexicon.net>
Date	2011-05-12 14:41 +1000
Message-ID	<mailman.1446.1305175310.9059.python-list@python.org>
In reply to	#5187

On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote:
>
> If the file you're writing to doesn't specify an encoding, Python will
> default to locale.getdefaultencoding(),

No such attribute. Perhaps you mean locale.getpreferredencoding()

[toc] | [prev] | [next] | [standalone]

#5204

From	harrismh777 <harrismh777@charter.net>
Date	2011-05-12 01:14 -0500
Message-ID	<gVKyp.26577$Vp.24961@newsfe14.iad>
In reply to	#5197

John Machin wrote:
> On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote:
>>
>> If the file you're writing to doesn't specify an encoding, Python will
>> default to locale.getdefaultencoding(),
>
> No such attribute. Perhaps you mean locale.getpreferredencoding()


 >>> import locale
 >>> locale.getpreferredencoding()
'UTF-8'
 >>>

Yessssssss!


:)

[toc] | [prev] | [next] | [standalone]

#5231

From	TheSaint <nobody@nowhere.net.no>
Date	2011-05-12 20:40 +0800
Message-ID	<iqgkf8$vob$1@speranza.aioe.org>
In reply to	#5197

John Machin wrote:

> On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote:
>>
>> If the file you're writing to doesn't specify an encoding, Python will
>> default to locale.getdefaultencoding(),
> 
> No such attribute. Perhaps you mean locale.getpreferredencoding()

what about sys.getfilesystemencoding()
In the event to distribuite a program how to guess which encoding will the 
user has?

-- 
goto /dev/null

[toc] | [prev] | [next] | [standalone]

#5193

From	Ben Finney <ben+python@benfinney.id.au>
Date	2011-05-12 14:07 +1000
Message-ID	<874o50k1eb.fsf@benfinney.id.au>
In reply to	#5182

MRAB <python@mrabarnett.plus.com> writes:

> You need to understand the difference between characters and bytes.

Yep. Those who don't need to join us in the third millennium, and the
resources pointed out in this thread are good to help that.

> A string contains characters, a file contains bytes.

That's not true for Python 2.

I'd phrase that as:

* Text is a sequence of characters. Most inputs to the program,
  including files, sockets, etc., contain a sequence of bytes.

* Always know whether you're dealing with text or with bytes. No object
  can be both.

* In Python 2, ‘str’ is the type for a sequence of bytes. ‘unicode’ is
  the type for text.

* In Python 3, ‘str’ is the type for text. ‘bytes’ is the type for a
  sequence of bytes.

-- 
 \      “I went to a garage sale. ‘How much for the garage?’ ‘It's not |
  `\                                        for sale.’” —Steven Wright |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#5209

From	harrismh777 <harrismh777@charter.net>
Date	2011-05-12 01:31 -0500
Message-ID	<U8Lyp.1000$dL5.14@newsfe08.iad>
In reply to	#5193

Ben Finney wrote:
> I'd phrase that as:

> * Text is a sequence of characters. Most inputs to the program,
>    including files, sockets, etc., contain a sequence of bytes.

> * Always know whether you're dealing with text or with bytes. No object
>    can be both.

> * In Python 2, ‘str’ is the type for a sequence of bytes. ‘unicode’ is
>    the type for text.

> * In Python 3, ‘str’ is the type for text. ‘bytes’ is the type for a
>    sequence of bytes.

That is very helpful...   thanks

MRAB, Steve, John, Terry, Ben F, Ben K, Ian...
    ...thank you guys so much, I think I've got a better picture now of 
what is going on... this is also one place where I don't think the books 
are as clear as they need to be at least for me...(Lutz, Summerfield).

So, the UTF-16 UTF-32 is INTERNAL only, for Python... and text in/out is 
based on locale... in my case UTF-8  ...that is enormously helpful for 
me... understanding locale on this system is as mystifying as unicode is 
in the first place.
Well, after reading about unicode tonight (about four hours) I realize 
that its not really that hard... there's just a lot of details that have 
to come together. Straightening out that whole tower-of-babel thing is 
sure a pain in the butt.
I also was not aware that UTF-8 chars could be up to six(6) byes long 
from left to right.  I see now that the little-endianness I was 
ascribing to python is just a function of hexdump... and I was a little 
disappointed to find that hexdump does not support UTF-8, just ascii...doh.
Anyway, thanks again... I've got enough now to play around a bit...

PS thanks Steve for that link, informative and entertaining too... Joe 
says, "If you are a programmer . . . and you don't know the basics of 
characters, character sets, encodings, and Unicode, and I catch you, I'm 
going to punish you by making you peel onions for 6 months in a 
submarine. I swear I will".     :)

kind regards,
m harris

[toc] | [prev] | [next] | [standalone]

#5216

From	"John Machin" <sjmachin@lexicon.net>
Date	2011-05-12 17:58 +1000
Message-ID	<mailman.1450.1305187110.9059.python-list@python.org>
In reply to	#5209

On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:

>
> So, the UTF-16 UTF-32 is INTERNAL only, for Python

NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
encodings for the EXTERNAL representation of Unicode characters in byte
streams.

> I also was not aware that UTF-8 chars could be up to six(6) byes long
> from left to right.

It could be, once upon a time in ISO faerieland, when it was thought that
Unicode could grow to 2**32 codepoints. However ISO and the Unicode
consortium have agreed that 17 planes is the utter max, and accordingly a
valid UTF-8 byte sequence can be no longer than 4 bytes ... see below

    >>> chr(17 * 65536)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: chr() arg not in range(0x110000)
    >>> chr(17 * 65536 - 1)
    '\U0010ffff'
    >>> _.encode('utf8')
    b'\xf4\x8f\xbf\xbf'
    >>> b'\xf5\x8f\xbf\xbf'.decode('utf8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\python32\lib\encodings\utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xf5 in position 0:
invalid start byte

[toc] | [prev] | [next] | [standalone]

#5244

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2011-05-12 10:17 -0600
Message-ID	<mailman.1476.1305217085.9059.python-list@python.org>
In reply to	#5209

On Thu, May 12, 2011 at 1:58 AM, John Machin <sjmachin@lexicon.net> wrote:
> On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:
>
>>
>> So, the UTF-16 UTF-32 is INTERNAL only, for Python
>
> NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
> encodings for the EXTERNAL representation of Unicode characters in byte
> streams.

Right.  *Under the hood* Python uses UCS-2 (which is not exactly the
same thing as UTF-16, by the way) to represent Unicode strings.
However, this is entirely transparent.  To the Python programmer, a
unicode string is just an abstraction of a sequence of code-points.
You don't need to think about UCS-2 at all.  The only times you need
to worry about encodings are when you're encoding unicode characters
to byte strings, or decoding bytes to unicode characters, or opening a
stream in text mode; and in those cases the only encoding that matters
is the external one.

[toc] | [prev] | [next] | [standalone]

#5281

From	jmfauth <wxjmfauth@gmail.com>
Date	2011-05-12 23:28 -0700
Message-ID	<492f8500-fd53-4b52-bd7b-cd90a3118d38@k16g2000yqm.googlegroups.com>
In reply to	#5244

On 12 mai, 18:17, Ian Kelly <ian.g.ke...@gmail.com> wrote:

> ...
> to worry about encodings are when you're encoding unicode characters
> to byte strings, or decoding bytes to unicode characters

A small but important correction/clarification:

In Unicode, "unicode" does not encode a *character*. It
encodes a *code point*, a number, the integer associated
to the character.

jmf

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

unicode by default

Contents

#5163 — unicode by default

#5168

#5174

#5177

#5181

#5182

#5183

#5187

#5194

#5211

#5195

#5196

#5197

#5204

#5231

#5193

#5209

#5216

#5244

#5281