Groups > comp.lang.python > #21308 > unrolled thread

"Decoding unicode is not supported" in unusual situation

Started by	John Nagle <nagle@animats.com>
First post	2012-03-07 00:25 -0800
Last post	2012-03-09 13:23 +0000
Articles	15 — 7 participants

Back to article view | Back to comp.lang.python

  "Decoding unicode is not supported" in unusual situation John Nagle <nagle@animats.com> - 2012-03-07 00:25 -0800
    Re: "Decoding unicode is not supported" in unusual situation deets@web.de (Diez B. Roggisch) - 2012-03-07 11:20 +0100
      Re: "Decoding unicode is not supported" in unusual situation Ben Finney <ben+python@benfinney.id.au> - 2012-03-07 22:18 +1100
        Re: "Decoding unicode is not supported" in unusual situation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-07 11:42 +0000
          Re: "Decoding unicode is not supported" in unusual situation John Nagle <nagle@animats.com> - 2012-03-07 11:25 -0800
            Re: "Decoding unicode is not supported" in unusual situation Ben Finney <ben+python@benfinney.id.au> - 2012-03-08 08:48 +1100
              Re: "Decoding unicode is not supported" in unusual situation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-07 23:26 +0000
                Re: "Decoding unicode is not supported" in unusual situation Terry Reedy <tjreedy@udel.edu> - 2012-03-07 19:03 -0500
                Re: "Decoding unicode is not supported" in unusual situation Ben Finney <ben+python@benfinney.id.au> - 2012-03-08 13:18 +1100
                  Re: "Decoding unicode is not supported" in unusual situation John Nagle <nagle@animats.com> - 2012-03-08 14:23 -0800
                    RE: "Decoding unicode is not supported" in unusual situation "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-08 22:58 +0000
                      Re: "Decoding unicode is not supported" in unusual situation John Nagle <nagle@animats.com> - 2012-03-09 10:11 -0800
                        Re: "Decoding unicode is not supported" in unusual situation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-10 00:57 +0000
                          Re: "Decoding unicode is not supported" in unusual situation John Nagle <nagle@animats.com> - 2012-03-10 22:12 -0800
                    Re: "Decoding unicode is not supported" in unusual situation Neil Cerutti <neilc@norwich.edu> - 2012-03-09 13:23 +0000

#21308 — "Decoding unicode is not supported" in unusual situation

From	John Nagle <nagle@animats.com>
Date	2012-03-07 00:25 -0800
Subject	"Decoding unicode is not supported" in unusual situation
Message-ID	<4f571b94$0$12037$742ec2ed@news.sonic.net>

I'm getting

line 79, in tounicode
return(unicode(s, errors='replace'))
TypeError: decoding Unicode is not supported

from this, under Python 2.7:

def tounicode(s) :
     if type(s) == unicode :
         return(s)
     return(unicode(s, errors='replace'))

That would seem to be impossible.  But it's not.
"s" is generated from the "suds" SOAP client.  The documentation
for "suds" says:

"Suds leverages python meta programming to provide an intuative API for 
consuming web services. Runtime objectification of types defined in the 
WSDL is provided without class generation."

I think that somewhere in "suds", they subclass the "unicode" type.
That's almost too cute.

The proper test is

	isinstance(s,unicode)


					John Nagle

[toc] | [next] | [standalone]

#21312

From	deets@web.de (Diez B. Roggisch)
Date	2012-03-07 11:20 +0100
Message-ID	<m2obs890ge.fsf@web.de>
In reply to	#21308

John Nagle <nagle@animats.com> writes:

> I think that somewhere in "suds", they subclass the "unicode" type.
> That's almost too cute.
>
> The proper test is
>
> 	isinstance(s,unicode)


Woot, you finally discovered polymorphism - congratulations!

Diez

[toc] | [prev] | [next] | [standalone]

#21315

From	Ben Finney <ben+python@benfinney.id.au>
Date	2012-03-07 22:18 +1100
Message-ID	<8762egmzfp.fsf@benfinney.id.au>
In reply to	#21312

deets@web.de (Diez B. Roggisch) writes:

> John Nagle <nagle@animats.com> writes:
>
> > I think that somewhere in "suds", they subclass the "unicode" type.
> > That's almost too cute.
> >
> > The proper test is
> >
> > 	isinstance(s,unicode)
>
> Woot, you finally discovered polymorphism - congratulations!

If by “discovered” you mean “broke”.

John, polymorphism entails that it *doesn't matter* whether the object
inherits from any particular type; it only matters whether the object
behaves correctly.

So rather than testing whether the object inherits from ‘unicode’, test
whether it behaves how you expect – preferably by just using it as
though it does behave that way.

-- 
 \     Lucifer: “Just sign the Contract, sir, and the Piano is yours.” |
  `\     Ray: “Sheesh! This is long! Mind if I sign it now and read it |
_o__)                                later?” —http://www.achewood.com/ |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#21316

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-03-07 11:42 +0000
Message-ID	<4f5749bc$0$29989$c3e8da3$5496439d@news.astraweb.com>
In reply to	#21315

On Wed, 07 Mar 2012 22:18:50 +1100, Ben Finney wrote:

> deets@web.de (Diez B. Roggisch) writes:
> 
>> John Nagle <nagle@animats.com> writes:
>>
>> > I think that somewhere in "suds", they subclass the "unicode" type.
>> > That's almost too cute.
>> >
>> > The proper test is
>> >
>> > 	isinstance(s,unicode)
>>
>> Woot, you finally discovered polymorphism - congratulations!
> 
> If by “discovered” you mean “broke”.
> 
> John, polymorphism entails that it *doesn't matter* whether the object
> inherits from any particular type; it only matters whether the object
> behaves correctly.
> 
> So rather than testing whether the object inherits from ‘unicode’, test
> whether it behaves how you expect – preferably by just using it as
> though it does behave that way.

I must admit that I can't quite understand John Nagle's original post, so 
I could be wrong, but I *think* that both you and Diez have misunderstood 
the nature of John's complaint.

I *think* he is complaining that some other library -- suds? -- has a 
broken test for Unicode, by using:

if type(s) is unicode: ...

instead of

if isinstance(s, unicode): ...

Consequently, when the library passes a unicode *subclass* to the 
tounicode function, the "type() is unicode" test fails. That's a bad bug.

It's arguable that the library shouldn't even use isinstance, but that's 
an argument for another day.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#21331

From	John Nagle <nagle@animats.com>
Date	2012-03-07 11:25 -0800
Message-ID	<4f57b63f$0$11986$742ec2ed@news.sonic.net>
In reply to	#21316

On 3/7/2012 3:42 AM, Steven D'Aprano wrote:

> I *think* he is complaining that some other library -- suds? -- has a
> broken test for Unicode, by using:
>
> if type(s) is unicode: ...
>
> instead of
>
> if isinstance(s, unicode): ...
>
> Consequently, when the library passes a unicode *subclass* to the
> tounicode function, the "type() is unicode" test fails. That's a bad bug.

    No, that was my bug.

    The library bug, if any, is that you can't apply

	unicode(s, errors='replace')

to a Unicode string. TypeError("Decoding unicode is not supported") is 
raised.  However

   	unicode(s)

will accept Unicode input.

The Python documentation
("http://docs.python.org/library/functions.html#unicode") does not 
mention this.  It is therefore necessary to check the type before
calling "unicode", or catch the undocumented TypeError exception
afterward.

					John Nagle

[toc] | [prev] | [next] | [standalone]

#21353

From	Ben Finney <ben+python@benfinney.id.au>
Date	2012-03-08 08:48 +1100
Message-ID	<871up4m69h.fsf@benfinney.id.au>
In reply to	#21331

John Nagle <nagle@animats.com> writes:

>    The library bug, if any, is that you can't apply
>
> 	unicode(s, errors='replace')
>
> to a Unicode string. TypeError("Decoding unicode is not supported") is
> raised.  However
>
>   	unicode(s)
>
> will accept Unicode input.

I think that's a Python bug. If the latter succeeds as a no-op, the
former should also succeed as a no-op. Neither should ever get any
errors when ‘s’ is a ‘unicode’ object already.

> The Python documentation
> ("http://docs.python.org/library/functions.html#unicode") does not
> mention this. It is therefore necessary to check the type before
> calling "unicode", or catch the undocumented TypeError exception
> afterward.

Yes, this check should not be necessary; calling the ‘unicode’
constructor with an object that's already an instance of ‘unicode’
should just return the object as-is, IMO. It shouldn't matter that
you've specified how decoding errors are to be handled, because in that
case no decoding happens anyway.

Care to report that bug to <URL:http://bugs.python.org/>, John?

-- 
 \          “Those who write software only for pay should go hurt some |
  `\                 other field.” —Erik Naggum, in _gnu.misc.discuss_ |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#21360

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-03-07 23:26 +0000
Message-ID	<4f57eeac$0$29989$c3e8da3$5496439d@news.astraweb.com>
In reply to	#21353

On Thu, 08 Mar 2012 08:48:58 +1100, Ben Finney wrote:

> John Nagle <nagle@animats.com> writes:
> 
>>    The library bug, if any, is that you can't apply
>>
>> 	unicode(s, errors='replace')
>>
>> to a Unicode string. TypeError("Decoding unicode is not supported") is
>> raised.  However
>>
>>   	unicode(s)
>>
>> will accept Unicode input.
> 
> I think that's a Python bug. If the latter succeeds as a no-op, the
> former should also succeed as a no-op. Neither should ever get any
> errors when ‘s’ is a ‘unicode’ object already.

No. The semantics of the unicode function (technically: a type 
constructor) are well-defined, and there are two distinct behaviours:

unicode(obj)

is analogous to str(obj), and it attempts to convert obj to a unicode 
string by calling obj.__unicode__, if it exists, or __str__ if it 
doesn't. No encoding or decoding is attempted in the event that obj is a 
unicode instance.

unicode(obj, encoding, errors) 

is explicitly stated in the docs as decoding obj if EITHER of encoding or 
errors is given, AND that obj must be either an 8-bit string (bytes) or a 
buffer object.

It is true that u''.decode() will succeed, in Python 2, but the fact that 
unicode objects have a decode method at all is IMO a bug. It has also 
been corrected in Python 3, where (unicode) str objects no longer have a 
decode method, and bytes objects no longer have an encode method.

>> The Python documentation
>> ("http://docs.python.org/library/functions.html#unicode") does not
>> mention this.

Yes it does. It is is the SECOND sentence, immediately after the summary 
line:

unicode([object[, encoding[, errors]]])
    Return the Unicode string version of object using one of the
    following modes:

    If encoding and/or errors are given, unicode() will decode the object
    which can either be an 8-bit string or a character buffer using the
    codec for encoding. ...

Admittedly, it doesn't *explicitly* state that TypeError will be raised, 
but what other exception kind would you expect when you supply an 
argument of the wrong type?

>> It is therefore necessary to check the type before
>> calling "unicode", or catch the undocumented TypeError exception
>> afterward.
> 
> Yes, this check should not be necessary; calling the ‘unicode’
> constructor with an object that's already an instance of ‘unicode’
> should just return the object as-is, IMO. It shouldn't matter that
> you've specified how decoding errors are to be handled, because in that
> case no decoding happens anyway.

I don't believe that it is the job of unicode() to Do What I Mean, but 
only to Do What I Say. If I *explicitly* tell unicode() to decode the 
argument (by specifying either the codec or the error handler or both) 
then it should not double-guess me and ignore the extra parameters.

End-user applications may, with care, try to be smart and DWIM, but 
library functions should be dumb and should do what they are told.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#21361

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-03-07 19:03 -0500
Message-ID	<mailman.495.1331165057.3037.python-list@python.org>
In reply to	#21360

On 3/7/2012 6:26 PM, Steven D'Aprano wrote:
> On Thu, 08 Mar 2012 08:48:58 +1100, Ben Finney wrote:
>
>> John Nagle<nagle@animats.com>  writes:
>>
>>>     The library bug, if any, is that you can't apply
>>>
>>> 	unicode(s, errors='replace')
>>>
>>> to a Unicode string. TypeError("Decoding unicode is not supported") is
>>> raised.  However
>>>
>>>    	unicode(s)
>>>
>>> will accept Unicode input.
>>
>> I think that's a Python bug. If the latter succeeds as a no-op, the
>> former should also succeed as a no-op. Neither should ever get any
>> errors when ‘s’ is a ‘unicode’ object already.

> No. The semantics of the unicode function (technically: a type
> constructor) are well-defined, and there are two distinct behaviours:
>
> unicode(obj)
>
> is analogous to str(obj), and it attempts to convert obj to a unicode
> string by calling obj.__unicode__, if it exists, or __str__ if it
> doesn't. No encoding or decoding is attempted in the event that obj is a
> unicode instance.
>
> unicode(obj, encoding, errors)
>
> is explicitly stated in the docs as decoding obj if EITHER of encoding or
> errors is given, AND that obj must be either an 8-bit string (bytes) or a
> buffer object.
>
> It is true that u''.decode() will succeed, in Python 2, but the fact that
> unicode objects have a decode method at all is IMO a bug. It has also

I believe that is because in Py 2, codecs and .encode/.decode were used 
for same type recoding like base64, uu coding. That was simplified in 
Py3 so that 'decoding' is bytes to string and 'encoding' is string to 
bytes, and base64, etc, are only done in their separate modules and not 
also duplicated in the codecs machinery.

> been corrected in Python 3, where (unicode) str objects no longer have a
> decode method, and bytes objects no longer have an encode method.
>
>
>>> The Python documentation
>>> ("http://docs.python.org/library/functions.html#unicode") does not
>>> mention this.
>
> Yes it does. It is is the SECOND sentence, immediately after the summary
> line:
>
> unicode([object[, encoding[, errors]]])
>      Return the Unicode string version of object using one of the
>      following modes:
>
>      If encoding and/or errors are given, unicode() will decode the object
>      which can either be an 8-bit string or a character buffer using the
>      codec for encoding. ...
>
>
> Admittedly, it doesn't *explicitly* state that TypeError will be raised,
> but what other exception kind would you expect when you supply an
> argument of the wrong type?

What you have correctly pointed out is that there is no discrepancy 
between doc and behavior and hence no bug for the purpose of the 
tracker. Thanks.

>>> It is therefore necessary to check the type before
>>> calling "unicode", or catch the undocumented TypeError exception
>>> afterward.
>>
>> Yes, this check should not be necessary; calling the ‘unicode’
>> constructor with an object that's already an instance of ‘unicode’
>> should just return the object as-is, IMO. It shouldn't matter that
>> you've specified how decoding errors are to be handled, because in that
>> case no decoding happens anyway.
>
> I don't believe that it is the job of unicode() to Do What I Mean, but
> only to Do What I Say. If I *explicitly* tell unicode() to decode the
> argument (by specifying either the codec or the error handler or both)
> then it should not double-guess me and ignore the extra parameters.
>
> End-user applications may, with care, try to be smart and DWIM, but
> library functions should be dumb and should do what they are told.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#21367

From	Ben Finney <ben+python@benfinney.id.au>
Date	2012-03-08 13:18 +1100
Message-ID	<87sjhjltsp.fsf@benfinney.id.au>
In reply to	#21360

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

> On Thu, 08 Mar 2012 08:48:58 +1100, Ben Finney wrote:
> > I think that's a Python bug. If the latter succeeds as a no-op, the
> > former should also succeed as a no-op. Neither should ever get any
> > errors when ‘s’ is a ‘unicode’ object already.
>
> No. The semantics of the unicode function (technically: a type 
> constructor) are well-defined, and there are two distinct behaviours:

That is documented, right. Thanks for drawing my attention to it.

> > Yes, this check should not be necessary; calling the ‘unicode’
> > constructor with an object that's already an instance of ‘unicode’
> > should just return the object as-is, IMO. It shouldn't matter that
> > you've specified how decoding errors are to be handled, because in
> > that case no decoding happens anyway.
>
> I don't believe that it is the job of unicode() to Do What I Mean, but 
> only to Do What I Say. If I *explicitly* tell unicode() to decode the 
> argument (by specifying either the codec or the error handler or both) 

That's where I disagree. Specifying what to do in the case of decoding
errors is *not* explicitly requesting to decode.

The decision of whether to decode is up to the object, not the caller.
Specifying an error handler *in case* decoding errors happen is not the
same as specifying that decoding must happen.

In other words: I think specifying an encoding is saying “decode this”,
but I don't think the same is true of specifying an error handler.

> End-user applications may, with care, try to be smart and DWIM, but 
> library functions should be dumb and should do what they are told.

Agreed, and I think this is compatible with my position.

-- 
 \     “Creativity can be a social contribution, but only in so far as |
  `\         society is free to use the results.” —Richard M. Stallman |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#21402

From	John Nagle <nagle@animats.com>
Date	2012-03-08 14:23 -0800
Message-ID	<4f593178$0$11963$742ec2ed@news.sonic.net>
In reply to	#21367

On 3/7/2012 6:18 PM, Ben Finney wrote:
> Steven D'Aprano<steve+comp.lang.python@pearwood.info>  writes:
>
>> On Thu, 08 Mar 2012 08:48:58 +1100, Ben Finney wrote:
>>> I think that's a Python bug. If the latter succeeds as a no-op, the
>>> former should also succeed as a no-op. Neither should ever get any
>>> errors when ‘s’ is a ‘unicode’ object already.
>>
>> No. The semantics of the unicode function (technically: a type
>> constructor) are well-defined, and there are two distinct behaviours:

    Right. The real problem is that Python 2.7 doesn't have distinct
"str" and "bytes" types.  type(bytes() returns <type 'str'>
"str" is assumed to be ASCII 0..127, but that's not enforced.
"bytes" and "str" should have been distinct types, but
that would have broken much old code.  If they were distinct, then
constructors could distinguish between string type conversion
(which requires no encoding information) and byte stream decoding.

    So it's possible to get junk characters in a "str", and they
won't convert to Unicode.  I've had this happen with databases which
were supposed to be ASCII, but occasionally a non-ASCII character
would slip through.

    This is all different in Python 3.x, where "str" is Unicode and
"bytes" really are a distinct type.

				John Nagle

[toc] | [prev] | [next] | [standalone]

#21407

From	"Prasad, Ramit" <ramit.prasad@jpmorgan.com>
Date	2012-03-08 22:58 +0000
Message-ID	<mailman.522.1331247546.3037.python-list@python.org>
In reply to	#21402

>     Right. The real problem is that Python 2.7 doesn't have distinct
> "str" and "bytes" types.  type(bytes() returns <type 'str'>
> "str" is assumed to be ASCII 0..127, but that's not enforced.
> "bytes" and "str" should have been distinct types, but
> that would have broken much old code.  If they were distinct, then
> constructors could distinguish between string type conversion
> (which requires no encoding information) and byte stream decoding.
> 
>     So it's possible to get junk characters in a "str", and they
> won't convert to Unicode.  I've had this happen with databases which
> were supposed to be ASCII, but occasionally a non-ASCII character
> would slip through.

bytes and str are just aliases for each other. 

>>> id( bytes )
505366496
>>> id( str )
505366496
>>> type( bytes )
<type 'type'>
>>> type( str )
<type 'type'>
>>> bytes == str 
True
>>> bytes is str
True


And I do not think they were ever intended to be just 
ASCII because chr() takes 0 - 256 (non-inclusive) and 
returns a str.


Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--


> -----Original Message-----
> From: python-list-bounces+ramit.prasad=jpmorgan.com@python.org
> [mailto:python-list-bounces+ramit.prasad=jpmorgan.com@python.org] On Behalf
> Of John Nagle
> Sent: Thursday, March 08, 2012 4:24 PM
> To: python-list@python.org
> Subject: Re: "Decoding unicode is not supported" in unusual situation
> 
> On 3/7/2012 6:18 PM, Ben Finney wrote:
> > Steven D'Aprano<steve+comp.lang.python@pearwood.info>  writes:
> >
> >> On Thu, 08 Mar 2012 08:48:58 +1100, Ben Finney wrote:
> >>> I think that's a Python bug. If the latter succeeds as a no-op, the
> >>> former should also succeed as a no-op. Neither should ever get any
> >>> errors when ‘s’ is a ‘unicode’ object already.
> >>
> >> No. The semantics of the unicode function (technically: a type
> >> constructor) are well-defined, and there are two distinct behaviours:
> 
> 
>     This is all different in Python 3.x, where "str" is Unicode and
> "bytes" really are a distinct type.
> 
> 				John Nagle
> --
> http://mail.python.org/mailman/listinfo/python-list

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

[toc] | [prev] | [next] | [standalone]

#21434

From	John Nagle <nagle@animats.com>
Date	2012-03-09 10:11 -0800
Message-ID	<4f5a47f0$0$11979$742ec2ed@news.sonic.net>
In reply to	#21407

On 3/8/2012 2:58 PM, Prasad, Ramit wrote:
>>      Right. The real problem is that Python 2.7 doesn't have distinct
>> "str" and "bytes" types.  type(bytes() returns<type 'str'>
>> "str" is assumed to be ASCII 0..127, but that's not enforced.
>> "bytes" and "str" should have been distinct types, but
>> that would have broken much old code.  If they were distinct, then
>> constructors could distinguish between string type conversion
>> (which requires no encoding information) and byte stream decoding.
>>
>>      So it's possible to get junk characters in a "str", and they
>> won't convert to Unicode.  I've had this happen with databases which
>> were supposed to be ASCII, but occasionally a non-ASCII character
>> would slip through.
>
> bytes and str are just aliases for each other.

    That's true in Python 2.7, but not in 3.x.  From 2.6 forward,
"bytes" and "str" were slowly being separated.  See PEP 358.
Some of the problems in Python 2.7 come from this ambiguity.
Logically, "unicode" of "str" should be a simple type conversion
from ASCII to Unicode, while "unicode" of "bytes" should
require an encoding.  But because of the bytes/str ambiguity
in Python 2.6/2.7, the behavior couldn't be type-based.

				John Nagle

[toc] | [prev] | [next] | [standalone]

#21440

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-03-10 00:57 +0000
Message-ID	<4f5aa70e$0$30002$c3e8da3$5496439d@news.astraweb.com>
In reply to	#21434

On Fri, 09 Mar 2012 10:11:58 -0800, John Nagle wrote:

> On 3/8/2012 2:58 PM, Prasad, Ramit wrote:
>>>      Right. The real problem is that Python 2.7 doesn't have distinct
>>> "str" and "bytes" types.  type(bytes() returns<type 'str'> "str" is
>>> assumed to be ASCII 0..127, but that's not enforced. "bytes" and "str"
>>> should have been distinct types, but that would have broken much old
>>> code.  If they were distinct, then constructors could distinguish
>>> between string type conversion (which requires no encoding
>>> information) and byte stream decoding.
>>>
>>>      So it's possible to get junk characters in a "str", and they
>>> won't convert to Unicode.  I've had this happen with databases which
>>> were supposed to be ASCII, but occasionally a non-ASCII character
>>> would slip through.
>>
>> bytes and str are just aliases for each other.
> 
>     That's true in Python 2.7, but not in 3.x.  From 2.6 forward,
> "bytes" and "str" were slowly being separated.  See PEP 358. Some of the
> problems in Python 2.7 come from this ambiguity. Logically, "unicode" of
> "str" should be a simple type conversion from ASCII to Unicode, while
> "unicode" of "bytes" should require an encoding.  But because of the
> bytes/str ambiguity in Python 2.6/2.7, the behavior couldn't be
> type-based.

This demonstrates a gross confusion about both Unicode and Python. John, 
I honestly don't mean to be rude here, but if you actually believe that 
(rather than merely expressing yourself poorly), then it seems to me that 
you are desperately misinformed about Unicode and are working on the 
basis of some serious misapprehensions about the nature of strings.

I recommend you start with this:

http://www.joelonsoftware.com/articles/Unicode.html

In Python 2.6/2.7, there is no ambiguity between str/bytes. The two names 
are aliases for each other. The older name, "str", is a misnomer, since 
it *actually* refers to bytes (and always has, all the way back to the 
earliest days of Python). At best, it could be read as "byte string" or 
"8-bit string", but the emphasis should always be on the *bytes*.

str is NOT "assumed to be ASCII 0..127", and it never has been. Python's 
str prior to version 3.0 has *always* been bytes, it just never used that 
name. For example, in Python 2.4, help(chr) explicitly supports 
characters with ordinal 0...255:

Help on built-in function chr in module __builtin__:

chr(...)
    chr(i) -> character

    Return a string of one character with ordinal i; 0 <= i < 256.

I can go all the way back to Python 0.9, which was so primitive it didn't 
even accept "" as string delimiters, and the str type was still based on 
bytes, with explicit support for non-ASCII values:

steve@runes:~/Downloads/python-0.9.1$ ./python0.9.1 
>>> print 'This is *not* ASCII \xCA see the non-ASCII byte.'
This is *not* ASCII � see the non-ASCII byte.

Any conversion from bytes (including Python 2 strings) to Unicode is 
ALWAYS a decoding operation. It can't possibly be anything else. If you 
think that it can be, you don't understand the relationship between 
strings, Unicode and bytes.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#21481

From	John Nagle <nagle@animats.com>
Date	2012-03-10 22:12 -0800
Message-ID	<4f5c4262$0$11961$742ec2ed@news.sonic.net>
In reply to	#21440

On 3/9/2012 4:57 PM, Steven D'Aprano wrote:
> On Fri, 09 Mar 2012 10:11:58 -0800, John Nagle wrote:
> This demonstrates a gross confusion about both Unicode and Python. John,
> I honestly don't mean to be rude here, but if you actually believe that
> (rather than merely expressing yourself poorly), then it seems to me that
> you are desperately misinformed about Unicode and are working on the
> basis of some serious misapprehensions about the nature of strings.
>
> In Python 2.6/2.7, there is no ambiguity between str/bytes. The two names
> are aliases for each other. The older name, "str", is a misnomer, since
> it *actually* refers to bytes (and always has, all the way back to the
> earliest days of Python). At best, it could be read as "byte string" or
> "8-bit string", but the emphasis should always be on the *bytes*.

    There's an inherent ambiguity in that "bytes" and "str" are really
the same type in Python 2.6/2.7.  That's a hack for backwards
compatibility, and it goes away in 3.x.  The notes for PEP 358
admit this.

    It's implicit in allowing

	unicode(s)

with no encoding, on type "str", that there is an implicit
assumption that s is ASCII.  Arguably, "unicode()" should
have required an encoding in all cases.

Or "str" and "bytes" should have been made separate types in
Python 2.7, in which case unicode() of a str would be a safe
ASCII to Unicode translation, and unicode() of a bytes object
would require an encoding.  But that would break too much old code.
So we have an ambiguity and a hack.

"While Python 2 also has a unicode string type, the fundamental 
ambiguity of the core string type, coupled with Python 2's default 
behavior of supporting automatic coercion from 8-bit strings to unicode 
objects when the two are combined, often leads to UnicodeErrors"
- PEP 404

				John Nagle

[toc] | [prev] | [next] | [standalone]

#21424

From	Neil Cerutti <neilc@norwich.edu>
Date	2012-03-09 13:23 +0000
Message-ID	<9ruei9FnpgU3@mid.individual.net>
In reply to	#21402

On 2012-03-08, John Nagle <nagle@animats.com> wrote:
> So it's possible to get junk characters in a "str", and they
> won't convert to Unicode.  I've had this happen with databases
> which were supposed to be ASCII, but occasionally a non-ASCII
> character would slip through.

Perhaps encode and then decode, rather than try to encode a
non-encoded str.

-- 
Neil Cerutti

[toc] | [prev] | [standalone]

csiph-web

"Decoding unicode is not supported" in unusual situation

Contents

#21308 — "Decoding unicode is not supported" in unusual situation

#21312

#21315

#21316

#21331

#21353

#21360

#21361

#21367

#21402

#21407

#21434

#21440

#21481

#21424