Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!xlned.com!feeder7.xlned.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Terry Reedy <tjreedy@udel.edu>
Subject: Re: "Decoding unicode is not supported" in unusual situation
Date: Wed, 07 Mar 2012 19:03:41 -0500
References: <4f571b94$0$12037$742ec2ed@news.sonic.net> <m2obs890ge.fsf@web.de> <8762egmzfp.fsf@benfinney.id.au> <4f5749bc$0$29989$c3e8da3$5496439d@news.astraweb.com> <4f57b63f$0$11986$742ec2ed@news.sonic.net> <871up4m69h.fsf@benfinney.id.au> <4f57eeac$0$29989$c3e8da3$5496439d@news.astraweb.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20111105 Thunderbird/8.0
In-Reply-To: <4f57eeac$0$29989$c3e8da3$5496439d@news.astraweb.com>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.495.1331165057.3037.python-list@python.org>
Lines: 109
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:21361

On 3/7/2012 6:26 PM, Steven D'Aprano wrote:
> On Thu, 08 Mar 2012 08:48:58 +1100, Ben Finney wrote:
>
>> John Nagle<nagle@animats.com>  writes:
>>
>>>     The library bug, if any, is that you can't apply
>>>
>>> 	unicode(s, errors=3D'replace')
>>>
>>> to a Unicode string. TypeError("Decoding unicode is not supported") i=
s
>>> raised.  However
>>>
>>>    	unicode(s)
>>>
>>> will accept Unicode input.
>>
>> I think that's a Python bug. If the latter succeeds as a no-op, the
>> former should also succeed as a no-op. Neither should ever get any
>> errors when =E2=80=98s=E2=80=99 is a =E2=80=98unicode=E2=80=99 object =
already.

> No. The semantics of the unicode function (technically: a type
> constructor) are well-defined, and there are two distinct behaviours:
>
> unicode(obj)
>
> is analogous to str(obj), and it attempts to convert obj to a unicode
> string by calling obj.__unicode__, if it exists, or __str__ if it
> doesn't. No encoding or decoding is attempted in the event that obj is =
a
> unicode instance.
>
> unicode(obj, encoding, errors)
>
> is explicitly stated in the docs as decoding obj if EITHER of encoding =
or
> errors is given, AND that obj must be either an 8-bit string (bytes) or=
 a
> buffer object.
>
> It is true that u''.decode() will succeed, in Python 2, but the fact th=
at
> unicode objects have a decode method at all is IMO a bug. It has also

I believe that is because in Py 2, codecs and .encode/.decode were used=20
for same type recoding like base64, uu coding. That was simplified in=20
Py3 so that 'decoding' is bytes to string and 'encoding' is string to=20
bytes, and base64, etc, are only done in their separate modules and not=20
also duplicated in the codecs machinery.

> been corrected in Python 3, where (unicode) str objects no longer have =
a
> decode method, and bytes objects no longer have an encode method.
>
>
>>> The Python documentation
>>> ("http://docs.python.org/library/functions.html#unicode") does not
>>> mention this.
>
> Yes it does. It is is the SECOND sentence, immediately after the summar=
y
> line:
>
> unicode([object[, encoding[, errors]]])
>      Return the Unicode string version of object using one of the
>      following modes:
>
>      If encoding and/or errors are given, unicode() will decode the obj=
ect
>      which can either be an 8-bit string or a character buffer using th=
e
>      codec for encoding. ...
>
>
> Admittedly, it doesn't *explicitly* state that TypeError will be raised=
,
> but what other exception kind would you expect when you supply an
> argument of the wrong type?

What you have correctly pointed out is that there is no discrepancy=20
between doc and behavior and hence no bug for the purpose of the=20
tracker. Thanks.

>>> It is therefore necessary to check the type before
>>> calling "unicode", or catch the undocumented TypeError exception
>>> afterward.
>>
>> Yes, this check should not be necessary; calling the =E2=80=98unicode=E2=
=80=99
>> constructor with an object that's already an instance of =E2=80=98unic=
ode=E2=80=99
>> should just return the object as-is, IMO. It shouldn't matter that
>> you've specified how decoding errors are to be handled, because in tha=
t
>> case no decoding happens anyway.
>
> I don't believe that it is the job of unicode() to Do What I Mean, but
> only to Do What I Say. If I *explicitly* tell unicode() to decode the
> argument (by specifying either the codec or the error handler or both)
> then it should not double-guess me and ignore the extra parameters.
>
> End-user applications may, with care, try to be smart and DWIM, but
> library functions should be dumb and should do what they are told.

--=20
Terry Jan Reedy