Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.albasani.net!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <538bcfff$0$29978$c3e8da3$5496439d@news.astraweb.com>
References: <mailman.10509.1401552642.18130.python-list@python.org> <538a8f48$0$29978$c3e8da3$5496439d@news.astraweb.com> <mailman.10531.1401663275.18130.python-list@python.org> <538bcfff$0$29978$c3e8da3$5496439d@news.astraweb.com>
Date: Mon, 2 Jun 2014 12:23:05 +1000
Subject: Re: Python 3.2 has some deadly infection
From: Tim Delaney <timothy.c.delaney@gmail.com>
To: "Steven D'Aprano" <steve+comp.lang.python@pearwood.info>
Content-Type: multipart/alternative; boundary=047d7b86d782484f4604fad113b2
Cc: Python-List <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.10534.1401676125.18130.python-list@python.org>
Lines: 126
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:72390

--047d7b86d782484f4604fad113b2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 2 June 2014 11:14, Steven D'Aprano <steve+comp.lang.python@pearwood.info=
>
wrote:

> On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote:
> > I'm currently working on a product that interacts with lots of other
> > products. These other products can be using any encoding - but most of
> > the functions that interact with I/O assume the system default encoding
> > of the machine that is collecting the data. The product has been in
> > production for nearly a decade, so there's a lot of pushback against
> > changes deep in the code for fear that it will break working systems.
> > The fact that they are working largely by accident appears to escape
> > them ...
> >
> > FWIW, changing to use iso-latin-1 by default would be the most sensible
> > option (effectively treating everything as bytes), with the option for
> > another encoding if/when more information is known (e.g. there's often =
a
> > call to return the encoding, and the output of that call is guaranteed
> > to be ASCII).
>
> Python 2 does what you suggest, and it is *broken*. Python 2.7 creates
> moji-bake, while Python 3 gets it right:
>

The purpose of my example was to show a case where no thought was put into
encodings - the assumption was that the system encoding and the remote
system encoding would be the same. This is most definitely not the case a
lot of the time.

I also should have been more clear that *in the particular situation I was
talking about* iso-latin-1 as default would be the right thing to do, not
in the general case. Quite often we won't know the correct encoding until
we've executed a command via ssh - iso-latin-1 will allow us to extract the
info we need (which will generally be 7-bit ASCII) without the possibility
of an invalid encoding. Sure we may get mojibake, but that's better than
the alternative when we don't yet know the correct encoding.


> Latin-1 is one of those legacy encodings which needs to die, not to be
> entrenched as the default. My terminal uses UTF-8 by default (as it
> should), and if I use the terminal to input "=CE=B4=D0=B6=C3=A7", Python =
ought to see
> what I input, not Latin-1 moji-bake.
>

For some purposes, there needs to be a way to treat an arbitrary stream of
bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a
convenient way to do that. It's not the only way, but settling on it and
being consistent is better than not having a way.

Tim Delaney

--047d7b86d782484f4604fad113b2
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On 2=
 June 2014 11:14, Steven D&#39;Aprano <span dir=3D"ltr">&lt;<a href=3D"mail=
to:steve+comp.lang.python@pearwood.info" target=3D"_blank">steve+comp.lang.=
python@pearwood.info</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div class=3D"">On Mon, 02 Jun 2014 08:54:33 +1000, Tim De=
laney wrote:<br>
</div><div class=3D"">
&gt; I&#39;m currently working on a product that interacts with lots of oth=
er<br>
&gt; products. These other products can be using any encoding - but most of=
<br>
&gt; the functions that interact with I/O assume the system default encodin=
g<br>
&gt; of the machine that is collecting the data. The product has been in<br=
>
&gt; production for nearly a decade, so there&#39;s a lot of pushback again=
st<br>
&gt; changes deep in the code for fear that it will break working systems.<=
br>
&gt; The fact that they are working largely by accident appears to escape<b=
r>
&gt; them ...<br>
&gt;<br>
&gt; FWIW, changing to use iso-latin-1 by default would be the most sensibl=
e<br>
&gt; option (effectively treating everything as bytes), with the option for=
<br>
&gt; another encoding if/when more information is known (e.g. there&#39;s o=
ften a<br>
&gt; call to return the encoding, and the output of that call is guaranteed=
<br>
&gt; to be ASCII).<br>
<br>
</div>Python 2 does what you suggest, and it is *broken*. Python 2.7 create=
s<br>
moji-bake, while Python 3 gets it right:<br></blockquote><div><br></div><di=
v>The purpose of my example was to show a case where no thought was put int=
o encodings - the assumption was that the system encoding and the remote sy=
stem encoding would be the same. This is most definitely not the case a lot=
 of the time.</div>
<div><br></div><div>I also should have been more clear that *in the particu=
lar situation I was talking about* iso-latin-1 as default would be the righ=
t thing to do, not in the general case. Quite often we won&#39;t know the c=
orrect encoding until we&#39;ve executed a command via ssh - iso-latin-1 wi=
ll allow us to extract the info we need (which will generally be 7-bit ASCI=
I) without the possibility of an invalid encoding. Sure we may get mojibake=
, but that&#39;s better than the alternative when we don&#39;t yet know the=
 correct encoding.</div>
<div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-l=
eft-style:solid;padding-left:1ex">Latin-1 is one of those legacy encodings =
which needs to die, not to be<br>
entrenched as the default. My terminal uses UTF-8 by default (as it<br>shou=
ld), and if I use the terminal to input &quot;=CE=B4=D0=B6=C3=A7&quot;, Pyt=
hon ought to see<br>what I input, not Latin-1 moji-bake.<br></blockquote><d=
iv><br></div>
<div>For some purposes, there needs to be a way to treat an arbitrary strea=
m of bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a con=
venient way to do that. It&#39;s not the only way, but settling on it and b=
eing consistent is better than not having a way.</div>
<div><br></div><div>Tim Delaney=C2=A0</div></div></div></div>

--047d7b86d782484f4604fad113b2--