Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.albasani.net!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'encoding': 0.05; 'output': 0.05; 'subject:Python': 0.06; 'utf-8': 0.07; 'default.': 0.09; 'escape': 0.09; 'executed': 0.09; 'input,': 0.09; 'cc:addr:python- list': 0.11; 'python': 0.11; 'systems.': 0.12; 'assume': 0.14; '2.7': 0.14; 'creates': 0.14; 'changes': 0.15; '8-bit': 0.16; '8bit%:32': 0.16; 'decade,': 0.16; 'encoding.': 0.16; 'encodings': 0.16; 'ought': 0.16; 'right:': 0.16; 'ssh': 0.16; 'to:addr:pearwood.info': 0.16; 'to:addr:steve+comp.lang.python': 0.16; "to:name:steven d'aprano": 0.16; 'do,': 0.16; 'wrote:': 0.18; 'command': 0.22; 'machine': 0.22; 'appears': 0.22; 'input': 0.22; 'example': 0.22; 'cc:addr:python.org': 0.22; 'bytes': 0.24; 'case.': 0.24; 'convenient': 0.24; 'mon,': 0.24; 'cc:2**0': 0.24; '>': 0.26; 'nearly': 0.26; 'gets': 0.27; 'header:In-Reply- To:1': 0.27; 'correct': 0.29; 'generally': 0.29; 'possibility': 0.29; 'tim': 0.29; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; '(which': 0.31; 'code': 0.31; 'that.': 0.31; "d'aprano": 0.31; 'extract': 0.31; 'steven': 0.31; 'subject:some': 0.31; 'option': 0.32; 'another': 0.32; 'quite': 0.32; '(e.g.': 0.33; 'info': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'collecting': 0.36; 'consistent': 0.36; 'interact': 0.36; 'largely': 0.36; 'should': 0.36; 'changing': 0.37; 'clear': 0.37; 'being': 0.38; 'remote': 0.38; 'skip:& 10': 0.38; 'same.': 0.38; 'fact': 0.38; 'does': 0.39; 'sure': 0.39; 'most': 0.60; 'break': 0.61; 'show': 0.63; 'information': 0.63; 'more': 0.64; 'situation': 0.65; 'talking': 0.65; 'production': 0.68; 'invalid': 0.68; 'default': 0.69; 'products': 0.71; 'products.': 0.72; 'guaranteed': 0.75; 'treating': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=UB+eRmNVZOSCil9SOfUJG8KnrJaFVqLxe6lM/m6g+Xs=; b=icgeGUnM5H6hydcg/o/pyMZq79E4GdPPaUC8ax5Wrd83hBIirer1FiQ8fP4ejFZ59I gaSOHcRnly+5UtnHtLp3aqFUOVcOdKe+9aJdZtfH09l7qT/2XyiVomSK377N6bmcIjjp YtAVrutDTU+Ke3Ptj/0DHv96Vfj8zHdwIMKQn1btHZHPKZ8HRBuC6KnoVd8zLGLoblQJ RUn/ZLuay/0WPuwlEg7vw+9SO5DGncdtH3INxPHVSVQ+TwZLbxrwjWMm9ZAkPmtXwRUF 99xOvQZGTwX84y8EIXexkzre2yAOcaMyjliklBzlE9CysPXJtN0opo0Jcqr2FCrKL1mD IZYw== MIME-Version: 1.0 X-Received: by 10.60.179.138 with SMTP id dg10mr34902336oec.13.1401675785455; Sun, 01 Jun 2014 19:23:05 -0700 (PDT) In-Reply-To: <538bcfff$0$29978$c3e8da3$5496439d@news.astraweb.com> References: <538a8f48$0$29978$c3e8da3$5496439d@news.astraweb.com> <538bcfff$0$29978$c3e8da3$5496439d@news.astraweb.com> Date: Mon, 2 Jun 2014 12:23:05 +1000 Subject: Re: Python 3.2 has some deadly infection From: Tim Delaney To: "Steven D'Aprano" Content-Type: multipart/alternative; boundary=047d7b86d782484f4604fad113b2 Cc: Python-List X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 126 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1401676125 news.xs4all.nl 2864 [2001:888:2000:d::a6]:50040 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:72390 --047d7b86d782484f4604fad113b2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 2 June 2014 11:14, Steven D'Aprano wrote: > On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote: > > I'm currently working on a product that interacts with lots of other > > products. These other products can be using any encoding - but most of > > the functions that interact with I/O assume the system default encoding > > of the machine that is collecting the data. The product has been in > > production for nearly a decade, so there's a lot of pushback against > > changes deep in the code for fear that it will break working systems. > > The fact that they are working largely by accident appears to escape > > them ... > > > > FWIW, changing to use iso-latin-1 by default would be the most sensible > > option (effectively treating everything as bytes), with the option for > > another encoding if/when more information is known (e.g. there's often = a > > call to return the encoding, and the output of that call is guaranteed > > to be ASCII). > > Python 2 does what you suggest, and it is *broken*. Python 2.7 creates > moji-bake, while Python 3 gets it right: > The purpose of my example was to show a case where no thought was put into encodings - the assumption was that the system encoding and the remote system encoding would be the same. This is most definitely not the case a lot of the time. I also should have been more clear that *in the particular situation I was talking about* iso-latin-1 as default would be the right thing to do, not in the general case. Quite often we won't know the correct encoding until we've executed a command via ssh - iso-latin-1 will allow us to extract the info we need (which will generally be 7-bit ASCII) without the possibility of an invalid encoding. Sure we may get mojibake, but that's better than the alternative when we don't yet know the correct encoding. > Latin-1 is one of those legacy encodings which needs to die, not to be > entrenched as the default. My terminal uses UTF-8 by default (as it > should), and if I use the terminal to input "=CE=B4=D0=B6=C3=A7", Python = ought to see > what I input, not Latin-1 moji-bake. > For some purposes, there needs to be a way to treat an arbitrary stream of bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a convenient way to do that. It's not the only way, but settling on it and being consistent is better than not having a way. Tim Delaney --047d7b86d782484f4604fad113b2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On 2= June 2014 11:14, Steven D'Aprano <steve+comp.lang.= python@pearwood.info> wrote:
On Mon, 02 Jun 2014 08:54:33 +1000, Tim De= laney wrote:
> I'm currently working on a product that interacts with lots of oth= er
> products. These other products can be using any encoding - but most of=
> the functions that interact with I/O assume the system default encodin= g
> of the machine that is collecting the data. The product has been in > production for nearly a decade, so there's a lot of pushback again= st
> changes deep in the code for fear that it will break working systems.<= br> > The fact that they are working largely by accident appears to escape > them ...
>
> FWIW, changing to use iso-latin-1 by default would be the most sensibl= e
> option (effectively treating everything as bytes), with the option for=
> another encoding if/when more information is known (e.g. there's o= ften a
> call to return the encoding, and the output of that call is guaranteed=
> to be ASCII).

Python 2 does what you suggest, and it is *broken*. Python 2.7 create= s
moji-bake, while Python 3 gets it right:

The purpose of my example was to show a case where no thought was put int= o encodings - the assumption was that the system encoding and the remote sy= stem encoding would be the same. This is most definitely not the case a lot= of the time.

I also should have been more clear that *in the particu= lar situation I was talking about* iso-latin-1 as default would be the righ= t thing to do, not in the general case. Quite often we won't know the c= orrect encoding until we've executed a command via ssh - iso-latin-1 wi= ll allow us to extract the info we need (which will generally be 7-bit ASCI= I) without the possibility of an invalid encoding. Sure we may get mojibake= , but that's better than the alternative when we don't yet know the= correct encoding.
=C2=A0
Latin-1 is one of those legacy encodings = which needs to die, not to be
entrenched as the default. My terminal uses UTF-8 by default (as it
shou= ld), and if I use the terminal to input "=CE=B4=D0=B6=C3=A7", Pyt= hon ought to see
what I input, not Latin-1 moji-bake.

For some purposes, there needs to be a way to treat an arbitrary strea= m of bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a con= venient way to do that. It's not the only way, but settling on it and b= eing consistent is better than not having a way.

Tim Delaney=C2=A0
--047d7b86d782484f4604fad113b2--