Groups > comp.lang.python > #64719 > unrolled thread

Trying to understand this moji-bake

Started by	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
First post	2014-01-25 04:37 +0000
Last post	2014-01-25 21:15 +0000
Articles	11 — 8 participants

Back to article view | Back to comp.lang.python

  Trying to understand this moji-bake Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-25 04:37 +0000
    Re: Trying to understand this moji-bake Cameron Simpson <cs@zip.com.au> - 2014-01-25 16:08 +1100
    Re: Trying to understand this moji-bake Chris Angelico <rosuav@gmail.com> - 2014-01-25 17:08 +1100
      Re: Trying to understand this moji-bake Peter Pearson <ppearson@nowhere.invalid> - 2014-01-25 17:56 +0000
        Re: Trying to understand this moji-bake Chris Angelico <rosuav@gmail.com> - 2014-01-26 06:13 +1100
        Re: Trying to understand this moji-bake Terry Reedy <tjreedy@udel.edu> - 2014-01-25 22:31 -0500
      Re: Trying to understand this moji-bake Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-26 02:04 +0000
        Re: Trying to understand this moji-bake Chris Angelico <rosuav@gmail.com> - 2014-01-26 13:08 +1100
    Re: Trying to understand this moji-bake Peter Otten <__peter__@web.de> - 2014-01-25 09:56 +0100
    Re: Trying to understand this moji-bake wxjmfauth@gmail.com - 2014-01-25 01:24 -0800
    Re: Trying to understand this moji-bake Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-01-25 21:15 +0000

#64719 — Trying to understand this moji-bake

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-01-25 04:37 +0000
Subject	Trying to understand this moji-bake
Message-ID	<52e33f8d$0$29999$c3e8da3$5496439d@news.astraweb.com>

I have an unexpected display error when dealing with Unicode strings, and 
I cannot understand where the error is occurring. I suspect it's not 
actually a Python issue, but I thought I'd ask here to start.

Using Python 3.3, if I print a unicode string from the command line, it 
displays correctly. I'm using the KDE 3.5 Konsole application, with the 
encoding set to the default (which ought to be UTF-8, I believe, although 
I'm not completely sure). This displays correctly:

[steve@ando ~]$ python3.3 -c "print(u'ñøλπйж')"
ñøλπйж


Likewise for Python 3.2:

[steve@ando ~]$ python3.2 -c "print('ñøλπйж')"
ñøλπйж


But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
Ã±Ã¸Î»ÏÐ¹Ð¶


However, interactively it works fine:

[steve@ando ~]$ python2.7 -E
Python 2.7.2 (default, May 18 2012, 18:25:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print u'ñøλπйж'
ñøλπйж


This occurs on at least two different machines, one using Centos and the 
other Debian.

Anyone have any idea what's going on? I can replicate the display error 
using Python 3 like this:

py> s = 'ñøλπйж'
py> print(s.encode('utf-8').decode('latin-1'))
Ã±Ã¸Î»ÏÐ¹Ð¶

but I'm not sure why it's happening at the command line. Anyone have any 
ideas?



-- 
Steven

[toc] | [next] | [standalone]

#64722

From	Cameron Simpson <cs@zip.com.au>
Date	2014-01-25 16:08 +1100
Message-ID	<mailman.5964.1390627595.18130.python-list@python.org>
In reply to	#64719

On 25Jan2014 04:37, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> I have an unexpected display error when dealing with Unicode strings, and 
> I cannot understand where the error is occurring. I suspect it's not 
> actually a Python issue, but I thought I'd ask here to start.
> 
> Using Python 3.3, if I print a unicode string from the command line, it 
> displays correctly. I'm using the KDE 3.5 Konsole application, with the 
> encoding set to the default (which ought to be UTF-8, I believe, although 
> I'm not completely sure).

There are at least 2 layers: the encoding python is using for
transcription to the terminal and the decoding the terminal is
making of the byte stream to decide what to display.

The former can be printed with:

  import sys
  print(sys.stdout.encoding)

The latter depends on your desktop settings and KDE settings I
guess. I would hope the Konsole will decide based on your environment
settings. Running the shell command:

  locale

will print the settings derived from that. Provided your environment
matches that which invoked the Konsole, that should be informative.

But I expect the Konsole is decoding using UTF-8 because so much
else works for you already.

I would point out that you could perhaps debug with something like this:

  python2.7 ..... | od -c

which will print the output bytes. By printing to the terminal,
you're letting the terminal's decoding get in your way. It is fine
for seeing correct/incorrect results, but not so fine for seeing
the bytes causing them.

> This displays correctly:
> [steve@ando ~]$ python3.3 -c "print(u'ñøλπйж')"
> ñøλπйж
> 
> 
> Likewise for Python 3.2:
> [steve@ando ~]$ python3.2 -c "print('ñøλπйж')"
> ñøλπйж
> 
> But using Python 2.7, I get a really bad case of moji-bake:
> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
> Ã±Ã¸Î»ÏÐ¹Ð¶
> 
> However, interactively it works fine:
[...]

Debug by printing sys.stdout.encoding at this point.

I do recall getting different output encodings depending on how
Python was invoked; I forget the pattern, but I also remember writing
some ghastly hack to work around it, which I can't find at the
moment...

Also see "man python2.7" in particular the PYTHONIOENCODING environment
variable. That might let you exert more control.

Cheers,
-- 
Cameron Simpson <cs@zip.com.au>

ASCII  n s. [from the greek]  Those people who, at certain times of the year,
have no shadow at noon; such are the inhabitatants of the torrid zone.
        - 1837 copy of Johnson's Dictionary

[toc] | [prev] | [next] | [standalone]

#64723

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-25 17:08 +1100
Message-ID	<mailman.5965.1390630146.18130.python-list@python.org>
In reply to	#64719

On Sat, Jan 25, 2014 at 3:37 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> But using Python 2.7, I get a really bad case of moji-bake:
>
> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
> Ã±Ã¸Î»ÏÐ¹Ð¶

What's 2.7's default source code encoding? I thought it was ascii, but
maybe it's assuming (in the absence of a magic cookie) that it's
Latin-1.

ChrisA

[toc] | [prev] | [next] | [standalone]

#64748

From	Peter Pearson <ppearson@nowhere.invalid>
Date	2014-01-25 17:56 +0000
Message-ID	<bkic64FrnhdU1@mid.individual.net>
In reply to	#64723

On Sat, 25 Jan 2014 17:08:56 +1100, Chris Angelico <rosuav@gmail.com> wrote:
> On Sat, Jan 25, 2014 at 3:37 PM, Steven D'Aprano
><steve+comp.lang.python@pearwood.info> wrote:
>> But using Python 2.7, I get a really bad case of moji-bake:
>>
>> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
>> Ã±Ã¸Î»ÏÐ¹Ð¶
>
> What's 2.7's default source code encoding? I thought it was ascii, but
> maybe it's assuming (in the absence of a magic cookie) that it's
> Latin-1.
>
> ChrisA

I seem to be getting the same behavior as Steven:

$ python2.7 -c "print u'ñøλπйж'"
Ã±Ã¸Î»ÏÐ¹Ð¶
$ python2.7 -c "import sys; print(sys.stdout.encoding)"
UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
$ python2.7 -c "import sys; print(sys.stdin.encoding)"
UTF-8

Also, my GNOME Terminal 3.4.1.1 character encoding is "Unicode (UTF-8)".

HTH

-- 
To email me, substitute nowhere->spamcop, invalid->net.

[toc] | [prev] | [next] | [standalone]

#64750

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-26 06:13 +1100
Message-ID	<mailman.5979.1390677239.18130.python-list@python.org>
In reply to	#64748

On Sun, Jan 26, 2014 at 4:56 AM, Peter Pearson <ppearson@nowhere.invalid> wrote:
> $ python2.7 -c "import sys; print(sys.stdin.encoding)"
> UTF-8

This isn't from stdin, though, it's about the interpretation of the
bytes of source code without a magic cookie.

According to PEP 263 [1], the default encoding should have become
"ascii" as of Python 2.5. That's what puzzles me.

ChrisA

[1] http://www.python.org/dev/peps/pep-0263/

[toc] | [prev] | [next] | [standalone]

#64759

From	Terry Reedy <tjreedy@udel.edu>
Date	2014-01-25 22:31 -0500
Message-ID	<mailman.5985.1390707137.18130.python-list@python.org>
In reply to	#64748

On 1/25/2014 2:13 PM, Chris Angelico wrote:
> On Sun, Jan 26, 2014 at 4:56 AM, Peter Pearson <ppearson@nowhere.invalid> wrote:
>> $ python2.7 -c "import sys; print(sys.stdin.encoding)"
>> UTF-8
>
> This isn't from stdin, though, it's about the interpretation of the
> bytes of source code without a magic cookie.
>
> According to PEP 263 [1], the default encoding should have become
> "ascii" as of Python 2.5. That's what puzzles me.

I believe it is actually (but unofficially) latin-1 so that latin-1 
accented chars can be used in identifiers even though only ascii is 
officially supported.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#64755

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-01-26 02:04 +0000
Message-ID	<52e46d38$0$29999$c3e8da3$5496439d@news.astraweb.com>
In reply to	#64723

On Sat, 25 Jan 2014 17:08:56 +1100, Chris Angelico wrote:

> On Sat, Jan 25, 2014 at 3:37 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> But using Python 2.7, I get a really bad case of moji-bake:
>>
>> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'" Ã±Ã¸Î»ÏÐ¹Ð¶
> 
> What's 2.7's default source code encoding? I thought it was ascii, but
> maybe it's assuming (in the absence of a magic cookie) that it's
> Latin-1.

I think that's it! Python 2.7 ought to raise a SyntaxError, since there's 
no source encoding declared, while Python 3.3 defaults to UTF-8 which is 
the same as my terminal. If there's a bug, it is that Python 2.7 doesn't 
raise SyntaxError when called with -c and there are non-ASCII literals in 
the source. Instead, it seems to be defaulting to Latin-1, hence the moji-
bake.

Thanks to everyone who responded!

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#64756

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-26 13:08 +1100
Message-ID	<mailman.5983.1390702113.18130.python-list@python.org>
In reply to	#64755

On Sun, Jan 26, 2014 at 1:04 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> If there's a bug, it is that Python 2.7 doesn't
> raise SyntaxError when called with -c and there are non-ASCII literals in
> the source. Instead, it seems to be defaulting to Latin-1, hence the moji-
> bake.

That might well be a bug! I was reading the PEP, which was pretty
clear about it needing to be ASCII by default. It's not so clear about
-c but I would expect it to do the same.

ChrisA

[toc] | [prev] | [next] | [standalone]

#64733

From	Peter Otten <__peter__@web.de>
Date	2014-01-25 09:56 +0100
Message-ID	<mailman.5970.1390640181.18130.python-list@python.org>
In reply to	#64719

Steven D'Aprano wrote:

> I have an unexpected display error when dealing with Unicode strings, and
> I cannot understand where the error is occurring. I suspect it's not
> actually a Python issue, but I thought I'd ask here to start.

I suppose it is a Python issue -- where Python fails to guess an encoding it 
usually falls back to ascii.

> But using Python 2.7, I get a really bad case of moji-bake:
> 
> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
> Ã±Ã¸Î»ÏÐ¹Ð¶
> 
> 
> However, interactively it works fine:
> 
> [steve@ando ~]$ python2.7 -E
> Python 2.7.2 (default, May 18 2012, 18:25:10)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> print u'ñøλπйж'
> ñøλπйж

You can provoke it with exec:

>>> exec "print u'ñøλπйж'"
Ã±Ã¸Î»ÏÐ¹Ð¶
>>> exec u"print u'ñøλπйж'"
ñøλπйж
>>> exec "# -*- coding: utf-8 -*-\nprint u'ñøλπйж'"
ñøλπйж

> This occurs on at least two different machines, one using Centos and the
> other Debian.
> 
> Anyone have any idea what's going on? I can replicate the display error
> using Python 3 like this:
> 
> py> s = 'ñøλπйж'
> py> print(s.encode('utf-8').decode('latin-1'))
> Ã±Ã¸Î»ÏÐ¹Ð¶
> 
> but I'm not sure why it's happening at the command line. Anyone have any
> ideas?

It is probably burried in the C code -- after a few indirections I lost 
track :(

[toc] | [prev] | [next] | [standalone]

#64734

From	wxjmfauth@gmail.com
Date	2014-01-25 01:24 -0800
Message-ID	<42dab079-6766-4efd-aa64-33fdce2d3178@googlegroups.com>
In reply to	#64719

Le samedi 25 janvier 2014 05:37:34 UTC+1, Steven D'Aprano a écrit :
> I have an unexpected display error when dealing with Unicode strings, and 
> 
> I cannot understand where the error is occurring. I suspect it's not 
> 
> actually a Python issue, but I thought I'd ask here to start.
> 
> 
> 
> Using Python 3.3, if I print a unicode string from the command line, it 
> 
> displays correctly. I'm using the KDE 3.5 Konsole application, with the 
> 
> encoding set to the default (which ought to be UTF-8, I believe, although 
> 
> I'm not completely sure). This displays correctly:
> 
> 
> 
> [steve@ando ~]$ python3.3 -c "print(u'ñøλπйж')"
> 
> ñøλπйж
> 
> 
> 
> 
> 
> Likewise for Python 3.2:
> 
> 
> 
> [steve@ando ~]$ python3.2 -c "print('ñøλπйж')"
> 
> ñøλπйж
> 
> 
> 
> 
> 
> But using Python 2.7, I get a really bad case of moji-bake:
> 
> 
> 
> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
> 
> Ã±Ã¸Î»ÏÐ¹Ð¶
> 
> 
> 
> 
> 
> However, interactively it works fine:
> 
> 
> 
> [steve@ando ~]$ python2.7 -E
> 
> Python 2.7.2 (default, May 18 2012, 18:25:10)
> 
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
> 
> Type "help", "copyright", "credits" or "license" for more information.
> 
> >>> print u'ñøλπйж'
> 
> ñøλπйж
> 
> 
> 
> 
> 
> This occurs on at least two different machines, one using Centos and the 
> 
> other Debian.
> 
> 
> 
> Anyone have any idea what's going on? I can replicate the display error 
> 
> using Python 3 like this:
> 
> 
> 
> py> s = 'ñøλπйж'
> 
> py> print(s.encode('utf-8').decode('latin-1'))
> 
> Ã±Ã¸Î»ÏÐ¹Ð¶
> 
> 
> 
> but I'm not sure why it's happening at the command line. Anyone have any 
> 
> ideas?
> 
> 
> 

The basic problem is neither Python, nor the system (OS), nor
the terminal, nor the GUI console. The basic problem is that
all these elements [*] are not "speaking" the same language.

The second problem lies in Python itsself. Python attempts
to solve this problem by doing its own "cooking" based on the
elements, I pointed above [*], with the side effect the
situation may just become more confused and/or just not properly
working (sys.std***.encoding, print, GUI/terminal, souce
coding, ...)

The third problem is more *x specific. In many cases,
the Python "distribution" is tweaked in such a way to
make it working on a specific *x-version/distribution
(sys.getdefaultencoding(), site.py, sitecustomize.py)
and finally resulting in a non properly working Python.

Fourth problem. GUI applications supposed to mimick the
"real" terminal by doing and adding their own "recipes".

Fifth problem. The user who has to understand all this
stuff.

n-th problem, ...
jmf

PS I already understood all this stuff ten years ago!

[toc] | [prev] | [next] | [standalone]

#64752

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2014-01-25 21:15 +0000
Message-ID	<mailman.5981.1390684541.18130.python-list@python.org>
In reply to	#64719

On 25 January 2014 04:37, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>
> But using Python 2.7, I get a really bad case of moji-bake:
>
> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
> Ã±Ã¸Î»ÏÐ¹Ð¶
>
> However, interactively it works fine:
>
> [steve@ando ~]$ python2.7 -E
> Python 2.7.2 (default, May 18 2012, 18:25:10)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> print u'ñøλπйж'
> ñøλπйж
>
> This occurs on at least two different machines, one using Centos and the
> other Debian.

Same for me. It's to do with using a u literal:

$ python2.7 -c "print('ñøλπйж')"
ñøλπйж
$ python2.7 -c "print(u'ñøλπйж')"
Ã±Ã¸Î»Ï€Ð¹Ð¶
$ python2.7 -c "print(repr('ñøλπйж'))"
'\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'
$ python2.7 -c "print(repr(u'ñøλπйж'))"
u'\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'

$ python2.7
Python 2.7.5+ (default, Sep 19 2013, 13:49:51)
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> b='\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'
>>> print(b)
ñøλπйж
>>> s=u'\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'
>>> print(s)
Ã±Ã¸Î»Ï€Ð¹Ð¶
>>> print(s.encode('latin-1'))
ñøλπйж
>>> import sys
>>> sys.getdefaultencoding()
'ascii'

It works in the interactive prompt:

>>> s = 'ñøλπйж'
>>> print(s)
ñøλπйж
>>> s = u'ñøλπйж'
>>> print(s)
ñøλπйж

But the interactive prompt has an associated encoding:

>>> import sys
>>> sys.stdout.encoding
'UTF-8'

If I put it into a utf-8 file with no encoding declared I get a SyntaxError:
$ cat tmp.py
s = u'ñøλπйж'
print(s)
oscar@tonis-laptop:~$ python2.7 tmp.py
  File "tmp.py", line 1
SyntaxError: Non-ASCII character '\xc3' in file tmp.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for
details

If I add the encoding declaration it works:

oscar@tonis-laptop:~$ vim tmp.py
oscar@tonis-laptop:~$ cat tmp.py
# -*- coding: utf-8 -*-
s = u'ñøλπйж'
print(s)
oscar@tonis-laptop:~$ python2.7 tmp.py
ñøλπйж
oscar@tonis-laptop:~$

So I'd say that your original example should be a SyntaxError with
Python 2.7 but instead it implicitly uses latin-1.


Oscar

[toc] | [prev] | [standalone]

csiph-web

Trying to understand this moji-bake

Contents

#64719 — Trying to understand this moji-bake

#64722

#64723

#64748

#64750

#64759

#64755

#64756

#64733

#64734

#64752