Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #64719 > unrolled thread
| Started by | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| First post | 2014-01-25 04:37 +0000 |
| Last post | 2014-01-25 21:15 +0000 |
| Articles | 11 — 8 participants |
Back to article view | Back to comp.lang.python
Trying to understand this moji-bake Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-25 04:37 +0000
Re: Trying to understand this moji-bake Cameron Simpson <cs@zip.com.au> - 2014-01-25 16:08 +1100
Re: Trying to understand this moji-bake Chris Angelico <rosuav@gmail.com> - 2014-01-25 17:08 +1100
Re: Trying to understand this moji-bake Peter Pearson <ppearson@nowhere.invalid> - 2014-01-25 17:56 +0000
Re: Trying to understand this moji-bake Chris Angelico <rosuav@gmail.com> - 2014-01-26 06:13 +1100
Re: Trying to understand this moji-bake Terry Reedy <tjreedy@udel.edu> - 2014-01-25 22:31 -0500
Re: Trying to understand this moji-bake Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-26 02:04 +0000
Re: Trying to understand this moji-bake Chris Angelico <rosuav@gmail.com> - 2014-01-26 13:08 +1100
Re: Trying to understand this moji-bake Peter Otten <__peter__@web.de> - 2014-01-25 09:56 +0100
Re: Trying to understand this moji-bake wxjmfauth@gmail.com - 2014-01-25 01:24 -0800
Re: Trying to understand this moji-bake Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-01-25 21:15 +0000
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-01-25 04:37 +0000 |
| Subject | Trying to understand this moji-bake |
| Message-ID | <52e33f8d$0$29999$c3e8da3$5496439d@news.astraweb.com> |
I have an unexpected display error when dealing with Unicode strings, and
I cannot understand where the error is occurring. I suspect it's not
actually a Python issue, but I thought I'd ask here to start.
Using Python 3.3, if I print a unicode string from the command line, it
displays correctly. I'm using the KDE 3.5 Konsole application, with the
encoding set to the default (which ought to be UTF-8, I believe, although
I'm not completely sure). This displays correctly:
[steve@ando ~]$ python3.3 -c "print(u'ñøλπйж')"
ñøλπйж
Likewise for Python 3.2:
[steve@ando ~]$ python3.2 -c "print('ñøλπйж')"
ñøλπйж
But using Python 2.7, I get a really bad case of moji-bake:
[steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
ñøλÏйж
However, interactively it works fine:
[steve@ando ~]$ python2.7 -E
Python 2.7.2 (default, May 18 2012, 18:25:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print u'ñøλπйж'
ñøλπйж
This occurs on at least two different machines, one using Centos and the
other Debian.
Anyone have any idea what's going on? I can replicate the display error
using Python 3 like this:
py> s = 'ñøλπйж'
py> print(s.encode('utf-8').decode('latin-1'))
ñøλÏйж
but I'm not sure why it's happening at the command line. Anyone have any
ideas?
--
Steven
[toc] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2014-01-25 16:08 +1100 |
| Message-ID | <mailman.5964.1390627595.18130.python-list@python.org> |
| In reply to | #64719 |
On 25Jan2014 04:37, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> I have an unexpected display error when dealing with Unicode strings, and
> I cannot understand where the error is occurring. I suspect it's not
> actually a Python issue, but I thought I'd ask here to start.
>
> Using Python 3.3, if I print a unicode string from the command line, it
> displays correctly. I'm using the KDE 3.5 Konsole application, with the
> encoding set to the default (which ought to be UTF-8, I believe, although
> I'm not completely sure).
There are at least 2 layers: the encoding python is using for
transcription to the terminal and the decoding the terminal is
making of the byte stream to decide what to display.
The former can be printed with:
import sys
print(sys.stdout.encoding)
The latter depends on your desktop settings and KDE settings I
guess. I would hope the Konsole will decide based on your environment
settings. Running the shell command:
locale
will print the settings derived from that. Provided your environment
matches that which invoked the Konsole, that should be informative.
But I expect the Konsole is decoding using UTF-8 because so much
else works for you already.
I would point out that you could perhaps debug with something like this:
python2.7 ..... | od -c
which will print the output bytes. By printing to the terminal,
you're letting the terminal's decoding get in your way. It is fine
for seeing correct/incorrect results, but not so fine for seeing
the bytes causing them.
> This displays correctly:
> [steve@ando ~]$ python3.3 -c "print(u'ñøλπйж')"
> ñøλπйж
>
>
> Likewise for Python 3.2:
> [steve@ando ~]$ python3.2 -c "print('ñøλπйж')"
> ñøλπйж
>
> But using Python 2.7, I get a really bad case of moji-bake:
> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
> ñøλÏйж
>
> However, interactively it works fine:
[...]
Debug by printing sys.stdout.encoding at this point.
I do recall getting different output encodings depending on how
Python was invoked; I forget the pattern, but I also remember writing
some ghastly hack to work around it, which I can't find at the
moment...
Also see "man python2.7" in particular the PYTHONIOENCODING environment
variable. That might let you exert more control.
Cheers,
--
Cameron Simpson <cs@zip.com.au>
ASCII n s. [from the greek] Those people who, at certain times of the year,
have no shadow at noon; such are the inhabitatants of the torrid zone.
- 1837 copy of Johnson's Dictionary
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-25 17:08 +1100 |
| Message-ID | <mailman.5965.1390630146.18130.python-list@python.org> |
| In reply to | #64719 |
On Sat, Jan 25, 2014 at 3:37 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > But using Python 2.7, I get a really bad case of moji-bake: > > [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'" > ñøλÏйж What's 2.7's default source code encoding? I thought it was ascii, but maybe it's assuming (in the absence of a magic cookie) that it's Latin-1. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Peter Pearson <ppearson@nowhere.invalid> |
|---|---|
| Date | 2014-01-25 17:56 +0000 |
| Message-ID | <bkic64FrnhdU1@mid.individual.net> |
| In reply to | #64723 |
On Sat, 25 Jan 2014 17:08:56 +1100, Chris Angelico <rosuav@gmail.com> wrote: > On Sat, Jan 25, 2014 at 3:37 PM, Steven D'Aprano ><steve+comp.lang.python@pearwood.info> wrote: >> But using Python 2.7, I get a really bad case of moji-bake: >> >> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'" >> ñøλÏйж > > What's 2.7's default source code encoding? I thought it was ascii, but > maybe it's assuming (in the absence of a magic cookie) that it's > Latin-1. > > ChrisA I seem to be getting the same behavior as Steven: $ python2.7 -c "print u'ñøλπйж'" ñøλÏйж $ python2.7 -c "import sys; print(sys.stdout.encoding)" UTF-8 $ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE=C LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= $ python2.7 -c "import sys; print(sys.stdin.encoding)" UTF-8 Also, my GNOME Terminal 3.4.1.1 character encoding is "Unicode (UTF-8)". HTH -- To email me, substitute nowhere->spamcop, invalid->net.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-26 06:13 +1100 |
| Message-ID | <mailman.5979.1390677239.18130.python-list@python.org> |
| In reply to | #64748 |
On Sun, Jan 26, 2014 at 4:56 AM, Peter Pearson <ppearson@nowhere.invalid> wrote: > $ python2.7 -c "import sys; print(sys.stdin.encoding)" > UTF-8 This isn't from stdin, though, it's about the interpretation of the bytes of source code without a magic cookie. According to PEP 263 [1], the default encoding should have become "ascii" as of Python 2.5. That's what puzzles me. ChrisA [1] http://www.python.org/dev/peps/pep-0263/
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-01-25 22:31 -0500 |
| Message-ID | <mailman.5985.1390707137.18130.python-list@python.org> |
| In reply to | #64748 |
On 1/25/2014 2:13 PM, Chris Angelico wrote: > On Sun, Jan 26, 2014 at 4:56 AM, Peter Pearson <ppearson@nowhere.invalid> wrote: >> $ python2.7 -c "import sys; print(sys.stdin.encoding)" >> UTF-8 > > This isn't from stdin, though, it's about the interpretation of the > bytes of source code without a magic cookie. > > According to PEP 263 [1], the default encoding should have become > "ascii" as of Python 2.5. That's what puzzles me. I believe it is actually (but unofficially) latin-1 so that latin-1 accented chars can be used in identifiers even though only ascii is officially supported. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-01-26 02:04 +0000 |
| Message-ID | <52e46d38$0$29999$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #64723 |
On Sat, 25 Jan 2014 17:08:56 +1100, Chris Angelico wrote: > On Sat, Jan 25, 2014 at 3:37 PM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> But using Python 2.7, I get a really bad case of moji-bake: >> >> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'" ñøλÏйж > > What's 2.7's default source code encoding? I thought it was ascii, but > maybe it's assuming (in the absence of a magic cookie) that it's > Latin-1. I think that's it! Python 2.7 ought to raise a SyntaxError, since there's no source encoding declared, while Python 3.3 defaults to UTF-8 which is the same as my terminal. If there's a bug, it is that Python 2.7 doesn't raise SyntaxError when called with -c and there are non-ASCII literals in the source. Instead, it seems to be defaulting to Latin-1, hence the moji- bake. Thanks to everyone who responded! -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-26 13:08 +1100 |
| Message-ID | <mailman.5983.1390702113.18130.python-list@python.org> |
| In reply to | #64755 |
On Sun, Jan 26, 2014 at 1:04 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > If there's a bug, it is that Python 2.7 doesn't > raise SyntaxError when called with -c and there are non-ASCII literals in > the source. Instead, it seems to be defaulting to Latin-1, hence the moji- > bake. That might well be a bug! I was reading the PEP, which was pretty clear about it needing to be ASCII by default. It's not so clear about -c but I would expect it to do the same. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-01-25 09:56 +0100 |
| Message-ID | <mailman.5970.1390640181.18130.python-list@python.org> |
| In reply to | #64719 |
Steven D'Aprano wrote:
> I have an unexpected display error when dealing with Unicode strings, and
> I cannot understand where the error is occurring. I suspect it's not
> actually a Python issue, but I thought I'd ask here to start.
I suppose it is a Python issue -- where Python fails to guess an encoding it
usually falls back to ascii.
> But using Python 2.7, I get a really bad case of moji-bake:
>
> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
> ñøλÏйж
>
>
> However, interactively it works fine:
>
> [steve@ando ~]$ python2.7 -E
> Python 2.7.2 (default, May 18 2012, 18:25:10)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> print u'ñøλπйж'
> ñøλπйж
You can provoke it with exec:
>>> exec "print u'ñøλπйж'"
ñøλÏйж
>>> exec u"print u'ñøλπйж'"
ñøλπйж
>>> exec "# -*- coding: utf-8 -*-\nprint u'ñøλπйж'"
ñøλπйж
> This occurs on at least two different machines, one using Centos and the
> other Debian.
>
> Anyone have any idea what's going on? I can replicate the display error
> using Python 3 like this:
>
> py> s = 'ñøλπйж'
> py> print(s.encode('utf-8').decode('latin-1'))
> ñøλÏйж
>
> but I'm not sure why it's happening at the command line. Anyone have any
> ideas?
It is probably burried in the C code -- after a few indirections I lost
track :(
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2014-01-25 01:24 -0800 |
| Message-ID | <42dab079-6766-4efd-aa64-33fdce2d3178@googlegroups.com> |
| In reply to | #64719 |
Le samedi 25 janvier 2014 05:37:34 UTC+1, Steven D'Aprano a écrit :
> I have an unexpected display error when dealing with Unicode strings, and
>
> I cannot understand where the error is occurring. I suspect it's not
>
> actually a Python issue, but I thought I'd ask here to start.
>
>
>
> Using Python 3.3, if I print a unicode string from the command line, it
>
> displays correctly. I'm using the KDE 3.5 Konsole application, with the
>
> encoding set to the default (which ought to be UTF-8, I believe, although
>
> I'm not completely sure). This displays correctly:
>
>
>
> [steve@ando ~]$ python3.3 -c "print(u'ñøλπйж')"
>
> ñøλπйж
>
>
>
>
>
> Likewise for Python 3.2:
>
>
>
> [steve@ando ~]$ python3.2 -c "print('ñøλπйж')"
>
> ñøλπйж
>
>
>
>
>
> But using Python 2.7, I get a really bad case of moji-bake:
>
>
>
> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
>
> ñøλÏйж
>
>
>
>
>
> However, interactively it works fine:
>
>
>
> [steve@ando ~]$ python2.7 -E
>
> Python 2.7.2 (default, May 18 2012, 18:25:10)
>
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
>
> Type "help", "copyright", "credits" or "license" for more information.
>
> >>> print u'ñøλπйж'
>
> ñøλπйж
>
>
>
>
>
> This occurs on at least two different machines, one using Centos and the
>
> other Debian.
>
>
>
> Anyone have any idea what's going on? I can replicate the display error
>
> using Python 3 like this:
>
>
>
> py> s = 'ñøλπйж'
>
> py> print(s.encode('utf-8').decode('latin-1'))
>
> ñøλÏйж
>
>
>
> but I'm not sure why it's happening at the command line. Anyone have any
>
> ideas?
>
>
>
The basic problem is neither Python, nor the system (OS), nor
the terminal, nor the GUI console. The basic problem is that
all these elements [*] are not "speaking" the same language.
The second problem lies in Python itsself. Python attempts
to solve this problem by doing its own "cooking" based on the
elements, I pointed above [*], with the side effect the
situation may just become more confused and/or just not properly
working (sys.std***.encoding, print, GUI/terminal, souce
coding, ...)
The third problem is more *x specific. In many cases,
the Python "distribution" is tweaked in such a way to
make it working on a specific *x-version/distribution
(sys.getdefaultencoding(), site.py, sitecustomize.py)
and finally resulting in a non properly working Python.
Fourth problem. GUI applications supposed to mimick the
"real" terminal by doing and adding their own "recipes".
Fifth problem. The user who has to understand all this
stuff.
n-th problem, ...
jmf
PS I already understood all this stuff ten years ago!
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2014-01-25 21:15 +0000 |
| Message-ID | <mailman.5981.1390684541.18130.python-list@python.org> |
| In reply to | #64719 |
On 25 January 2014 04:37, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>
> But using Python 2.7, I get a really bad case of moji-bake:
>
> [steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
> ñøλÏйж
>
> However, interactively it works fine:
>
> [steve@ando ~]$ python2.7 -E
> Python 2.7.2 (default, May 18 2012, 18:25:10)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> print u'ñøλπйж'
> ñøλπйж
>
> This occurs on at least two different machines, one using Centos and the
> other Debian.
Same for me. It's to do with using a u literal:
$ python2.7 -c "print('ñøλπйж')"
ñøλπйж
$ python2.7 -c "print(u'ñøλπйж')"
ñøλπйж
$ python2.7 -c "print(repr('ñøλπйж'))"
'\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'
$ python2.7 -c "print(repr(u'ñøλπйж'))"
u'\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'
$ python2.7
Python 2.7.5+ (default, Sep 19 2013, 13:49:51)
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> b='\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'
>>> print(b)
ñøλπйж
>>> s=u'\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'
>>> print(s)
ñøλπйж
>>> print(s.encode('latin-1'))
ñøλπйж
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
It works in the interactive prompt:
>>> s = 'ñøλπйж'
>>> print(s)
ñøλπйж
>>> s = u'ñøλπйж'
>>> print(s)
ñøλπйж
But the interactive prompt has an associated encoding:
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
If I put it into a utf-8 file with no encoding declared I get a SyntaxError:
$ cat tmp.py
s = u'ñøλπйж'
print(s)
oscar@tonis-laptop:~$ python2.7 tmp.py
File "tmp.py", line 1
SyntaxError: Non-ASCII character '\xc3' in file tmp.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for
details
If I add the encoding declaration it works:
oscar@tonis-laptop:~$ vim tmp.py
oscar@tonis-laptop:~$ cat tmp.py
# -*- coding: utf-8 -*-
s = u'ñøλπйж'
print(s)
oscar@tonis-laptop:~$ python2.7 tmp.py
ñøλπйж
oscar@tonis-laptop:~$
So I'd say that your original example should be a SyntaxError with
Python 2.7 but instead it implicitly uses latin-1.
Oscar
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web