Groups > comp.lang.python > #18731 > unrolled thread

UnicodeEncodeError in compile

Started by	pyscripter@gmail.com
First post	2012-01-09 20:24 -0800
Last post	2012-01-11 03:27 -0800
Articles	17 — 5 participants

Back to article view | Back to comp.lang.python

  UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-09 20:24 -0800
    Re: UnicodeEncodeError in compile Terry Reedy <tjreedy@udel.edu> - 2012-01-10 03:08 -0500
      Re: UnicodeEncodeError in compile jmfauth <wxjmfauth@gmail.com> - 2012-01-10 01:42 -0800
      Re: UnicodeEncodeError in compile 88888 Dihedral <dihedral88888@googlemail.com> - 2012-01-10 02:53 -0800
        Re: UnicodeEncodeError in compile jmfauth <wxjmfauth@gmail.com> - 2012-01-10 04:28 -0800
          Re: UnicodeEncodeError in compile jmfauth <wxjmfauth@gmail.com> - 2012-01-10 05:43 -0800
            Re: UnicodeEncodeError in compile Terry Reedy <tjreedy@udel.edu> - 2012-01-10 19:56 -0500
              Re: UnicodeEncodeError in compile jmfauth <wxjmfauth@gmail.com> - 2012-01-11 01:29 -0800
              Re: UnicodeEncodeError in compile jmfauth <wxjmfauth@gmail.com> - 2012-01-10 23:05 -0800
      Re: UnicodeEncodeError in compile 88888 Dihedral <dihedral88888@googlemail.com> - 2012-01-10 02:53 -0800
    Re: UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-10 02:04 -0800
    Re: UnicodeEncodeError in compile Terry Reedy <tjreedy@udel.edu> - 2012-01-10 22:50 -0500
      Re: UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-11 03:27 -0800
        Re: UnicodeEncodeError in compile Dave Angel <d@davea.name> - 2012-01-11 06:45 -0500
          Re: UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-11 04:14 -0800
          Re: UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-11 04:14 -0800
      Re: UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-11 03:27 -0800

#18731 — UnicodeEncodeError in compile

From	pyscripter@gmail.com
Date	2012-01-09 20:24 -0800
Subject	UnicodeEncodeError in compile
Message-ID	<9043309.329.1326169476466.JavaMail.geo-discussion-forums@yqhi24>

Using python 3.2 in Windows 7 I am getting the following:

>>compile('pass', r'c:\temp\工具\module1.py', 'exec')
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character

Can anybody explain why the compile statement tries to convert the unicode filename using mbcs?  I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.

[toc] | [next] | [standalone]

#18740

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-01-10 03:08 -0500
Message-ID	<mailman.4584.1326182952.27778.python-list@python.org>
In reply to	#18731

On 1/9/2012 11:24 PM, pyscripter@gmail.com wrote:
> Using python 3.2 in Windows 7 I am getting the following:
>
>>> compile('pass', r'c:\temp\工具\module1.py', 'exec')
> UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
>
> Can anybody explain why the compile statement tries to convert the unicode filename using mbcs?  I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.

I get the same error running 3.2.2 under IDLE but not when pasting into 
Command Prompt. However, Command Prompt may be cheating by replacing the 
Chinese chars with '??' upon pasting, so that Python never gets them -- 
whereas they appear just fine in IDLE.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#18743

From	jmfauth <wxjmfauth@gmail.com>
Date	2012-01-10 01:42 -0800
Message-ID	<f8a1cfd0-bba7-4d2c-b8e3-72e4238c7bf0@d10g2000vbh.googlegroups.com>
In reply to	#18740

1) If I copy/paste these CJK chars from Google Groups in two of my
interactive
interpreters (no "dos/cmd console"), I have no problem.

>>> import unicodedata as ud
>>> ud.name('工')
'CJK UNIFIED IDEOGRAPH-5DE5'
>>> ud.name('具')
'CJK UNIFIED IDEOGRAPH-5177'
>>> hex(ord(('工')))
'0x5de5'
>>> hex(ord('具'))
'0x5177'
>>>

2) It semms the mbcs codec has some difficulties with
these chars.

>>> '\u5de5'.encode('mbcs')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid character
>>> '\u5de5'.encode('utf-8')
b'\xe5\xb7\xa5'
>>> '\u5de5'.encode('utf-32-be')
b'\x00\x00]\xe5'

3) On the usage of mbcs in files IO interaction --> core devs.

My conclusion.
The bottle neck is on the mbcs side.

jmf

[toc] | [prev] | [next] | [standalone]

#18745

From	88888 Dihedral <dihedral88888@googlemail.com>
Date	2012-01-10 02:53 -0800
Message-ID	<mailman.4585.1326192839.27778.python-list@python.org>
In reply to	#18740

Terry Reedy於 2012年1月10日星期二UTC+8下午4時08分40秒寫道：
> On 1/9/2012 11:24 PM, pyscr...@gmail.com wrote:
> > Using python 3.2 in Windows 7 I am getting the following:
> >
> >>> compile('pass', r'c:\temp\工具\module1.py', 'exec')
> > UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
> >
> > Can anybody explain why the compile statement tries to convert the unicode filename using mbcs?  I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.
> 
> I get the same error running 3.2.2 under IDLE but not when pasting into 
> Command Prompt. However, Command Prompt may be cheating by replacing the 
> Chinese chars with '??' upon pasting, so that Python never gets them -- 
> whereas they appear just fine in IDLE.
> 
> -- 
> Terry Jan Reedy

Thank you about the trick. 
Use some wildcat pattern to get the name.py compiled to pwc in some 
directory with utf-8 encoded chars.

[toc] | [prev] | [next] | [standalone]

#18752

From	jmfauth <wxjmfauth@gmail.com>
Date	2012-01-10 04:28 -0800
Message-ID	<e8448df4-76f6-4444-a785-53a1103d3f39@a11g2000vbz.googlegroups.com>
In reply to	#18745

On 10 jan, 11:53, 88888 Dihedral <dihedral88...@googlemail.com> wrote:
> Terry Reedy於 2012年1月10日星期二UTC+8下午4時08分40秒寫道：
>
>
> > I get the same error running 3.2.2 under IDLE but not when pasting into
> > Command Prompt. However, Command Prompt may be cheating by replacing the
> > Chinese chars with '??' upon pasting, so that Python never gets them --
> > whereas they appear just fine in IDLE.
>
> > --


Tested with *my* Windows GUI interactive intepreters.

It seems to me there is a problem with the mbcs codec.

>>> hex(ord('工'))
'0x5de5'
>>> '\u5de5'
'工'
>>> '\u5de5'.encode('mbcs')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid character
>>> '\u5de5'.encode('utf-8')
b'\xe5\xb7\xa5'
>>> '\u5de5'.encode('utf-32-be')
b'\x00\x00]\xe5'
>>> sys.version
'3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)]'
>>> '\u5de5'.encode('mbcs', 'replace')
b'?'

----------

>>> u'\u5de5'.encode('mbcs', 'replace')
'?'
>>> repr(u'\u5de5'.encode('utf-8'))
"'\\xe5\\xb7\\xa5'"
>>> repr(u'\u5de5'.encode('utf-32-be'))
"'\\x00\\x00]\\xe5'"
>>> sys.version
'2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)]'


jmf

[toc] | [prev] | [next] | [standalone]

#18760

From	jmfauth <wxjmfauth@gmail.com>
Date	2012-01-10 05:43 -0800
Message-ID	<3c9fd9e7-6a0e-40cc-a048-1a82e477c013@p4g2000vbt.googlegroups.com>
In reply to	#18752

On 10 jan, 13:28, jmfauth <wxjmfa...@gmail.com> wrote:

Addendum, Python console ("dos box")

D:\>c:\python32\python.exe
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u5de5'.encode('utf-8')
b'\xe5\xb7\xa5'
>>> '\u5de5'.encode('mbcs')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: inval
id character
>>> ^Z


D:\>c:\python27\python.exe
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u5de5'.encode('utf-8')
'\xe5\xb7\xa5'
>>> u'\u5de5'.encode('mbcs')
'?'
>>> ^Z


D:\>

jmf

[toc] | [prev] | [next] | [standalone]

#18790

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-01-10 19:56 -0500
Message-ID	<mailman.4618.1326243425.27778.python-list@python.org>
In reply to	#18760

On 1/10/2012 8:43 AM, jmfauth wrote:
> D:\>c:\python32\python.exe
> Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit
> (Intel)] on win
> 32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> '\u5de5'.encode('utf-8')
> b'\xe5\xb7\xa5'
>>>> '\u5de5'.encode('mbcs')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in<module>
> UnicodeEncodeError: 'mbcs' codec can't encode characters in position
> 0--1: inval
> id character

> D:\>c:\python27\python.exe
> Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit
> (Intel)] on win
> 32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> u'\u5de5'.encode('utf-8')
> '\xe5\xb7\xa5'
>>>> u'\u5de5'.encode('mbcs')
> '?'

mbcs encodes according to the current codepage. Only the chinese 
codepage(s) can encode the chinese char. So the unicode error is correct 
and 2.7 has a bug in that it is doing "errors='replace'" when it 
supposedly is doing "errors='strict'". The Py3 fix was done in
http://bugs.python.org/issue850997
2.7 was intentionally left alone because of back-compatibility 
considerations. (None of this addresses the OP's question.)

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#18813

From	jmfauth <wxjmfauth@gmail.com>
Date	2012-01-11 01:29 -0800
Message-ID	<ad5c938c-8c88-4b81-90c3-5f745b205537@d10g2000vbh.googlegroups.com>
In reply to	#18790

On 11 jan, 01:56, Terry Reedy <tjre...@udel.edu> wrote:
> On 1/10/2012 8:43 AM, jmfauth wrote:
>
>
>
> > D:\>c:\python32\python.exe
> > Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit
> > (Intel)] on win
> > 32
> > Type "help", "copyright", "credits" or "license" for more information.
> >>>> '\u5de5'.encode('utf-8')
> > b'\xe5\xb7\xa5'
> >>>> '\u5de5'.encode('mbcs')
> > Traceback (most recent call last):
> >    File "<stdin>", line 1, in<module>
> > UnicodeEncodeError: 'mbcs' codec can't encode characters in position
> > 0--1: inval
> > id character
> > D:\>c:\python27\python.exe
> > Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit
> > (Intel)] on win
> > 32
> > Type "help", "copyright", "credits" or "license" for more information.
> >>>> u'\u5de5'.encode('utf-8')
> > '\xe5\xb7\xa5'
> >>>> u'\u5de5'.encode('mbcs')
> > '?'
>
> mbcs encodes according to the current codepage. Only the chinese
> codepage(s) can encode the chinese char. So the unicode error is correct
> and 2.7 has a bug in that it is doing "errors='replace'" when it
> supposedly is doing "errors='strict'". The Py3 fix was done inhttp://bugs.python.org/issue850997
> 2.7 was intentionally left alone because of back-compatibility
> considerations. (None of this addresses the OP's question.)
>
> --

Ok. I was not aware of this.
PS Prev. post gets lost.

[toc] | [prev] | [next] | [standalone]

#18814

From	jmfauth <wxjmfauth@gmail.com>
Date	2012-01-10 23:05 -0800
Message-ID	<362fecda-1d4a-42d1-8139-4a3b340e44fb@h13g2000vbn.googlegroups.com>
In reply to	#18790

On 11 jan, 01:56, Terry Reedy <tjre...@udel.edu> wrote:
> On 1/10/2012 8:43 AM, jmfauth wrote:
>
> ...
>
> mbcs encodes according to the current codepage. Only the chinese
> codepage(s) can encode the chinese char. So the unicode error is correct
> and 2.7 has a bug in that it is doing "errors='replace'" when it
> supposedly is doing "errors='strict'". The Py3 fix was done inhttp://bugs.python.org/issue850997
> 2.7 was intentionally left alone because of back-compatibility
> considerations. (None of this addresses the OP's question.)
>
> --

win7, cp1252

Ok. I was not aware of this.

>>> '\N{CYRILLIC SMALL LETTER A}'.encode('mbcs')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid character
>>> '\N{GREEK SMALL LETTER ALPHA}'.encode('mbcs')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid character

jmf

[toc] | [prev] | [next] | [standalone]

#18746

From	88888 Dihedral <dihedral88888@googlemail.com>
Date	2012-01-10 02:53 -0800
Message-ID	<28903716.270.1326192835962.JavaMail.geo-discussion-forums@prmu37>
In reply to	#18740

Terry Reedy於 2012年1月10日星期二UTC+8下午4時08分40秒寫道：
> On 1/9/2012 11:24 PM, pyscr...@gmail.com wrote:
> > Using python 3.2 in Windows 7 I am getting the following:
> >
> >>> compile('pass', r'c:\temp\工具\module1.py', 'exec')
> > UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
> >
> > Can anybody explain why the compile statement tries to convert the unicode filename using mbcs?  I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.
> 
> I get the same error running 3.2.2 under IDLE but not when pasting into 
> Command Prompt. However, Command Prompt may be cheating by replacing the 
> Chinese chars with '??' upon pasting, so that Python never gets them -- 
> whereas they appear just fine in IDLE.
> 
> -- 
> Terry Jan Reedy

Thank you about the trick. 
Use some wildcat pattern to get the name.py compiled to pwc in some 
directory with utf-8 encoded chars.

[toc] | [prev] | [next] | [standalone]

#18744

From	pyscripter@gmail.com
Date	2012-01-10 02:04 -0800
Message-ID	<6733632.476.1326189850532.JavaMail.geo-discussion-forums@yqbl25>
In reply to	#18731

See a more complete version of the question at http://stackoverflow.com/questions/8798591/unicodeencodeerror-when-using-the-compile-function

[toc] | [prev] | [next] | [standalone]

#18801

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-01-10 22:50 -0500
Message-ID	<mailman.4625.1326253880.27778.python-list@python.org>
In reply to	#18731

On 1/10/2012 3:08 AM, Terry Reedy wrote:
> On 1/9/2012 11:24 PM, pyscripter@gmail.com wrote:
>> Using python 3.2 in Windows 7 I am getting the following:
>>
>>>> compile('pass', r'c:\temp\工具\module1.py', 'exec')

Is this a filename that could be an actual, valid filename on your system?

>> UnicodeEncodeError: 'mbcs' codec can't encode characters in position
>> 0--1: invalid character
>>
>> Can anybody explain why the compile statement tries to convert the
>> unicode filename using mbcs?

Good question. I believe this holdover from 2.x should be deleted.
I argued that in http://bugs.python.org/issue10114
(which was about a different problem) and now, directly, in
http://bugs.python.org/issue13758

If you (or anyone) can make a better argument for the requested change, 
or for also changing compile on *nix, than I did, please do so.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#18818

From	pyscripter@gmail.com
Date	2012-01-11 03:27 -0800
Message-ID	<9664479.1553.1326281266242.JavaMail.geo-discussion-forums@yqlp13>
In reply to	#18801


On Wednesday, January 11, 2012 5:50:51 AM UTC+2, Terry Reedy wrote:
> On 1/10/2012 3:08 AM, Terry Reedy wrote:
> Is this a filename that could be an actual, valid filename on your system?

Yes it is. open works on that file.

> Good question. I believe this holdover from 2.x should be deleted.
> I argued that in http://bugs.python.org/issue10114
> (which was about a different problem) and now, directly, in
> http://bugs.python.org/issue13758
> 
Maybe the example of this question can be added to the issue 13785 as a proof that compile fails on valid file names.

But I think the real issue is why on modern Windows systems the file system encoding is mbcs.  Shouldn't it be utf-16?

[toc] | [prev] | [next] | [standalone]

#18822

From	Dave Angel <d@davea.name>
Date	2012-01-11 06:45 -0500
Message-ID	<mailman.4641.1326282354.27778.python-list@python.org>
In reply to	#18818

On 01/11/2012 06:27 AM, pyscripter@gmail.com wrote:
> <SNIP>
> Maybe the example of this question can be added to the issue 13785 as a proof that compile fails on valid file names.
>
> But I think the real issue is why on modern Windows systems the file system encoding is mbcs.  Shouldn't it be utf-16?
Depends what you mean by modern. The following isn't true for Windows 
95, 98, nor ME.  But they weren't modern when they were first released.

NT systems, (which includes Win2k, XP, Vista, and Win7)  for at least 
the last 15 years, have used Unicode for the file system.  They also 
supply an "ASCII" interface.  If Python is using the latter, then it 
won't be able to access all possible files.

Now, it may be the fault of the C library that CPython uses.  I haven't 
looked at any of the code for CPython.

This is all from memory, as I haven't actively used Windows for some 
time now.  But I think the DLL name is kernel32.dll, and the entry 
points have names like  CreateFileW() for the unicode open, and 
CreateFileA() for the "ASCII" open.

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#18823

From	pyscripter@gmail.com
Date	2012-01-11 04:14 -0800
Message-ID	<11173914.1374.1326284096109.JavaMail.geo-discussion-forums@yqiq10>
In reply to	#18822

Indeed, on Windows NT the file system encoding should not be mbcs, since it creates UnicodeEncodeErrors on perfectly valid file names.

[toc] | [prev] | [next] | [standalone]

#18824

From	pyscripter@gmail.com
Date	2012-01-11 04:14 -0800
Message-ID	<mailman.4642.1326284105.27778.python-list@python.org>
In reply to	#18822

Indeed, on Windows NT the file system encoding should not be mbcs, since it creates UnicodeEncodeErrors on perfectly valid file names.

[toc] | [prev] | [next] | [standalone]

#18819

From	pyscripter@gmail.com
Date	2012-01-11 03:27 -0800
Message-ID	<mailman.4640.1326281269.27778.python-list@python.org>
In reply to	#18801


On Wednesday, January 11, 2012 5:50:51 AM UTC+2, Terry Reedy wrote:
> On 1/10/2012 3:08 AM, Terry Reedy wrote:
> Is this a filename that could be an actual, valid filename on your system?

Yes it is. open works on that file.

> Good question. I believe this holdover from 2.x should be deleted.
> I argued that in http://bugs.python.org/issue10114
> (which was about a different problem) and now, directly, in
> http://bugs.python.org/issue13758
> 
Maybe the example of this question can be added to the issue 13785 as a proof that compile fails on valid file names.

But I think the real issue is why on modern Windows systems the file system encoding is mbcs.  Shouldn't it be utf-16?

[toc] | [prev] | [standalone]

csiph-web

UnicodeEncodeError in compile

Contents

#18731 — UnicodeEncodeError in compile

#18740

#18743

#18745

#18752

#18760

#18790

#18813

#18814

#18746

#18744

#18801

#18818

#18822

#18823

#18824

#18819