Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #18731 > unrolled thread
| Started by | pyscripter@gmail.com |
|---|---|
| First post | 2012-01-09 20:24 -0800 |
| Last post | 2012-01-11 03:27 -0800 |
| Articles | 17 — 5 participants |
Back to article view | Back to comp.lang.python
UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-09 20:24 -0800
Re: UnicodeEncodeError in compile Terry Reedy <tjreedy@udel.edu> - 2012-01-10 03:08 -0500
Re: UnicodeEncodeError in compile jmfauth <wxjmfauth@gmail.com> - 2012-01-10 01:42 -0800
Re: UnicodeEncodeError in compile 88888 Dihedral <dihedral88888@googlemail.com> - 2012-01-10 02:53 -0800
Re: UnicodeEncodeError in compile jmfauth <wxjmfauth@gmail.com> - 2012-01-10 04:28 -0800
Re: UnicodeEncodeError in compile jmfauth <wxjmfauth@gmail.com> - 2012-01-10 05:43 -0800
Re: UnicodeEncodeError in compile Terry Reedy <tjreedy@udel.edu> - 2012-01-10 19:56 -0500
Re: UnicodeEncodeError in compile jmfauth <wxjmfauth@gmail.com> - 2012-01-11 01:29 -0800
Re: UnicodeEncodeError in compile jmfauth <wxjmfauth@gmail.com> - 2012-01-10 23:05 -0800
Re: UnicodeEncodeError in compile 88888 Dihedral <dihedral88888@googlemail.com> - 2012-01-10 02:53 -0800
Re: UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-10 02:04 -0800
Re: UnicodeEncodeError in compile Terry Reedy <tjreedy@udel.edu> - 2012-01-10 22:50 -0500
Re: UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-11 03:27 -0800
Re: UnicodeEncodeError in compile Dave Angel <d@davea.name> - 2012-01-11 06:45 -0500
Re: UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-11 04:14 -0800
Re: UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-11 04:14 -0800
Re: UnicodeEncodeError in compile pyscripter@gmail.com - 2012-01-11 03:27 -0800
| From | pyscripter@gmail.com |
|---|---|
| Date | 2012-01-09 20:24 -0800 |
| Subject | UnicodeEncodeError in compile |
| Message-ID | <9043309.329.1326169476466.JavaMail.geo-discussion-forums@yqhi24> |
Using python 3.2 in Windows 7 I am getting the following:
>>compile('pass', r'c:\temp\工具\module1.py', 'exec')
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
Can anybody explain why the compile statement tries to convert the unicode filename using mbcs? I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.
[toc] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-01-10 03:08 -0500 |
| Message-ID | <mailman.4584.1326182952.27778.python-list@python.org> |
| In reply to | #18731 |
On 1/9/2012 11:24 PM, pyscripter@gmail.com wrote:
> Using python 3.2 in Windows 7 I am getting the following:
>
>>> compile('pass', r'c:\temp\工具\module1.py', 'exec')
> UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
>
> Can anybody explain why the compile statement tries to convert the unicode filename using mbcs? I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.
I get the same error running 3.2.2 under IDLE but not when pasting into
Command Prompt. However, Command Prompt may be cheating by replacing the
Chinese chars with '??' upon pasting, so that Python never gets them --
whereas they appear just fine in IDLE.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2012-01-10 01:42 -0800 |
| Message-ID | <f8a1cfd0-bba7-4d2c-b8e3-72e4238c7bf0@d10g2000vbh.googlegroups.com> |
| In reply to | #18740 |
1) If I copy/paste these CJK chars from Google Groups in two of my
interactive
interpreters (no "dos/cmd console"), I have no problem.
>>> import unicodedata as ud
>>> ud.name('工')
'CJK UNIFIED IDEOGRAPH-5DE5'
>>> ud.name('具')
'CJK UNIFIED IDEOGRAPH-5177'
>>> hex(ord(('工')))
'0x5de5'
>>> hex(ord('具'))
'0x5177'
>>>
2) It semms the mbcs codec has some difficulties with
these chars.
>>> '\u5de5'.encode('mbcs')
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid character
>>> '\u5de5'.encode('utf-8')
b'\xe5\xb7\xa5'
>>> '\u5de5'.encode('utf-32-be')
b'\x00\x00]\xe5'
3) On the usage of mbcs in files IO interaction --> core devs.
My conclusion.
The bottle neck is on the mbcs side.
jmf
[toc] | [prev] | [next] | [standalone]
| From | 88888 Dihedral <dihedral88888@googlemail.com> |
|---|---|
| Date | 2012-01-10 02:53 -0800 |
| Message-ID | <mailman.4585.1326192839.27778.python-list@python.org> |
| In reply to | #18740 |
Terry Reedy於 2012年1月10日星期二UTC+8下午4時08分40秒寫道:
> On 1/9/2012 11:24 PM, pyscr...@gmail.com wrote:
> > Using python 3.2 in Windows 7 I am getting the following:
> >
> >>> compile('pass', r'c:\temp\工具\module1.py', 'exec')
> > UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
> >
> > Can anybody explain why the compile statement tries to convert the unicode filename using mbcs? I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.
>
> I get the same error running 3.2.2 under IDLE but not when pasting into
> Command Prompt. However, Command Prompt may be cheating by replacing the
> Chinese chars with '??' upon pasting, so that Python never gets them --
> whereas they appear just fine in IDLE.
>
> --
> Terry Jan Reedy
Thank you about the trick.
Use some wildcat pattern to get the name.py compiled to pwc in some
directory with utf-8 encoded chars.
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2012-01-10 04:28 -0800 |
| Message-ID | <e8448df4-76f6-4444-a785-53a1103d3f39@a11g2000vbz.googlegroups.com> |
| In reply to | #18745 |
On 10 jan, 11:53, 88888 Dihedral <dihedral88...@googlemail.com> wrote:
> Terry Reedy於 2012年1月10日星期二UTC+8下午4時08分40秒寫道:
>
>
> > I get the same error running 3.2.2 under IDLE but not when pasting into
> > Command Prompt. However, Command Prompt may be cheating by replacing the
> > Chinese chars with '??' upon pasting, so that Python never gets them --
> > whereas they appear just fine in IDLE.
>
> > --
Tested with *my* Windows GUI interactive intepreters.
It seems to me there is a problem with the mbcs codec.
>>> hex(ord('工'))
'0x5de5'
>>> '\u5de5'
'工'
>>> '\u5de5'.encode('mbcs')
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid character
>>> '\u5de5'.encode('utf-8')
b'\xe5\xb7\xa5'
>>> '\u5de5'.encode('utf-32-be')
b'\x00\x00]\xe5'
>>> sys.version
'3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)]'
>>> '\u5de5'.encode('mbcs', 'replace')
b'?'
----------
>>> u'\u5de5'.encode('mbcs', 'replace')
'?'
>>> repr(u'\u5de5'.encode('utf-8'))
"'\\xe5\\xb7\\xa5'"
>>> repr(u'\u5de5'.encode('utf-32-be'))
"'\\x00\\x00]\\xe5'"
>>> sys.version
'2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)]'
jmf
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2012-01-10 05:43 -0800 |
| Message-ID | <3c9fd9e7-6a0e-40cc-a048-1a82e477c013@p4g2000vbt.googlegroups.com> |
| In reply to | #18752 |
On 10 jan, 13:28, jmfauth <wxjmfa...@gmail.com> wrote:
Addendum, Python console ("dos box")
D:\>c:\python32\python.exe
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u5de5'.encode('utf-8')
b'\xe5\xb7\xa5'
>>> '\u5de5'.encode('mbcs')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: inval
id character
>>> ^Z
D:\>c:\python27\python.exe
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u5de5'.encode('utf-8')
'\xe5\xb7\xa5'
>>> u'\u5de5'.encode('mbcs')
'?'
>>> ^Z
D:\>
jmf
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-01-10 19:56 -0500 |
| Message-ID | <mailman.4618.1326243425.27778.python-list@python.org> |
| In reply to | #18760 |
On 1/10/2012 8:43 AM, jmfauth wrote:
> D:\>c:\python32\python.exe
> Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit
> (Intel)] on win
> 32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> '\u5de5'.encode('utf-8')
> b'\xe5\xb7\xa5'
>>>> '\u5de5'.encode('mbcs')
> Traceback (most recent call last):
> File "<stdin>", line 1, in<module>
> UnicodeEncodeError: 'mbcs' codec can't encode characters in position
> 0--1: inval
> id character
> D:\>c:\python27\python.exe
> Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit
> (Intel)] on win
> 32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> u'\u5de5'.encode('utf-8')
> '\xe5\xb7\xa5'
>>>> u'\u5de5'.encode('mbcs')
> '?'
mbcs encodes according to the current codepage. Only the chinese
codepage(s) can encode the chinese char. So the unicode error is correct
and 2.7 has a bug in that it is doing "errors='replace'" when it
supposedly is doing "errors='strict'". The Py3 fix was done in
http://bugs.python.org/issue850997
2.7 was intentionally left alone because of back-compatibility
considerations. (None of this addresses the OP's question.)
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2012-01-11 01:29 -0800 |
| Message-ID | <ad5c938c-8c88-4b81-90c3-5f745b205537@d10g2000vbh.googlegroups.com> |
| In reply to | #18790 |
On 11 jan, 01:56, Terry Reedy <tjre...@udel.edu> wrote:
> On 1/10/2012 8:43 AM, jmfauth wrote:
>
>
>
> > D:\>c:\python32\python.exe
> > Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit
> > (Intel)] on win
> > 32
> > Type "help", "copyright", "credits" or "license" for more information.
> >>>> '\u5de5'.encode('utf-8')
> > b'\xe5\xb7\xa5'
> >>>> '\u5de5'.encode('mbcs')
> > Traceback (most recent call last):
> > File "<stdin>", line 1, in<module>
> > UnicodeEncodeError: 'mbcs' codec can't encode characters in position
> > 0--1: inval
> > id character
> > D:\>c:\python27\python.exe
> > Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit
> > (Intel)] on win
> > 32
> > Type "help", "copyright", "credits" or "license" for more information.
> >>>> u'\u5de5'.encode('utf-8')
> > '\xe5\xb7\xa5'
> >>>> u'\u5de5'.encode('mbcs')
> > '?'
>
> mbcs encodes according to the current codepage. Only the chinese
> codepage(s) can encode the chinese char. So the unicode error is correct
> and 2.7 has a bug in that it is doing "errors='replace'" when it
> supposedly is doing "errors='strict'". The Py3 fix was done inhttp://bugs.python.org/issue850997
> 2.7 was intentionally left alone because of back-compatibility
> considerations. (None of this addresses the OP's question.)
>
> --
Ok. I was not aware of this.
PS Prev. post gets lost.
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2012-01-10 23:05 -0800 |
| Message-ID | <362fecda-1d4a-42d1-8139-4a3b340e44fb@h13g2000vbn.googlegroups.com> |
| In reply to | #18790 |
On 11 jan, 01:56, Terry Reedy <tjre...@udel.edu> wrote:
> On 1/10/2012 8:43 AM, jmfauth wrote:
>
> ...
>
> mbcs encodes according to the current codepage. Only the chinese
> codepage(s) can encode the chinese char. So the unicode error is correct
> and 2.7 has a bug in that it is doing "errors='replace'" when it
> supposedly is doing "errors='strict'". The Py3 fix was done inhttp://bugs.python.org/issue850997
> 2.7 was intentionally left alone because of back-compatibility
> considerations. (None of this addresses the OP's question.)
>
> --
win7, cp1252
Ok. I was not aware of this.
>>> '\N{CYRILLIC SMALL LETTER A}'.encode('mbcs')
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid character
>>> '\N{GREEK SMALL LETTER ALPHA}'.encode('mbcs')
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid character
jmf
[toc] | [prev] | [next] | [standalone]
| From | 88888 Dihedral <dihedral88888@googlemail.com> |
|---|---|
| Date | 2012-01-10 02:53 -0800 |
| Message-ID | <28903716.270.1326192835962.JavaMail.geo-discussion-forums@prmu37> |
| In reply to | #18740 |
Terry Reedy於 2012年1月10日星期二UTC+8下午4時08分40秒寫道:
> On 1/9/2012 11:24 PM, pyscr...@gmail.com wrote:
> > Using python 3.2 in Windows 7 I am getting the following:
> >
> >>> compile('pass', r'c:\temp\工具\module1.py', 'exec')
> > UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
> >
> > Can anybody explain why the compile statement tries to convert the unicode filename using mbcs? I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.
>
> I get the same error running 3.2.2 under IDLE but not when pasting into
> Command Prompt. However, Command Prompt may be cheating by replacing the
> Chinese chars with '??' upon pasting, so that Python never gets them --
> whereas they appear just fine in IDLE.
>
> --
> Terry Jan Reedy
Thank you about the trick.
Use some wildcat pattern to get the name.py compiled to pwc in some
directory with utf-8 encoded chars.
[toc] | [prev] | [next] | [standalone]
| From | pyscripter@gmail.com |
|---|---|
| Date | 2012-01-10 02:04 -0800 |
| Message-ID | <6733632.476.1326189850532.JavaMail.geo-discussion-forums@yqbl25> |
| In reply to | #18731 |
See a more complete version of the question at http://stackoverflow.com/questions/8798591/unicodeencodeerror-when-using-the-compile-function
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-01-10 22:50 -0500 |
| Message-ID | <mailman.4625.1326253880.27778.python-list@python.org> |
| In reply to | #18731 |
On 1/10/2012 3:08 AM, Terry Reedy wrote:
> On 1/9/2012 11:24 PM, pyscripter@gmail.com wrote:
>> Using python 3.2 in Windows 7 I am getting the following:
>>
>>>> compile('pass', r'c:\temp\工具\module1.py', 'exec')
Is this a filename that could be an actual, valid filename on your system?
>> UnicodeEncodeError: 'mbcs' codec can't encode characters in position
>> 0--1: invalid character
>>
>> Can anybody explain why the compile statement tries to convert the
>> unicode filename using mbcs?
Good question. I believe this holdover from 2.x should be deleted.
I argued that in http://bugs.python.org/issue10114
(which was about a different problem) and now, directly, in
http://bugs.python.org/issue13758
If you (or anyone) can make a better argument for the requested change,
or for also changing compile on *nix, than I did, please do so.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | pyscripter@gmail.com |
|---|---|
| Date | 2012-01-11 03:27 -0800 |
| Message-ID | <9664479.1553.1326281266242.JavaMail.geo-discussion-forums@yqlp13> |
| In reply to | #18801 |
On Wednesday, January 11, 2012 5:50:51 AM UTC+2, Terry Reedy wrote: > On 1/10/2012 3:08 AM, Terry Reedy wrote: > Is this a filename that could be an actual, valid filename on your system? Yes it is. open works on that file. > Good question. I believe this holdover from 2.x should be deleted. > I argued that in http://bugs.python.org/issue10114 > (which was about a different problem) and now, directly, in > http://bugs.python.org/issue13758 > Maybe the example of this question can be added to the issue 13785 as a proof that compile fails on valid file names. But I think the real issue is why on modern Windows systems the file system encoding is mbcs. Shouldn't it be utf-16?
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-01-11 06:45 -0500 |
| Message-ID | <mailman.4641.1326282354.27778.python-list@python.org> |
| In reply to | #18818 |
On 01/11/2012 06:27 AM, pyscripter@gmail.com wrote: > <SNIP> > Maybe the example of this question can be added to the issue 13785 as a proof that compile fails on valid file names. > > But I think the real issue is why on modern Windows systems the file system encoding is mbcs. Shouldn't it be utf-16? Depends what you mean by modern. The following isn't true for Windows 95, 98, nor ME. But they weren't modern when they were first released. NT systems, (which includes Win2k, XP, Vista, and Win7) for at least the last 15 years, have used Unicode for the file system. They also supply an "ASCII" interface. If Python is using the latter, then it won't be able to access all possible files. Now, it may be the fault of the C library that CPython uses. I haven't looked at any of the code for CPython. This is all from memory, as I haven't actively used Windows for some time now. But I think the DLL name is kernel32.dll, and the entry points have names like CreateFileW() for the unicode open, and CreateFileA() for the "ASCII" open. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | pyscripter@gmail.com |
|---|---|
| Date | 2012-01-11 04:14 -0800 |
| Message-ID | <11173914.1374.1326284096109.JavaMail.geo-discussion-forums@yqiq10> |
| In reply to | #18822 |
Indeed, on Windows NT the file system encoding should not be mbcs, since it creates UnicodeEncodeErrors on perfectly valid file names.
[toc] | [prev] | [next] | [standalone]
| From | pyscripter@gmail.com |
|---|---|
| Date | 2012-01-11 04:14 -0800 |
| Message-ID | <mailman.4642.1326284105.27778.python-list@python.org> |
| In reply to | #18822 |
Indeed, on Windows NT the file system encoding should not be mbcs, since it creates UnicodeEncodeErrors on perfectly valid file names.
[toc] | [prev] | [next] | [standalone]
| From | pyscripter@gmail.com |
|---|---|
| Date | 2012-01-11 03:27 -0800 |
| Message-ID | <mailman.4640.1326281269.27778.python-list@python.org> |
| In reply to | #18801 |
On Wednesday, January 11, 2012 5:50:51 AM UTC+2, Terry Reedy wrote: > On 1/10/2012 3:08 AM, Terry Reedy wrote: > Is this a filename that could be an actual, valid filename on your system? Yes it is. open works on that file. > Good question. I believe this holdover from 2.x should be deleted. > I argued that in http://bugs.python.org/issue10114 > (which was about a different problem) and now, directly, in > http://bugs.python.org/issue13758 > Maybe the example of this question can be added to the issue 13785 as a proof that compile fails on valid file names. But I think the real issue is why on modern Windows systems the file system encoding is mbcs. Shouldn't it be utf-16?
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web