Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #19074 > unrolled thread
| Started by | Olive <diolu@bigfoot.com> |
|---|---|
| First post | 2012-01-18 08:16 +0100 |
| Last post | 2012-01-19 02:40 -0800 |
| Articles | 6 — 4 participants |
Back to article view | Back to comp.lang.python
sys.argv as a list of bytes Olive <diolu@bigfoot.com> - 2012-01-18 08:16 +0100
Re: sys.argv as a list of bytes Peter Otten <__peter__@web.de> - 2012-01-18 09:05 +0100
Re: sys.argv as a list of bytes Olive <diolu@bigfoot.com> - 2012-01-18 11:16 +0100
Re: sys.argv as a list of bytes Peter Otten <__peter__@web.de> - 2012-01-18 15:01 +0100
Re: sys.argv as a list of bytes Nobody <nobody@nowhere.com> - 2012-01-19 05:05 +0000
Re: sys.argv as a list of bytes jmfauth <wxjmfauth@gmail.com> - 2012-01-19 02:40 -0800
| From | Olive <diolu@bigfoot.com> |
|---|---|
| Date | 2012-01-18 08:16 +0100 |
| Subject | sys.argv as a list of bytes |
| Message-ID | <20120118081612.13745187@bigfoot.com> |
In Unix the operating system pass argument as a list of C strings. But C strings does corresponds to the bytes notions of Python3. Is it possible to have sys.argv as a list of bytes ? What happens if I pass to a program an argumpent containing funny "character", for example (with a bash shell)? python -i ./test.py $'\x01'$'\x05'$'\xFF'
[toc] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-01-18 09:05 +0100 |
| Message-ID | <mailman.4823.1326873960.27778.python-list@python.org> |
| In reply to | #19074 |
Olive wrote:
> In Unix the operating system pass argument as a list of C strings. But
> C strings does corresponds to the bytes notions of Python3. Is it
> possible to have sys.argv as a list of bytes ? What happens if I pass
> to a program an argumpent containing funny "character", for example
> (with a bash shell)?
>
> python -i ./test.py $'\x01'$'\x05'$'\xFF'
Python has a special errorhandler, "surrogateescape" to deal with bytes that are not
valid UTF-8. If you try to print such a string you get an error:
$ python3 -c'import sys; print(repr(sys.argv[1]))' $'\x01'$'\x05'$'\xFF'
'\x01\x05\udcff'
$ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 2: surrogates not allowed
It is still possible to get the original bytes:
$ python3 -c'import sys; print(sys.argv[1].encode("utf-8", "surrogateescape"))' $'\x01'$'\x05'$'\xFF'
b'\x01\x05\xff'
[toc] | [prev] | [next] | [standalone]
| From | Olive <diolu@bigfoot.com> |
|---|---|
| Date | 2012-01-18 11:16 +0100 |
| Message-ID | <20120118111627.14d490ac@bigfoot.com> |
| In reply to | #19075 |
On Wed, 18 Jan 2012 09:05:42 +0100
Peter Otten <__peter__@web.de> wrote:
> Olive wrote:
>
> > In Unix the operating system pass argument as a list of C strings.
> > But C strings does corresponds to the bytes notions of Python3. Is
> > it possible to have sys.argv as a list of bytes ? What happens if I
> > pass to a program an argumpent containing funny "character", for
> > example (with a bash shell)?
> >
> > python -i ./test.py $'\x01'$'\x05'$'\xFF'
>
> Python has a special errorhandler, "surrogateescape" to deal with
> bytes that are not valid UTF-8. If you try to print such a string you
> get an error:
>
> $ python3 -c'import sys; print(repr(sys.argv[1]))'
> $'\x01'$'\x05'$'\xFF' '\x01\x05\udcff'
> $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
> Traceback (most recent call last):
> File "<string>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
> position 2: surrogates not allowed
>
> It is still possible to get the original bytes:
>
> $ python3 -c'import sys; print(sys.argv[1].encode("utf-8",
> "surrogateescape"))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff'
>
>
But is it safe even if the locale is not UTF-8? I would like to be able
to pass a file name to a script. I can use bytes for file names in the
open function. If I keep the filename as bytes everywhere it will work
reliably whatever the locale or strange character the file name may
contain.
Olive
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-01-18 15:01 +0100 |
| Message-ID | <mailman.4827.1326895312.27778.python-list@python.org> |
| In reply to | #19079 |
Olive wrote:
> On Wed, 18 Jan 2012 09:05:42 +0100
> Peter Otten <__peter__@web.de> wrote:
>
>> Olive wrote:
>>
>> > In Unix the operating system pass argument as a list of C strings.
>> > But C strings does corresponds to the bytes notions of Python3. Is
>> > it possible to have sys.argv as a list of bytes ? What happens if I
>> > pass to a program an argumpent containing funny "character", for
>> > example (with a bash shell)?
>> >
>> > python -i ./test.py $'\x01'$'\x05'$'\xFF'
>>
>> Python has a special errorhandler, "surrogateescape" to deal with
>> bytes that are not valid UTF-8. If you try to print such a string you
>> get an error:
>>
>> $ python3 -c'import sys; print(repr(sys.argv[1]))'
>> $'\x01'$'\x05'$'\xFF' '\x01\x05\udcff'
>> $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
>> Traceback (most recent call last):
>> File "<string>", line 1, in <module>
>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
>> position 2: surrogates not allowed
>>
>> It is still possible to get the original bytes:
>>
>> $ python3 -c'import sys; print(sys.argv[1].encode("utf-8",
>> "surrogateescape"))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff'
>>
>>
>
> But is it safe even if the locale is not UTF-8? I would like to be able
> to pass a file name to a script. I can use bytes for file names in the
> open function. If I keep the filename as bytes everywhere it will work
> reliably whatever the locale or strange character the file name may
> contain.
I believe you need not convert back to bytes explicitly, you can open the
file with open(sys.argv[i]). I don't know if there are cornercases where
that won't work; maybe http://www.python.org/dev/peps/pep-0383/ can help you
figure it out.
[toc] | [prev] | [next] | [standalone]
| From | Nobody <nobody@nowhere.com> |
|---|---|
| Date | 2012-01-19 05:05 +0000 |
| Message-ID | <pan.2012.01.19.05.05.31.519000@nowhere.com> |
| In reply to | #19079 |
On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote:
>> Python has a special errorhandler, "surrogateescape" to deal with
>> bytes that are not valid UTF-8.
On Wed, 18 Jan 2012 11:16:27 +0100, Olive wrote:
> But is it safe even if the locale is not UTF-8?
Yes. Peter's reference to UTF-8 is misleading. The surrogateescape
mechanism is used to represent anything which cannot be decoded according
to the locale's encoding. E.g. in the "C" locale, any byte >= 128 will be
encoded as a surrogate.
On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote:
> It is still possible to get the original bytes:
>
> python3 -c'import sys; print(sys.argv[1].encode("utf-8", "surrogateescape"))'
Except, it isn't. Because the Python dev's can't make up their mind which
encoding sys.argv uses, or even document it.
AFAICT:
On Windows, there never was a bytes version of sys.argv to start with
(the OS supplies the command line using wide strings).
On Mac OS X, the command line is always decoded using UTF-8.
On Unix, the command line is decoded using mbstowcs(). There isn't a
Python function to query which encoding this used (if there even _is_ a
corresponding Python encoding).
Except on Windows (where OS APIs take wide string parameters), if a
library function needs to pass a Unicode string to an API function, it
will normally decode it using sys.getfilesystemencoding(), which isn't
guaranteed to be the encoding which was used to fabricate sys.argv in
the first place.
In short: if you need to write "system" scripts on Unix, and you need them
to work reliably, you need to stick with Python 2.x.
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2012-01-19 02:40 -0800 |
| Message-ID | <fbe32369-594d-445b-9735-d19cbfd57de3@t30g2000vbx.googlegroups.com> |
| In reply to | #19114 |
> > In short: if you need to write "system" scripts on Unix, and you need them > to work reliably, you need to stick with Python 2.x. I think, understanding the coding of the characters helps a bit. I can not figure out how the example below could not be done on other systems. D:\tmp>chcp Page de codes active : 1252 D:\tmp>c:\python32\python.exe sysarg.py a b é € \u0430 \u03b1 z arg: 1 unicode name: LATIN SMALL LETTER A arg: 2 unicode name: LATIN SMALL LETTER B arg: 3 unicode name: LATIN SMALL LETTER E WITH ACUTE arg: 4 unicode name: EURO SIGN arg: 5 unicode name: CYRILLIC SMALL LETTER A arg: 6 unicode name: GREEK SMALL LETTER ALPHA arg: 7 unicode name: LATIN SMALL LETTER Z jmf
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web