Groups > comp.lang.python > #19074 > unrolled thread

sys.argv as a list of bytes

Started by	Olive <diolu@bigfoot.com>
First post	2012-01-18 08:16 +0100
Last post	2012-01-19 02:40 -0800
Articles	6 — 4 participants

Back to article view | Back to comp.lang.python

  sys.argv as a list of bytes Olive <diolu@bigfoot.com> - 2012-01-18 08:16 +0100
    Re: sys.argv as a list of bytes Peter Otten <__peter__@web.de> - 2012-01-18 09:05 +0100
      Re: sys.argv as a list of bytes Olive <diolu@bigfoot.com> - 2012-01-18 11:16 +0100
        Re: sys.argv as a list of bytes Peter Otten <__peter__@web.de> - 2012-01-18 15:01 +0100
        Re: sys.argv as a list of bytes Nobody <nobody@nowhere.com> - 2012-01-19 05:05 +0000
          Re: sys.argv as a list of bytes jmfauth <wxjmfauth@gmail.com> - 2012-01-19 02:40 -0800

#19074 — sys.argv as a list of bytes

From	Olive <diolu@bigfoot.com>
Date	2012-01-18 08:16 +0100
Subject	sys.argv as a list of bytes
Message-ID	<20120118081612.13745187@bigfoot.com>

In Unix the operating system pass argument as a list of C strings. But
C strings does corresponds to the bytes notions of Python3. Is it
possible to have sys.argv as a list of bytes ? What happens if I pass
to a program an argumpent containing funny "character", for example
(with a bash shell)?

python -i ./test.py $'\x01'$'\x05'$'\xFF'

[toc] | [next] | [standalone]

#19075

From	Peter Otten <__peter__@web.de>
Date	2012-01-18 09:05 +0100
Message-ID	<mailman.4823.1326873960.27778.python-list@python.org>
In reply to	#19074

Olive wrote:

> In Unix the operating system pass argument as a list of C strings. But
> C strings does corresponds to the bytes notions of Python3. Is it
> possible to have sys.argv as a list of bytes ? What happens if I pass
> to a program an argumpent containing funny "character", for example
> (with a bash shell)?
> 
> python -i ./test.py $'\x01'$'\x05'$'\xFF'

Python has a special errorhandler, "surrogateescape" to deal with bytes that are not 
valid UTF-8. If you try to print such a string you get an error:

$ python3 -c'import sys; print(repr(sys.argv[1]))' $'\x01'$'\x05'$'\xFF'
'\x01\x05\udcff'
$ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 2: surrogates not allowed

It is still possible to get the original bytes:

$ python3 -c'import sys; print(sys.argv[1].encode("utf-8", "surrogateescape"))' $'\x01'$'\x05'$'\xFF'
b'\x01\x05\xff'

[toc] | [prev] | [next] | [standalone]

#19079

From	Olive <diolu@bigfoot.com>
Date	2012-01-18 11:16 +0100
Message-ID	<20120118111627.14d490ac@bigfoot.com>
In reply to	#19075

On Wed, 18 Jan 2012 09:05:42 +0100
Peter Otten <__peter__@web.de> wrote:

> Olive wrote:
> 
> > In Unix the operating system pass argument as a list of C strings.
> > But C strings does corresponds to the bytes notions of Python3. Is
> > it possible to have sys.argv as a list of bytes ? What happens if I
> > pass to a program an argumpent containing funny "character", for
> > example (with a bash shell)?
> > 
> > python -i ./test.py $'\x01'$'\x05'$'\xFF'
> 
> Python has a special errorhandler, "surrogateescape" to deal with
> bytes that are not valid UTF-8. If you try to print such a string you
> get an error:
> 
> $ python3 -c'import sys; print(repr(sys.argv[1]))'
> $'\x01'$'\x05'$'\xFF' '\x01\x05\udcff'
> $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
> position 2: surrogates not allowed
> 
> It is still possible to get the original bytes:
> 
> $ python3 -c'import sys; print(sys.argv[1].encode("utf-8",
> "surrogateescape"))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff'
> 
> 

But is it safe even if the locale is not UTF-8? I would like to be able
to pass a file name to a script. I can use bytes for file names in the
open function. If I keep the filename as bytes everywhere it will work
reliably whatever the locale or strange character the file name may
contain. 

Olive

[toc] | [prev] | [next] | [standalone]

#19081

From	Peter Otten <__peter__@web.de>
Date	2012-01-18 15:01 +0100
Message-ID	<mailman.4827.1326895312.27778.python-list@python.org>
In reply to	#19079

Olive wrote:

> On Wed, 18 Jan 2012 09:05:42 +0100
> Peter Otten <__peter__@web.de> wrote:
> 
>> Olive wrote:
>> 
>> > In Unix the operating system pass argument as a list of C strings.
>> > But C strings does corresponds to the bytes notions of Python3. Is
>> > it possible to have sys.argv as a list of bytes ? What happens if I
>> > pass to a program an argumpent containing funny "character", for
>> > example (with a bash shell)?
>> > 
>> > python -i ./test.py $'\x01'$'\x05'$'\xFF'
>> 
>> Python has a special errorhandler, "surrogateescape" to deal with
>> bytes that are not valid UTF-8. If you try to print such a string you
>> get an error:
>> 
>> $ python3 -c'import sys; print(repr(sys.argv[1]))'
>> $'\x01'$'\x05'$'\xFF' '\x01\x05\udcff'
>> $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
>> Traceback (most recent call last):
>>   File "<string>", line 1, in <module>
>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
>> position 2: surrogates not allowed
>> 
>> It is still possible to get the original bytes:
>> 
>> $ python3 -c'import sys; print(sys.argv[1].encode("utf-8",
>> "surrogateescape"))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff'
>> 
>> 
> 
> But is it safe even if the locale is not UTF-8? I would like to be able
> to pass a file name to a script. I can use bytes for file names in the
> open function. If I keep the filename as bytes everywhere it will work
> reliably whatever the locale or strange character the file name may
> contain.

I believe you need not convert back to bytes explicitly, you can open the 
file with open(sys.argv[i]). I don't know if there are cornercases where 
that won't work; maybe http://www.python.org/dev/peps/pep-0383/ can help you 
figure it out.

[toc] | [prev] | [next] | [standalone]

#19114

From	Nobody <nobody@nowhere.com>
Date	2012-01-19 05:05 +0000
Message-ID	<pan.2012.01.19.05.05.31.519000@nowhere.com>
In reply to	#19079

On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote:

>> Python has a special errorhandler, "surrogateescape" to deal with
>> bytes that are not valid UTF-8.

On Wed, 18 Jan 2012 11:16:27 +0100, Olive wrote:

> But is it safe even if the locale is not UTF-8?

Yes. Peter's reference to UTF-8 is misleading. The surrogateescape
mechanism is used to represent anything which cannot be decoded according
to the locale's encoding. E.g. in the "C" locale, any byte >= 128 will be
encoded as a surrogate.

On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote:

> It is still possible to get the original bytes:
> 
> python3 -c'import sys; print(sys.argv[1].encode("utf-8", "surrogateescape"))'

Except, it isn't. Because the Python dev's can't make up their mind which
encoding sys.argv uses, or even document it.

AFAICT:

On Windows, there never was a bytes version of sys.argv to start with
(the OS supplies the command line using wide strings).

On Mac OS X, the command line is always decoded using UTF-8.

On Unix, the command line is decoded using mbstowcs(). There isn't a
Python function to query which encoding this used (if there even _is_ a
corresponding Python encoding).

Except on Windows (where OS APIs take wide string parameters), if a
library function needs to pass a Unicode string to an API function, it
will normally decode it using sys.getfilesystemencoding(), which isn't
guaranteed to be the encoding which was used to fabricate sys.argv in
the first place.

In short: if you need to write "system" scripts on Unix, and you need them
to work reliably, you need to stick with Python 2.x.

[toc] | [prev] | [next] | [standalone]

#19120

From	jmfauth <wxjmfauth@gmail.com>
Date	2012-01-19 02:40 -0800
Message-ID	<fbe32369-594d-445b-9735-d19cbfd57de3@t30g2000vbx.googlegroups.com>
In reply to	#19114

>
> In short: if you need to write "system" scripts on Unix, and you need them
> to work reliably, you need to stick with Python 2.x.


I think, understanding the coding of the characters helps a bit.

I can not figure out how the example below could not be
done on other systems.

D:\tmp>chcp
Page de codes active : 1252

D:\tmp>c:\python32\python.exe sysarg.py a b é € \u0430 \u03b1 z
arg: 1   unicode name: LATIN SMALL LETTER A
arg: 2   unicode name: LATIN SMALL LETTER B
arg: 3   unicode name: LATIN SMALL LETTER E WITH ACUTE
arg: 4   unicode name: EURO SIGN
arg: 5   unicode name: CYRILLIC SMALL LETTER A
arg: 6   unicode name: GREEK SMALL LETTER ALPHA
arg: 7   unicode name: LATIN SMALL LETTER Z

jmf

[toc] | [prev] | [standalone]

csiph-web

sys.argv as a list of bytes

Contents

#19074 — sys.argv as a list of bytes

#19075

#19079

#19081

#19114

#19120