Groups > comp.lang.python > #63778 > unrolled thread

Problem writing some strings (UnicodeEncodeError)

Started by	Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt>
First post	2014-01-12 15:36 +0000
Last post	2014-01-12 08:55 -0800
Articles	13 — 3 participants

Back to article view | Back to comp.lang.python

  Problem writing some strings (UnicodeEncodeError) Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt> - 2014-01-12 15:36 +0000
    Re: Problem writing some strings (UnicodeEncodeError) Peter Otten <__peter__@web.de> - 2014-01-12 17:23 +0100
      Re: Problem writing some strings (UnicodeEncodeError) Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt> - 2014-01-12 17:51 +0000
        Re: Problem writing some strings (UnicodeEncodeError) Peter Otten <__peter__@web.de> - 2014-01-12 19:50 +0100
          Re: Problem writing some strings (UnicodeEncodeError) Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt> - 2014-01-12 19:41 +0000
            Re: Problem writing some strings (UnicodeEncodeError) Peter Otten <__peter__@web.de> - 2014-01-12 21:29 +0100
              Re: Problem writing some strings (UnicodeEncodeError) Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt> - 2014-01-12 23:53 +0000
                Re: Problem writing some strings (UnicodeEncodeError) Peter Otten <__peter__@web.de> - 2014-01-13 09:48 +0100
                Re: Problem writing some strings (UnicodeEncodeError) Peter Otten <__peter__@web.de> - 2014-01-13 09:58 +0100
                  Re: Problem writing some strings (UnicodeEncodeError) Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt> - 2014-01-13 16:14 +0000
                    Re: Problem writing some strings (UnicodeEncodeError) Peter Otten <__peter__@web.de> - 2014-01-13 18:29 +0100
                      Re: Problem writing some strings (UnicodeEncodeError) Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt> - 2014-01-13 18:44 +0000
    Re: Problem writing some strings (UnicodeEncodeError) Emile van Sebille <emile@fenx.com> - 2014-01-12 08:55 -0800

#63778 — Problem writing some strings (UnicodeEncodeError)

From	Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt>
Date	2014-01-12 15:36 +0000
Subject	Problem writing some strings (UnicodeEncodeError)
Message-ID	<laucp8$890$1@speranza.aioe.org>

Hi!

I am using a python3 script to produce a bash script from lots of
filenames got using os.walk.

I have a template string for each bash command in which I replace a
special string with the filename and then write the command to the bash
script file.

Something like this:

shf=open(bashfilename,'w')
filenames=getfilenames() # uses os.walk
for fn in filenames:
	...
	cmd=templ.replace("<fn>",fn)
	shf.write(cmd)

For certain filenames I got a UnicodeEncodeError exception at
shf.write(cmd)!
I use utf-8 and have # -*- coding: utf-8 -*- in the source .py.

How can I fix this?

Thanks for any help/comments.

[toc] | [next] | [standalone]

#63781

From	Peter Otten <__peter__@web.de>
Date	2014-01-12 17:23 +0100
Message-ID	<mailman.5374.1389543800.18130.python-list@python.org>
In reply to	#63778

Paulo da Silva wrote:

> I am using a python3 script to produce a bash script from lots of
> filenames got using os.walk.
> 
> I have a template string for each bash command in which I replace a
> special string with the filename and then write the command to the bash
> script file.
> 
> Something like this:
> 
> shf=open(bashfilename,'w')
> filenames=getfilenames() # uses os.walk
> for fn in filenames:
> ...
> cmd=templ.replace("<fn>",fn)
> shf.write(cmd)
> 
> For certain filenames I got a UnicodeEncodeError exception at
> shf.write(cmd)!
> I use utf-8 and have # -*- coding: utf-8 -*- in the source .py.
> 
> How can I fix this?
> 
> Thanks for any help/comments.

You make it harder to debug your problem by not giving the complete 
traceback. If the error message contains 'surrogates not allowed' like in 
the demo below

>>> with open("tmp.txt", "w") as f:
...     f.write("\udcef")
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in 
position 0: surrogates not allowed

you have filenames that are not valid UTF-8 on your harddisk. 

A possible fix would be to use bytes instead of str. For that you need to 
open `bashfilename` in binary mode ("wb") and pass bytes to the os.walk() 
call. 

Or you just go and fix the offending names.

[toc] | [prev] | [next] | [standalone]

#63792

From	Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt>
Date	2014-01-12 17:51 +0000
Message-ID	<laukne$sp6$1@speranza.aioe.org>
In reply to	#63781

Em 12-01-2014 16:23, Peter Otten escreveu:
> Paulo da Silva wrote:
> 
>> I am using a python3 script to produce a bash script from lots of
>> filenames got using os.walk.
>>
>> I have a template string for each bash command in which I replace a
>> special string with the filename and then write the command to the bash
>> script file.
>>
>> Something like this:
>>
>> shf=open(bashfilename,'w')
>> filenames=getfilenames() # uses os.walk
>> for fn in filenames:
>> ...
>> cmd=templ.replace("<fn>",fn)
>> shf.write(cmd)
>>
>> For certain filenames I got a UnicodeEncodeError exception at
>> shf.write(cmd)!
>> I use utf-8 and have # -*- coding: utf-8 -*- in the source .py.
>>
>> How can I fix this?
>>
>> Thanks for any help/comments.
> 
> You make it harder to debug your problem by not giving the complete 
> traceback. If the error message contains 'surrogates not allowed' like in 
> the demo below
> 
>>>> with open("tmp.txt", "w") as f:
> ...     f.write("\udcef")
> ... 
> Traceback (most recent call last):
>   File "<stdin>", line 2, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in 
> position 0: surrogates not allowed

That is the situation. I just lost it and it would take a few houres to
repeat the situation. Sorry.


> 
> you have filenames that are not valid UTF-8 on your harddisk. 
> 
> A possible fix would be to use bytes instead of str. For that you need to 
> open `bashfilename` in binary mode ("wb") and pass bytes to the os.walk() 
> call. 
This is my 1st time with python3, so I am confused!

As much I could understand it seems that os.walk is returning the
filenames exactly as they are on disk. Just bytes like in C.

My template is a string. What is the result of the replace command? Is
there any change in the filename from os.walk contents?

Now, if the result of the replace has the replaced filename unchanged
how do I "convert" it to bytes type, without changing its contents, so
that I can write to the bashfile opened with "wb"?


> 
> Or you just go and fix the offending names.
This is impossible in my case.
I need a bash script with the names as they are on disk.

[toc] | [prev] | [next] | [standalone]

#63795

From	Peter Otten <__peter__@web.de>
Date	2014-01-12 19:50 +0100
Message-ID	<mailman.5382.1389552633.18130.python-list@python.org>
In reply to	#63792

Paulo da Silva wrote:

> Em 12-01-2014 16:23, Peter Otten escreveu:
>> Paulo da Silva wrote:
>> 
>>> I am using a python3 script to produce a bash script from lots of
>>> filenames got using os.walk.
>>>
>>> I have a template string for each bash command in which I replace a
>>> special string with the filename and then write the command to the bash
>>> script file.
>>>
>>> Something like this:
>>>
>>> shf=open(bashfilename,'w')
>>> filenames=getfilenames() # uses os.walk
>>> for fn in filenames:
>>> ...
>>> cmd=templ.replace("<fn>",fn)
>>> shf.write(cmd)
>>>
>>> For certain filenames I got a UnicodeEncodeError exception at
>>> shf.write(cmd)!
>>> I use utf-8 and have # -*- coding: utf-8 -*- in the source .py.
>>>
>>> How can I fix this?
>>>
>>> Thanks for any help/comments.
>> 
>> You make it harder to debug your problem by not giving the complete
>> traceback. If the error message contains 'surrogates not allowed' like in
>> the demo below
>> 
>>>>> with open("tmp.txt", "w") as f:
>> ...     f.write("\udcef")
>> ...
>> Traceback (most recent call last):
>>   File "<stdin>", line 2, in <module>
>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in
>> position 0: surrogates not allowed
> 
> That is the situation. I just lost it and it would take a few houres to
> repeat the situation. Sorry.
> 
> 
>> 
>> you have filenames that are not valid UTF-8 on your harddisk.
>> 
>> A possible fix would be to use bytes instead of str. For that you need to
>> open `bashfilename` in binary mode ("wb") and pass bytes to the os.walk()
>> call.
> This is my 1st time with python3, so I am confused!
> 
> As much I could understand it seems that os.walk is returning the
> filenames exactly as they are on disk. Just bytes like in C.

No, they are decoded with the preferred encoding. With UTF-8 that can fail, 
and if it does the surrogateescape error handler replaces the offending 
bytes with special codepoints:

>>> import os
>>> with open(b"\xe4\xf6\xfc", "w") as f: f.write("whatever")
... 
8
>>> os.listdir()
['\udce4\udcf6\udcfc']

You can bypass the decoding process by providing a bytes argument to 
os.listdir() (or os.walk() which uses os.listdir() internally):

>>> os.listdir(b".")
[b'\xe4\xf6\xfc']

To write these raw bytes into a file the file has of course to be binary, 
too.

> My template is a string. What is the result of the replace command? Is
> there any change in the filename from os.walk contents?
> 
> Now, if the result of the replace has the replaced filename unchanged
> how do I "convert" it to bytes type, without changing its contents, so
> that I can write to the bashfile opened with "wb"?
> 
> 
>> 
>> Or you just go and fix the offending names.
> This is impossible in my case.
> I need a bash script with the names as they are on disk.

I think instead of the hard way sketched out above it will be sufficient to 
specify the error handler when opening the destination file

shf = open(bashfilename, 'w', errors="surrogateescape")

but I have not tried it myself. Also, some bytes may need to be escaped, 
either to be understood by the shell, or to address security concerns:

>>> import os
>>> template = "ls <fn>"
>>> for filename in os.listdir():
...     print(template.replace("<fn>", filename))
... 
ls foo; rm bar

[toc] | [prev] | [next] | [standalone]

#63798

From	Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt>
Date	2014-01-12 19:41 +0000
Message-ID	<laur49$dj8$1@speranza.aioe.org>
In reply to	#63795

> 
> I think instead of the hard way sketched out above it will be sufficient to 
> specify the error handler when opening the destination file
> 
> shf = open(bashfilename, 'w', errors="surrogateescape")
This seems to fix everything!
I tried with a small test set and it worked.

> 
> but I have not tried it myself. Also, some bytes may need to be escaped, 
> either to be understood by the shell, or to address security concerns:
> 

Since I am puting the file names between "", the only char that needs to
be escaped is the " itself.

I'm gonna try with the real thing.

Thank you very much for the fixing and for everything I have learned here.

[toc] | [prev] | [next] | [standalone]

#63801

From	Peter Otten <__peter__@web.de>
Date	2014-01-12 21:29 +0100
Message-ID	<mailman.5386.1389558607.18130.python-list@python.org>
In reply to	#63798

Paulo da Silva wrote:

>> but I have not tried it myself. Also, some bytes may need to be escaped,
>> either to be understood by the shell, or to address security concerns:
>>
> 
> Since I am puting the file names between "", the only char that needs to
> be escaped is the " itself.

What about the escape char?

[toc] | [prev] | [next] | [standalone]

#63807

From	Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt>
Date	2014-01-12 23:53 +0000
Message-ID	<lav9u1$hc8$1@speranza.aioe.org>
In reply to	#63801

Em 12-01-2014 20:29, Peter Otten escreveu:
> Paulo da Silva wrote:
> 
>>> but I have not tried it myself. Also, some bytes may need to be escaped,
>>> either to be understood by the shell, or to address security concerns:
>>>
>>
>> Since I am puting the file names between "", the only char that needs to
>> be escaped is the " itself.
> 
> What about the escape char?
> 
Just this fn=fn.replace('"','\\"')

So far I didn't find any problem, but the script is still running.

[toc] | [prev] | [next] | [standalone]

#63816

From	Peter Otten <__peter__@web.de>
Date	2014-01-13 09:48 +0100
Message-ID	<mailman.5396.1389602915.18130.python-list@python.org>
In reply to	#63807

Paulo da Silva wrote:

> Em 12-01-2014 20:29, Peter Otten escreveu:
>> Paulo da Silva wrote:
>> 
>>>> but I have not tried it myself. Also, some bytes may need to be
>>>> escaped, either to be understood by the shell, or to address security
>>>> concerns:
>>>>
>>>
>>> Since I am puting the file names between "", the only char that needs to
>>> be escaped is the " itself.
>> 
>> What about the escape char?
>> 
> Just this fn=fn.replace('"','\\"')
> 
> So far I didn't find any problem, but the script is still running.

To be a bit more explicit:

>>> for filename in os.listdir():
...     print(template.replace("<fn>", filename.replace('"', '\\"')))
... 
ls "\\"; rm whatever; ls \"

[toc] | [prev] | [next] | [standalone]

#63817

From	Peter Otten <__peter__@web.de>
Date	2014-01-13 09:58 +0100
Message-ID	<mailman.5397.1389603536.18130.python-list@python.org>
In reply to	#63807

Peter Otten wrote:

> Paulo da Silva wrote:
> 
>> Em 12-01-2014 20:29, Peter Otten escreveu:
>>> Paulo da Silva wrote:
>>> 
>>>>> but I have not tried it myself. Also, some bytes may need to be
>>>>> escaped, either to be understood by the shell, or to address security
>>>>> concerns:
>>>>>
>>>>
>>>> Since I am puting the file names between "", the only char that needs
>>>> to be escaped is the " itself.
>>> 
>>> What about the escape char?
>>> 
>> Just this fn=fn.replace('"','\\"')
>> 
>> So far I didn't find any problem, but the script is still running.
> 
> To be a bit more explicit:
> 
>>>> for filename in os.listdir():
> ...     print(template.replace("<fn>", filename.replace('"', '\\"')))
> ...
> ls "\\"; rm whatever; ls \"

The complete session:

>>> import os
>>> template = 'ls "<fn>"'
>>> with open('\\"; rm whatever; ls \\', "w") as f: pass
... 
>>> for filename in os.listdir():
...     print(template.replace("<fn>", filename.replace('"', '\\"')))
... 
ls "\\"; rm whatever; ls \"


Shell variable substitution is another problem. c.l.py is probably not the 
best place to get the complete list of possibilities.

[toc] | [prev] | [next] | [standalone]

#63835

From	Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt>
Date	2014-01-13 16:14 +0000
Message-ID	<lb13cq$j8g$1@speranza.aioe.org>
In reply to	#63817

Em 13-01-2014 08:58, Peter Otten escreveu:
> Peter Otten wrote:
> 
>> Paulo da Silva wrote:
>>
>>> Em 12-01-2014 20:29, Peter Otten escreveu:
>>>> Paulo da Silva wrote:
>>>>
>>>>>> but I have not tried it myself. Also, some bytes may need to be
>>>>>> escaped, either to be understood by the shell, or to address security
>>>>>> concerns:
>>>>>>
>>>>>
>>>>> Since I am puting the file names between "", the only char that needs
>>>>> to be escaped is the " itself.
>>>>
>>>> What about the escape char?
>>>>
>>> Just this fn=fn.replace('"','\\"')
>>>
>>> So far I didn't find any problem, but the script is still running.
>>
>> To be a bit more explicit:
>>
>>>>> for filename in os.listdir():
>> ...     print(template.replace("<fn>", filename.replace('"', '\\"')))
>> ...
>> ls "\\"; rm whatever; ls \"
> 
> The complete session:
> 
>>>> import os
>>>> template = 'ls "<fn>"'
>>>> with open('\\"; rm whatever; ls \\', "w") as f: pass
> ... 
>>>> for filename in os.listdir():
> ...     print(template.replace("<fn>", filename.replace('"', '\\"')))
> ... 
> ls "\\"; rm whatever; ls \"
> 
> 
> Shell variable substitution is another problem. c.l.py is probably not the 
> best place to get the complete list of possibilities.
I see what you mean.
This is a tedious problem. Don't know if there is a simple solution in
python for this. I have to think about it ...
On a more general and serious application I would not produce a bash
script. I would do all the work in python.

That's not the case, however. This is a few times execution script for a
very special purpose. The only problem was the occurrence of some
Portuguese characters in old filenames encoded in another code than
utf-8. Very few also include the ".

The worst thing that could happen was the bash script to abort. Then it
would be easy to fix it using a simple editor.

[toc] | [prev] | [next] | [standalone]

#63842

From	Peter Otten <__peter__@web.de>
Date	2014-01-13 18:29 +0100
Message-ID	<mailman.5416.1389634177.18130.python-list@python.org>
In reply to	#63835

Paulo da Silva wrote:

> Em 13-01-2014 08:58, Peter Otten escreveu:
>> Peter Otten wrote:
>> 
>>> Paulo da Silva wrote:
>>>
>>>> Em 12-01-2014 20:29, Peter Otten escreveu:
>>>>> Paulo da Silva wrote:
>>>>>
>>>>>>> but I have not tried it myself. Also, some bytes may need to be
>>>>>>> escaped, either to be understood by the shell, or to address
>>>>>>> security concerns:
>>>>>>>
>>>>>>
>>>>>> Since I am puting the file names between "", the only char that needs
>>>>>> to be escaped is the " itself.
>>>>>
>>>>> What about the escape char?
>>>>>
>>>> Just this fn=fn.replace('"','\\"')
>>>>
>>>> So far I didn't find any problem, but the script is still running.
>>>
>>> To be a bit more explicit:
>>>
>>>>>> for filename in os.listdir():
>>> ...     print(template.replace("<fn>", filename.replace('"', '\\"')))
>>> ...
>>> ls "\\"; rm whatever; ls \"
>> 
>> The complete session:
>> 
>>>>> import os
>>>>> template = 'ls "<fn>"'
>>>>> with open('\\"; rm whatever; ls \\', "w") as f: pass
>> ...
>>>>> for filename in os.listdir():
>> ...     print(template.replace("<fn>", filename.replace('"', '\\"')))
>> ...
>> ls "\\"; rm whatever; ls \"
>> 
>> 
>> Shell variable substitution is another problem. c.l.py is probably not
>> the best place to get the complete list of possibilities.
> I see what you mean.
> This is a tedious problem. Don't know if there is a simple solution in
> python for this. I have to think about it ...
> On a more general and serious application I would not produce a bash
> script. I would do all the work in python.
> 
> That's not the case, however. This is a few times execution script for a
> very special purpose. The only problem was the occurrence of some
> Portuguese characters in old filenames encoded in another code than
> utf-8. Very few also include the ".
> 
> The worst thing that could happen was the bash script to abort. Then it
> would be easy to fix it using a simple editor.

I looked around in the stdlib and found shlex.quote(). It uses ' instead of 
" which simplifies things, and special-cases only ':

>>> print(shlex.quote("alpha'beta"))
'alpha'"'"'beta'

So the answer is simpler than I had expected.

[toc] | [prev] | [next] | [standalone]

#63853

From	Paulo da Silva <p_s_d_a_s_i_l_v_a@netcabo.pt>
Date	2014-01-13 18:44 +0000
Message-ID	<lb1c5f$cc9$1@speranza.aioe.org>
In reply to	#63842

Em 13-01-2014 17:29, Peter Otten escreveu:
> Paulo da Silva wrote:
> 
>> Em 13-01-2014 08:58, Peter Otten escreveu:

> 
> I looked around in the stdlib and found shlex.quote(). It uses ' instead of 
> " which simplifies things, and special-cases only ':
> 
>>>> print(shlex.quote("alpha'beta"))
> 'alpha'"'"'beta'
> 
> So the answer is simpler than I had expected.
> 
Yes, it should work, at least in this case.
Although python oriented, it seems to work to bash also.
I need to remove the "" from the templates and use shlex.quote for
filenames. I'll give it a try.

Thanks

[toc] | [prev] | [next] | [standalone]

#63788

From	Emile van Sebille <emile@fenx.com>
Date	2014-01-12 08:55 -0800
Message-ID	<mailman.5378.1389546006.18130.python-list@python.org>
In reply to	#63778

On 01/12/2014 07:36 AM, Paulo da Silva wrote:
> Hi!
>
> I am using a python3 script to produce a bash script from lots of
> filenames got using os.walk.
>
> I have a template string for each bash command in which I replace a
> special string with the filename and then write the command to the bash
> script file.
>
> Something like this:
>
> shf=open(bashfilename,'w')
> filenames=getfilenames() # uses os.walk
> for fn in filenames:
> 	...
> 	cmd=templ.replace("<fn>",fn)
> 	shf.write(cmd)
>
> For certain filenames I got a UnicodeEncodeError exception at
> shf.write(cmd)!
> I use utf-8 and have # -*- coding: utf-8 -*- in the source .py.
>
> How can I fix this?

Not sure exactly, but I'd try


shf=open(bashfilename,'wb')

as a start.

HTH,

Emile

[toc] | [prev] | [standalone]

csiph-web

Problem writing some strings (UnicodeEncodeError)

Contents

#63778 — Problem writing some strings (UnicodeEncodeError)

#63781

#63792

#63795

#63798

#63801

#63807

#63816

#63817

#63835

#63842

#63853

#63788