Groups > comp.lang.python > #44489 > unrolled thread

How do I encode and decode this data to write to a file?

Started by	cl@isbd.net
First post	2013-04-29 10:47 +0100
Last post	2013-05-01 19:36 -0400
Articles	13 — 9 participants

Back to article view | Back to comp.lang.python

  How do I encode and decode this data to write to a file? cl@isbd.net - 2013-04-29 10:47 +0100
    Re: How do I encode and decode this data to write to a file? Andrew Berg <bahamutzero8825@gmail.com> - 2013-04-29 05:11 -0500
      Re: How do I encode and decode this data to write to a file? cl@isbd.net - 2013-04-29 13:50 +0100
    Re: How do I encode and decode this data to write to a file? Peter Otten <__peter__@web.de> - 2013-04-29 12:33 +0200
    Re: How do I encode and decode this data to write to a file? Dave Angel <davea@davea.name> - 2013-04-29 07:46 -0400
      Re: How do I encode and decode this data to write to a file? cl@isbd.net - 2013-04-29 13:59 +0100
        Re: How do I encode and decode this data to write to a file? Robert Kern <robert.kern@gmail.com> - 2013-04-29 14:11 +0100
          Re: How do I encode and decode this data to write to a file? cl@isbd.net - 2013-04-29 15:38 +0100
            Re: How do I encode and decode this data to write to a file? Skip Montanaro <skip@pobox.com> - 2013-04-29 09:56 -0500
    Re: How do I encode and decode this data to write to a file? Terry Jan Reedy <tjreedy@udel.edu> - 2013-04-29 18:02 -0400
    Re: How do I encode and decode this data to write to a file? Tony the Tiger <tony@tiger.invalid> - 2013-05-01 16:20 -0500
      Re: How do I encode and decode this data to write to a file? Ned Batchelder <ned@nedbatchelder.com> - 2013-05-01 18:01 -0400
    Re: How do I encode and decode this data to write to a file? Ned Batchelder <ned@nedbatchelder.com> - 2013-05-01 19:36 -0400

#44489 — How do I encode and decode this data to write to a file?

From	cl@isbd.net
Date	2013-04-29 10:47 +0100
Subject	How do I encode and decode this data to write to a file?
Message-ID	<27s15a-943.ln1@chris.zbmc.eu>

I am debugging some code that creates a static HTML gallery from a
directory hierarchy full of images. It's this package:-
    https://pypi.python.org/pypi/Gallery2.py/2.0


It's basically working and does pretty much what I want so I'm happy to
put some effort into it and fix things.

The problem I'm currently chasing is that it can't cope with directory
names that have accented characters in them, it fails when it tries to
write the HTML that creates the page with the thumbnails on.

The code that's failing is:-

        raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
        file = open(raw, "w")
        file.write("".join(html).encode('utf-8'))
        file.close()

The variable html is a list containing the lines of HTML to write to the
file.  It fails when it contains accented characters (an é in this
case).  Here's the traceback:-

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 41, in run self._recurse()
  File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272, in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
  File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg)
  File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 238, in walk func(arg, top, names)
  File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263, in processDir self.createGallery()
  File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215, in createGallery self.picturemanager.createPictureHTMLs(self.footer)
  File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py", line 84, in createPictureHTMLs curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(), self.fullsize, footer)
  File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361, in createPictureHTML file.write("".join(html).encode('utf-8')) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 783: ordinal not in range(128)



If I understand correctly the encode() is saying that it can't
understand the data in the html because there's a character 0xc3 in it.
I *think* this means that the é is encoded in UTF-8 already in the
incoming data stream (should be as my system is wholly UTF-8 as far as I
know and I created the directory name).

So how do I change the code so I don't get the error?  Do I just
decode() the data first and then encode() it?

-- 
Chris Green

[toc] | [next] | [standalone]

#44490

From	Andrew Berg <bahamutzero8825@gmail.com>
Date	2013-04-29 05:11 -0500
Message-ID	<mailman.1146.1367230281.3114.python-list@python.org>
In reply to	#44489

On 2013.04.29 04:47, cl@isbd.net wrote:
> If I understand correctly the encode() is saying that it can't
> understand the data in the html because there's a character 0xc3 in it.
> I *think* this means that the é is encoded in UTF-8 already in the
> incoming data stream (should be as my system is wholly UTF-8 as far as I
> know and I created the directory name).
You can verify that your filesystem is set to use UTF-8 with sys.getfilesystemencoding(). If it returns 'ascii', then your locale settings
are incorrect.

-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1

[toc] | [prev] | [next] | [standalone]

#44501

From	cl@isbd.net
Date	2013-04-29 13:50 +0100
Message-ID	<ps625a-c45.ln1@chris.zbmc.eu>
In reply to	#44490

Andrew Berg <bahamutzero8825@gmail.com> wrote:
> On 2013.04.29 04:47, cl@isbd.net wrote:
> > If I understand correctly the encode() is saying that it can't
> > understand the data in the html because there's a character 0xc3 in it.
> > I *think* this means that the é is encoded in UTF-8 already in the
> > incoming data stream (should be as my system is wholly UTF-8 as far as I
> > know and I created the directory name).
> You can verify that your filesystem is set to use UTF-8 with sys.getfilesystemencoding(). 
> If it returns 'ascii', then your locale settings 
> are incorrect.
> 

    chris$ python
    Python 2.7.3 (default, Sep 26 2012, 21:51:14) 
    [GCC 4.7.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.getfilesystemencoding()
    'UTF-8'
    >>> 

So I am set up right for UTF-8.

-- 
Chris Green

[toc] | [prev] | [next] | [standalone]

#44492

From	Peter Otten <__peter__@web.de>
Date	2013-04-29 12:33 +0200
Message-ID	<mailman.1149.1367231585.3114.python-list@python.org>
In reply to	#44489

cl@isbd.net wrote:

> I am debugging some code that creates a static HTML gallery from a
> directory hierarchy full of images. It's this package:-
>     https://pypi.python.org/pypi/Gallery2.py/2.0
> 
> 
> It's basically working and does pretty much what I want so I'm happy to
> put some effort into it and fix things.
> 
> The problem I'm currently chasing is that it can't cope with directory
> names that have accented characters in them, it fails when it tries to
> write the HTML that creates the page with the thumbnails on.
> 
> The code that's failing is:-
> 
>         raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
>         file = open(raw, "w")
>         file.write("".join(html).encode('utf-8'))
>         file.close()
> 
> The variable html is a list containing the lines of HTML to write to the
> file.  It fails when it contains accented characters (an é in this
> case).  Here's the traceback:-
> 
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line
>   41, in run self._recurse() File
>   "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272,
>   in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
>   File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name,
>   func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk
>   walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246,
>   in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py",
>   line 238, in walk func(arg, top, names) File
>   "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263,
>   in processDir self.createGallery() File
>   "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215,
>   in createGallery self.picturemanager.createPictureHTMLs(self.footer)
>   File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py",
>   line 84, in createPictureHTMLs
>   curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(),
>   self.fullsize, footer) File
>   "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361,
>   in createPictureHTML file.write("".join(html).encode('utf-8'))
>   UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
>   783: ordinal not in range(128)
> 
> 
> 
> If I understand correctly the encode() is saying that it can't
> understand the data in the html because there's a character 0xc3 in it.
> I *think* this means that the é is encoded in UTF-8 already in the
> incoming data stream (should be as my system is wholly UTF-8 as far as I
> know and I created the directory name).
> 
> So how do I change the code so I don't get the error?  Do I just
> decode() the data first and then encode() it?
> 

Note that you are getting a *UnicodeDecodeError*, not a UnicodeEncodeError. 
Try omitting the encode() step, i. e. instead of

>         file.write("".join(html).encode('utf-8'))

use

file.write(""join(html))

Background (applies to Python 2 only): the str type deals with bytes, not 
code points. The right thing to do is to use .decode(...) to convert from 
str to unicode and .encode(...) to convert from unicode to str. In Python 2 
however the str type has an encode(...) method which is basically equivalent 
to

class str:
   # imaginary python implementation of python2's str
   ...
   def encode(self, encoding):
       return self.decode("ascii").encode(encoding)

and is almost never called intentionally.

PS Python3 has relabeled unicode to str and thus uses unicode by default. 
str was renamed to bytes and the annoying bytes.encode() method is gone.

[toc] | [prev] | [next] | [standalone]

#44495

From	Dave Angel <davea@davea.name>
Date	2013-04-29 07:46 -0400
Message-ID	<mailman.1151.1367236032.3114.python-list@python.org>
In reply to	#44489

On 04/29/2013 05:47 AM, cl@isbd.net wrote:

A couple of generic comments:  your email program made a mess of the 
traceback by appending each source line to the location information.

Please mention your Python version & OS.  Apparently you're running 2.7 
on Linux or similar.

> I am debugging some code that creates a static HTML gallery from a
> directory hierarchy full of images. It's this package:-
>      https://pypi.python.org/pypi/Gallery2.py/2.0
>
>
> It's basically working and does pretty much what I want so I'm happy to
> put some effort into it and fix things.
>
> The problem I'm currently chasing is that it can't cope with directory
> names that have accented characters in them, it fails when it tries to
> write the HTML that creates the page with the thumbnails on.
>
> The code that's failing is:-
>
>          raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
>          file = open(raw, "w")
>          file.write("".join(html).encode('utf-8'))

You can't encode byte data, it's already encoded. So you're forcing the 
Python system to implicitly decode it (using ASCII codec) before letting 
you encode it to utf-8.  If you think it's already in utf-8, then omit 
the encode() call there.

Additionally, you can debug things with some simple print statements, at 
least if you decompose your 3-function line so you can get at the 
intermediate data.  Split the line into three parts;
     temp1 = "".join(html)     #temp1 is byte data
     temp2 = temp1.decode()    #temp2 is unicode data
     temp3 = temp2.encode("utf-8")  #temp3 is byte data again
     file.write(temp3)

Now, you'll presumably get the error on the second line, so examine the 
bytes around byte 783.  Make sure it's really in utf-8, and if it is, 
then skip the decode and the encode.  If it's not, then Andrew's advice 
is pertinent.

I would also look at the variable html.  It's a list, but what are the 
types of the elements in it?

>          file.close()
>
> The variable html is a list containing the lines of HTML to write to the
> file.  It fails when it contains accented characters (an é in this
> case).  Here's the traceback:-
>
> Traceback (most recent call last):
>    File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 41, in run self._recurse()
>    File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272, in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
>    File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg)
>    File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 238, in walk func(arg, top, names)
>    File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263, in processDir self.createGallery()
>    File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215, in createGallery self.picturemanager.createPictureHTMLs(self.footer)
>    File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py", line 84, in createPictureHTMLs curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(), self.fullsize, footer)
>    File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361, in createPictureHTML file.write("".join(html).encode('utf-8')) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 783: ordinal not in range(128)
>
>
>
> If I understand correctly the encode() is saying that it can't
> understand the data in the html because there's a character 0xc3 in it.
> I *think* this means that the é is encoded in UTF-8 already in the
> incoming data stream (should be as my system is wholly UTF-8 as far as I
> know and I created the directory name).
>
> So how do I change the code so I don't get the error?  Do I just
> decode() the data first and then encode() it?
>


-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#44502

From	cl@isbd.net
Date	2013-04-29 13:59 +0100
Message-ID	<ee725a-c45.ln1@chris.zbmc.eu>
In reply to	#44495

Dave Angel <davea@davea.name> wrote:
> On 04/29/2013 05:47 AM, cl@isbd.net wrote:
> 
> A couple of generic comments:  your email program made a mess of the 
> traceback by appending each source line to the location information.
> 
What's me email program got to do with it?  :-)   I'm using a dedicated
newsreader (tin) as I posted via the gmane/usenet interface.  The posting
looks perfectly OK to me when I read it back from usenet.


> Please mention your Python version & OS.  Apparently you're running 2.7 
> on Linux or similar.
> 
Sorry, yes you're spot on.


> > I am debugging some code that creates a static HTML gallery from a
> > directory hierarchy full of images. It's this package:-
> >      https://pypi.python.org/pypi/Gallery2.py/2.0
> >
> >
> > It's basically working and does pretty much what I want so I'm happy to
> > put some effort into it and fix things.
> >
> > The problem I'm currently chasing is that it can't cope with directory
> > names that have accented characters in them, it fails when it tries to
> > write the HTML that creates the page with the thumbnails on.
> >
> > The code that's failing is:-
> >
> >          raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
> >          file = open(raw, "w")
> >          file.write("".join(html).encode('utf-8'))
> 
> You can't encode byte data, it's already encoded. So you're forcing the 
> Python system to implicitly decode it (using ASCII codec) before letting 
> you encode it to utf-8.  If you think it's already in utf-8, then omit 
> the encode() call there.
> 
It's the way the code was as I installed it from pypi.  What you say
makes a lot of sense though, I'll remove the encode().


> Additionally, you can debug things with some simple print statements, at 
> least if you decompose your 3-function line so you can get at the 
> intermediate data.  Split the line into three parts;
>      temp1 = "".join(html)     #temp1 is byte data
>      temp2 = temp1.decode()    #temp2 is unicode data
>      temp3 = temp2.encode("utf-8")  #temp3 is byte data again
>      file.write(temp3)
> 
OK, thanks for this and all the other advice on this thread.

-- 
Chris Green

[toc] | [prev] | [next] | [standalone]

#44503

From	Robert Kern <robert.kern@gmail.com>
Date	2013-04-29 14:11 +0100
Message-ID	<mailman.1154.1367241090.3114.python-list@python.org>
In reply to	#44502

On 2013-04-29 13:59, cl@isbd.net wrote:
> Dave Angel <davea@davea.name> wrote:
>> On 04/29/2013 05:47 AM, cl@isbd.net wrote:
>>
>> A couple of generic comments:  your email program made a mess of the
>> traceback by appending each source line to the location information.
>>
> What's me email program got to do with it?  :-)   I'm using a dedicated
> newsreader (tin) as I posted via the gmane/usenet interface.  The posting
> looks perfectly OK to me when I read it back from usenet.

FWIW, I see the same problem Dave sees. I'm using gmane via Thunderbird.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

[toc] | [prev] | [next] | [standalone]

#44504

From	cl@isbd.net
Date	2013-04-29 15:38 +0100
Message-ID	<t8d25a-pd6.ln1@chris.zbmc.eu>
In reply to	#44503

Robert Kern <robert.kern@gmail.com> wrote:
> On 2013-04-29 13:59, cl@isbd.net wrote:
> > Dave Angel <davea@davea.name> wrote:
> >> On 04/29/2013 05:47 AM, cl@isbd.net wrote:
> >>
> >> A couple of generic comments:  your email program made a mess of the
> >> traceback by appending each source line to the location information.
> >>
> > What's me email program got to do with it?  :-)   I'm using a dedicated
> > newsreader (tin) as I posted via the gmane/usenet interface.  The posting
> > looks perfectly OK to me when I read it back from usenet.
> 
> FWIW, I see the same problem Dave sees. I'm using gmane via Thunderbird.
> 
How strange.  I think it must be something to do with the gmane
interface between news and mail then.

-- 
Chris Green

[toc] | [prev] | [next] | [standalone]

#44505

From	Skip Montanaro <skip@pobox.com>
Date	2013-04-29 09:56 -0500
Message-ID	<mailman.1155.1367247394.3114.python-list@python.org>
In reply to	#44504

> How strange.  I think it must be something to do with the gmane
> interface between news and mail then.

Probably.  It was borked in Gmail as well...

Skip

[toc] | [prev] | [next] | [standalone]

#44525

From	Terry Jan Reedy <tjreedy@udel.edu>
Date	2013-04-29 18:02 -0400
Message-ID	<mailman.1167.1367272922.3114.python-list@python.org>
In reply to	#44489

On 4/29/2013 5:47 AM, cl@isbd.net wrote:

> case).  Here's the traceback:-
>

>    File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361,
 > in createPictureHTML file.write("".join(html).encode('utf-8'))
 > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 
783: ordinal not in range(128)

Generiric advice for anyone getting unicode errors:
unpack the composition producing the error
so that one can see which operation produced it.

In this case
s = "".join(html)\
s = s.encode('utf-8')
file.write(s)

This also makes it possible to print intermediate results.
   print(type(s), s)  # would have been useful
Doing so would have immediately shown that in this case the error was 
the encode operation, because s was already bytes.
For many other posts, the error with the same type of message has been 
the print or write operation, do to output encoding issues, but that was 
not the case here.

[toc] | [prev] | [next] | [standalone]

#44603

From	Tony the Tiger <tony@tiger.invalid>
Date	2013-05-01 16:20 -0500
Message-ID	<Zridnb6s37SKGhzMnZ2dnUVZ8u6dnZ2d@giganews.com>
In reply to	#44489

On Mon, 29 Apr 2013 10:47:46 +0100, cl wrote:

>         raw = os.path.join(directory, self.getNameNoExtension()) +
>         ".html"
>         file = open(raw, "w")
>         file.write("".join(html).encode('utf-8'))
>         file.close()

This works for me:

Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> html='<html><head><title>Blah</title><body>éåäö</body></html>'
>>> f=open('test.html', 'w')
>>> f.write(''.join(html.decode('utf-8').encode('utf-8')))
>>> f.close()

Perhaps there are better ways to do it.


 /Grrr
-- 
          ___                  ___
 (\_--_/)  | _ ._    _|_|_  _   |o _  _ ._
 ( 9  9 )  |(_)| |\/  |_| |(/_  ||(_|(/_|
 stripes are forever - as overripe ferrets

[toc] | [prev] | [next] | [standalone]

#44604

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2013-05-01 18:01 -0400
Message-ID	<mailman.1224.1367445706.3114.python-list@python.org>
In reply to	#44603

On 5/1/2013 5:20 PM, Tony the Tiger wrote:
> On Mon, 29 Apr 2013 10:47:46 +0100, cl wrote:
>
>>          raw = os.path.join(directory, self.getNameNoExtension()) +
>>          ".html"
>>          file = open(raw, "w")
>>          file.write("".join(html).encode('utf-8'))
>>          file.close()
> This works for me:
>
> Python 2.7.3 (default, Aug  1 2012, 05:16:07)
> [GCC 4.6.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> html='<html><head><title>Blah</title><body>éåäö</body></html>'
>>>> f=open('test.html', 'w')
>>>> f.write(''.join(html.decode('utf-8').encode('utf-8')))
>>>> f.close()
> Perhaps there are better ways to do it.

Your .write() line is exactly equivalent to:

     f.write(html)

Because: if X is a UTF-8 bytestring, then:

     X.decode('utf-8').encode('utf-8') == X

And if X is a bytestring, then:

     ''.join(X) == X

--Ned.

>
>   /Grrr

[toc] | [prev] | [next] | [standalone]

#44607

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2013-05-01 19:36 -0400
Message-ID	<mailman.1226.1367451385.3114.python-list@python.org>
In reply to	#44489

On 4/29/2013 5:47 AM, cl@isbd.net wrote:
> If I understand correctly the encode() is saying that it can't
> understand the data in the html because there's a character 0xc3 in it.
> I *think* this means that the é is encoded in UTF-8 already in the
> incoming data stream (should be as my system is wholly UTF-8 as far as I
> know and I created the directory name).
>
> So how do I change the code so I don't get the error?  Do I just
> decode() the data first and then encode() it?
>

BTW, I did a presentation at PyCon 2012 that many people have found 
helpful: Pragmatic Unicode, or, How Do I Stop the Pain: 
http://nedbatchelder.com/text/unipain.html .  It explains the principles 
at work here.

--Ned.

[toc] | [prev] | [standalone]

csiph-web

How do I encode and decode this data to write to a file?

Contents

#44489 — How do I encode and decode this data to write to a file?

#44490

#44501

#44492

#44495

#44502

#44503

#44504

#44505

#44525

#44603

#44604

#44607