Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #44489 > unrolled thread
| Started by | cl@isbd.net |
|---|---|
| First post | 2013-04-29 10:47 +0100 |
| Last post | 2013-05-01 19:36 -0400 |
| Articles | 13 — 9 participants |
Back to article view | Back to comp.lang.python
How do I encode and decode this data to write to a file? cl@isbd.net - 2013-04-29 10:47 +0100
Re: How do I encode and decode this data to write to a file? Andrew Berg <bahamutzero8825@gmail.com> - 2013-04-29 05:11 -0500
Re: How do I encode and decode this data to write to a file? cl@isbd.net - 2013-04-29 13:50 +0100
Re: How do I encode and decode this data to write to a file? Peter Otten <__peter__@web.de> - 2013-04-29 12:33 +0200
Re: How do I encode and decode this data to write to a file? Dave Angel <davea@davea.name> - 2013-04-29 07:46 -0400
Re: How do I encode and decode this data to write to a file? cl@isbd.net - 2013-04-29 13:59 +0100
Re: How do I encode and decode this data to write to a file? Robert Kern <robert.kern@gmail.com> - 2013-04-29 14:11 +0100
Re: How do I encode and decode this data to write to a file? cl@isbd.net - 2013-04-29 15:38 +0100
Re: How do I encode and decode this data to write to a file? Skip Montanaro <skip@pobox.com> - 2013-04-29 09:56 -0500
Re: How do I encode and decode this data to write to a file? Terry Jan Reedy <tjreedy@udel.edu> - 2013-04-29 18:02 -0400
Re: How do I encode and decode this data to write to a file? Tony the Tiger <tony@tiger.invalid> - 2013-05-01 16:20 -0500
Re: How do I encode and decode this data to write to a file? Ned Batchelder <ned@nedbatchelder.com> - 2013-05-01 18:01 -0400
Re: How do I encode and decode this data to write to a file? Ned Batchelder <ned@nedbatchelder.com> - 2013-05-01 19:36 -0400
| From | cl@isbd.net |
|---|---|
| Date | 2013-04-29 10:47 +0100 |
| Subject | How do I encode and decode this data to write to a file? |
| Message-ID | <27s15a-943.ln1@chris.zbmc.eu> |
I am debugging some code that creates a static HTML gallery from a
directory hierarchy full of images. It's this package:-
https://pypi.python.org/pypi/Gallery2.py/2.0
It's basically working and does pretty much what I want so I'm happy to
put some effort into it and fix things.
The problem I'm currently chasing is that it can't cope with directory
names that have accented characters in them, it fails when it tries to
write the HTML that creates the page with the thumbnails on.
The code that's failing is:-
raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
file = open(raw, "w")
file.write("".join(html).encode('utf-8'))
file.close()
The variable html is a list containing the lines of HTML to write to the
file. It fails when it contains accented characters (an é in this
case). Here's the traceback:-
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 41, in run self._recurse()
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272, in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg)
File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 238, in walk func(arg, top, names)
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263, in processDir self.createGallery()
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215, in createGallery self.picturemanager.createPictureHTMLs(self.footer)
File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py", line 84, in createPictureHTMLs curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(), self.fullsize, footer)
File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361, in createPictureHTML file.write("".join(html).encode('utf-8')) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 783: ordinal not in range(128)
If I understand correctly the encode() is saying that it can't
understand the data in the html because there's a character 0xc3 in it.
I *think* this means that the é is encoded in UTF-8 already in the
incoming data stream (should be as my system is wholly UTF-8 as far as I
know and I created the directory name).
So how do I change the code so I don't get the error? Do I just
decode() the data first and then encode() it?
--
Chris Green
[toc] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2013-04-29 05:11 -0500 |
| Message-ID | <mailman.1146.1367230281.3114.python-list@python.org> |
| In reply to | #44489 |
On 2013.04.29 04:47, cl@isbd.net wrote: > If I understand correctly the encode() is saying that it can't > understand the data in the html because there's a character 0xc3 in it. > I *think* this means that the é is encoded in UTF-8 already in the > incoming data stream (should be as my system is wholly UTF-8 as far as I > know and I created the directory name). You can verify that your filesystem is set to use UTF-8 with sys.getfilesystemencoding(). If it returns 'ascii', then your locale settings are incorrect. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
[toc] | [prev] | [next] | [standalone]
| From | cl@isbd.net |
|---|---|
| Date | 2013-04-29 13:50 +0100 |
| Message-ID | <ps625a-c45.ln1@chris.zbmc.eu> |
| In reply to | #44490 |
Andrew Berg <bahamutzero8825@gmail.com> wrote:
> On 2013.04.29 04:47, cl@isbd.net wrote:
> > If I understand correctly the encode() is saying that it can't
> > understand the data in the html because there's a character 0xc3 in it.
> > I *think* this means that the é is encoded in UTF-8 already in the
> > incoming data stream (should be as my system is wholly UTF-8 as far as I
> > know and I created the directory name).
> You can verify that your filesystem is set to use UTF-8 with sys.getfilesystemencoding().
> If it returns 'ascii', then your locale settings
> are incorrect.
>
chris$ python
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getfilesystemencoding()
'UTF-8'
>>>
So I am set up right for UTF-8.
--
Chris Green
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-04-29 12:33 +0200 |
| Message-ID | <mailman.1149.1367231585.3114.python-list@python.org> |
| In reply to | #44489 |
cl@isbd.net wrote:
> I am debugging some code that creates a static HTML gallery from a
> directory hierarchy full of images. It's this package:-
> https://pypi.python.org/pypi/Gallery2.py/2.0
>
>
> It's basically working and does pretty much what I want so I'm happy to
> put some effort into it and fix things.
>
> The problem I'm currently chasing is that it can't cope with directory
> names that have accented characters in them, it fails when it tries to
> write the HTML that creates the page with the thumbnails on.
>
> The code that's failing is:-
>
> raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
> file = open(raw, "w")
> file.write("".join(html).encode('utf-8'))
> file.close()
>
> The variable html is a list containing the lines of HTML to write to the
> file. It fails when it contains accented characters (an é in this
> case). Here's the traceback:-
>
> Traceback (most recent call last):
> File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line
> 41, in run self._recurse() File
> "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272,
> in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
> File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name,
> func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk
> walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246,
> in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py",
> line 238, in walk func(arg, top, names) File
> "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263,
> in processDir self.createGallery() File
> "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215,
> in createGallery self.picturemanager.createPictureHTMLs(self.footer)
> File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py",
> line 84, in createPictureHTMLs
> curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(),
> self.fullsize, footer) File
> "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361,
> in createPictureHTML file.write("".join(html).encode('utf-8'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 783: ordinal not in range(128)
>
>
>
> If I understand correctly the encode() is saying that it can't
> understand the data in the html because there's a character 0xc3 in it.
> I *think* this means that the é is encoded in UTF-8 already in the
> incoming data stream (should be as my system is wholly UTF-8 as far as I
> know and I created the directory name).
>
> So how do I change the code so I don't get the error? Do I just
> decode() the data first and then encode() it?
>
Note that you are getting a *UnicodeDecodeError*, not a UnicodeEncodeError.
Try omitting the encode() step, i. e. instead of
> file.write("".join(html).encode('utf-8'))
use
file.write(""join(html))
Background (applies to Python 2 only): the str type deals with bytes, not
code points. The right thing to do is to use .decode(...) to convert from
str to unicode and .encode(...) to convert from unicode to str. In Python 2
however the str type has an encode(...) method which is basically equivalent
to
class str:
# imaginary python implementation of python2's str
...
def encode(self, encoding):
return self.decode("ascii").encode(encoding)
and is almost never called intentionally.
PS Python3 has relabeled unicode to str and thus uses unicode by default.
str was renamed to bytes and the annoying bytes.encode() method is gone.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-04-29 07:46 -0400 |
| Message-ID | <mailman.1151.1367236032.3114.python-list@python.org> |
| In reply to | #44489 |
On 04/29/2013 05:47 AM, cl@isbd.net wrote:
A couple of generic comments: your email program made a mess of the
traceback by appending each source line to the location information.
Please mention your Python version & OS. Apparently you're running 2.7
on Linux or similar.
> I am debugging some code that creates a static HTML gallery from a
> directory hierarchy full of images. It's this package:-
> https://pypi.python.org/pypi/Gallery2.py/2.0
>
>
> It's basically working and does pretty much what I want so I'm happy to
> put some effort into it and fix things.
>
> The problem I'm currently chasing is that it can't cope with directory
> names that have accented characters in them, it fails when it tries to
> write the HTML that creates the page with the thumbnails on.
>
> The code that's failing is:-
>
> raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
> file = open(raw, "w")
> file.write("".join(html).encode('utf-8'))
You can't encode byte data, it's already encoded. So you're forcing the
Python system to implicitly decode it (using ASCII codec) before letting
you encode it to utf-8. If you think it's already in utf-8, then omit
the encode() call there.
Additionally, you can debug things with some simple print statements, at
least if you decompose your 3-function line so you can get at the
intermediate data. Split the line into three parts;
temp1 = "".join(html) #temp1 is byte data
temp2 = temp1.decode() #temp2 is unicode data
temp3 = temp2.encode("utf-8") #temp3 is byte data again
file.write(temp3)
Now, you'll presumably get the error on the second line, so examine the
bytes around byte 783. Make sure it's really in utf-8, and if it is,
then skip the decode and the encode. If it's not, then Andrew's advice
is pertinent.
I would also look at the variable html. It's a list, but what are the
types of the elements in it?
> file.close()
>
> The variable html is a list containing the lines of HTML to write to the
> file. It fails when it contains accented characters (an é in this
> case). Here's the traceback:-
>
> Traceback (most recent call last):
> File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 41, in run self._recurse()
> File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272, in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
> File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg)
> File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 238, in walk func(arg, top, names)
> File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263, in processDir self.createGallery()
> File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215, in createGallery self.picturemanager.createPictureHTMLs(self.footer)
> File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py", line 84, in createPictureHTMLs curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(), self.fullsize, footer)
> File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361, in createPictureHTML file.write("".join(html).encode('utf-8')) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 783: ordinal not in range(128)
>
>
>
> If I understand correctly the encode() is saying that it can't
> understand the data in the html because there's a character 0xc3 in it.
> I *think* this means that the é is encoded in UTF-8 already in the
> incoming data stream (should be as my system is wholly UTF-8 as far as I
> know and I created the directory name).
>
> So how do I change the code so I don't get the error? Do I just
> decode() the data first and then encode() it?
>
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | cl@isbd.net |
|---|---|
| Date | 2013-04-29 13:59 +0100 |
| Message-ID | <ee725a-c45.ln1@chris.zbmc.eu> |
| In reply to | #44495 |
Dave Angel <davea@davea.name> wrote:
> On 04/29/2013 05:47 AM, cl@isbd.net wrote:
>
> A couple of generic comments: your email program made a mess of the
> traceback by appending each source line to the location information.
>
What's me email program got to do with it? :-) I'm using a dedicated
newsreader (tin) as I posted via the gmane/usenet interface. The posting
looks perfectly OK to me when I read it back from usenet.
> Please mention your Python version & OS. Apparently you're running 2.7
> on Linux or similar.
>
Sorry, yes you're spot on.
> > I am debugging some code that creates a static HTML gallery from a
> > directory hierarchy full of images. It's this package:-
> > https://pypi.python.org/pypi/Gallery2.py/2.0
> >
> >
> > It's basically working and does pretty much what I want so I'm happy to
> > put some effort into it and fix things.
> >
> > The problem I'm currently chasing is that it can't cope with directory
> > names that have accented characters in them, it fails when it tries to
> > write the HTML that creates the page with the thumbnails on.
> >
> > The code that's failing is:-
> >
> > raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
> > file = open(raw, "w")
> > file.write("".join(html).encode('utf-8'))
>
> You can't encode byte data, it's already encoded. So you're forcing the
> Python system to implicitly decode it (using ASCII codec) before letting
> you encode it to utf-8. If you think it's already in utf-8, then omit
> the encode() call there.
>
It's the way the code was as I installed it from pypi. What you say
makes a lot of sense though, I'll remove the encode().
> Additionally, you can debug things with some simple print statements, at
> least if you decompose your 3-function line so you can get at the
> intermediate data. Split the line into three parts;
> temp1 = "".join(html) #temp1 is byte data
> temp2 = temp1.decode() #temp2 is unicode data
> temp3 = temp2.encode("utf-8") #temp3 is byte data again
> file.write(temp3)
>
OK, thanks for this and all the other advice on this thread.
--
Chris Green
[toc] | [prev] | [next] | [standalone]
| From | Robert Kern <robert.kern@gmail.com> |
|---|---|
| Date | 2013-04-29 14:11 +0100 |
| Message-ID | <mailman.1154.1367241090.3114.python-list@python.org> |
| In reply to | #44502 |
On 2013-04-29 13:59, cl@isbd.net wrote: > Dave Angel <davea@davea.name> wrote: >> On 04/29/2013 05:47 AM, cl@isbd.net wrote: >> >> A couple of generic comments: your email program made a mess of the >> traceback by appending each source line to the location information. >> > What's me email program got to do with it? :-) I'm using a dedicated > newsreader (tin) as I posted via the gmane/usenet interface. The posting > looks perfectly OK to me when I read it back from usenet. FWIW, I see the same problem Dave sees. I'm using gmane via Thunderbird. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
[toc] | [prev] | [next] | [standalone]
| From | cl@isbd.net |
|---|---|
| Date | 2013-04-29 15:38 +0100 |
| Message-ID | <t8d25a-pd6.ln1@chris.zbmc.eu> |
| In reply to | #44503 |
Robert Kern <robert.kern@gmail.com> wrote: > On 2013-04-29 13:59, cl@isbd.net wrote: > > Dave Angel <davea@davea.name> wrote: > >> On 04/29/2013 05:47 AM, cl@isbd.net wrote: > >> > >> A couple of generic comments: your email program made a mess of the > >> traceback by appending each source line to the location information. > >> > > What's me email program got to do with it? :-) I'm using a dedicated > > newsreader (tin) as I posted via the gmane/usenet interface. The posting > > looks perfectly OK to me when I read it back from usenet. > > FWIW, I see the same problem Dave sees. I'm using gmane via Thunderbird. > How strange. I think it must be something to do with the gmane interface between news and mail then. -- Chris Green
[toc] | [prev] | [next] | [standalone]
| From | Skip Montanaro <skip@pobox.com> |
|---|---|
| Date | 2013-04-29 09:56 -0500 |
| Message-ID | <mailman.1155.1367247394.3114.python-list@python.org> |
| In reply to | #44504 |
> How strange. I think it must be something to do with the gmane > interface between news and mail then. Probably. It was borked in Gmail as well... Skip
[toc] | [prev] | [next] | [standalone]
| From | Terry Jan Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-04-29 18:02 -0400 |
| Message-ID | <mailman.1167.1367272922.3114.python-list@python.org> |
| In reply to | #44489 |
On 4/29/2013 5:47 AM, cl@isbd.net wrote:
> case). Here's the traceback:-
>
> File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361,
> in createPictureHTML file.write("".join(html).encode('utf-8'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
783: ordinal not in range(128)
Generiric advice for anyone getting unicode errors:
unpack the composition producing the error
so that one can see which operation produced it.
In this case
s = "".join(html)\
s = s.encode('utf-8')
file.write(s)
This also makes it possible to print intermediate results.
print(type(s), s) # would have been useful
Doing so would have immediately shown that in this case the error was
the encode operation, because s was already bytes.
For many other posts, the error with the same type of message has been
the print or write operation, do to output encoding issues, but that was
not the case here.
[toc] | [prev] | [next] | [standalone]
| From | Tony the Tiger <tony@tiger.invalid> |
|---|---|
| Date | 2013-05-01 16:20 -0500 |
| Message-ID | <Zridnb6s37SKGhzMnZ2dnUVZ8u6dnZ2d@giganews.com> |
| In reply to | #44489 |
On Mon, 29 Apr 2013 10:47:46 +0100, cl wrote:
> raw = os.path.join(directory, self.getNameNoExtension()) +
> ".html"
> file = open(raw, "w")
> file.write("".join(html).encode('utf-8'))
> file.close()
This works for me:
Python 2.7.3 (default, Aug 1 2012, 05:16:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> html='<html><head><title>Blah</title><body>éåäö</body></html>'
>>> f=open('test.html', 'w')
>>> f.write(''.join(html.decode('utf-8').encode('utf-8')))
>>> f.close()
Perhaps there are better ways to do it.
/Grrr
--
___ ___
(\_--_/) | _ ._ _|_|_ _ |o _ _ ._
( 9 9 ) |(_)| |\/ |_| |(/_ ||(_|(/_|
stripes are forever - as overripe ferrets
[toc] | [prev] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2013-05-01 18:01 -0400 |
| Message-ID | <mailman.1224.1367445706.3114.python-list@python.org> |
| In reply to | #44603 |
On 5/1/2013 5:20 PM, Tony the Tiger wrote:
> On Mon, 29 Apr 2013 10:47:46 +0100, cl wrote:
>
>> raw = os.path.join(directory, self.getNameNoExtension()) +
>> ".html"
>> file = open(raw, "w")
>> file.write("".join(html).encode('utf-8'))
>> file.close()
> This works for me:
>
> Python 2.7.3 (default, Aug 1 2012, 05:16:07)
> [GCC 4.6.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> html='<html><head><title>Blah</title><body>éåäö</body></html>'
>>>> f=open('test.html', 'w')
>>>> f.write(''.join(html.decode('utf-8').encode('utf-8')))
>>>> f.close()
> Perhaps there are better ways to do it.
Your .write() line is exactly equivalent to:
f.write(html)
Because: if X is a UTF-8 bytestring, then:
X.decode('utf-8').encode('utf-8') == X
And if X is a bytestring, then:
''.join(X) == X
--Ned.
>
> /Grrr
[toc] | [prev] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2013-05-01 19:36 -0400 |
| Message-ID | <mailman.1226.1367451385.3114.python-list@python.org> |
| In reply to | #44489 |
On 4/29/2013 5:47 AM, cl@isbd.net wrote: > If I understand correctly the encode() is saying that it can't > understand the data in the html because there's a character 0xc3 in it. > I *think* this means that the é is encoded in UTF-8 already in the > incoming data stream (should be as my system is wholly UTF-8 as far as I > know and I created the directory name). > > So how do I change the code so I don't get the error? Do I just > decode() the data first and then encode() it? > BTW, I did a presentation at PyCon 2012 that many people have found helpful: Pragmatic Unicode, or, How Do I Stop the Pain: http://nedbatchelder.com/text/unipain.html . It explains the principles at work here. --Ned.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web