Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #38740 > unrolled thread

UnicodeEncodeError when not running script from IDE

Started byMagnus Pettersson <magpettersson@gmail.com>
First post2013-02-12 02:43 -0800
Last post2013-02-12 11:07 -0500
Articles 20 — 8 participants

Back to article view | Back to comp.lang.python


Contents

  UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 02:43 -0800
    Re: UnicodeEncodeError when not running script from IDE Andrew Berg <bahamutzero8825@gmail.com> - 2013-02-12 05:01 -0600
      Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 06:24 -0800
        Re: UnicodeEncodeError when not running script from IDE Peter Otten <__peter__@web.de> - 2013-02-12 15:49 +0100
          Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 07:29 -0800
            Re: UnicodeEncodeError when not running script from IDE Peter Otten <__peter__@web.de> - 2013-02-12 16:48 +0100
            Re: UnicodeEncodeError when not running script from IDE Dave Angel <davea@davea.name> - 2013-02-12 10:58 -0500
              Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 09:12 -0800
                Re: UnicodeEncodeError when not running script from IDE Fabio Zadrozny <fabiofz@gmail.com> - 2013-02-12 18:04 -0200
                Re: UnicodeEncodeError when not running script from IDE Dave Angel <davea@davea.name> - 2013-02-12 15:51 -0500
                  Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 16:20 -0800
                    Re: UnicodeEncodeError when not running script from IDE Dave Angel <davea@davea.name> - 2013-02-12 22:51 -0500
                Re: UnicodeEncodeError when not running script from IDE Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-02-13 11:21 +1100
                  Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 16:40 -0800
          Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 07:29 -0800
        Re: UnicodeEncodeError when not running script from IDE MRAB <python@mrabarnett.plus.com> - 2013-02-12 21:03 +0000
      Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 06:24 -0800
    Re: UnicodeEncodeError when not running script from IDE Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-02-12 22:43 +1100
      Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 04:34 -0800
        Re: UnicodeEncodeError when not running script from IDE Terry Reedy <tjreedy@udel.edu> - 2013-02-12 11:07 -0500

#38740 — UnicodeEncodeError when not running script from IDE

FromMagnus Pettersson <magpettersson@gmail.com>
Date2013-02-12 02:43 -0800
SubjectUnicodeEncodeError when not running script from IDE
Message-ID<650d144e-da3d-4ca7-ad3a-49f44ce9cbaa@googlegroups.com>
I am using Eclipse to write my python scripts and when i run them from inside eclipse they work fine without errors. 

But almost in every script that handle some form of special characters like swedish åäö and chinese characters etc i get Unicode errors when running the script externally with python.exe or pythonw.exe (but the scripts run completely fine from within Eclipse (standard pydev projects, python2.7). I have usually launched the script gui from wihin eclipse because of this error but now i want to get the bottom of this so i dont have to open eclipse everytime i want to run a script!

Here is the error i get now when running the script with python.exe:
UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in position 32: character maps to <undefined>

what can i do to fix this?

[toc] | [next] | [standalone]


#38741

FromAndrew Berg <bahamutzero8825@gmail.com>
Date2013-02-12 05:01 -0600
Message-ID<mailman.1696.1360666894.2939.python-list@python.org>
In reply to#38740
On 2013.02.12 04:43, Magnus Pettersson wrote:
> I am using Eclipse to write my python scripts and when i run them from inside eclipse they work fine without errors. 
> 
> But almost in every script that handle some form of special characters like swedish åäö and chinese characters etc i get Unicode errors when running the script externally with python.exe or pythonw.exe (but the scripts run completely fine from within Eclipse (standard pydev projects, python2.7). I have usually launched the script gui from wihin eclipse because of this error but now i want to get the bottom of this so i dont have to open eclipse everytime i want to run a script!
> 
> Here is the error i get now when running the script with python.exe:
> UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in position 32: character maps to <undefined>
> 
> what can i do to fix this?
> 
Since you didn't say what code actually does this, I'll turn to my
crystal ball. It says you are trying to print characters to a terminal
that doesn't support them. If that is the case, you could try changing
the code page (but only 3.3 supports cp65001, so that probably won't
help) or use replacement characters when printing.

-- 
CPython 3.3.0 | Windows NT 6.2.9200.16461 / FreeBSD 9.1-RELEASE

[toc] | [prev] | [next] | [standalone]


#38749

FromMagnus Pettersson <magpettersson@gmail.com>
Date2013-02-12 06:24 -0800
Message-ID<0d6d513d-fa12-4d51-a33d-7bb38f1ee6b2@googlegroups.com>
In reply to#38741
I have tried now to take away printing to terminal and just keeping the writing to a .txt file to disk (which is what the scripts purpose is):

with open(filepath,"a") as f:
    for card in cardlist:
        f.write(card+"\n")

The file it writes to exists and im just appending to it, but when i run the script trough eclipse, all is fine. When i run in terminal i get this error instead:

File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
    f.write(card+"\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in position 3
2: ordinal not in range(128)

On Tuesday, February 12, 2013 12:01:19 PM UTC+1, Andrew Berg wrote:
> On 2013.02.12 04:43, Magnus Pettersson wrote:
> 
> > I am using Eclipse to write my python scripts and when i run them from inside eclipse they work fine without errors. 
> 
> > 
> 
> > But almost in every script that handle some form of special characters like swedish åäö and chinese characters etc i get Unicode errors when running the script externally with python.exe or pythonw.exe (but the scripts run completely fine from within Eclipse (standard pydev projects, python2.7). I have usually launched the script gui from wihin eclipse because of this error but now i want to get the bottom of this so i dont have to open eclipse everytime i want to run a script!
> 
> > 
> 
> > Here is the error i get now when running the script with python.exe:
> 
> > UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in position 32: character maps to <undefined>
> 
> > 
> 
> > what can i do to fix this?
> 
> > 
> 
> Since you didn't say what code actually does this, I'll turn to my
> 
> crystal ball. It says you are trying to print characters to a terminal
> 
> that doesn't support them. If that is the case, you could try changing
> 
> the code page (but only 3.3 supports cp65001, so that probably won't
> 
> help) or use replacement characters when printing.
> 
> 
> 
> -- 
> 
> CPython 3.3.0 | Windows NT 6.2.9200.16461 / FreeBSD 9.1-RELEASE

[toc] | [prev] | [next] | [standalone]


#38751

FromPeter Otten <__peter__@web.de>
Date2013-02-12 15:49 +0100
Message-ID<mailman.1700.1360680572.2939.python-list@python.org>
In reply to#38749
Magnus Pettersson wrote:

> I have tried now to take away printing to terminal and just keeping the
> writing to a .txt file to disk (which is what the scripts purpose is):
> 
> with open(filepath,"a") as f:
>     for card in cardlist:
>         f.write(card+"\n")
> 
> The file it writes to exists and im just appending to it, but when i run
> the script trough eclipse, all is fine. When i run in terminal i get this
> error instead:
> 
> File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
>     f.write(card+"\n")
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in
> position 3 2: ordinal not in range(128)

Are you sure you are writing the same data? That would mean that pydev 
changes the default encoding -- which is evil.

A portable approach would be to use codecs.open() or io.open() instead of 
the built-in:

import io
with io.open(filepath, "a") as f:
    ...

io.open() uses UTF-8 by default, but you can specify other encodings with
io.open(filepath, mode, encoding=whatever).

[toc] | [prev] | [next] | [standalone]


#38760

FromMagnus Pettersson <magpettersson@gmail.com>
Date2013-02-12 07:29 -0800
Message-ID<780d353a-de5c-4d04-8f51-11d81802351b@googlegroups.com>
In reply to#38751
> Are you sure you are writing the same data? That would mean that pydev 
> 
> changes the default encoding -- which is evil.
> 
> 
> 
> A portable approach would be to use codecs.open() or io.open() instead of 
> 
> the built-in:
> 
> 
> 
> import io
> 
> with io.open(filepath, "a") as f:
> 
>     ...
> 
> 
> 
> io.open() uses UTF-8 by default, but you can specify other encodings with
> 
> io.open(filepath, mode, encoding=whatever).


Interesting. Pydev must be doing something behind the scenes because when i changed open() to io.open() i get error inside of eclipse now:

f.write(card+"\n")
  File "C:\python27\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in position 32: character maps to <undefined>

....

io.open(filepath, "a", encoding="UTF-8") as f: 

Then it works in eclipse. But I seem to be having an encoding problem all over the place that works in eclipse but dosnt work outside of eclipse pydev.

Here is the flow of my data, im terrible at using unicode/encode/decode so could use some help here:

kanji_anki_gui.py:

def on_addButton_clicked(self):
    #code
    # self.kanji.text() comes from a kanji letter written into a pyqt4 QLineEdit
    kanji = unicode(self.kanji.text())
    card = kanji_anki.scrapeKanji(kanji,tags)
    #more code

kanji_anki.py:

def scrapeKanji(kanji, tags="", onlymeaning=False):
    baseurl = unicode("http://www.romajidesu.com/kanji/")
    url = unicode(baseurl+kanji)
    #test to write out url to disk, works outside of eclipse now
    savefile([url])
    
    #getting webpage works fine in eclipse, prints "Oh no..." in terminal
    try:
        page = urllib2.urlopen(url)
    except:
        print "OH no website dont work"
	return None

    #Code that does some scraping and returns a string containing kanji letters
    return card

def savefile(cardlist,filepath="D:/iknow_kanji.txt"):
    with io.open(filepath, "a") as f:
        for card in cardlist:
            f.write(card+"\n")
    return True

[toc] | [prev] | [next] | [standalone]


#38764

FromPeter Otten <__peter__@web.de>
Date2013-02-12 16:48 +0100
Message-ID<mailman.1709.1360684126.2939.python-list@python.org>
In reply to#38760
Magnus Pettersson wrote:

>> io.open() uses UTF-8 by default, but you can specify other encodings with
>> 
>> io.open(filepath, mode, encoding=whatever).
> 
> 
> Interesting. Pydev must be doing something behind the scenes because when
> i changed open() to io.open() i get error inside of eclipse now:
> 
> f.write(card+"\n")
>   File "C:\python27\lib\encodings\cp1252.py", line 19, in encode
>     return codecs.charmap_encode(input,self.errors,encoding_table)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in
> position 32: character maps to <undefined>
> 
> ....
> 
> io.open(filepath, "a", encoding="UTF-8") as f:
> 
> Then it works in eclipse. But I seem to be having an encoding problem all
> over the place that works in eclipse but dosnt work outside of eclipse
> pydev.

No, I was wrong about the default; it is actually 
locale.getpreferredencoding(). Sorry for the confusion.

[toc] | [prev] | [next] | [standalone]


#38766

FromDave Angel <davea@davea.name>
Date2013-02-12 10:58 -0500
Message-ID<mailman.1711.1360684727.2939.python-list@python.org>
In reply to#38760
On 02/12/2013 10:29 AM, Magnus Pettersson wrote:
>> Are you sure you are writing the same data? That would mean that pydev
>>
>> changes the default encoding -- which is evil.
>>
>>
>>
>> A portable approach would be to use codecs.open() or io.open() instead of
>>
>> the built-in:
>>
>>
>>
>> import io
>>
>> with io.open(filepath, "a") as f:
>>
>>      ...
>>
>>
>>
>> io.open() uses UTF-8 by default, but you can specify other encodings with

I think you are using Python 2.x, not Python 3.  So you'd better be 
explicit what encodings you want for each file.

>>
>> io.open(filepath, mode, encoding=whatever).
>
>
> Interesting. Pydev must be doing something behind the scenes because when i changed open() to io.open() i get error inside of eclipse now:

What encoding is this file?  Since you're appending to it, you really 
need to match the pre-existing encoding, or the next program to deal 
with it is in big trouble.  So using the io.open() without the encoding= 
keyword is probably a mistake.

>
> f.write(card+"\n")
>    File "C:\python27\lib\encodings\cp1252.py", line 19, in encode
>      return codecs.charmap_encode(input,self.errors,encoding_table)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in position 32: character maps to <undefined>
>
> ....
>


-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#38773

FromMagnus Pettersson <magpettersson@gmail.com>
Date2013-02-12 09:12 -0800
Message-ID<a80a49be-b3c4-4549-bf94-523605dbbeec@googlegroups.com>
In reply to#38766
> What encoding is this file?  Since you're appending to it, you really 
> 
> need to match the pre-existing encoding, or the next program to deal 
> 
> with it is in big trouble.  So using the io.open() without the encoding= 
> 
> keyword is probably a mistake.

The .txt file is in UTF-8

I have got it to work now in the terminal, but i dont understand what im doing and why i didnt need to do all the unicode strings and encode mumbo jumbo in eclipse

#Here kanji = u"私"
baseurl = u"http://www.romajidesu.com/kanji/"
url = baseurl+kanji
savefile([url]) #this test works now. uses: io.open(filepath, "a",encoding="UTF-8") as f:
# This made the fetching of the website work. Why did i have to write url.encode("UTF-8") when url already is unicode? I feel i dont have a good understanding of this.
page = urllib2.urlopen(url.encode("UTF-8"))


....

[toc] | [prev] | [next] | [standalone]


#38779

FromFabio Zadrozny <fabiofz@gmail.com>
Date2013-02-12 18:04 -0200
Message-ID<mailman.1722.1360699522.2939.python-list@python.org>
In reply to#38773

[Multipart message — attachments visible in raw view] — view raw

Just to note, PyDev does something behind the scenes (it sets the encoding
for the console).

You may specify which encoding you want at your launch configuration (in
the 'common' tab you can set the encoding you want for the shell).

Cheers,

Fabio


On Tue, Feb 12, 2013 at 3:12 PM, Magnus Pettersson
<magpettersson@gmail.com>wrote:

> > What encoding is this file?  Since you're appending to it, you really
> >
> > need to match the pre-existing encoding, or the next program to deal
> >
> > with it is in big trouble.  So using the io.open() without the encoding=
> >
> > keyword is probably a mistake.
>
> The .txt file is in UTF-8
>
> I have got it to work now in the terminal, but i dont understand what im
> doing and why i didnt need to do all the unicode strings and encode mumbo
> jumbo in eclipse
>
> #Here kanji = u"私"
> baseurl = u"http://www.romajidesu.com/kanji/"
> url = baseurl+kanji
> savefile([url]) #this test works now. uses: io.open(filepath,
> "a",encoding="UTF-8") as f:
> # This made the fetching of the website work. Why did i have to write
> url.encode("UTF-8") when url already is unicode? I feel i dont have a good
> understanding of this.
> page = urllib2.urlopen(url.encode("UTF-8"))
>
>
> ....
> --
> http://mail.python.org/mailman/listinfo/python-list
>

[toc] | [prev] | [next] | [standalone]


#38784

FromDave Angel <davea@davea.name>
Date2013-02-12 15:51 -0500
Message-ID<mailman.1725.1360702303.2939.python-list@python.org>
In reply to#38773
On 02/12/2013 12:12 PM, Magnus Pettersson wrote:
>> < snip >
>>
> #Here kanji = u"私"
> baseurl = u"http://www.romajidesu.com/kanji/"
> url = baseurl+kanji
> savefile([url]) #this test works now. uses: io.open(filepath, "a",encoding="UTF-8") as f:
> # This made the fetching of the website work.

You don't show the code that actually does the io.open(), nor the 
url.encode, so I'm not going to guess what you're actually doing.


> Why did i have to write url.encode("UTF-8") when url already is unicode? I feel i dont have a good understanding of this.
> page = urllib2.urlopen(url.encode("UTF-8"))

utf-8 is NOT unicode;  they are entirely different.  Unicode is 
conceptually 32 bits per character, and is an internal representation. 
There are a million or so characters defined.  Nearly always when you're 
talking to an external device, you need bytes.  Since you can't cram 32 
bits into 8, you have to encode it.  Examples of devices would be any 
file, or the console.  Notice that sometimes you can use unicode 
directly for certain functions.  For example, the Windows file name is 
composed of Unicode characters, so Windows has function calls that 
accept Unicode directly.  But back to 8 bits:

One encoding is called ASCII, which is simply the bottommost 7 bits. 
But of course it gets an error if there are any characters above 127.

Other encodings try to pick an 8 bit subset of the million possible 
characters.  Again, if you happen to have a character that's not in that 
subset, you'll get an error.

There are also other encodings which are hard to describe, but 
fortunately pretty rare these days.

Then there's utf-8, which uses a variable length  bunch of bytes for 
each character.  It's designed to use the ASCII encoding for characters 
which are below 128, but uses two or more bytes for all the other 
characters.  So it works out well when most characters happen to be ASCII.

Once encoded, a stream of bytes can only be successfully interpreted if 
you use the same decoding when processing them.



-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#38791

FromMagnus Pettersson <magpettersson@gmail.com>
Date2013-02-12 16:20 -0800
Message-ID<d7da5405-7de9-4eb4-935e-fafc131194d9@googlegroups.com>
In reply to#38784
> You don't show the code that actually does the io.open(), nor the 
> 
> url.encode, so I'm not going to guess what you're actually doing.

Hmm im not sure what you mean but I wrote all code needed in a previous post so maybe you missed that one :)
In short I basically just have:
import io
io.open(myfile,"a",encode="UTF-8") as f:
    f.write(my_ustring_with_kanji)

the url.encode() is my unicode string variable named "url" using the type built in  function .encode() which was the thing i wondered why i needed to use, which you explained very well, thank you!

Just one more question since all this is still a little fuzzy in my head.

When do i need to use .decode() in my code? is it when i read lines from f.ex a UTF-8 file? And why didn't I have to use .encode() on my unicode string when running from within eclipse pydev? someone wrote that it has a default codec setting so maybe that handles it for me there (which is kinda dangerous since my programs wont work running outside of eclipse since i didnt do any encoding or using of unicode strings before in my script and it still worked)

--Magnus

[toc] | [prev] | [next] | [standalone]


#38803

FromDave Angel <davea@davea.name>
Date2013-02-12 22:51 -0500
Message-ID<mailman.1734.1360727537.2939.python-list@python.org>
In reply to#38791
On 02/12/2013 07:20 PM, Magnus Pettersson wrote:
>
>> You don't show the code that actually does the io.open(), nor the
>>
>> url.encode, so I'm not going to guess what you're actually doing.
>
> Hmm im not sure what you mean but I wrote all code needed in a previous post so maybe you missed that one :)
> In short I basically just have:
> import io
> io.open(myfile,"a",encode="UTF-8") as f:
>      f.write(my_ustring_with_kanji)
>
> the url.encode() is my unicode string variable named "url" using the type built in  function .encode() which was the thing i wondered why i needed to use, which you explained very well, thank you!
>
> Just one more question since all this is still a little fuzzy in my head.
>
> When do i need to use .decode() in my code? is it when i read lines from f.ex a UTF-8 file? And why didn't I have to use .encode() on my unicode string when running from within eclipse pydev? someone wrote that it has a default codec setting so maybe that handles it for me there (which is kinda dangerous since my programs wont work running outside of eclipse since i didnt do any encoding or using of unicode strings before in my script and it still worked)
>

decode goes from bytes to unicode, the exact reverse.  And you're right, 
you'd need it on input from a file, and theoretically on input from a 
keyboard.

Conceptually, the easiest (not necessarily the fastest) thing to do is 
to always convert any input that comes in byte form to unicode, 
immediately on getting it. Then all processing in the code should be 
done in unicode form.  And you encode any output just before it goes out 
to a byte-device.

Python 3 makes that a natural, as the string type is already unicode, 
and it's byte strings that are the exception.  But all that really 
changes is the syntax you use.

There are defaults all over the place on these conversions.  And 
apparently, your IDE sets those defaults for you, which is a nasty 
thing, since it means things that run in the IDE will run differently 
outside of it.  You're just lucky the difference was an error.  If there 
weren't an error, you might have merrily been creating files with a 
mixture of encodings, which is a real disaster.

One other place where decoding happens is in your source file.  There is 
an optional encoding line you can place at the top of the file 
(immediately after the shebang line) to change how unicode literals with 
non-ASCII characters are interpreted.  Remember your source file is a 
byte file edited with some text editor, and it has been encoded, 
deliberately or accidentally by that editor.  You can avoid the issue by 
always using escape sequences, but if for example, you copy/paste some 
unicode string from an email message into your source code, you'd like 
it to be equivalent.  If your email program, your text editor, and your 
Python compiler are all on the same page, it works amazingly simply.

(That encoding line may affect other things;  I know in Python 3, it 
makes non-ASCII attribute names possible, but I'm not sure if it matters 
in Python 2.x other than for unicode literal strings)


-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#38792

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-02-13 11:21 +1100
Message-ID<511adc9a$0$29972$c3e8da3$5496439d@news.astraweb.com>
In reply to#38773
Magnus Pettersson wrote:


> # This made the fetching of the website work. Why did i have to write
> # url.encode("UTF-8") when url already is unicode? I feel i dont have a
> # good understanding of this.
> page = urllib2.urlopen(url.encode("UTF-8"))


Start here:

"The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)"

http://www.joelonsoftware.com/articles/Unicode.html


Basically, Unicode is an in-memory data format. Python knows about Unicode
characters (to be technical: code points), but files on disk do not.
Neither do network protocols, or terminals, or other simple devices. They
only understand bytes.

So when you have Unicode text, and you want to write it to a file on disk,
or print it, or send it over the network to another machine, it has to be
*encoded* into bytes, and then *decoded* back into Unicode when you read it
from the file again. Sometimes the system will "helpfully" do that encoding
and decoding automatically for you, which is fine when it works but when it
doesn't it can be perplexing.

There are many, many, many different *encoding schemes*. ASCII is one. UTF-8
is another. And then there are about a bazillion legacy encodings which, if
you are lucky, you will never need to care about. Only some encodings can
deal with the entire range of Unicode characters, most can only deal with a
(typically small) subset of possible characters. E.g. ASCII only knows
about 127 characters out of the million-plus that Unicode deals with.
Latin-1 can handle close to 256 different characters. If you have a say in
the matter, always use UTF-8, since it can handle the full set of Unicode
characters in the most efficient manner.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#38793

FromMagnus Pettersson <magpettersson@gmail.com>
Date2013-02-12 16:40 -0800
Message-ID<b9306c81-7129-4af8-ae2c-ceae66416de3@googlegroups.com>
In reply to#38792
Thanks a lot Steven, you gave me a good AHA experience! :)

Now I understand why I had to use encoding when calling the urllib2! So basically Eclipse PyDev does this in the background for me, and its console supports utf-8, so thats why i never had to think about it before (and why some scripts tends to fail with unicode errors when run outside of the Eclipse IDE).

cheers
Magnus

> Start here:
> 
> 
> 
> "The Absolute Minimum Every Software Developer Absolutely, Positively Must
> 
> Know About Unicode and Character Sets (No Excuses!)"
> 
> 
> 
> http://www.joelonsoftware.com/articles/Unicode.html
> 
> 
> 
> 
> 
> Basically, Unicode is an in-memory data format. Python knows about Unicode
> 
> characters (to be technical: code points), but files on disk do not.
> 
> Neither do network protocols, or terminals, or other simple devices. They
> 
> only understand bytes.
> 
> 
> 
> So when you have Unicode text, and you want to write it to a file on disk,
> 
> or print it, or send it over the network to another machine, it has to be
> 
> *encoded* into bytes, and then *decoded* back into Unicode when you read it
> 
> from the file again. Sometimes the system will "helpfully" do that encoding
> 
> and decoding automatically for you, which is fine when it works but when it
> 
> doesn't it can be perplexing.
> 
> 
> 
> There are many, many, many different *encoding schemes*. ASCII is one. UTF-8
> 
> is another. And then there are about a bazillion legacy encodings which, if
> 
> you are lucky, you will never need to care about. Only some encodings can
> 
> deal with the entire range of Unicode characters, most can only deal with a
> 
> (typically small) subset of possible characters. E.g. ASCII only knows
> 
> about 127 characters out of the million-plus that Unicode deals with.
> 
> Latin-1 can handle close to 256 different characters. If you have a say in
> 
> the matter, always use UTF-8, since it can handle the full set of Unicode
> 
> characters in the most efficient manner.
> 
> 
> 
> 
> 
> -- 
> 
> Steven

[toc] | [prev] | [next] | [standalone]


#38761

FromMagnus Pettersson <magpettersson@gmail.com>
Date2013-02-12 07:29 -0800
Message-ID<mailman.1706.1360682950.2939.python-list@python.org>
In reply to#38751
> Are you sure you are writing the same data? That would mean that pydev 
> 
> changes the default encoding -- which is evil.
> 
> 
> 
> A portable approach would be to use codecs.open() or io.open() instead of 
> 
> the built-in:
> 
> 
> 
> import io
> 
> with io.open(filepath, "a") as f:
> 
>     ...
> 
> 
> 
> io.open() uses UTF-8 by default, but you can specify other encodings with
> 
> io.open(filepath, mode, encoding=whatever).


Interesting. Pydev must be doing something behind the scenes because when i changed open() to io.open() i get error inside of eclipse now:

f.write(card+"\n")
  File "C:\python27\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in position 32: character maps to <undefined>

....

io.open(filepath, "a", encoding="UTF-8") as f: 

Then it works in eclipse. But I seem to be having an encoding problem all over the place that works in eclipse but dosnt work outside of eclipse pydev.

Here is the flow of my data, im terrible at using unicode/encode/decode so could use some help here:

kanji_anki_gui.py:

def on_addButton_clicked(self):
    #code
    # self.kanji.text() comes from a kanji letter written into a pyqt4 QLineEdit
    kanji = unicode(self.kanji.text())
    card = kanji_anki.scrapeKanji(kanji,tags)
    #more code

kanji_anki.py:

def scrapeKanji(kanji, tags="", onlymeaning=False):
    baseurl = unicode("http://www.romajidesu.com/kanji/")
    url = unicode(baseurl+kanji)
    #test to write out url to disk, works outside of eclipse now
    savefile([url])
    
    #getting webpage works fine in eclipse, prints "Oh no..." in terminal
    try:
        page = urllib2.urlopen(url)
    except:
        print "OH no website dont work"
	return None

    #Code that does some scraping and returns a string containing kanji letters
    return card

def savefile(cardlist,filepath="D:/iknow_kanji.txt"):
    with io.open(filepath, "a") as f:
        for card in cardlist:
            f.write(card+"\n")
    return True

[toc] | [prev] | [next] | [standalone]


#38786

FromMRAB <python@mrabarnett.plus.com>
Date2013-02-12 21:03 +0000
Message-ID<mailman.1727.1360702992.2939.python-list@python.org>
In reply to#38749
On 2013-02-12 14:24, Magnus Pettersson wrote:
> I have tried now to take away printing to terminal and just keeping the writing to a .txt file to disk (which is what the scripts purpose is):
>
> with open(filepath,"a") as f:
>      for card in cardlist:
>          f.write(card+"\n")
>
> The file it writes to exists and im just appending to it, but when i run the script trough eclipse, all is fine. When i run in terminal i get this error instead:
>
> File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
>      f.write(card+"\n")
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in position 3
> 2: ordinal not in range(128)
>
When you open the file, tell it what encoding to use. For example:

with open(filepath, "a", encoding="utf-8") as f:
     for card in cardlist:
         f.write(card + "\n")

[toc] | [prev] | [next] | [standalone]


#38780

FromMagnus Pettersson <magpettersson@gmail.com>
Date2013-02-12 06:24 -0800
Message-ID<mailman.1723.1360700325.2939.python-list@python.org>
In reply to#38741
I have tried now to take away printing to terminal and just keeping the writing to a .txt file to disk (which is what the scripts purpose is):

with open(filepath,"a") as f:
    for card in cardlist:
        f.write(card+"\n")

The file it writes to exists and im just appending to it, but when i run the script trough eclipse, all is fine. When i run in terminal i get this error instead:

File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
    f.write(card+"\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in position 3
2: ordinal not in range(128)

On Tuesday, February 12, 2013 12:01:19 PM UTC+1, Andrew Berg wrote:
> On 2013.02.12 04:43, Magnus Pettersson wrote:
> 
> > I am using Eclipse to write my python scripts and when i run them from inside eclipse they work fine without errors. 
> 
> > 
> 
> > But almost in every script that handle some form of special characters like swedish åäö and chinese characters etc i get Unicode errors when running the script externally with python.exe or pythonw.exe (but the scripts run completely fine from within Eclipse (standard pydev projects, python2.7). I have usually launched the script gui from wihin eclipse because of this error but now i want to get the bottom of this so i dont have to open eclipse everytime i want to run a script!
> 
> > 
> 
> > Here is the error i get now when running the script with python.exe:
> 
> > UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in position 32: character maps to <undefined>
> 
> > 
> 
> > what can i do to fix this?
> 
> > 
> 
> Since you didn't say what code actually does this, I'll turn to my
> 
> crystal ball. It says you are trying to print characters to a terminal
> 
> that doesn't support them. If that is the case, you could try changing
> 
> the code page (but only 3.3 supports cp65001, so that probably won't
> 
> help) or use replacement characters when printing.
> 
> 
> 
> -- 
> 
> CPython 3.3.0 | Windows NT 6.2.9200.16461 / FreeBSD 9.1-RELEASE

[toc] | [prev] | [next] | [standalone]


#38743

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-02-12 22:43 +1100
Message-ID<511a2ac5$0$29988$c3e8da3$5496439d@news.astraweb.com>
In reply to#38740
Magnus Pettersson wrote:

> I am using Eclipse to write my python scripts and when i run them from
> inside eclipse they work fine without errors.
> 
> But almost in every script that handle some form of special characters
> like swedish åäö and chinese characters etc

A comment: they are not "special" characters. They're merely not American.


> i get Unicode errors when 
> running the script externally with python.exe or pythonw.exe (but the
> scripts run completely fine from within Eclipse (standard pydev projects,
> python2.7). I have usually launched the script gui from wihin eclipse
> because of this error but now i want to get the bottom of this so i dont
> have to open eclipse everytime i want to run a script!
> 
> Here is the error i get now when running the script with python.exe:
> UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in
> position 32: character maps to <undefined>

Please show the *complete* traceback, including the line of code that causes
the exception.

 
> what can i do to fix this?

My guess is that you are trying to print a character which your terminal
cannot display. My terminal is set to use UTF-8, and so it can display it
fine:

py> c = u'\u898b'
py> print(c)
見


(or at least it would display fine if the font used had a glyph for that
character). Why there are still terminals in the world that don't default
to UTF-8 is beyond me.

If I manually change the terminal's encoding to Western European ISO 8859-1,
I get some moji-bake:

py> print(c)
è¦


I can't replicate the exception you give, so I assume it is specific to
Windows.




-- 
Steven

[toc] | [prev] | [next] | [standalone]


#38745

FromMagnus Pettersson <magpettersson@gmail.com>
Date2013-02-12 04:34 -0800
Message-ID<9bc96ed6-8cb7-421a-a25a-4a7c9404463a@googlegroups.com>
In reply to#38743
Ahh so its the actual printing that makes it error out outside of eclipse because its a different terminal that its printing to. Its the default DOS terminal in windows that runs then i run the script with python.exe and i guess its the same when i run with pythonw.exe just that the terminal window is not opened up, only the pyqt gui in this case.

I will try to fix it now when i know what it is :)

I never thought about the terminal, last time i had the same problem i just were playing around for hours with unicode encode and decode and all that not-so-fun stuff :)

Andrew Berg: Thanks, your crystal ball seems to be right :P

On Tuesday, February 12, 2013 12:43:00 PM UTC+1, Steven D'Aprano wrote:
> Magnus Pettersson wrote:
> 
> 
> 
> > I am using Eclipse to write my python scripts and when i run them from
> 
> > inside eclipse they work fine without errors.
> 
> > 
> 
> > But almost in every script that handle some form of special characters
> 
> > like swedish åäö and chinese characters etc
> 
> 
> 
> A comment: they are not "special" characters. They're merely not American.
> 
> 
> 
> 
> 
> > i get Unicode errors when 
> 
> > running the script externally with python.exe or pythonw.exe (but the
> 
> > scripts run completely fine from within Eclipse (standard pydev projects,
> 
> > python2.7). I have usually launched the script gui from wihin eclipse
> 
> > because of this error but now i want to get the bottom of this so i dont
> 
> > have to open eclipse everytime i want to run a script!
> 
> > 
> 
> > Here is the error i get now when running the script with python.exe:
> 
> > UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in
> 
> > position 32: character maps to <undefined>
> 
> 
> 
> Please show the *complete* traceback, including the line of code that causes
> 
> the exception.
> 
> 
> 
>  
> 
> > what can i do to fix this?
> 
> 
> 
> My guess is that you are trying to print a character which your terminal
> 
> cannot display. My terminal is set to use UTF-8, and so it can display it
> 
> fine:
> 
> 
> 
> py> c = u'\u898b'
> 
> py> print(c)
> 
> 見
> 
> 
> 
> 
> 
> (or at least it would display fine if the font used had a glyph for that
> 
> character). Why there are still terminals in the world that don't default
> 
> to UTF-8 is beyond me.
> 
> 
> 
> If I manually change the terminal's encoding to Western European ISO 8859-1,
> 
> I get some moji-bake:
> 
> 
> 
> py> print(c)
> 
> è¦
> 
> 
> 
> 
> 
> I can't replicate the exception you give, so I assume it is specific to
> 
> Windows.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

[toc] | [prev] | [next] | [standalone]


#38769

FromTerry Reedy <tjreedy@udel.edu>
Date2013-02-12 11:07 -0500
Message-ID<mailman.1712.1360686721.2939.python-list@python.org>
In reply to#38745
On 2/12/2013 7:34 AM, Magnus Pettersson wrote:
> Ahh so its the actual printing that makes it error out outside of
> eclipse because its a different terminal that its printing to. Its
> the default DOS terminal in windows that runs then i run the script
> with python.exe and i guess its the same when i run with pythonw.exe
> just that the terminal window is not opened up, only the pyqt gui in
> this case.

Writing

txt = <expression involving coding>
print(txt)

rather than

print(<expression involving coding>)

makes it easier to tell whether a UnicodeError comes from evaluating the 
expression or from the print operation.

Using 3.3 instead of 2.7 will make using unicode somewhat easier.

-- 
Terry Jan Reedy

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web