Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #38740 > unrolled thread
| Started by | Magnus Pettersson <magpettersson@gmail.com> |
|---|---|
| First post | 2013-02-12 02:43 -0800 |
| Last post | 2013-02-12 11:07 -0500 |
| Articles | 20 — 8 participants |
Back to article view | Back to comp.lang.python
UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 02:43 -0800
Re: UnicodeEncodeError when not running script from IDE Andrew Berg <bahamutzero8825@gmail.com> - 2013-02-12 05:01 -0600
Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 06:24 -0800
Re: UnicodeEncodeError when not running script from IDE Peter Otten <__peter__@web.de> - 2013-02-12 15:49 +0100
Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 07:29 -0800
Re: UnicodeEncodeError when not running script from IDE Peter Otten <__peter__@web.de> - 2013-02-12 16:48 +0100
Re: UnicodeEncodeError when not running script from IDE Dave Angel <davea@davea.name> - 2013-02-12 10:58 -0500
Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 09:12 -0800
Re: UnicodeEncodeError when not running script from IDE Fabio Zadrozny <fabiofz@gmail.com> - 2013-02-12 18:04 -0200
Re: UnicodeEncodeError when not running script from IDE Dave Angel <davea@davea.name> - 2013-02-12 15:51 -0500
Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 16:20 -0800
Re: UnicodeEncodeError when not running script from IDE Dave Angel <davea@davea.name> - 2013-02-12 22:51 -0500
Re: UnicodeEncodeError when not running script from IDE Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-02-13 11:21 +1100
Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 16:40 -0800
Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 07:29 -0800
Re: UnicodeEncodeError when not running script from IDE MRAB <python@mrabarnett.plus.com> - 2013-02-12 21:03 +0000
Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 06:24 -0800
Re: UnicodeEncodeError when not running script from IDE Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-02-12 22:43 +1100
Re: UnicodeEncodeError when not running script from IDE Magnus Pettersson <magpettersson@gmail.com> - 2013-02-12 04:34 -0800
Re: UnicodeEncodeError when not running script from IDE Terry Reedy <tjreedy@udel.edu> - 2013-02-12 11:07 -0500
| From | Magnus Pettersson <magpettersson@gmail.com> |
|---|---|
| Date | 2013-02-12 02:43 -0800 |
| Subject | UnicodeEncodeError when not running script from IDE |
| Message-ID | <650d144e-da3d-4ca7-ad3a-49f44ce9cbaa@googlegroups.com> |
I am using Eclipse to write my python scripts and when i run them from inside eclipse they work fine without errors. But almost in every script that handle some form of special characters like swedish åäö and chinese characters etc i get Unicode errors when running the script externally with python.exe or pythonw.exe (but the scripts run completely fine from within Eclipse (standard pydev projects, python2.7). I have usually launched the script gui from wihin eclipse because of this error but now i want to get the bottom of this so i dont have to open eclipse everytime i want to run a script! Here is the error i get now when running the script with python.exe: UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in position 32: character maps to <undefined> what can i do to fix this?
[toc] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2013-02-12 05:01 -0600 |
| Message-ID | <mailman.1696.1360666894.2939.python-list@python.org> |
| In reply to | #38740 |
On 2013.02.12 04:43, Magnus Pettersson wrote: > I am using Eclipse to write my python scripts and when i run them from inside eclipse they work fine without errors. > > But almost in every script that handle some form of special characters like swedish åäö and chinese characters etc i get Unicode errors when running the script externally with python.exe or pythonw.exe (but the scripts run completely fine from within Eclipse (standard pydev projects, python2.7). I have usually launched the script gui from wihin eclipse because of this error but now i want to get the bottom of this so i dont have to open eclipse everytime i want to run a script! > > Here is the error i get now when running the script with python.exe: > UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in position 32: character maps to <undefined> > > what can i do to fix this? > Since you didn't say what code actually does this, I'll turn to my crystal ball. It says you are trying to print characters to a terminal that doesn't support them. If that is the case, you could try changing the code page (but only 3.3 supports cp65001, so that probably won't help) or use replacement characters when printing. -- CPython 3.3.0 | Windows NT 6.2.9200.16461 / FreeBSD 9.1-RELEASE
[toc] | [prev] | [next] | [standalone]
| From | Magnus Pettersson <magpettersson@gmail.com> |
|---|---|
| Date | 2013-02-12 06:24 -0800 |
| Message-ID | <0d6d513d-fa12-4d51-a33d-7bb38f1ee6b2@googlegroups.com> |
| In reply to | #38741 |
I have tried now to take away printing to terminal and just keeping the writing to a .txt file to disk (which is what the scripts purpose is):
with open(filepath,"a") as f:
for card in cardlist:
f.write(card+"\n")
The file it writes to exists and im just appending to it, but when i run the script trough eclipse, all is fine. When i run in terminal i get this error instead:
File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
f.write(card+"\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in position 3
2: ordinal not in range(128)
On Tuesday, February 12, 2013 12:01:19 PM UTC+1, Andrew Berg wrote:
> On 2013.02.12 04:43, Magnus Pettersson wrote:
>
> > I am using Eclipse to write my python scripts and when i run them from inside eclipse they work fine without errors.
>
> >
>
> > But almost in every script that handle some form of special characters like swedish åäö and chinese characters etc i get Unicode errors when running the script externally with python.exe or pythonw.exe (but the scripts run completely fine from within Eclipse (standard pydev projects, python2.7). I have usually launched the script gui from wihin eclipse because of this error but now i want to get the bottom of this so i dont have to open eclipse everytime i want to run a script!
>
> >
>
> > Here is the error i get now when running the script with python.exe:
>
> > UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in position 32: character maps to <undefined>
>
> >
>
> > what can i do to fix this?
>
> >
>
> Since you didn't say what code actually does this, I'll turn to my
>
> crystal ball. It says you are trying to print characters to a terminal
>
> that doesn't support them. If that is the case, you could try changing
>
> the code page (but only 3.3 supports cp65001, so that probably won't
>
> help) or use replacement characters when printing.
>
>
>
> --
>
> CPython 3.3.0 | Windows NT 6.2.9200.16461 / FreeBSD 9.1-RELEASE
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-02-12 15:49 +0100 |
| Message-ID | <mailman.1700.1360680572.2939.python-list@python.org> |
| In reply to | #38749 |
Magnus Pettersson wrote:
> I have tried now to take away printing to terminal and just keeping the
> writing to a .txt file to disk (which is what the scripts purpose is):
>
> with open(filepath,"a") as f:
> for card in cardlist:
> f.write(card+"\n")
>
> The file it writes to exists and im just appending to it, but when i run
> the script trough eclipse, all is fine. When i run in terminal i get this
> error instead:
>
> File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
> f.write(card+"\n")
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in
> position 3 2: ordinal not in range(128)
Are you sure you are writing the same data? That would mean that pydev
changes the default encoding -- which is evil.
A portable approach would be to use codecs.open() or io.open() instead of
the built-in:
import io
with io.open(filepath, "a") as f:
...
io.open() uses UTF-8 by default, but you can specify other encodings with
io.open(filepath, mode, encoding=whatever).
[toc] | [prev] | [next] | [standalone]
| From | Magnus Pettersson <magpettersson@gmail.com> |
|---|---|
| Date | 2013-02-12 07:29 -0800 |
| Message-ID | <780d353a-de5c-4d04-8f51-11d81802351b@googlegroups.com> |
| In reply to | #38751 |
> Are you sure you are writing the same data? That would mean that pydev
>
> changes the default encoding -- which is evil.
>
>
>
> A portable approach would be to use codecs.open() or io.open() instead of
>
> the built-in:
>
>
>
> import io
>
> with io.open(filepath, "a") as f:
>
> ...
>
>
>
> io.open() uses UTF-8 by default, but you can specify other encodings with
>
> io.open(filepath, mode, encoding=whatever).
Interesting. Pydev must be doing something behind the scenes because when i changed open() to io.open() i get error inside of eclipse now:
f.write(card+"\n")
File "C:\python27\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in position 32: character maps to <undefined>
....
io.open(filepath, "a", encoding="UTF-8") as f:
Then it works in eclipse. But I seem to be having an encoding problem all over the place that works in eclipse but dosnt work outside of eclipse pydev.
Here is the flow of my data, im terrible at using unicode/encode/decode so could use some help here:
kanji_anki_gui.py:
def on_addButton_clicked(self):
#code
# self.kanji.text() comes from a kanji letter written into a pyqt4 QLineEdit
kanji = unicode(self.kanji.text())
card = kanji_anki.scrapeKanji(kanji,tags)
#more code
kanji_anki.py:
def scrapeKanji(kanji, tags="", onlymeaning=False):
baseurl = unicode("http://www.romajidesu.com/kanji/")
url = unicode(baseurl+kanji)
#test to write out url to disk, works outside of eclipse now
savefile([url])
#getting webpage works fine in eclipse, prints "Oh no..." in terminal
try:
page = urllib2.urlopen(url)
except:
print "OH no website dont work"
return None
#Code that does some scraping and returns a string containing kanji letters
return card
def savefile(cardlist,filepath="D:/iknow_kanji.txt"):
with io.open(filepath, "a") as f:
for card in cardlist:
f.write(card+"\n")
return True
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-02-12 16:48 +0100 |
| Message-ID | <mailman.1709.1360684126.2939.python-list@python.org> |
| In reply to | #38760 |
Magnus Pettersson wrote: >> io.open() uses UTF-8 by default, but you can specify other encodings with >> >> io.open(filepath, mode, encoding=whatever). > > > Interesting. Pydev must be doing something behind the scenes because when > i changed open() to io.open() i get error inside of eclipse now: > > f.write(card+"\n") > File "C:\python27\lib\encodings\cp1252.py", line 19, in encode > return codecs.charmap_encode(input,self.errors,encoding_table)[0] > UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in > position 32: character maps to <undefined> > > .... > > io.open(filepath, "a", encoding="UTF-8") as f: > > Then it works in eclipse. But I seem to be having an encoding problem all > over the place that works in eclipse but dosnt work outside of eclipse > pydev. No, I was wrong about the default; it is actually locale.getpreferredencoding(). Sorry for the confusion.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-02-12 10:58 -0500 |
| Message-ID | <mailman.1711.1360684727.2939.python-list@python.org> |
| In reply to | #38760 |
On 02/12/2013 10:29 AM, Magnus Pettersson wrote: >> Are you sure you are writing the same data? That would mean that pydev >> >> changes the default encoding -- which is evil. >> >> >> >> A portable approach would be to use codecs.open() or io.open() instead of >> >> the built-in: >> >> >> >> import io >> >> with io.open(filepath, "a") as f: >> >> ... >> >> >> >> io.open() uses UTF-8 by default, but you can specify other encodings with I think you are using Python 2.x, not Python 3. So you'd better be explicit what encodings you want for each file. >> >> io.open(filepath, mode, encoding=whatever). > > > Interesting. Pydev must be doing something behind the scenes because when i changed open() to io.open() i get error inside of eclipse now: What encoding is this file? Since you're appending to it, you really need to match the pre-existing encoding, or the next program to deal with it is in big trouble. So using the io.open() without the encoding= keyword is probably a mistake. > > f.write(card+"\n") > File "C:\python27\lib\encodings\cp1252.py", line 19, in encode > return codecs.charmap_encode(input,self.errors,encoding_table)[0] > UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in position 32: character maps to <undefined> > > .... > -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Magnus Pettersson <magpettersson@gmail.com> |
|---|---|
| Date | 2013-02-12 09:12 -0800 |
| Message-ID | <a80a49be-b3c4-4549-bf94-523605dbbeec@googlegroups.com> |
| In reply to | #38766 |
> What encoding is this file? Since you're appending to it, you really
>
> need to match the pre-existing encoding, or the next program to deal
>
> with it is in big trouble. So using the io.open() without the encoding=
>
> keyword is probably a mistake.
The .txt file is in UTF-8
I have got it to work now in the terminal, but i dont understand what im doing and why i didnt need to do all the unicode strings and encode mumbo jumbo in eclipse
#Here kanji = u"私"
baseurl = u"http://www.romajidesu.com/kanji/"
url = baseurl+kanji
savefile([url]) #this test works now. uses: io.open(filepath, "a",encoding="UTF-8") as f:
# This made the fetching of the website work. Why did i have to write url.encode("UTF-8") when url already is unicode? I feel i dont have a good understanding of this.
page = urllib2.urlopen(url.encode("UTF-8"))
....
[toc] | [prev] | [next] | [standalone]
| From | Fabio Zadrozny <fabiofz@gmail.com> |
|---|---|
| Date | 2013-02-12 18:04 -0200 |
| Message-ID | <mailman.1722.1360699522.2939.python-list@python.org> |
| In reply to | #38773 |
[Multipart message — attachments visible in raw view] — view raw
Just to note, PyDev does something behind the scenes (it sets the encoding
for the console).
You may specify which encoding you want at your launch configuration (in
the 'common' tab you can set the encoding you want for the shell).
Cheers,
Fabio
On Tue, Feb 12, 2013 at 3:12 PM, Magnus Pettersson
<magpettersson@gmail.com>wrote:
> > What encoding is this file? Since you're appending to it, you really
> >
> > need to match the pre-existing encoding, or the next program to deal
> >
> > with it is in big trouble. So using the io.open() without the encoding=
> >
> > keyword is probably a mistake.
>
> The .txt file is in UTF-8
>
> I have got it to work now in the terminal, but i dont understand what im
> doing and why i didnt need to do all the unicode strings and encode mumbo
> jumbo in eclipse
>
> #Here kanji = u"私"
> baseurl = u"http://www.romajidesu.com/kanji/"
> url = baseurl+kanji
> savefile([url]) #this test works now. uses: io.open(filepath,
> "a",encoding="UTF-8") as f:
> # This made the fetching of the website work. Why did i have to write
> url.encode("UTF-8") when url already is unicode? I feel i dont have a good
> understanding of this.
> page = urllib2.urlopen(url.encode("UTF-8"))
>
>
> ....
> --
> http://mail.python.org/mailman/listinfo/python-list
>
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-02-12 15:51 -0500 |
| Message-ID | <mailman.1725.1360702303.2939.python-list@python.org> |
| In reply to | #38773 |
On 02/12/2013 12:12 PM, Magnus Pettersson wrote:
>> < snip >
>>
> #Here kanji = u"私"
> baseurl = u"http://www.romajidesu.com/kanji/"
> url = baseurl+kanji
> savefile([url]) #this test works now. uses: io.open(filepath, "a",encoding="UTF-8") as f:
> # This made the fetching of the website work.
You don't show the code that actually does the io.open(), nor the
url.encode, so I'm not going to guess what you're actually doing.
> Why did i have to write url.encode("UTF-8") when url already is unicode? I feel i dont have a good understanding of this.
> page = urllib2.urlopen(url.encode("UTF-8"))
utf-8 is NOT unicode; they are entirely different. Unicode is
conceptually 32 bits per character, and is an internal representation.
There are a million or so characters defined. Nearly always when you're
talking to an external device, you need bytes. Since you can't cram 32
bits into 8, you have to encode it. Examples of devices would be any
file, or the console. Notice that sometimes you can use unicode
directly for certain functions. For example, the Windows file name is
composed of Unicode characters, so Windows has function calls that
accept Unicode directly. But back to 8 bits:
One encoding is called ASCII, which is simply the bottommost 7 bits.
But of course it gets an error if there are any characters above 127.
Other encodings try to pick an 8 bit subset of the million possible
characters. Again, if you happen to have a character that's not in that
subset, you'll get an error.
There are also other encodings which are hard to describe, but
fortunately pretty rare these days.
Then there's utf-8, which uses a variable length bunch of bytes for
each character. It's designed to use the ASCII encoding for characters
which are below 128, but uses two or more bytes for all the other
characters. So it works out well when most characters happen to be ASCII.
Once encoded, a stream of bytes can only be successfully interpreted if
you use the same decoding when processing them.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Magnus Pettersson <magpettersson@gmail.com> |
|---|---|
| Date | 2013-02-12 16:20 -0800 |
| Message-ID | <d7da5405-7de9-4eb4-935e-fafc131194d9@googlegroups.com> |
| In reply to | #38784 |
> You don't show the code that actually does the io.open(), nor the
>
> url.encode, so I'm not going to guess what you're actually doing.
Hmm im not sure what you mean but I wrote all code needed in a previous post so maybe you missed that one :)
In short I basically just have:
import io
io.open(myfile,"a",encode="UTF-8") as f:
f.write(my_ustring_with_kanji)
the url.encode() is my unicode string variable named "url" using the type built in function .encode() which was the thing i wondered why i needed to use, which you explained very well, thank you!
Just one more question since all this is still a little fuzzy in my head.
When do i need to use .decode() in my code? is it when i read lines from f.ex a UTF-8 file? And why didn't I have to use .encode() on my unicode string when running from within eclipse pydev? someone wrote that it has a default codec setting so maybe that handles it for me there (which is kinda dangerous since my programs wont work running outside of eclipse since i didnt do any encoding or using of unicode strings before in my script and it still worked)
--Magnus
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-02-12 22:51 -0500 |
| Message-ID | <mailman.1734.1360727537.2939.python-list@python.org> |
| In reply to | #38791 |
On 02/12/2013 07:20 PM, Magnus Pettersson wrote: > >> You don't show the code that actually does the io.open(), nor the >> >> url.encode, so I'm not going to guess what you're actually doing. > > Hmm im not sure what you mean but I wrote all code needed in a previous post so maybe you missed that one :) > In short I basically just have: > import io > io.open(myfile,"a",encode="UTF-8") as f: > f.write(my_ustring_with_kanji) > > the url.encode() is my unicode string variable named "url" using the type built in function .encode() which was the thing i wondered why i needed to use, which you explained very well, thank you! > > Just one more question since all this is still a little fuzzy in my head. > > When do i need to use .decode() in my code? is it when i read lines from f.ex a UTF-8 file? And why didn't I have to use .encode() on my unicode string when running from within eclipse pydev? someone wrote that it has a default codec setting so maybe that handles it for me there (which is kinda dangerous since my programs wont work running outside of eclipse since i didnt do any encoding or using of unicode strings before in my script and it still worked) > decode goes from bytes to unicode, the exact reverse. And you're right, you'd need it on input from a file, and theoretically on input from a keyboard. Conceptually, the easiest (not necessarily the fastest) thing to do is to always convert any input that comes in byte form to unicode, immediately on getting it. Then all processing in the code should be done in unicode form. And you encode any output just before it goes out to a byte-device. Python 3 makes that a natural, as the string type is already unicode, and it's byte strings that are the exception. But all that really changes is the syntax you use. There are defaults all over the place on these conversions. And apparently, your IDE sets those defaults for you, which is a nasty thing, since it means things that run in the IDE will run differently outside of it. You're just lucky the difference was an error. If there weren't an error, you might have merrily been creating files with a mixture of encodings, which is a real disaster. One other place where decoding happens is in your source file. There is an optional encoding line you can place at the top of the file (immediately after the shebang line) to change how unicode literals with non-ASCII characters are interpreted. Remember your source file is a byte file edited with some text editor, and it has been encoded, deliberately or accidentally by that editor. You can avoid the issue by always using escape sequences, but if for example, you copy/paste some unicode string from an email message into your source code, you'd like it to be equivalent. If your email program, your text editor, and your Python compiler are all on the same page, it works amazingly simply. (That encoding line may affect other things; I know in Python 3, it makes non-ASCII attribute names possible, but I'm not sure if it matters in Python 2.x other than for unicode literal strings) -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-02-13 11:21 +1100 |
| Message-ID | <511adc9a$0$29972$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #38773 |
Magnus Pettersson wrote:
> # This made the fetching of the website work. Why did i have to write
> # url.encode("UTF-8") when url already is unicode? I feel i dont have a
> # good understanding of this.
> page = urllib2.urlopen(url.encode("UTF-8"))
Start here:
"The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)"
http://www.joelonsoftware.com/articles/Unicode.html
Basically, Unicode is an in-memory data format. Python knows about Unicode
characters (to be technical: code points), but files on disk do not.
Neither do network protocols, or terminals, or other simple devices. They
only understand bytes.
So when you have Unicode text, and you want to write it to a file on disk,
or print it, or send it over the network to another machine, it has to be
*encoded* into bytes, and then *decoded* back into Unicode when you read it
from the file again. Sometimes the system will "helpfully" do that encoding
and decoding automatically for you, which is fine when it works but when it
doesn't it can be perplexing.
There are many, many, many different *encoding schemes*. ASCII is one. UTF-8
is another. And then there are about a bazillion legacy encodings which, if
you are lucky, you will never need to care about. Only some encodings can
deal with the entire range of Unicode characters, most can only deal with a
(typically small) subset of possible characters. E.g. ASCII only knows
about 127 characters out of the million-plus that Unicode deals with.
Latin-1 can handle close to 256 different characters. If you have a say in
the matter, always use UTF-8, since it can handle the full set of Unicode
characters in the most efficient manner.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Magnus Pettersson <magpettersson@gmail.com> |
|---|---|
| Date | 2013-02-12 16:40 -0800 |
| Message-ID | <b9306c81-7129-4af8-ae2c-ceae66416de3@googlegroups.com> |
| In reply to | #38792 |
Thanks a lot Steven, you gave me a good AHA experience! :) Now I understand why I had to use encoding when calling the urllib2! So basically Eclipse PyDev does this in the background for me, and its console supports utf-8, so thats why i never had to think about it before (and why some scripts tends to fail with unicode errors when run outside of the Eclipse IDE). cheers Magnus > Start here: > > > > "The Absolute Minimum Every Software Developer Absolutely, Positively Must > > Know About Unicode and Character Sets (No Excuses!)" > > > > http://www.joelonsoftware.com/articles/Unicode.html > > > > > > Basically, Unicode is an in-memory data format. Python knows about Unicode > > characters (to be technical: code points), but files on disk do not. > > Neither do network protocols, or terminals, or other simple devices. They > > only understand bytes. > > > > So when you have Unicode text, and you want to write it to a file on disk, > > or print it, or send it over the network to another machine, it has to be > > *encoded* into bytes, and then *decoded* back into Unicode when you read it > > from the file again. Sometimes the system will "helpfully" do that encoding > > and decoding automatically for you, which is fine when it works but when it > > doesn't it can be perplexing. > > > > There are many, many, many different *encoding schemes*. ASCII is one. UTF-8 > > is another. And then there are about a bazillion legacy encodings which, if > > you are lucky, you will never need to care about. Only some encodings can > > deal with the entire range of Unicode characters, most can only deal with a > > (typically small) subset of possible characters. E.g. ASCII only knows > > about 127 characters out of the million-plus that Unicode deals with. > > Latin-1 can handle close to 256 different characters. If you have a say in > > the matter, always use UTF-8, since it can handle the full set of Unicode > > characters in the most efficient manner. > > > > > > -- > > Steven
[toc] | [prev] | [next] | [standalone]
| From | Magnus Pettersson <magpettersson@gmail.com> |
|---|---|
| Date | 2013-02-12 07:29 -0800 |
| Message-ID | <mailman.1706.1360682950.2939.python-list@python.org> |
| In reply to | #38751 |
> Are you sure you are writing the same data? That would mean that pydev
>
> changes the default encoding -- which is evil.
>
>
>
> A portable approach would be to use codecs.open() or io.open() instead of
>
> the built-in:
>
>
>
> import io
>
> with io.open(filepath, "a") as f:
>
> ...
>
>
>
> io.open() uses UTF-8 by default, but you can specify other encodings with
>
> io.open(filepath, mode, encoding=whatever).
Interesting. Pydev must be doing something behind the scenes because when i changed open() to io.open() i get error inside of eclipse now:
f.write(card+"\n")
File "C:\python27\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in position 32: character maps to <undefined>
....
io.open(filepath, "a", encoding="UTF-8") as f:
Then it works in eclipse. But I seem to be having an encoding problem all over the place that works in eclipse but dosnt work outside of eclipse pydev.
Here is the flow of my data, im terrible at using unicode/encode/decode so could use some help here:
kanji_anki_gui.py:
def on_addButton_clicked(self):
#code
# self.kanji.text() comes from a kanji letter written into a pyqt4 QLineEdit
kanji = unicode(self.kanji.text())
card = kanji_anki.scrapeKanji(kanji,tags)
#more code
kanji_anki.py:
def scrapeKanji(kanji, tags="", onlymeaning=False):
baseurl = unicode("http://www.romajidesu.com/kanji/")
url = unicode(baseurl+kanji)
#test to write out url to disk, works outside of eclipse now
savefile([url])
#getting webpage works fine in eclipse, prints "Oh no..." in terminal
try:
page = urllib2.urlopen(url)
except:
print "OH no website dont work"
return None
#Code that does some scraping and returns a string containing kanji letters
return card
def savefile(cardlist,filepath="D:/iknow_kanji.txt"):
with io.open(filepath, "a") as f:
for card in cardlist:
f.write(card+"\n")
return True
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-02-12 21:03 +0000 |
| Message-ID | <mailman.1727.1360702992.2939.python-list@python.org> |
| In reply to | #38749 |
On 2013-02-12 14:24, Magnus Pettersson wrote:
> I have tried now to take away printing to terminal and just keeping the writing to a .txt file to disk (which is what the scripts purpose is):
>
> with open(filepath,"a") as f:
> for card in cardlist:
> f.write(card+"\n")
>
> The file it writes to exists and im just appending to it, but when i run the script trough eclipse, all is fine. When i run in terminal i get this error instead:
>
> File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
> f.write(card+"\n")
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in position 3
> 2: ordinal not in range(128)
>
When you open the file, tell it what encoding to use. For example:
with open(filepath, "a", encoding="utf-8") as f:
for card in cardlist:
f.write(card + "\n")
[toc] | [prev] | [next] | [standalone]
| From | Magnus Pettersson <magpettersson@gmail.com> |
|---|---|
| Date | 2013-02-12 06:24 -0800 |
| Message-ID | <mailman.1723.1360700325.2939.python-list@python.org> |
| In reply to | #38741 |
I have tried now to take away printing to terminal and just keeping the writing to a .txt file to disk (which is what the scripts purpose is):
with open(filepath,"a") as f:
for card in cardlist:
f.write(card+"\n")
The file it writes to exists and im just appending to it, but when i run the script trough eclipse, all is fine. When i run in terminal i get this error instead:
File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
f.write(card+"\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in position 3
2: ordinal not in range(128)
On Tuesday, February 12, 2013 12:01:19 PM UTC+1, Andrew Berg wrote:
> On 2013.02.12 04:43, Magnus Pettersson wrote:
>
> > I am using Eclipse to write my python scripts and when i run them from inside eclipse they work fine without errors.
>
> >
>
> > But almost in every script that handle some form of special characters like swedish åäö and chinese characters etc i get Unicode errors when running the script externally with python.exe or pythonw.exe (but the scripts run completely fine from within Eclipse (standard pydev projects, python2.7). I have usually launched the script gui from wihin eclipse because of this error but now i want to get the bottom of this so i dont have to open eclipse everytime i want to run a script!
>
> >
>
> > Here is the error i get now when running the script with python.exe:
>
> > UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in position 32: character maps to <undefined>
>
> >
>
> > what can i do to fix this?
>
> >
>
> Since you didn't say what code actually does this, I'll turn to my
>
> crystal ball. It says you are trying to print characters to a terminal
>
> that doesn't support them. If that is the case, you could try changing
>
> the code page (but only 3.3 supports cp65001, so that probably won't
>
> help) or use replacement characters when printing.
>
>
>
> --
>
> CPython 3.3.0 | Windows NT 6.2.9200.16461 / FreeBSD 9.1-RELEASE
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-02-12 22:43 +1100 |
| Message-ID | <511a2ac5$0$29988$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #38740 |
Magnus Pettersson wrote: > I am using Eclipse to write my python scripts and when i run them from > inside eclipse they work fine without errors. > > But almost in every script that handle some form of special characters > like swedish åäö and chinese characters etc A comment: they are not "special" characters. They're merely not American. > i get Unicode errors when > running the script externally with python.exe or pythonw.exe (but the > scripts run completely fine from within Eclipse (standard pydev projects, > python2.7). I have usually launched the script gui from wihin eclipse > because of this error but now i want to get the bottom of this so i dont > have to open eclipse everytime i want to run a script! > > Here is the error i get now when running the script with python.exe: > UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in > position 32: character maps to <undefined> Please show the *complete* traceback, including the line of code that causes the exception. > what can i do to fix this? My guess is that you are trying to print a character which your terminal cannot display. My terminal is set to use UTF-8, and so it can display it fine: py> c = u'\u898b' py> print(c) 見 (or at least it would display fine if the font used had a glyph for that character). Why there are still terminals in the world that don't default to UTF-8 is beyond me. If I manually change the terminal's encoding to Western European ISO 8859-1, I get some moji-bake: py> print(c) è¦ I can't replicate the exception you give, so I assume it is specific to Windows. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Magnus Pettersson <magpettersson@gmail.com> |
|---|---|
| Date | 2013-02-12 04:34 -0800 |
| Message-ID | <9bc96ed6-8cb7-421a-a25a-4a7c9404463a@googlegroups.com> |
| In reply to | #38743 |
Ahh so its the actual printing that makes it error out outside of eclipse because its a different terminal that its printing to. Its the default DOS terminal in windows that runs then i run the script with python.exe and i guess its the same when i run with pythonw.exe just that the terminal window is not opened up, only the pyqt gui in this case. I will try to fix it now when i know what it is :) I never thought about the terminal, last time i had the same problem i just were playing around for hours with unicode encode and decode and all that not-so-fun stuff :) Andrew Berg: Thanks, your crystal ball seems to be right :P On Tuesday, February 12, 2013 12:43:00 PM UTC+1, Steven D'Aprano wrote: > Magnus Pettersson wrote: > > > > > I am using Eclipse to write my python scripts and when i run them from > > > inside eclipse they work fine without errors. > > > > > > But almost in every script that handle some form of special characters > > > like swedish åäö and chinese characters etc > > > > A comment: they are not "special" characters. They're merely not American. > > > > > > > i get Unicode errors when > > > running the script externally with python.exe or pythonw.exe (but the > > > scripts run completely fine from within Eclipse (standard pydev projects, > > > python2.7). I have usually launched the script gui from wihin eclipse > > > because of this error but now i want to get the bottom of this so i dont > > > have to open eclipse everytime i want to run a script! > > > > > > Here is the error i get now when running the script with python.exe: > > > UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in > > > position 32: character maps to <undefined> > > > > Please show the *complete* traceback, including the line of code that causes > > the exception. > > > > > > > what can i do to fix this? > > > > My guess is that you are trying to print a character which your terminal > > cannot display. My terminal is set to use UTF-8, and so it can display it > > fine: > > > > py> c = u'\u898b' > > py> print(c) > > 見 > > > > > > (or at least it would display fine if the font used had a glyph for that > > character). Why there are still terminals in the world that don't default > > to UTF-8 is beyond me. > > > > If I manually change the terminal's encoding to Western European ISO 8859-1, > > I get some moji-bake: > > > > py> print(c) > > è¦ > > > > > > I can't replicate the exception you give, so I assume it is specific to > > Windows. > > > > > > > > > > -- > > Steven
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-02-12 11:07 -0500 |
| Message-ID | <mailman.1712.1360686721.2939.python-list@python.org> |
| In reply to | #38745 |
On 2/12/2013 7:34 AM, Magnus Pettersson wrote: > Ahh so its the actual printing that makes it error out outside of > eclipse because its a different terminal that its printing to. Its > the default DOS terminal in windows that runs then i run the script > with python.exe and i guess its the same when i run with pythonw.exe > just that the terminal window is not opened up, only the pyqt gui in > this case. Writing txt = <expression involving coding> print(txt) rather than print(<expression involving coding>) makes it easier to tell whether a UnicodeError comes from evaluating the expression or from the print operation. Using 3.3 instead of 2.7 will make using unicode somewhat easier. -- Terry Jan Reedy
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web