Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder7.xlned.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Tue, 12 Feb 2013 15:51:22 -0500
From: Dave Angel <davea@davea.name>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: UnicodeEncodeError when not running script from IDE
References: <650d144e-da3d-4ca7-ad3a-49f44ce9cbaa@googlegroups.com> <mailman.1696.1360666894.2939.python-list@python.org> <0d6d513d-fa12-4d51-a33d-7bb38f1ee6b2@googlegroups.com> <mailman.1700.1360680572.2939.python-list@python.org> <780d353a-de5c-4d04-8f51-11d81802351b@googlegroups.com> <mailman.1711.1360684727.2939.python-list@python.org> <a80a49be-b3c4-4549-bf94-523605dbbeec@googlegroups.com>
In-Reply-To: <a80a49be-b3c4-4549-bf94-523605dbbeec@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1725.1360702303.2939.python-list@python.org>
Lines: 48
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:38784

On 02/12/2013 12:12 PM, Magnus Pettersson wrote:
>> < snip >
>>
> #Here kanji = u"私"
> baseurl = u"http://www.romajidesu.com/kanji/"
> url = baseurl+kanji
> savefile([url]) #this test works now. uses: io.open(filepath, "a",encoding="UTF-8") as f:
> # This made the fetching of the website work.

You don't show the code that actually does the io.open(), nor the 
url.encode, so I'm not going to guess what you're actually doing.


> Why did i have to write url.encode("UTF-8") when url already is unicode? I feel i dont have a good understanding of this.
> page = urllib2.urlopen(url.encode("UTF-8"))

utf-8 is NOT unicode;  they are entirely different.  Unicode is 
conceptually 32 bits per character, and is an internal representation. 
There are a million or so characters defined.  Nearly always when you're 
talking to an external device, you need bytes.  Since you can't cram 32 
bits into 8, you have to encode it.  Examples of devices would be any 
file, or the console.  Notice that sometimes you can use unicode 
directly for certain functions.  For example, the Windows file name is 
composed of Unicode characters, so Windows has function calls that 
accept Unicode directly.  But back to 8 bits:

One encoding is called ASCII, which is simply the bottommost 7 bits. 
But of course it gets an error if there are any characters above 127.

Other encodings try to pick an 8 bit subset of the million possible 
characters.  Again, if you happen to have a character that's not in that 
subset, you'll get an error.

There are also other encodings which are hard to describe, but 
fortunately pretty rare these days.

Then there's utf-8, which uses a variable length  bunch of bytes for 
each character.  It's designed to use the ASCII encoding for characters 
which are below 128, but uses two or more bytes for all the other 
characters.  So it works out well when most characters happen to be ASCII.

Once encoded, a stream of bytes can only be successfully interpreted if 
you use the same decoding when processing them.



-- 
DaveA