Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!weretis.net!feeder1.news.weretis.net!news.albasani.net!news2.arglkargh.de!news.wiretrip.org!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=OPAUsglUUPkqTkpSH0F4Rh00MxYa+pfc039JQMSOQf3CqGjEsw93/GbEAvN+vW/p6C PBp8ACVLgtzmFlr+RNkXoA9HziEwRtvTbzELejnis2UTO9DfT9sD/UVoF2xWYRYTssHH q9cC2RNQ0lWhG4Ie/i/Zo+5sogTXHufXJC6jw=
MIME-Version: 1.0
In-Reply-To: <0604E20B5F6F2F4784C9C8C71C5DD4DD2E33300F5D@EMARC112VS01.exchad.jpmchase.net>
References: <4de40ee8$0$6623$9b4e6d93@newsspool2.arcor-online.net> <mailman.2315.1306841548.9059.python-list@python.org> <4de50cfd$0$6538$9b4e6d93@newsspool4.arcor-online.net> <0604E20B5F6F2F4784C9C8C71C5DD4DD2E33300F5D@EMARC112VS01.exchad.jpmchase.net>
Date: Wed, 1 Jun 2011 03:19:52 +1000
Subject: Re: sqlalchemy and Unicode strings: errormessage
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2326.1306862395.9059.python-list@python.org>
Lines: 30
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:6746

On Wed, Jun 1, 2011 at 2:31 AM, Prasad, Ramit <ramit.prasad@jpmchase.com> w=
rote:
>>line =3D unicode(line.strip(),'utf8')
>>and now i get really utf8-strings. It does work but i dont know why it wo=
rks. For me it looks like i change an utf8-string to an utf8-string.
>
>
> I would like to point out that UTF-8 is not exactly "Unicode". From what =
I understand, Unicode is a standard while UTF-8 is like an implementation o=
f that standard (called an encoding). Being able to convert to Unicode (the=
 standard) should mean you are then able to convert to any encoding that su=
pports the Unicode characters used.

Unicode defines characters; UTF-8 is one way (of many) to represent
those characters in bytes. UTF-16 and UTF-32 are other ways of
representing those characters in bytes, and internally, Python
probably uses one of them - but there is no guarantee, and you should
never need to know. Unicode strings can be stored in memory and
manipulated in various ways, but they're a high level construct on par
with lists and dictionaries - they can't be stored on disk or
transmitted to another computer without using an encoding system.

UTF-8 is an efficient way to translate Unicode text consisting
primarily of low codepoint characters into bytes. It's not so much an
implementation of Unicode as a means of converting a mythical concept
of "Unicode characters" into a concrete stream of bytes.

Hope that clarifies things a little!

Chris Angelico