Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: Pyhon 2.x or 3.x, which is faster?
Date: Thu, 10 Mar 2016 01:11:17 +1100
Lines: 38
Message-ID: <mailman.77.1457532681.15725.python-list@python.org>
References: <mailman.238.1457265255.20602.python-list@python.org> <nbjmv7$ad5$1@dont-email.me> <87d1r6iltx.fsf@elektro.pacujo.net> <nbjp1e$jhv$1@dont-email.me> <nbjrjm$m16$1@gioia.aioe.org> <nbjvas$h22$1@dont-email.me> <mailman.17.1457364684.10335.python-list@python.org> <nbkhei$dg6$1@dont-email.me> <mailman.43.1457377845.10335.python-list@python.org> <nbknir$avu$1@dont-email.me> <56de28a1$0$1604$c3e8da3$5496439d@news.astraweb.com> <nblae9$nl0$1@dont-email.me> <56de57b5$0$1590$c3e8da3$5496439d@news.astraweb.com> <nbml2q$l4n$1@dont-email.me> <56df6873$0$1588$c3e8da3$5496439d@news.astraweb.com> <nbnu1m$u1m$1@dont-email.me> <56df87f7$0$1620$c3e8da3$5496439d@news.astraweb.com> <nbpaaa$uic$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <nbpaaa$uic$1@dont-email.me>
Precedence: list
Xref: csiph.com comp.lang.python:104411

On Thu, Mar 10, 2016 at 1:03 AM, BartC <bc@freeuk.com> wrote:
> I've just tried a UTF-8 file and getting some odd results. With a file
> containing [three euro symbols]:
>
> =E2=82=AC=E2=82=AC=E2=82=AC
>
> (including a 3-byte utf-8 marker at the start), and opened in text mode,
> Python 3 gives me this series of bytes (ie. the ord() of each character):
>
> 239
> 187
> 191
> 226
> 8218
> 172
> 226
> 8218
> 172
> 226
> 8218
> 172
>
> And prints the resulting string as: =C3=AF=C2=BB=C2=BF=C3=A2=E2=80=9A=C2=
=AC=C3=A2=E2=80=9A=C2=AC=C3=A2=E2=80=9A=C2=AC.

The first three bytes are the "UTF-8 BOM", which suggests you may have
created this in a broken editor like Notepad.

For the rest, I'm not sure how you told Python to open this as text,
but you certainly did NOT specify an encoding of UTF-8. The 8218
entries in there are completely bogus. Can you show your code, please,
and also what you get if you open the file as binary?

Unicode handling is easy as long as you (a) understand the fundamental
difference between text and bytes, and (b) declare your encodings.
Python isn't magical. It can't know the encoding without being told.

ChrisA