Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico Newsgroups: comp.lang.python Subject: Re: Pyhon 2.x or 3.x, which is faster? Date: Thu, 10 Mar 2016 01:11:17 +1100 Lines: 38 Message-ID: References: <87d1r6iltx.fsf@elektro.pacujo.net> <56de28a1$0$1604$c3e8da3$5496439d@news.astraweb.com> <56de57b5$0$1590$c3e8da3$5496439d@news.astraweb.com> <56df6873$0$1588$c3e8da3$5496439d@news.astraweb.com> <56df87f7$0$1620$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: news.uni-berlin.de BX7HDRqg+OTpgoqwx/aTMwtpgt0wCHtJCCh5DRU8+K4A== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'broken': 0.03; 'received:209.85.223': 0.03; 'resulting': 0.04; '(b)': 0.07; 'rest,': 0.07; 'utf-8': 0.07; 'cc:addr:python-list': 0.09; '(ie.': 0.09; '172': 0.09; 'bytes,': 0.09; 'mode,': 0.09; 'subject:which': 0.09; 'python': 0.10; 'encoding': 0.15; 'thu,': 0.15; '2016': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'marker': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'wrote:': 0.16; 'string': 0.17; 'bytes': 0.18; 'odd': 0.18; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; '(a)': 0.22; 'text,': 0.22; 'am,': 0.23; 'code,': 0.23; 'tried': 0.24; 'header:In-Reply- To:1': 0.24; "i've": 0.25; 'handling': 0.27; 'message- id:@mail.gmail.com': 0.27; 'entries': 0.27; 'specify': 0.27; 'prints': 0.29; "i'm": 0.30; '(including': 0.30; 'certainly': 0.30; "can't": 0.32; 'getting': 0.33; 'open': 0.33; 'editor': 0.34; 'file': 0.34; 'gives': 0.35; 'received:google.com': 0.35; 'text': 0.35; 'unicode': 0.35; "isn't": 0.35; 'but': 0.36; 'there': 0.36; 'created': 0.36; 'received:209.85': 0.36; 'subject:?': 0.36; 'subject:: ': 0.37; 'being': 0.37; 'difference': 0.38; 'received:209': 0.38; 'sure': 0.39; 'some': 0.40; 'easy': 0.60; 'your': 0.60; 'show': 0.62; 'please,': 0.63; 'between': 0.65; 'mar': 0.65; 'series': 0.65; 'fundamental': 0.66; 'results.': 0.67; '8bit%:96': 0.67; 'euro': 0.75; 'as:': 0.79; 'chrisa': 0.84; 'to:none': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-transfer-encoding; bh=NlCGPowqToUma/iTPmcJNlk8lM9nx188aXsjZgSW6Y8=; b=A8BetimuC6hrzPTofAibCd1NdHlp4lVvVNN3AcwPo493UL55Pt5BKHxK+SYPvqVxSt 5KOacw+lb2phk5l/E8GhTXjgM4Q99hebpunjyHRTmUrQsLHubVv0GzJwWfkqcbYqL3ZK iJ3Q4LL1SzPacm1V3NwynZrjlaMkTTsrYFIL8ue1eyaeD59d8P2Q68caNXL8Ai2eHx2K 6jXE35MhDfwz+iUmrdEOHEtTZ8ZOmkRNCninPxplIb/hdOnzdtNV/Mcm7T+Ed0YbKsSs Km30g4+cEtqz3o2djGMp+C0MPMydTE0Lty5CV8xkGajuzy0+YooGO972EZS8L6mxeuT7 xbeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc:content-transfer-encoding; bh=NlCGPowqToUma/iTPmcJNlk8lM9nx188aXsjZgSW6Y8=; b=M83SM0rctnqLhU0WZ5d89JUkkCttuDFXJvAbZzesK+YttDBBqaONb/zdBXnfEdo3sn D7NGN9GY5lbtbM27poXqgtclUC/K7bkU6oy9mCzkdAKkfwd4jZnGytGUvELr9gaOhYpD 0eu6jAK6UGSPIMLksm8TmctgmwwrUXGVBnE1UWBx+3BxhZzToXnXQY3qywklaDezj7+w vZ/7RvnSKvt0txJD3XvHQ4FJ5DNJRqgJWZMLK4jvfhbL/tcRMltDu6Z1vaKcl5o7fxPj W60hU5K9XowsNJqjlMCZcnnKkICmnDNd1BMDvQGlulFWt7HxpaNYosJx+AUSd5nTV1eM buFg== X-Gm-Message-State: AD7BkJJxU8piiMikY/Pqish1PTXz+rofmKsrd2YDwaoZrz7Uz7BNFRMkr/2JAaPOJxGpI6gnO24C2Bx81MNz0w== X-Received: by 10.107.47.163 with SMTP id v35mr30747751iov.19.1457532678215; Wed, 09 Mar 2016 06:11:18 -0800 (PST) In-Reply-To: X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:104411 On Thu, Mar 10, 2016 at 1:03 AM, BartC wrote: > I've just tried a UTF-8 file and getting some odd results. With a file > containing [three euro symbols]: > > =E2=82=AC=E2=82=AC=E2=82=AC > > (including a 3-byte utf-8 marker at the start), and opened in text mode, > Python 3 gives me this series of bytes (ie. the ord() of each character): > > 239 > 187 > 191 > 226 > 8218 > 172 > 226 > 8218 > 172 > 226 > 8218 > 172 > > And prints the resulting string as: =C3=AF=C2=BB=C2=BF=C3=A2=E2=80=9A=C2= =AC=C3=A2=E2=80=9A=C2=AC=C3=A2=E2=80=9A=C2=AC. The first three bytes are the "UTF-8 BOM", which suggests you may have created this in a broken editor like Notepad. For the rest, I'm not sure how you told Python to open this as text, but you certainly did NOT specify an encoding of UTF-8. The 8218 entries in there are completely bogus. Can you show your code, please, and also what you get if you open the file as binary? Unicode handling is easy as long as you (a) understand the fundamental difference between text and bytes, and (b) declare your encodings. Python isn't magical. It can't know the encoding without being told. ChrisA