Path: csiph.com!usenet.pasdenom.info!dedibox.gegeweb.org!gegeweb.eu!nntpfeed.proxad.net!proxad.net!feeder1-2.proxad.net!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.013 X-Spam-Evidence: '*H*': 0.97; '*S*': 0.00; 'subject:not': 0.03; 'encoding': 0.05; 'subject:Python': 0.06; 'binary': 0.07; 'utf-8': 0.07; 'ambiguity': 0.09; 'strings.': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'hell.': 0.16; 'subject:Unicode': 0.16; 'text?': 0.16; 'files.': 0.16; 'wrote:': 0.18; 'bit': 0.19; 'app': 0.19; 'meant': 0.20; 'saying': 0.22; 'cc:addr:python.org': 0.22; 'bytes': 0.24; 'text.': 0.24; 'unicode': 0.24; 'cc:2**0': 0.24; 'header:In-Reply-To:1': 0.27; 'idea': 0.28; 'xml': 0.29; "doesn't": 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'work.': 0.31; 'went': 0.31; '13,': 0.31; 'convenience': 0.31; 'safely': 0.31; 'allows': 0.31; 'file': 0.32; 'stuff': 0.32; 'text': 0.33; 'standards': 0.33; 'trouble': 0.34; "can't": 0.35; 'something': 0.35; 'computing': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'should': 0.36; 'example,': 0.37; 'problems': 0.38; 'pm,': 0.38; 'does': 0.39; 'sure': 0.39; 'then,': 0.60; 'full': 0.61; 'back': 0.62; 'production': 0.68; 'obvious': 0.74; '(network': 0.84; 'freaky': 0.84; 'subject:know': 0.84; 'subject:you': 0.87; 'dealt': 0.91; 'subject:want': 0.91; 'suspicious': 0.91; 'thing,': 0.91; 'to:none': 0.92; 'tough': 0.93 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=CW8SufNxlonlx2ohukHJoJ7LXpFgTKtTvZK0j9/c7Qo=; b=HI/fIr4aVHv1tNAmEO8gjj1S1Tcj/pLIbBMlV3T+Fi4dqcaOoW01Gcxvx/KE5LwbkN inNrvtDJvhwi3rfCy0qE/t2QYY8iyEF/s+lkKYVCiLDh2xWqVxm4vi35r8kp+WDW37TC P9NAUYZhawCOi+Aqqxo9XOc3SWjQIvgJnCTy3o29tP/zhBCYyefa2bqAyEwX4y56hwjr NYv3REeD6ZaPzuR/Wclx6AOxFyg/FqPRrudpCvZRksmL0tUmUGeyaeWQ+29exD9hgUB8 4eqsm/FZGgaDWcyr3GWd1kMcjnyRK5hR0TAtNJoUv2MoXe1jCykQmgPWXDasSEH7xmJv JtuA== MIME-Version: 1.0 X-Received: by 10.52.11.230 with SMTP id t6mr883258vdb.27.1399970283204; Tue, 13 May 2014 01:38:03 -0700 (PDT) In-Reply-To: <87tx8uccgd.fsf@elektro.pacujo.net> References: <8P7cv.78617$Sp6.8377@fx15.am4> <537172eb$0$29980$c3e8da3$5496439d@news.astraweb.com> <87tx8uccgd.fsf@elektro.pacujo.net> Date: Tue, 13 May 2014 18:38:03 +1000 Subject: Re: Everything you did not want to know about Unicode in Python 3 From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 31 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1399970292 news.xs4all.nl 2839 [2001:888:2000:d::a6]:51919 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:71451 On Tue, May 13, 2014 at 6:25 PM, Marko Rauhamaa wrote: > Johannes Bauer : > >> Having dealt with the UTF-8 problems on Python2 I can safely say that >> I never, never ever want to go back to that freaky hell. If I deal >> with strings, I want to be able to sanely manipulate them and I want >> to be sure that after manipulation they're still valid strings. >> Manipulating the bytes representation of unicode data just doesn't >> work. > > Based on my background (network and system programming), I'm a bit > suspicious of strings, that is, text. For example, is the stuff that > goes to syslog bytes or text? Does an XML file contain bytes or > (encoded) text? The answers are not obvious to me. Modern computing is > full of ASCII-esque binary communication standards and formats. These are problems that Unicode can't solve. In theory, XML should contain text in a known encoding (defaulting to UTF-8). With syslog, it's problematic - I don't remember what it's meant to be, but I know there are issues. Same with other log files. > Python 2's ambiguity allows me not to answer the tough philosophical > questions. I'm not saying it's necessarily a good thing, but it has its > benefits. It's not a good thing. It means that you have the convenience of pretending there's no problem, which means you don't notice trouble until something happens... and then, in all probability, your app is in production and you have no idea why stuff went wrong. ChrisA