Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.007 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'encoding': 0.05; 'subject:text': 0.05; 'utf-8': 0.07; 'string': 0.09; 'encode': 0.09; 'encoder': 0.09; 'subject:question': 0.10; 'cc:addr:python- list': 0.11; 'python': 0.11; 'bug': 0.12; 'question.': 0.14; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'hits': 0.16; 'inability': 0.16; 'invalid.': 0.16; 'mainstream': 0.16; 'sure.': 0.16; 'surrogate': 0.16; 'index': 0.16; 'language': 0.16; 'wrote:': 0.18; 'cc:addr:python.org': 0.22; 'error': 0.23; 'unicode': 0.24; 'mon,': 0.24; 'cc:2**0': 0.24; 'header:In-Reply- To:1': 0.27; 'am,': 0.29; 'character': 0.29; 'points': 0.29; 'streaming': 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'code': 0.31; 'usually': 0.31; "d'aprano": 0.31; 'obscure': 0.31; 'steven': 0.31; 'anyone': 0.31; 'allows': 0.31; 'languages': 0.32; "can't": 0.35; 'skip:s 30': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'done': 0.36; 'similar': 0.36; 'skip:_ 40': 0.38; 'does': 0.39; 'even': 0.60; 'skip:c 50': 0.60; 'range': 0.61; 'first': 0.61; 'kind': 0.63; 'mar': 0.68; 'invalid': 0.68; 'containing': 0.69; 'therefore': 0.72; 'construction': 0.72; 'genuine': 0.78; '2015': 0.84; 'lone': 0.84; 'pike': 0.84; 'notion': 0.91; 'to:none': 0.92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=RapPk/dYYT5qyYzxebqIU1Q0QMxFLkNt65eKYcKR9Qg=; b=LeoAW3/n69VgyhPXA0dhQEiCRn/nvakQYpZ6Tc4Bbgm7Hyq3zo2/5qwFuGElKpK1sR 3t2fkzNje1qDWud9Iizle2WAOCGcS50qRmE79wNkPhnCpXKvdBLnbV9b7JHhubVXrf/t w1nmlhu6ld94qMdgIvtzqqWNyY6zz9WA4HF8F8EgUV3LidAcCRXgSOCJ99tk/jdR2Drg R7TD/8k4aBSF3waxpnigQaLWExHf2j+FLaFATybZ1w5AtKRC+sas9J4mLstMOGH95pRt 5UYz1E0qJICj00PulNTWNV8U0FyL+AW52KTm0Us6GyJWFe0HdiKrg16dFAbjG+4UKwbT FCmQ== MIME-Version: 1.0 X-Received: by 10.42.159.132 with SMTP id l4mr21849556icx.59.1425849228317; Sun, 08 Mar 2015 14:13:48 -0700 (PDT) In-Reply-To: <54fc9400$0$13009$c3e8da3$5496439d@news.astraweb.com> References: <9169f3b1-2ac7-42a3-8033-584f84b88a1f@googlegroups.com> <7a75a23c-4678-4d7a-a2ec-9e8fff4c07f8@googlegroups.com> <132d5ce6-f672-4eec-99f9-1cc9e88b94f3@googlegroups.com> <619e4cb5-1c4c-449b-a5d7-951101b32b45@googlegroups.com> <54f862ca$0$13014$c3e8da3$5496439d@news.astraweb.com> <54fadc70$0$13004$c3e8da3$5496439d@news.astraweb.com> <87twxxxbvd.fsf@elektro.pacujo.net> <54fb1bf4$0$12993$c3e8da3$5496439d@news.astraweb.com> <87twxw4xlz.fsf@elektro.pacujo.net> <54fba9d4$0$12988$c3e8da3$5496439d@news.astraweb.com> <87y4n8uf9a.fsf@elektro.pacujo.net> <87twxvvrjl.fsf@elektro.pacujo.net> <54fc9400$0$13009$c3e8da3$5496439d@news.astraweb.com> Date: Mon, 9 Mar 2015 08:13:48 +1100 Subject: Re: Newbie question about text encoding From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.19 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 37 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1425849237 news.xs4all.nl 2842 [2001:888:2000:d::a6]:60400 X-Complaints-To: abuse@xs4all.nl Path: csiph.com!usenet.pasdenom.info!bete-des-vosges.org!feed.ac-versailles.fr!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Xref: csiph.com comp.lang.python:87160 On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano wrote: > Perhaps the bug is not UTF-8's inability to encode lone > surrogates, but that Python allows you to create lone surrogates in the > first place. That's not a rhetorical question. It's a genuine question. As to the notion of rejecting the construction of strings containing these invalid codepoints, I'm not sure. Are there any languages out there that have a Unicode string type that requires that all codepoints be valid (no surrogates, no U+FFFE, etc)? This is the kind of thing that's usually done in an obscure language before it hits a mainstream one. Pike is similar to Python here. I can create a string with invalid code points in it: > "\uFFFE\uDD00"; (1) Result: "\ufffe\udd00" but I can't UTF-8 encode that: > string_to_utf8("\uFFFE\uDD00"); Character 0x0000dd00 at index 1 is in the surrogate range and therefore invalid. Unknown program: string_to_utf8("\ufffe\udd00") HilfeInput:1: HilfeInput()->___HilfeWrapper() Or, using the streaming UTF-8 encoder instead of the short-hand: > Charset.encoder("UTF-8")->feed("\uFFFE\uDD00")->drain(); Error encoding "\ufffe"[0xdd00] using utf8: Unsupported character 56576. /usr/local/pike/8.1.0/lib/modules/_Charset.so:1: _Charset.UTF8enc()->feed("\ufffe\udd00") HilfeInput:1: HilfeInput()->___HilfeWrapper() Does anyone know of a language where you can't even construct the string? ChrisA