Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed2a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'represents': 0.05; 'subject:text': 0.05; 'utf-8': 0.07; 'bytes,': 0.09; 'bytes.': 0.09; 'received:internal': 0.09; 'subject:question': 0.10; 'mostly': 0.14; 'boundary,': 0.16; 'exist.': 0.16; 'message- id:@webmail.messagingengine.com': 0.16; 'non-trivial': 0.16; 'received:10.202': 0.16; 'received:10.202.2': 0.16; 'received:66.111': 0.16; 'received:66.111.4': 0.16; 'received:messagingengine.com': 0.16; 'wrote:': 0.18; 'obviously': 0.18; 'restrictions': 0.19; 'thu,': 0.19; 'programming': 0.22; 'question': 0.24; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'are.': 0.31; "d'aprano": 0.31; 'restricted': 0.31; 'steven': 0.31; 'received:66': 0.35; 'agree': 0.35; 'representing': 0.36; 'received:10': 0.37; 'represent': 0.38; 'to:addr:python-list': 0.38; 'supporting': 0.39; 'to:addr:python.org': 0.39; 'system.': 0.39; 'either': 0.39; 'from:no real name:2**0': 0.61; "you're": 0.61; 'header:Message-Id:1': 0.63; 'mar': 0.68; 'natural': 0.68; 'six': 0.68; 'whereas': 0.91 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=fastmail.us; h= message-id:x-sasl-enc:from:to:mime-version :content-transfer-encoding:content-type:in-reply-to:references :subject:date; s=mesmtp; bh=Tsg5FB2QpvunWMrHz3AZjGknk7s=; b=hqJ+ HdkTOEVCM77i74H+8/JN4OU41UPvEei8oHD1eSdlAXfBP5IHePeVAZSL1KYQOJVk ALJUH01V5FgANCPp0U5REOx/OYkvzYVmHamZai/+Vn/JabWK44jIxnA1OBKBDPHv HZ6YWNISgmKTARdNol8ivGwtXko0/Un5CS7BXkQ= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=message-id:x-sasl-enc:from:to :mime-version:content-transfer-encoding:content-type:in-reply-to :references:subject:date; s=smtpout; bh=Tsg5FB2QpvunWMrHz3AZjGkn k7s=; b=kpbknvlSN+HHKI9yOhsWsXVdlsn/hAToGNVFEc4n0ee5Wq5caFHwbkQI cX5f9NED37y26+Mrf1rp5l+m4m6atKvYuTSi2TXMAW2RAlkmbjYOylJnpGAInR+g 58h58HtUAY/l8HskwOt+1MR9s+Lh0KIOpPDr2mljsQjZzhBG07Q= X-Sasl-Enc: FNPdFdkrf3ZnBSYBhtJv6Okts54W39YHiIeB4/jyvlh2 1425585545 From: random832@fastmail.us To: python-list@python.org MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain X-Mailer: MessagingEngine.com Webmail Interface - ajax-07699171 In-Reply-To: <54f862ca$0$13014$c3e8da3$5496439d@news.astraweb.com> References: <201502241524.t1OFO09k022270@fido.openend.se> <201502241620.t1OGKf4n002146@fido.openend.se> <54ECB134.5090304@davea.name> <201502241945.t1OJjshO013092@fido.openend.se> <201502241957.t1OJvrJS015604@fido.openend.se> <9169f3b1-2ac7-42a3-8033-584f84b88a1f@googlegroups.com> <7a75a23c-4678-4d7a-a2ec-9e8fff4c07f8@googlegroups.com> <132d5ce6-f672-4eec-99f9-1cc9e88b94f3@googlegroups.com> <619e4cb5-1c4c-449b-a5d7-951101b32b45@googlegroups.com> <54f862ca$0$13014$c3e8da3$5496439d@news.astraweb.com> Subject: Re: Newbie question about text encoding Date: Thu, 05 Mar 2015 14:59:05 -0500 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.19 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 15 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1425585548 news.xs4all.nl 2907 [2001:888:2000:d::a6]:45355 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:86951 On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote: > I mostly agree with Chris. Supporting *just* the BMP is non-trivial in > UTF-8 > and UTF-32, since that goes against the grain of the system. You would > have > to program in artificial restrictions that otherwise don't exist. UTF-8 is already restricted from representing values above 0x10FFFF, whereas UTF-8 can "naturally" represent values up to 0x1FFFFF in four bytes, up to 0x3FFFFFF in five bytes, and 0x7FFFFFFF in six bytes. If anything, the BMP represents a natural boundary, since it coincides with values that can be represented in three bytes. Likewise, UTF-32 can obviously represent values up to 0xFFFFFFFF. You're programming in artificial restrictions either way, it's just a question of what those restrictions are.