Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.007 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'subject:Python': 0.05; 'ascii': 0.07; 'terry': 0.07; 'bytes,': 0.09; 'transcoding': 0.09; 'utf-8': 0.09; 'am,': 0.12; '*before*': 0.16; 'encode': 0.16; 'in- memory': 0.16; 'received:192.168.1.104': 0.16; 'reedy': 0.16; 'roy': 0.16; 'set,': 0.16; 'subject:usage': 0.16; 'ucs-4': 0.16; 'cc:addr:python-list': 0.16; 'mon,': 0.16; 'solution.': 0.17; 'wrote:': 0.18; '>>>': 0.18; 'bytes': 0.18; 'simpler': 0.18; 'java': 0.21; 'cc:no real name:2**0': 0.21; "doesn't": 0.22; 'header:In-Reply-To:1': 0.22; 'feb': 0.22; 'least,': 0.23; 'subject:numbers': 0.23; 'string': 0.24; 'byte': 0.24; 'libraries': 0.24; 'cc:2**0': 0.26; 'code.': 0.26; 'code': 0.26; "i'm": 0.28; 'unicode': 0.29; 'worked': 0.29; 'skip:b 20': 0.29; 'cc:addr:python.org': 0.29; 'pm,': 0.29; 'characters,': 0.30; 'unicode,': 0.30; 'chris': 0.30; 'nobody': 0.31; 'idea': 0.32; 'header:User-Agent:1': 0.33; 'character': 0.34; 'rather': 0.34; 'anything': 0.34; 'hell': 0.34; 'then.': 0.34; 'problem.': 0.35; 'sets': 0.35; 'project': 0.35; 'cheap': 0.37; 'encoding': 0.37; 'but': 0.37; 'using': 0.37; 'received:192': 0.38; 'replace': 0.38; 'could': 0.38; 'some': 0.38; 'data': 0.38; 'smaller': 0.39; 'received:192.168.1': 0.39; 'that.': 0.39; 'point': 0.40; 'one,': 0.40; 'more': 0.61; 'your': 0.61; 'hope': 0.61; 'devices': 0.63; 'everybody': 0.63; 'life': 0.63; 'hopefully,': 0.67; 'storage': 0.70; 'header:Reply-To:1': 0.70; 'reply-to:no real name:2**0': 0.72; 'realized': 0.73; 'article,': 0.84; 'encoding,': 0.84; 'deciding': 0.91; 'inefficient': 0.91; 'device,': 0.93 Date: Sun, 12 Feb 2012 17:40:38 -0500 From: Dave Angel User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 Thunderbird/3.1.16 MIME-Version: 1.0 To: Roy Smith Subject: Re: Python usage numbers References: <4F36E2F5.9000505@gmail.com> <4f37229b$0$29986$c3e8da3$5496439d@news.astraweb.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:G7g2DR1JUBa9asjbyNGf3E32qWwceqDTmCwRXBI2tpg xAFtIl/iXH2wOlySMyionvbcxQYU/3cMlgDD0CFpGLiMWh8nYF vx6ZR7cTADOVCax0W4rQtz0q0ckI+HHbkHhgQXlYmG+sCjuhPz ZmlscAJOdgJYMQiQQFlUrhDTM/BFXWYwkjs11u+ekzHhiXcgnk Po8+aa088+X9eQ+y+1KaE3oBzw51vx2j35O+JLtRdSCu8oAYCr WxEtOsqW7BxPLa+HLCdmI8Esl8e+F/JhiOSO+LhK/AwoTqE+T8 H1yynUElsQyo22lnKd9orPJiI1ROFeCKy5MqR7231suxyOVVQb M83dkMGhDHCm/e4QEpsQ= Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: davea@dejaviewphoto.com List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 42 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1329086473 news.xs4all.nl 6855 [2001:888:2000:d::a6]:37935 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:20302 On 02/12/2012 05:27 PM, Roy Smith wrote: > In article, > Chris Angelico wrote: > >> On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy wrote: >>> The situation before ascii is like where we ended up *before* unicode. >>> Unicode aims to replace all those byte encoding and character sets with >>> *one* byte encoding for *one* character set, which will be a great >>> simplification. It is the idea of ascii applied on a global rather that >>> local basis. >> Unicode doesn't deal with byte encodings; UTF-8 is an encoding, but so >> are UTF-16, UTF-32. and as many more as you could hope for. But >> broadly yes, Unicode IS the solution. > I could hope for one and only one, but I know I'm just going to be > disapointed. The last project I worked on used UTF-8 in most places, > but also used some C and Java libraries which were only available for > UTF-16. So it was transcoding hell all over the place. > > Hopefully, we will eventually reach the point where storage is so cheap > that nobody minds how inefficient UTF-32 is and we all just start using > that. Life will be a lot simpler then. No more transcoding, a string > will just as many bytes as it is characters, and everybody will be happy > again. Keep your in-memory character strings as Unicode, and only serialize(encode) them when they go to/from a device, or to/from anachronistic code. Then the cost is realized at the point of the problem. No different than when deciding how to serialize any other data type. Do it only at the point of entry/exit of your program. But as long as devices are addressed as bytes, or as anything smaller than 32bit thingies, you will have encoding issues when writing to the device, and decoding issues when reading. At the very least, you have big-endian/little-endian ways to encode that UCS-4 code point.