Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'say,': 0.05; 'indexing': 0.07; 'python': 0.09; 'command-line': 0.09; 'lost.': 0.09; 'storage.': 0.09; 'subject:string': 0.09; 'worse': 0.09; 'programmer': 0.11; "wouldn't": 0.11; 'aug': 0.13; '(just': 0.16; 'binaries': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'subject:unicode': 0.16; 'surrogates.': 0.16; 'ucs-4': 0.16; 'wed,': 0.16; 'string': 0.17; 'wrote:': 0.17; 'bytes': 0.17; 'handles': 0.18; 'changes': 0.20; 'received:209.85.214.174': 0.21; '3.2': 0.22; 'fixing': 0.22; 'header:In-Reply-To:1': 0.25; 'handling': 0.27; 'message- id:@mail.gmail.com': 0.27; 'fine': 0.28; 'cases.': 0.29; 'character': 0.29; 'asking': 0.32; '11,': 0.33; 'to:addr:python- list': 0.33; 'that,': 0.34; 'received:google.com': 0.34; 'pm,': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'really': 0.36; 'except': 0.36; 'but': 0.36; 'beyond': 0.37; 'optimization': 0.37; 'supporting': 0.37; 'option': 0.37; 'two': 0.37; 'received:209': 0.37; 'subject:: ': 0.38; 'some': 0.38; 'performance': 0.39; 'to:addr:python.org': 0.39; 'received:209.85.214': 0.39; 'header:Received:5': 0.40; 'end': 0.40; 'your': 0.60; 'subject:, ': 0.61; "you'll": 0.62; 'wide': 0.62; 'subject:...': 0.63; 'more': 0.63; 'potentially': 0.66; 'subject:, ...': 0.84; 'rusi': 0.91; 'world:': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=McejDUxeqcka/qf6feeaatcMlHxJ+fOqhbhXRxP6vfE=; b=k5j/NQBLFP/ZWatOD0xjFnQxkclE81JnshDPlBC52soO3fk6Om1U7tuKmZM9ShVIqJ LN1zlaHwKk7+xbMv2Zxd4OI6AZOuBRkjqADZouxkDHIp+CFlYhieDYtfUef+p7oyl4TI MNueCvZkkis3umglrPELC32F29+WwC6FweZOVCXxcrANapbFUszg5FlLogdAPnYCd+8D UBoGa+MXwUwj0NSCqwiczR39Kvpk3RctQA0PknIu8GyDD1jheD+En5+iw7/c3H2gG/1k JN5LEHzFurdHz79gYC/vq44GZg1NPhgDABVQ1wIqLjwzT6yN63yN7Qxe3ya8hy7c9qhb V1BA== MIME-Version: 1.0 In-Reply-To: References: <1cb3f062-eb45-4b0c-977b-76afb099923c@googlegroups.com> <503a0d51$0$6574$c3e8da3$5496439d@news.astraweb.com> <503a8361$0$6574$c3e8da3$5496439d@news.astraweb.com> <2e92da71-fbd2-467f-9088-1c79fa7bcf69@googlegroups.com> Date: Wed, 29 Aug 2012 13:59:27 +1000 Subject: Re: Flexible string representation, unicode, typography, ... From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 36 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1346212776 news.xs4all.nl 6925 [2001:888:2000:d::a6]:39102 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:28042 On Wed, Aug 29, 2012 at 12:42 PM, rusi wrote: > Clearly there are 3 string-engines in the python 3 world: > - 3.2 narrow > - 3.2 wide > - 3.3 (flexible) > > How difficult would it be to giving the choice of string engine as a > command-line flag? > This would avoid the nuisance of having two binaries -- narrow and > wide. > And it would give the python programmer a choice of efficiency > profiles. To what benefit? 3.2 narrow is, I would have to say, buggy. It handles everything up to \uFFFF without problems, but once you have any character beyond that, your indexing and slicing are wrong. 3.2 wide is fine but memory-inefficient. 3.3 is never worse than 3.2 except for some tiny checks, and will be more memory-efficient in many cases. Supporting narrow would require fixing the handling of surrogates. Potentially a huge job, and you'll end up with ridiculous performance in many cases. So what you're really asking for is a command-line option to force all strings to have their 'kind' set to 11, UCS-4 storage. That would be doable, I suppose; it wouldn't require many changes (just a quick check in string creation functions). But what would be the advantage? Every string requires 4 bytes per character to store; an optimization has been lost. ChrisA