Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!cs.uu.nl!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.008 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; '*not*': 0.07; 'string': 0.09; 'counting': 0.09; 'pep': 0.09; 'release,': 0.09; 'assume': 0.14; 'mostly': 0.14; '(relative': 0.16; 'deprecation': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'two.': 0.16; 'underlying': 0.16; 'wrote:': 0.18; 'trying': 0.19; 'thu,': 0.19; 'work,': 0.20; 'bytes': 0.24; 'string,': 0.24; 'fairly': 0.24; 'switch': 0.26; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'code': 0.31; '25,': 0.31; 'calculated': 0.31; 'convenience': 0.31; "d'aprano": 0.31; 'steven': 0.31; 'maybe': 0.34; 'possible.': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'possible': 0.36; 'should': 0.36; 'example,': 0.37; 'easily': 0.37; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'little': 0.38; 'to:addr:python.org': 0.39; 'changed': 0.39; 'course': 0.61; 'term': 0.63; 'jul': 0.74; 'walk': 0.74; 'convenience,': 0.91; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=bu76G1eCuaZByZ/n+SVFt4D/X9HBe2LdBIOqSe98lqI=; b=vYoBVL52+U4ud1Mmh7JyRNmwUbO6HwRML4U4MkJYCKmXVpoxDLR36QsUohW9eiUN68 NQPZ/WFTe+R/a7sXuBrpfAk91MSLHAMtB3jsUjH+oTB59bxl+iVerpqRsCN6n/facqba saFy6TZw2VbabB+1IFaDGXRq2d1Y+QczNTOTgQ1Pq5OAbIu8x4GzMorBtCJE1HLE9DGp uQRMc0gl99j6nvNZeCQKHFDc7F7de7r4gcoktCOyX9RlTIzJpY8N7Q+S+Llw/yT9ZVLX 97US26iKIZ94qJGENxkvfpXZujGSMtzJVFHrWr9vo59+h8tZmUeFBA5ynXbFGUhyVquY 5TjQ== MIME-Version: 1.0 X-Received: by 10.59.9.69 with SMTP id dq5mr16830326ved.87.1374746861672; Thu, 25 Jul 2013 03:07:41 -0700 (PDT) In-Reply-To: <51f0ee48$0$29971$c3e8da3$5496439d@news.astraweb.com> References: <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <4f1067f6-bc99-42ad-9166-37fb228b90e8@googlegroups.com> <51EFEC17.90303@gmail.com> <51f0d0a0$0$29971$c3e8da3$5496439d@news.astraweb.com> <51f0ee48$0$29971$c3e8da3$5496439d@news.astraweb.com> Date: Thu, 25 Jul 2013 20:07:41 +1000 Subject: Re: RE Module Performance From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 17 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1374746869 news.xs4all.nl 16007 [2001:888:2000:d::a6]:38542 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:51211 On Thu, Jul 25, 2013 at 7:22 PM, Steven D'Aprano wrote: > What I'm trying to say is that it is possible to use UTF-16 internally, > but *not* assume that every code point (character) is represented by a > single 2-byte unit. For example, the len() of a UTF-16 string should not > be calculated by counting the number of bytes and dividing by two. You > actually need to walk the string, inspecting each double-byte Anything's possible. But since underlying representations can be changed fairly easily (relative term of course - it's a lot of work, but it can be changed in a single release, no deprecation required or anything), there's very little reason to continue using UTF-16 underneath. May as well switch to UTF-32 for convenience, or PEP 393 for convenience and efficiency, or maybe some other system that's still mostly fixed-width. ChrisA