Re: How is unicode implemented behind the scenes?

Path	csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder1.xlned.com!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<rosuav@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.004
X-Spam-Evidence	'H': 0.99; 'S': 0.00; 'python.': 0.02; 'cpython': 0.05; 'binary': 0.07; 'indexing': 0.07; 'utf-8': 0.07; 'string': 0.09; 'created,': 0.09; 'if,': 0.09; 'iterate': 0.09; 'sucks': 0.09; 'subject:How': 0.10; 'cc:addr:python-list': 0.11; 'python': 0.11; '"space"': 0.16; '3.3,': 0.16; 'concatenate': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'iterates': 0.16; 'roy': 0.16; 'separated': 0.16; 'statement.': 0.16; 'string:': 0.16; 'subject:unicode': 0.16; 'exception': 0.16; 'language': 0.16; 'wrote:': 0.18; 'cc:addr:python.org': 0.22; 'convenient': 0.24; 'days,': 0.24; 'unicode': 0.24; 'cc:2**0': 0.24; 'sort': 0.25; 'first,': 0.26; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'character': 0.29; 'words': 0.29; 'characters': 0.30; 'primarily': 0.30; 'message-id:@mail.gmail.com': 0.30; 'work.': 0.31; 'gives': 0.31; "d'aprano": 0.31; 'operators': 0.31; 'steven': 0.31; 'with,': 0.31; 'lists': 0.32; 'option': 0.32; 'another': 0.32; 'worked': 0.33; 'subject:the': 0.34; 'common': 0.35; 'something': 0.35; 'operations': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'like,': 0.36; 'scheme': 0.36; 'subject:?': 0.36; 'expected': 0.38; 'represent': 0.38; 'pm,': 0.38; 'little': 0.38; 'bad': 0.39; 'even': 0.60; 'most': 0.60; 'introduced': 0.61; 'more': 0.64; 'different': 0.65; 'between': 0.67; 'mar': 0.68; 'smith': 0.68; 'therefore': 0.72; 'article': 0.77; '"best': 0.84; 'characters,': 0.84; 'compact,': 0.84; 'cpu,': 0.84; 'internally.': 0.84; 'recognition': 0.84; "they'd": 0.84; 'universally': 0.84; 'absolutely': 0.87; 'to:none': 0.92; 'hand,': 0.93
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=CxsQ48XKQ0pIsSt42RVMUDIK3vrNLPo8WE5LAjwW9I4=; b=acOEZtZ+4itCDTzGiVlpT+w8hvLrn4+RX/tdBkgDabjnyG3/b9auTkfKL7rVdzhFya oczICOZVNTcaXP7kbrd+HrNtJ4Od7cgWiTevTE6LJpVpeWEIHkOihvo0A/WwtJgcTWY6 /RNFUX8QZ+Rk2X49vtGbLT8JP+p731ehuMZV8QXil1BnW+BnLq9QEcBo95D1Cd9wW5/R xzHYl05NWTQlWO0pZLhiLw+wrDn8SUq+HzR2xfSLFUzYT5Ka83FVEWKmbBDGbKMATH2F 9jdo77etAUui1jnAmyY63DFQH3KHrPLoW9N43Y1yAcUrXSChe3suTozfHcVcqSHnNaVk cjFQ==
MIME-Version	1.0
X-Received	by 10.68.66.1 with SMTP id b1mr31766530pbt.43.1394335140093; Sat, 08 Mar 2014 19:19:00 -0800 (PST)
In-Reply-To	<roy-A8220E.22015908032014@news.panix.com>
References	<mailman.7942.1394330927.18130.python-list@python.org> <531bd709$0$29985$c3e8da3$5496439d@news.astraweb.com> <roy-A8220E.22015908032014@news.panix.com>
Date	Sun, 9 Mar 2014 14:19:00 +1100
Subject	Re: How is unicode implemented behind the scenes?
From	Chris Angelico <rosuav@gmail.com>
Cc	"python-list@python.org" <python-list@python.org>
Content-Type	text/plain; charset=UTF-8
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.7948.1394335143.18130.python-list@python.org> (permalink)
Lines	44
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1394335143 news.xs4all.nl 2915 [2001:888:2000:d::a6]:42393
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:68067

Show key headers only | View raw

On Sun, Mar 9, 2014 at 2:01 PM, Roy Smith <roy@panix.com> wrote:
> In article <531bd709$0$29985$c3e8da3$5496439d@news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>
>> There are various common ways to store Unicode strings in RAM.
>>
>> The first, UTF-16.
>> [...]
>> Another option is UTF-32.
>> [...]
>> Another option is to use UTF-8 internally.
>> [...]
>> In Python 3.3, CPython introduced an internal scheme that gives the best
>> of all worlds. When a string is created, Python uses a different
>> implementation depending on the characters in the string:
>
> This was an excellent post, but I would take exception to the "best of
> all worlds" statement.  I would put it a little less absolutely and say
> something like, "a good compromise for many common use cases".  I would
> even go with, "... for most common use cases".  But, there are
> situations where it loses.

It's universally good for string indexing/slicing on binary CPUs
(there's no point using a 24-bit or 21-bit representation on an
Intel-compatible CPU, even though they'd be just as good as UTC-32).
It's not a compromise, so much as a recognition that Python offers
convenient operators for indexing and slicing. If, on the other hand,
Python fundamentally worked with U+0020 separated words (REXX has a
whole set of word-based functions), then it might be better to
represent strings as lists of words internally. Or if the string
operations are primarily based on the transitions between Unicode
types of "space" and "non-space", which would be more likely these
days, then something of that sort would still work. Anyway, it's based
on the operations the language makes convenient, and which will
therefore be common and expected to be fast: those are the operations
to optimize for.

If the only thing you ever do with a string is iterate sequentially
over its characters, UTF-8 would be the perfect representation. It's
compact, you can concatenate strings without re-encoding, and it
iterates forwards easily. But it sucks for "give me character #142857
from this string", so it's a bad choice for Python.

ChrisA

Thread

How is unicode implemented behind the scenes? Dan Stromberg <drsalists@gmail.com> - 2014-03-08 18:08 -0800
  Re: How is unicode implemented behind the scenes? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-09 02:50 +0000
    Re: How is unicode implemented behind the scenes? Roy Smith <roy@panix.com> - 2014-03-08 22:01 -0500
      Re: How is unicode implemented behind the scenes? Chris Angelico <rosuav@gmail.com> - 2014-03-09 14:19 +1100
    Re: How is unicode implemented behind the scenes? Rustom Mody <rustompmody@gmail.com> - 2014-03-08 19:12 -0800
    Re: How is unicode implemented behind the scenes? Dan Sommers <dan@tombstonezero.net> - 2014-03-09 05:46 +0000

csiph-web