Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Ned Batchelder <ned@nedbatchelder.com>
Subject: Re: How is unicode implemented behind the scenes?
Date: Sat, 08 Mar 2014 22:48:51 -0500
References: <CAGGBd_rSN1bMHkQYix8Lo0TfXi3_k+Q9nu25vMokR1+Eumf5Cg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Thunderbird/24.3.0
In-Reply-To: <CAGGBd_rSN1bMHkQYix8Lo0TfXi3_k+Q9nu25vMokR1+Eumf5Cg@mail.gmail.com>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.7949.1394336952.18130.python-list@python.org>
Lines: 34
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:68069

On 3/8/14 9:08 PM, Dan Stromberg wrote:
> OK, I know that Unicode data is stored in an encoding on disk.
>
> But how is it stored in RAM?
>
> I realize I shouldn't write code that depends on any relevant
> implementation details, but knowing some of the more common
> implementation options would probably help build an intuition for
> what's going on internally.
>
> I've heard that characters are no longer all c bytes wide internally,
> so is it sometimes utf-8?
>
> Thanks.
>

In abstract terms, a Unicode string is a sequence of integers (code 
points).  There are lots of ways to store a sequence of integers.

In Python 2.x, it's either a vector of 16-bit ints, or 32-bit ints. 
These are the Unicode representations known as UTF-16 and UTF-32, 
respectively, and which you have depends on whether you have a "narrow" 
or "wide" build of Python.  You can tell the difference by examining 
sys.maxunicode, which is 65535 (narrow) or 1114111 (wide).

In Python 3.3, the representation was changed from narrow/wide to the 
so-called Flexible String Representation which others here have 
described.  It uses either 1-, 2-, or 4-bytes per code point, depending 
on the set of code points in the string.  It's specified in PEP 393: 
http://legacy.python.org/dev/peps/pep-0393/

-- 
Ned Batchelder, http://nedbatchelder.com