Groups > comp.lang.python > #68058 > unrolled thread

How is unicode implemented behind the scenes?

Started by	Dan Stromberg <drsalists@gmail.com>
First post	2014-03-08 18:08 -0800
Last post	2014-03-09 05:46 +0000
Articles	6 — 6 participants

Back to article view | Back to comp.lang.python

  How is unicode implemented behind the scenes? Dan Stromberg <drsalists@gmail.com> - 2014-03-08 18:08 -0800
    Re: How is unicode implemented behind the scenes? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-09 02:50 +0000
      Re: How is unicode implemented behind the scenes? Roy Smith <roy@panix.com> - 2014-03-08 22:01 -0500
        Re: How is unicode implemented behind the scenes? Chris Angelico <rosuav@gmail.com> - 2014-03-09 14:19 +1100
      Re: How is unicode implemented behind the scenes? Rustom Mody <rustompmody@gmail.com> - 2014-03-08 19:12 -0800
      Re: How is unicode implemented behind the scenes? Dan Sommers <dan@tombstonezero.net> - 2014-03-09 05:46 +0000

#68058 — How is unicode implemented behind the scenes?

From	Dan Stromberg <drsalists@gmail.com>
Date	2014-03-08 18:08 -0800
Subject	How is unicode implemented behind the scenes?
Message-ID	<mailman.7942.1394330927.18130.python-list@python.org>

OK, I know that Unicode data is stored in an encoding on disk.

But how is it stored in RAM?

I realize I shouldn't write code that depends on any relevant
implementation details, but knowing some of the more common
implementation options would probably help build an intuition for
what's going on internally.

I've heard that characters are no longer all c bytes wide internally,
so is it sometimes utf-8?

Thanks.

[toc] | [next] | [standalone]

#68062

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-03-09 02:50 +0000
Message-ID	<531bd709$0$29985$c3e8da3$5496439d@news.astraweb.com>
In reply to	#68058

On Sat, 08 Mar 2014 18:08:38 -0800, Dan Stromberg wrote:

> OK, I know that Unicode data is stored in an encoding on disk.
> 
> But how is it stored in RAM?

There are various common ways to store Unicode strings in RAM.

The first, UTF-16, treats every character [aside: technically, a code 
point] as a double byte rather than a single byte. So the letter "A" is 
stored as two bytes 0x0041 (or 0x4100 depending on your platform's byte 
order). Using two bytes allows for a maximum of 65536 different 
characters, *way* too few for the whole Unicode character set, so UTF-16 
has an escaping mechanism where characters beyond ordinal 0xFFFF are 
stored as *two* "characters" (again, actually, code points) called 
surrogate pairs.

That means that a sequence of (say) four human-readable characters may, 
depending on those characters, take up anything from eight bytes to 
sixteen bytes, and you cannot tell which until you walk through the 
sequence inspecting each pair of bytes:

while there are still pairs of bytes to inspect:
    c = get_next_pair()
    if is_low_surrogate(c):
        error
    elif is_high_surrogate(c):
        d = get_next_pair()
        if not is_low_surrogate(d):
            error
        print make_char_from_surrogate_pair(c, d)
    else:
        print make_char_from_double_byte(c)

So UTF-16 is a *variable width* (could be 1 unit, could be 2 units) 
*double byte* encoding (each unit is two bytes).

Prior to Python 3.3, using UTF-16 was an option when compiling Python's 
source code. Such versions of the interpreter are called "narrow builds".

Another option is UTF-32. UTF-32 uses four bytes for every character. 
That's enough to store every Unicode character, and then some, so there 
are no surrogate pairs needed. But every character takes up four bytes: 
"A" would be stored as 0x00000041 or 0x41000000. Although UTF-32 is 
faster than UTF-16, because you don't have to walk the string checking 
each individual pair of bytes to see if they are part of a surrogate, 
strings use up to twice as much memory as UTF-16 whether they need it or 
not. (And four times more memory than ASCII strings.)

Prior to Python 3.3, UTF-32 was a build option too. Such versions of the 
interpreter are called "wide builds".

Another option is to use UTF-8 internally. With UTF-8, every character 
uses between 1 and 4 bytes. By design, ASCII characters are stored using 
a single byte, the same byte they would have in old fashioned single-byte 
ASCII: the letter "A" is stored as 0x41. (The algorithm used by UTF-8 can 
continue up to six bytes, but there is no need to since there aren't that 
many Unicode characters.) Because it's variable-width, you have the same 
variable-width issues as UTF-16, only even more so, but because most 
common characters (at least for English speakers) use only 1 or 2 bytes, 
it's much more compact than either.

No version of Python has, to my knowledge, used UTF-8 internally. Some 
other languages, such as Go and Haskell, do, and consequently string 
processing is slow for them.

In Python 3.3, CPython introduced an internal scheme that gives the best 
of all worlds. When a string is created, Python uses a different 
implementation depending on the characters in the string:

* If all the characters are ASCII or Latin-1, then the string uses 
  a single byte per character.

* If all the characters are no greater than ordinal value 0xFFFF, 
  then UTF-16 is used. Because the characters are all below 0xFFFF, 
  no surrogate pairs are required.

* Only if there is at least one ord() greater than 0xFFFF does 
  Python use UTF-32 for that string.

The end result is that creating strings is slightly slower, as Python may 
have to inspect each character at most twice to decide what system to 
use. But memory use is much improved: Python has *many* strings (every 
function, method and class uses many strings in their implementation) and 
the memory savings can be considerable. Depending on your application and 
what you do with those strings, that may even lead to time savings as 
well as memory savings.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]

#68065

From	Roy Smith <roy@panix.com>
Date	2014-03-08 22:01 -0500
Message-ID	<roy-A8220E.22015908032014@news.panix.com>
In reply to	#68062

In article <531bd709$0$29985$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> There are various common ways to store Unicode strings in RAM.
> 
> The first, UTF-16.
> [...]
> Another option is UTF-32.
> [...]
> Another option is to use UTF-8 internally.
> [...]
> In Python 3.3, CPython introduced an internal scheme that gives the best 
> of all worlds. When a string is created, Python uses a different 
> implementation depending on the characters in the string:

This was an excellent post, but I would take exception to the "best of 
all worlds" statement.  I would put it a little less absolutely and say 
something like, "a good compromise for many common use cases".  I would 
even go with, "... for most common use cases".  But, there are 
situations where it loses.

[toc] | [prev] | [next] | [standalone]

#68067

From	Chris Angelico <rosuav@gmail.com>
Date	2014-03-09 14:19 +1100
Message-ID	<mailman.7948.1394335143.18130.python-list@python.org>
In reply to	#68065

On Sun, Mar 9, 2014 at 2:01 PM, Roy Smith <roy@panix.com> wrote:
> In article <531bd709$0$29985$c3e8da3$5496439d@news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>
>> There are various common ways to store Unicode strings in RAM.
>>
>> The first, UTF-16.
>> [...]
>> Another option is UTF-32.
>> [...]
>> Another option is to use UTF-8 internally.
>> [...]
>> In Python 3.3, CPython introduced an internal scheme that gives the best
>> of all worlds. When a string is created, Python uses a different
>> implementation depending on the characters in the string:
>
> This was an excellent post, but I would take exception to the "best of
> all worlds" statement.  I would put it a little less absolutely and say
> something like, "a good compromise for many common use cases".  I would
> even go with, "... for most common use cases".  But, there are
> situations where it loses.

It's universally good for string indexing/slicing on binary CPUs
(there's no point using a 24-bit or 21-bit representation on an
Intel-compatible CPU, even though they'd be just as good as UTC-32).
It's not a compromise, so much as a recognition that Python offers
convenient operators for indexing and slicing. If, on the other hand,
Python fundamentally worked with U+0020 separated words (REXX has a
whole set of word-based functions), then it might be better to
represent strings as lists of words internally. Or if the string
operations are primarily based on the transitions between Unicode
types of "space" and "non-space", which would be more likely these
days, then something of that sort would still work. Anyway, it's based
on the operations the language makes convenient, and which will
therefore be common and expected to be fast: those are the operations
to optimize for.

If the only thing you ever do with a string is iterate sequentially
over its characters, UTF-8 would be the perfect representation. It's
compact, you can concatenate strings without re-encoding, and it
iterates forwards easily. But it sucks for "give me character #142857
from this string", so it's a bad choice for Python.

ChrisA

[toc] | [prev] | [next] | [standalone]

#68066

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-03-08 19:12 -0800
Message-ID	<fd13b137-18a0-46c1-975b-b5fc8486d94f@googlegroups.com>
In reply to	#68062

On Sunday, March 9, 2014 8:20:49 AM UTC+5:30, Steven D'Aprano wrote:
> No version of Python has, to my knowledge, used UTF-8 internally. Some 
> other languages, such as Go and Haskell, do, and consequently string 
> processing is slow for them.

Haskell: Its more like: "Heres the menu, take your pick"
http://blog.ezyang.com/2010/08/strings-in-haskell/

[toc] | [prev] | [next] | [standalone]

#68070

From	Dan Sommers <dan@tombstonezero.net>
Date	2014-03-09 05:46 +0000
Message-ID	<lfgv6t$qmf$1@dont-email.me>
In reply to	#68062

On Sun, 09 Mar 2014 03:50:49 +0000, Steven D'Aprano wrote:

> ... UTF-16 ... the letter "A" is stored as two bytes 0x0041 (or 0x4100
> depending on your platform's byte order) ...

At the risk of being pedantic, the two bytes are 0x00 and 0x41, and the
order in which they appear in memory depends on your platform and even
your particular view of that platform (do stacks grow up or down?  are
addresses of higher memory larger or smaller?).

> ... UTF-32 ... "A" would be stored as 0x00000041 or 0x41000000 ...

Or even some other sequence if you're on a PDP-11.

See <http://www.catb.org/jargon/html/M/middle-endian.html>.

But you knew that.  ;-)

Pedantic'ly yours,
Dan

[toc] | [prev] | [standalone]

csiph-web

How is unicode implemented behind the scenes?

Contents

#68058 — How is unicode implemented behind the scenes?

#68062

#68065

#68067

#68066

#68070