Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #68061 > unrolled thread

Re: How is unicode implemented behind the scenes?

Started byMRAB <python@mrabarnett.plus.com>
First post2014-03-09 02:42 +0000
Last post2014-03-09 02:42 +0000
Articles 1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: How is unicode implemented behind the scenes? MRAB <python@mrabarnett.plus.com> - 2014-03-09 02:42 +0000

#68061 — Re: How is unicode implemented behind the scenes?

FromMRAB <python@mrabarnett.plus.com>
Date2014-03-09 02:42 +0000
SubjectRe: How is unicode implemented behind the scenes?
Message-ID<mailman.7945.1394332967.18130.python-list@python.org>
On 2014-03-09 02:40, MRAB wrote:
> On 2014-03-09 02:08, Dan Stromberg wrote:
>> OK, I know that Unicode data is stored in an encoding on disk.
>>
>> But how is it stored in RAM?
>>
>> I realize I shouldn't write code that depends on any relevant
>> implementation details, but knowing some of the more common
>> implementation options would probably help build an intuition for
>> what's going on internally.
>>
>> I've heard that characters are no longer all c bytes wide internally,
>> so is it sometimes utf-8?
>>
> No.
>
>   From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint.
>
> In Python terms:
>
> if all(c <= '\xFF' for c in string):
>       use 1 byte per codepoint
> elif all(c <= '\xFFFF' for c in string):
>       use 2 bytes per codepoint
> else:
>       use 4 bytes per codepoint
>
Oops! That should, of course, be:

if all(c <= '\xFF' for c in string):
     use 1 byte per codepoint
elif all(c <= '\uFFFF' for c in string):
     use 2 bytes per codepoint
else:
     use 4 bytes per codepoint

[toc] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web