Re: How is unicode implemented behind the scenes?

Newsgroups	comp.lang.python
Date	2014-03-09 00:39 -0800
References	<CAGGBd_rSN1bMHkQYix8Lo0TfXi3_k+Q9nu25vMokR1+Eumf5Cg@mail.gmail.com> <mailman.7943.1394332835.18130.python-list@python.org>
Message-ID	<751cbe5d-ebbe-4f4e-93a9-6012667297e3@googlegroups.com> (permalink)
Subject	Re: How is unicode implemented behind the scenes?
From	wxjmfauth@gmail.com

Show all headers | View raw

Le dimanche 9 mars 2014 03:40:28 UTC+1, MRAB a écrit :
> On 2014-03-09 02:08, Dan Stromberg wrote:
> 
> > OK, I know that Unicode data is stored in an encoding on disk.
> 
> >
> 
> > But how is it stored in RAM?
> 
> >
> 
> > I realize I shouldn't write code that depends on any relevant
> 
> > implementation details, but knowing some of the more common
> 
> > implementation options would probably help build an intuition for
> 
> > what's going on internally.
> 
> >
> 
> > I've heard that characters are no longer all c bytes wide internally,
> 
> > so is it sometimes utf-8?
> 
> >
> 
> No.
> 
> 
> 
>  From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint.
> 
> 
> 
> In Python terms:
> 
> 
> 
> if all(c <= '\xFF' for c in string):
> 
>      use 1 byte per codepoint
> 
> elif all(c <= '\xFFFF' for c in string):
> 
>      use 2 bytes per codepoint
> 
> else:
> 
>      use 4 bytes per codepoint

A very, very nice recursive mathematical
absurdity.

jmf

Thread

Re: How is unicode implemented behind the scenes? MRAB <python@mrabarnett.plus.com> - 2014-03-09 02:40 +0000
  Re: How is unicode implemented behind the scenes? wxjmfauth@gmail.com - 2014-03-09 00:39 -0800
    Re: How is unicode implemented behind the scenes? Rustom Mody <rustompmody@gmail.com> - 2014-03-09 03:32 -0700
      Re: How is unicode implemented behind the scenes? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-03-09 14:53 +0000

csiph-web