Groups > comp.lang.python > #68059 > unrolled thread

Re: How is unicode implemented behind the scenes?

Started by	MRAB <python@mrabarnett.plus.com>
First post	2014-03-09 02:40 +0000
Last post	2014-03-09 14:53 +0000
Articles	4 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: How is unicode implemented behind the scenes? MRAB <python@mrabarnett.plus.com> - 2014-03-09 02:40 +0000
    Re: How is unicode implemented behind the scenes? wxjmfauth@gmail.com - 2014-03-09 00:39 -0800
      Re: How is unicode implemented behind the scenes? Rustom Mody <rustompmody@gmail.com> - 2014-03-09 03:32 -0700
        Re: How is unicode implemented behind the scenes? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-03-09 14:53 +0000

#68059 — Re: How is unicode implemented behind the scenes?

From	MRAB <python@mrabarnett.plus.com>
Date	2014-03-09 02:40 +0000
Subject	Re: How is unicode implemented behind the scenes?
Message-ID	<mailman.7943.1394332835.18130.python-list@python.org>

On 2014-03-09 02:08, Dan Stromberg wrote:
> OK, I know that Unicode data is stored in an encoding on disk.
>
> But how is it stored in RAM?
>
> I realize I shouldn't write code that depends on any relevant
> implementation details, but knowing some of the more common
> implementation options would probably help build an intuition for
> what's going on internally.
>
> I've heard that characters are no longer all c bytes wide internally,
> so is it sometimes utf-8?
>
No.

 From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint.

In Python terms:

if all(c <= '\xFF' for c in string):
     use 1 byte per codepoint
elif all(c <= '\xFFFF' for c in string):
     use 2 bytes per codepoint
else:
     use 4 bytes per codepoint

[toc] | [next] | [standalone]

#68073

From	wxjmfauth@gmail.com
Date	2014-03-09 00:39 -0800
Message-ID	<751cbe5d-ebbe-4f4e-93a9-6012667297e3@googlegroups.com>
In reply to	#68059

Le dimanche 9 mars 2014 03:40:28 UTC+1, MRAB a écrit :
> On 2014-03-09 02:08, Dan Stromberg wrote:
> 
> > OK, I know that Unicode data is stored in an encoding on disk.
> 
> >
> 
> > But how is it stored in RAM?
> 
> >
> 
> > I realize I shouldn't write code that depends on any relevant
> 
> > implementation details, but knowing some of the more common
> 
> > implementation options would probably help build an intuition for
> 
> > what's going on internally.
> 
> >
> 
> > I've heard that characters are no longer all c bytes wide internally,
> 
> > so is it sometimes utf-8?
> 
> >
> 
> No.
> 
> 
> 
>  From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint.
> 
> 
> 
> In Python terms:
> 
> 
> 
> if all(c <= '\xFF' for c in string):
> 
>      use 1 byte per codepoint
> 
> elif all(c <= '\xFFFF' for c in string):
> 
>      use 2 bytes per codepoint
> 
> else:
> 
>      use 4 bytes per codepoint

A very, very nice recursive mathematical
absurdity.

jmf

[toc] | [prev] | [next] | [standalone]

#68078

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-03-09 03:32 -0700
Message-ID	<a47099d2-45df-46eb-8317-59409b18ec9a@googlegroups.com>
In reply to	#68073

On Sunday, March 9, 2014 2:09:32 PM UTC+5:30, wxjm...@gmail.com wrote:
> Le dimanche 9 mars 2014 03:40:28 UTC+1, MRAB a écrit :
> > On 2014-03-09 02:08, Dan Stromberg wrote:
> > > OK, I know that Unicode data is stored in an encoding on disk.
> > > But how is it stored in RAM?
> > > I realize I shouldn't write code that depends on any relevant
> > > implementation details, but knowing some of the more common
> > > implementation options would probably help build an intuition for
> > > what's going on internally.
> > > I've heard that characters are no longer all c bytes wide internally,
> > > so is it sometimes utf-8?
> > No.
> >  From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint.
> > In Python terms:
> > if all(c <= '\xFF' for c in string):
> >      use 1 byte per codepoint
> > elif all(c <= '\xFFFF' for c in string):
> >      use 2 bytes per codepoint
> > else:
> >      use 4 bytes per codepoint

> A very, very nice recursive mathematical absurdity.

As a profoundly astute mathematician
v v n r m a
can be parsed in 42 different ways (5th catalan number)

Which parse did you intend?

[toc] | [prev] | [next] | [standalone]

#68081

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-03-09 14:53 +0000
Message-ID	<mailman.7954.1394376810.18130.python-list@python.org>
In reply to	#68078

On 09/03/2014 10:32, Rustom Mody wrote:
> On Sunday, March 9, 2014 2:09:32 PM UTC+5:30, wxjm...@gmail.com wrote:
>> Le dimanche 9 mars 2014 03:40:28 UTC+1, MRAB a écrit :
>>> On 2014-03-09 02:08, Dan Stromberg wrote:
>>>> OK, I know that Unicode data is stored in an encoding on disk.
>>>> But how is it stored in RAM?
>>>> I realize I shouldn't write code that depends on any relevant
>>>> implementation details, but knowing some of the more common
>>>> implementation options would probably help build an intuition for
>>>> what's going on internally.
>>>> I've heard that characters are no longer all c bytes wide internally,
>>>> so is it sometimes utf-8?
>>> No.
>>>   From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint.
>>> In Python terms:
>>> if all(c <= '\xFF' for c in string):
>>>       use 1 byte per codepoint
>>> elif all(c <= '\xFFFF' for c in string):
>>>       use 2 bytes per codepoint
>>> else:
>>>       use 4 bytes per codepoint
>
>> A very, very nice recursive mathematical absurdity.
>
> As a profoundly astute mathematician
> v v n r m a
> can be parsed in 42 different ways (5th catalan number)
>
> Which parse did you intend?
>
>

Please don't feed this particular troll, it's a complete waste of time.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com

[toc] | [prev] | [standalone]

csiph-web

Re: How is unicode implemented behind the scenes?

Contents

#68059 — Re: How is unicode implemented behind the scenes?

#68073

#68078

#68081