Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #68059 > unrolled thread
| Started by | MRAB <python@mrabarnett.plus.com> |
|---|---|
| First post | 2014-03-09 02:40 +0000 |
| Last post | 2014-03-09 14:53 +0000 |
| Articles | 4 — 4 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: How is unicode implemented behind the scenes? MRAB <python@mrabarnett.plus.com> - 2014-03-09 02:40 +0000
Re: How is unicode implemented behind the scenes? wxjmfauth@gmail.com - 2014-03-09 00:39 -0800
Re: How is unicode implemented behind the scenes? Rustom Mody <rustompmody@gmail.com> - 2014-03-09 03:32 -0700
Re: How is unicode implemented behind the scenes? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-03-09 14:53 +0000
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2014-03-09 02:40 +0000 |
| Subject | Re: How is unicode implemented behind the scenes? |
| Message-ID | <mailman.7943.1394332835.18130.python-list@python.org> |
On 2014-03-09 02:08, Dan Stromberg wrote:
> OK, I know that Unicode data is stored in an encoding on disk.
>
> But how is it stored in RAM?
>
> I realize I shouldn't write code that depends on any relevant
> implementation details, but knowing some of the more common
> implementation options would probably help build an intuition for
> what's going on internally.
>
> I've heard that characters are no longer all c bytes wide internally,
> so is it sometimes utf-8?
>
No.
From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint.
In Python terms:
if all(c <= '\xFF' for c in string):
use 1 byte per codepoint
elif all(c <= '\xFFFF' for c in string):
use 2 bytes per codepoint
else:
use 4 bytes per codepoint
[toc] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2014-03-09 00:39 -0800 |
| Message-ID | <751cbe5d-ebbe-4f4e-93a9-6012667297e3@googlegroups.com> |
| In reply to | #68059 |
Le dimanche 9 mars 2014 03:40:28 UTC+1, MRAB a écrit : > On 2014-03-09 02:08, Dan Stromberg wrote: > > > OK, I know that Unicode data is stored in an encoding on disk. > > > > > > But how is it stored in RAM? > > > > > > I realize I shouldn't write code that depends on any relevant > > > implementation details, but knowing some of the more common > > > implementation options would probably help build an intuition for > > > what's going on internally. > > > > > > I've heard that characters are no longer all c bytes wide internally, > > > so is it sometimes utf-8? > > > > > No. > > > > From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint. > > > > In Python terms: > > > > if all(c <= '\xFF' for c in string): > > use 1 byte per codepoint > > elif all(c <= '\xFFFF' for c in string): > > use 2 bytes per codepoint > > else: > > use 4 bytes per codepoint A very, very nice recursive mathematical absurdity. jmf
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-03-09 03:32 -0700 |
| Message-ID | <a47099d2-45df-46eb-8317-59409b18ec9a@googlegroups.com> |
| In reply to | #68073 |
On Sunday, March 9, 2014 2:09:32 PM UTC+5:30, wxjm...@gmail.com wrote: > Le dimanche 9 mars 2014 03:40:28 UTC+1, MRAB a écrit : > > On 2014-03-09 02:08, Dan Stromberg wrote: > > > OK, I know that Unicode data is stored in an encoding on disk. > > > But how is it stored in RAM? > > > I realize I shouldn't write code that depends on any relevant > > > implementation details, but knowing some of the more common > > > implementation options would probably help build an intuition for > > > what's going on internally. > > > I've heard that characters are no longer all c bytes wide internally, > > > so is it sometimes utf-8? > > No. > > From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint. > > In Python terms: > > if all(c <= '\xFF' for c in string): > > use 1 byte per codepoint > > elif all(c <= '\xFFFF' for c in string): > > use 2 bytes per codepoint > > else: > > use 4 bytes per codepoint > A very, very nice recursive mathematical absurdity. As a profoundly astute mathematician v v n r m a can be parsed in 42 different ways (5th catalan number) Which parse did you intend?
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2014-03-09 14:53 +0000 |
| Message-ID | <mailman.7954.1394376810.18130.python-list@python.org> |
| In reply to | #68078 |
On 09/03/2014 10:32, Rustom Mody wrote: > On Sunday, March 9, 2014 2:09:32 PM UTC+5:30, wxjm...@gmail.com wrote: >> Le dimanche 9 mars 2014 03:40:28 UTC+1, MRAB a écrit : >>> On 2014-03-09 02:08, Dan Stromberg wrote: >>>> OK, I know that Unicode data is stored in an encoding on disk. >>>> But how is it stored in RAM? >>>> I realize I shouldn't write code that depends on any relevant >>>> implementation details, but knowing some of the more common >>>> implementation options would probably help build an intuition for >>>> what's going on internally. >>>> I've heard that characters are no longer all c bytes wide internally, >>>> so is it sometimes utf-8? >>> No. >>> From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint. >>> In Python terms: >>> if all(c <= '\xFF' for c in string): >>> use 1 byte per codepoint >>> elif all(c <= '\xFFFF' for c in string): >>> use 2 bytes per codepoint >>> else: >>> use 4 bytes per codepoint > >> A very, very nice recursive mathematical absurdity. > > As a profoundly astute mathematician > v v n r m a > can be parsed in 42 different ways (5th catalan number) > > Which parse did you intend? > > Please don't feed this particular troll, it's a complete waste of time. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web