Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #68064

Re: How is unicode implemented behind the scenes?

Path csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <rosuav@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.001
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'scripts': 0.03; 'encoding': 0.05; 'mrab': 0.05; 'utf-8': 0.07; 'string': 0.09; 'ascii': 0.09; 'dan': 0.09; 'pep': 0.09; 'shifting': 0.09; 'whichever': 0.09; 'subject:How': 0.10; 'cc:addr:python-list': 0.11; 'python': 0.11; 'stored': 0.12; '(there': 0.16; '1:08': 0.16; '3.3,': 0.16; 'bonus,': 0.16; 'cc:name:python list': 0.16; 'encodings,': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'string:': 0.16; 'subject:unicode': 0.16; 'tighter': 0.16; 'url:peps': 0.16; '(you': 0.16; 'wrote:': 0.18; 'bit': 0.19; '(the': 0.22; 'cc:addr:python.org': 0.22; 'bytes': 0.24; "shouldn't": 0.24; 'unicode': 0.24; 'url:dev': 0.24; '(or': 0.24; 'cc:2**0': 0.24; "i've": 0.25; 'options': 0.25; 'header:In- Reply-To:1': 0.27; 'words': 0.29; 'characters': 0.30; 'message- id:@mail.gmail.com': 0.30; 'code': 0.31; '3.2': 0.31; 'minor': 0.31; 'ones.': 0.31; 'probably': 0.32; 'url:python': 0.33; 'position.': 0.33; 'subject:the': 0.34; 'common': 0.35; 'knows': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'really': 0.36; 'like,': 0.36; 'picking': 0.36; 'subject:?': 0.36; 'url:org': 0.36; 'sometimes': 0.38; 'depends': 0.38; 'handle': 0.38; 'pm,': 0.38; 'heard': 0.39; 'realize': 0.39; 'either': 0.39; 'called': 0.40; 'how': 0.40; 'even': 0.60; 'read': 0.60; 'conversion': 0.61; 'full': 0.61; 'matter': 0.61; 'simply': 0.61; 'field': 0.63; 'more': 0.64; 'details': 0.65; 'here': 0.66; 'between': 0.67; 'details,': 0.68; 'mar': 0.68; 'internally.': 0.84; 'to:none': 0.92; 'differences': 0.93
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=zhl2nI6a5awJ4ZPrzFpSZqBycyt7IBxJQ8/9lno3SKc=; b=SAcTiMixPzoyPHXujqyLwaPGVc0oS2btvAxLbw5SKnir9HM6tDV7Oh/GCE0Xkyk1jb GFeXEKA6l/rUBNA+ymlgPMMCeC0v7Vx06QtDKAUZZFIKMd0t+2n0IykcxcIeO7USp/Ok llrjGlm5+mVj0yDI1AFJ7AASn5QjLG6+hRPJqb88tJPMpu4LYcLvQ69guZr5LW/os05B YmikemejHlrgVQAt6LD3mqha92yCrh8NHWtH2HkXeGGxfdoPIScWZ9Z5D9gZ2bOTUe0W gC+5sQBfW70olKhsM0h5ykR2HgUmHPuhbYMmuPMIN4W+mzB4kSznM0qGWuanBDb3ufWG ehpQ==
MIME-Version 1.0
X-Received by 10.66.164.135 with SMTP id yq7mr6227654pab.126.1394333799105; Sat, 08 Mar 2014 18:56:39 -0800 (PST)
In-Reply-To <CAGGBd_rSN1bMHkQYix8Lo0TfXi3_k+Q9nu25vMokR1+Eumf5Cg@mail.gmail.com>
References <CAGGBd_rSN1bMHkQYix8Lo0TfXi3_k+Q9nu25vMokR1+Eumf5Cg@mail.gmail.com>
Date Sun, 9 Mar 2014 13:56:39 +1100
Subject Re: How is unicode implemented behind the scenes?
From Chris Angelico <rosuav@gmail.com>
Cc Python List <python-list@python.org>
Content-Type text/plain; charset=UTF-8
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.7947.1394333802.18130.python-list@python.org> (permalink)
Lines 41
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1394333802 news.xs4all.nl 2846 [2001:888:2000:d::a6]:55739
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:68064

Show key headers only | View raw


On Sun, Mar 9, 2014 at 1:08 PM, Dan Stromberg <drsalists@gmail.com> wrote:
> OK, I know that Unicode data is stored in an encoding on disk.
>
> But how is it stored in RAM?
>
> I realize I shouldn't write code that depends on any relevant
> implementation details, but knowing some of the more common
> implementation options would probably help build an intuition for
> what's going on internally.
>
> I've heard that characters are no longer all c bytes wide internally,
> so is it sometimes utf-8?
>

As of Python 3.3, it's as MRAB described. If you like, Python chooses
between one of three (or four) encodings, based on what can handle the
string:

1) ASCII (there are some minor differences with 7-bit strings, eg it
knows the conversion to UTF-8 is the identity function)
2) Latin-1
3) UCS-2
4) UCS-4

This means that finding the Nth codepoint in a string is simply a
matter of shifting N by either 0, 0, 1, or 2, and picking the right
number of bytes from that position. You can read the gory details in
PEP 393:

http://www.python.org/dev/peps/pep-0393/

but the important bit here is the "kind", which is 01 for Latin-1, 10
for UCS-2, 11 for UCS-4. (The "ascii-only" flag is stored elsewhere.)
There's a functionally-identical field in Pike's strings, called
size_shift - 0 for ASCII or Latin-1, 1 for UCS-2, 2 for UCS-4.
Whichever it is, it's really efficient - and as an added bonus, all
those ASCII-only strings that scripts are full of (you know, words
like "print" and "len" and "int") are stored compactly, so it's much
tighter than the 3.2 builds, even narrow ones. It's pretty awesome!

ChrisA

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: How is unicode implemented behind the scenes? Chris Angelico <rosuav@gmail.com> - 2014-03-09 13:56 +1100

csiph-web