Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!ecngs!feeder2.ecngs.de!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.047 X-Spam-Evidence: '*H*': 0.91; '*S*': 0.00; 'encoding': 0.05; 'subject:How': 0.10; 'stored': 0.12; 'subject:unicode': 0.16; 'to:name:python list': 0.16; 'thanks.': 0.20; 'bytes': 0.24; "shouldn't": 0.24; 'unicode': 0.24; "i've": 0.25; 'options': 0.25; 'characters': 0.30; 'message-id:@mail.gmail.com': 0.30; 'code': 0.31; 'probably': 0.32; 'subject:the': 0.34; 'common': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'subject:?': 0.36; 'sometimes': 0.38; 'depends': 0.38; 'to:addr:python-list': 0.38; 'heard': 0.39; 'realize': 0.39; 'to:addr:python.org': 0.39; 'how': 0.40; 'more': 0.64; 'details,': 0.68; 'internally.': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=d8Je7qE9erireoUjetsAFgnbKx5159As6nsF4+NZ/z4=; b=le7ZwY+b/S+VAZ7H1r/tKfPTYPfKuKHvj350hbWAYE8AI7mkBszNztoVHo0YDeiJa1 bzjvsqZEnoxcoXy6KTLmevb3J1EMO+KMzLYvLrgQFplmgmq4ihorQWpYZ+CFGAHJT+IC HSAv9itcnsj+qisyqlzB6UKXxMAut2JXseEdmdTuK2JmVB7h851NAh9nJUU/TYKBZsLx 4GgP3MPLeZGQzXInm1S/otCOnH95VOqys2i1Un1pE9F6lt1KYwTFAPAkVHkEnfcIk3fq VAbwdqRVEcaszhM5GpP3kWW1wAraR2Nf6t0v6LM6frXZ/BVi48QRrKQspbdbvKX21ntT onnQ== MIME-Version: 1.0 X-Received: by 10.236.124.104 with SMTP id w68mr34511611yhh.2.1394330918889; Sat, 08 Mar 2014 18:08:38 -0800 (PST) Date: Sat, 8 Mar 2014 18:08:38 -0800 Subject: How is unicode implemented behind the scenes? From: Dan Stromberg To: Python List Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 13 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1394330927 news.xs4all.nl 2866 [2001:888:2000:d::a6]:37258 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:68058 OK, I know that Unicode data is stored in an encoding on disk. But how is it stored in RAM? I realize I shouldn't write code that depends on any relevant implementation details, but knowing some of the more common implementation options would probably help build an intuition for what's going on internally. I've heard that characters are no longer all c bytes wide internally, so is it sometimes utf-8? Thanks.