Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #68069 > unrolled thread
| Started by | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| First post | 2014-03-08 22:48 -0500 |
| Last post | 2014-03-08 22:48 -0500 |
| Articles | 1 — 1 participant |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: How is unicode implemented behind the scenes? Ned Batchelder <ned@nedbatchelder.com> - 2014-03-08 22:48 -0500
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2014-03-08 22:48 -0500 |
| Subject | Re: How is unicode implemented behind the scenes? |
| Message-ID | <mailman.7949.1394336952.18130.python-list@python.org> |
On 3/8/14 9:08 PM, Dan Stromberg wrote: > OK, I know that Unicode data is stored in an encoding on disk. > > But how is it stored in RAM? > > I realize I shouldn't write code that depends on any relevant > implementation details, but knowing some of the more common > implementation options would probably help build an intuition for > what's going on internally. > > I've heard that characters are no longer all c bytes wide internally, > so is it sometimes utf-8? > > Thanks. > In abstract terms, a Unicode string is a sequence of integers (code points). There are lots of ways to store a sequence of integers. In Python 2.x, it's either a vector of 16-bit ints, or 32-bit ints. These are the Unicode representations known as UTF-16 and UTF-32, respectively, and which you have depends on whether you have a "narrow" or "wide" build of Python. You can tell the difference by examining sys.maxunicode, which is 65535 (narrow) or 1114111 (wide). In Python 3.3, the representation was changed from narrow/wide to the so-called Flexible String Representation which others here have described. It uses either 1-, 2-, or 4-bytes per code point, depending on the set of code points in the string. It's specified in PEP 393: http://legacy.python.org/dev/peps/pep-0393/ -- Ned Batchelder, http://nedbatchelder.com
Back to top | Article view | comp.lang.python
csiph-web