Path: csiph.com!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: Marko Rauhamaa Newsgroups: comp.lang.python Subject: Re: How to waste computer memory? Date: Sat, 19 Mar 2016 17:02:29 +0200 Organization: A noiseless patient Spider Lines: 51 Message-ID: <877fgylddm.fsf@elektro.pacujo.net> References: <265377f4-741d-4aa2-9338-239f56f8bc57@googlegroups.com> <87twk3oli0.fsf@elektro.pacujo.net> <87k2kzo5y5.fsf@elektro.pacujo.net> <56ed0a71$0$1607$c3e8da3$5496439d@news.astraweb.com> <87lh5en79a.fsf@elektro.pacujo.net> <56ed68bb$0$1604$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: mx02.eternal-september.org; posting-host="b7cb1518d23ec19d482dcc9c31d30fdd"; logging-data="5856"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19FuoJtmfpA1ge9TcmOPCB8" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) Cancel-Lock: sha1:eoWBWrGpdd+MP04SU0Tu9Vg1bBQ= sha1:Xb/MsaaHrsaLpHIcddRRs0ZVRBI= Xref: csiph.com comp.lang.python:105276 Steven D'Aprano : > On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote: > > >> Using the surrogate mechanism, UTF-16 can support all 1,114,112 >> potential Unicode characters. >> >> But Unicode doesn't contain 1,114,112 characters—the surrogates are >> excluded from Unicode, and definitely cannot be encoded using >> UTF-anything. > > Surrogates are most certainly part of the Unicode standard, and they are > necessary in UTF-16. Yes, but UTF-16 produces 16-bit values that are outside Unicode. UTF-16 can encode *any* valid Unicode, but it cannot encode surrogate characters. >>> '\udc10'.encode('utf-8') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode character '\udc10' in pos\ ition 0: surrogates not allowed >>> '\udc10'.encode('utf-16') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-16' codec can't encode character '\udc10' in po\ sition 0: surrogates not allowed >>> '\udc10'.encode('utf-32') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-32' codec can't encode character '\udc10' in po\ sition 0: surrogates not allowed >> We still don't know if the final result will be UCS-4 everywhere (with >> all 2**32 code points allowed?!) or UTF-8 everywhere. > > Unicode does not have 2**32 code points. It is guaranteed to never > exceed the 2**21 code points already allocated. (Many of those are > still unused.) Never say never. > In the future, we'll have so much memory that the idea of using > variable width in-memory formats will seem absurd. I'm starting to think that future is already here. Marko