Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Wed, 4 Jun 2014 06:01:52 -0500
From: Tim Chase <python.list@tim.thechases.com>
To: Paul Rubin <no.email@nospam.invalid>
Subject: Re: Micro Python -- a lean and efficient implementation of Python 3
In-Reply-To: <7xoay9w1h0.fsf@ruckus.brouhaha.com>
References: <CANw+MznPsKgJiW6e_O370VUsmVVxBfQ=M_7WUyU7+wNh+-qefA@mail.gmail.com> <CAPTjJmoB0eMMMhjUz++yYV2CEv=2xUXx7P8UuRvCk7y7gB-4+Q@mail.gmail.com> <20140603194949.3147497d@x34f> <CAPTjJmrwGVaJKmzLiX8buZQmGxrGJV657Jnb7fsK7j1-pLxtVA@mail.gmail.com> <mailman.10646.1401831682.18130.python-list@python.org> <44acd692-5dcd-4e5f-8238-7fbe0de4db2a@googlegroups.com> <mailman.10673.1401853976.18130.python-list@python.org> <c04434ce-cbc4-49ab-b312-24f1631dd894@googlegroups.com> <mailman.10684.1401866176.18130.python-list@python.org> <538ecdef$0$11109$c3e8da3@news.astraweb.com> <7xoay9w1h0.fsf@ruckus.brouhaha.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.10697.1401879750.18130.python-list@python.org>
Lines: 32
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:72626

On 2014-06-04 00:58, Paul Rubin wrote:
> Steven D'Aprano <steve@pearwood.info> writes:
> >> Maybe there's a use-case for a microcontroller that works in
> >> ISO-8859-5 natively, thus using only eight bits per character, 
> > That won't even make the Russians happy, since in Russia there
> > are multiple incompatible legacy encodings.
> 
> I've never understood why not use UTF-8 for everything.

If you use UTF-8 for everything, then you end up in a world where
string-indexing (see ChrisA's other side thread on this topic) is no
longer an O(1) operation, but an O(N) operation.  Some of us slice
strings for a living. ;-)  I understand that using UTF-32 would allow
us to maintain O(1) indexing at the cost of every string occupying 4
bytes per character.  The FSR (again, as I understand it) allows
strings that fit in one-byte-per-character to use that, scaling up to
use wider characters internally as they're actually needed/used.

At the cost of complexity and non-constant memory space, an O(N)
algorithm could be tweaked down to O(log N) by using an internal
balanced tree of offsets-to-chunks (where the chunk-size was the size
of a block where it was faster to scan linearly than to navigate the
tree).  One might even endow the algorithm with FSR smarts, so each
chunk/fragment could be a different encoding in memory, and linearly
iterating over the string would walk the tree, returning each decoded
piece. </random_ramblings>

-tkc