Path: csiph.com!weretis.net!feeder6.news.weretis.net!feeder4.news.weretis.net!ecngs!testfeeder.ecngs.de!81.171.118.61.MISMATCH!peer01.fr7!news.highwinds-media.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: How to waste computer memory?
Date: Fri, 18 Mar 2016 23:37:21 +1100
Lines: 82
Message-ID: <mailman.314.1458304644.12893.python-list@python.org>
References: <a2639027-c69c-46df-a7a5-45a677b9e01d@googlegroups.com> <265377f4-741d-4aa2-9338-239f56f8bc57@googlegroups.com> <mailman.302.1458284448.12893.python-list@python.org> <56ebea83$0$1599$c3e8da3$5496439d@news.astraweb.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <56ebea83$0$1599$c3e8da3$5496439d@news.astraweb.com>
Precedence: list
Xref: csiph.com comp.lang.python:105206

On Fri, Mar 18, 2016 at 10:46 PM, Steven D'Aprano <steve@pearwood.info> wro=
te:
> On Fri, 18 Mar 2016 06:00 pm, Ian Kelly wrote:
>
>> On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson
>> <rantingrickjohnson@gmail.com> wrote:
>>> In the event that i change my mind about Unicode, and/or for
>>> the sake of others, who may want to know, please provide a
>>> list of languages that *YOU* think handle Unicode better than
>>> Python, starting with the best first. Thanks.
>
> Better than Python? Easy-peasy:
>
> List of languages with Unicode handling which is better than Python =3D [=
]
>
> I'm not aware of any language with better or more complete Unicode
> functionality than Python's. (That doesn't necessarily mean that they don=
't
> exist.)

And this also doesn't preclude languages that have *as good* handling
as Python's, of which I know of one off-hand, and there may be any
number. (Trivial case: Take Python 3.5, change the definition of a
block to be { } instead of indentation, and release it as Bracethon
1.0. Voila, a distinct-yet-related language whose Unicode handling is
exactly as good as Python's.)

>> jmf has been asked this before, and as I recall he seems to feel that
>> UTF-8 should be used for all purposes, ignoring the limitations of
>> that encoding such as that indexing becomes a O(n) operation.
>
> Technically, UTF-8 doesn't *necessarily* imply indexing is O(n). For
> instance, your UTF-8 string might consist of an array of bytes containing
> the string, plus an array of indexes to the start of each code point. For
> example, the string:
>
> =E2=80=9Cabc=CF=80=C3=9F=D0=8A=E2=80=A2=F0=92=80=81=E2=80=9D
>
> (including the quote marks) is 10 code points in length and 22 bytes as
> UTF-8. Grouping the (hex) bytes for each code point, we have:
>
> e2809c 61 62 63 cf80 c39f d08a e280a2 f0928081 e2809d
>
> so we could get a O(1) UTF-8 string by recording the bytes (in hex) plus =
the
> indexes (in decimal) in which each code point starts:
>
> e2809c616263cf80c39fd08ae280a2f0928081e2809d
>
> 0 3 4 5 6 8 10 12 15 19
>
> but (assuming each index needs 2 bytes, which supports strings up to 6553=
5
> characters in length), that's actually LESS memory efficient than UTF-32:
> 42 bytes versus 40.

A lot of strings will have no more than 255 non-ASCII characters in
them. (For example, all strings which no more than 255 total
characters.) You could store, instead of the indexes themselves, a
series of one-byte offsets:

e2809c616263cf80c39fd08ae280a2f0928081e2809d
0 2 2 2 2 3 4 5 7 10

Locating a byte based on its character position is still O(1); you
look up that position in the offset table, add that to your original
character position, and you have the byte location. For strings with
too many non-ASCII codepoints, you'd need some other representation,
but at that point, it might be worth just switching to UTF-32.

Of course, O(1) isn't the ultimate goal to the exclusion of all else.
For a simple sequential parser, indexing might be such a rare
operation that it's okay for it to be O(N), as you're never going to
index more than a few characters from a known position. Or if you're
trying to search a few gig of text, it's entirely possible that
transcoding into an indexable format is a complete waste of time, and
it's better to just work with a stream of bytes straight off the disk.
But for a general string type in a high level language, I'm normally
going to assume that indexing is fairly cheap.

ChrisA