Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Ian Kelly <ian.g.kelly@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: How to waste computer memory?
Date: Fri, 18 Mar 2016 07:57:57 -0600
Lines: 49
Message-ID: <mailman.318.1458309520.12893.python-list@python.org>
References: <a2639027-c69c-46df-a7a5-45a677b9e01d@googlegroups.com> <265377f4-741d-4aa2-9338-239f56f8bc57@googlegroups.com> <mailman.302.1458284448.12893.python-list@python.org> <56ebea83$0$1599$c3e8da3$5496439d@news.astraweb.com> <CAPTjJmoxXh2+894LjcVjPJ-qP=bJRWEeeSXU_3Dn673+0hyLJQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <CAPTjJmoxXh2+894LjcVjPJ-qP=bJRWEeeSXU_3Dn673+0hyLJQ@mail.gmail.com>
Precedence: list
Xref: csiph.com comp.lang.python:105213

On Fri, Mar 18, 2016 at 6:37 AM, Chris Angelico <rosuav@gmail.com> wrote:
> On Fri, Mar 18, 2016 at 10:46 PM, Steven D'Aprano <steve@pearwood.info> w=
rote:
>> Technically, UTF-8 doesn't *necessarily* imply indexing is O(n). For
>> instance, your UTF-8 string might consist of an array of bytes containin=
g
>> the string, plus an array of indexes to the start of each code point. Fo=
r
>> example, the string:
>>
>> =E2=80=9Cabc=CF=80=C3=9F=D0=8A=E2=80=A2=F0=92=80=81=E2=80=9D
>>
>> (including the quote marks) is 10 code points in length and 22 bytes as
>> UTF-8. Grouping the (hex) bytes for each code point, we have:
>>
>> e2809c 61 62 63 cf80 c39f d08a e280a2 f0928081 e2809d
>>
>> so we could get a O(1) UTF-8 string by recording the bytes (in hex) plus=
 the
>> indexes (in decimal) in which each code point starts:
>>
>> e2809c616263cf80c39fd08ae280a2f0928081e2809d
>>
>> 0 3 4 5 6 8 10 12 15 19
>>
>> but (assuming each index needs 2 bytes, which supports strings up to 655=
35
>> characters in length), that's actually LESS memory efficient than UTF-32=
:
>> 42 bytes versus 40.
>
> A lot of strings will have no more than 255 non-ASCII characters in
> them. (For example, all strings which no more than 255 total
> characters.) You could store, instead of the indexes themselves, a
> series of one-byte offsets:
>
> e2809c616263cf80c39fd08ae280a2f0928081e2809d
> 0 2 2 2 2 3 4 5 7 10
>
> Locating a byte based on its character position is still O(1); you
> look up that position in the offset table, add that to your original
> character position, and you have the byte location. For strings with
> too many non-ASCII codepoints, you'd need some other representation,
> but at that point, it might be worth just switching to UTF-32.

So this uses approximately twice as much memory as the FSR and still
requires switching on some form of character width in the
implementation? Yeah, I don't think the RUE is going to go for that.
8-)