Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <0420de60-b9b5-4ac4-ba7b-ca5ac2ca65fe@googlegroups.com>
References: <mailman.4618.1373613834.3114.python-list@python.org> <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <CAN1F8qUFP3uX57HhiiUPaYqO3h_HiT8Q_YD=vCYky3EAWsdE7Q@mail.gmail.com> <mailman.4666.1373670835.3114.python-list@python.org> <4f1067f6-bc99-42ad-9166-37fb228b90e8@googlegroups.com> <mailman.5039.1374677274.3114.python-list@python.org> <0420de60-b9b5-4ac4-ba7b-ca5ac2ca65fe@googlegroups.com>
Date: Thu, 25 Jul 2013 20:14:46 +1000
Subject: Re: RE Module Performance
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5090.1374747295.3114.python-list@python.org>
Lines: 63
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:51212

On Thu, Jul 25, 2013 at 7:27 PM,  <wxjmfauth@gmail.com> wrote:
> A coding scheme works with a unique set of characters (the repertoire),
> and the implementation (the programming) works with a unique set
> of encoded code points. The critical step is the path
> {unique set of characters} <--> {unique set of encoded code points}

That's called Unicode. It maps the character 'A' to the code point
U+0041 and so on. Code points are integers. In fact, they are very
well represented in Python that way (also in Pike, fwiw):

>>> ord('A')
65
>>> chr(65)
'A'
>>> chr(123456)
'\U0001e240'
>>> ord(_)
123456

> In the byte string world, this step is a no-op.
>
> In Unicode, it is exactly the purpose of a "utf" to achieve this
> step. "utf": a confusing name covering at the same time the
> process and the result of the process.
> A "utf chunk", a series of bits (not bytes), hold intrisically
> the information about the character it is representing.

No, now you're looking at another level: how to store codepoints in
memory. That demands that they be stored as bits and bytes, because PC
memory works that way.

> utf32: as a pointed many times. You are already using it (maybe
> without knowing it). Where? in fonts (OpenType technology),
> rendering engines, pdf files. Why? Because there is not other
> way to do it better.

And UTF-32 is an excellent system... as long as you're okay with
spending four bytes for every character.

> See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0

I refuse to click this link. Give us a link to the
python-list@python.org archive, or gmane, or something else more
suited to the audience. I'm not going to Google Groups just to figure
out what you're saying.

> If you are not understanding my "editor" analogy. One other
> proposed exercise. Build/create a flexible iso-8859-X coding
> scheme. You will quickly understand where the bottleneck
> is.
> Two working ways:
> - stupidly with an editor and your fingers.
> - lazily with a sheet of paper and you head.

What has this to do with the editor?

> There is a clear difference between FSR and ucs-4/utf32.

Yes. Memory usage. PEP 393 strings might take up half or even a
quarter of what they'd take up in fixed UTF-32. Other than that,
there's no difference.

ChrisA