Re: Py 3.3, unicode / upper()

Date	2012-12-20 20:20 +0000
From	MRAB <python@mrabarnett.plus.com>
Subject	Re: Py 3.3, unicode / upper()
References	<2adb4a25-8ea3-441f-b8c0-ee6c87e4b19f@googlegroups.com> <a633063d-0887-48aa-8fd5-03d2bfe7d0f1@googlegroups.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.1107.1356034858.29569.python-list@python.org> (permalink)

Show all headers | View raw

On 2012-12-20 19:19, wxjmfauth@gmail.com wrote:
> Fact.
> In order to work comfortably and with efficiency with a "scheme for
> the coding of the characters", can be unicode or any coding scheme,
> one has to take into account two things: 1) work with a unique set
> of characters and 2) work with a contiguous block of code points.
>
> At this point, it should be noticed I did not even wrote about
> the real coding, only about characters and code points.
>
> Now, let's take a look at what happens when one breaks the rules
> above and, precisely, if one attempts to work with multiple
> characters sets or if one divides - artificially - the whole range
> of the unicode code points in chunks.
>
> The first (and it should be quite obvious) consequence is that
> you create bloated, unnecessary and useless code. I simplify
> the flexible string representation (FSR) and will use an "ascii" /
> "non-ascii" model/terminology.
>
> If you are an "ascii" user, a FSR model has no sense. An
> "ascii" user will use, per definition, only "ascii characters".
>
> If you are a "non-ascii" user, the FSR model is also a non
> sense, because you are per definition a n"on-ascii" user of
> "non-ascii" character. Any optimisation for "ascii" user just
> become irrelevant.
>
> In one sense, to escape from this, you have to be at the same time
> a non "ascii" user and a non "non-ascii" user. Impossible.
> In both cases, a FSR model is useless and in both cases you are
> forced to use bloated and unnecessary code.
>
> The rule is to treat every character of a unique set of characters
> of a coding scheme in, how to say, an "equal way". The problematic
> can be seen the other way, every coding scheme has been built
> to work with a unique set of characters, otherwhile it is not
> properly working!
>
[snip]
It's true that in an ideal world you would treat all codepoints the
same. However, this is a case where "practicality beats purity".

In order to accommodate every codepoint you need 3 bytes per codepoint
(although for pragmatic reasons it's 4 bytes per codepoint).

But not all codepoints are used equally. Those in the "astral plane",
for example, are used rarely, so the vast majority of the time you
would be using twice as much memory as strictly necessary. There are
also, in reality, many times in which strings contain only ASCII-range
codepoints, although they may not be visible to the average user, being
the names of functions and attributes in program code, or tags and
attributes in HTML and XML.

FSR is a pragmatic solution to dealing with limited resources.

Would you prefer there to be a switch that makes strings always use 4
bytes per codepoint for those users and systems where memory is no
object?

Thread

Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 06:23 -0800
  Re: Py 3.3, unicode / upper() Thomas Bach <thbach@students.uni-mainz.de> - 2012-12-19 15:43 +0100
  Re: Py 3.3, unicode / upper() Christian Heimes <christian@python.org> - 2012-12-19 15:52 +0100
    Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 12:55 -0800
      Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 14:23 -0700
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:42 -0800
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:42 -0800
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 13:01 +1100
      Re: Py 3.3, unicode / upper() Westley Martínez <anikom15@gmail.com> - 2012-12-19 18:53 -0800
    Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 12:55 -0800
  Re: Py 3.3, unicode / upper() Stefan Krah <stefan-usenet@bytereef.org> - 2012-12-19 16:01 +0100
  Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 02:17 +1100
  Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-19 16:18 +0100
    Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-19 16:22 +0100
    Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 02:40 +1100
      Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-20 15:57 +0100
    Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 11:27 -0700
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 13:18 -0800
        Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 14:31 -0700
          Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:40 -0800
            Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:48 -0500
            Re: Py 3.3, unicode / upper() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-20 22:51 +0000
          Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:40 -0800
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 13:18 -0800
    Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-19 19:39 -0500
    Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 13:03 +1100
    Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-19 21:54 -0500
    Re: Py 3.3, unicode / upper() Westley Martínez <anikom15@gmail.com> - 2012-12-19 19:12 -0800
    Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 14:22 +1100
    Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 00:32 -0500
      Re: Py 3.3, unicode / upper() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-20 05:51 +0000
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:57 -0800
        Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:30 -0500
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:57 -0800
    Re: Py 3.3, unicode / upper() Serhiy Storchaka <storchaka@gmail.com> - 2012-12-27 21:00 +0200
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-27 11:36 -0800
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-27 11:36 -0800
  Re: Py 3.3, unicode / upper() Christian Heimes <christian@python.org> - 2012-12-19 16:33 +0100
    Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-29 11:16 -0800
    Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-29 11:16 -0800
  Re: Py 3.3, unicode / upper() Benjamin Peterson <benjamin@python.org> - 2012-12-19 20:25 +0000
  Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:19 -0800
    Re: Py 3.3, unicode / upper() MRAB <python@mrabarnett.plus.com> - 2012-12-20 20:20 +0000
    Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-21 08:19 +1100
    Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:12 -0500
    Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:59 -0500
    Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-20 17:34 -0700

csiph-web