Groups > comp.lang.python > #105093 > unrolled thread

How to waste computer memory?

Started by	wxjmfauth@gmail.com
First post	2016-03-17 07:34 -0700
Last post	2016-03-18 11:18 -0700
Articles	20 on this page of 72 — 18 participants

Back to article view | Back to comp.lang.python

  How to waste computer memory? wxjmfauth@gmail.com - 2016-03-17 07:34 -0700
    Re: How to waste computer memory? Rick Johnson <rantingrickjohnson@gmail.com> - 2016-03-17 12:21 -0700
      Re: How to waste computer memory? cl@isbd.net - 2016-03-17 20:31 +0000
        Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 07:42 +1100
          Re: How to waste computer memory? Grant Edwards <invalid@invalid.invalid> - 2016-03-17 21:08 +0000
            Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 08:13 +1100
              Re: How to waste computer memory? Paul Rubin <no.email@nospam.invalid> - 2016-03-17 14:30 -0700
            Re: How to waste computer memory? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-17 22:32 +0000
            Re: How to waste computer memory? cl@isbd.net - 2016-03-17 22:42 +0000
          Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-17 23:11 +0200
            Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 08:17 +1100
            Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-17 21:26 +0000
              Re: How to waste computer memory? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-17 22:38 +0000
              Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 10:02 +1100
          Re: How to waste computer memory? alister <alister.ware@ntlworld.com> - 2016-03-17 21:37 +0000
            Re: How to waste computer memory? alister <alister.ware@ntlworld.com> - 2016-03-17 21:43 +0000
            Re: How to waste computer memory? Gene Heskett <gheskett@wdtv.com> - 2016-03-17 20:51 -0400
              Re: How to waste computer memory? Rick Johnson <rantingrickjohnson@gmail.com> - 2016-03-17 18:47 -0700
              Re: How to waste computer memory? cl@isbd.net - 2016-03-18 10:44 +0000
                Re: How to waste computer memory? Gene Heskett <gheskett@wdtv.com> - 2016-03-18 10:11 -0400
                Re: How to waste computer memory? Grant Edwards <invalid@invalid.invalid> - 2016-03-19 13:50 +0000
      Re: How to waste computer memory? Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-18 01:00 -0600
        Re: How to waste computer memory? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-03-18 10:26 +0200
          Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 17:26 +0200
            Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 03:58 +1100
            Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 23:02 +0200
              Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 23:28 +0200
                Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 00:03 +0200
                  Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 09:49 +0200
                    Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 10:22 +0200
                      Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 11:40 +0200
                  Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 19:38 +1100
              Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 00:14 -0700
                Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 02:17 -0700
              Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 19:14 +1100
                Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 11:31 +0200
                  Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 03:40 -0700
                  Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 13:07 +0200
                    Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 12:24 +0000
                      Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 14:43 +0200
                      Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:18 +1100
                        Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 15:14 +0000
                          Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 15:20 +0000
                  Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 22:32 +1100
                    Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 14:42 +0200
                      Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:39 +1100
                        Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 16:56 +0200
                    Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 07:01 -0700
                  Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:56 +1100
                    Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 17:02 +0200
                      Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 02:47 +1100
                        Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 18:12 +0200
                          Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 16:01 +1100
                            Re: How to waste computer memory? Rustom Mody <rustompmody@gmail.com> - 2016-03-19 23:20 -0700
                              Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 22:06 +1100
                                Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-20 22:22 +1100
                                  Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 23:14 +1100
                                    Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-20 23:27 +1100
                              Re: How to waste computer memory? Ben Bacarisse <ben.usenet@bsb.me.uk> - 2016-03-20 14:55 +0000
                                Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-20 17:36 +0200
                                Re: How to waste computer memory? Random832 <random832@fastmail.com> - 2016-03-20 14:17 -0400
                            Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-20 09:30 +0200
        Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 03:50 -0700
        Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-18 22:46 +1100
          Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-18 22:58 +1100
            Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 12:53 -0700
          Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 23:37 +1100
          Re: How to waste computer memory? Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-18 07:57 -0600
      Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 03:44 +1100
        Re: How to waste computer memory? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-03-18 20:22 +0200
          Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 13:03 -0700
    Re: How to waste computer memory? sohcahtoa82@gmail.com - 2016-03-18 11:18 -0700

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

#105272

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-03-20 01:18 +1100
Message-ID	<56ed5f9a$0$1605$c3e8da3$5496439d@news.astraweb.com>
In reply to	#105264

On Sat, 19 Mar 2016 11:24 pm, BartC wrote about combining characters:

> So a string that looks like:
> 
> "ññññññññññññññññññññññññññññññññññññññññññññññññññ"
> 
> can have 2**50 different representations? 

Yes.

> And occupy somewhere between 50 and 200 bytes? Or is that 400?

The minimum storage would use a legacy encoding (like MacRoman, or Latin-1)
with the composed ñ character. That gives 50 x 1-byte characters, or 50
bytes.

The maximum storage would be if all 50 characters were decomposed into two
code points (giving 100 code points), and then stored as UTF-32, giving 400
bytes all up.


> OK...

You say that as if 400 bytes was a lot.

Besides, this is hardly any different from (say) a pure ASCIII version of
the "permille" (per thousand) symbol. In Unicode I can write ‰ (two bytes
in UTF-16) but in ASCII I am forced to write O/oo (four bytes), or
worse, "per thousand" (12 bytes). Imagine a string of "‰"*50, written in
ASCII, for a total of 600 bytes...

Yes, this is silly. Really, if you've got 50 ñ in a string, they take up the
space they take up, and memory is cheap. The days of thinking that 127
characters is all you need (7 bit ASCII) are long, long gone, just like the
days when it was appropriate for ints to be 16 bits.

When I first started programming, the default "integer" type in Pascal,
Forth and other languages was 16 bits, which meant that the largest number
you can represent in a calculation was 32767. My four-function calculator
had an 8 digit display and could calculate up to 99999999, while Pascal
choked on 32767. (Or 65536 if you used unsigned numbers.) Now, I routinely
and without hesitation generate thousand-plus bit numbers like 2**10000,
and my computer calculates and prints the result faster than I can enter
the calculation in the first place. Worrying about the fact that characters
use more than 8 bits is oh-so-1990s.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#105277

From	BartC <bc@freeuk.com>
Date	2016-03-19 15:14 +0000
Message-ID	<ncjq6e$bvb$1@dont-email.me>
In reply to	#105272

On 19/03/2016 14:18, Steven D'Aprano wrote:
> On Sat, 19 Mar 2016 11:24 pm, BartC wrote about combining characters:

>> And occupy somewhere between 50 and 200 bytes? Or is that 400?

>> OK...
>
> You say that as if 400 bytes was a lot.

No, just unpredictable.

> Besides, this is hardly any different from (say) a pure ASCIII version of
> the "permille" (per thousand) symbol. In Unicode I can write ‰ (two bytes
> in UTF-16) but in ASCII I am forced to write O/oo (four bytes), or
> worse, "per thousand" (12 bytes). Imagine a string of "‰"*50, written in
> ASCII, for a total of 600 bytes...

Those kinds of problems are well known with ASCII, for example needing 
to compare strings but ignoring case, or treating tabs as spaces. It's 
clear that dealing with those properly goes beyond the remit of basic 
string processing in a language.

With Unicode there are a whole bunch of other problems, and some people 
expect basic string handling to be able to deal with all of them. (I 
think Unicode should be dealt with at the next level up. Then some of 
use can stay at the bottom level that is more efficient and works 99% of 
the time on average, and just about 100% for most.)

> Yes, this is silly. Really, if you've got 50 ñ in a string, they take up the
> space they take up, and memory is cheap.

Which is about 3000 decimal digits, slightly more than 1KB in packed 
binary. In BCD it would be 1.5KB. At one-byte per digit (eg. ASCII) it's 
3KB. At 4 bytes per (eg. UCS4), it's 12KB.

What would you say to someone advocating 12 times as much storage for 
long integers as is used now? After all memory is cheap!

> and my computer calculates and prints the result faster than I can enter
> the calculation in the first place. Worrying about the fact that characters
> use more than 8 bits is oh-so-1990s.

We still need to worry about it. Whatever memory is being used up (ram, 
cache, flash, disk) 16-bit characters will use twice as much, and 32-bit 
half as much. And the bandwidth necessary to access or transmit will 
also be twice or four times as much.

But the existence of UTF-8 means something /has/ been done about it, or 
some of it; somebody /has/ worried about it.

 >The days of thinking that 127
 > characters is all you need (7 bit ASCII) are long, long gone, just 
 >like the
 > days when it was appropriate for ints to be 16 bits.

Some things haven't actually changed that much.

Word sizes might have doubled from 32 bits on a mainframe to 64 bits now 
(temporarily reducing to 8 and 16 along the way for micros and minis).

But the English alphabet still has 26 letters. Keyboards still have 
around 100 keys. And programming languages and text formats still 
predominantly use ASCII subset for their keywords and identifiers.

-- 
Bartc

[toc] | [prev] | [next] | [standalone]

#105278

From	BartC <bc@freeuk.com>
Date	2016-03-19 15:20 +0000
Message-ID	<ncjqiq$db4$1@dont-email.me>
In reply to	#105277

On 19/03/2016 15:14, BartC wrote:

> Which is about 3000 decimal digits, slightly more than 1KB in packed
> binary. In BCD it would be 1.5KB. At one-byte per digit (eg. ASCII) it's
> 3KB. At 4 bytes per (eg. UCS4), it's 12KB.

The comment refers to this which inexplicably got snipped (not my fault 
at all..):

[Steven D'Aprano:]
 >> Now, I routinely
 >> and without hesitation generate thousand-plus bit numbers like
 >> 2**10000,

-- 
Bartc

[toc] | [prev] | [next] | [standalone]

#105263

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-03-19 22:32 +1100
Message-ID	<56ed38bb$0$1584$c3e8da3$5496439d@news.astraweb.com>
In reply to	#105259

On Sat, 19 Mar 2016 09:18 pm, Chris Angelico wrote:

> On Sat, Mar 19, 2016 at 8:31 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> Unicode made several (understandable but grave) mistakes along the way:
>>
>>    * normalization
>>
> 
> Elaborate please? What's such a big mistake here?

As usual, Unicode problems are generally due to backwards compatibility.
Blame the old legacy encodings, which invented the "dead keys"
a.k.a. "combining character" technique. Of course, they had a reasonable
excuse at the time, but Unicode's requirement of being able to losslessly
handle all legacy character set standards means that Unicode has to provide
the same functionality.

The problem is not so much the existence of combining characters, but that
*some* but not all accented characters are available in two forms: a
composed single code point, and a decomposed pair of code points. This adds
complexity and means that equality of characters is not well-defined.
(Hence Unicode punts on the whole "character" thing and just talks about
code points.)

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#105265

From	Marko Rauhamaa <marko@pacujo.net>
Date	2016-03-19 14:42 +0200
Message-ID	<87oaaaljui.fsf@elektro.pacujo.net>
In reply to	#105263

Steven D'Aprano <steve@pearwood.info>:

> As usual, Unicode problems are generally due to backwards
> compatibility. Blame the old legacy encodings, which invented the
> "dead keys" a.k.a. "combining character" technique. Of course, they
> had a reasonable excuse at the time, but Unicode's requirement of
> being able to losslessly handle all legacy character set standards
> means that Unicode has to provide the same functionality.

The combining characters allow for maze of twisty little combinations,
all alike. There's no limit to the number of diacritics you can pile on,
under and next to the base character.

Was that universality unavoidable? Maybe it was. Deep down, all scripts
are two-dimensional.

> The problem is not so much the existence of combining characters, but that
> *some* but not all accented characters are available in two forms: a
> composed single code point, and a decomposed pair of code points.

Also, is an a with ring on top and another ring on bottom the same
character as an a with ring on bottom and another ring on top?

> This adds complexity and means that equality of characters is not
> well-defined. (Hence Unicode punts on the whole "character" thing and
> just talks about code points.)

The problem is not theoretical. If I implement a web form and someone
enters "Aña" as their name, how do I make sure queries find the name
regardless of the unicode code point sequence? I have to normalize using
unicodedata.normalize().

When glorifying Python's advanced Unicode capabilities, are we careful
to emphasize the necessity of unicodedata.normalize() everywhere? Should
Python normalize strings unconditionally and transparently? What does
the O(1) character lookup mean under normalization?

Some weeks ago I had to spend 30 minutes to debug my Python program when
a user complained it didn't work. Turns out they had accidentally
invoked the program using a space and a composing tilde instead of the
ASCII ~. There was no visual indication of a problem on the screen, but
the Python program acted up.


Marko

[toc] | [prev] | [next] | [standalone]

#105273

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-03-20 01:39 +1100
Message-ID	<56ed64b4$0$1596$c3e8da3$5496439d@news.astraweb.com>
In reply to	#105265

On Sat, 19 Mar 2016 11:42 pm, Marko Rauhamaa wrote:

> The problem is not theoretical. If I implement a web form and someone
> enters "Aña" as their name, how do I make sure queries find the name
> regardless of the unicode code point sequence? I have to normalize using
> unicodedata.normalize().

I didn't say that it was theoretical. It is a real problem, but it is a
problem with human languages: the number of characters-with-accents is
vast, possibly impossibly vast. They can't all have unique code points.

I must admit I had completely missed your example of multiple combining
characters, that's a good one. Here's the example again:

a + combining ring above + combining ring below, versus
a + combining ring below + combining ring above

Naturally just comparing them gives unequal:

py> s = "a\u030A\u0325"
py> t = "a\u0325\u030A"
py> s == t
False

But we can normalise them:

====  =============  =============  ==================  =================
Form  NFC            NFKC           NFKD                NFKD
====  =============  =============  ==================  =================
s     U+1E01,030A    U+1E01,030A    U+0061,0325,030A    U+0061,0325,030A
t     U+1E01,030A    U+1E01,030A    U+0061,0325,030A    U+0061,0325,030A
====  =============  =============  ==================  =================

As you can see, *any* of the normalisation forms will put the code points
into the same, canonical order, making them equal.

> When glorifying Python's advanced Unicode capabilities, are we careful
> to emphasize the necessity of unicodedata.normalize() everywhere? Should
> Python normalize strings unconditionally and transparently? What does
> the O(1) character lookup mean under normalization?
> 
> Some weeks ago I had to spend 30 minutes to debug my Python program when
> a user complained it didn't work. Turns out they had accidentally
> invoked the program using a space and a composing tilde instead of the
> ASCII ~. There was no visual indication of a problem on the screen, but
> the Python program acted up.

We recently had somebody here who wrote capital I by pressing the lower case
l on the keyboard. Should a pure-ASCII program be able to operate without
malfunction if the user confuses 0 and O, or I l and 1? What about ' and `
or possibly even '' and "?

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#105274

From	Marko Rauhamaa <marko@pacujo.net>
Date	2016-03-19 16:56 +0200
Message-ID	<87bn6aldmy.fsf@elektro.pacujo.net>
In reply to	#105273

Steven D'Aprano <steve@pearwood.info>:

> On Sat, 19 Mar 2016 11:42 pm, Marko Rauhamaa wrote:
>> When glorifying Python's advanced Unicode capabilities, are we
>> careful to emphasize the necessity of unicodedata.normalize()
>> everywhere? Should Python normalize strings unconditionally and
>> transparently? What does the O(1) character lookup mean under
>> normalization?
>> 
>> Some weeks ago I had to spend 30 minutes to debug my Python program
>> when a user complained it didn't work. Turns out they had
>> accidentally invoked the program using a space and a composing tilde
>> instead of the ASCII ~. There was no visual indication of a problem
>> on the screen, but the Python program acted up.
>
> We recently had somebody here who wrote capital I by pressing the
> lower case l on the keyboard. Should a pure-ASCII program be able to
> operate without malfunction if the user confuses 0 and O, or I l and
> 1? What about ' and ` or possibly even '' and "?

What I'm talking about is that maybe Python should treat canonically
equivalent strings equivalently, that is, indistinguishably under any
external inspection.

Anyway, Python's Unicode support is great thing, but Unicode is a big
can of worms. Far from being a paradise, it's more of a case of picking
your poison.


Marko

[toc] | [prev] | [next] | [standalone]

#105271

From	wxjmfauth@gmail.com
Date	2016-03-19 07:01 -0700
Message-ID	<37f00078-41eb-44aa-bf40-7e4e6fa8e100@googlegroups.com>
In reply to	#105263

Le samedi 19 mars 2016 12:32:25 UTC+1, Steven D'Aprano a écrit :
> On Sat, 19 Mar 2016 09:18 pm, Chris Angelico wrote:
> 
> > On Sat, Mar 19, 2016 at 8:31 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> >> Unicode made several (understandable but grave) mistakes along the way:
> >>
> >>    * normalization
> >>
> > 
> > Elaborate please? What's such a big mistake here?
> 
> As usual, Unicode problems are generally due to backwards compatibility.
> Blame the old legacy encodings, which invented the "dead keys"
> a.k.a. "combining character" technique. Of course, they had a reasonable
> excuse at the time, but Unicode's requirement of being able to losslessly
> handle all legacy character set standards means that Unicode has to provide
> the same functionality.
> 
> The problem is not so much the existence of combining characters, but that
> *some* but not all accented characters are available in two forms: a
> composed single code point, and a decomposed pair of code points. This adds
> complexity and means that equality of characters is not well-defined.
> (Hence Unicode punts on the whole "character" thing and just talks about
> code points.)
> 
> 
> 
> -- 
> Steven

I'm laughing, i'm laughing. You do not imagine how
I'm laughing...

[toc] | [prev] | [next] | [standalone]

#105275

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-03-20 01:56 +1100
Message-ID	<56ed68bb$0$1604$c3e8da3$5496439d@news.astraweb.com>
In reply to	#105259

On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote:

>    Using the surrogate mechanism, UTF-16 can support all 1,114,112
>    potential Unicode characters.
> 
> But Unicode doesn't contain 1,114,112 characters—the surrogates are
> excluded from Unicode, and definitely cannot be encoded using
> UTF-anything.

Surrogates are most certainly part of the Unicode standard, and they are
necessary in UTF-16. (You cannot represent astral characters without them!)
So in a UTF-16 stream, a *pair* of surrogates is nothing unusual. They just
represent a SMP code point.

However, *single* surrogates are an error. For example, we see this FAQ:

Q: How do I convert an unpaired UTF-16 surrogate to UTF-32?

A: If an unpaired surrogate is encountered when converting ill-formed UTF-16
data, any conformant converter must treat this as an error. By representing
such an unpaired surrogate on its own, the resulting UTF-32 data stream
would become ill-formed. While it faithfully reflects the nature of the
input, Unicode conformance requires that encoding form conversion always
results in valid data stream.

http://www.unicode.org/faq/utf_bom.html#utf32-7

But nobody says that programming languages must deal with only conformant
converters and valid Unicode sequences. An unfortunate fact of life that
even if you don't generate them yourself, somebody else will so you need to
be able to deal with them.

[...]
> We still don't know if the final result will be UCS-4 everywhere (with
> all 2**32 code points allowed?!) or UTF-8 everywhere.

Unicode does not have 2**32 code points. It is guaranteed to never exceed
the 2**21 code points already allocated. (Many of those are still unused.)

As far as I am concerned, the future is clear:

UTF-8 for transmission and storage formats, where fast random access is not
necessary;

UTF-32 for in-memory formats, where O(1) random access is advantagous.
Possibly with certain in-memory optimizations to save space, where such can
be done transparently.

In the future, we will no more balk at using four whole bytes for a code
point than we now balk at using eight bytes for floating point numbers. The
mathematical advantages of float Doubles are just overwhelming, and the
only reason for using fewer than 64 bits is if you care more about getting
a fast answer than an accurate answer.

(I'm reminded of one of my wife's former roadies, back in the 70s, crossing
the US desert in a van. On being told that he was heading in the wrong
direction for their next gig, he replied "Who cares? We're making great
time!")

In the future, we'll have so much memory that the idea of using variable
width in-memory formats will seem absurd. 

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#105276

From	Marko Rauhamaa <marko@pacujo.net>
Date	2016-03-19 17:02 +0200
Message-ID	<877fgylddm.fsf@elektro.pacujo.net>
In reply to	#105275

Steven D'Aprano <steve@pearwood.info>:

> On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote:
>
>
>>    Using the surrogate mechanism, UTF-16 can support all 1,114,112
>>    potential Unicode characters.
>> 
>> But Unicode doesn't contain 1,114,112 characters—the surrogates are
>> excluded from Unicode, and definitely cannot be encoded using
>> UTF-anything.
>
> Surrogates are most certainly part of the Unicode standard, and they are
> necessary in UTF-16.

Yes, but UTF-16 produces 16-bit values that are outside Unicode. UTF-16
can encode *any* valid Unicode, but it cannot encode surrogate
characters.

   >>> '\udc10'.encode('utf-8')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-8' codec can't encode character '\udc10' in pos\
   ition 0: surrogates not allowed
   >>> '\udc10'.encode('utf-16')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-16' codec can't encode character '\udc10' in po\
   sition 0: surrogates not allowed
   >>> '\udc10'.encode('utf-32')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-32' codec can't encode character '\udc10' in po\
   sition 0: surrogates not allowed

>> We still don't know if the final result will be UCS-4 everywhere (with
>> all 2**32 code points allowed?!) or UTF-8 everywhere.
>
> Unicode does not have 2**32 code points. It is guaranteed to never
> exceed the 2**21 code points already allocated. (Many of those are
> still unused.)

Never say never.

> In the future, we'll have so much memory that the idea of using
> variable width in-memory formats will seem absurd.

I'm starting to think that future is already here.


Marko

[toc] | [prev] | [next] | [standalone]

#105281

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-03-20 02:47 +1100
Message-ID	<56ed749e$0$1583$c3e8da3$5496439d@news.astraweb.com>
In reply to	#105276

On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote:

> Steven D'Aprano <steve@pearwood.info>:
> 
>> On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote:
>>
>>
>>>    Using the surrogate mechanism, UTF-16 can support all 1,114,112
>>>    potential Unicode characters.
>>> 
>>> But Unicode doesn't contain 1,114,112 characters—the surrogates are
>>> excluded from Unicode, and definitely cannot be encoded using
>>> UTF-anything.
>>
>> Surrogates are most certainly part of the Unicode standard, and they are
>> necessary in UTF-16.
> 
> Yes, but UTF-16 produces 16-bit values that are outside Unicode. 

Show me.

Before you answer, if your answer is "surrogate pairs", that is incorrect.
Surrogate pairs is how UTF-16 encodes astral characters.

For example, the UTF-16 *byte sequence* 0xD800 0xDC00 does not
represent "code points U+D800,DC00". It represents the *single* code point
U+10000 "LINEAR B SYLLABLE B008 A". The code points U+D800 and U+DC00 are
reserved for the use of UTF-16 as surrogates.

This means that UTF-16 cannot encode lone surrogates. It cannot encode, say,
the code point U+D800 on its own, because it looks like half of a SMP code
point, which is an error. And it cannot encode U+D800 immediately followed
by U+DC00, because that would be interpreted as U+10000. So there is a
range of code points which cannot be represented in UTF-16.

Where UTF-16 goes, UTF-8 and UTF-32 must follow. It is a requirement of
Unicode that you must be able to freely and losslessly convert between the
three UTFs. (I'm not sure if that also applies to UTF-7.) Since UTF-16
*cannot* represent this specific range of code points, then UTF-8 and
UTF-32 must be *forbidden* from doing the same.

Note that the UTF-8 and UTF-32 formats are perfectly capable of representing
lone surrogates. UTF-32, for example would simply pad the code point with
zeroes: U+D800 would be represented as the four bytes 0x0000D800. UTF-8 has
a well-defined 3-byte sequence that corresponds to it. But that is invalid,
since it violates the requirement that it be freely and losslessly
translatable into UTF-16.

Invalid Unicode strings have their uses, but they are not valid :-)

> UTF-16 can encode *any* valid Unicode, but it cannot encode surrogate
> characters.

Correct. 

But encoding of surrogates is not required in Unicode. Strictly speaking, it
is forbidden. Did you read the link from the Unicode consortium that I
provided?

>>> We still don't know if the final result will be UCS-4 everywhere (with
>>> all 2**32 code points allowed?!) or UTF-8 everywhere.
>>
>> Unicode does not have 2**32 code points. It is guaranteed to never
>> exceed the 2**21 code points already allocated. (Many of those are
>> still unused.)
> 
> Never say never.

The Unicode standard has published this guarantee. It is not going to
change. If somebody wants more than 2**21 code points, they can start their
own new, competing, standard.

>> In the future, we'll have so much memory that the idea of using
>> variable width in-memory formats will seem absurd.
> 
> I'm starting to think that future is already here.

I'm not *quite* ruling out the possibility that UTF-8 as internal
representation for in-memory strings is a good idea, but I think that for
non-embedded systems, it is very probably a waste of time.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#105282

From	Marko Rauhamaa <marko@pacujo.net>
Date	2016-03-19 18:12 +0200
Message-ID	<8737rmla4w.fsf@elektro.pacujo.net>
In reply to	#105281

Steven D'Aprano <steve@pearwood.info>:

> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote:
>> Yes, but UTF-16 produces 16-bit values that are outside Unicode. 
>
> Show me.
>
> Before you answer, if your answer is "surrogate pairs", that is
> incorrect. Surrogate pairs is how UTF-16 encodes astral characters.

UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers.
Thus, the output of UTF-16 is not Unicode.

Marko

[toc] | [prev] | [next] | [standalone]

#105292

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-03-20 16:01 +1100
Message-ID	<56ee2ebd$0$1597$c3e8da3$5496439d@news.astraweb.com>
In reply to	#105282

On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote:

> Steven D'Aprano <steve@pearwood.info>:
> 
>> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote:
>>> Yes, but UTF-16 produces 16-bit values that are outside Unicode.
>>
>> Show me.
>>
>> Before you answer, if your answer is "surrogate pairs", that is
>> incorrect. Surrogate pairs is how UTF-16 encodes astral characters.
> 
> UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers.
> Thus, the output of UTF-16 is not Unicode.

I'm not sure what point you think you are making.

Unicode (the character set part of it) is a set of abstract 23-bit numbers,
or code points, representing (among other things) characters, and numbered
from U+0000 to U+10FFFF. Any UTF is, by definition, a transformation from
such abstract code points to sequences of machine words or bytes (and vice
versa). What's your point?

If your point is that the data you get from running UTF-16 on a sequence of
code points is "not Unicode, but 2-byte words", then I agree, but I'm not
sure why you think that's significant.

If you want to call those words "numbers", I cannot really object, but if
so, they aren't abstract numbers (like code points, which may have any
implementation you like), but have their actual base-2 structure specified
by the standard.

If your point is that a UTF-16 encoded stream of bytes is not the same as an
abstract sequence of code points, then I can't disagree, but I don't
understand why you think that's important.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#105293

From	Rustom Mody <rustompmody@gmail.com>
Date	2016-03-19 23:20 -0700
Message-ID	<12db8cba-8edf-4cd0-a91d-2f6b6634c9d3@googlegroups.com>
In reply to	#105292

On Sunday, March 20, 2016 at 10:32:07 AM UTC+5:30, Steven D'Aprano wrote:
> On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote:
> 
> > Steven D'Aprano :
> > 
> >> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote:
> >>> Yes, but UTF-16 produces 16-bit values that are outside Unicode.
> >>
> >> Show me.
> >>
> >> Before you answer, if your answer is "surrogate pairs", that is
> >> incorrect. Surrogate pairs is how UTF-16 encodes astral characters.
> > 
> > UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers.
> > Thus, the output of UTF-16 is not Unicode.
> 
> I'm not sure what point you think you are making.
> 
> Unicode (the character set part of it) is a set of abstract 23-bit numbers,

23? Or 21?
AIUI if the 'least-count' is 1 its 21
If its 8 its 24
If its 16 its 32

More pertinently if the number of bits signifies, whatever is the sense of
the word 'abstract'?

> or code points, representing (among other things) characters, and numbered
> from U+0000 to U+10FFFF. Any UTF is, by definition, a transformation from
> such abstract code points to sequences of machine words or bytes (and vice
> versa). What's your point?

I think its more useful to think of data transformations between formats
Rather than calling one format more abstract than another

[toc] | [prev] | [next] | [standalone]

#105298

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-03-20 22:06 +1100
Message-ID	<56ee8454$0$22142$c3e8da3$5496439d@news.astraweb.com>
In reply to	#105293

On Sun, 20 Mar 2016 05:20 pm, Rustom Mody wrote:

> On Sunday, March 20, 2016 at 10:32:07 AM UTC+5:30, Steven D'Aprano wrote:
>> On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote:
>> 
>> > Steven D'Aprano :
>> > 
>> >> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote:
>> >>> Yes, but UTF-16 produces 16-bit values that are outside Unicode.
>> >>
>> >> Show me.
>> >>
>> >> Before you answer, if your answer is "surrogate pairs", that is
>> >> incorrect. Surrogate pairs is how UTF-16 encodes astral characters.
>> > 
>> > UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers.
>> > Thus, the output of UTF-16 is not Unicode.
>> 
>> I'm not sure what point you think you are making.
>> 
>> Unicode (the character set part of it) is a set of abstract 23-bit
>> numbers,
> 
> 23? Or 21?

Oops, you're right, its 21 bits.

> More pertinently if the number of bits signifies, whatever is the sense of
> the word 'abstract'?

The Unicode standard does not, as far as I am aware, care how you represent
code points in memory, only that there are 0x110000 of them, numbered from
U+0000 to U+10FFFF. That's what I mean by abstract. The obvious
implementation is to use 32-bit integers, where 0x00000000 represents code
point U+0000, 0x00000001 represents U+0001, and so forth. This is
essentially equivalent to UTF-16, but it's not mandated or specified by the
Unicode standard, you could, if you choose, use something else.

On the other hand, I believe that the output of the UTF transformations is
explicitly described in terms of 8-bit bytes and 16- or 32-bit words. For
instance, the UTF-8 encoding of "A" has to be a single byte with value 0x41
(decimal 65). It isn't that this is the most obvious implementation, its
that it can't be anything else and still be UTF-8.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#105299

From	Chris Angelico <rosuav@gmail.com>
Date	2016-03-20 22:22 +1100
Message-ID	<mailman.404.1458472974.12893.python-list@python.org>
In reply to	#105298

On Sun, Mar 20, 2016 at 10:06 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> The Unicode standard does not, as far as I am aware, care how you represent
> code points in memory, only that there are 0x110000 of them, numbered from
> U+0000 to U+10FFFF. That's what I mean by abstract. The obvious
> implementation is to use 32-bit integers, where 0x00000000 represents code
> point U+0000, 0x00000001 represents U+0001, and so forth. This is
> essentially equivalent to UTF-16, but it's not mandated or specified by the
> Unicode standard, you could, if you choose, use something else.

(UTF-32)

The codepoints are not representable in *memory*; they are, by
definition, representable in a field of integers. If you choose to
represent those integers as little-endian 32-bit values, then yes, the
layout in memory will look like UTF-32LE, but that's because UTF-32LE
is defined in this extremely simple way. In fact, that's exactly how
the layers work - Unicode defines a mapping of characters to code
points, and then UTF-x defines a mapping of code points to bytes.

> On the other hand, I believe that the output of the UTF transformations is
> explicitly described in terms of 8-bit bytes and 16- or 32-bit words. For
> instance, the UTF-8 encoding of "A" has to be a single byte with value 0x41
> (decimal 65). It isn't that this is the most obvious implementation, its
> that it can't be anything else and still be UTF-8.

Exactly. Aside from the way UTF-16 and UTF-32 have LE and BE variants,
there is only one bitpattern for any given character sequence and
UTF-x (so if you work with eg "UTF-16LE", there's only one). This is
no accident. Unlike some encodings, in which there's a "one most
obvious" way to encode things but then a number of other legal ways,
UTF-x can be compared for equality [1] using simple byte-for-byte
comparisons. This means you don't have to worry about someone sneaking
a magic character past your filter; if you're checking a UTF-8 stream
for the character U+003C LESS-THAN SIGN, the only byte value to look
for is 0x3C - the sequence 0xC0 0xBC, despite mathematically
representing the number 003C, is explicitly forbidden.

ChrisA

[1] Though not inequality - lexical sorting doesn't follow codepoint
order, and codepoint order won't always match byte order. But equality
is easy.

[toc] | [prev] | [next] | [standalone]

#105300

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-03-20 23:14 +1100
Message-ID	<56ee9431$0$1620$c3e8da3$5496439d@news.astraweb.com>
In reply to	#105299

On Sun, 20 Mar 2016 10:22 pm, Chris Angelico wrote:

> On Sun, Mar 20, 2016 at 10:06 PM, Steven D'Aprano <steve@pearwood.info>
> wrote:
>> The Unicode standard does not, as far as I am aware, care how you
>> represent code points in memory, only that there are 0x110000 of them,
>> numbered from U+0000 to U+10FFFF. That's what I mean by abstract. The
>> obvious implementation is to use 32-bit integers, where 0x00000000
>> represents code point U+0000, 0x00000001 represents U+0001, and so forth.
>> This is essentially equivalent to UTF-16, but it's not mandated or
>> specified by the Unicode standard, you could, if you choose, use
>> something else.
> 
> (UTF-32)

D'oh!

I mean, yes, well done, you have passed my little test to see if anyone is
paying attention. Have a gold star.

> The codepoints are not representable in *memory*; they are, by
> definition, representable in a field of integers. 

They're not directly representable in memory because the definition of code
points is not given in terms of memory values. Hence, they are abstract
values, numbered in a certain way, and given certain semantics.

In other words, there's nothing in the Unicode standard that says that code
point U+0020 has to be stored as a byte 0x20, or a word 0x0020. But the
standard does say that the code point U+0020 represents a space character.

[...]
>> On the other hand, I believe that the output of the UTF transformations
>> is explicitly described in terms of 8-bit bytes and 16- or 32-bit words.
>> For instance, the UTF-8 encoding of "A" has to be a single byte with
>> value 0x41 (decimal 65). It isn't that this is the most obvious
>> implementation, its that it can't be anything else and still be UTF-8.
> 
> Exactly. Aside from the way UTF-16 and UTF-32 have LE and BE variants,

Blame the chip manufacturers for that. Actually, I think we can blame Intel
specifically for that, for reversing the normal layout of words in memory.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#105302

From	Chris Angelico <rosuav@gmail.com>
Date	2016-03-20 23:27 +1100
Message-ID	<mailman.406.1458476886.12893.python-list@python.org>
In reply to	#105300

On Sun, Mar 20, 2016 at 11:14 PM, Steven D'Aprano <steve@pearwood.info> wrote:
>>> On the other hand, I believe that the output of the UTF transformations
>>> is explicitly described in terms of 8-bit bytes and 16- or 32-bit words.
>>> For instance, the UTF-8 encoding of "A" has to be a single byte with
>>> value 0x41 (decimal 65). It isn't that this is the most obvious
>>> implementation, its that it can't be anything else and still be UTF-8.
>>
>> Exactly. Aside from the way UTF-16 and UTF-32 have LE and BE variants,
>
> Blame the chip manufacturers for that. Actually, I think we can blame Intel
> specifically for that, for reversing the normal layout of words in memory.

No, I disagree; it's inherent in the notion of representing a 16-bit
or 32-bit value across bytes. Maybe there could have been one
most-common standard, but there'd still have been another way of doing
it. Little-endianness and big-endianness are important enough to have
to deal with.

ChrisA

[toc] | [prev] | [next] | [standalone]

#105303

From	Ben Bacarisse <ben.usenet@bsb.me.uk>
Date	2016-03-20 14:55 +0000
Message-ID	<874mc1mc5g.fsf@bsb.me.uk>
In reply to	#105293

Rustom Mody <rustompmody@gmail.com> writes:

> On Sunday, March 20, 2016 at 10:32:07 AM UTC+5:30, Steven D'Aprano wrote:
<snip>
>> Unicode (the character set part of it) is a set of abstract 23-bit numbers,
>
> 23? Or 21?

It's 21.  The reason being (or at least part of the reason being) that
21 bits can be UTF-8 encoded in 4 bytes: 11110xxx 10xxxxxx 10xxxxxx
10xxxxxx (3 + 3*6).

<snip>
-- 
Ben.

[toc] | [prev] | [next] | [standalone]

#105304

From	Marko Rauhamaa <marko@pacujo.net>
Date	2016-03-20 17:36 +0200
Message-ID	<87shzl6u1i.fsf@elektro.pacujo.net>
In reply to	#105303

Ben Bacarisse <ben.usenet@bsb.me.uk>:

> It's 21. The reason being (or at least part of the reason being) that
> 21 bits can be UTF-8 encoded in 4 bytes: 11110xxx 10xxxxxx 10xxxxxx
> 10xxxxxx (3 + 3*6).

I bet the reason is UTF-16. Microsoft and Sun/Oracle would have insisted
on a maximum of 4 bytes per character. UTF-16 can just barely squeeze 21
bits into the scheme and only at the expense of creating an ugly hole
inside Unicode. Politics, politics.


Marko

[toc] | [prev] | [next] | [standalone]

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

csiph-web

How to waste computer memory?

Contents

#105272

#105277

#105278

#105263

#105265

#105273

#105274

#105271

#105275

#105276

#105281

#105282

#105292

#105293

#105298

#105299

#105300

#105302

#105303

#105304