Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #105093 > unrolled thread
| Started by | wxjmfauth@gmail.com |
|---|---|
| First post | 2016-03-17 07:34 -0700 |
| Last post | 2016-03-18 11:18 -0700 |
| Articles | 20 on this page of 72 — 18 participants |
Back to article view | Back to comp.lang.python
How to waste computer memory? wxjmfauth@gmail.com - 2016-03-17 07:34 -0700
Re: How to waste computer memory? Rick Johnson <rantingrickjohnson@gmail.com> - 2016-03-17 12:21 -0700
Re: How to waste computer memory? cl@isbd.net - 2016-03-17 20:31 +0000
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 07:42 +1100
Re: How to waste computer memory? Grant Edwards <invalid@invalid.invalid> - 2016-03-17 21:08 +0000
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 08:13 +1100
Re: How to waste computer memory? Paul Rubin <no.email@nospam.invalid> - 2016-03-17 14:30 -0700
Re: How to waste computer memory? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-17 22:32 +0000
Re: How to waste computer memory? cl@isbd.net - 2016-03-17 22:42 +0000
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-17 23:11 +0200
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 08:17 +1100
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-17 21:26 +0000
Re: How to waste computer memory? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-17 22:38 +0000
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 10:02 +1100
Re: How to waste computer memory? alister <alister.ware@ntlworld.com> - 2016-03-17 21:37 +0000
Re: How to waste computer memory? alister <alister.ware@ntlworld.com> - 2016-03-17 21:43 +0000
Re: How to waste computer memory? Gene Heskett <gheskett@wdtv.com> - 2016-03-17 20:51 -0400
Re: How to waste computer memory? Rick Johnson <rantingrickjohnson@gmail.com> - 2016-03-17 18:47 -0700
Re: How to waste computer memory? cl@isbd.net - 2016-03-18 10:44 +0000
Re: How to waste computer memory? Gene Heskett <gheskett@wdtv.com> - 2016-03-18 10:11 -0400
Re: How to waste computer memory? Grant Edwards <invalid@invalid.invalid> - 2016-03-19 13:50 +0000
Re: How to waste computer memory? Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-18 01:00 -0600
Re: How to waste computer memory? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-03-18 10:26 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 17:26 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 03:58 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 23:02 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 23:28 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 00:03 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 09:49 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 10:22 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 11:40 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 19:38 +1100
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 00:14 -0700
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 02:17 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 19:14 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 11:31 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 03:40 -0700
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 13:07 +0200
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 12:24 +0000
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 14:43 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:18 +1100
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 15:14 +0000
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 15:20 +0000
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 22:32 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 14:42 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:39 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 16:56 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 07:01 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:56 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 17:02 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 02:47 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 18:12 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 16:01 +1100
Re: How to waste computer memory? Rustom Mody <rustompmody@gmail.com> - 2016-03-19 23:20 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 22:06 +1100
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-20 22:22 +1100
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 23:14 +1100
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-20 23:27 +1100
Re: How to waste computer memory? Ben Bacarisse <ben.usenet@bsb.me.uk> - 2016-03-20 14:55 +0000
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-20 17:36 +0200
Re: How to waste computer memory? Random832 <random832@fastmail.com> - 2016-03-20 14:17 -0400
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-20 09:30 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 03:50 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-18 22:46 +1100
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-18 22:58 +1100
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 12:53 -0700
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 23:37 +1100
Re: How to waste computer memory? Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-18 07:57 -0600
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 03:44 +1100
Re: How to waste computer memory? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-03-18 20:22 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 13:03 -0700
Re: How to waste computer memory? sohcahtoa82@gmail.com - 2016-03-18 11:18 -0700
Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-20 01:18 +1100 |
| Message-ID | <56ed5f9a$0$1605$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105264 |
On Sat, 19 Mar 2016 11:24 pm, BartC wrote about combining characters: > So a string that looks like: > > "ññññññññññññññññññññññññññññññññññññññññññññññññññ" > > can have 2**50 different representations? Yes. > And occupy somewhere between 50 and 200 bytes? Or is that 400? The minimum storage would use a legacy encoding (like MacRoman, or Latin-1) with the composed ñ character. That gives 50 x 1-byte characters, or 50 bytes. The maximum storage would be if all 50 characters were decomposed into two code points (giving 100 code points), and then stored as UTF-32, giving 400 bytes all up. > OK... You say that as if 400 bytes was a lot. Besides, this is hardly any different from (say) a pure ASCIII version of the "permille" (per thousand) symbol. In Unicode I can write ‰ (two bytes in UTF-16) but in ASCII I am forced to write O/oo (four bytes), or worse, "per thousand" (12 bytes). Imagine a string of "‰"*50, written in ASCII, for a total of 600 bytes... Yes, this is silly. Really, if you've got 50 ñ in a string, they take up the space they take up, and memory is cheap. The days of thinking that 127 characters is all you need (7 bit ASCII) are long, long gone, just like the days when it was appropriate for ints to be 16 bits. When I first started programming, the default "integer" type in Pascal, Forth and other languages was 16 bits, which meant that the largest number you can represent in a calculation was 32767. My four-function calculator had an 8 digit display and could calculate up to 99999999, while Pascal choked on 32767. (Or 65536 if you used unsigned numbers.) Now, I routinely and without hesitation generate thousand-plus bit numbers like 2**10000, and my computer calculates and prints the result faster than I can enter the calculation in the first place. Worrying about the fact that characters use more than 8 bits is oh-so-1990s. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | BartC <bc@freeuk.com> |
|---|---|
| Date | 2016-03-19 15:14 +0000 |
| Message-ID | <ncjq6e$bvb$1@dont-email.me> |
| In reply to | #105272 |
On 19/03/2016 14:18, Steven D'Aprano wrote: > On Sat, 19 Mar 2016 11:24 pm, BartC wrote about combining characters: >> And occupy somewhere between 50 and 200 bytes? Or is that 400? >> OK... > > You say that as if 400 bytes was a lot. No, just unpredictable. > Besides, this is hardly any different from (say) a pure ASCIII version of > the "permille" (per thousand) symbol. In Unicode I can write ‰ (two bytes > in UTF-16) but in ASCII I am forced to write O/oo (four bytes), or > worse, "per thousand" (12 bytes). Imagine a string of "‰"*50, written in > ASCII, for a total of 600 bytes... Those kinds of problems are well known with ASCII, for example needing to compare strings but ignoring case, or treating tabs as spaces. It's clear that dealing with those properly goes beyond the remit of basic string processing in a language. With Unicode there are a whole bunch of other problems, and some people expect basic string handling to be able to deal with all of them. (I think Unicode should be dealt with at the next level up. Then some of use can stay at the bottom level that is more efficient and works 99% of the time on average, and just about 100% for most.) > Yes, this is silly. Really, if you've got 50 ñ in a string, they take up the > space they take up, and memory is cheap. Which is about 3000 decimal digits, slightly more than 1KB in packed binary. In BCD it would be 1.5KB. At one-byte per digit (eg. ASCII) it's 3KB. At 4 bytes per (eg. UCS4), it's 12KB. What would you say to someone advocating 12 times as much storage for long integers as is used now? After all memory is cheap! > and my computer calculates and prints the result faster than I can enter > the calculation in the first place. Worrying about the fact that characters > use more than 8 bits is oh-so-1990s. We still need to worry about it. Whatever memory is being used up (ram, cache, flash, disk) 16-bit characters will use twice as much, and 32-bit half as much. And the bandwidth necessary to access or transmit will also be twice or four times as much. But the existence of UTF-8 means something /has/ been done about it, or some of it; somebody /has/ worried about it. >The days of thinking that 127 > characters is all you need (7 bit ASCII) are long, long gone, just >like the > days when it was appropriate for ints to be 16 bits. Some things haven't actually changed that much. Word sizes might have doubled from 32 bits on a mainframe to 64 bits now (temporarily reducing to 8 and 16 along the way for micros and minis). But the English alphabet still has 26 letters. Keyboards still have around 100 keys. And programming languages and text formats still predominantly use ASCII subset for their keywords and identifiers. -- Bartc
[toc] | [prev] | [next] | [standalone]
| From | BartC <bc@freeuk.com> |
|---|---|
| Date | 2016-03-19 15:20 +0000 |
| Message-ID | <ncjqiq$db4$1@dont-email.me> |
| In reply to | #105277 |
On 19/03/2016 15:14, BartC wrote: > Which is about 3000 decimal digits, slightly more than 1KB in packed > binary. In BCD it would be 1.5KB. At one-byte per digit (eg. ASCII) it's > 3KB. At 4 bytes per (eg. UCS4), it's 12KB. The comment refers to this which inexplicably got snipped (not my fault at all..): [Steven D'Aprano:] >> Now, I routinely >> and without hesitation generate thousand-plus bit numbers like >> 2**10000, -- Bartc
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-19 22:32 +1100 |
| Message-ID | <56ed38bb$0$1584$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105259 |
On Sat, 19 Mar 2016 09:18 pm, Chris Angelico wrote: > On Sat, Mar 19, 2016 at 8:31 PM, Marko Rauhamaa <marko@pacujo.net> wrote: >> Unicode made several (understandable but grave) mistakes along the way: >> >> * normalization >> > > Elaborate please? What's such a big mistake here? As usual, Unicode problems are generally due to backwards compatibility. Blame the old legacy encodings, which invented the "dead keys" a.k.a. "combining character" technique. Of course, they had a reasonable excuse at the time, but Unicode's requirement of being able to losslessly handle all legacy character set standards means that Unicode has to provide the same functionality. The problem is not so much the existence of combining characters, but that *some* but not all accented characters are available in two forms: a composed single code point, and a decomposed pair of code points. This adds complexity and means that equality of characters is not well-defined. (Hence Unicode punts on the whole "character" thing and just talks about code points.) -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-19 14:42 +0200 |
| Message-ID | <87oaaaljui.fsf@elektro.pacujo.net> |
| In reply to | #105263 |
Steven D'Aprano <steve@pearwood.info>: > As usual, Unicode problems are generally due to backwards > compatibility. Blame the old legacy encodings, which invented the > "dead keys" a.k.a. "combining character" technique. Of course, they > had a reasonable excuse at the time, but Unicode's requirement of > being able to losslessly handle all legacy character set standards > means that Unicode has to provide the same functionality. The combining characters allow for maze of twisty little combinations, all alike. There's no limit to the number of diacritics you can pile on, under and next to the base character. Was that universality unavoidable? Maybe it was. Deep down, all scripts are two-dimensional. > The problem is not so much the existence of combining characters, but that > *some* but not all accented characters are available in two forms: a > composed single code point, and a decomposed pair of code points. Also, is an a with ring on top and another ring on bottom the same character as an a with ring on bottom and another ring on top? > This adds complexity and means that equality of characters is not > well-defined. (Hence Unicode punts on the whole "character" thing and > just talks about code points.) The problem is not theoretical. If I implement a web form and someone enters "Aña" as their name, how do I make sure queries find the name regardless of the unicode code point sequence? I have to normalize using unicodedata.normalize(). When glorifying Python's advanced Unicode capabilities, are we careful to emphasize the necessity of unicodedata.normalize() everywhere? Should Python normalize strings unconditionally and transparently? What does the O(1) character lookup mean under normalization? Some weeks ago I had to spend 30 minutes to debug my Python program when a user complained it didn't work. Turns out they had accidentally invoked the program using a space and a composing tilde instead of the ASCII ~. There was no visual indication of a problem on the screen, but the Python program acted up. Marko
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-20 01:39 +1100 |
| Message-ID | <56ed64b4$0$1596$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105265 |
On Sat, 19 Mar 2016 11:42 pm, Marko Rauhamaa wrote: > The problem is not theoretical. If I implement a web form and someone > enters "Aña" as their name, how do I make sure queries find the name > regardless of the unicode code point sequence? I have to normalize using > unicodedata.normalize(). I didn't say that it was theoretical. It is a real problem, but it is a problem with human languages: the number of characters-with-accents is vast, possibly impossibly vast. They can't all have unique code points. I must admit I had completely missed your example of multiple combining characters, that's a good one. Here's the example again: a + combining ring above + combining ring below, versus a + combining ring below + combining ring above Naturally just comparing them gives unequal: py> s = "a\u030A\u0325" py> t = "a\u0325\u030A" py> s == t False But we can normalise them: ==== ============= ============= ================== ================= Form NFC NFKC NFKD NFKD ==== ============= ============= ================== ================= s U+1E01,030A U+1E01,030A U+0061,0325,030A U+0061,0325,030A t U+1E01,030A U+1E01,030A U+0061,0325,030A U+0061,0325,030A ==== ============= ============= ================== ================= As you can see, *any* of the normalisation forms will put the code points into the same, canonical order, making them equal. > When glorifying Python's advanced Unicode capabilities, are we careful > to emphasize the necessity of unicodedata.normalize() everywhere? Should > Python normalize strings unconditionally and transparently? What does > the O(1) character lookup mean under normalization? > > Some weeks ago I had to spend 30 minutes to debug my Python program when > a user complained it didn't work. Turns out they had accidentally > invoked the program using a space and a composing tilde instead of the > ASCII ~. There was no visual indication of a problem on the screen, but > the Python program acted up. We recently had somebody here who wrote capital I by pressing the lower case l on the keyboard. Should a pure-ASCII program be able to operate without malfunction if the user confuses 0 and O, or I l and 1? What about ' and ` or possibly even '' and "? -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-19 16:56 +0200 |
| Message-ID | <87bn6aldmy.fsf@elektro.pacujo.net> |
| In reply to | #105273 |
Steven D'Aprano <steve@pearwood.info>: > On Sat, 19 Mar 2016 11:42 pm, Marko Rauhamaa wrote: >> When glorifying Python's advanced Unicode capabilities, are we >> careful to emphasize the necessity of unicodedata.normalize() >> everywhere? Should Python normalize strings unconditionally and >> transparently? What does the O(1) character lookup mean under >> normalization? >> >> Some weeks ago I had to spend 30 minutes to debug my Python program >> when a user complained it didn't work. Turns out they had >> accidentally invoked the program using a space and a composing tilde >> instead of the ASCII ~. There was no visual indication of a problem >> on the screen, but the Python program acted up. > > We recently had somebody here who wrote capital I by pressing the > lower case l on the keyboard. Should a pure-ASCII program be able to > operate without malfunction if the user confuses 0 and O, or I l and > 1? What about ' and ` or possibly even '' and "? What I'm talking about is that maybe Python should treat canonically equivalent strings equivalently, that is, indistinguishably under any external inspection. Anyway, Python's Unicode support is great thing, but Unicode is a big can of worms. Far from being a paradise, it's more of a case of picking your poison. Marko
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2016-03-19 07:01 -0700 |
| Message-ID | <37f00078-41eb-44aa-bf40-7e4e6fa8e100@googlegroups.com> |
| In reply to | #105263 |
Le samedi 19 mars 2016 12:32:25 UTC+1, Steven D'Aprano a écrit : > On Sat, 19 Mar 2016 09:18 pm, Chris Angelico wrote: > > > On Sat, Mar 19, 2016 at 8:31 PM, Marko Rauhamaa <marko@pacujo.net> wrote: > >> Unicode made several (understandable but grave) mistakes along the way: > >> > >> * normalization > >> > > > > Elaborate please? What's such a big mistake here? > > As usual, Unicode problems are generally due to backwards compatibility. > Blame the old legacy encodings, which invented the "dead keys" > a.k.a. "combining character" technique. Of course, they had a reasonable > excuse at the time, but Unicode's requirement of being able to losslessly > handle all legacy character set standards means that Unicode has to provide > the same functionality. > > The problem is not so much the existence of combining characters, but that > *some* but not all accented characters are available in two forms: a > composed single code point, and a decomposed pair of code points. This adds > complexity and means that equality of characters is not well-defined. > (Hence Unicode punts on the whole "character" thing and just talks about > code points.) > > > > -- > Steven I'm laughing, i'm laughing. You do not imagine how I'm laughing...
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-20 01:56 +1100 |
| Message-ID | <56ed68bb$0$1604$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105259 |
On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote: > Using the surrogate mechanism, UTF-16 can support all 1,114,112 > potential Unicode characters. > > But Unicode doesn't contain 1,114,112 characters—the surrogates are > excluded from Unicode, and definitely cannot be encoded using > UTF-anything. Surrogates are most certainly part of the Unicode standard, and they are necessary in UTF-16. (You cannot represent astral characters without them!) So in a UTF-16 stream, a *pair* of surrogates is nothing unusual. They just represent a SMP code point. However, *single* surrogates are an error. For example, we see this FAQ: Q: How do I convert an unpaired UTF-16 surrogate to UTF-32? A: If an unpaired surrogate is encountered when converting ill-formed UTF-16 data, any conformant converter must treat this as an error. By representing such an unpaired surrogate on its own, the resulting UTF-32 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. http://www.unicode.org/faq/utf_bom.html#utf32-7 But nobody says that programming languages must deal with only conformant converters and valid Unicode sequences. An unfortunate fact of life that even if you don't generate them yourself, somebody else will so you need to be able to deal with them. [...] > We still don't know if the final result will be UCS-4 everywhere (with > all 2**32 code points allowed?!) or UTF-8 everywhere. Unicode does not have 2**32 code points. It is guaranteed to never exceed the 2**21 code points already allocated. (Many of those are still unused.) As far as I am concerned, the future is clear: UTF-8 for transmission and storage formats, where fast random access is not necessary; UTF-32 for in-memory formats, where O(1) random access is advantagous. Possibly with certain in-memory optimizations to save space, where such can be done transparently. In the future, we will no more balk at using four whole bytes for a code point than we now balk at using eight bytes for floating point numbers. The mathematical advantages of float Doubles are just overwhelming, and the only reason for using fewer than 64 bits is if you care more about getting a fast answer than an accurate answer. (I'm reminded of one of my wife's former roadies, back in the 70s, crossing the US desert in a van. On being told that he was heading in the wrong direction for their next gig, he replied "Who cares? We're making great time!") In the future, we'll have so much memory that the idea of using variable width in-memory formats will seem absurd. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-19 17:02 +0200 |
| Message-ID | <877fgylddm.fsf@elektro.pacujo.net> |
| In reply to | #105275 |
Steven D'Aprano <steve@pearwood.info>:
> On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote:
>
>
>> Using the surrogate mechanism, UTF-16 can support all 1,114,112
>> potential Unicode characters.
>>
>> But Unicode doesn't contain 1,114,112 characters—the surrogates are
>> excluded from Unicode, and definitely cannot be encoded using
>> UTF-anything.
>
> Surrogates are most certainly part of the Unicode standard, and they are
> necessary in UTF-16.
Yes, but UTF-16 produces 16-bit values that are outside Unicode. UTF-16
can encode *any* valid Unicode, but it cannot encode surrogate
characters.
>>> '\udc10'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc10' in pos\
ition 0: surrogates not allowed
>>> '\udc10'.encode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16' codec can't encode character '\udc10' in po\
sition 0: surrogates not allowed
>>> '\udc10'.encode('utf-32')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-32' codec can't encode character '\udc10' in po\
sition 0: surrogates not allowed
>> We still don't know if the final result will be UCS-4 everywhere (with
>> all 2**32 code points allowed?!) or UTF-8 everywhere.
>
> Unicode does not have 2**32 code points. It is guaranteed to never
> exceed the 2**21 code points already allocated. (Many of those are
> still unused.)
Never say never.
> In the future, we'll have so much memory that the idea of using
> variable width in-memory formats will seem absurd.
I'm starting to think that future is already here.
Marko
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-20 02:47 +1100 |
| Message-ID | <56ed749e$0$1583$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105276 |
On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote: > Steven D'Aprano <steve@pearwood.info>: > >> On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote: >> >> >>> Using the surrogate mechanism, UTF-16 can support all 1,114,112 >>> potential Unicode characters. >>> >>> But Unicode doesn't contain 1,114,112 characters—the surrogates are >>> excluded from Unicode, and definitely cannot be encoded using >>> UTF-anything. >> >> Surrogates are most certainly part of the Unicode standard, and they are >> necessary in UTF-16. > > Yes, but UTF-16 produces 16-bit values that are outside Unicode. Show me. Before you answer, if your answer is "surrogate pairs", that is incorrect. Surrogate pairs is how UTF-16 encodes astral characters. For example, the UTF-16 *byte sequence* 0xD800 0xDC00 does not represent "code points U+D800,DC00". It represents the *single* code point U+10000 "LINEAR B SYLLABLE B008 A". The code points U+D800 and U+DC00 are reserved for the use of UTF-16 as surrogates. This means that UTF-16 cannot encode lone surrogates. It cannot encode, say, the code point U+D800 on its own, because it looks like half of a SMP code point, which is an error. And it cannot encode U+D800 immediately followed by U+DC00, because that would be interpreted as U+10000. So there is a range of code points which cannot be represented in UTF-16. Where UTF-16 goes, UTF-8 and UTF-32 must follow. It is a requirement of Unicode that you must be able to freely and losslessly convert between the three UTFs. (I'm not sure if that also applies to UTF-7.) Since UTF-16 *cannot* represent this specific range of code points, then UTF-8 and UTF-32 must be *forbidden* from doing the same. Note that the UTF-8 and UTF-32 formats are perfectly capable of representing lone surrogates. UTF-32, for example would simply pad the code point with zeroes: U+D800 would be represented as the four bytes 0x0000D800. UTF-8 has a well-defined 3-byte sequence that corresponds to it. But that is invalid, since it violates the requirement that it be freely and losslessly translatable into UTF-16. Invalid Unicode strings have their uses, but they are not valid :-) > UTF-16 can encode *any* valid Unicode, but it cannot encode surrogate > characters. Correct. But encoding of surrogates is not required in Unicode. Strictly speaking, it is forbidden. Did you read the link from the Unicode consortium that I provided? >>> We still don't know if the final result will be UCS-4 everywhere (with >>> all 2**32 code points allowed?!) or UTF-8 everywhere. >> >> Unicode does not have 2**32 code points. It is guaranteed to never >> exceed the 2**21 code points already allocated. (Many of those are >> still unused.) > > Never say never. The Unicode standard has published this guarantee. It is not going to change. If somebody wants more than 2**21 code points, they can start their own new, competing, standard. >> In the future, we'll have so much memory that the idea of using >> variable width in-memory formats will seem absurd. > > I'm starting to think that future is already here. I'm not *quite* ruling out the possibility that UTF-8 as internal representation for in-memory strings is a good idea, but I think that for non-embedded systems, it is very probably a waste of time. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-19 18:12 +0200 |
| Message-ID | <8737rmla4w.fsf@elektro.pacujo.net> |
| In reply to | #105281 |
Steven D'Aprano <steve@pearwood.info>: > On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote: >> Yes, but UTF-16 produces 16-bit values that are outside Unicode. > > Show me. > > Before you answer, if your answer is "surrogate pairs", that is > incorrect. Surrogate pairs is how UTF-16 encodes astral characters. UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers. Thus, the output of UTF-16 is not Unicode. Marko
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-20 16:01 +1100 |
| Message-ID | <56ee2ebd$0$1597$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105282 |
On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote: > Steven D'Aprano <steve@pearwood.info>: > >> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote: >>> Yes, but UTF-16 produces 16-bit values that are outside Unicode. >> >> Show me. >> >> Before you answer, if your answer is "surrogate pairs", that is >> incorrect. Surrogate pairs is how UTF-16 encodes astral characters. > > UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers. > Thus, the output of UTF-16 is not Unicode. I'm not sure what point you think you are making. Unicode (the character set part of it) is a set of abstract 23-bit numbers, or code points, representing (among other things) characters, and numbered from U+0000 to U+10FFFF. Any UTF is, by definition, a transformation from such abstract code points to sequences of machine words or bytes (and vice versa). What's your point? If your point is that the data you get from running UTF-16 on a sequence of code points is "not Unicode, but 2-byte words", then I agree, but I'm not sure why you think that's significant. If you want to call those words "numbers", I cannot really object, but if so, they aren't abstract numbers (like code points, which may have any implementation you like), but have their actual base-2 structure specified by the standard. If your point is that a UTF-16 encoded stream of bytes is not the same as an abstract sequence of code points, then I can't disagree, but I don't understand why you think that's important. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2016-03-19 23:20 -0700 |
| Message-ID | <12db8cba-8edf-4cd0-a91d-2f6b6634c9d3@googlegroups.com> |
| In reply to | #105292 |
On Sunday, March 20, 2016 at 10:32:07 AM UTC+5:30, Steven D'Aprano wrote: > On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote: > > > Steven D'Aprano : > > > >> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote: > >>> Yes, but UTF-16 produces 16-bit values that are outside Unicode. > >> > >> Show me. > >> > >> Before you answer, if your answer is "surrogate pairs", that is > >> incorrect. Surrogate pairs is how UTF-16 encodes astral characters. > > > > UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers. > > Thus, the output of UTF-16 is not Unicode. > > I'm not sure what point you think you are making. > > Unicode (the character set part of it) is a set of abstract 23-bit numbers, 23? Or 21? AIUI if the 'least-count' is 1 its 21 If its 8 its 24 If its 16 its 32 More pertinently if the number of bits signifies, whatever is the sense of the word 'abstract'? > or code points, representing (among other things) characters, and numbered > from U+0000 to U+10FFFF. Any UTF is, by definition, a transformation from > such abstract code points to sequences of machine words or bytes (and vice > versa). What's your point? I think its more useful to think of data transformations between formats Rather than calling one format more abstract than another
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-20 22:06 +1100 |
| Message-ID | <56ee8454$0$22142$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105293 |
On Sun, 20 Mar 2016 05:20 pm, Rustom Mody wrote: > On Sunday, March 20, 2016 at 10:32:07 AM UTC+5:30, Steven D'Aprano wrote: >> On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote: >> >> > Steven D'Aprano : >> > >> >> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote: >> >>> Yes, but UTF-16 produces 16-bit values that are outside Unicode. >> >> >> >> Show me. >> >> >> >> Before you answer, if your answer is "surrogate pairs", that is >> >> incorrect. Surrogate pairs is how UTF-16 encodes astral characters. >> > >> > UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers. >> > Thus, the output of UTF-16 is not Unicode. >> >> I'm not sure what point you think you are making. >> >> Unicode (the character set part of it) is a set of abstract 23-bit >> numbers, > > 23? Or 21? Oops, you're right, its 21 bits. > More pertinently if the number of bits signifies, whatever is the sense of > the word 'abstract'? The Unicode standard does not, as far as I am aware, care how you represent code points in memory, only that there are 0x110000 of them, numbered from U+0000 to U+10FFFF. That's what I mean by abstract. The obvious implementation is to use 32-bit integers, where 0x00000000 represents code point U+0000, 0x00000001 represents U+0001, and so forth. This is essentially equivalent to UTF-16, but it's not mandated or specified by the Unicode standard, you could, if you choose, use something else. On the other hand, I believe that the output of the UTF transformations is explicitly described in terms of 8-bit bytes and 16- or 32-bit words. For instance, the UTF-8 encoding of "A" has to be a single byte with value 0x41 (decimal 65). It isn't that this is the most obvious implementation, its that it can't be anything else and still be UTF-8. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-03-20 22:22 +1100 |
| Message-ID | <mailman.404.1458472974.12893.python-list@python.org> |
| In reply to | #105298 |
On Sun, Mar 20, 2016 at 10:06 PM, Steven D'Aprano <steve@pearwood.info> wrote: > The Unicode standard does not, as far as I am aware, care how you represent > code points in memory, only that there are 0x110000 of them, numbered from > U+0000 to U+10FFFF. That's what I mean by abstract. The obvious > implementation is to use 32-bit integers, where 0x00000000 represents code > point U+0000, 0x00000001 represents U+0001, and so forth. This is > essentially equivalent to UTF-16, but it's not mandated or specified by the > Unicode standard, you could, if you choose, use something else. (UTF-32) The codepoints are not representable in *memory*; they are, by definition, representable in a field of integers. If you choose to represent those integers as little-endian 32-bit values, then yes, the layout in memory will look like UTF-32LE, but that's because UTF-32LE is defined in this extremely simple way. In fact, that's exactly how the layers work - Unicode defines a mapping of characters to code points, and then UTF-x defines a mapping of code points to bytes. > On the other hand, I believe that the output of the UTF transformations is > explicitly described in terms of 8-bit bytes and 16- or 32-bit words. For > instance, the UTF-8 encoding of "A" has to be a single byte with value 0x41 > (decimal 65). It isn't that this is the most obvious implementation, its > that it can't be anything else and still be UTF-8. Exactly. Aside from the way UTF-16 and UTF-32 have LE and BE variants, there is only one bitpattern for any given character sequence and UTF-x (so if you work with eg "UTF-16LE", there's only one). This is no accident. Unlike some encodings, in which there's a "one most obvious" way to encode things but then a number of other legal ways, UTF-x can be compared for equality [1] using simple byte-for-byte comparisons. This means you don't have to worry about someone sneaking a magic character past your filter; if you're checking a UTF-8 stream for the character U+003C LESS-THAN SIGN, the only byte value to look for is 0x3C - the sequence 0xC0 0xBC, despite mathematically representing the number 003C, is explicitly forbidden. ChrisA [1] Though not inequality - lexical sorting doesn't follow codepoint order, and codepoint order won't always match byte order. But equality is easy.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-20 23:14 +1100 |
| Message-ID | <56ee9431$0$1620$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105299 |
On Sun, 20 Mar 2016 10:22 pm, Chris Angelico wrote: > On Sun, Mar 20, 2016 at 10:06 PM, Steven D'Aprano <steve@pearwood.info> > wrote: >> The Unicode standard does not, as far as I am aware, care how you >> represent code points in memory, only that there are 0x110000 of them, >> numbered from U+0000 to U+10FFFF. That's what I mean by abstract. The >> obvious implementation is to use 32-bit integers, where 0x00000000 >> represents code point U+0000, 0x00000001 represents U+0001, and so forth. >> This is essentially equivalent to UTF-16, but it's not mandated or >> specified by the Unicode standard, you could, if you choose, use >> something else. > > (UTF-32) D'oh! I mean, yes, well done, you have passed my little test to see if anyone is paying attention. Have a gold star. > The codepoints are not representable in *memory*; they are, by > definition, representable in a field of integers. They're not directly representable in memory because the definition of code points is not given in terms of memory values. Hence, they are abstract values, numbered in a certain way, and given certain semantics. In other words, there's nothing in the Unicode standard that says that code point U+0020 has to be stored as a byte 0x20, or a word 0x0020. But the standard does say that the code point U+0020 represents a space character. [...] >> On the other hand, I believe that the output of the UTF transformations >> is explicitly described in terms of 8-bit bytes and 16- or 32-bit words. >> For instance, the UTF-8 encoding of "A" has to be a single byte with >> value 0x41 (decimal 65). It isn't that this is the most obvious >> implementation, its that it can't be anything else and still be UTF-8. > > Exactly. Aside from the way UTF-16 and UTF-32 have LE and BE variants, Blame the chip manufacturers for that. Actually, I think we can blame Intel specifically for that, for reversing the normal layout of words in memory. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-03-20 23:27 +1100 |
| Message-ID | <mailman.406.1458476886.12893.python-list@python.org> |
| In reply to | #105300 |
On Sun, Mar 20, 2016 at 11:14 PM, Steven D'Aprano <steve@pearwood.info> wrote: >>> On the other hand, I believe that the output of the UTF transformations >>> is explicitly described in terms of 8-bit bytes and 16- or 32-bit words. >>> For instance, the UTF-8 encoding of "A" has to be a single byte with >>> value 0x41 (decimal 65). It isn't that this is the most obvious >>> implementation, its that it can't be anything else and still be UTF-8. >> >> Exactly. Aside from the way UTF-16 and UTF-32 have LE and BE variants, > > Blame the chip manufacturers for that. Actually, I think we can blame Intel > specifically for that, for reversing the normal layout of words in memory. No, I disagree; it's inherent in the notion of representing a 16-bit or 32-bit value across bytes. Maybe there could have been one most-common standard, but there'd still have been another way of doing it. Little-endianness and big-endianness are important enough to have to deal with. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ben Bacarisse <ben.usenet@bsb.me.uk> |
|---|---|
| Date | 2016-03-20 14:55 +0000 |
| Message-ID | <874mc1mc5g.fsf@bsb.me.uk> |
| In reply to | #105293 |
Rustom Mody <rustompmody@gmail.com> writes: > On Sunday, March 20, 2016 at 10:32:07 AM UTC+5:30, Steven D'Aprano wrote: <snip> >> Unicode (the character set part of it) is a set of abstract 23-bit numbers, > > 23? Or 21? It's 21. The reason being (or at least part of the reason being) that 21 bits can be UTF-8 encoded in 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3 + 3*6). <snip> -- Ben.
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-20 17:36 +0200 |
| Message-ID | <87shzl6u1i.fsf@elektro.pacujo.net> |
| In reply to | #105303 |
Ben Bacarisse <ben.usenet@bsb.me.uk>: > It's 21. The reason being (or at least part of the reason being) that > 21 bits can be UTF-8 encoded in 4 bytes: 11110xxx 10xxxxxx 10xxxxxx > 10xxxxxx (3 + 3*6). I bet the reason is UTF-16. Microsoft and Sun/Oracle would have insisted on a maximum of 4 bytes per character. UTF-16 can just barely squeeze 21 bits into the scheme and only at the expense of creating an ugly hole inside Unicode. Politics, politics. Marko
[toc] | [prev] | [next] | [standalone]
Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →
Back to top | Article view | comp.lang.python
csiph-web