Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #86311 > unrolled thread
| Started by | pierrick.brihaye@gmail.com |
|---|---|
| First post | 2015-02-24 02:49 -0800 |
| Last post | 2015-02-27 10:23 +1100 |
| Articles | 20 on this page of 158 — 19 participants |
Back to article view | Back to comp.lang.python
Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100
Page 4 of 8 — ← Prev page 1 2 3 [4] 5 6 7 8 Next page →
| From | random832@fastmail.us |
|---|---|
| Date | 2015-03-05 14:59 -0500 |
| Message-ID | <mailman.63.1425585548.21433.python-list@python.org> |
| In reply to | #86942 |
On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote: > I mostly agree with Chris. Supporting *just* the BMP is non-trivial in > UTF-8 > and UTF-32, since that goes against the grain of the system. You would > have > to program in artificial restrictions that otherwise don't exist. UTF-8 is already restricted from representing values above 0x10FFFF, whereas UTF-8 can "naturally" represent values up to 0x1FFFFF in four bytes, up to 0x3FFFFFF in five bytes, and 0x7FFFFFFF in six bytes. If anything, the BMP represents a natural boundary, since it coincides with values that can be represented in three bytes. Likewise, UTF-32 can obviously represent values up to 0xFFFFFFFF. You're programming in artificial restrictions either way, it's just a question of what those restrictions are.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-06 09:33 +1100 |
| Message-ID | <54f8d9c6$0$12993$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #86951 |
random832@fastmail.us wrote: > On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote: >> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in >> UTF-8 >> and UTF-32, since that goes against the grain of the system. You would >> have >> to program in artificial restrictions that otherwise don't exist. > > UTF-8 is already restricted from representing values above 0x10FFFF, > whereas UTF-8 can "naturally" represent values up to 0x1FFFFF in four > bytes, up to 0x3FFFFFF in five bytes, and 0x7FFFFFFF in six bytes. If > anything, the BMP represents a natural boundary, since it coincides with > values that can be represented in three bytes. Likewise, UTF-32 can > obviously represent values up to 0xFFFFFFFF. You're programming in > artificial restrictions either way, it's just a question of what those > restrictions are. Good points, but they don't greatly change my conclusion. If you are implementing UTF-8 or UTF-32, it is no harder to deal with code points in the SMP than those in the BMP. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-05 20:53 -0800 |
| Message-ID | <c6caaa76-f448-4c2f-8874-c1f2716da744@googlegroups.com> |
| In reply to | #86942 |
On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote: > Rustom Mody wrote: > > > On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote: > >> On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: > >> > > >> > It lists some examples of software that somehow break/goof going from > >> > BMP-only unicode to 7.0 unicode. > >> > > >> > IOW the suggestion is that the the two-way classification > >> > - ASCII > >> > - Unicode > >> > > >> > is less useful and accurate than the 3-way > >> > > >> > - ASCII > >> > - BMP > >> > - Unicode > >> > >> How is that more useful? Aside from storage optimizations (in which > >> the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is > >> not significantly different from the rest of Unicode. > > > > Sorry... Dont understand. > > Chris is suggesting that going from BMP to all of Unicode is not the hard > part. Going from ASCII to the BMP part of Unicode is the hard part. If you > can do that, you can go the rest of the way easily. Depends where the going is starting from. I specifically names Java, Javascript, Windows... among others. Here's some quotes from the supplementary chars doc of Java http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html | Supplementary characters are characters in the Unicode standard whose code | points are above U+FFFF, and which therefore cannot be described as single | 16-bit entities such as the char data type in the Java programming language. | Such characters are generally rare, but some are used, for example, as part | of Chinese and Japanese personal names, and so support for them is commonly | required for government applications in East Asian countries... | The introduction of supplementary characters unfortunately makes the | character model quite a bit more complicated. | Unicode was originally designed as a fixed-width 16-bit character encoding. | The primitive data type char in the Java programming language was intended to | take advantage of this design by providing a simple data type that could hold | any character.... Version 5.0 of the J2SE is required to support version 4.0 | of the Unicode standard, so it has to support supplementary characters. My conclusion: Early adopters of unicode -- Windows and Java -- were punished for their early adoption. You can blame the unicode consortium, you can blame the babel of human languages, particularly that some use characters and some only (the equivalent of) what we call words. Or you can skip the blame-game and simply note the fact that large segments of extant code-bases are currently in bug-prone or plain buggy state. This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. > > I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8 > and UTF-32, since that goes against the grain of the system. You would have > to program in artificial restrictions that otherwise don't exist. Yes UTF-8 and UTF-32 make most of the objections to unicode 7.0 irrelevant. Large segments of the > > UTF-16 is different, and that's probably why you think supporting all of > Unicode is hard. With UTF-16, there really is an obvious distinction > between the BMP and the SMP: that's where you jump from a single 2-byte > unit to a pair of 2-byte units. But that distinction doesn't exist in UTF-8 > or UTF-32: > > - In UTF-8, about 99.8% of the BMP requires multiple bytes. Whether you > support the SMP or not doesn't change the fact that you have to deal > with multi-byte characters. > > - In UTF-32, everything is fixed-width whether it is in the BMP or not. > > In both cases, supporting the SMPs is no harder than supporting the BMP. > It's only UTF-16 that makes the SMP seem hard. > > Conclusion: faulty implementations of UTF-16 which incorrectly handle > surrogate pairs should be replaced by non-faulty implementations, or > changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume > that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be > upgraded. Imagine for a moment a thought experiment -- we are not on a python but a java forum and please rewrite the above para. Are you addressing the vanilla java programmer? Language implementer? Designer? The Java-funders -- earlier Sun, now Oracle? > > Wrong conclusion: SMPs are unnecessary and unneeded, and we need a new > standard that is just like obsolete Unicode version 1. > > Unicode version 1 is obsolete for a reason. 16 bits is not enough for even > existing languages, let alone all the code points and characters that are > used in human communication. > > > >> Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why > >> do you keep talking about 7.0 as if it's a recent change? > > > > It is 2015 as of now. 7.0 is the current standard. > > > > The need for the adjective 'current' should be pondered upon. > > What's your point? > > The UTF encodings have not changed since they were first introduced. They > have been stable for at least twenty years: UTF-8 has existed since 1993, > and UTF-16 since 1996. > > Since version 2.0 of Unicode in 1996, the standard has made "stability > guarantees" that no code points will be renamed or removed. Consequently, > there has only been one version which removed characters, version 1.1. > Since then, new versions of the standard have only added characters, never > moved, renamed or deleted them. > > http://unicode.org/policies/stability_policy.html > > Some highlights in Unicode history: > > Unicode 1.0 (1991): initial version, defined 7161 code points. > > In January 1993, Rob Pike and Ken Thompson announced the design and working > implementation of the UTF-8 encoding. > > 1.1 (1993): defined 34233 characters, finalised Han Unification. Removed > some characters from the 1.0 set. This is the first and only time any code > points have been removed. > > 2.0 (1996): First version to include code points in the Supplementary > Multilingual Planes. Defined 38950 code points. Introduced the UTF-16 > encoding. > > 3.1 (2001): Defined 94205 code points, including 42711 additional Han > ideographs, bringing the total number of CJK code points alone to 71793, > too many to fit in 16 bits. > > 2006: The People's Republic Of China mandates support for the GB-18030 > character set for all software products sold in the PRC. GB-18030 supports > the entire Unicode range, include the SMPs. Since this date, all software > sold in China must support the SMPs. > > 6.0 (2010): The first emoji or emoticons were added to Unicode. > > 7.0 (2014): 113021 code points defined in total. > > > > In practice, standards change. > > However if a standard changes so frequently that that users have to play > > catching cook and keep asking: "Which version?" they are justified in > > asking "Are the standard-makers doing due diligence?" > > Since Unicode has stability guarantees, and the encodings have not changed > in twenty years and will not change in the future, this argument is bogus. > Updating to a new version of the standard means, to a first approximation, > merely allocating some new code points which had previously been undefined > but are now defined. > > (Code points can be flagged deprecated, but they will never be removed.) Its not about new code points; its about "Fits in 2 bytes" to "Does not fit in 2 bytes" If you call that argument bogus I call you a non computer scientist. [Essentially this is my issue with the consortium it seems to be working like a bunch of linguists not computer scientists] Here is Roy's Smith post that first started me thinking that something may be wrong with SMP https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ Some parts are here some earlier and from my memory. If details wrong please correct: - 200 million records - Containing 4 strings with SMP characters - System made with python and mysql. SMP works with python, breaks mysql. So whole system broke due to those 4 in 200,000,000 records I know enough (or not enough) of unicode to be chary of statistical conclusions from the above. My conclusion is essentially an 'existence-proof': SMP-chars can break systems. The breakage is costly-fied by the combination - layman statistical assumptions - BMP → SMP exercises different code-paths It is necessary but not sufficient to test print "hello world" in ASCII, BMP, SMP. You also have to write the hello world in the database -- mysql Read it from the webform -- javascript etc etc You could also choose do with "astral crap" (Roy's words) what we all do with crap -- throw it out as early as possible.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-06 16:20 +1100 |
| Message-ID | <mailman.88.1425619223.21433.python-list@python.org> |
| In reply to | #86986 |
On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody <rustompmody@gmail.com> wrote: > My conclusion: Early adopters of unicode -- Windows and Java -- were punished > for their early adoption. You can blame the unicode consortium, you can > blame the babel of human languages, particularly that some use characters > and some only (the equivalent of) what we call words. > > Or you can skip the blame-game and simply note the fact that large segments of > extant code-bases are currently in bug-prone or plain buggy state. For most of the 1990s, I was writing code in REXX, on OS/2. An even earlier adopter, REXX didn't have Unicode support _at all_, but instead had facilities for working with DBCS strings. You can't get everything right AND be the first to produce anything. Python didn't make Unicode strings the default until 3.0, but that's not Unicode's fault. > This includes not just bug-prone-system code such as Java and Windows but > seemingly working code such as python 3. > > Here is Roy's Smith post that first started me thinking that something may > be wrong with SMP > https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ > > Some parts are here some earlier and from my memory. > If details wrong please correct: > - 200 million records > - Containing 4 strings with SMP characters > - System made with python and mysql. SMP works with python, breaks mysql. > So whole system broke due to those 4 in 200,000,000 records > > I know enough (or not enough) of unicode to be chary of statistical conclusions > from the above. > My conclusion is essentially an 'existence-proof': Hang on hang on. Why are you blaming Python or SMP characters for this? The problem here is MySQL, which doesn't adequately cope with the full Unicode range. (Or, didn't then, or doesn't with its default settings. I believe you can configure current versions of MySQL to work correctly, though I haven't actually checked. PostgreSQL gets it right, that's good enough for me.) > SMP-chars can break systems. > The breakage is costly-fied by the combination > - layman statistical assumptions > - BMP → SMP exercises different code-paths Broken systems can be shown up by anything. Suppose you have a program that breaks when it gets a NUL character (not unknown in C code); is the fault with the Unicode consortium for allocating something at codepoint 0, or the code that can't cope with a perfectly normal character? > You could also choose do with "astral crap" (Roy's words) what we all do with > crap -- throw it out as early as possible. There's only one character that fits that description, and that's 1F4A9. Everything else is just "astral characters", and you shouldn't have any difficulties with them. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-06 01:02 -0800 |
| Message-ID | <01dd9b83-db3e-4e7d-9022-dc6af75eb570@googlegroups.com> |
| In reply to | #86987 |
On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: > On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody wrote: > > My conclusion: Early adopters of unicode -- Windows and Java -- were punished > > for their early adoption. You can blame the unicode consortium, you can > > blame the babel of human languages, particularly that some use characters > > and some only (the equivalent of) what we call words. > > > > Or you can skip the blame-game and simply note the fact that large segments of > > extant code-bases are currently in bug-prone or plain buggy state. > > For most of the 1990s, I was writing code in REXX, on OS/2. An even > earlier adopter, REXX didn't have Unicode support _at all_, but > instead had facilities for working with DBCS strings. You can't get > everything right AND be the first to produce anything. Python didn't > make Unicode strings the default until 3.0, but that's not Unicode's > fault. > > > This includes not just bug-prone-system code such as Java and Windows but > > seemingly working code such as python 3. > > > > Here is Roy's Smith post that first started me thinking that something may > > be wrong with SMP > > https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ > > > > Some parts are here some earlier and from my memory. > > If details wrong please correct: > > - 200 million records > > - Containing 4 strings with SMP characters > > - System made with python and mysql. SMP works with python, breaks mysql. > > So whole system broke due to those 4 in 200,000,000 records > > > > I know enough (or not enough) of unicode to be chary of statistical conclusions > > from the above. > > My conclusion is essentially an 'existence-proof': > > Hang on hang on. Why are you blaming Python or SMP characters for > this? The problem here is MySQL, which doesn't adequately cope with > the full Unicode range. (Or, didn't then, or doesn't with its default > settings. I believe you can configure current versions of MySQL to > work correctly, though I haven't actually checked. PostgreSQL gets it > right, that's good enough for me.) > > > SMP-chars can break systems. > > The breakage is costly-fied by the combination > > - layman statistical assumptions > > - BMP → SMP exercises different code-paths > > Broken systems can be shown up by anything. Suppose you have a program > that breaks when it gets a NUL character (not unknown in C code); is > the fault with the Unicode consortium for allocating something at > codepoint 0, or the code that can't cope with a perfectly normal > character? Strawman. Lets please stick to UTF-16 shall we? Now tell me: - Is it broken or not? - Is it widely used or not? - Should programmers be careful of it or not? - Should programmers be warned about it or not?
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-06 01:06 -0800 |
| Message-ID | <d01a4428-d691-4620-88ba-076360366cff@googlegroups.com> |
| In reply to | #87001 |
On Friday, March 6, 2015 at 2:33:11 PM UTC+5:30, Rustom Mody wrote: > Lets please stick to UTF-16 shall we? > > Now tell me: > - Is it broken or not? > - Is it widely used or not? > - Should programmers be careful of it or not? > - Should programmers be warned about it or not? Also: Can a programmer who is away from UTF-16 in one part of the system (say by using python3) assume he is safe all over?
[toc] | [prev] | [next] | [standalone]
| From | random832@fastmail.us |
|---|---|
| Date | 2015-03-06 08:33 -0500 |
| Message-ID | <mailman.108.1425648784.21433.python-list@python.org> |
| In reply to | #87002 |
On Fri, Mar 6, 2015, at 04:06, Rustom Mody wrote: > Also: > Can a programmer who is away from UTF-16 in one part of the system (say > by using python3) > assume he is safe all over? The most common failure of UTF-16 support, supposedly, is in programs misusing the number of code units (for length or random access) as a proxy for the number of characters. However, when do you _really_ want the number of characters? You may want to use it for, for example, the number of columns in a 'monospace' font, which you've already screwed up because you haven't accounted for double-wide characters or combining marks. Or you may want the position that pressing an arrow key or backspace or forward-delete a number of times will reach, which has its own rules in e.g. Indic languages (and also fails on Latin with combining marks).
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-07 00:39 +1100 |
| Message-ID | <mailman.109.1425649169.21433.python-list@python.org> |
| In reply to | #87002 |
On Sat, Mar 7, 2015 at 12:33 AM, <random832@fastmail.us> wrote: > However, when do you _really_ want the number of characters? You may > want to use it for, for example, the number of columns in a 'monospace' > font, which you've already screwed up because you haven't accounted for > double-wide characters or combining marks. Or you may want the position > that pressing an arrow key or backspace or forward-delete a number of > times will reach, which has its own rules in e.g. Indic languages (and > also fails on Latin with combining marks). Number of code points is the most logical way to length-limit something. If you want to allow users to set their display names but not to make arbitrarily long ones, limiting them to X code points is the safest way (and preferably do an NFC or NFD normalization before counting, for consistency); this means you disallow pathological cases where every base character has innumerable combining marks added. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | random832@fastmail.us |
|---|---|
| Date | 2015-03-06 09:03 -0500 |
| Message-ID | <mailman.111.1425650593.21433.python-list@python.org> |
| In reply to | #87002 |
On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote: > Number of code points is the most logical way to length-limit > something. If you want to allow users to set their display names but > not to make arbitrarily long ones, limiting them to X code points is > the safest way (and preferably do an NFC or NFD normalization before > counting, for consistency); Why are you length-limiting it? Storage space? Limit it in whatever encoding they're stored in. Why are combining marks "pathological" but surrogate characters not? Display space? Limit it by columns. If you're going to allow a Japanese user's name to be twice as wide, you've got a problem when you go to display it. > this means you disallow pathological cases > where every base character has innumerable combining marks added. No it doesn't. If you limit it to, say, fifty, someone can still post two base characters with twenty combining marks each. If you actually want to disallow this, you've got to do more work. You've disallowed some of the pathological cases, some of the time, by coincidence. And limiting the number of UTF-8 bytes, or the number of UTF-16 code points, will accomplish this just as well. Now, if you intend to _silently truncate_ it to the desired length, you certainly don't want to leave half a character in, of course. But who's to say the base character plus first few combining marks aren't also "half a character"? If you're _splitting_ a string, rather than merely truncating it, you probably don't want those combining marks at the beginning of part two.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-07 01:11 +1100 |
| Message-ID | <mailman.112.1425651082.21433.python-list@python.org> |
| In reply to | #87002 |
On Sat, Mar 7, 2015 at 1:03 AM, <random832@fastmail.us> wrote: > On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote: >> Number of code points is the most logical way to length-limit >> something. If you want to allow users to set their display names but >> not to make arbitrarily long ones, limiting them to X code points is >> the safest way (and preferably do an NFC or NFD normalization before >> counting, for consistency); > > Why are you length-limiting it? Storage space? Limit it in whatever > encoding they're stored in. Why are combining marks "pathological" but > surrogate characters not? Display space? Limit it by columns. If you're > going to allow a Japanese user's name to be twice as wide, you've got a > problem when you go to display it. To prevent people from putting three paragraphs of lipsum in and calling it a username. >> this means you disallow pathological cases >> where every base character has innumerable combining marks added. > > No it doesn't. If you limit it to, say, fifty, someone can still post > two base characters with twenty combining marks each. If you actually > want to disallow this, you've got to do more work. You've disallowed > some of the pathological cases, some of the time, by coincidence. And > limiting the number of UTF-8 bytes, or the number of UTF-16 code points, > will accomplish this just as well. They can, but then they're limited to two base characters. They can't have fifty base characters with twenty combining marks each. That's the point. > Now, if you intend to _silently truncate_ it to the desired length, you > certainly don't want to leave half a character in, of course. But who's > to say the base character plus first few combining marks aren't also > "half a character"? If you're _splitting_ a string, rather than merely > truncating it, you probably don't want those combining marks at the > beginning of part two. So you truncate to the desired length, then if the first character of the trimmed-off section is a combining mark (based on its Unicode character types), you keep trimming until you've removed a character which isn't. Then, if you no longer have any content whatsoever, reject the name. Simple. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | random832@fastmail.us |
|---|---|
| Date | 2015-03-06 09:27 -0500 |
| Message-ID | <mailman.113.1425652066.21433.python-list@python.org> |
| In reply to | #87002 |
On Fri, Mar 6, 2015, at 09:11, Chris Angelico wrote: > To prevent people from putting three paragraphs of lipsum in and > calling it a username. Limiting by UTF-8 bytes or UTF-16 units works just as well for that. > So you truncate to the desired length, then if the first character of > the trimmed-off section is a combining mark (based on its Unicode > character types), you keep trimming until you've removed a character > which isn't. Then, if you no longer have any content whatsoever, > reject the name. Simple. My entire point was that UTF-32 doesn't save you from that, so it cannot be called a deficiency of UTF-16. My point is there are very few problems to which "count of Unicode code points" is the only right answer - that UTF-32 is good enough for but that are meaningfully impacted by a naive usage of UTF-16, to the point where UTF-16 is something you have to be "safe" from.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-07 03:26 +1100 |
| Message-ID | <54f9d51b$0$13014$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #87025 |
random832@fastmail.us wrote: > My point is there are very few > problems to which "count of Unicode code points" is the only right > answer - that UTF-32 is good enough for but that are meaningfully > impacted by a naive usage of UTF-16, to the point where UTF-16 is > something you have to be "safe" from. I'm not sure why you care about the "count of Unicode code points", although that *is* a problem. Not for end-user reasons like "how long is my password?", but because it makes your job as a programmer harder. [steve@ando ~]$ python2.7 -c "print (len(u'\U00004444:\U00014445'))" 4 [steve@ando ~]$ python3.3 -c "print (len(u'\U00004444:\U00014445'))" 3 It's hard to reason about your code when something as fundamental as the length of a string is implementation-dependent. (By the way, the right answer should be 3, not 4.) But an even more important problem is that broken-UTF-16 lets you create invalid, impossible Unicode strings *by accident*. Naturally you can create broken Unicode if you assemble strings of surrogates yourself, but broken-UTF-16 means it can happen from otherwise innocuous operations like reversing a string: py> s = u'\U00004444:\U00014445' # Python 2.7 narrow build py> s[::-1] u'\udc45\ud811:\u4444' It's hard for me to demonstrate that the reversed string is broken because the shell I am using does an amazingly good job of handling broken Unicode. Even if I print it, the shell just prints missing-character glyphs instead of crashing (fortunately for me!). But the first two code points are in illegal order: \udc45 is a high surrogate, and must follow a low surrogate; \ud811 is a low surrogate, and must precede a high surrogate; I'm not convinced you should be allowed to create Unicode strings containing mismatched surrogates like this deliberately, but you certainly shouldn't be able to do so by accident. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-06 20:54 +1100 |
| Message-ID | <mailman.99.1425635649.21433.python-list@python.org> |
| In reply to | #87001 |
On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody <rustompmody@gmail.com> wrote: >> Broken systems can be shown up by anything. Suppose you have a program >> that breaks when it gets a NUL character (not unknown in C code); is >> the fault with the Unicode consortium for allocating something at >> codepoint 0, or the code that can't cope with a perfectly normal >> character? > > Strawman. Not really, no. I know of lots of programs that can't handle embedded NULs, and which fail in various ways when given them (the most common is simple truncation, but it's by far not the only way). And it's exactly the same: a program that purports to handle arbitrary Unicode text should be able to handle arbitrary Unicode text, not "Unicode text as long as it contains only codepoints within the range X-Y". It doesn't matter whether the code chokes on U+0000, U+005C, U+FFFC, or U+1F4A3 - if your code blows up, it's a failure in your code. > Lets please stick to UTF-16 shall we? > > Now tell me: > - Is it broken or not? > - Is it widely used or not? > - Should programmers be careful of it or not? > - Should programmers be warned about it or not? No, UTF-16 is not itself broken. (It would be if we expected codepoints >0x10FFFF, and it's because of UTF-16 that that's the cap on Unicode, but it's looking unlikely that we'll be needing any more than that anyway.) What's broken is code that tries to treat UTF-16 as if it's UCS-2, and then breaks on surrogate pairs. Yes, it's widely used. Programmers should probably be warned about it, but only because its tradeoffs are generally poorer than UTF-8's. If you use it correctly, there's no problem. > Also: > Can a programmer who is away from UTF-16 in one part of the system (say by using python3) > assume he is safe all over? I don't know what you mean here. Do you mean that your Python 3 program is "at risk" in some way because there might be some other program that misuses UTF-16? Well, sure. And there might be some other program that misuses buffer sizes, SQL queries, or shell invocations, and makes your overall system vulnerable to buffer overruns or injection attacks. These are significantly more likely AND more serious than UTF-16 misuses. And you still have not proven anything about SMP characters being a problem, but only that code can be broken. Broken code is still broken code, no matter what your actual brokenness. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-06 02:07 -0800 |
| Message-ID | <dd0a2f6c-61f7-4d6f-a42c-d9e6940f5a7d@googlegroups.com> |
| In reply to | #87004 |
On Friday, March 6, 2015 at 3:24:48 PM UTC+5:30, Chris Angelico wrote: > On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody wrote: > >> Broken systems can be shown up by anything. Suppose you have a program > >> that breaks when it gets a NUL character (not unknown in C code); is > >> the fault with the Unicode consortium for allocating something at > >> codepoint 0, or the code that can't cope with a perfectly normal > >> character? > > > > Strawman. > > Not really, no. I know of lots of programs that can't handle embedded > NULs, and which fail in various ways when given them (the most common > is simple truncation, but it's by far not the only way). Ah well if you insist on pursuing the nul-char example... No the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0 Nor the code that "can't cope with a perfectly normal character?" But with C for having a data structure called string with a 'hole' in it. And it's > exactly the same: a program that purports to handle arbitrary Unicode > text should be able to handle arbitrary Unicode text, not "Unicode > text as long as it contains only codepoints within the range X-Y". It > doesn't matter whether the code chokes on U+0000, U+005C, U+FFFC, or > U+1F4A3 - if your code blows up, it's a failure in your code. > > > Lets please stick to UTF-16 shall we? > > > > Now tell me: > > - Is it broken or not? > > - Is it widely used or not? > > - Should programmers be careful of it or not? > > - Should programmers be warned about it or not? > > No, UTF-16 is not itself broken. (It would be if we expected > codepoints >0x10FFFF, and it's because of UTF-16 that that's the cap > on Unicode, but it's looking unlikely that we'll be needing any more > than that anyway.) What's broken is code that tries to treat UTF-16 as > if it's UCS-2, and then breaks on surrogate pairs. > > Yes, it's widely used. Programmers should probably be warned about it, > but only because its tradeoffs are generally poorer than UTF-8's. If > you use it correctly, there's no problem. > > > Also: > > Can a programmer who is away from UTF-16 in one part of the system (say by using python3) > > assume he is safe all over? > > I don't know what you mean here. Do you mean that your Python 3 > program is "at risk" in some way because there might be some other > program that misuses UTF-16? Yes some other program/library/API etc connected to the python one > Well, sure. And there might be some other > program that misuses buffer sizes, SQL queries, or shell invocations, > and makes your overall system vulnerable to buffer overruns or > injection attacks. These are significantly more likely AND more > serious than UTF-16 misuses. And you still have not proven anything > about SMP characters being a problem, but only that code can be > broken. Broken code is still broken code, no matter what your actual > brokenness. Roy Smith (and many other links Ive cited) prove exactly that - an SMP character broke the code. Note: I have no objection to people supporting full unicode 7. Im just saying it may be significantly harder than just "Use python3 and you are done"
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-07 01:50 +1100 |
| Message-ID | <54f9bea1$0$12994$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #87001 |
Rustom Mody wrote: > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: [snip example of an analogous situation with NULs] > Strawman. Sigh. If I had a dollar for every time somebody cried "Strawman!" when what they really should say is "Yes, that's a good argument, I'm afraid I can't argue against it, at least not without considerable thought", I'd be a wealthy man... > Lets please stick to UTF-16 shall we? > > Now tell me: > - Is it broken or not? The UTF-16 standard is not broken. It is a perfectly adequate variable-width encoding, and considerably better than most other variable-width encodings. However, many implementations of UTF-16 are faulty, and assume a fixed-width. *That* is broken, not UTF-16. (The difference between specification and implementation is critical.) > - Is it widely used or not? It's quite widely used. > - Should programmers be careful of it or not? Programmers should be aware whether or not any specific language uses UTF-16 and whether the implementation is buggy. That will help them decide whether or not to use that language. > - Should programmers be warned about it or not? I'm in favour of people having more knowledge rather than less. I don't believe that ignorance is bliss, except perhaps in the case that a giant asteroid the size of Texas is heading straight for us. Programmers should be aware of the limitations or bugs in any UTF-16 implementation they are likely to run into. Hence my general recommendation: - For transmission over networks or storage on permanent media (e.g. the content of text files), use UTF-8. It is well-implemented by nearly all languages that support Unicode, as far as I know. - If you are designing your own language, your implementation of Unicode strings should use something like Python's FSR, or UTF-8 with tweaks to make string indexing O(1) rather than O(N), or correctly-implemented UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed 2-byte per code point format, you fail. - If you are using an existing language, be aware of any bugs and limitations in its Unicode implementation. You may or may not be able to work around them, but at least you can decide whether or not you wish to try. - If you are writing your own file system layer, it's 2015 fer fecks sake, file names should be Unicode strings, not bytes! (That's one part of the Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file system, whichever you please, but again remember that both are variable-width formats. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-07 02:27 +1100 |
| Message-ID | <mailman.114.1425655645.21433.python-list@python.org> |
| In reply to | #87026 |
On Sat, Mar 7, 2015 at 1:50 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > Rustom Mody wrote: > >> On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: > > [snip example of an analogous situation with NULs] > >> Strawman. > > Sigh. If I had a dollar for every time somebody cried "Strawman!" when what > they really should say is "Yes, that's a good argument, I'm afraid I can't > argue against it, at least not without considerable thought", I'd be a > wealthy man... If I had a dollar for every time anyone said "If I had <insert currency unit here> for every time...", I'd go meta all day long and profit from it... :) > - If you are writing your own file system layer, it's 2015 fer fecks sake, > file names should be Unicode strings, not bytes! (That's one part of the > Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file > system, whichever you please, but again remember that both are > variable-width formats. I agree that that part of the Unix model needs to change, but there are two viable ways to move forward: 1) Keep file names as bytes, but mandate that they be valid UTF-8 streams, and recommend that they be decoded UTF-8 for display to a human 2) Change the entire protocol stack from the file system upwards so that file names become Unicode strings. Trouble with #2 is that file names need to be passed around somehow, which means bytes in memory. So ultimately, #2 really means "keep file names as bytes, and mandate an encoding all the way up the stack"... so it's a massive documentation change that really comes down to the same thing as #1. This is one area where, as I understand it, Mac OS got it right. It's time for other Unix variants to adopt the same policy. The bulk of file names will be ASCII-only anyway, so requiring UTF-8 won't affect them; a lot of others are already UTF-8; so all we need is a transition scheme for the remaining ones. If there's a known FS encoding, it ought to be possible to have a file system conversion tool that goes through everything, decodes, re-encodes UTF-8, and then flags the file system as UTF-8 compliant. All that'd be left would be the file names that are broken already - ones that don't decode in the FS encoding - and there's nothing to be done with them but wrap them up into something probably-meaningless-but reversible. When can we start doing this? ext5? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-06 07:37 -0800 |
| Message-ID | <bc230953-27ed-4d10-a509-32d3aa1eced9@googlegroups.com> |
| In reply to | #87026 |
Le vendredi 6 mars 2015 15:50:22 UTC+1, Steven D'Aprano a écrit : > Rustom Mody wrote: > > > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: > > [snip example of an analogous situation with NULs] > > > Strawman. > > Sigh. If I had a dollar for every time somebody cried "Strawman!" when what > they really should say is "Yes, that's a good argument, I'm afraid I can't > argue against it, at least not without considerable thought", I'd be a > wealthy man... > > > > Lets please stick to UTF-16 shall we? > > > > Now tell me: > > - Is it broken or not? > > The UTF-16 standard is not broken. It is a perfectly adequate variable-width > encoding, and considerably better than most other variable-width encodings. > > However, many implementations of UTF-16 are faulty, and assume a > fixed-width. *That* is broken, not UTF-16. > > (The difference between specification and implementation is critical.) > > > > - Is it widely used or not? > > It's quite widely used. > > > > - Should programmers be careful of it or not? > > Programmers should be aware whether or not any specific language uses UTF-16 > and whether the implementation is buggy. That will help them decide whether > or not to use that language. > > > > - Should programmers be warned about it or not? > > I'm in favour of people having more knowledge rather than less. I don't > believe that ignorance is bliss, except perhaps in the case that a giant > asteroid the size of Texas is heading straight for us. > > Programmers should be aware of the limitations or bugs in any UTF-16 > implementation they are likely to run into. Hence my general > recommendation: > > - For transmission over networks or storage on permanent media (e.g. the > content of text files), use UTF-8. It is well-implemented by nearly all > languages that support Unicode, as far as I know. > > - If you are designing your own language, your implementation of Unicode > strings should use something like Python's FSR, or UTF-8 with tweaks to > make string indexing O(1) rather than O(N), or correctly-implemented > UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) If, in > 2015, you design your Unicode implementation as if UTF-16 is a fixed 2-byte > per code point format, you fail. > > - If you are using an existing language, be aware of any bugs and > limitations in its Unicode implementation. You may or may not be able to > work around them, but at least you can decide whether or not you wish to > try. > > - If you are writing your own file system layer, it's 2015 fer fecks sake, > file names should be Unicode strings, not bytes! (That's one part of the > Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file > system, whichever you please, but again remember that both are > variable-width formats. > > > > -- > Steven =========== Sorry, but it's time to learn and to understand UNICODE. (It is no so complicate). jmf
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-06 08:20 -0800 |
| Message-ID | <bb37d542-096f-46f0-9f4e-7cd9230ee2a0@googlegroups.com> |
| In reply to | #87026 |
On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:
>
> > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
>
> [snip example of an analogous situation with NULs]
>
> > Strawman.
>
> Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> they really should say is "Yes, that's a good argument, I'm afraid I can't
> argue against it, at least not without considerable thought", I'd be a
> wealthy man...
Missed my addition? Here it is again – grammar slightly corrected.
===========
Ah well if you insist on pursuing the nul-char example...
- No, the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0
- No, the code that "can't cope with a perfectly normal character" is not wrong
- It is C that is wrong for designing a buggy string data structure that cannot
contain a valid char.
===========
In fact Chris' nul-char example is so strongly supporting my argument – bugginess of UTF-16 –
it is perhaps too strong even for me.
To elaborate:
Take the buggy-plane analogy I gave in
http://blog.languager.org/2015/03/whimsical-unicode.html
If a plane model crashes once in 10,000 flights compared to others that crash once in
one million flights we can call it bug-prone though not strictly buggy – it does fly
9999 times safely!
OTOH if a plane is guaranteed to crash we can all it a buggy plane.
C's string is not bug-prone its plain buggy as it cannot represent strings
with nulls.
I would not go that far for UTF-16.
It is bug-inviting but it can also be implemented correctly
>
>
> > Lets please stick to UTF-16 shall we?
> >
> > Now tell me:
> > - Is it broken or not?
>
> The UTF-16 standard is not broken. It is a perfectly adequate variable-width
> encoding, and considerably better than most other variable-width encodings.
>
> However, many implementations of UTF-16 are faulty, and assume a
> fixed-width. *That* is broken, not UTF-16.
>
> (The difference between specification and implementation is critical.)
>
>
> > - Is it widely used or not?
>
> It's quite widely used.
>
>
> > - Should programmers be careful of it or not?
>
> Programmers should be aware whether or not any specific language uses UTF-16
> and whether the implementation is buggy. That will help them decide whether
> or not to use that language.
>
>
> > - Should programmers be warned about it or not?
>
> I'm in favour of people having more knowledge rather than less. I don't
> believe that ignorance is bliss, except perhaps in the case that a giant
> asteroid the size of Texas is heading straight for us.
>
> Programmers should be aware of the limitations or bugs in any UTF-16
> implementation they are likely to run into. Hence my general
> recommendation:
>
> - For transmission over networks or storage on permanent media (e.g. the
> content of text files), use UTF-8. It is well-implemented by nearly all
> languages that support Unicode, as far as I know.
>
> - If you are designing your own language, your implementation of Unicode
> strings should use something like Python's FSR, or UTF-8 with tweaks to
> make string indexing O(1) rather than O(N), or correctly-implemented
> UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)
FSR is possible in python for very specific pythonic reasons
- dynamicness
- immutable strings
Drop either and FSR is impossible
> If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed
> 2-byte per code point format, you fail.
Seems obvious enough.
So lets see...
Here's a 2-line python program -- runs well enough when run as a command.
Program:
=========
pp = "💩"
print (pp)
=========
Try open it in idle3 and you get (at least I get):
$ idle3 ff.py
Traceback (most recent call last):
File "/usr/bin/idle3", line 5, in <module>
main()
File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
if flist.open(filename) is None:
File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
edit = self.EditorWindow(self, filename, key)
File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
EditorWindow.__init__(self, *args)
File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
if io.loadfile(filename):
File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
self.text.insert("1.0", chars)
File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
self.top.insert(index, chars, tags)
File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
self.addcmd(InsertCommand(index, chars, tags))
File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
cmd.do(self.delegate)
File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
text.insert(self.index1, self.chars, self.tags)
File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
self.delegate.insert(index, chars, tags)
File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
return self.tk_call(self.orig_and_operation + args)
_tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
So who/what is broken?
>
> - If you are using an existing language, be aware of any bugs and
> limitations in its Unicode implementation. You may or may not be able to
> work around them, but at least you can decide whether or not you wish to
> try.
>
> - If you are writing your own file system layer, it's 2015 fer fecks sake,
> file names should be Unicode strings, not bytes! (That's one part of the
> Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> system, whichever you please, but again remember that both are
> variable-width formats.
Correct.
Windows is broken for using UTF-16
Linux is broken for conflating UTF-8 and byte string.
Lot of breakage out here dont you think?
May be related to the equation
UTF-16 = UCS-2 + Duct-tape
??
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-07 03:45 +1100 |
| Message-ID | <mailman.120.1425660339.21433.python-list@python.org> |
| In reply to | #87032 |
On Sat, Mar 7, 2015 at 3:20 AM, Rustom Mody <rustompmody@gmail.com> wrote: > C's string is not bug-prone its plain buggy as it cannot represent strings > with nulls. > > I would not go that far for UTF-16. > It is bug-inviting but it can also be implemented correctly C's standard library string handling functions are restricted in that they handle a 255-byte alphabet. They do not handle Unicode, they do not handle NUL, that is simply how they are. But I never said I was talking about the C standard library. If you type a text string into a GUI entry field, or encode it quoted-printable and pass it to a web server, or whatever, you shouldn't know or care about what language the program is written in; and if that program barfs on a NUL, that's a limitation. That limitation might be caused by its naive use of strcpy() when it should have used memcpy(), but that's not your problem. It's exactly the same here: if your program chokes on an SMP character, I don't care what your program was written in or what library functions your program called on. All I care is that your program - repeated for emphasis, *your* program - failed on that input. It's up to you to choose your underlying functions appropriately. >> - If you are designing your own language, your implementation of Unicode >> strings should use something like Python's FSR, or UTF-8 with tweaks to >> make string indexing O(1) rather than O(N), or correctly-implemented >> UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) > > FSR is possible in python for very specific pythonic reasons > - dynamicness > - immutable strings > > Drop either and FSR is impossible I don't know what you mean by "dynamicness". What you do need is a Unicode string type, such that the application program isn't aware of the underlying bytes, but simply treats this string as a sequence of code points. The immutability isn't technically a requirement, but it does make the FSR much more manageable; in a language with mutable strings, it's probably more efficient to use UTF-32 for simplicity, but it's up to the language designer to figure that out. (It might be best to use something like the FSR, but where strings are never narrowed after being widened, so it'd be possible for an ASCII-only string to be stored UTF-32. That has consequences for comparisons, but might give a reasonable hybrid of storage and mutation performance.) > _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl > > So who/what is broken? The exception is pretty clear on that point. Tcl can't handle SMP characters. So it's Tcl that's broken. Unless there's evidence to the contrary, that's what I would expect to be the case. > Correct. > Windows is broken for using UTF-16 > Linux is broken for conflating UTF-8 and byte string. > > Lot of breakage out here dont you think? > May be related to the equation > > UTF-16 = UCS-2 + Duct-tape UTF-16 is an encoding that was designed to be backward-compatible with UCS-2, just as UTF-8 was designed to be compatible with ASCII. Call it what you will, but backward compatibility is pretty important. Look at things like DES3 - if you use the same key three times, it's compatible with DES. Linux isn't "broken" for conflating UTF-8 and byte strings. Linux is flawed in that it defines file names to be byte strings, which means that every file system could be different in what it actually uses as the encoding. Since file names exist for the benefit of humans, they should be treated as text, so we should work with them as text. But for reasons of backward compatibility, Linux hasn't yet changed. Windows isn't broken for using UTF-16. I think it's a poor trade-off, given that so many file names are ASCII-only; and, of course, if any program treats a Windows file name as UCS-2, then that program is broken. But UTF-16 is not itself broken, any more than UTF-7 is. And UTF-7 is a lot harder to work with. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-06 11:41 -0800 |
| Message-ID | <b67491eb-f4f5-49e8-9a88-d10304369822@googlegroups.com> |
| In reply to | #87032 |
Le vendredi 6 mars 2015 17:21:10 UTC+1, Rustom Mody a écrit :
> On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote:
> > Rustom Mody wrote:
> >
> > > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
> >
> > [snip example of an analogous situation with NULs]
> >
> > > Strawman.
> >
> > Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> > they really should say is "Yes, that's a good argument, I'm afraid I can't
> > argue against it, at least not without considerable thought", I'd be a
> > wealthy man...
>
> Missed my addition? Here it is again – grammar slightly corrected.
>
> ===========
> Ah well if you insist on pursuing the nul-char example...
> - No, the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0
>
> - No, the code that "can't cope with a perfectly normal character" is not wrong
>
> - It is C that is wrong for designing a buggy string data structure that cannot
> contain a valid char.
> ===========
>
> In fact Chris' nul-char example is so strongly supporting my argument – bugginess of UTF-16 –
> it is perhaps too strong even for me.
>
> To elaborate:
> Take the buggy-plane analogy I gave in
> http://blog.languager.org/2015/03/whimsical-unicode.html
>
> If a plane model crashes once in 10,000 flights compared to others that crash once in
> one million flights we can call it bug-prone though not strictly buggy – it does fly
> 9999 times safely!
> OTOH if a plane is guaranteed to crash we can all it a buggy plane.
>
> C's string is not bug-prone its plain buggy as it cannot represent strings
> with nulls.
>
> I would not go that far for UTF-16.
> It is bug-inviting but it can also be implemented correctly
> >
> >
> > > Lets please stick to UTF-16 shall we?
> > >
> > > Now tell me:
> > > - Is it broken or not?
> >
> > The UTF-16 standard is not broken. It is a perfectly adequate variable-width
> > encoding, and considerably better than most other variable-width encodings.
> >
> > However, many implementations of UTF-16 are faulty, and assume a
> > fixed-width. *That* is broken, not UTF-16.
> >
> > (The difference between specification and implementation is critical.)
> >
> >
> > > - Is it widely used or not?
> >
> > It's quite widely used.
> >
> >
> > > - Should programmers be careful of it or not?
> >
> > Programmers should be aware whether or not any specific language uses UTF-16
> > and whether the implementation is buggy. That will help them decide whether
> > or not to use that language.
> >
> >
> > > - Should programmers be warned about it or not?
> >
> > I'm in favour of people having more knowledge rather than less. I don't
> > believe that ignorance is bliss, except perhaps in the case that a giant
> > asteroid the size of Texas is heading straight for us.
> >
> > Programmers should be aware of the limitations or bugs in any UTF-16
> > implementation they are likely to run into. Hence my general
> > recommendation:
> >
> > - For transmission over networks or storage on permanent media (e.g. the
> > content of text files), use UTF-8. It is well-implemented by nearly all
> > languages that support Unicode, as far as I know.
> >
> > - If you are designing your own language, your implementation of Unicode
> > strings should use something like Python's FSR, or UTF-8 with tweaks to
> > make string indexing O(1) rather than O(N), or correctly-implemented
> > UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)
>
> FSR is possible in python for very specific pythonic reasons
> - dynamicness
> - immutable strings
>
> Drop either and FSR is impossible
>
> > If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed
> > 2-byte per code point format, you fail.
>
> Seems obvious enough.
> So lets see...
> Here's a 2-line python program -- runs well enough when run as a command.
> Program:
> =========
> pp = "💩"
> print (pp)
> =========
> Try open it in idle3 and you get (at least I get):
>
> $ idle3 ff.py
> Traceback (most recent call last):
> File "/usr/bin/idle3", line 5, in <module>
> main()
> File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
> if flist.open(filename) is None:
> File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
> edit = self.EditorWindow(self, filename, key)
> File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
> EditorWindow.__init__(self, *args)
> File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
> if io.loadfile(filename):
> File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
> self.text.insert("1.0", chars)
> File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
> self.top.insert(index, chars, tags)
> File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
> self.addcmd(InsertCommand(index, chars, tags))
> File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
> cmd.do(self.delegate)
> File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
> text.insert(self.index1, self.chars, self.tags)
> File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
> self.delegate.insert(index, chars, tags)
> File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
> return self.tk_call(self.orig_and_operation + args)
> _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
>
> So who/what is broken?
>
> >
> > - If you are using an existing language, be aware of any bugs and
> > limitations in its Unicode implementation. You may or may not be able to
> > work around them, but at least you can decide whether or not you wish to
> > try.
> >
> > - If you are writing your own file system layer, it's 2015 fer fecks sake,
> > file names should be Unicode strings, not bytes! (That's one part of the
> > Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> > system, whichever you please, but again remember that both are
> > variable-width formats.
>
> Correct.
> Windows is broken for using UTF-16
> Linux is broken for conflating UTF-8 and byte string.
>
> Lot of breakage out here dont you think?
> May be related to the equation
>
> UTF-16 = UCS-2 + Duct-tape
>
> ??
=============
1) A copy/paste of pp = ... from google group into
my Python interactive interpreter without intermediate
state.
2) Some manipulations.
3) A copy/paste from my interpreter into google group.
I hope the rendering will be correct.
Python 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)] on win32
>>> eta runs etazero.py...
...etazero has been executed
>>> pp = "💩"
>>> print(pp)
💩
>>> len(pp)
2
>>> pp + pp + 'abc需' + pp
'💩💩abc需💩'
>>>
>>> # ok, nine glyphs, individually seleectable.
>>>
Note:
len(pp) = 2 because of Py32. This is a deliberate
choice to keep the Py32 "behaviour" in my interpreter.
but also note:
The code point is correctly displayed with a single "glyph".
All the cut/copy/paste (eg word, pdf, ...), cursor mouvement,
selection, caret position, text wrapping, char typing, ... mainly
for rendering purpose is done with my internal "artillary",
full unicode.
In my other GUI applications, everything is working fine,
including string lenghts, because my "artillary" work and
also handle glyphs (including diacritical signs).
Honestly, I'm no sure about bidi; however Hebrew I'm able
to test is working fine.
jmf
[toc] | [prev] | [next] | [standalone]
Page 4 of 8 — ← Prev page 1 2 3 [4] 5 6 7 8 Next page →
Back to top | Article view | comp.lang.python
csiph-web