Groups > comp.lang.python > #86311 > unrolled thread

Newbie question about text encoding

Started by	pierrick.brihaye@gmail.com
First post	2015-02-24 02:49 -0800
Last post	2015-02-27 10:23 +1100
Articles	20 on this page of 158 — 19 participants

Back to article view | Back to comp.lang.python

  Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
    Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
        Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
        Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
        Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
          Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
        Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
              Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
                  Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
                      Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
                        Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
                            Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
                              Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
                          Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
                      Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
                        Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
                      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
                      Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
                      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
                              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
                                Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
                                    Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
                                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
                            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
                                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
                                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
                                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
                                                  Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
                                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
                                                      Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
                                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
                                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
                                                          Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
                                                        Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
                                    Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
                                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
                                          Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
                                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
                                              Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
                                                Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
                                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
                                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
                                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
                                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
                          Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
                          Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
            Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
                Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
          Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100

Page 4 of 8 — ← Prev page 1 2 3 [4] 5 6 7 8 Next page →

#86951

From	random832@fastmail.us
Date	2015-03-05 14:59 -0500
Message-ID	<mailman.63.1425585548.21433.python-list@python.org>
In reply to	#86942

On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote:
> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in
> UTF-8
> and UTF-32, since that goes against the grain of the system. You would
> have
> to program in artificial restrictions that otherwise don't exist.

UTF-8 is already restricted from representing values above 0x10FFFF,
whereas UTF-8 can "naturally" represent values up to 0x1FFFFF in four
bytes, up to 0x3FFFFFF in five bytes, and 0x7FFFFFFF in six bytes. If
anything, the BMP represents a natural boundary, since it coincides with
values that can be represented in three bytes. Likewise, UTF-32 can
obviously represent values up to 0xFFFFFFFF. You're programming in
artificial restrictions either way, it's just a question of what those
restrictions are.

[toc] | [prev] | [next] | [standalone]

#86959

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-06 09:33 +1100
Message-ID	<54f8d9c6$0$12993$c3e8da3$5496439d@news.astraweb.com>
In reply to	#86951

random832@fastmail.us wrote:

> On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote:
>> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in
>> UTF-8
>> and UTF-32, since that goes against the grain of the system. You would
>> have
>> to program in artificial restrictions that otherwise don't exist.
> 
> UTF-8 is already restricted from representing values above 0x10FFFF,
> whereas UTF-8 can "naturally" represent values up to 0x1FFFFF in four
> bytes, up to 0x3FFFFFF in five bytes, and 0x7FFFFFFF in six bytes. If
> anything, the BMP represents a natural boundary, since it coincides with
> values that can be represented in three bytes. Likewise, UTF-32 can
> obviously represent values up to 0xFFFFFFFF. You're programming in
> artificial restrictions either way, it's just a question of what those
> restrictions are.

Good points, but they don't greatly change my conclusion. If you are
implementing UTF-8 or UTF-32, it is no harder to deal with code points in
the SMP than those in the BMP.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#86986

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-05 20:53 -0800
Message-ID	<c6caaa76-f448-4c2f-8874-c1f2716da744@googlegroups.com>
In reply to	#86942

On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:
> 
> > On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote:
> >> On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody  wrote:
> >> >
> >> > It lists some examples of software that somehow break/goof going from
> >> > BMP-only unicode to 7.0 unicode.
> >> >
> >> > IOW the suggestion is that the the two-way classification
> >> > - ASCII
> >> > - Unicode
> >> >
> >> > is less useful and accurate than the 3-way
> >> >
> >> > - ASCII
> >> > - BMP
> >> > - Unicode
> >> 
> >> How is that more useful? Aside from storage optimizations (in which
> >> the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is
> >> not significantly different from the rest of Unicode.
> > 
> > Sorry... Dont understand.
> 
> Chris is suggesting that going from BMP to all of Unicode is not the hard
> part. Going from ASCII to the BMP part of Unicode is the hard part. If you
> can do that, you can go the rest of the way easily.

Depends where the going is starting from.
I specifically names Java, Javascript, Windows... among others.
Here's some quotes from the supplementary chars doc of Java
http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html

| Supplementary characters are characters in the Unicode standard whose code
| points are above U+FFFF, and which therefore cannot be described as single 
| 16-bit entities such as the char data type in the Java programming language. 
| Such characters are generally rare, but some are used, for example, as part 
| of Chinese and Japanese personal names, and so support for them is commonly 
| required for government applications in East Asian countries...

| The introduction of supplementary characters unfortunately makes the 
| character model quite a bit more complicated. 

| Unicode was originally designed as a fixed-width 16-bit character encoding. 
| The primitive data type char in the Java programming language was intended to 
| take advantage of this design by providing a simple data type that could hold 
| any character....  Version 5.0 of the J2SE is required to support version 4.0 
| of the Unicode standard, so it has to support supplementary characters. 

My conclusion: Early adopters of unicode -- Windows and Java -- were punished
for their early adoption.  You can blame the unicode consortium, you can
blame the babel of human languages, particularly that some use characters
and some only (the equivalent of) what we call words.

Or you can skip the blame-game and simply note the fact that large segments of
extant code-bases are currently in bug-prone or plain buggy state.

This includes not just bug-prone-system code such as Java and Windows but
seemingly working code such as python 3.
> 
> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8
> and UTF-32, since that goes against the grain of the system. You would have
> to program in artificial restrictions that otherwise don't exist.

Yes  UTF-8 and UTF-32 make most of the objections to unicode 7.0 irrelevant.
Large segments of the
> 
> UTF-16 is different, and that's probably why you think supporting all of
> Unicode is hard. With UTF-16, there really is an obvious distinction
> between the BMP and the SMP: that's where you jump from a single 2-byte
> unit to a pair of 2-byte units. But that distinction doesn't exist in UTF-8
> or UTF-32: 
> 
> - In UTF-8, about 99.8% of the BMP requires multiple bytes. Whether you
>   support the SMP or not doesn't change the fact that you have to deal
>   with multi-byte characters.
> 
> - In UTF-32, everything is fixed-width whether it is in the BMP or not.
> 
> In both cases, supporting the SMPs is no harder than supporting the BMP.
> It's only UTF-16 that makes the SMP seem hard.
> 
> Conclusion: faulty implementations of UTF-16 which incorrectly handle
> surrogate pairs should be replaced by non-faulty implementations, or
> changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume
> that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be
> upgraded.

Imagine for a moment a thought experiment -- we are not on a python but a java
forum and please rewrite the above para.
Are you addressing the vanilla java programmer? Language implementer? Designer?
The Java-funders -- earlier Sun, now Oracle?
> 
> Wrong conclusion: SMPs are unnecessary and unneeded, and we need a new
> standard that is just like obsolete Unicode version 1.
> 
> Unicode version 1 is obsolete for a reason. 16 bits is not enough for even
> existing languages, let alone all the code points and characters that are
> used in human communication.
> 
> 
> >> Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
> >> do you keep talking about 7.0 as if it's a recent change?
> > 
> > It is 2015 as of now. 7.0 is the current standard.
> > 
> > The need for the adjective 'current' should be pondered upon.
> 
> What's your point?
> 
> The UTF encodings have not changed since they were first introduced. They
> have been stable for at least twenty years: UTF-8 has existed since 1993,
> and UTF-16 since 1996.
> 
> Since version 2.0 of Unicode in 1996, the standard has made "stability
> guarantees" that no code points will be renamed or removed. Consequently,
> there has only been one version which removed characters, version 1.1.
> Since then, new versions of the standard have only added characters, never
> moved, renamed or deleted them.
> 
> http://unicode.org/policies/stability_policy.html
> 
> Some highlights in Unicode history:
> 
> Unicode 1.0 (1991): initial version, defined 7161 code points.
> 
> In January 1993, Rob Pike and Ken Thompson announced the design and working
> implementation of the UTF-8 encoding.
> 
> 1.1 (1993): defined 34233 characters, finalised Han Unification. Removed
> some characters from the 1.0 set. This is the first and only time any code
> points have been removed.
> 
> 2.0 (1996): First version to include code points in the Supplementary
> Multilingual Planes. Defined 38950 code points. Introduced the UTF-16
> encoding.
> 
> 3.1 (2001): Defined 94205 code points, including 42711 additional Han
> ideographs, bringing the total number of CJK code points alone to 71793,
> too many to fit in 16 bits.
> 
> 2006: The People's Republic Of China mandates support for the GB-18030
> character set for all software products sold in the PRC. GB-18030 supports
> the entire Unicode range, include the SMPs. Since this date, all software
> sold in China must support the SMPs.
> 
> 6.0 (2010): The first emoji or emoticons were added to Unicode.
> 
> 7.0 (2014): 113021 code points defined in total.
> 
> 
> > In practice, standards change.
> > However if a standard changes so frequently that that users have to play
> > catching cook and keep asking: "Which version?" they are justified in
> > asking "Are the standard-makers doing due diligence?"
> 
> Since Unicode has stability guarantees, and the encodings have not changed
> in twenty years and will not change in the future, this argument is bogus.
> Updating to a new version of the standard means, to a first approximation,
> merely allocating some new code points which had previously been undefined
> but are now defined.
> 
> (Code points can be flagged deprecated, but they will never be removed.)

Its not about new code points; its about "Fits in 2 bytes" to "Does not fit in 2 bytes"

If you call that argument bogus I call you a non computer scientist.
[Essentially this is my issue with the consortium it seems to be working like
a bunch of linguists not computer scientists]

Here is Roy's Smith post that first started me thinking that something may
be wrong with SMP
https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ

Some parts are here some earlier and from my memory.
If details wrong please correct:
- 200 million records
- Containing 4 strings with SMP characters
- System made with python and mysql. SMP works with python, breaks mysql.
  So whole system broke due to those 4 in 200,000,000 records

I know enough (or not enough) of unicode to be chary of statistical conclusions 
from the above.
My conclusion is essentially an 'existence-proof':

SMP-chars can break systems.
The breakage is costly-fied by the combination
- layman statistical assumptions
- BMP → SMP exercises different code-paths

It is necessary but not sufficient to test print "hello world" in ASCII, BMP, SMP.
You also have to write the hello world in the database -- mysql
Read it from the webform -- javascript 
etc etc

You could also choose do with "astral crap" (Roy's words) what we all do with
crap -- throw it out as early as possible.

[toc] | [prev] | [next] | [standalone]

#86987

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-06 16:20 +1100
Message-ID	<mailman.88.1425619223.21433.python-list@python.org>
In reply to	#86986

On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody <rustompmody@gmail.com> wrote:
> My conclusion: Early adopters of unicode -- Windows and Java -- were punished
> for their early adoption.  You can blame the unicode consortium, you can
> blame the babel of human languages, particularly that some use characters
> and some only (the equivalent of) what we call words.
>
> Or you can skip the blame-game and simply note the fact that large segments of
> extant code-bases are currently in bug-prone or plain buggy state.

For most of the 1990s, I was writing code in REXX, on OS/2. An even
earlier adopter, REXX didn't have Unicode support _at all_, but
instead had facilities for working with DBCS strings. You can't get
everything right AND be the first to produce anything. Python didn't
make Unicode strings the default until 3.0, but that's not Unicode's
fault.

> This includes not just bug-prone-system code such as Java and Windows but
> seemingly working code such as python 3.
>
> Here is Roy's Smith post that first started me thinking that something may
> be wrong with SMP
> https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ
>
> Some parts are here some earlier and from my memory.
> If details wrong please correct:
> - 200 million records
> - Containing 4 strings with SMP characters
> - System made with python and mysql. SMP works with python, breaks mysql.
>   So whole system broke due to those 4 in 200,000,000 records
>
> I know enough (or not enough) of unicode to be chary of statistical conclusions
> from the above.
> My conclusion is essentially an 'existence-proof':

Hang on hang on. Why are you blaming Python or SMP characters for
this? The problem here is MySQL, which doesn't adequately cope with
the full Unicode range. (Or, didn't then, or doesn't with its default
settings. I believe you can configure current versions of MySQL to
work correctly, though I haven't actually checked. PostgreSQL gets it
right, that's good enough for me.)

> SMP-chars can break systems.
> The breakage is costly-fied by the combination
> - layman statistical assumptions
> - BMP → SMP exercises different code-paths

Broken systems can be shown up by anything. Suppose you have a program
that breaks when it gets a NUL character (not unknown in C code); is
the fault with the Unicode consortium for allocating something at
codepoint 0, or the code that can't cope with a perfectly normal
character?

> You could also choose do with "astral crap" (Roy's words) what we all do with
> crap -- throw it out as early as possible.

There's only one character that fits that description, and that's
1F4A9. Everything else is just "astral characters", and you shouldn't
have any difficulties with them.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87001

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-06 01:02 -0800
Message-ID	<01dd9b83-db3e-4e7d-9022-dc6af75eb570@googlegroups.com>
In reply to	#86987

On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
> On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody wrote:
> > My conclusion: Early adopters of unicode -- Windows and Java -- were punished
> > for their early adoption.  You can blame the unicode consortium, you can
> > blame the babel of human languages, particularly that some use characters
> > and some only (the equivalent of) what we call words.
> >
> > Or you can skip the blame-game and simply note the fact that large segments of
> > extant code-bases are currently in bug-prone or plain buggy state.
> 
> For most of the 1990s, I was writing code in REXX, on OS/2. An even
> earlier adopter, REXX didn't have Unicode support _at all_, but
> instead had facilities for working with DBCS strings. You can't get
> everything right AND be the first to produce anything. Python didn't
> make Unicode strings the default until 3.0, but that's not Unicode's
> fault.
> 
> > This includes not just bug-prone-system code such as Java and Windows but
> > seemingly working code such as python 3.
> >
> > Here is Roy's Smith post that first started me thinking that something may
> > be wrong with SMP
> > https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ
> >
> > Some parts are here some earlier and from my memory.
> > If details wrong please correct:
> > - 200 million records
> > - Containing 4 strings with SMP characters
> > - System made with python and mysql. SMP works with python, breaks mysql.
> >   So whole system broke due to those 4 in 200,000,000 records
> >
> > I know enough (or not enough) of unicode to be chary of statistical conclusions
> > from the above.
> > My conclusion is essentially an 'existence-proof':
> 
> Hang on hang on. Why are you blaming Python or SMP characters for
> this? The problem here is MySQL, which doesn't adequately cope with
> the full Unicode range. (Or, didn't then, or doesn't with its default
> settings. I believe you can configure current versions of MySQL to
> work correctly, though I haven't actually checked. PostgreSQL gets it
> right, that's good enough for me.)
> 
> > SMP-chars can break systems.
> > The breakage is costly-fied by the combination
> > - layman statistical assumptions
> > - BMP → SMP exercises different code-paths
> 
> Broken systems can be shown up by anything. Suppose you have a program
> that breaks when it gets a NUL character (not unknown in C code); is
> the fault with the Unicode consortium for allocating something at
> codepoint 0, or the code that can't cope with a perfectly normal
> character?

Strawman.

Lets please stick to UTF-16 shall we?

Now tell me:
- Is it broken or not?
- Is it widely used or not?
- Should programmers be careful of it or not?
- Should programmers be warned about it or not?

[toc] | [prev] | [next] | [standalone]

#87002

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-06 01:06 -0800
Message-ID	<d01a4428-d691-4620-88ba-076360366cff@googlegroups.com>
In reply to	#87001

On Friday, March 6, 2015 at 2:33:11 PM UTC+5:30, Rustom Mody wrote:
> Lets please stick to UTF-16 shall we?
> 
> Now tell me:
> - Is it broken or not?
> - Is it widely used or not?
> - Should programmers be careful of it or not?
> - Should programmers be warned about it or not?

Also:
Can a programmer who is away from UTF-16 in one part of the system (say by using python3)
assume he is safe all over?

[toc] | [prev] | [next] | [standalone]

#87020

From	random832@fastmail.us
Date	2015-03-06 08:33 -0500
Message-ID	<mailman.108.1425648784.21433.python-list@python.org>
In reply to	#87002

On Fri, Mar 6, 2015, at 04:06, Rustom Mody wrote:
> Also:
> Can a programmer who is away from UTF-16 in one part of the system (say
> by using python3)
> assume he is safe all over?

The most common failure of UTF-16 support, supposedly, is in programs
misusing the number of code units (for length or random access) as a
proxy for the number of characters.

However, when do you _really_ want the number of characters? You may
want to use it for, for example, the number of columns in a 'monospace'
font, which you've already screwed up because you haven't accounted for
double-wide characters or combining marks. Or you may want the position
that pressing an arrow key or backspace or forward-delete a number of
times will reach, which has its own rules in e.g. Indic languages (and
also fails on Latin with combining marks).

[toc] | [prev] | [next] | [standalone]

#87021

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-07 00:39 +1100
Message-ID	<mailman.109.1425649169.21433.python-list@python.org>
In reply to	#87002

On Sat, Mar 7, 2015 at 12:33 AM,  <random832@fastmail.us> wrote:
> However, when do you _really_ want the number of characters? You may
> want to use it for, for example, the number of columns in a 'monospace'
> font, which you've already screwed up because you haven't accounted for
> double-wide characters or combining marks. Or you may want the position
> that pressing an arrow key or backspace or forward-delete a number of
> times will reach, which has its own rules in e.g. Indic languages (and
> also fails on Latin with combining marks).

Number of code points is the most logical way to length-limit
something. If you want to allow users to set their display names but
not to make arbitrarily long ones, limiting them to X code points is
the safest way (and preferably do an NFC or NFD normalization before
counting, for consistency); this means you disallow pathological cases
where every base character has innumerable combining marks added.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87023

From	random832@fastmail.us
Date	2015-03-06 09:03 -0500
Message-ID	<mailman.111.1425650593.21433.python-list@python.org>
In reply to	#87002

On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote:
> Number of code points is the most logical way to length-limit
> something. If you want to allow users to set their display names but
> not to make arbitrarily long ones, limiting them to X code points is
> the safest way (and preferably do an NFC or NFD normalization before
> counting, for consistency);

Why are you length-limiting it? Storage space? Limit it in whatever
encoding they're stored in. Why are combining marks "pathological" but
surrogate characters not? Display space? Limit it by columns. If you're
going to allow a Japanese user's name to be twice as wide, you've got a
problem when you go to display it.

> this means you disallow pathological cases
> where every base character has innumerable combining marks added.

No it doesn't. If you limit it to, say, fifty, someone can still post
two base characters with twenty combining marks each. If you actually
want to disallow this, you've got to do more work. You've disallowed
some of the pathological cases, some of the time, by coincidence. And
limiting the number of UTF-8 bytes, or the number of UTF-16 code points,
will accomplish this just as well.

Now, if you intend to _silently truncate_ it to the desired length, you
certainly don't want to leave half a character in, of course. But who's
to say the base character plus first few combining marks aren't also
"half a character"? If you're _splitting_ a string, rather than merely
truncating it, you probably don't want those combining marks at the
beginning of part two.

[toc] | [prev] | [next] | [standalone]

#87024

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-07 01:11 +1100
Message-ID	<mailman.112.1425651082.21433.python-list@python.org>
In reply to	#87002

On Sat, Mar 7, 2015 at 1:03 AM,  <random832@fastmail.us> wrote:
> On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote:
>> Number of code points is the most logical way to length-limit
>> something. If you want to allow users to set their display names but
>> not to make arbitrarily long ones, limiting them to X code points is
>> the safest way (and preferably do an NFC or NFD normalization before
>> counting, for consistency);
>
> Why are you length-limiting it? Storage space? Limit it in whatever
> encoding they're stored in. Why are combining marks "pathological" but
> surrogate characters not? Display space? Limit it by columns. If you're
> going to allow a Japanese user's name to be twice as wide, you've got a
> problem when you go to display it.

To prevent people from putting three paragraphs of lipsum in and
calling it a username.

>> this means you disallow pathological cases
>> where every base character has innumerable combining marks added.
>
> No it doesn't. If you limit it to, say, fifty, someone can still post
> two base characters with twenty combining marks each. If you actually
> want to disallow this, you've got to do more work. You've disallowed
> some of the pathological cases, some of the time, by coincidence. And
> limiting the number of UTF-8 bytes, or the number of UTF-16 code points,
> will accomplish this just as well.

They can, but then they're limited to two base characters. They can't
have fifty base characters with twenty combining marks each. That's
the point.

> Now, if you intend to _silently truncate_ it to the desired length, you
> certainly don't want to leave half a character in, of course. But who's
> to say the base character plus first few combining marks aren't also
> "half a character"? If you're _splitting_ a string, rather than merely
> truncating it, you probably don't want those combining marks at the
> beginning of part two.

So you truncate to the desired length, then if the first character of
the trimmed-off section is a combining mark (based on its Unicode
character types), you keep trimming until you've removed a character
which isn't. Then, if you no longer have any content whatsoever,
reject the name. Simple.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87025

From	random832@fastmail.us
Date	2015-03-06 09:27 -0500
Message-ID	<mailman.113.1425652066.21433.python-list@python.org>
In reply to	#87002

On Fri, Mar 6, 2015, at 09:11, Chris Angelico wrote:
> To prevent people from putting three paragraphs of lipsum in and
> calling it a username.

Limiting by UTF-8 bytes or UTF-16 units works just as well for that.

> So you truncate to the desired length, then if the first character of
> the trimmed-off section is a combining mark (based on its Unicode
> character types), you keep trimming until you've removed a character
> which isn't. Then, if you no longer have any content whatsoever,
> reject the name. Simple.

My entire point was that UTF-32 doesn't save you from that, so it cannot
be called a deficiency of UTF-16. My point is there are very few
problems to which "count of Unicode code points" is the only right
answer - that UTF-32 is good enough for but that are meaningfully
impacted by a naive usage of UTF-16, to the point where UTF-16 is
something you have to be "safe" from.

[toc] | [prev] | [next] | [standalone]

#87035

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-07 03:26 +1100
Message-ID	<54f9d51b$0$13014$c3e8da3$5496439d@news.astraweb.com>
In reply to	#87025

random832@fastmail.us wrote:

> My point is there are very few
> problems to which "count of Unicode code points" is the only right
> answer - that UTF-32 is good enough for but that are meaningfully
> impacted by a naive usage of UTF-16, to the point where UTF-16 is
> something you have to be "safe" from.

I'm not sure why you care about the "count of Unicode code points", although
that *is* a problem. Not for end-user reasons like "how long is my
password?", but because it makes your job as a programmer harder.

[steve@ando ~]$ python2.7 -c "print (len(u'\U00004444:\U00014445'))"
4
[steve@ando ~]$ python3.3 -c "print (len(u'\U00004444:\U00014445'))"
3

It's hard to reason about your code when something as fundamental as the
length of a string is implementation-dependent. (By the way, the right
answer should be 3, not 4.)

But an even more important problem is that broken-UTF-16 lets you create
invalid, impossible Unicode strings *by accident*. Naturally you can create
broken Unicode if you assemble strings of surrogates yourself, but
broken-UTF-16 means it can happen from otherwise innocuous operations like
reversing a string:

py> s = u'\U00004444:\U00014445'  # Python 2.7 narrow build
py> s[::-1]
u'\udc45\ud811:\u4444'

It's hard for me to demonstrate that the reversed string is broken because
the shell I am using does an amazingly good job of handling broken Unicode.
Even if I print it, the shell just prints missing-character glyphs instead
of crashing (fortunately for me!). But the first two code points are in
illegal order:

\udc45 is a high surrogate, and must follow a low surrogate;
\ud811 is a low surrogate, and must precede a high surrogate;

I'm not convinced you should be allowed to create Unicode strings containing
mismatched surrogates like this deliberately, but you certainly shouldn't
be able to do so by accident.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#87004

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-06 20:54 +1100
Message-ID	<mailman.99.1425635649.21433.python-list@python.org>
In reply to	#87001

On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody <rustompmody@gmail.com> wrote:
>> Broken systems can be shown up by anything. Suppose you have a program
>> that breaks when it gets a NUL character (not unknown in C code); is
>> the fault with the Unicode consortium for allocating something at
>> codepoint 0, or the code that can't cope with a perfectly normal
>> character?
>
> Strawman.

Not really, no. I know of lots of programs that can't handle embedded
NULs, and which fail in various ways when given them (the most common
is simple truncation, but it's by far not the only way). And it's
exactly the same: a program that purports to handle arbitrary Unicode
text should be able to handle arbitrary Unicode text, not "Unicode
text as long as it contains only codepoints within the range X-Y". It
doesn't matter whether the code chokes on U+0000, U+005C, U+FFFC, or
U+1F4A3 - if your code blows up, it's a failure in your code.

> Lets please stick to UTF-16 shall we?
>
> Now tell me:
> - Is it broken or not?
> - Is it widely used or not?
> - Should programmers be careful of it or not?
> - Should programmers be warned about it or not?

No, UTF-16 is not itself broken. (It would be if we expected
codepoints >0x10FFFF, and it's because of UTF-16 that that's the cap
on Unicode, but it's looking unlikely that we'll be needing any more
than that anyway.) What's broken is code that tries to treat UTF-16 as
if it's UCS-2, and then breaks on surrogate pairs.

Yes, it's widely used. Programmers should probably be warned about it,
but only because its tradeoffs are generally poorer than UTF-8's. If
you use it correctly, there's no problem.

> Also:
> Can a programmer who is away from UTF-16 in one part of the system (say by using python3)
> assume he is safe all over?

I don't know what you mean here. Do you mean that your Python 3
program is "at risk" in some way because there might be some other
program that misuses UTF-16? Well, sure. And there might be some other
program that misuses buffer sizes, SQL queries, or shell invocations,
and makes your overall system vulnerable to buffer overruns or
injection attacks. These are significantly more likely AND more
serious than UTF-16 misuses. And you still have not proven anything
about SMP characters being a problem, but only that code can be
broken. Broken code is still broken code, no matter what your actual
brokenness.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87009

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-06 02:07 -0800
Message-ID	<dd0a2f6c-61f7-4d6f-a42c-d9e6940f5a7d@googlegroups.com>
In reply to	#87004

On Friday, March 6, 2015 at 3:24:48 PM UTC+5:30, Chris Angelico wrote:
> On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody wrote:
> >> Broken systems can be shown up by anything. Suppose you have a program
> >> that breaks when it gets a NUL character (not unknown in C code); is
> >> the fault with the Unicode consortium for allocating something at
> >> codepoint 0, or the code that can't cope with a perfectly normal
> >> character?
> >
> > Strawman.
> 
> Not really, no. I know of lots of programs that can't handle embedded
> NULs, and which fail in various ways when given them (the most common
> is simple truncation, but it's by far not the only way).

Ah well if you insist on pursuing the nul-char example...
No the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0
Nor the code that "can't cope with a perfectly normal character?"

But with C for having a data structure called string with a 'hole' in it.

And it's
> exactly the same: a program that purports to handle arbitrary Unicode
> text should be able to handle arbitrary Unicode text, not "Unicode
> text as long as it contains only codepoints within the range X-Y". It
> doesn't matter whether the code chokes on U+0000, U+005C, U+FFFC, or
> U+1F4A3 - if your code blows up, it's a failure in your code.
> 
> > Lets please stick to UTF-16 shall we?
> >
> > Now tell me:
> > - Is it broken or not?
> > - Is it widely used or not?
> > - Should programmers be careful of it or not?
> > - Should programmers be warned about it or not?
> 
> No, UTF-16 is not itself broken. (It would be if we expected
> codepoints >0x10FFFF, and it's because of UTF-16 that that's the cap
> on Unicode, but it's looking unlikely that we'll be needing any more
> than that anyway.) What's broken is code that tries to treat UTF-16 as
> if it's UCS-2, and then breaks on surrogate pairs.
> 
> Yes, it's widely used. Programmers should probably be warned about it,
> but only because its tradeoffs are generally poorer than UTF-8's. If
> you use it correctly, there's no problem.
> 
> > Also:
> > Can a programmer who is away from UTF-16 in one part of the system (say by using python3)
> > assume he is safe all over?
> 
> I don't know what you mean here. Do you mean that your Python 3
> program is "at risk" in some way because there might be some other
> program that misuses UTF-16?

Yes some other program/library/API etc connected to the python one

> Well, sure. And there might be some other
> program that misuses buffer sizes, SQL queries, or shell invocations,
> and makes your overall system vulnerable to buffer overruns or
> injection attacks. These are significantly more likely AND more
> serious than UTF-16 misuses. And you still have not proven anything
> about SMP characters being a problem, but only that code can be
> broken. Broken code is still broken code, no matter what your actual
> brokenness.

Roy Smith (and many other links Ive cited) prove exactly that - an
SMP character broke the code.

Note: I have no objection to people supporting full unicode 7.
Im just saying it may be significantly harder than just "Use python3 and you are done"

[toc] | [prev] | [next] | [standalone]

#87026

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-07 01:50 +1100
Message-ID	<54f9bea1$0$12994$c3e8da3$5496439d@news.astraweb.com>
In reply to	#87001

Rustom Mody wrote:

> On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:

[snip example of an analogous situation with NULs]

> Strawman.

Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
they really should say is "Yes, that's a good argument, I'm afraid I can't
argue against it, at least not without considerable thought", I'd be a
wealthy man...

> Lets please stick to UTF-16 shall we?
> 
> Now tell me:
> - Is it broken or not?

The UTF-16 standard is not broken. It is a perfectly adequate variable-width
encoding, and considerably better than most other variable-width encodings.

However, many implementations of UTF-16 are faulty, and assume a
fixed-width. *That* is broken, not UTF-16.

(The difference between specification and implementation is critical.)

> - Is it widely used or not?

It's quite widely used.

> - Should programmers be careful of it or not?

Programmers should be aware whether or not any specific language uses UTF-16
and whether the implementation is buggy. That will help them decide whether
or not to use that language.

> - Should programmers be warned about it or not?

I'm in favour of people having more knowledge rather than less. I don't
believe that ignorance is bliss, except perhaps in the case that a giant
asteroid the size of Texas is heading straight for us.

Programmers should be aware of the limitations or bugs in any UTF-16
implementation they are likely to run into. Hence my general
recommendation:

- For transmission over networks or storage on permanent media (e.g. the
content of text files), use UTF-8. It is well-implemented by nearly all
languages that support Unicode, as far as I know.

- If you are designing your own language, your implementation of Unicode
strings should use something like Python's FSR, or UTF-8 with tweaks to
make string indexing O(1) rather than O(N), or correctly-implemented
UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) If, in
2015, you design your Unicode implementation as if UTF-16 is a fixed 2-byte
per code point format, you fail.

- If you are using an existing language, be aware of any bugs and
limitations in its Unicode implementation. You may or may not be able to
work around them, but at least you can decide whether or not you wish to
try.

- If you are writing your own file system layer, it's 2015 fer fecks sake,
file names should be Unicode strings, not bytes! (That's one part of the
Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
system, whichever you please, but again remember that both are
variable-width formats.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#87027

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-07 02:27 +1100
Message-ID	<mailman.114.1425655645.21433.python-list@python.org>
In reply to	#87026

On Sat, Mar 7, 2015 at 1:50 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Rustom Mody wrote:
>
>> On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
>
> [snip example of an analogous situation with NULs]
>
>> Strawman.
>
> Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> they really should say is "Yes, that's a good argument, I'm afraid I can't
> argue against it, at least not without considerable thought", I'd be a
> wealthy man...

If I had a dollar for every time anyone said "If I had <insert
currency unit here> for every time...", I'd go meta all day long and
profit from it... :)

> - If you are writing your own file system layer, it's 2015 fer fecks sake,
> file names should be Unicode strings, not bytes! (That's one part of the
> Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> system, whichever you please, but again remember that both are
> variable-width formats.

I agree that that part of the Unix model needs to change, but there
are two viable ways to move forward:

1) Keep file names as bytes, but mandate that they be valid UTF-8
streams, and recommend that they be decoded UTF-8 for display to a
human
2) Change the entire protocol stack from the file system upwards so
that file names become Unicode strings.

Trouble with #2 is that file names need to be passed around somehow,
which means bytes in memory. So ultimately, #2 really means "keep file
names as bytes, and mandate an encoding all the way up the stack"...
so it's a massive documentation change that really comes down to the
same thing as #1.

This is one area where, as I understand it, Mac OS got it right. It's
time for other Unix variants to adopt the same policy. The bulk of
file names will be ASCII-only anyway, so requiring UTF-8 won't affect
them; a lot of others are already UTF-8; so all we need is a
transition scheme for the remaining ones. If there's a known FS
encoding, it ought to be possible to have a file system conversion
tool that goes through everything, decodes, re-encodes UTF-8, and then
flags the file system as UTF-8 compliant. All that'd be left would be
the file names that are broken already - ones that don't decode in the
FS encoding - and there's nothing to be done with them but wrap them
up into something probably-meaningless-but reversible.

When can we start doing this? ext5?

ChrisA

[toc] | [prev] | [next] | [standalone]

#87029

From	wxjmfauth@gmail.com
Date	2015-03-06 07:37 -0800
Message-ID	<bc230953-27ed-4d10-a509-32d3aa1eced9@googlegroups.com>
In reply to	#87026

Le vendredi 6 mars 2015 15:50:22 UTC+1, Steven D'Aprano a écrit :
> Rustom Mody wrote:
> 
> > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
> 
> [snip example of an analogous situation with NULs]
> 
> > Strawman.
> 
> Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> they really should say is "Yes, that's a good argument, I'm afraid I can't
> argue against it, at least not without considerable thought", I'd be a
> wealthy man...
> 
> 
> > Lets please stick to UTF-16 shall we?
> > 
> > Now tell me:
> > - Is it broken or not?
> 
> The UTF-16 standard is not broken. It is a perfectly adequate variable-width
> encoding, and considerably better than most other variable-width encodings.
> 
> However, many implementations of UTF-16 are faulty, and assume a
> fixed-width. *That* is broken, not UTF-16.
> 
> (The difference between specification and implementation is critical.)
> 
> 
> > - Is it widely used or not?
> 
> It's quite widely used.
> 
> 
> > - Should programmers be careful of it or not?
> 
> Programmers should be aware whether or not any specific language uses UTF-16
> and whether the implementation is buggy. That will help them decide whether
> or not to use that language.
> 
> 
> > - Should programmers be warned about it or not?
> 
> I'm in favour of people having more knowledge rather than less. I don't
> believe that ignorance is bliss, except perhaps in the case that a giant
> asteroid the size of Texas is heading straight for us.
> 
> Programmers should be aware of the limitations or bugs in any UTF-16
> implementation they are likely to run into. Hence my general
> recommendation:
> 
> - For transmission over networks or storage on permanent media (e.g. the
> content of text files), use UTF-8. It is well-implemented by nearly all
> languages that support Unicode, as far as I know.
> 
> - If you are designing your own language, your implementation of Unicode
> strings should use something like Python's FSR, or UTF-8 with tweaks to
> make string indexing O(1) rather than O(N), or correctly-implemented
> UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) If, in
> 2015, you design your Unicode implementation as if UTF-16 is a fixed 2-byte
> per code point format, you fail.
> 
> - If you are using an existing language, be aware of any bugs and
> limitations in its Unicode implementation. You may or may not be able to
> work around them, but at least you can decide whether or not you wish to
> try.
> 
> - If you are writing your own file system layer, it's 2015 fer fecks sake,
> file names should be Unicode strings, not bytes! (That's one part of the
> Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> system, whichever you please, but again remember that both are
> variable-width formats.
> 
> 
> 
> -- 
> Steven

===========

Sorry, but
it's time to learn and to understand UNICODE.
(It is no so complicate).

jmf

[toc] | [prev] | [next] | [standalone]

#87032

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-06 08:20 -0800
Message-ID	<bb37d542-096f-46f0-9f4e-7cd9230ee2a0@googlegroups.com>
In reply to	#87026

On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:
> 
> > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
> 
> [snip example of an analogous situation with NULs]
> 
> > Strawman.
> 
> Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> they really should say is "Yes, that's a good argument, I'm afraid I can't
> argue against it, at least not without considerable thought", I'd be a
> wealthy man...

Missed my addition? Here it is again –  grammar slightly corrected.

===========
Ah well if you insist on pursuing the nul-char example...
- No, the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0

- No, the code that "can't cope with a perfectly normal character" is not wrong

- It is C that is wrong for designing a buggy string data structure that cannot
contain a valid char.
===========

In fact Chris' nul-char example is so strongly supporting my argument – bugginess of UTF-16 –
it is perhaps too strong even for me.

To elaborate:
Take the buggy-plane analogy I gave in
http://blog.languager.org/2015/03/whimsical-unicode.html

If a plane model crashes once in 10,000 flights compared to others that crash once in
one million flights we can call it bug-prone though not strictly buggy – it does fly  
9999 times safely!
OTOH if a plane is guaranteed to crash we can all it a buggy plane.

C's string is not bug-prone its plain buggy as it cannot represent strings
with nulls.

I would not go that far for UTF-16.
It is bug-inviting but it can also be implemented correctly
> 
> 
> > Lets please stick to UTF-16 shall we?
> > 
> > Now tell me:
> > - Is it broken or not?
> 
> The UTF-16 standard is not broken. It is a perfectly adequate variable-width
> encoding, and considerably better than most other variable-width encodings.
> 
> However, many implementations of UTF-16 are faulty, and assume a
> fixed-width. *That* is broken, not UTF-16.
> 
> (The difference between specification and implementation is critical.)
> 
> 
> > - Is it widely used or not?
> 
> It's quite widely used.
> 
> 
> > - Should programmers be careful of it or not?
> 
> Programmers should be aware whether or not any specific language uses UTF-16
> and whether the implementation is buggy. That will help them decide whether
> or not to use that language.
> 
> 
> > - Should programmers be warned about it or not?
> 
> I'm in favour of people having more knowledge rather than less. I don't
> believe that ignorance is bliss, except perhaps in the case that a giant
> asteroid the size of Texas is heading straight for us.
> 
> Programmers should be aware of the limitations or bugs in any UTF-16
> implementation they are likely to run into. Hence my general
> recommendation:
> 
> - For transmission over networks or storage on permanent media (e.g. the
> content of text files), use UTF-8. It is well-implemented by nearly all
> languages that support Unicode, as far as I know.
> 
> - If you are designing your own language, your implementation of Unicode
> strings should use something like Python's FSR, or UTF-8 with tweaks to
> make string indexing O(1) rather than O(N), or correctly-implemented
> UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)

FSR is possible in python for very specific pythonic reasons
- dynamicness
- immutable strings

Drop either and FSR is impossible

> If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed 
> 2-byte per code point format, you fail.

Seems obvious enough.
So lets see...
Here's a 2-line python program -- runs well enough when run as a command.
Program:
=========
pp = "💩"
print (pp)
=========
Try open it in idle3 and you get (at least I get):

$ idle3 ff.py 
Traceback (most recent call last):
  File "/usr/bin/idle3", line 5, in <module>
    main()
  File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
    if flist.open(filename) is None:
  File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
    edit = self.EditorWindow(self, filename, key)
  File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
    EditorWindow.__init__(self, *args)
  File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
    if io.loadfile(filename):
  File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
    self.text.insert("1.0", chars)
  File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
    self.top.insert(index, chars, tags)
  File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
    self.addcmd(InsertCommand(index, chars, tags))
  File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
    cmd.do(self.delegate)
  File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
    text.insert(self.index1, self.chars, self.tags)
  File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
    self.delegate.insert(index, chars, tags)
  File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
    return self.tk_call(self.orig_and_operation + args)
_tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl

So who/what is broken?

> 
> - If you are using an existing language, be aware of any bugs and
> limitations in its Unicode implementation. You may or may not be able to
> work around them, but at least you can decide whether or not you wish to
> try.
> 
> - If you are writing your own file system layer, it's 2015 fer fecks sake,
> file names should be Unicode strings, not bytes! (That's one part of the
> Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> system, whichever you please, but again remember that both are
> variable-width formats.

Correct.
Windows is broken for using UTF-16
Linux is broken for conflating UTF-8 and byte string.

Lot of breakage out here dont you think?
May be related to the equation

UTF-16 = UCS-2 + Duct-tape

??

[toc] | [prev] | [next] | [standalone]

#87040

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-07 03:45 +1100
Message-ID	<mailman.120.1425660339.21433.python-list@python.org>
In reply to	#87032

On Sat, Mar 7, 2015 at 3:20 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> C's string is not bug-prone its plain buggy as it cannot represent strings
> with nulls.
>
> I would not go that far for UTF-16.
> It is bug-inviting but it can also be implemented correctly

C's standard library string handling functions are restricted in that
they handle a 255-byte alphabet. They do not handle Unicode, they do
not handle NUL, that is simply how they are. But I never said I was
talking about the C standard library. If you type a text string into a
GUI entry field, or encode it quoted-printable and pass it to a web
server, or whatever, you shouldn't know or care about what language
the program is written in; and if that program barfs on a NUL, that's
a limitation. That limitation might be caused by its naive use of
strcpy() when it should have used memcpy(), but that's not your
problem.

It's exactly the same here: if your program chokes on an SMP
character, I don't care what your program was written in or what
library functions your program called on. All I care is that your
program - repeated for emphasis, *your* program - failed on that
input. It's up to you to choose your underlying functions
appropriately.

>> - If you are designing your own language, your implementation of Unicode
>> strings should use something like Python's FSR, or UTF-8 with tweaks to
>> make string indexing O(1) rather than O(N), or correctly-implemented
>> UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)
>
> FSR is possible in python for very specific pythonic reasons
> - dynamicness
> - immutable strings
>
> Drop either and FSR is impossible

I don't know what you mean by "dynamicness". What you do need is a
Unicode string type, such that the application program isn't aware of
the underlying bytes, but simply treats this string as a sequence of
code points. The immutability isn't technically a requirement, but it
does make the FSR much more manageable; in a language with mutable
strings, it's probably more efficient to use UTF-32 for simplicity,
but it's up to the language designer to figure that out. (It might be
best to use something like the FSR, but where strings are never
narrowed after being widened, so it'd be possible for an ASCII-only
string to be stored UTF-32. That has consequences for comparisons, but
might give a reasonable hybrid of storage and mutation performance.)

> _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
>
> So who/what is broken?

The exception is pretty clear on that point. Tcl can't handle SMP
characters. So it's Tcl that's broken. Unless there's evidence to the
contrary, that's what I would expect to be the case.

> Correct.
> Windows is broken for using UTF-16
> Linux is broken for conflating UTF-8 and byte string.
>
> Lot of breakage out here dont you think?
> May be related to the equation
>
> UTF-16 = UCS-2 + Duct-tape

UTF-16 is an encoding that was designed to be backward-compatible with
UCS-2, just as UTF-8 was designed to be compatible with ASCII. Call it
what you will, but backward compatibility is pretty important. Look at
things like DES3 - if you use the same key three times, it's
compatible with DES.

Linux isn't "broken" for conflating UTF-8 and byte strings. Linux is
flawed in that it defines file names to be byte strings, which means
that every file system could be different in what it actually uses as
the encoding. Since file names exist for the benefit of humans, they
should be treated as text, so we should work with them as text. But
for reasons of backward compatibility, Linux hasn't yet changed.

Windows isn't broken for using UTF-16. I think it's a poor trade-off,
given that so many file names are ASCII-only; and, of course, if any
program treats a Windows file name as UCS-2, then that program is
broken. But UTF-16 is not itself broken, any more than UTF-7 is. And
UTF-7 is a lot harder to work with.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87055

From	wxjmfauth@gmail.com
Date	2015-03-06 11:41 -0800
Message-ID	<b67491eb-f4f5-49e8-9a88-d10304369822@googlegroups.com>
In reply to	#87032

Le vendredi 6 mars 2015 17:21:10 UTC+1, Rustom Mody a écrit :
> On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote:
> > Rustom Mody wrote:
> > 
> > > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
> > 
> > [snip example of an analogous situation with NULs]
> > 
> > > Strawman.
> > 
> > Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> > they really should say is "Yes, that's a good argument, I'm afraid I can't
> > argue against it, at least not without considerable thought", I'd be a
> > wealthy man...
> 
> Missed my addition? Here it is again –  grammar slightly corrected.
> 
> ===========
> Ah well if you insist on pursuing the nul-char example...
> - No, the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0
> 
> - No, the code that "can't cope with a perfectly normal character" is not wrong
> 
> - It is C that is wrong for designing a buggy string data structure that cannot
> contain a valid char.
> ===========
> 
> In fact Chris' nul-char example is so strongly supporting my argument – bugginess of UTF-16 –
> it is perhaps too strong even for me.
> 
> To elaborate:
> Take the buggy-plane analogy I gave in
> http://blog.languager.org/2015/03/whimsical-unicode.html
> 
> If a plane model crashes once in 10,000 flights compared to others that crash once in
> one million flights we can call it bug-prone though not strictly buggy – it does fly  
> 9999 times safely!
> OTOH if a plane is guaranteed to crash we can all it a buggy plane.
> 
> C's string is not bug-prone its plain buggy as it cannot represent strings
> with nulls.
> 
> I would not go that far for UTF-16.
> It is bug-inviting but it can also be implemented correctly
> > 
> > 
> > > Lets please stick to UTF-16 shall we?
> > > 
> > > Now tell me:
> > > - Is it broken or not?
> > 
> > The UTF-16 standard is not broken. It is a perfectly adequate variable-width
> > encoding, and considerably better than most other variable-width encodings.
> > 
> > However, many implementations of UTF-16 are faulty, and assume a
> > fixed-width. *That* is broken, not UTF-16.
> > 
> > (The difference between specification and implementation is critical.)
> > 
> > 
> > > - Is it widely used or not?
> > 
> > It's quite widely used.
> > 
> > 
> > > - Should programmers be careful of it or not?
> > 
> > Programmers should be aware whether or not any specific language uses UTF-16
> > and whether the implementation is buggy. That will help them decide whether
> > or not to use that language.
> > 
> > 
> > > - Should programmers be warned about it or not?
> > 
> > I'm in favour of people having more knowledge rather than less. I don't
> > believe that ignorance is bliss, except perhaps in the case that a giant
> > asteroid the size of Texas is heading straight for us.
> > 
> > Programmers should be aware of the limitations or bugs in any UTF-16
> > implementation they are likely to run into. Hence my general
> > recommendation:
> > 
> > - For transmission over networks or storage on permanent media (e.g. the
> > content of text files), use UTF-8. It is well-implemented by nearly all
> > languages that support Unicode, as far as I know.
> > 
> > - If you are designing your own language, your implementation of Unicode
> > strings should use something like Python's FSR, or UTF-8 with tweaks to
> > make string indexing O(1) rather than O(N), or correctly-implemented
> > UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)
> 
> FSR is possible in python for very specific pythonic reasons
> - dynamicness
> - immutable strings
> 
> Drop either and FSR is impossible
> 
> > If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed 
> > 2-byte per code point format, you fail.
> 
> Seems obvious enough.
> So lets see...
> Here's a 2-line python program -- runs well enough when run as a command.
> Program:
> =========
> pp = "💩"
> print (pp)
> =========
> Try open it in idle3 and you get (at least I get):
> 
> $ idle3 ff.py 
> Traceback (most recent call last):
>   File "/usr/bin/idle3", line 5, in <module>
>     main()
>   File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
>     if flist.open(filename) is None:
>   File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
>     edit = self.EditorWindow(self, filename, key)
>   File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
>     EditorWindow.__init__(self, *args)
>   File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
>     if io.loadfile(filename):
>   File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
>     self.text.insert("1.0", chars)
>   File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
>     self.top.insert(index, chars, tags)
>   File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
>     self.addcmd(InsertCommand(index, chars, tags))
>   File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
>     cmd.do(self.delegate)
>   File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
>     text.insert(self.index1, self.chars, self.tags)
>   File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
>     self.delegate.insert(index, chars, tags)
>   File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
>     return self.tk_call(self.orig_and_operation + args)
> _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
> 
> So who/what is broken?
> 
> > 
> > - If you are using an existing language, be aware of any bugs and
> > limitations in its Unicode implementation. You may or may not be able to
> > work around them, but at least you can decide whether or not you wish to
> > try.
> > 
> > - If you are writing your own file system layer, it's 2015 fer fecks sake,
> > file names should be Unicode strings, not bytes! (That's one part of the
> > Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> > system, whichever you please, but again remember that both are
> > variable-width formats.
> 
> Correct.
> Windows is broken for using UTF-16
> Linux is broken for conflating UTF-8 and byte string.
> 
> Lot of breakage out here dont you think?
> May be related to the equation
> 
> UTF-16 = UCS-2 + Duct-tape
> 
> ??

=============

1) A copy/paste of pp = ... from google group into
my Python interactive interpreter without intermediate
state.
2) Some manipulations.
3) A copy/paste from my interpreter into google group.

I hope the rendering will be correct.

Python 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)] on win32
>>> eta runs etazero.py...
...etazero has been executed
>>> pp = "💩"
>>> print(pp)
💩
>>> len(pp)
2
>>> pp + pp + 'abcéœ€' + pp
'💩💩abcéœ€💩'
>>> 
>>> # ok, nine glyphs, individually seleectable.
>>> 


Note:

len(pp) = 2 because of Py32. This is a deliberate
choice to keep the Py32 "behaviour" in my interpreter.

but also note:

The code point is correctly displayed with a single "glyph".
All the cut/copy/paste (eg word, pdf, ...), cursor mouvement,
selection, caret position, text wrapping, char typing, ... mainly
for rendering purpose is done with my internal "artillary",
full unicode.

In my other GUI applications, everything is working fine,
including string lenghts, because my "artillary" work and
also handle glyphs (including diacritical signs).
Honestly, I'm no sure about bidi; however Hebrew I'm able
to test is working fine.

jmf

[toc] | [prev] | [next] | [standalone]

Page 4 of 8 — ← Prev page 1 2 3 [4] 5 6 7 8 Next page →

csiph-web

Newbie question about text encoding

Contents

#86951

#86959

#86986

#86987

#87001

#87002

#87020

#87021

#87023

#87024

#87025

#87035

#87004

#87009

#87026

#87027

#87029

#87032

#87040

#87055