Groups > comp.lang.python > #86311 > unrolled thread

Newbie question about text encoding

Started by	pierrick.brihaye@gmail.com
First post	2015-02-24 02:49 -0800
Last post	2015-02-27 10:23 +1100
Articles	18 on this page of 158 — 19 participants

Back to article view | Back to comp.lang.python

  Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
    Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
        Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
        Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
        Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
          Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
        Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
              Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
                  Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
                      Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
                        Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
                            Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
                              Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
                          Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
                      Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
                        Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
                      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
                      Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
                      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
                              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
                                Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
                                    Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
                                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
                            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
                                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
                                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
                                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
                                                  Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
                                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
                                                      Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
                                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
                                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
                                                          Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
                                                        Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
                                    Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
                                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
                                          Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
                                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
                                              Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
                                                Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
                                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
                                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
                                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
                                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
                          Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
                          Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
            Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
                Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
          Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100

Page 8 of 8 — ← Prev page 1 2 3 4 5 6 7 [8]

#87088

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2015-03-07 14:13 +0000
Message-ID	<mailman.143.1425737633.21433.python-list@python.org>
In reply to	#87083

On 07/03/2015 11:09, Steven D'Aprano wrote:
> Rustom Mody wrote:
>
>>
>> This includes not just bug-prone-system code such as Java and Windows but
>> seemingly working code such as python 3.
>
> What Unicode bugs do you think Python 3.3 and above have?
>

Methinks somebody has been drinking too much loony juice.  Either that 
or taking too much notice of our RUE.  Not that I've done a proper 
analysis, but to my knowledge there's nothing like the number of issues 
on the bug tracker for Unicode bugs for Python 3 compared to Python 2.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#87134

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-07 23:23 -0800
Message-ID	<7cdb210c-c152-41a6-8afa-a0c0028f454e@googlegroups.com>
In reply to	#87083

On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:
> > This includes not just bug-prone-system code such as Java and Windows but
> > seemingly working code such as python 3.
> 
> What Unicode bugs do you think Python 3.3 and above have?

Literal/Legalistic answer:
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-2135

[And already quoted at
http://blog.languager.org/2015/03/whimsical-unicode.html
]

An answer more in the spirit of what I am trying to say:
Idle3, Roy's example and in general all systems that are
python-centric but use components outside of python that are unicode-broken

IOW I would expect people (at least people with good faith) reading my

> bug-prone-system code...seemingly working code such as python 3...

to interpret that NOT as

"python 3 is seemingly working but actually broken"

But as

"Apps made with working system code (eg python3) can end up being broken
because of other non-working system code - eg mysql, java, javascript, windows-shell, and ultimately windows, linux"

[toc] | [prev] | [next] | [standalone]

#87150

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-09 05:30 +1100
Message-ID	<54fc9556$0$12994$c3e8da3$5496439d@news.astraweb.com>
In reply to	#87134

Rustom Mody wrote:

> On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote:
>> Rustom Mody wrote:
>> > This includes not just bug-prone-system code such as Java and Windows
>> > but seemingly working code such as python 3.
>> 
>> What Unicode bugs do you think Python 3.3 and above have?
> 
> Literal/Legalistic answer:
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-2135

Nice one :-) but not exactly in the spirit of what we're discussing (as you
acknowledge below), so I won't discuss that.

> [And already quoted at
> http://blog.languager.org/2015/03/whimsical-unicode.html
> ]
> 
> An answer more in the spirit of what I am trying to say:
> Idle3, Roy's example and in general all systems that are
> python-centric but use components outside of python that are
> unicode-broken
> 
> IOW I would expect people (at least people with good faith) reading my
> 
>> bug-prone-system code...seemingly working code such as python 3...
> 
> to interpret that NOT as
> 
> "python 3 is seemingly working but actually broken"

Why not? That is the natural interpretation of the sentence, particularly in
the context of your previous sentence:

    [quote]
    Or you can skip the blame-game and simply note the fact that 
    large segments of extant code-bases are currently in bug-prone
    or plain buggy state.

    This includes not just bug-prone-system code such as Java and
    Windows but seemingly working code such as python 3.
    [end quote]

The natural interpretation of this is that Python 3 is only *seemingly*
working, but is also an example of a code base in "bug-prone or plain buggy
state".

If that's not your intended meaning, then rather than casting aspersions on
my honesty ("good faith" indeed) you might accept that perhaps you didn't
quite manage to get your message across.

> But as
> 
> "Apps made with working system code (eg python3) can end up being broken
> because of other non-working system code - eg mysql, java, javascript,
> windows-shell, and ultimately windows, linux"

Don't forget viruses or other malware, cosmic rays, processor bugs, dry
solder joints on the motherboard, faulty memory, and user-error.

I'm not sure what point you think you are making. If you want to discuss the
fact that complex systems have more interactions than simple systems, and
therefore more ways for things to go wrong, I will agree. I'll agree that
this is an issue with Python code that interacts with other systems which
may or may not implement Unicode correctly. There are a few ways to
interpret this:

(1) You're making a general point about the complexity of modern computing.

(2) You're making the point that dealing with text encodings in general, and
Unicode in specific, is hard because of the interaction of programming
language, database, file system, locale, etc.

(3) You're implying that Python ought to fix this problem some how.

(4) You're implying that *Unicode* specifically is uniquely problematic in
this way. Or at least *unusual* to be problematic in this way.

I will agree with 1 and 2; I'll say that 3 would be nice but in the absence
of concrete proposals for how to fix it, it's just meaningless chatter. And
I'll disagree strongly with 4.

Unicode came into existence because legacy encodings suffer from similar
problems, only worse. (One major advantage of Unicode over previous
multi-byte encodings is that the UTF encodings are self-healing. A single
corrupted byte will, *at worst*, cause a single corrupted code point.)

In one sense, Unicode has solved these legacy encoding problems, in the
sense that if you always use a correct implementation of Unicode then you
won't *ever* suffer from problems like moji-bake, broken strings and so
forth.

In another sense, Unicode hasn't solved these legacy problems because we
still have to deal with files using legacy encodings, as well as standards
organisations, operating systems, developers, applications and users who
continue to produce new content using legacy encodings, buggy or incorrect
implementations of the standard, also viruses, cosmic rays, dry solder
joints and user-error. How are these things Unicode's fault or
responsibility?

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#87168

From	Cameron Simpson <cs@zip.com.au>
Date	2015-03-09 13:09 +1100
Message-ID	<mailman.182.1425866969.21433.python-list@python.org>
In reply to	#87083

On 07Mar2015 22:09, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>Rustom Mody wrote:
>>[...big snip...]
>> Some parts are here some earlier and from my memory.
>> If details wrong please correct:
>> - 200 million records
>> - Containing 4 strings with SMP characters
>> - System made with python and mysql. SMP works with python, breaks mysql.
>>   So whole system broke due to those 4 in 200,000,000 records
>
>No, they broke because MySQL has buggy Unicode handling.
[...]
>> You could also choose do with "astral crap" (Roy's words) what we all do
>> with crap -- throw it out as early as possible.
>
>And when Roy's customers demand that his product support emoji, or complain
>that they cannot spell their own name because of his parochial and ignorant
>idea of "crap", perhaps he will consider doing what he should have done
>from the beginning:
>
>Stop using MySQL, which is a joke of a database[1], and use Postgres which
>does not have this problem.
>
>[1] So I have been told.

I use MySQL a fair bit, and Postgres very slightly. I would agree with your 
characterisation above; MySQL is littered with inconsistencies and arbitrary 
breakage, both in tools and SQL implementation. And Postgres has been a pure 
pleasure to work with, little though I have done that so far.

Cheers,
Cameron Simpson <cs@zip.com.au>

There is no human problem which could not be solved if people would simply
do as I advise. - Gore Vidal

[toc] | [prev] | [next] | [standalone]

#87170

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-08 19:42 -0700
Message-ID	<bf5d739e-965a-4dbd-bd11-c322bd0dbe28@googlegroups.com>
In reply to	#87168

On Monday, March 9, 2015 at 7:39:42 AM UTC+5:30, Cameron Simpson wrote:
> On 07Mar2015 22:09, Steven D'Aprano  wrote:
> >Rustom Mody wrote:
> >>[...big snip...]
> >> Some parts are here some earlier and from my memory.
> >> If details wrong please correct:
> >> - 200 million records
> >> - Containing 4 strings with SMP characters
> >> - System made with python and mysql. SMP works with python, breaks mysql.
> >>   So whole system broke due to those 4 in 200,000,000 records
> >
> >No, they broke because MySQL has buggy Unicode handling.
> [...]
> >> You could also choose do with "astral crap" (Roy's words) what we all do
> >> with crap -- throw it out as early as possible.
> >
> >And when Roy's customers demand that his product support emoji, or complain
> >that they cannot spell their own name because of his parochial and ignorant
> >idea of "crap", perhaps he will consider doing what he should have done
> >from the beginning:
> >
> >Stop using MySQL, which is a joke of a database[1], and use Postgres which
> >does not have this problem.
> >
> >[1] So I have been told.
> 
> I use MySQL a fair bit, and Postgres very slightly. I would agree with your 
> characterisation above; MySQL is littered with inconsistencies and arbitrary 
> breakage, both in tools and SQL implementation. And Postgres has been a pure 
> pleasure to work with, little though I have done that so far.
> 
> Cheers,
> Cameron Simpson
> 
> There is no human problem which could not be solved if people would simply
> do as I advise. - Gore Vidal

I think that last quote sums up the issue best.
Ive written to Intel asking them to make their next generation have 21-bit wide bytes.
Once they do that we will be back in the paradise we have been for the last 40 years
which I call the 'Unix-assumption'
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

Until then...

We have to continue living in the real world.
Which includes 10 times more windows than linux users.
Is windows 10 times better an OS than linux?

In the 'real world' people make choices for all sorts of reasons. My guess is the
top reason is the pointiness of the hair of pointy-haired-boss.

Just like people choose  windows over linux, people choose mysql over postgres,
and that's the context of this discussion -- people stuck in sub-optimal choices

[toc] | [prev] | [next] | [standalone]

#86892

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-04 19:16 +1100
Message-ID	<54f6bf5a$0$11122$c3e8da3@news.astraweb.com>
In reply to	#86886

Chris Angelico wrote:

> Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
> do you keep talking about 7.0 as if it's a recent change?

This is the Internet. Lack of knowledge about something doesn't prevent 
people from having opinions about it.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#86859

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-04 05:43 +1100
Message-ID	<mailman.24.1425408236.21433.python-list@python.org>
In reply to	#86856

On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> What I was trying to say expanded here
> http://blog.languager.org/2015/03/whimsical-unicode.html
> [Hope  the word 'whimsical' is less jarring and more accurate than 'gibberish']

Re footnote #4: ½ is a single character for compatibility reasons.
⅟₁₀₀ doesn't need to be a single character, because there are
countably infinite vulgar fractions and only 0x110000 Unicode
characters.

ChrisA

[toc] | [prev] | [next] | [standalone]

#86873

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-03 18:53 -0800
Message-ID	<18d9d5a7-dfb9-4e13-ada3-5ef97cf0543d@googlegroups.com>
In reply to	#86859

On Wednesday, March 4, 2015 at 12:14:11 AM UTC+5:30, Chris Angelico wrote:
> On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody wrote:
> > What I was trying to say expanded here
> > http://blog.languager.org/2015/03/whimsical-unicode.html
> > [Hope  the word 'whimsical' is less jarring and more accurate than 'gibberish']
> 
> Re footnote #4: ½ is a single character for compatibility reasons.
> ⅟₁₀₀ ...
  ^^^

Neat 
Thanks
[And figured out some of quopri module along the way figuring that out]

[toc] | [prev] | [next] | [standalone]

#86871

From	Terry Reedy <tjreedy@udel.edu>
Date	2015-03-03 18:30 -0500
Message-ID	<mailman.27.1425425434.21433.python-list@python.org>
In reply to	#86856

On 3/3/2015 1:03 PM, Rustom Mody wrote:
> On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:

>> You should add emoticons, but not call them or the above 'gibberish'.
>> I think that this part of your post is more 'unprofessional' than the
>> character blocks.  It is very jarring and seems contrary to your main point.
>
> Ok Done
>
> References to gibberish removed from
> http://blog.languager.org/2015/02/universal-unicode.html
>
> What I was trying to say expanded here
> http://blog.languager.org/2015/03/whimsical-unicode.html
> [Hope  the word 'whimsical' is less jarring and more accurate than 'gibberish']

I agree with both.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#86874

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-04 13:54 +1100
Message-ID	<54f673e4$0$12980$c3e8da3$5496439d@news.astraweb.com>
In reply to	#86856

Rustom Mody wrote:

> On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
>> On 2/26/2015 8:24 AM, Chris Angelico wrote:
>> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
>> >> Wrote something up on why we should stop using ASCII:
>> >> http://blog.languager.org/2015/02/universal-unicode.html
>> 
>> I think that the main point of the post, that many Unicode chars are
>> truly planetary rather than just national/regional, is excellent.
> 
> <snipped>
> 
>> You should add emoticons, but not call them or the above 'gibberish'.
>> I think that this part of your post is more 'unprofessional' than the
>> character blocks.  It is very jarring and seems contrary to your main
>> point.
> 
> Ok Done
> 
> References to gibberish removed from
> http://blog.languager.org/2015/02/universal-unicode.html

I consider it unethical to make semantic changes to a published work in
place without acknowledgement. Fixing minor typos or spelling errors, or
dead links, is okay. But any edit that changes the meaning should be
commented on, either by an explicit note on the page itself, or by striking
out the previous content and inserting the new.

As for the content of the essay, it is currently rather unfocused. It
appears to be more of a list of "here are some Unicode characters I think
are interesting, divided into subgroups, oh and here are some I personally
don't have any use for, which makes them silly" than any sort of discussion
about the universality of Unicode. That makes it rather idiosyncratic and
parochial. Why should obscure maths symbols be given more importance than
obscure historical languages?

I think that the universality of Unicode could be explained in a single
sentence:

"It is the aim of Unicode to be the one character set anyone needs to
represent every character, ideogram or symbol (but not necessarily distinct
glyph) from any existing or historical human language."

I can expand on that, but in a nutshell that is it.

You state:

"APL and Z Notation are two notable languages APL is a programming language
and Z a specification language that did not tie themselves down to a
restricted charset ..."

but I don't think that is correct. I'm pretty sure that neither APL nor Z
allowed you to define new characters. They might not have used ASCII alone,
but they still had a restricted character set. It was merely less
restricted than ASCII.

You make a comment about Cobol's relative unpopularity, but (1) Cobol
doesn't require you to write out numbers as English words, and (2) Cobol is
still used, there are uncounted billions of lines of Cobol code being used,
and if the number of Cobol programmers is less now than it was 16 years
ago, there are still a lot of them. Academics and FOSS programmers don't
think much of Cobol, but it has to count as one of the most amazing success
stories in the field of programming languages, despite its lousy design.

You list ideographs such as Cuneiform under "Icons". They are not icons.
They are a mixture of symbols used for consonants, syllables, and
logophonetic, consonantal alphabetic and syllabic signs. That sits them
firmly in the same categories as modern languages with consonants, ideogram
languages like Chinese, and syllabary languages like Cheyenne.

Just because native readers of Cuneiform are all dead doesn't make Cuneiform
unimportant. There are probably more people who need to write Cuneiform
than people who need to write APL source code.

You make a comment:

"To me – a unicode-layman – it looks unprofessional… Billions of computing
devices world over, each having billions of storage words having their
storage wasted on blocks such as these??"

But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
Why are you so worried about an (illusionary) minor optimization?

Whether code points are allocated or not doesn't affect how much space they
take up. There are millions of unused Unicode code points today. If they
are allocated tomorrow, the space your documents take up will not increase
one byte.

Allocating code points to Cuneiform has not increased the space needed by
Unicode at all. Two bytes alone is not enough for even existing human
languages (thanks China). For hardware related reasons, it is faster and
more efficient to use four bytes than three, so the obvious and "dumb" (in
the simplest thing which will work) way to store Unicode is UTF-32, which
takes a full four bytes per code point, regardless of whether there are
65537 code points or 1114112. That makes it less expensive than floating
point numbers, which take eight. Would you like to argue that floating
point doubles are "unprofessional" and wasteful?

As Dave pointed out, and you apparently agreed with him enough to quote him
TWICE (once in each of two blog posts), history of computing is full of
premature optimizations for space. (In fact, some of these may have been
justified by the technical limitations of the day.) Technically Unicode is
also limited, but it is limited to over one million code points, 1114112 to
be exact, although some of them are reserved as invalid for technical
reasons, and there is no indication that we'll ever run out of space in
Unicode.

In practice, there are three common Unicode encodings that nearly all
Unicode documents will use.

* UTF-8 will use between one and (by memory) four bytes per code 
  point. For Western European languages, that will be mostly one 
  or two bytes per character.

* UTF-16 uses a fixed two bytes per code point in the Basic Multilingual 
  Plane, which is enough for nearly all Western European writing and 
  much East Asian writing as well. For the rest, it uses a fixed four 
  bytes per code point.

* UTF-32 uses a fixed four bytes per code point. Hardly anyone uses 
  this as a storage format.

In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode
doesn't change the space used. If you actually include a few hieroglyphs to
your document, the space increases only by the actual space used by those
hieroglyphs: four bytes per hieroglyph. At no time does the existence of a
single hieroglyph in your document force you to expand the non-hieroglyph
characters to use more space.

> What I was trying to say expanded here
> http://blog.languager.org/2015/03/whimsical-unicode.html

You have at least two broken links, referring to a non-existent page:

http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html

This essay seems to be even more rambling and unfocused than the first. What
does the cost of semi-conductor plants have to do with whether or not
programmers support Unicode in their applications?

Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte
Order Mark. But if you interpret it as an explicit UTF-8 signature or mark,
it isn't so silly. If your text begins with the UTF-8 mark, treat it as
UTF-8. It's no more silly than any other heuristic, like HTML encoding tags
or text editor's encoding cookies.

Your discussion of "complexifiers and simplifiers" doesn't seem to be
terribly relevant, or at least if it is relevant, you don't give any reason
for it. The whole thing about Moore's Law and the cost of semi-conductor
plants seems irrelevant to Unicode except in the most over-generalised
sense of "things are bigger today than in the past, we've gone from
five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point?

You agree that 16-bits are not enough, and yet you critice Unicode for using
more than 16-bits on wasteful, whimsical gibberish like Cuneiform? That is
an inconsistent position to take.

UTF-16 is not half-arsed Unicode support. UTF-16 is full Unicode support.

The problem is when your language treats UTF-16 as a fixed-width two-byte
format instead of a variable-width, two- or four-byte format. (That's more
or less like the old, obsolete, UCS-2 standard.) There are all sorts of
good ways to solve the problem of surrogate pairs and the SMPs in UTF-16.
If some programming languages or software fails to do so, they are buggy,
not UTF-16.

After explaining that 16 bits are not enough, you then propose a 16 bit
standard. /face-palm

UTF-16 cannot break the fixed with invariant, because it has no fixed width
invariant. That's like arguing against UTF-8 because it breaks the fixed
width invariant "all characters are single byte ASCII characters".

If you cannot handle SMP characters, you are not supporting Unicode.

You suggest that Chinese users should be looking at Big5 or GB. I really,
really don't think so.

- Neither is universal. What makes you think that Chinese writers need 
  to use maths symbols, or include (say) Thai or Russian in their work 
  any less than Western writers do?

- Neither even support all of Chinese. Big5 supports Traditional 
  Chinese, but not Simplified Chinese. GB supports Simplified 
  Chinese, but not Traditional Chinese. 

- Big5 likewise doesn't support placenames, many people's names, and
  other less common parts of Chinese.

- Big5 is a shift-system, like Shift-JIS, and suffers from the same sort
  of data corruption issues.

- There is no one single Big5 standard, but a whole lot of vendor 
  extensions.

You say:

"I just want to suggest that the Unicode consortium going overboard in
adding zillions of codepoints of nearly zero usefulness, is in fact
undermining unicode’s popularity and spread."

Can you demonstrate this? Can you show somebody who says "Well, I was going
to support full Unicode, but since they added a snowman, I'm going to stick
to ASCII"?

The "whimsical" characters you are complaining about were important enough
to somebody to spend significant amounts of time and money to write up a
proposal, have it go through the Unicode Consortium bureaucracy, and
eventually have it accepted. That's not easy or cheap, and people didn't
add a snowman on a whim. They did it because there are a whole lot of
people who want a shared standard for map symbols.

It is easy to mock what is not important to you. I daresay kids adding emoji
to their 10 character tweets would mock all the useless maths symbols in
Unicode too.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#86875

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-04 14:02 +1100
Message-ID	<mailman.29.1425438171.21433.python-list@python.org>
In reply to	#86874

On Wed, Mar 4, 2015 at 1:54 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> It is easy to mock what is not important to you. I daresay kids adding emoji
> to their 10 character tweets would mock all the useless maths symbols in
> Unicode too.

Definitely! Who ever sings "do you wanna build an integral sign"?

ChrisA

[toc] | [prev] | [next] | [standalone]

#86882

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-03 20:05 -0800
Message-ID	<601f597e-719a-4721-9620-1a7ea43de57d@googlegroups.com>
In reply to	#86874

On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:
> 
> > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
> >> On 2/26/2015 8:24 AM, Chris Angelico wrote:
> >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
> >> >> Wrote something up on why we should stop using ASCII:
> >> >> http://blog.languager.org/2015/02/universal-unicode.html
> >> 
> >> I think that the main point of the post, that many Unicode chars are
> >> truly planetary rather than just national/regional, is excellent.
> > 
> > <snipped>
> > 
> >> You should add emoticons, but not call them or the above 'gibberish'.
> >> I think that this part of your post is more 'unprofessional' than the
> >> character blocks.  It is very jarring and seems contrary to your main
> >> point.
> > 
> > Ok Done
> > 
> > References to gibberish removed from
> > http://blog.languager.org/2015/02/universal-unicode.html
> 
> I consider it unethical to make semantic changes to a published work in
> place without acknowledgement. Fixing minor typos or spelling errors, or
> dead links, is okay. But any edit that changes the meaning should be
> commented on, either by an explicit note on the page itself, or by striking
> out the previous content and inserting the new.

Dunno What you are grumping about…

Anyway the attribution is made more explicit – footnote 5 in
 http://blog.languager.org/2015/03/whimsical-unicode.html.

Note Terry Reedy's post who mainly objected was already acked earlier.
Ive just added one more ack¹
And JFTR the 'publication' (O how archaic!) is the whole blog not a single page just as it is for any other dead-tree publication.

> 
> As for the content of the essay, it is currently rather unfocused.

True.

 It
> appears to be more of a list of "here are some Unicode characters I think
> are interesting, divided into subgroups, oh and here are some I personally
> don't have any use for, which makes them silly" than any sort of discussion
> about the universality of Unicode. That makes it rather idiosyncratic and
> parochial. Why should obscure maths symbols be given more importance than
> obscure historical languages?

Idiosyncratic ≠ parochial


> 
> I think that the universality of Unicode could be explained in a single
> sentence:
> 
> "It is the aim of Unicode to be the one character set anyone needs to
> represent every character, ideogram or symbol (but not necessarily distinct
> glyph) from any existing or historical human language."
> 
> I can expand on that, but in a nutshell that is it.
> 
> 
> You state:
> 
> "APL and Z Notation are two notable languages APL is a programming language
> and Z a specification language that did not tie themselves down to a
> restricted charset ..."

Tsk Tsk – dihonest snipping. I wrote

| APL and Z Notation are two notable languages APL is a programming language 
| and Z a specification language that did not tie themselves down to a 
| restricted charset even in the day that ASCII ruled.

so its clear that the restricted applies to ASCII
> 
> You list ideographs such as Cuneiform under "Icons". They are not icons.
> They are a mixture of symbols used for consonants, syllables, and
> logophonetic, consonantal alphabetic and syllabic signs. That sits them
> firmly in the same categories as modern languages with consonants, ideogram
> languages like Chinese, and syllabary languages like Cheyenne.

Ok changed to iconic.
Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they were languages.
In 2015 when someone sees them and recognizes them, they are 'those things that
Sumerians/Egyptians wrote' No one except a rare expert knows those languages

> 
> Just because native readers of Cuneiform are all dead doesn't make Cuneiform
> unimportant. There are probably more people who need to write Cuneiform
> than people who need to write APL source code.
> 
> You make a comment:
> 
> "To me – a unicode-layman – it looks unprofessional… Billions of computing
> devices world over, each having billions of storage words having their
> storage wasted on blocks such as these??"
> 
> But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
> Why are you so worried about an (illusionary) minor optimization?

2 < 4 as far as I am concerned.
[If you disagree one man's illusionary is another's waking]

> 
> Whether code points are allocated or not doesn't affect how much space they
> take up. There are millions of unused Unicode code points today. If they
> are allocated tomorrow, the space your documents take up will not increase
> one byte.
> 
> Allocating code points to Cuneiform has not increased the space needed by
> Unicode at all. Two bytes alone is not enough for even existing human
> languages (thanks China). For hardware related reasons, it is faster and
> more efficient to use four bytes than three, so the obvious and "dumb" (in
> the simplest thing which will work) way to store Unicode is UTF-32, which
> takes a full four bytes per code point, regardless of whether there are
> 65537 code points or 1114112. That makes it less expensive than floating
> point numbers, which take eight. Would you like to argue that floating
> point doubles are "unprofessional" and wasteful?
> 
> As Dave pointed out, and you apparently agreed with him enough to quote him
> TWICE (once in each of two blog posts), history of computing is full of
> premature optimizations for space. (In fact, some of these may have been
> justified by the technical limitations of the day.) Technically Unicode is
> also limited, but it is limited to over one million code points, 1114112 to
> be exact, although some of them are reserved as invalid for technical
> reasons, and there is no indication that we'll ever run out of space in
> Unicode.
> 
> In practice, there are three common Unicode encodings that nearly all
> Unicode documents will use.
> 
> * UTF-8 will use between one and (by memory) four bytes per code 
>   point. For Western European languages, that will be mostly one 
>   or two bytes per character.
> 
> * UTF-16 uses a fixed two bytes per code point in the Basic Multilingual 
>   Plane, which is enough for nearly all Western European writing and 
>   much East Asian writing as well. For the rest, it uses a fixed four 
>   bytes per code point.
> 
> * UTF-32 uses a fixed four bytes per code point. Hardly anyone uses 
>   this as a storage format.
> 
> 
> In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode
> doesn't change the space used. If you actually include a few hieroglyphs to
> your document, the space increases only by the actual space used by those
> hieroglyphs: four bytes per hieroglyph. At no time does the existence of a
> single hieroglyph in your document force you to expand the non-hieroglyph
> characters to use more space.
> 
> 
> > What I was trying to say expanded here
> > http://blog.languager.org/2015/03/whimsical-unicode.html
> 
> You have at least two broken links, referring to a non-existent page:
> 
> http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html

Thanks corrected

> 
> This essay seems to be even more rambling and unfocused than the first. What
> does the cost of semi-conductor plants have to do with whether or not
> programmers support Unicode in their applications?
> 
> Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte
> Order Mark. But if you interpret it as an explicit UTF-8 signature or mark,
> it isn't so silly. If your text begins with the UTF-8 mark, treat it as
> UTF-8. It's no more silly than any other heuristic, like HTML encoding tags
> or text editor's encoding cookies.
> 
> Your discussion of "complexifiers and simplifiers" doesn't seem to be
> terribly relevant, or at least if it is relevant, you don't give any reason
> for it. The whole thing about Moore's Law and the cost of semi-conductor
> plants seems irrelevant to Unicode except in the most over-generalised
> sense of "things are bigger today than in the past, we've gone from
> five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point?

- Most people need only 16 bits.
- Many notable examples of software fail going from 16 to 23.
- If you are a software writer, and you fail going 16 to 23 its ok but try to 
give useful errors

> 
> You agree that 16-bits are not enough, and yet you critice Unicode for using
> more than 16-bits on wasteful, whimsical gibberish like Cuneiform? That is
> an inconsistent position to take.

| ½-assed unicode support – BMP-only – is better than 1/100-assed⁴ support – 
| ASCII.  BMP-only Unicode is universal enough but within practical limits 
| whereas full (7.0) Unicode is 'really' universal at a cost of performance and 
| whimsicality.

Do you disagree that BMP-only = 16 bits?

> 
> UTF-16 is not half-arsed Unicode support. UTF-16 is full Unicode support.
> 
> The problem is when your language treats UTF-16 as a fixed-width two-byte
> format instead of a variable-width, two- or four-byte format. (That's more
> or less like the old, obsolete, UCS-2 standard.) There are all sorts of
> good ways to solve the problem of surrogate pairs and the SMPs in UTF-16.
> If some programming languages or software fails to do so, they are buggy,
> not UTF-16.
> 
> After explaining that 16 bits are not enough, you then propose a 16 bit
> standard. /face-palm
> 
> UTF-16 cannot break the fixed with invariant, because it has no fixed width
> invariant. That's like arguing against UTF-8 because it breaks the fixed
> width invariant "all characters are single byte ASCII characters".
> 
> If you cannot handle SMP characters, you are not supporting Unicode.


7.0

> 
> 
> You suggest that Chinese users should be looking at Big5 or GB. I really,
> really don't think so.
> 
> - Neither is universal. What makes you think that Chinese writers need 
>   to use maths symbols, or include (say) Thai or Russian in their work 
>   any less than Western writers do?
> 
> - Neither even support all of Chinese. Big5 supports Traditional 
>   Chinese, but not Simplified Chinese. GB supports Simplified 
>   Chinese, but not Traditional Chinese. 
> 
> - Big5 likewise doesn't support placenames, many people's names, and
>   other less common parts of Chinese.
> 
> - Big5 is a shift-system, like Shift-JIS, and suffers from the same sort
>   of data corruption issues.
> 
> - There is no one single Big5 standard, but a whole lot of vendor 
>   extensions.
> 
> 
> You say:
> 
> "I just want to suggest that the Unicode consortium going overboard in
> adding zillions of codepoints of nearly zero usefulness, is in fact
> undermining unicode’s popularity and spread."
> 
> Can you demonstrate this? Can you show somebody who says "Well, I was going
> to support full Unicode, but since they added a snowman, I'm going to stick
> to ASCII"?

I gave a list of softwares which goof/break going BMP to 7.0 unicode
> 
> The "whimsical" characters you are complaining about were important enough
> to somebody to spend significant amounts of time and money to write up a
> proposal, have it go through the Unicode Consortium bureaucracy, and
> eventually have it accepted. That's not easy or cheap, and people didn't
> add a snowman on a whim. They did it because there are a whole lot of
> people who want a shared standard for map symbols.
> 
> It is easy to mock what is not important to you. I daresay kids adding emoji
> to their 10 character tweets would mock all the useless maths symbols in
> Unicode too.

Head para of section 5 has:
| However (the following) are (in the standard)! So lets use them! 
Looks like mocking to you

The only mocking is at 5.1. And even here I dont mock the users of these blocks
– now or millenia ago. I only mock the unicode consortium for putting them into
unicode

----------------------
¹ And somewhere around here we get into Gödelian problems -- known to programmers
under the form "Write a program that prints itself". Likewise Acks.
I am going to deal with the Gödel-loop by the device:
- Address real issues/objects
- Smile at grumpiness

[toc] | [prev] | [next] | [standalone]

#86883

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-03 20:16 -0800
Message-ID	<debcdbc6-bb2d-4a22-9716-5f6c9afb2f37@googlegroups.com>
In reply to	#86882

On Wednesday, March 4, 2015 at 9:35:28 AM UTC+5:30, Rustom Mody wrote:
> On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote:
> > Rustom Mody wrote:
> > 
> > > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
> > >> On 2/26/2015 8:24 AM, Chris Angelico wrote:
> > >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
> > >> >> Wrote something up on why we should stop using ASCII:
> > >> >> http://blog.languager.org/2015/02/universal-unicode.html
> > >> 
> > >> I think that the main point of the post, that many Unicode chars are
> > >> truly planetary rather than just national/regional, is excellent.
> > > 
> > > <snipped>
> > > 
> > >> You should add emoticons, but not call them or the above 'gibberish'.
> > >> I think that this part of your post is more 'unprofessional' than the
> > >> character blocks.  It is very jarring and seems contrary to your main
> > >> point.
> > > 
> > > Ok Done
> > > 
> > > References to gibberish removed from
> > > http://blog.languager.org/2015/02/universal-unicode.html
> > 
> > I consider it unethical to make semantic changes to a published work in
> > place without acknowledgement. Fixing minor typos or spelling errors, or
> > dead links, is okay. But any edit that changes the meaning should be
> > commented on, either by an explicit note on the page itself, or by striking
> > out the previous content and inserting the new.
> 
> Dunno What you are grumping about…
> 
> Anyway the attribution is made more explicit – footnote 5 in
>  http://blog.languager.org/2015/03/whimsical-unicode.html.
> 
> Note Terry Reedy's post who mainly objected was already acked earlier.
> Ive just added one more ack¹
> And JFTR the 'publication' (O how archaic!) is the whole blog not a single page just as it is for any other dead-tree publication.
> 
> > 
> > As for the content of the essay, it is currently rather unfocused.
> 
> True.
> 
>  It
> > appears to be more of a list of "here are some Unicode characters I think
> > are interesting, divided into subgroups, oh and here are some I personally
> > don't have any use for, which makes them silly" than any sort of discussion
> > about the universality of Unicode. That makes it rather idiosyncratic and
> > parochial. Why should obscure maths symbols be given more importance than
> > obscure historical languages?
> 
> Idiosyncratic ≠ parochial
> 
> 
> > 
> > I think that the universality of Unicode could be explained in a single
> > sentence:
> > 
> > "It is the aim of Unicode to be the one character set anyone needs to
> > represent every character, ideogram or symbol (but not necessarily distinct
> > glyph) from any existing or historical human language."
> > 
> > I can expand on that, but in a nutshell that is it.
> > 
> > 
> > You state:
> > 
> > "APL and Z Notation are two notable languages APL is a programming language
> > and Z a specification language that did not tie themselves down to a
> > restricted charset ..."
> 
> Tsk Tsk – dihonest snipping. I wrote
> 
> | APL and Z Notation are two notable languages APL is a programming language 
> | and Z a specification language that did not tie themselves down to a 
> | restricted charset even in the day that ASCII ruled.
> 
> so its clear that the restricted applies to ASCII
> > 
> > You list ideographs such as Cuneiform under "Icons". They are not icons.
> > They are a mixture of symbols used for consonants, syllables, and
> > logophonetic, consonantal alphabetic and syllabic signs. That sits them
> > firmly in the same categories as modern languages with consonants, ideogram
> > languages like Chinese, and syllabary languages like Cheyenne.
> 
> Ok changed to iconic.
> Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they were languages.
> In 2015 when someone sees them and recognizes them, they are 'those things that
> Sumerians/Egyptians wrote' No one except a rare expert knows those languages
> 
> > 
> > Just because native readers of Cuneiform are all dead doesn't make Cuneiform
> > unimportant. There are probably more people who need to write Cuneiform
> > than people who need to write APL source code.
> > 
> > You make a comment:
> > 
> > "To me – a unicode-layman – it looks unprofessional… Billions of computing
> > devices world over, each having billions of storage words having their
> > storage wasted on blocks such as these??"
> > 
> > But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
> > Why are you so worried about an (illusionary) minor optimization?
> 
> 2 < 4 as far as I am concerned.
> [If you disagree one man's illusionary is another's waking]
> 
> > 
> > Whether code points are allocated or not doesn't affect how much space they
> > take up. There are millions of unused Unicode code points today. If they
> > are allocated tomorrow, the space your documents take up will not increase
> > one byte.
> > 
> > Allocating code points to Cuneiform has not increased the space needed by
> > Unicode at all. Two bytes alone is not enough for even existing human
> > languages (thanks China). For hardware related reasons, it is faster and
> > more efficient to use four bytes than three, so the obvious and "dumb" (in
> > the simplest thing which will work) way to store Unicode is UTF-32, which
> > takes a full four bytes per code point, regardless of whether there are
> > 65537 code points or 1114112. That makes it less expensive than floating
> > point numbers, which take eight. Would you like to argue that floating
> > point doubles are "unprofessional" and wasteful?
> > 
> > As Dave pointed out, and you apparently agreed with him enough to quote him
> > TWICE (once in each of two blog posts), history of computing is full of
> > premature optimizations for space. (In fact, some of these may have been
> > justified by the technical limitations of the day.) Technically Unicode is
> > also limited, but it is limited to over one million code points, 1114112 to
> > be exact, although some of them are reserved as invalid for technical
> > reasons, and there is no indication that we'll ever run out of space in
> > Unicode.
> > 
> > In practice, there are three common Unicode encodings that nearly all
> > Unicode documents will use.
> > 
> > * UTF-8 will use between one and (by memory) four bytes per code 
> >   point. For Western European languages, that will be mostly one 
> >   or two bytes per character.
> > 
> > * UTF-16 uses a fixed two bytes per code point in the Basic Multilingual 
> >   Plane, which is enough for nearly all Western European writing and 
> >   much East Asian writing as well. For the rest, it uses a fixed four 
> >   bytes per code point.
> > 
> > * UTF-32 uses a fixed four bytes per code point. Hardly anyone uses 
> >   this as a storage format.
> > 
> > 
> > In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode
> > doesn't change the space used. If you actually include a few hieroglyphs to
> > your document, the space increases only by the actual space used by those
> > hieroglyphs: four bytes per hieroglyph. At no time does the existence of a
> > single hieroglyph in your document force you to expand the non-hieroglyph
> > characters to use more space.
> > 
> > 
> > > What I was trying to say expanded here
> > > http://blog.languager.org/2015/03/whimsical-unicode.html
> > 
> > You have at least two broken links, referring to a non-existent page:
> > 
> > http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html
> 
> Thanks corrected
> 
> > 
> > This essay seems to be even more rambling and unfocused than the first. What
> > does the cost of semi-conductor plants have to do with whether or not
> > programmers support Unicode in their applications?
> > 
> > Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte
> > Order Mark. But if you interpret it as an explicit UTF-8 signature or mark,
> > it isn't so silly. If your text begins with the UTF-8 mark, treat it as
> > UTF-8. It's no more silly than any other heuristic, like HTML encoding tags
> > or text editor's encoding cookies.
> > 
> > Your discussion of "complexifiers and simplifiers" doesn't seem to be
> > terribly relevant, or at least if it is relevant, you don't give any reason
> > for it. The whole thing about Moore's Law and the cost of semi-conductor
> > plants seems irrelevant to Unicode except in the most over-generalised
> > sense of "things are bigger today than in the past, we've gone from
> > five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point?
> 
> - Most people need only 16 bits.
> - Many notable examples of software fail going from 16 to 23.
> - If you are a software writer, and you fail going 16 to 23 its ok but try to 
> give useful errors

Uh… 21
Thats what makes 3 chars per 64-bit word a possibility.
A possibility that can become realistic if/when Intel decides to add 'packed-unicode' string instructions.

[toc] | [prev] | [next] | [standalone]

#86891

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-04 19:14 +1100
Message-ID	<54f6bee5$0$11122$c3e8da3@news.astraweb.com>
In reply to	#86882

Rustom Mody wrote:

> On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote:

>> I consider it unethical to make semantic changes to a published work in
>> place without acknowledgement. Fixing minor typos or spelling errors, or
>> dead links, is okay. But any edit that changes the meaning should be
>> commented on, either by an explicit note on the page itself, or by
>> striking out the previous content and inserting the new.
> 
> Dunno What you are grumping about…

You published something on a blog. And then you edited it, not to correct a 
typo, but to make a potentially substantial change to semantics, without 
noting that fact.

I consider that unethical. Reputable journalists also consider it unethical 
to change a published work in place without comment, that is why if they 
have to correct an online post or article, they put a note (usually at the 
bottom of the page) stating the nature of the correction made. E.g. "an 
earlier version of this story stated blah, which is incorrect and has now 
been corrected."

Putting the correction in another post is not good enough, for obvious 
reasons. People don't read a blog as a unified single piece, they read it as 
individual posts.

In this case, I *assume* that the change only changes the tone rather than 
the actual meaning of the text, since I haven't seen the before-and-after 
versions. I'm making a general comment about the ethics of blogging.

> And JFTR the 'publication' (O how archaic!) is the whole blog not a single
> page just as it is for any other dead-tree publication.

"Any other dead-tree publication"? An internet blog is not a dead-tree 
publication.

And there's nothing archaic about publishing work on the Internet. What a 
foolish thing to say.

>> As for the content of the essay, it is currently rather unfocused.
> 
> True.
> 
>  It
>> appears to be more of a list of "here are some Unicode characters I think
>> are interesting, divided into subgroups, oh and here are some I
>> personally don't have any use for, which makes them silly" than any sort
>> of discussion about the universality of Unicode. That makes it rather
>> idiosyncratic and parochial. Why should obscure maths symbols be given
>> more importance than obscure historical languages?
> 
> Idiosyncratic ≠ parochial

I know. That's why I said "idiosyncratic and parochial" rather than just 
picking one. It is both.

[...]
>> You state:
>> 
>> "APL and Z Notation are two notable languages APL is a programming
>> language and Z a specification language that did not tie themselves down
>> to a restricted charset ..."
> 
> Tsk Tsk – dihonest snipping. I wrote
> 
> | APL and Z Notation are two notable languages APL is a programming
> | language and Z a specification language that did not tie themselves down
> | to a restricted charset even in the day that ASCII ruled.
> 
> so its clear that the restricted applies to ASCII

It is not clear at all, and in fact ASCII is irrelevant.

Even in the days that "ASCII ruled", there were dozens, maybe hundreds of 
restricted charsets. EBCDIC, national variants of ASCII, mutations of it 
like PETSCII (used on Commodore machines), 8-bit code pages...

APL was invented in 1964, the first public draft of ASCII was 1963 just one 
year earlier. In 1964, ASCII was not commonly used in computing, it was a 
seven-bit teleprinter code. ASCII didn't get fully established in computing 
until 1968, when the US government mandated that starting from 1969 all 
computers purchased by the government had to support ASCII.

When APL was invented, ASCII wasn't even relevant.

>> You list ideographs such as Cuneiform under "Icons". They are not icons.
>> They are a mixture of symbols used for consonants, syllables, and
>> logophonetic, consonantal alphabetic and syllabic signs. That sits them
>> firmly in the same categories as modern languages with consonants,
>> ideogram languages like Chinese, and syllabary languages like Cheyenne.
> 
> Ok changed to iconic.
> Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform

o_O

People don't speak hieroglyphs, except in Asterisk The Gaul comics. People 
speak words.

> they were languages. In 2015 when someone sees them and recognizes them,
> they are 'those things that Sumerians/Egyptians wrote' No one except a
> rare expert knows those languages

True. But there are people who are not "rare experts" but still have need to 
use cuneiform or hieroglyphs in their works, just like not everybody who 
writes about mathematics is "a rare expert" mathematician.

>> Just because native readers of Cuneiform are all dead doesn't make
>> Cuneiform unimportant. There are probably more people who need to write
>> Cuneiform than people who need to write APL source code.
>> 
>> You make a comment:
>> 
>> "To me – a unicode-layman – it looks unprofessional… Billions of
>> computing devices world over, each having billions of storage words
>> having their storage wasted on blocks such as these??"
>> 
>> But that is nonsense, and it contradicts your earlier quoting of Dave
>> Angel. Why are you so worried about an (illusionary) minor optimization?
> 
> 2 < 4 as far as I am concerned.
> [If you disagree one man's illusionary is another's waking]

You can't have it both ways. You acknowledge that 16-bits are not sufficient 
for a universal character set, then criticize Unicode for using more than 
16-bits. This is inconsistent and foolish.

[...]
>> Your discussion of "complexifiers and simplifiers" doesn't seem to be
>> terribly relevant, or at least if it is relevant, you don't give any
>> reason for it. The whole thing about Moore's Law and the cost of
>> semi-conductor plants seems irrelevant to Unicode except in the most
>> over-generalised sense of "things are bigger today than in the past,
>> we've gone from five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So
>> what's your point?
> 
> - Most people need only 16 bits.

I don't know about "most" people, but there are over one billion Chinese 
whose native language simply doesn't fit into 16 bits.

> - Many notable examples of software fail going from 16 to 23.
> - If you are a software writer, and you fail going 16 to 23 its ok but try
> to give useful errors

No it isn't okay.

>> You agree that 16-bits are not enough, and yet you critice Unicode for
>> using more than 16-bits on wasteful, whimsical gibberish like Cuneiform?
>> That is an inconsistent position to take.
> 
> | ½-assed unicode support – BMP-only – is better than 1/100-assed⁴ support
> | –
> | ASCII.  BMP-only Unicode is universal enough but within practical limits
> | whereas full (7.0) Unicode is 'really' universal at a cost of
> | performance and whimsicality.
> 
> Do you disagree that BMP-only = 16 bits?

That point is not in question.

Unicode was extended beyond 16 bits because 16 bits *is not enough* even for 
existing human languages in common use.

As for performance, you contradict yourself. You've quoted Dave TWICE about 
all these artificial limits imposed which turned out to be too low, and here 
you are doing exactly the same thing.

[...]
>> You say:
>> 
>> "I just want to suggest that the Unicode consortium going overboard in
>> adding zillions of codepoints of nearly zero usefulness, is in fact
>> undermining unicode’s popularity and spread."
>> 
>> Can you demonstrate this? Can you show somebody who says "Well, I was
>> going to support full Unicode, but since they added a snowman, I'm going
>> to stick to ASCII"?
> 
> I gave a list of softwares which goof/break going BMP to 7.0 unicode

Irrelevant to my question.

You didn't say that Unicode was being undermined by buggy programming 
languages, you stated it was being undermined by the addition of characters 
of "nearly zero usefulness". Citation please.

>> 
>> The "whimsical" characters you are complaining about were important
>> enough to somebody to spend significant amounts of time and money to
>> write up a proposal, have it go through the Unicode Consortium
>> bureaucracy, and eventually have it accepted. That's not easy or cheap,
>> and people didn't add a snowman on a whim. They did it because there are
>> a whole lot of people who want a shared standard for map symbols.
>> 
>> It is easy to mock what is not important to you. I daresay kids adding
>> emoji to their 10 character tweets would mock all the useless maths
>> symbols in Unicode too.
> 
> Head para of section 5 has:
> | However (the following) are (in the standard)! So lets use them!
> Looks like mocking to you

No. The part where you say they are "gibberish" or "whimsical" and make zero 
effort to understand why they were added is mocking. The part where your 
argument basically boils down to "I personally have no need for these 
characters, therefore the Unicode Consortium is silly for adding them."

> The only mocking is at 5.1. And even here I dont mock the users of these
> blocks – now or millenia ago. I only mock the unicode consortium for
> putting them into unicode

Exactly.

-- 
Steve

[toc] | [prev] | [next] | [standalone]

#86895

From	wxjmfauth@gmail.com
Date	2015-03-04 02:16 -0800
Message-ID	<0b4484c7-b213-49ee-9098-1eeeb3aabcb6@googlegroups.com>
In reply to	#86891

Le mercredi 4 mars 2015 09:14:42 UTC+1, Steven D'Aprano a écrit :
> 
> o_O
> 
> People don't speak hieroglyphs, except in Asterisk The Gaul comics. People 
> speak words.
> 
> 
http://www.asterix.com/asterix-de-a-a-z/les-personnages/tumeheris.html

jmf

[toc] | [prev] | [next] | [standalone]

#86523

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-27 04:29 +1100
Message-ID	<mailman.19277.1424971771.18130.python-list@python.org>
In reply to	#86495

On Fri, Feb 27, 2015 at 4:02 AM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 2/26/2015 8:24 AM, Chris Angelico wrote:
>>
>> On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody <rustompmody@gmail.com>
>> wrote:
>>>
>>> Wrote something up on why we should stop using ASCII:
>>> http://blog.languager.org/2015/02/universal-unicode.html
>
>
> I think that the main point of the post, that many Unicode chars are truly
> planetary rather than just national/regional, is excellent.

Agreed. Like you, though, I take exception at the "Gibberish" section.

Unicode offers us a number of types of character needed by linguists:

1) Letters[1] common to many languages, such as the unadorned Latin
and Cyrillic letters
2) Letters specific to one or very few languages, such as the Turkish dotless i
3) Diacritical marks, ready to be combined with various letters
4) Precomposed forms of various common "letter with diacritical" combinations
5) Other precomposed forms, eg ligatures and Hangul syllables
6) Symbols, punctuation, and various other marks
7) Spacing of various widths and attributes

Apart from #4 and #5, which could be avoided by using the decomposed
forms everywhere, each of these character types is vital. You can't
typeset a document without being able to adequately represent every
part of it. Then there are additional characters that aren't strictly
necessary, but are extremely convenient, such as the emoticon
sections. You can talk in text and still put in a nice little picture
of a globe, or the monkey-no-evil set, etc.

Most of these characters - in fact, all except #2 and maybe a few of
the diacritical marks - are used in multiple places/languages. Unicode
isn't about taking everyone's separate character sets and numbering
them all so we can reference characters from anywhere; if you wanted
that, you'd be much better off with something that lets you specify a
code page in 16 bits and a character in 8, which is roughly the same
size as Unicode anyway. What we have is, instead, a system that brings
them all together - LATIN SMALL LETTER A is U+0061 no matter whether
it's being used to write English, French, Malaysian, Turkish,
Croatian, Vietnamese, or Icelandic text. Unicode is truly planetary.

ChrisA

[1] I use the word "letter" loosely here; Chinese and Japanese don't
have a concept of letters as such, but their glyphs are still
represented.

[toc] | [prev] | [next] | [standalone]

#86553

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-02-27 10:09 +1100
Message-ID	<54efa7b6$0$12994$c3e8da3$5496439d@news.astraweb.com>
In reply to	#86523

Chris Angelico wrote:

> Unicode
> isn't about taking everyone's separate character sets and numbering
> them all so we can reference characters from anywhere; if you wanted
> that, you'd be much better off with something that lets you specify a
> code page in 16 bits and a character in 8, which is roughly the same
> size as Unicode anyway.

Well, except for the approximately 25% of people in the world whose native
language has more than 256 characters.

It sounds like you are referring to some sort of "shift code" system. Some
legacy East Asian encodings use a similar scheme, and depending on how they
are implemented they have great disadvantages. For example, Shift-JIS
suffers from a number of weaknesses including that a single byte corrupted
in transmission can cause large swaths of the following text to be
corrupted. With Unicode, a single corrupted byte can only corrupt a single
code point.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#86554

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-27 10:23 +1100
Message-ID	<mailman.19295.1424993013.18130.python-list@python.org>
In reply to	#86553

On Fri, Feb 27, 2015 at 10:09 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Chris Angelico wrote:
>
>> Unicode
>> isn't about taking everyone's separate character sets and numbering
>> them all so we can reference characters from anywhere; if you wanted
>> that, you'd be much better off with something that lets you specify a
>> code page in 16 bits and a character in 8, which is roughly the same
>> size as Unicode anyway.
>
> Well, except for the approximately 25% of people in the world whose native
> language has more than 256 characters.

You could always allocate multiple code pages to one language. But
since I'm not advocating this system, I'm only guessing at solutions
to its problems.

> It sounds like you are referring to some sort of "shift code" system. Some
> legacy East Asian encodings use a similar scheme, and depending on how they
> are implemented they have great disadvantages. For example, Shift-JIS
> suffers from a number of weaknesses including that a single byte corrupted
> in transmission can cause large swaths of the following text to be
> corrupted. With Unicode, a single corrupted byte can only corrupt a single
> code point.

That's exactly what I was hinting at. There are plenty of systems like
that, and they are badly flawed compared to a simple universal system
for a number of reasons. One is the corruption issue you mention;
another is that a simple memory-based text search becomes utterly
useless (to locate text in a document, you'd need to do a whole lot of
stateful parsing - not to mention the difficulties of doing
"similar-to" searches across languages); concatenation of text also
becomes a stateful operation, and so do all sorts of other simple
manipulations. Unicode may demand a bit more storage in certain
circumstances (where an eight-bit encoding might have handled your
entire document), but it's so much easier for the general case.

ChrisA

[toc] | [prev] | [standalone]

Page 8 of 8 — ← Prev page 1 2 3 4 5 6 7 [8]

csiph-web

Newbie question about text encoding

Contents

#87088

#87134

#87150

#87168

#87170

#86892

#86859

#86873

#86871

#86874

#86875

#86882

#86883

#86891

#86895

#86523

#86553

#86554