Groups > comp.lang.python > #86311 > unrolled thread

Newbie question about text encoding

Started by	pierrick.brihaye@gmail.com
First post	2015-02-24 02:49 -0800
Last post	2015-02-27 10:23 +1100
Articles	20 on this page of 158 — 19 participants

Back to article view | Back to comp.lang.python

  Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
    Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
        Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
        Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
        Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
          Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
        Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
              Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
                  Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
                      Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
                        Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
                            Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
                              Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
                          Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
                      Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
                        Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
                      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
                      Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
                      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
                              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
                                Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
                                    Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
                                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
                            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
                                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
                                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
                                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
                                                  Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
                                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
                                                      Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
                                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
                                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
                                                          Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
                                                        Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
                                    Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
                                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
                                          Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
                                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
                                              Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
                                                Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
                                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
                                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
                                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
                                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
                          Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
                          Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
            Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
                Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
          Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100

Page 5 of 8 — ← Prev page 1 2 3 4 [5] 6 7 8 Next page →

#87056

From	wxjmfauth@gmail.com
Date	2015-03-06 11:58 -0800
Message-ID	<87d1076d-4b71-4705-8e5b-ef58c5086bcd@googlegroups.com>
In reply to	#87055

Le vendredi 6 mars 2015 20:41:36 UTC+1, wxjm...@gmail.com a écrit :
> Le vendredi 6 mars 2015 17:21:10 UTC+1, Rustom Mody a écrit :
> > On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote:
> > > Rustom Mody wrote:
> > > 
> > > > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
> > > 
> > > [snip example of an analogous situation with NULs]
> > > 
> > > > Strawman.
> > > 
> > > Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> > > they really should say is "Yes, that's a good argument, I'm afraid I can't
> > > argue against it, at least not without considerable thought", I'd be a
> > > wealthy man...
> > 
> > Missed my addition? Here it is again –  grammar slightly corrected.
> > 
> > ===========
> > Ah well if you insist on pursuing the nul-char example...
> > - No, the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0
> > 
> > - No, the code that "can't cope with a perfectly normal character" is not wrong
> > 
> > - It is C that is wrong for designing a buggy string data structure that cannot
> > contain a valid char.
> > ===========
> > 
> > In fact Chris' nul-char example is so strongly supporting my argument – bugginess of UTF-16 –
> > it is perhaps too strong even for me.
> > 
> > To elaborate:
> > Take the buggy-plane analogy I gave in
> > http://blog.languager.org/2015/03/whimsical-unicode.html
> > 
> > If a plane model crashes once in 10,000 flights compared to others that crash once in
> > one million flights we can call it bug-prone though not strictly buggy – it does fly  
> > 9999 times safely!
> > OTOH if a plane is guaranteed to crash we can all it a buggy plane.
> > 
> > C's string is not bug-prone its plain buggy as it cannot represent strings
> > with nulls.
> > 
> > I would not go that far for UTF-16.
> > It is bug-inviting but it can also be implemented correctly
> > > 
> > > 
> > > > Lets please stick to UTF-16 shall we?
> > > > 
> > > > Now tell me:
> > > > - Is it broken or not?
> > > 
> > > The UTF-16 standard is not broken. It is a perfectly adequate variable-width
> > > encoding, and considerably better than most other variable-width encodings.
> > > 
> > > However, many implementations of UTF-16 are faulty, and assume a
> > > fixed-width. *That* is broken, not UTF-16.
> > > 
> > > (The difference between specification and implementation is critical.)
> > > 
> > > 
> > > > - Is it widely used or not?
> > > 
> > > It's quite widely used.
> > > 
> > > 
> > > > - Should programmers be careful of it or not?
> > > 
> > > Programmers should be aware whether or not any specific language uses UTF-16
> > > and whether the implementation is buggy. That will help them decide whether
> > > or not to use that language.
> > > 
> > > 
> > > > - Should programmers be warned about it or not?
> > > 
> > > I'm in favour of people having more knowledge rather than less. I don't
> > > believe that ignorance is bliss, except perhaps in the case that a giant
> > > asteroid the size of Texas is heading straight for us.
> > > 
> > > Programmers should be aware of the limitations or bugs in any UTF-16
> > > implementation they are likely to run into. Hence my general
> > > recommendation:
> > > 
> > > - For transmission over networks or storage on permanent media (e.g. the
> > > content of text files), use UTF-8. It is well-implemented by nearly all
> > > languages that support Unicode, as far as I know.
> > > 
> > > - If you are designing your own language, your implementation of Unicode
> > > strings should use something like Python's FSR, or UTF-8 with tweaks to
> > > make string indexing O(1) rather than O(N), or correctly-implemented
> > > UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)
> > 
> > FSR is possible in python for very specific pythonic reasons
> > - dynamicness
> > - immutable strings
> > 
> > Drop either and FSR is impossible
> > 
> > > If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed 
> > > 2-byte per code point format, you fail.
> > 
> > Seems obvious enough.
> > So lets see...
> > Here's a 2-line python program -- runs well enough when run as a command.
> > Program:
> > =========
> > pp = "💩"
> > print (pp)
> > =========
> > Try open it in idle3 and you get (at least I get):
> > 
> > $ idle3 ff.py 
> > Traceback (most recent call last):
> >   File "/usr/bin/idle3", line 5, in <module>
> >     main()
> >   File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
> >     if flist.open(filename) is None:
> >   File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
> >     edit = self.EditorWindow(self, filename, key)
> >   File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
> >     EditorWindow.__init__(self, *args)
> >   File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
> >     if io.loadfile(filename):
> >   File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
> >     self.text.insert("1.0", chars)
> >   File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
> >     self.top.insert(index, chars, tags)
> >   File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
> >     self.addcmd(InsertCommand(index, chars, tags))
> >   File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
> >     cmd.do(self.delegate)
> >   File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
> >     text.insert(self.index1, self.chars, self.tags)
> >   File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
> >     self.delegate.insert(index, chars, tags)
> >   File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
> >     return self.tk_call(self.orig_and_operation + args)
> > _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
> > 
> > So who/what is broken?
> > 
> > > 
> > > - If you are using an existing language, be aware of any bugs and
> > > limitations in its Unicode implementation. You may or may not be able to
> > > work around them, but at least you can decide whether or not you wish to
> > > try.
> > > 
> > > - If you are writing your own file system layer, it's 2015 fer fecks sake,
> > > file names should be Unicode strings, not bytes! (That's one part of the
> > > Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> > > system, whichever you please, but again remember that both are
> > > variable-width formats.
> > 
> > Correct.
> > Windows is broken for using UTF-16
> > Linux is broken for conflating UTF-8 and byte string.
> > 
> > Lot of breakage out here dont you think?
> > May be related to the equation
> > 
> > UTF-16 = UCS-2 + Duct-tape
> > 
> > ??
> 
> =============
> 
> 1) A copy/paste of pp = ... from google group into
> my Python interactive interpreter without intermediate
> state.
> 2) Some manipulations.
> 3) A copy/paste from my interpreter into google group.
> 
> I hope the rendering will be correct.
> 
> Python 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)] on win32
> >>> eta runs etazero.py...
> ...etazero has been executed
> >>> pp = "💩"
> >>> print(pp)
> 💩
> >>> len(pp)
> 2
> >>> pp + pp + 'abcéœ€' + pp
> '💩💩abcéœ€💩'
> >>> 
> >>> # ok, nine glyphs, individually seleectable.
> >>> 
> 
> 
> Note:
> 
> len(pp) = 2 because of Py32. This is a deliberate
> choice to keep the Py32 "behaviour" in my interpreter.
> 
> but also note:
> 
> The code point is correctly displayed with a single "glyph".
> All the cut/copy/paste (eg word, pdf, ...), cursor mouvement,
> selection, caret position, text wrapping, char typing, ... mainly
> for rendering purpose is done with my internal "artillary",
> full unicode.
> 
> In my other GUI applications, everything is working fine,
> including string lenghts, because my "artillary" work and
> also handle glyphs (including diacritical signs).
> Honestly, I'm no sure about bidi; however Hebrew I'm able
> to test is working fine.
> 
> jmf

======
Rest Numéro 2.

Re-cut/copy/paste of what I sent into my
intepreter.

>>> 
>>> len('💩💩abcéœ€💩')
12
>>>

Ok, fine.
Windows, Firefox, utf-16, ... are not so bad.

jmf

[toc] | [prev] | [next] | [standalone]

#87076

From	Terry Reedy <tjreedy@udel.edu>
Date	2015-03-07 01:11 -0500
Message-ID	<mailman.132.1425708701.21433.python-list@python.org>
In reply to	#87032

On 3/6/2015 11:20 AM, Rustom Mody wrote:

> =========
> pp = "💩"
> print (pp)
> =========
> Try open it in idle3 and you get (at least I get):
>
> $ idle3 ff.py
> Traceback (most recent call last):
>    File "/usr/bin/idle3", line 5, in <module>
>      main()
>    File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
>      if flist.open(filename) is None:
>    File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
>      edit = self.EditorWindow(self, filename, key)
>    File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
>      EditorWindow.__init__(self, *args)
>    File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
>      if io.loadfile(filename):
>    File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
>      self.text.insert("1.0", chars)
>    File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
>      self.top.insert(index, chars, tags)
>    File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
>      self.addcmd(InsertCommand(index, chars, tags))
>    File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
>      cmd.do(self.delegate)
>    File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
>      text.insert(self.index1, self.chars, self.tags)
>    File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
>      self.delegate.insert(index, chars, tags)
>    File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
>      return self.tk_call(self.orig_and_operation + args)
> _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
>
> So who/what is broken?

tcl
The possible workaround is for Idle to translate "💩" to "\U0001f4a9" 
(10 chars) before sending it to tk.

But some perspective.  In the console interpreter:

 >>> print("\U0001f4a9")
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "C:\Programs\Python34\lib\encodings\cp437.py", line 19, in encode
     return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4a9' 
in posit
ion 0: character maps to <undefined>

So what is broken?  Windows Command Prompt.

More perspective.  tk/Idle *will* print *something* for every BMP char. 
  Command Prompt will not.  It does not even do ucs-2 correctly. So 
which is more broken?  Windows Command Prompt.  Who has perhaps 
1,000,000 times more resources, Microsoft? or the tcl/tk group?  I think 
we all know.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#87078

From	wxjmfauth@gmail.com
Date	2015-03-06 23:43 -0800
Message-ID	<1d283a0a-914e-4a59-9d7a-da6975dbeb8f@googlegroups.com>
In reply to	#87076

Le samedi 7 mars 2015 07:11:53 UTC+1, Terry Reedy a écrit :
> On 3/6/2015 11:20 AM, Rustom Mody wrote:
> 
> > =========
> > pp = "💩"
> > print (pp)
> > =========
> > Try open it in idle3 and you get (at least I get):
> >
> > $ idle3 ff.py
> > Traceback (most recent call last):
> >    File "/usr/bin/idle3", line 5, in <module>
> >      main()
> >    File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
> >      if flist.open(filename) is None:
> >    File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
> >      edit = self.EditorWindow(self, filename, key)
> >    File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
> >      EditorWindow.__init__(self, *args)
> >    File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
> >      if io.loadfile(filename):
> >    File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
> >      self.text.insert("1.0", chars)
> >    File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
> >      self.top.insert(index, chars, tags)
> >    File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
> >      self.addcmd(InsertCommand(index, chars, tags))
> >    File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
> >      cmd.do(self.delegate)
> >    File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
> >      text.insert(self.index1, self.chars, self.tags)
> >    File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
> >      self.delegate.insert(index, chars, tags)
> >    File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
> >      return self.tk_call(self.orig_and_operation + args)
> > _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
> >
> > So who/what is broken?
> 
> tcl
> The possible workaround is for Idle to translate "💩" to "\U0001f4a9" 
> (10 chars) before sending it to tk.
> 
> But some perspective.  In the console interpreter:
> 
>  >>> print("\U0001f4a9")
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
>    File "C:\Programs\Python34\lib\encodings\cp437.py", line 19, in encode
>      return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4a9' 
> in posit
> ion 0: character maps to <undefined>
> 
> So what is broken?  Windows Command Prompt.
> 
> More perspective.  tk/Idle *will* print *something* for every BMP char. 
>   Command Prompt will not.  It does not even do ucs-2 correctly. So 
> which is more broken?  Windows Command Prompt.  Who has perhaps 
> 1,000,000 times more resources, Microsoft? or the tcl/tk group?  I think 
> we all know.
> 
> -- 
> Terry Jan Reedy

Well...

D:\jm>cd wuni

D:\jm\wuni>jmtest2
Py 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)]
Quelques caractères: «abcéœ€ßÜÆŸçñö»
Loop: empty string => quit
—>abc
Votre entrée était : abc  3 caractère(s)
—>abcéœ€
Votre entrée était : abcéœ€  6 caractère(s)
—>abcéœ€\u20acz\u03b1\z\u0430z
Wahrscheinlich falsches \uxxxx, (single, invalid backslash)
—>abcéœ€\u20acz\u03b1z\u0430z
Votre entrée était : abcéœ€€zαzаz  12 caractère(s)
—>Москва\\Zürich\\Αθήνα
Votre entrée était : Москва\Zürich\Αθήνα  19 caractère(s)
—>
Fin

D:\jm\wuni>


Python is "more broken" than the Windows terminal.

C# works, Ruby works, julia works, go works, Python? NOT

jmf

[toc] | [prev] | [next] | [standalone]

#87079

From	wxjmfauth@gmail.com
Date	2015-03-07 00:55 -0800
Message-ID	<a683f51e-20c5-4ab3-92fd-506b092e1dcf@googlegroups.com>
In reply to	#87076

Le samedi 7 mars 2015 07:11:53 UTC+1, Terry Reedy a écrit :
> tcl
> The possible workaround is for Idle to translate "💩" to "\U0001f4a9" 
> (10 chars) before sending it to tk.
> 

Both are correct. It's a question of perspective.

In an interpreter, which presents the "soul" of the
language, "\U0001f4a9" has more sense than a glyph.

For a general application, for an end user, displaying
a glyph makes more sense.

See, my previous comments.

----

Windows terminal:
I do not wish to defend MS, but despite its
"unicode limitations", it is working very
well and it is certainly not buggy.
Anyway, for serious apps, one writes GUI apps.

tcl/tk? Yes, it is buggy and unusable (at least
on Windows).

jmf

[toc] | [prev] | [next] | [standalone]

#87080

From	wxjmfauth@gmail.com
Date	2015-03-07 01:08 -0800
Message-ID	<60ad6440-340b-4d45-be5c-7f0c4ad6a8af@googlegroups.com>
In reply to	#87079

Le samedi 7 mars 2015 09:56:09 UTC+1, wxjm...@gmail.com a écrit :
> 
> tcl/tk? Yes, it is buggy and unusable (at least
> on Windows).
> 
> jmf

Important addendum.
Not because it does not handle non BMP (SMP)
chars. It's buggy with the BMP chars.

jmf

[toc] | [prev] | [next] | [standalone]

#87131

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-07 21:25 -0800
Message-ID	<7d2480b3-8a39-40d7-aa95-4f1aae95a1f8@googlegroups.com>
In reply to	#87076

On Saturday, March 7, 2015 at 11:41:53 AM UTC+5:30, Terry Reedy wrote:
> On 3/6/2015 11:20 AM, Rustom Mody wrote:
> 
> > =========
> > pp = "💩"
> > print (pp)
> > =========
> > Try open it in idle3 and you get (at least I get):
> >
> > $ idle3 ff.py
> > Traceback (most recent call last):
> >    File "/usr/bin/idle3", line 5, in <module>
> >      main()
> >    File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
> >      if flist.open(filename) is None:
> >    File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
> >      edit = self.EditorWindow(self, filename, key)
> >    File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
> >      EditorWindow.__init__(self, *args)
> >    File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
> >      if io.loadfile(filename):
> >    File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
> >      self.text.insert("1.0", chars)
> >    File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
> >      self.top.insert(index, chars, tags)
> >    File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
> >      self.addcmd(InsertCommand(index, chars, tags))
> >    File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
> >      cmd.do(self.delegate)
> >    File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
> >      text.insert(self.index1, self.chars, self.tags)
> >    File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
> >      self.delegate.insert(index, chars, tags)
> >    File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
> >      return self.tk_call(self.orig_and_operation + args)
> > _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
> >
> > So who/what is broken?
> 
> tcl
> The possible workaround is for Idle to translate "💩" to "\U0001f4a9" 
> (10 chars) before sending it to tk.
> 
> But some perspective.  In the console interpreter:
> 
>  >>> print("\U0001f4a9")
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
>    File "C:\Programs\Python34\lib\encodings\cp437.py", line 19, in encode
>      return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4a9' 
> in posit
> ion 0: character maps to <undefined>
> 
> So what is broken?  Windows Command Prompt.
> 
> More perspective.  tk/Idle *will* print *something* for every BMP char. 
>   Command Prompt will not.  It does not even do ucs-2 correctly. So 
> which is more broken?  Windows Command Prompt.  Who has perhaps 
> 1,000,000 times more resources, Microsoft? or the tcl/tk group?  I think 
> we all know.

Thanks Terry for the perspective.

From my side:

No complaints about python or tcl (or idle -- its actually neater than emacs
if only emacs was not burnt into my nervous system)

Even unicode -- only marginal complaints.
I wrote http://blog.languager.org/2015/02/universal-unicode.html
precisely to say that unicode is a wonderful thing and one should be 
enthusiastic
about it.
[You got that better than anyone else who has spoken -- Thanks]

Xah's pages are way more comprehensive than mine.
But comprehensive can be a negative -- ultimately the unicode standard is
the most comprehensive and correspondingly impenetrable without a compass.

The only very minor complaint I would make is:
If idle is unable to deal with SMP-chars and this is known and unlikely to change
(until TK changes), why not put up a dialog of the sort:
SMP char on line <nn>
SMP support currently unimplemented -- Sorry

instead of a backtrace?

[As I said just a suggestion]

[toc] | [prev] | [next] | [standalone]

#87083

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-07 22:09 +1100
Message-ID	<54fadc70$0$13004$c3e8da3$5496439d@news.astraweb.com>
In reply to	#86986

Rustom Mody wrote:

> On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote:
[...]
>> Chris is suggesting that going from BMP to all of Unicode is not the hard
>> part. Going from ASCII to the BMP part of Unicode is the hard part. If
>> you can do that, you can go the rest of the way easily.
> 
> Depends where the going is starting from.
> I specifically names Java, Javascript, Windows... among others.
> Here's some quotes from the supplementary chars doc of Java
>
http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html
> 
> | Supplementary characters are characters in the Unicode standard whose
> | code points are above U+FFFF, and which therefore cannot be described as
> | single 16-bit entities such as the char data type in the Java
> | programming language. Such characters are generally rare, but some are
> | used, for example, as part of Chinese and Japanese personal names, and
> | so support for them is commonly required for government applications in
> | East Asian countries...
> 
> | The introduction of supplementary characters unfortunately makes the
> | character model quite a bit more complicated.
> 
> | Unicode was originally designed as a fixed-width 16-bit character
> | encoding. The primitive data type char in the Java programming language
> | was intended to take advantage of this design by providing a simple data
> | type that could hold
> | any character....  Version 5.0 of the J2SE is required to support
> | version 4.0 of the Unicode standard, so it has to support supplementary
> | characters.
> 
> My conclusion: Early adopters of unicode -- Windows and Java -- were
> punished
> for their early adoption.  You can blame the unicode consortium, you can
> blame the babel of human languages, particularly that some use characters
> and some only (the equivalent of) what we call words.

I see you are blaming everyone except the people actually to blame.

It is 2015. Unicode 2.0 introduced the SMPs in 1996, almost twenty years
ago, the same year as 1.0 release of Java. Java has had eight major new
releases since then. Oracle, and Sun before them, are/were serious, tier-1,
world-class major IT companies. Why haven't they done something about
introducing proper support for Unicode in Java? It's not hard -- if Python
can do it using nothing but volunteers, Oracle can do it. They could even
do it in a backwards-compatible way, by leaving the existing APIs in place
and adding new APIs.

As for Microsoft, as a member of the Unicode Consortium they have no excuse.
But I think you exaggerate the lack of support for SMPs in Windows. Some
parts of Windows have no SMP support, but they tend to be the oldest and
less important (to Microsoft) parts, like the command prompt.

Anyone have Powershell and like to see how well it supports SMP?

This Stackoverflow question suggests that post-Windows 2000, the Windows
file system has proper support for code points in the supplementary planes:

http://stackoverflow.com/questions/7870014/how-does-windows-wchar-t-handle-unicode-characters-outside-the-basic-multilingua

Or maybe not.

> Or you can skip the blame-game and simply note the fact that large
> segments of extant code-bases are currently in bug-prone or plain buggy
> state.
> 
> This includes not just bug-prone-system code such as Java and Windows but
> seemingly working code such as python 3.

What Unicode bugs do you think Python 3.3 and above have?

>> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in
>> UTF-8 and UTF-32, since that goes against the grain of the system. You
>> would have to program in artificial restrictions that otherwise don't
>> exist.
> 
> Yes  UTF-8 and UTF-32 make most of the objections to unicode 7.0
> irrelevant.

Glad you agree about that much at least.

[...]
>> Conclusion: faulty implementations of UTF-16 which incorrectly handle
>> surrogate pairs should be replaced by non-faulty implementations, or
>> changed to UTF-8 or UTF-32; incomplete Unicode implementations which
>> assume that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should
>> be upgraded.
> 
> Imagine for a moment a thought experiment -- we are not on a python but a
> java forum and please rewrite the above para.

There is no need to re-write it. If Java's only implementation of Unicode
assumes that code points are 16 bits only, then Java needs a new Unicode
implementation. (I assume that the existing one cannot be changed for
backwards-compatibility reasons.)

> Are you addressing the vanilla java programmer? Language implementer?
> Designer? The Java-funders -- earlier Sun, now Oracle?

The last three should be considered the same people.

The vanilla Java programmer is not responsible for the short-comings of
Java's implementation.

[...]
>> > In practice, standards change.
>> > However if a standard changes so frequently that that users have to
>> > play catching cook and keep asking: "Which version?" they are justified
>> > in asking "Are the standard-makers doing due diligence?"
>> 
>> Since Unicode has stability guarantees, and the encodings have not
>> changed in twenty years and will not change in the future, this argument
>> is bogus. Updating to a new version of the standard means, to a first
>> approximation, merely allocating some new code points which had
>> previously been undefined but are now defined.
>> 
>> (Code points can be flagged deprecated, but they will never be removed.)
> 
> Its not about new code points; its about "Fits in 2 bytes" to "Does not
> fit in 2 bytes"

I quote you again:

"if a standard changes so frequently..."

The move to more than 16 bits happened once. It happened almost 20 years
ago. In what way does this count as frequent changes?

> If you call that argument bogus I call you a non computer scientist.

I am not a computer scientist, and the argument remains bogus. Unicode does
not change "frequently", and changes are backward-compatible.

> [Essentially this is my issue with the consortium it seems to be working
> [like a bunch of linguists not computer scientists]

That's rather like complaining that some computer game looks like it was
designed by games players instead of theoreticians. "Why, people have FUN
playing this, almost like it was designed by professionals who think about
gaming!!!"

Unicode is a standard intended for the handling of human languages. It is
intended as a real-life working standard, not some theoretical toy for
academics to experiment with. It is designed to be used, not to have papers
written about it. The character set part of it has effectively been
designed by linguists, and that is a good thing. But the encoding side of
things has been designed by practising computer programmers such as Rob
Pike and Ken Thompson. You might have heard of them.

> Here is Roy's Smith post that first started me thinking that something may
> be wrong with SMP
> https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ

There are plenty of things wrong with some implementations of Unicode, those
that assume all code points are two bytes.

There may be a few things wrong with the current Unicode standard, such as
missing characters, characters given the wrong name, and so forth.

But there's nothing wrong with the design of the SMP. It allows the great
majority of text, probably 99% or more, to use two bytes (UTF-16) or no
more than three bytes (UTF-8), while only relatively specialised uses need
four bytes for some code points.

> Some parts are here some earlier and from my memory.
> If details wrong please correct:
> - 200 million records
> - Containing 4 strings with SMP characters
> - System made with python and mysql. SMP works with python, breaks mysql.
>   So whole system broke due to those 4 in 200,000,000 records

No, they broke because MySQL has buggy Unicode handling.

Bugs are not unusual. I used to have a version of Apple's Hypercard which
would lock up the whole operating system if you tried to display the
string "0^0" in a message dialog. Given that classic Mac OS was not a
proper multi-tasking OS like Unix or OS-X or even Windows, this was a real
pain. My conclusion from that is that that version of Hypercard was buggy.
What is your conclusion?

> I know enough (or not enough) of unicode to be chary of statistical
> conclusions from the above.
> My conclusion is essentially an 'existence-proof':
> 
> SMP-chars can break systems.

Oh come on. How about this instead?

X can break systems, for every conceivable value of X.

> The breakage is costly-fied by the combination
> - layman statistical assumptions
> - BMP → SMP exercises different code-paths
> 
> It is necessary but not sufficient to test print "hello world" in ASCII,
> BMP, SMP. You also have to write the hello world in the database -- mysql
> Read it from the webform -- javascript
> etc etc

Yes. This is called "integration testing". That's what professionals do.

> You could also choose do with "astral crap" (Roy's words) what we all do
> with crap -- throw it out as early as possible.

And when Roy's customers demand that his product support emoji, or complain
that they cannot spell their own name because of his parochial and ignorant
idea of "crap", perhaps he will consider doing what he should have done
from the beginning:

Stop using MySQL, which is a joke of a database[1], and use Postgres which
does not have this problem.

[1] So I have been told.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#87084

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-07 22:33 +1100
Message-ID	<mailman.137.1425728048.21433.python-list@python.org>
In reply to	#87083

On Sat, Mar 7, 2015 at 10:09 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Stop using MySQL, which is a joke of a database[1], and use Postgres which
> does not have this problem.

I agree with the recommendation, though to be fair to MySQL, it is now
possible to store full Unicode. Though personally, I think the whole
"UTF8MB3 vs UTF8MB4" split is an embarrassment and should be abolished
*immediately* - not "we may change the meaning of UTF8 to be an alias
for UTF8MB4 in the future", just completely abolish the distinction
right now. (And deprecate the longer words.) There should be no reason
to build any kind of "UTF-8 but limited to three bytes" encoding for
anything. Ever.

But at least you can, if you configure things correctly, store any
Unicode character in your TEXT field.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87085

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-07 13:53 +0200
Message-ID	<87twxxxbvd.fsf@elektro.pacujo.net>
In reply to	#87083

Steven D'Aprano <steve+comp.lang.python@pearwood.info>:

> Rustom Mody wrote:
>> My conclusion: Early adopters of unicode -- Windows and Java -- were
>> punished for their early adoption. You can blame the unicode
>> consortium, you can blame the babel of human languages, particularly
>> that some use characters and some only (the equivalent of) what we
>> call words.
>
> I see you are blaming everyone except the people actually to blame.

I don't think you need to blame anybody. I think the UCS-2 mistake was
both deplorable and very understandable. At the time it looked like the
magic bullet to get out of the 8-bit mess. While 16-bit wide wchar_t's
looked like a hugely expensive price, it was deemed forward-looking to
pay it anyway to resolve the character set problem once and for all.

Linux was lucky to join the fray late enough to benefit from the bad
UCS-2 experience. That said, UTF-8 does suffer badly from its not being
a bijective mapping.

(Linux didn't quite dodge the bullet with pthreads, threads being
another sad fad of the 1990's. The hippies that cooked up the fork
system call should be awarded the next Millennium Prize. That foresight
or stroke of luck has withstood the challenge of half a century.)

> But there's nothing wrong with the design of the SMP. It allows the
> great majority of text, probably 99% or more, to use two bytes
> (UTF-16) or no more than three bytes (UTF-8), while only relatively
> specialised uses need four bytes for some code points.

The main dream was a fixed-width encoding scheme. People thought 16 bits
would be enough. The dream is so precious and true to us in the West
that people don't want to give it up.

It may yet be that UTF-32 replaces all previous schemes since it has all
the benefits of ASCII and only one drawback: redundancy. Maybe one day
we'll declare the byte 32 bits wide and be done with it. In some many
other aspects, 32-bit "bytes" are the de-facto reality already. Even C
coders routinely use 32 bits to express boolean values.

> And when Roy's customers demand that his product support emoji, or
> complain that they cannot spell their own name because of his
> parochial and ignorant idea of "crap", perhaps he will consider doing
> what he should have done from the beginning:

That's a recurring theme: Why didn't we do IPv6 from the get-go? Why
didn't we do multi-user from the get-go? Why didn't we do localization
from the get-go?

There comes a point when you have to release to start making money. You
then suffer the consequences until your company goes bankrupt.

Marko

[toc] | [prev] | [next] | [standalone]

#87086

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-07 23:02 +1100
Message-ID	<mailman.139.1425729786.21433.python-list@python.org>
In reply to	#87085

On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> The main dream was a fixed-width encoding scheme. People thought 16 bits
> would be enough. The dream is so precious and true to us in the West
> that people don't want to give it up.

So... use Pike, or Python 3.3+?

ChrisA

[toc] | [prev] | [next] | [standalone]

#87087

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2015-03-07 14:07 +0000
Message-ID	<mailman.142.1425737245.21433.python-list@python.org>
In reply to	#87085

On 07/03/2015 12:02, Chris Angelico wrote:
> On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> The main dream was a fixed-width encoding scheme. People thought 16 bits
>> would be enough. The dream is so precious and true to us in the West
>> that people don't want to give it up.
>
> So... use Pike, or Python 3.3+?
>
> ChrisA
>

Cue obligatory cobblers from our RUE.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#87090

From	wxjmfauth@gmail.com
Date	2015-03-07 07:28 -0800
Message-ID	<57fb30fd-4efb-4e50-9708-96f4e108b870@googlegroups.com>
In reply to	#87085

Le samedi 7 mars 2015 12:53:24 UTC+1, Marko Rauhamaa a écrit :
> 
> It may yet be that UTF-32 replaces all previous schemes since it has all
> the benefits of ASCII and only one drawback: redundancy. Maybe one day
> we'll declare the byte 32 bits wide and be done with it. In some many
> other aspects, 32-bit "bytes" are the de-facto reality already. Even C
> coders routinely use 32 bits to express boolean values.
> 

Like many, I'm using utf-32 every day on my win7 box with
2 Gb of ram.
I never meet once a problem.

jmf

[toc] | [prev] | [next] | [standalone]

#87091

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-08 02:40 +1100
Message-ID	<54fb1bf4$0$12993$c3e8da3$5496439d@news.astraweb.com>
In reply to	#87085

Marko Rauhamaa wrote:

> That said, UTF-8 does suffer badly from its not being
> a bijective mapping.

Can you explain?

As far as I am aware, every code point has one and only one valid UTF-8
encoding, and every UTF-8 encoding has one and only one valid code point.

There are *invalid* UTF-8 encodings, such as CESU-8, which is sometimes
mislabelled as UTF-8 (Oracle, I'm looking at you.) It violates the rule
that valid UTF-8 encodings are the shortest possible.

E.g. SMP code points should be encoded to four bytes using UTF-8:

py> u'\U0010FF01'.encode('utf-8')  # U+10FF01
'\xf4\x8f\xbc\x81'

But in CESU-8, the code point is first interpreted as a UTF-16 surrogate
pair:

py> u'\U0010FF01'.encode('utf-16be')
'\xdb\xff\xdf\x01'

then each surrogate pair is treated as a 16-bit code unit and individually
encoded to three bytes using UTF-8:

py> u'\udbff'.encode('utf-8')
'\xed\xaf\xbf'
py> u'\udf01'.encode('utf-8')
'\xed\xbc\x81'

giving six bytes in total:

'\xed\xaf\xbf\xed\xbc\x81'

This is not UTF-8! But some software mislabels it as UTF-8.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#87092

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-07 17:48 +0200
Message-ID	<87twxw4xlz.fsf@elektro.pacujo.net>
In reply to	#87091

Steven D'Aprano <steve+comp.lang.python@pearwood.info>:

> Marko Rauhamaa wrote:
>
>> That said, UTF-8 does suffer badly from its not being
>> a bijective mapping.
>
> Can you explain?

In Python terms, there are bytes objects b that don't satisfy:

   b.decode('utf-8').encode('utf-8') == b


Marko

[toc] | [prev] | [next] | [standalone]

#87099

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-08 03:17 +1100
Message-ID	<mailman.145.1425745085.21433.python-list@python.org>
In reply to	#87092

On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>
>> Marko Rauhamaa wrote:
>>
>>> That said, UTF-8 does suffer badly from its not being
>>> a bijective mapping.
>>
>> Can you explain?
>
> In Python terms, there are bytes objects b that don't satisfy:
>
>    b.decode('utf-8').encode('utf-8') == b

Please provide an example; that sounds like a bug. If there is any
invalid UTF-8 stream which decodes without an error, it is actually a
security bug, and should be fixed pronto in all affected and supported
versions.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87100

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-07 18:25 +0200
Message-ID	<87k2ysydtk.fsf@elektro.pacujo.net>
In reply to	#87099

Chris Angelico <rosuav@gmail.com>:

> On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>>
>>> Marko Rauhamaa wrote:
>>>
>>>> That said, UTF-8 does suffer badly from its not being
>>>> a bijective mapping.
>>>
>>> Can you explain?
>>
>> In Python terms, there are bytes objects b that don't satisfy:
>>
>>    b.decode('utf-8').encode('utf-8') == b
>
> Please provide an example; that sounds like a bug. If there is any
> invalid UTF-8 stream which decodes without an error, it is actually a
> security bug, and should be fixed pronto in all affected and supported
> versions.

Here's an example:

   b = b'\x80'

Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
from str objects to bytes objects.


Marko

[toc] | [prev] | [next] | [standalone]

#87103

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-08 03:41 +1100
Message-ID	<mailman.148.1425746496.21433.python-list@python.org>
In reply to	#87100

On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>>> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>>>
>>>> Marko Rauhamaa wrote:
>>>>
>>>>> That said, UTF-8 does suffer badly from its not being
>>>>> a bijective mapping.
>>>>
>>>> Can you explain?
>>>
>>> In Python terms, there are bytes objects b that don't satisfy:
>>>
>>>    b.decode('utf-8').encode('utf-8') == b
>>
>> Please provide an example; that sounds like a bug. If there is any
>> invalid UTF-8 stream which decodes without an error, it is actually a
>> security bug, and should be fixed pronto in all affected and supported
>> versions.
>
> Here's an example:
>
>    b = b'\x80'
>
> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
> from str objects to bytes objects.

That's not the same as what you said. All you've proven is that there
are bit patterns which are not UTF-8 streams... which is a very
deliberate feature. How does UTF-8 *suffer* from this? It benefits
hugely!

ChrisA

[toc] | [prev] | [next] | [standalone]

#87108

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-07 18:54 +0200
Message-ID	<87bnk4yci1.fsf@elektro.pacujo.net>
In reply to	#87103

Chris Angelico <rosuav@gmail.com>:

> On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>>>>> Marko Rauhamaa wrote:
>>>>>> That said, UTF-8 does suffer badly from its not being
>>>>>> a bijective mapping.
>>>>>
>> Here's an example:
>>
>>    b = b'\x80'
>>
>> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
>> from str objects to bytes objects.
>
> That's not the same as what you said.

Except that it's precisely what I said.

> All you've proven is that there are bit patterns which are not UTF-8
> streams...

And that causes problems.

> which is a very deliberate feature.

Well, nobody desired it. It was just something that had to give.

I believe you *could* have defined it as a bijective mapping but then
you would have lost the sorting order correspondence.

> How does UTF-8 *suffer* from this? It benefits hugely!

You can't operate on file names and text files using Python strings. Or
at least, you will need to add (nontrivial) exception catching logic.

Marko

[toc] | [prev] | [next] | [standalone]

#87109

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-08 03:58 +1100
Message-ID	<mailman.151.1425747492.21433.python-list@python.org>
In reply to	#87108

On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> All you've proven is that there are bit patterns which are not UTF-8
>> streams...
>
> And that causes problems.

Demonstrate.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87110

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-08 04:00 +1100
Message-ID	<mailman.152.1425747654.21433.python-list@python.org>
In reply to	#87108

On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> You can't operate on file names and text files using Python strings. Or
> at least, you will need to add (nontrivial) exception catching logic.

You can't operate on a JPG file using a Unicode string, nor an array
of integers. What of it? You can't operate on an array of integers
using a dictionary, either. So? How is this a failing of UTF-8?

If you really REALLY can't use the bytes() type to work with something
that is, yaknow, bytes, then you could use an alternative encoding
that has a value for every byte. It's still not Unicode text, so it
doesn't much matter which encoding you use. But it's much better to
use the bytes type to work with bytes. It is not text, so don't treat
it as text.

ChrisA

[toc] | [prev] | [next] | [standalone]

Page 5 of 8 — ← Prev page 1 2 3 4 [5] 6 7 8 Next page →

csiph-web

Newbie question about text encoding

Contents

#87056

#87076

#87078

#87079

#87080

#87131

#87083

#87084

#87085

#87086

#87087

#87090

#87091

#87092

#87099

#87100

#87103

#87108

#87109

#87110