Groups > comp.lang.python > #86311 > unrolled thread

Newbie question about text encoding

Started by	pierrick.brihaye@gmail.com
First post	2015-02-24 02:49 -0800
Last post	2015-02-27 10:23 +1100
Articles	20 on this page of 158 — 19 participants

Back to article view | Back to comp.lang.python

  Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
    Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
        Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
        Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
        Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
          Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
        Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
              Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
                  Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
                      Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
                        Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
                            Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
                              Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
                          Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
                      Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
                        Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
                      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
                      Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
                      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
                              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
                                Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
                                    Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
                                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
                            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
                                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
                                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
                                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
                                                  Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
                                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
                                                      Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
                                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
                                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
                                                          Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
                                                        Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
                                    Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
                                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
                                          Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
                                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
                                              Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
                                                Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
                                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
                                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
                                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
                                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
                          Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
                          Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
            Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
                Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
          Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100

Page 7 of 8 — ← Prev page 1 2 3 4 5 6 [7] 8 Next page →

#87133

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-08 09:20 +0200
Message-ID	<87y4n8uf9a.fsf@elektro.pacujo.net>
In reply to	#87128

Steven D'Aprano <steve+comp.lang.python@pearwood.info>:

> For those cases where you do wish to take an arbitrary byte stream and
> round-trip it, Python now provides an error handler for that.
>
> py> import random
> py> b = bytes([random.randint(0, 255) for _ in range(10000)])
> py> s = b.decode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0:
> invalid start byte
> py> s = b.decode('utf-8', errors='surrogateescape')
> py> s.encode('utf-8', errors='surrogateescape') == b
> True

That is indeed a valid workaround. With it we achieve

   b.decode('utf-8', errors='surrogateescape'). \
       encode('utf-8', errors='surrogateescape') == b

for any bytes b. It goes to great lengths to address the Linux
programmer's situation.

However,

 * it's not UTF-8 but a variant of it,

 * it sacrifices the ordering correspondence of UTF-8:

   >>> '\udc80' > 'ä'
   True
   >>> '\udc80'.encode('utf-8', errors='surrogateescape') > \
   ...        'ä'.encode('utf-8', errors='surrogateescape')
   False

 * it still isn't bijective between str and bytes:

   >>> '\udd00'.encode('utf-8', errors='surrogateescape')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-8' codec can't encode character 
   '\udd00' in position 0: surrogates not allowed


Marko

[toc] | [prev] | [next] | [standalone]

#87136

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-08 18:37 +1100
Message-ID	<mailman.163.1425800257.21433.python-list@python.org>
In reply to	#87133

On Sun, Mar 8, 2015 at 6:20 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
>  * it still isn't bijective between str and bytes:
>
>    >>> '\udd00'.encode('utf-8', errors='surrogateescape')
>    Traceback (most recent call last):
>      File "<stdin>", line 1, in <module>
>    UnicodeEncodeError: 'utf-8' codec can't encode character
>    '\udd00' in position 0: surrogates not allowed

Once again, you appear to be surprised that invalid data is failing.
Why is this so strange? U+DD00 is not a valid character. It is quite
correct to throw this error.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87140

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-08 10:09 +0200
Message-ID	<87twxvvrjl.fsf@elektro.pacujo.net>
In reply to	#87136

Chris Angelico <rosuav@gmail.com>:

> Once again, you appear to be surprised that invalid data is failing.
> Why is this so strange? U+DD00 is not a valid character. It is quite
> correct to throw this error.

'\udd00' is a valid str object:

   >>> '\udd00'
   '\udd00'
   >>> '\udd00'.encode('utf-32')
   b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
   >>> '\udd00'.encode('utf-16')
   b'\xff\xfe\x00\xdd'

I was simply stating that UTF-8 is not a bijection between unicode
strings and octet strings (even forgetting Python). Enriching Unicode
with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
without side effects.


Marko

[toc] | [prev] | [next] | [standalone]

#87141

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-08 19:23 +1100
Message-ID	<mailman.166.1425803025.21433.python-list@python.org>
In reply to	#87140

On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> Once again, you appear to be surprised that invalid data is failing.
>> Why is this so strange? U+DD00 is not a valid character. It is quite
>> correct to throw this error.
>
> '\udd00' is a valid str object:
>
>    >>> '\udd00'
>    '\udd00'
>    >>> '\udd00'.encode('utf-32')
>    b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
>    >>> '\udd00'.encode('utf-16')
>    b'\xff\xfe\x00\xdd'
>
> I was simply stating that UTF-8 is not a bijection between unicode
> strings and octet strings (even forgetting Python). Enriching Unicode
> with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> without side effects.

But it's not a valid Unicode string, so a Unicode encoding can't be
expected to cope with it. Mathematically, 0xC0 0x80 would represent
U+0000, and some UTF-8 codecs generate and accept this (in order to
allow U+0000 without ever yielding 0x00), but that doesn't mean that
UTF-8 should allow that byte sequence.

The only reason to craft some kind of Unicode string for any arbitrary
sequence of bytes is the "smuggling" effect used for file name
handling. There is no reason to support invalid Unicode codepoints.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87143

From	wxjmfauth@gmail.com
Date	2015-03-08 01:18 -0800
Message-ID	<a15b8f18-2e4d-4dde-aa62-9be5b192dd27@googlegroups.com>
In reply to	#87141

Le dimanche 8 mars 2015 09:24:30 UTC+1, Chris Angelico a écrit :
> On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> > Chris Angelico <rosuav@gmail.com>:
> >
> >> Once again, you appear to be surprised that invalid data is failing.
> >> Why is this so strange? U+DD00 is not a valid character. It is quite
> >> correct to throw this error.
> >
> > '\udd00' is a valid str object:
> >
> >    >>> '\udd00'
> >    '\udd00'
> >    >>> '\udd00'.encode('utf-32')
> >    b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
> >    >>> '\udd00'.encode('utf-16')
> >    b'\xff\xfe\x00\xdd'
> >
> > I was simply stating that UTF-8 is not a bijection between unicode
> > strings and octet strings (even forgetting Python). Enriching Unicode
> > with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> > without side effects.
> 
> But it's not a valid Unicode string, so a Unicode encoding can't be
> expected to cope with it. Mathematically, 0xC0 0x80 would represent
> U+0000, and some UTF-8 codecs generate and accept this (in order to
> allow U+0000 without ever yielding 0x00), but that doesn't mean that
> UTF-8 should allow that byte sequence.
> 
> The only reason to craft some kind of Unicode string for any arbitrary
> sequence of bytes is the "smuggling" effect used for file name
> handling. There is no reason to support invalid Unicode codepoints.
> 
> ChrisA

Python 3 and unicode?

A disaster reflecting a non understanding of Unicode.

but

A (buggy) jewel for those who wish to present and teach
Unicode.

jmf

[toc] | [prev] | [next] | [standalone]

#87149

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-09 05:25 +1100
Message-ID	<54fc9400$0$13009$c3e8da3$5496439d@news.astraweb.com>
In reply to	#87140

Marko Rauhamaa wrote:

> Chris Angelico <rosuav@gmail.com>:
> 
>> Once again, you appear to be surprised that invalid data is failing.
>> Why is this so strange? U+DD00 is not a valid character. 

But it is a valid non-character code point.

>> It is quite correct to throw this error.
> 
> '\udd00' is a valid str object:

Is it though? Perhaps the bug is not UTF-8's inability to encode lone
surrogates, but that Python allows you to create lone surrogates in the
first place. That's not a rhetorical question. It's a genuine question.

>    >>> '\udd00'
>    '\udd00'
>    >>> '\udd00'.encode('utf-32')
>    b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
>    >>> '\udd00'.encode('utf-16')
>    b'\xff\xfe\x00\xdd'

If you explicitly specify the endianness (say, utf-16-be or -le) then you
don't get the BOMs.

> I was simply stating that UTF-8 is not a bijection between unicode
> strings and octet strings (even forgetting Python). Enriching Unicode
> with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> without side effects.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#87156

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-08 22:09 +0200
Message-ID	<87d24juu8r.fsf@elektro.pacujo.net>
In reply to	#87149

Steven D'Aprano <steve+comp.lang.python@pearwood.info>:

> Marko Rauhamaa wrote:
>> '\udd00' is a valid str object:
>
> Is it though? Perhaps the bug is not UTF-8's inability to encode lone
> surrogates, but that Python allows you to create lone surrogates in
> the first place. That's not a rhetorical question. It's a genuine
> question.

The problem is that no matter how you shuffle surrogates, encoding
schemes, coding points and the like, a wrinkle always remains.

I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But
that's where the buck stops; traditional arithmetic functions are closed
under ℂ.

Unicode apparently hasn't found a similar closure.

That's why I think that while UTF-8 is a fabulous way to bring Unicode
to Linux, Linux should have taken the tack that Unicode is always an
application-level interpretation with few operating system tie-ins.
Unfortunately, the GNU world is busy trying to build a Unicode frosting
everywhere. The illusion can never be complete but is convincing enough
for application developers to forget to handle corner cases.

To answer your question, I think every code point from 0 to 1114111
should be treated as valid and analogous. Thus Python is correct here:

   >>> len('\udd00')
   1
   >>> len('\ufeff')
   1

The alternatives are far too messy to consider.

Marko

[toc] | [prev] | [next] | [standalone]

#87166

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-09 12:43 +1100
Message-ID	<54fcfac0$0$12995$c3e8da3$5496439d@news.astraweb.com>
In reply to	#87156

Marko Rauhamaa wrote:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> 
>> Marko Rauhamaa wrote:
>>> '\udd00' is a valid str object:
>>
>> Is it though? Perhaps the bug is not UTF-8's inability to encode lone
>> surrogates, but that Python allows you to create lone surrogates in
>> the first place. That's not a rhetorical question. It's a genuine
>> question.
> 
> The problem is that no matter how you shuffle surrogates, encoding
> schemes, coding points and the like, a wrinkle always remains.

Really? Define your terms. Can you define "wrinkles", and prove that it is
impossible to remove them? What's so bad about wrinkles anyway?

> I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But
> that's where the buck stops; traditional arithmetic functions are closed
> under ℂ.

That's simply incorrect. What's z/(0+0i)?

There are many more number sets used by mathematicians, some going back to
the 1800s. Here are just a few:

* ℝ-overbar or [−∞, +∞], which adds a pair of infinities to ℝ.

* ℝ-caret or ℝ+{∞}, which does the same but with a single 
  unsigned infinity.

* A similar extended version of ℂ with a single infinity.

* Split-complex or hyperbolic numbers, defined similarly to ℂ 
  except with i**2 = +1 (rather than the complex i**2 = -1).

* Dual numbers, which add a single infinitesimal number ε != 0 
  with the property that ε**2 = 0.

* Hyperreal numbers.

* John Conway's surreal numbers, which may be the largest 
  possible set, in the sense that it can construct all finite, 
  infinite and infinitesimal numbers. (The hyperreals and dual 
  numbers can be considered subsets of the surreals.)

The process of extending ℝ to ℂ is formally known as Cayley–Dickson
construction, and there is an infinite number of algebras (and hence number
sets) which can be constructed this way. The next few are:

* Hamilton's quaternions ℍ, very useful for dealing with rotations 
  in 3D space. They fell out of favour for some decades, but are now
  experiencing something of a renaissance.

* Octonions or Cayley numbers.

* Sedenions.

> Unicode apparently hasn't found a similar closure.

Similar in what way? And why do you think this is important?

It is not a requirement for every possible byte sequence to be a valid
Unicode string, any more than it is a requirement for every possible byte
sequence to be valid JPG, zip archive, or ELF executable. Some byte strings
simply are not JPG images, zip archives or ELF executables -- or Unicode
strings. So what?

Why do you think that is a problem that needs fixing by the Unicode
standard? It may be a problem that needs fixing by (for example)
programming languages, and Python invented the surrogatesescape encoding to
smuggle such invalid bytes into strings. Other solutions may exist as well.
But that's not part of Unicode and it isn't a problem for Unicode.

> That's why I think that while UTF-8 is a fabulous way to bring Unicode
> to Linux, Linux should have taken the tack that Unicode is always an
> application-level interpretation with few operating system tie-ins.

"Should have"? That is *exactly* the status quo, and while it was the only
practical solution given Linux's history, it's a horrible idea. That
Unicode is stuck on top of an OS which is unaware of Unicode is precisely
why we're left with problems like "how do you represent arbitrary bytes as
Unicode strings?".

> Unfortunately, the GNU world is busy trying to build a Unicode frosting
> everywhere. The illusion can never be complete but is convincing enough
> for application developers to forget to handle corner cases.
> 
> To answer your question, I think every code point from 0 to 1114111
> should be treated as valid and analogous. 

Your opinion isn't very relevant. What is relevant is what the Unicode
standard demands, and I think it requires that strings containing
surrogates are illegal (rather like x/0 is illegal in the real numbers).
Wikipedia states:

    The Unicode standard permanently reserves these code point 
    values [U+D800 to U+DFFF] for UTF-16 encoding of the high 
    and low surrogates, and they will never be assigned a 
    character, so there should be no reason to encode them. The 
    official Unicode standard says that no UTF forms, including 
    UTF-16, can encode these code points.

    However UCS-2, UTF-8, and UTF-32 can encode these code points
    in trivial and obvious ways, and large amounts of software 
    does so even though the standard states that such arrangements
    should be treated as encoding errors. It is possible to 
    unambiguously encode them in UTF-16 by using a code unit equal
    to the code point, as long as no sequence of two code units can
    be interpreted as a legal surrogate pair (that is, as long as a
    high surrogate is never followed by a low surrogate). The 
    majority of UTF-16 encoder and decoder implementations translate
    between encodings as though this were the case.

http://en.wikipedia.org/wiki/UTF-16

So yet again we are left with the conclusion that *buggy implementations* of
Unicode cause problems, not the Unicode standard itself.

> Thus Python is correct here: 
> 
>    >>> len('\udd00')
>    1
>    >>> len('\ufeff')
>    1
> 
> The alternatives are far too messy to consider.

Not at all. '\udd00' should be a SyntaxError.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#87167

From	Ben Finney <ben+python@benfinney.id.au>
Date	2015-03-09 13:09 +1100
Message-ID	<mailman.181.1425866967.21433.python-list@python.org>
In reply to	#87166

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

> '\udd00' should be a SyntaxError.

I find your argument convincing, that attempting to construct a Unicode
string of a lone surrogate should be an error.

Shouldn't the error type be a ValueError, though? The statement is not,
to my mind, erroneous syntax.

-- 
 \     “Please do not feed the animals. If you have any suitable food, |
  `\                     give it to the guard on duty.” —zoo, Budapest |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#87173

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-09 08:31 +0200
Message-ID	<87zj7mu1fj.fsf@elektro.pacujo.net>
In reply to	#87167

Ben Finney <ben+python@benfinney.id.au>:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>
>> '\udd00' should be a SyntaxError.
>
> I find your argument convincing, that attempting to construct a
> Unicode string of a lone surrogate should be an error.

Then we're back to square one:

   >>> b'\x80'.decode('utf-8', errors='surrogateescape')
   '\udc80'


Marko

[toc] | [prev] | [next] | [standalone]

#87169

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-09 13:18 +1100
Message-ID	<mailman.183.1425867521.21433.python-list@python.org>
In reply to	#87166

On Mon, Mar 9, 2015 at 1:09 PM, Ben Finney <ben+python@benfinney.id.au> wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>
>> '\udd00' should be a SyntaxError.
>
> I find your argument convincing, that attempting to construct a Unicode
> string of a lone surrogate should be an error.
>
> Shouldn't the error type be a ValueError, though? The statement is not,
> to my mind, erroneous syntax.

For the string literal, I would say SyntaxError is more appropriate
than ValueError, as a string object has to be constructed at
compilation time.

I'd still like to see a report from someone who has used a language
that specifically disallows all surrogates in strings. Does it help?
Is it more hassle than it's worth? Are there weird edge cases that it
breaks?

ChrisA

[toc] | [prev] | [next] | [standalone]

#87171

From	random832@fastmail.us
Date	2015-03-09 00:27 -0400
Message-ID	<mailman.184.1425875286.21433.python-list@python.org>
In reply to	#87166

On Sun, Mar 8, 2015, at 22:09, Ben Finney wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> 
> > '\udd00' should be a SyntaxError.
> 
> I find your argument convincing, that attempting to construct a Unicode
> string of a lone surrogate should be an error.
> 
> Shouldn't the error type be a ValueError, though? The statement is not,
> to my mind, erroneous syntax.

In this hypothetical, it's a problem with evaluating a literal - in the
same way that '\U12345', or '\U00110000, is.

[toc] | [prev] | [next] | [standalone]

#87159

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-09 07:55 +1100
Message-ID	<mailman.174.1425848148.21433.python-list@python.org>
In reply to	#87149

On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Marko Rauhamaa wrote:
>
>> Chris Angelico <rosuav@gmail.com>:
>>
>>> Once again, you appear to be surprised that invalid data is failing.
>>> Why is this so strange? U+DD00 is not a valid character.
>
> But it is a valid non-character code point.
>
>>> It is quite correct to throw this error.
>>
>> '\udd00' is a valid str object:
>
> Is it though? Perhaps the bug is not UTF-8's inability to encode lone
> surrogates, but that Python allows you to create lone surrogates in the
> first place. That's not a rhetorical question. It's a genuine question.

Ah, I see the confusion. Yes, it is plausible to permit the UTF-8-like
encoding of surrogates; but it's illegal according to the RFC:

https://tools.ietf.org/html/rfc3629
"""
   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.
"""

They're not valid characters, and the UTF-8 spec explicitly says that
they must not be encoded. Python is fully spec-compliant in rejecting
these. Some encoders [1] will permit them, but the resulting stream is
invalid UTF-8, just as CESU-8 and Modified UTF-8 are (the latter being
"UTF-8, only U+0000 is represented as C0 80").

ChrisA

[1] eg http://pike.lysator.liu.se/generated/manual/modref/ex/predef_3A_3A/string_to_utf8.html
optionally

[toc] | [prev] | [next] | [standalone]

#87160

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-09 08:13 +1100
Message-ID	<mailman.175.1425849237.21433.python-list@python.org>
In reply to	#87149

On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Perhaps the bug is not UTF-8's inability to encode lone
> surrogates, but that Python allows you to create lone surrogates in the
> first place. That's not a rhetorical question. It's a genuine question.

As to the notion of rejecting the construction of strings containing
these invalid codepoints, I'm not sure. Are there any languages out
there that have a Unicode string type that requires that all
codepoints be valid (no surrogates, no U+FFFE, etc)? This is the kind
of thing that's usually done in an obscure language before it hits a
mainstream one.

Pike is similar to Python here. I can create a string with invalid
code points in it:

> "\uFFFE\uDD00";
(1) Result: "\ufffe\udd00"

but I can't UTF-8 encode that:

> string_to_utf8("\uFFFE\uDD00");
Character 0x0000dd00 at index 1 is in the surrogate range and therefore invalid.
Unknown program: string_to_utf8("\ufffe\udd00")
HilfeInput:1: HilfeInput()->___HilfeWrapper()

Or, using the streaming UTF-8 encoder instead of the short-hand:

> Charset.encoder("UTF-8")->feed("\uFFFE\uDD00")->drain();
Error encoding "\ufffe"[0xdd00] using utf8: Unsupported character 56576.
/usr/local/pike/8.1.0/lib/modules/_Charset.so:1:
    _Charset.UTF8enc()->feed("\ufffe\udd00")
HilfeInput:1: HilfeInput()->___HilfeWrapper()

Does anyone know of a language where you can't even construct the string?

ChrisA

[toc] | [prev] | [next] | [standalone]

#87174

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-09 17:34 +1100
Message-ID	<54fd3f10$0$12977$c3e8da3$5496439d@news.astraweb.com>
In reply to	#87160

Chris Angelico wrote:

> As to the notion of rejecting the construction of strings containing
> these invalid codepoints, I'm not sure. Are there any languages out
> there that have a Unicode string type that requires that all
> codepoints be valid (no surrogates, no U+FFFE, etc)?

U+FFFE and U+FFFF are *noncharacters*, not invalid. There are a total of 66
noncharacters in Unicode, and they are legal in strings.

http://www.unicode.org/faq/private_use.html#nonchar8

I think the only illegal code points are surrogates. Surrogates should only
appear as bytes in UTF-16 byte-strings.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#87175

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-09 17:44 +1100
Message-ID	<mailman.186.1425883497.21433.python-list@python.org>
In reply to	#87174

On Mon, Mar 9, 2015 at 5:34 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Chris Angelico wrote:
>
>> As to the notion of rejecting the construction of strings containing
>> these invalid codepoints, I'm not sure. Are there any languages out
>> there that have a Unicode string type that requires that all
>> codepoints be valid (no surrogates, no U+FFFE, etc)?
>
> U+FFFE and U+FFFF are *noncharacters*, not invalid. There are a total of 66
> noncharacters in Unicode, and they are legal in strings.
>
> http://www.unicode.org/faq/private_use.html#nonchar8
>
> I think the only illegal code points are surrogates. Surrogates should only
> appear as bytes in UTF-16 byte-strings.

U+FFFE would cause problems at the beginning of a UTF-16 stream, as it
could be mistaken for a BOM - that's why it's a noncharacter. But
sure, let's leave them out of the discussion. The question is whether
surrogates are legal or not.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87176

From	wxjmfauth@gmail.com
Date	2015-03-09 02:08 -0700
Message-ID	<c19fade0-587b-4055-85b9-695baa10cf97@googlegroups.com>
In reply to	#87175

********************************************************************

In Unicode, a string is a sequence of characters,
not a sequence of code points and definitely not
a sequence of bytes.

********************************************************************

jmf

[toc] | [prev] | [next] | [standalone]

#87207

From	wxjmfauth@gmail.com
Date	2015-03-09 07:26 -0700
Message-ID	<8908135b-a092-446b-a281-c9c081a5f3d0@googlegroups.com>
In reply to	#87176

Le lundi 9 mars 2015 10:08:48 UTC+1, wxjm...@gmail.com a écrit :
> ********************************************************************
> 
> In Unicode, a string is a sequence of characters,
> not a sequence of code points and definitely not
> a sequence of bytes.
> 
> ********************************************************************
> 
> jmf

Mea culpa. It's not really correct.
It did not express, what I wanted to say.

jmf

[toc] | [prev] | [next] | [standalone]

#87193

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-09 05:28 -0700
Message-ID	<3c91b930-2642-456e-a26d-3edd4cd014d1@googlegroups.com>
In reply to	#87174

On Monday, March 9, 2015 at 12:05:05 PM UTC+5:30, Steven D'Aprano wrote:
> Chris Angelico wrote:
> 
> > As to the notion of rejecting the construction of strings containing
> > these invalid codepoints, I'm not sure. Are there any languages out
> > there that have a Unicode string type that requires that all
> > codepoints be valid (no surrogates, no U+FFFE, etc)?
> 
> U+FFFE and U+FFFF are *noncharacters*, not invalid. There are a total of 66
> noncharacters in Unicode, and they are legal in strings.

Interesting -- Thanks!
I wonder whether that's one more instance of the anti-pattern (other thread)?
Number thats not a number -- Nan
Pointer that points nowhere -- NULL
SQL data thats not there but there -- null

> 
> http://www.unicode.org/faq/private_use.html#nonchar8
> 
> I think the only illegal code points are surrogates. Surrogates should only
> appear as bytes in UTF-16 byte-strings.

Even more interesting: So there's a whole hierarchy of illegality??
Could you suggest some good reference for 'surrogate'?

[toc] | [prev] | [next] | [standalone]

#87139

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-08 19:01 +1100
Message-ID	<54fc01cf$0$12995$c3e8da3$5496439d@news.astraweb.com>
In reply to	#87128

Steven D'Aprano wrote:

> Marko Rauhamaa wrote:
> 
>> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>> 
>>> Marko Rauhamaa wrote:
>>>
>>>> That said, UTF-8 does suffer badly from its not being
>>>> a bijective mapping.
>>>
>>> Can you explain?
>> 
>> In Python terms, there are bytes objects b that don't satisfy:
>> 
>>    b.decode('utf-8').encode('utf-8') == b
> 
> Are you talking about the fact that not all byte streams are valid UTF-8?
> That is, some byte objects b may raise an exception on b.decode('utf-8').

Eh, I should have read the rest of the thread before replying...

> I don't see why that means UTF-8 "suffers badly" from this. Can you give
> an example of where you would expect to take an arbitrary byte-stream,
> decode it as UTF-8, and expect the results to be meaningful?

File names on Unix-like systems.

Unfortunately file names are a bit of a mess, but we're slowly converging on
Unicode support for files. I reckon that by 2070, 2080 tops, we'll have
that licked...

The three major operating systems have different levels of support for
Unicode file names:

* Apple OS X: HFS+ stores file names in decomposed form, using UTF-16. I
think this is the strictest Unicode support of all common file systems.
Well done Apple. Decomposed in this sense means that single code points may
be expanded where possible, e.g. é U+00E9 LATIN SMALL LETTER E WITH ACUTE
will be stored as two code points, U+0065 LATIN SMALL LETTER E + U+0301
COMBINING ACUTE ACCENT.

* Windows: NTFS stores file names as sequences of 16-bit code units except
0x0000. (Additional restrictions also apply: e.g. in POSIX mode, / is also
forbidden; in Win32 mode, / ? + etc. are forbidden.) The code units are
interpreted as UTF-16 but the file system doesn't prevent you from creating
file names with invalid sequences.

* Linux: ext2/ext3 stores file names as arbitrary bytes except for / and
nul. However most Linux distributions treat file names as if they were
UTF-8 (displaying ? glyphs for undecodable bytes), and many Linux GUI file
managers enforce the rule that file names are valid UTF-8.

File systems on removable media (FAT32, UDF, ISO-9660 with or without
extensions such as Joliet and Rock Ridge) have their own issues, but
generally speaking don't support Unicode well or at all.

So although the current situation is still a bit of a mess, there is a slow
move towards file names which are valid Unicode.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

Page 7 of 8 — ← Prev page 1 2 3 4 5 6 [7] 8 Next page →

csiph-web

Newbie question about text encoding

Contents

#87133

#87136

#87140

#87141

#87143

#87149

#87156

#87166

#87167

#87173

#87169

#87171

#87159

#87160

#87174

#87175

#87176

#87207

#87193

#87139