Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #86311 > unrolled thread
| Started by | pierrick.brihaye@gmail.com |
|---|---|
| First post | 2015-02-24 02:49 -0800 |
| Last post | 2015-02-27 10:23 +1100 |
| Articles | 20 on this page of 158 — 19 participants |
Back to article view | Back to comp.lang.python
Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100
Page 7 of 8 — ← Prev page 1 2 3 4 5 6 [7] 8 Next page →
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-08 09:20 +0200 |
| Message-ID | <87y4n8uf9a.fsf@elektro.pacujo.net> |
| In reply to | #87128 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> For those cases where you do wish to take an arbitrary byte stream and
> round-trip it, Python now provides an error handler for that.
>
> py> import random
> py> b = bytes([random.randint(0, 255) for _ in range(10000)])
> py> s = b.decode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0:
> invalid start byte
> py> s = b.decode('utf-8', errors='surrogateescape')
> py> s.encode('utf-8', errors='surrogateescape') == b
> True
That is indeed a valid workaround. With it we achieve
b.decode('utf-8', errors='surrogateescape'). \
encode('utf-8', errors='surrogateescape') == b
for any bytes b. It goes to great lengths to address the Linux
programmer's situation.
However,
* it's not UTF-8 but a variant of it,
* it sacrifices the ordering correspondence of UTF-8:
>>> '\udc80' > 'ä'
True
>>> '\udc80'.encode('utf-8', errors='surrogateescape') > \
... 'ä'.encode('utf-8', errors='surrogateescape')
False
* it still isn't bijective between str and bytes:
>>> '\udd00'.encode('utf-8', errors='surrogateescape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character
'\udd00' in position 0: surrogates not allowed
Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-08 18:37 +1100 |
| Message-ID | <mailman.163.1425800257.21433.python-list@python.org> |
| In reply to | #87133 |
On Sun, Mar 8, 2015 at 6:20 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> * it still isn't bijective between str and bytes:
>
> >>> '\udd00'.encode('utf-8', errors='surrogateescape')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character
> '\udd00' in position 0: surrogates not allowed
Once again, you appear to be surprised that invalid data is failing.
Why is this so strange? U+DD00 is not a valid character. It is quite
correct to throw this error.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-08 10:09 +0200 |
| Message-ID | <87twxvvrjl.fsf@elektro.pacujo.net> |
| In reply to | #87136 |
Chris Angelico <rosuav@gmail.com>:
> Once again, you appear to be surprised that invalid data is failing.
> Why is this so strange? U+DD00 is not a valid character. It is quite
> correct to throw this error.
'\udd00' is a valid str object:
>>> '\udd00'
'\udd00'
>>> '\udd00'.encode('utf-32')
b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
>>> '\udd00'.encode('utf-16')
b'\xff\xfe\x00\xdd'
I was simply stating that UTF-8 is not a bijection between unicode
strings and octet strings (even forgetting Python). Enriching Unicode
with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
without side effects.
Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-08 19:23 +1100 |
| Message-ID | <mailman.166.1425803025.21433.python-list@python.org> |
| In reply to | #87140 |
On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> Once again, you appear to be surprised that invalid data is failing.
>> Why is this so strange? U+DD00 is not a valid character. It is quite
>> correct to throw this error.
>
> '\udd00' is a valid str object:
>
> >>> '\udd00'
> '\udd00'
> >>> '\udd00'.encode('utf-32')
> b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
> >>> '\udd00'.encode('utf-16')
> b'\xff\xfe\x00\xdd'
>
> I was simply stating that UTF-8 is not a bijection between unicode
> strings and octet strings (even forgetting Python). Enriching Unicode
> with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> without side effects.
But it's not a valid Unicode string, so a Unicode encoding can't be
expected to cope with it. Mathematically, 0xC0 0x80 would represent
U+0000, and some UTF-8 codecs generate and accept this (in order to
allow U+0000 without ever yielding 0x00), but that doesn't mean that
UTF-8 should allow that byte sequence.
The only reason to craft some kind of Unicode string for any arbitrary
sequence of bytes is the "smuggling" effect used for file name
handling. There is no reason to support invalid Unicode codepoints.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-08 01:18 -0800 |
| Message-ID | <a15b8f18-2e4d-4dde-aa62-9be5b192dd27@googlegroups.com> |
| In reply to | #87141 |
Le dimanche 8 mars 2015 09:24:30 UTC+1, Chris Angelico a écrit :
> On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> > Chris Angelico <rosuav@gmail.com>:
> >
> >> Once again, you appear to be surprised that invalid data is failing.
> >> Why is this so strange? U+DD00 is not a valid character. It is quite
> >> correct to throw this error.
> >
> > '\udd00' is a valid str object:
> >
> > >>> '\udd00'
> > '\udd00'
> > >>> '\udd00'.encode('utf-32')
> > b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
> > >>> '\udd00'.encode('utf-16')
> > b'\xff\xfe\x00\xdd'
> >
> > I was simply stating that UTF-8 is not a bijection between unicode
> > strings and octet strings (even forgetting Python). Enriching Unicode
> > with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> > without side effects.
>
> But it's not a valid Unicode string, so a Unicode encoding can't be
> expected to cope with it. Mathematically, 0xC0 0x80 would represent
> U+0000, and some UTF-8 codecs generate and accept this (in order to
> allow U+0000 without ever yielding 0x00), but that doesn't mean that
> UTF-8 should allow that byte sequence.
>
> The only reason to craft some kind of Unicode string for any arbitrary
> sequence of bytes is the "smuggling" effect used for file name
> handling. There is no reason to support invalid Unicode codepoints.
>
> ChrisA
Python 3 and unicode?
A disaster reflecting a non understanding of Unicode.
but
A (buggy) jewel for those who wish to present and teach
Unicode.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-09 05:25 +1100 |
| Message-ID | <54fc9400$0$13009$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #87140 |
Marko Rauhamaa wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> Once again, you appear to be surprised that invalid data is failing.
>> Why is this so strange? U+DD00 is not a valid character.
But it is a valid non-character code point.
>> It is quite correct to throw this error.
>
> '\udd00' is a valid str object:
Is it though? Perhaps the bug is not UTF-8's inability to encode lone
surrogates, but that Python allows you to create lone surrogates in the
first place. That's not a rhetorical question. It's a genuine question.
> >>> '\udd00'
> '\udd00'
> >>> '\udd00'.encode('utf-32')
> b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
> >>> '\udd00'.encode('utf-16')
> b'\xff\xfe\x00\xdd'
If you explicitly specify the endianness (say, utf-16-be or -le) then you
don't get the BOMs.
> I was simply stating that UTF-8 is not a bijection between unicode
> strings and octet strings (even forgetting Python). Enriching Unicode
> with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> without side effects.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-08 22:09 +0200 |
| Message-ID | <87d24juu8r.fsf@elektro.pacujo.net> |
| In reply to | #87149 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> Marko Rauhamaa wrote:
>> '\udd00' is a valid str object:
>
> Is it though? Perhaps the bug is not UTF-8's inability to encode lone
> surrogates, but that Python allows you to create lone surrogates in
> the first place. That's not a rhetorical question. It's a genuine
> question.
The problem is that no matter how you shuffle surrogates, encoding
schemes, coding points and the like, a wrinkle always remains.
I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But
that's where the buck stops; traditional arithmetic functions are closed
under ℂ.
Unicode apparently hasn't found a similar closure.
That's why I think that while UTF-8 is a fabulous way to bring Unicode
to Linux, Linux should have taken the tack that Unicode is always an
application-level interpretation with few operating system tie-ins.
Unfortunately, the GNU world is busy trying to build a Unicode frosting
everywhere. The illusion can never be complete but is convincing enough
for application developers to forget to handle corner cases.
To answer your question, I think every code point from 0 to 1114111
should be treated as valid and analogous. Thus Python is correct here:
>>> len('\udd00')
1
>>> len('\ufeff')
1
The alternatives are far too messy to consider.
Marko
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-09 12:43 +1100 |
| Message-ID | <54fcfac0$0$12995$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #87156 |
Marko Rauhamaa wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>
>> Marko Rauhamaa wrote:
>>> '\udd00' is a valid str object:
>>
>> Is it though? Perhaps the bug is not UTF-8's inability to encode lone
>> surrogates, but that Python allows you to create lone surrogates in
>> the first place. That's not a rhetorical question. It's a genuine
>> question.
>
> The problem is that no matter how you shuffle surrogates, encoding
> schemes, coding points and the like, a wrinkle always remains.
Really? Define your terms. Can you define "wrinkles", and prove that it is
impossible to remove them? What's so bad about wrinkles anyway?
> I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But
> that's where the buck stops; traditional arithmetic functions are closed
> under ℂ.
That's simply incorrect. What's z/(0+0i)?
There are many more number sets used by mathematicians, some going back to
the 1800s. Here are just a few:
* ℝ-overbar or [−∞, +∞], which adds a pair of infinities to ℝ.
* ℝ-caret or ℝ+{∞}, which does the same but with a single
unsigned infinity.
* A similar extended version of ℂ with a single infinity.
* Split-complex or hyperbolic numbers, defined similarly to ℂ
except with i**2 = +1 (rather than the complex i**2 = -1).
* Dual numbers, which add a single infinitesimal number ε != 0
with the property that ε**2 = 0.
* Hyperreal numbers.
* John Conway's surreal numbers, which may be the largest
possible set, in the sense that it can construct all finite,
infinite and infinitesimal numbers. (The hyperreals and dual
numbers can be considered subsets of the surreals.)
The process of extending ℝ to ℂ is formally known as Cayley–Dickson
construction, and there is an infinite number of algebras (and hence number
sets) which can be constructed this way. The next few are:
* Hamilton's quaternions ℍ, very useful for dealing with rotations
in 3D space. They fell out of favour for some decades, but are now
experiencing something of a renaissance.
* Octonions or Cayley numbers.
* Sedenions.
> Unicode apparently hasn't found a similar closure.
Similar in what way? And why do you think this is important?
It is not a requirement for every possible byte sequence to be a valid
Unicode string, any more than it is a requirement for every possible byte
sequence to be valid JPG, zip archive, or ELF executable. Some byte strings
simply are not JPG images, zip archives or ELF executables -- or Unicode
strings. So what?
Why do you think that is a problem that needs fixing by the Unicode
standard? It may be a problem that needs fixing by (for example)
programming languages, and Python invented the surrogatesescape encoding to
smuggle such invalid bytes into strings. Other solutions may exist as well.
But that's not part of Unicode and it isn't a problem for Unicode.
> That's why I think that while UTF-8 is a fabulous way to bring Unicode
> to Linux, Linux should have taken the tack that Unicode is always an
> application-level interpretation with few operating system tie-ins.
"Should have"? That is *exactly* the status quo, and while it was the only
practical solution given Linux's history, it's a horrible idea. That
Unicode is stuck on top of an OS which is unaware of Unicode is precisely
why we're left with problems like "how do you represent arbitrary bytes as
Unicode strings?".
> Unfortunately, the GNU world is busy trying to build a Unicode frosting
> everywhere. The illusion can never be complete but is convincing enough
> for application developers to forget to handle corner cases.
>
> To answer your question, I think every code point from 0 to 1114111
> should be treated as valid and analogous.
Your opinion isn't very relevant. What is relevant is what the Unicode
standard demands, and I think it requires that strings containing
surrogates are illegal (rather like x/0 is illegal in the real numbers).
Wikipedia states:
The Unicode standard permanently reserves these code point
values [U+D800 to U+DFFF] for UTF-16 encoding of the high
and low surrogates, and they will never be assigned a
character, so there should be no reason to encode them. The
official Unicode standard says that no UTF forms, including
UTF-16, can encode these code points.
However UCS-2, UTF-8, and UTF-32 can encode these code points
in trivial and obvious ways, and large amounts of software
does so even though the standard states that such arrangements
should be treated as encoding errors. It is possible to
unambiguously encode them in UTF-16 by using a code unit equal
to the code point, as long as no sequence of two code units can
be interpreted as a legal surrogate pair (that is, as long as a
high surrogate is never followed by a low surrogate). The
majority of UTF-16 encoder and decoder implementations translate
between encodings as though this were the case.
http://en.wikipedia.org/wiki/UTF-16
So yet again we are left with the conclusion that *buggy implementations* of
Unicode cause problems, not the Unicode standard itself.
> Thus Python is correct here:
>
> >>> len('\udd00')
> 1
> >>> len('\ufeff')
> 1
>
> The alternatives are far too messy to consider.
Not at all. '\udd00' should be a SyntaxError.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2015-03-09 13:09 +1100 |
| Message-ID | <mailman.181.1425866967.21433.python-list@python.org> |
| In reply to | #87166 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: > '\udd00' should be a SyntaxError. I find your argument convincing, that attempting to construct a Unicode string of a lone surrogate should be an error. Shouldn't the error type be a ValueError, though? The statement is not, to my mind, erroneous syntax. -- \ “Please do not feed the animals. If you have any suitable food, | `\ give it to the guard on duty.” —zoo, Budapest | _o__) | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-09 08:31 +0200 |
| Message-ID | <87zj7mu1fj.fsf@elektro.pacujo.net> |
| In reply to | #87167 |
Ben Finney <ben+python@benfinney.id.au>:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>
>> '\udd00' should be a SyntaxError.
>
> I find your argument convincing, that attempting to construct a
> Unicode string of a lone surrogate should be an error.
Then we're back to square one:
>>> b'\x80'.decode('utf-8', errors='surrogateescape')
'\udc80'
Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-09 13:18 +1100 |
| Message-ID | <mailman.183.1425867521.21433.python-list@python.org> |
| In reply to | #87166 |
On Mon, Mar 9, 2015 at 1:09 PM, Ben Finney <ben+python@benfinney.id.au> wrote: > Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: > >> '\udd00' should be a SyntaxError. > > I find your argument convincing, that attempting to construct a Unicode > string of a lone surrogate should be an error. > > Shouldn't the error type be a ValueError, though? The statement is not, > to my mind, erroneous syntax. For the string literal, I would say SyntaxError is more appropriate than ValueError, as a string object has to be constructed at compilation time. I'd still like to see a report from someone who has used a language that specifically disallows all surrogates in strings. Does it help? Is it more hassle than it's worth? Are there weird edge cases that it breaks? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | random832@fastmail.us |
|---|---|
| Date | 2015-03-09 00:27 -0400 |
| Message-ID | <mailman.184.1425875286.21433.python-list@python.org> |
| In reply to | #87166 |
On Sun, Mar 8, 2015, at 22:09, Ben Finney wrote: > Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: > > > '\udd00' should be a SyntaxError. > > I find your argument convincing, that attempting to construct a Unicode > string of a lone surrogate should be an error. > > Shouldn't the error type be a ValueError, though? The statement is not, > to my mind, erroneous syntax. In this hypothetical, it's a problem with evaluating a literal - in the same way that '\U12345', or '\U00110000, is.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-09 07:55 +1100 |
| Message-ID | <mailman.174.1425848148.21433.python-list@python.org> |
| In reply to | #87149 |
On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > Marko Rauhamaa wrote: > >> Chris Angelico <rosuav@gmail.com>: >> >>> Once again, you appear to be surprised that invalid data is failing. >>> Why is this so strange? U+DD00 is not a valid character. > > But it is a valid non-character code point. > >>> It is quite correct to throw this error. >> >> '\udd00' is a valid str object: > > Is it though? Perhaps the bug is not UTF-8's inability to encode lone > surrogates, but that Python allows you to create lone surrogates in the > first place. That's not a rhetorical question. It's a genuine question. Ah, I see the confusion. Yes, it is plausible to permit the UTF-8-like encoding of surrogates; but it's illegal according to the RFC: https://tools.ietf.org/html/rfc3629 """ The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. """ They're not valid characters, and the UTF-8 spec explicitly says that they must not be encoded. Python is fully spec-compliant in rejecting these. Some encoders [1] will permit them, but the resulting stream is invalid UTF-8, just as CESU-8 and Modified UTF-8 are (the latter being "UTF-8, only U+0000 is represented as C0 80"). ChrisA [1] eg http://pike.lysator.liu.se/generated/manual/modref/ex/predef_3A_3A/string_to_utf8.html optionally
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-09 08:13 +1100 |
| Message-ID | <mailman.175.1425849237.21433.python-list@python.org> |
| In reply to | #87149 |
On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Perhaps the bug is not UTF-8's inability to encode lone
> surrogates, but that Python allows you to create lone surrogates in the
> first place. That's not a rhetorical question. It's a genuine question.
As to the notion of rejecting the construction of strings containing
these invalid codepoints, I'm not sure. Are there any languages out
there that have a Unicode string type that requires that all
codepoints be valid (no surrogates, no U+FFFE, etc)? This is the kind
of thing that's usually done in an obscure language before it hits a
mainstream one.
Pike is similar to Python here. I can create a string with invalid
code points in it:
> "\uFFFE\uDD00";
(1) Result: "\ufffe\udd00"
but I can't UTF-8 encode that:
> string_to_utf8("\uFFFE\uDD00");
Character 0x0000dd00 at index 1 is in the surrogate range and therefore invalid.
Unknown program: string_to_utf8("\ufffe\udd00")
HilfeInput:1: HilfeInput()->___HilfeWrapper()
Or, using the streaming UTF-8 encoder instead of the short-hand:
> Charset.encoder("UTF-8")->feed("\uFFFE\uDD00")->drain();
Error encoding "\ufffe"[0xdd00] using utf8: Unsupported character 56576.
/usr/local/pike/8.1.0/lib/modules/_Charset.so:1:
_Charset.UTF8enc()->feed("\ufffe\udd00")
HilfeInput:1: HilfeInput()->___HilfeWrapper()
Does anyone know of a language where you can't even construct the string?
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-09 17:34 +1100 |
| Message-ID | <54fd3f10$0$12977$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #87160 |
Chris Angelico wrote: > As to the notion of rejecting the construction of strings containing > these invalid codepoints, I'm not sure. Are there any languages out > there that have a Unicode string type that requires that all > codepoints be valid (no surrogates, no U+FFFE, etc)? U+FFFE and U+FFFF are *noncharacters*, not invalid. There are a total of 66 noncharacters in Unicode, and they are legal in strings. http://www.unicode.org/faq/private_use.html#nonchar8 I think the only illegal code points are surrogates. Surrogates should only appear as bytes in UTF-16 byte-strings. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-09 17:44 +1100 |
| Message-ID | <mailman.186.1425883497.21433.python-list@python.org> |
| In reply to | #87174 |
On Mon, Mar 9, 2015 at 5:34 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > Chris Angelico wrote: > >> As to the notion of rejecting the construction of strings containing >> these invalid codepoints, I'm not sure. Are there any languages out >> there that have a Unicode string type that requires that all >> codepoints be valid (no surrogates, no U+FFFE, etc)? > > U+FFFE and U+FFFF are *noncharacters*, not invalid. There are a total of 66 > noncharacters in Unicode, and they are legal in strings. > > http://www.unicode.org/faq/private_use.html#nonchar8 > > I think the only illegal code points are surrogates. Surrogates should only > appear as bytes in UTF-16 byte-strings. U+FFFE would cause problems at the beginning of a UTF-16 stream, as it could be mistaken for a BOM - that's why it's a noncharacter. But sure, let's leave them out of the discussion. The question is whether surrogates are legal or not. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-09 02:08 -0700 |
| Message-ID | <c19fade0-587b-4055-85b9-695baa10cf97@googlegroups.com> |
| In reply to | #87175 |
******************************************************************** In Unicode, a string is a sequence of characters, not a sequence of code points and definitely not a sequence of bytes. ******************************************************************** jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-09 07:26 -0700 |
| Message-ID | <8908135b-a092-446b-a281-c9c081a5f3d0@googlegroups.com> |
| In reply to | #87176 |
Le lundi 9 mars 2015 10:08:48 UTC+1, wxjm...@gmail.com a écrit : > ******************************************************************** > > In Unicode, a string is a sequence of characters, > not a sequence of code points and definitely not > a sequence of bytes. > > ******************************************************************** > > jmf Mea culpa. It's not really correct. It did not express, what I wanted to say. jmf
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-09 05:28 -0700 |
| Message-ID | <3c91b930-2642-456e-a26d-3edd4cd014d1@googlegroups.com> |
| In reply to | #87174 |
On Monday, March 9, 2015 at 12:05:05 PM UTC+5:30, Steven D'Aprano wrote: > Chris Angelico wrote: > > > As to the notion of rejecting the construction of strings containing > > these invalid codepoints, I'm not sure. Are there any languages out > > there that have a Unicode string type that requires that all > > codepoints be valid (no surrogates, no U+FFFE, etc)? > > U+FFFE and U+FFFF are *noncharacters*, not invalid. There are a total of 66 > noncharacters in Unicode, and they are legal in strings. Interesting -- Thanks! I wonder whether that's one more instance of the anti-pattern (other thread)? Number thats not a number -- Nan Pointer that points nowhere -- NULL SQL data thats not there but there -- null > > http://www.unicode.org/faq/private_use.html#nonchar8 > > I think the only illegal code points are surrogates. Surrogates should only > appear as bytes in UTF-16 byte-strings. Even more interesting: So there's a whole hierarchy of illegality?? Could you suggest some good reference for 'surrogate'?
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-08 19:01 +1100 |
| Message-ID | <54fc01cf$0$12995$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #87128 |
Steven D'Aprano wrote:
> Marko Rauhamaa wrote:
>
>> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>>
>>> Marko Rauhamaa wrote:
>>>
>>>> That said, UTF-8 does suffer badly from its not being
>>>> a bijective mapping.
>>>
>>> Can you explain?
>>
>> In Python terms, there are bytes objects b that don't satisfy:
>>
>> b.decode('utf-8').encode('utf-8') == b
>
> Are you talking about the fact that not all byte streams are valid UTF-8?
> That is, some byte objects b may raise an exception on b.decode('utf-8').
Eh, I should have read the rest of the thread before replying...
> I don't see why that means UTF-8 "suffers badly" from this. Can you give
> an example of where you would expect to take an arbitrary byte-stream,
> decode it as UTF-8, and expect the results to be meaningful?
File names on Unix-like systems.
Unfortunately file names are a bit of a mess, but we're slowly converging on
Unicode support for files. I reckon that by 2070, 2080 tops, we'll have
that licked...
The three major operating systems have different levels of support for
Unicode file names:
* Apple OS X: HFS+ stores file names in decomposed form, using UTF-16. I
think this is the strictest Unicode support of all common file systems.
Well done Apple. Decomposed in this sense means that single code points may
be expanded where possible, e.g. é U+00E9 LATIN SMALL LETTER E WITH ACUTE
will be stored as two code points, U+0065 LATIN SMALL LETTER E + U+0301
COMBINING ACUTE ACCENT.
* Windows: NTFS stores file names as sequences of 16-bit code units except
0x0000. (Additional restrictions also apply: e.g. in POSIX mode, / is also
forbidden; in Win32 mode, / ? + etc. are forbidden.) The code units are
interpreted as UTF-16 but the file system doesn't prevent you from creating
file names with invalid sequences.
* Linux: ext2/ext3 stores file names as arbitrary bytes except for / and
nul. However most Linux distributions treat file names as if they were
UTF-8 (displaying ? glyphs for undecodable bytes), and many Linux GUI file
managers enforce the rule that file names are valid UTF-8.
File systems on removable media (FAT32, UDF, ISO-9660 with or without
extensions such as Joliet and Rock Ridge) have their own issues, but
generally speaking don't support Unicode well or at all.
So although the current situation is still a bit of a mess, there is a slow
move towards file names which are valid Unicode.
--
Steven
[toc] | [prev] | [next] | [standalone]
Page 7 of 8 — ← Prev page 1 2 3 4 5 6 [7] 8 Next page →
Back to top | Article view | comp.lang.python
csiph-web