Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #56446

Re: Encoding of surrogate code points to UTF-8

Path csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; ';-)': 0.03; 'encoding': 0.05; 'string.': 0.05; 'strict': 0.07; 'subject:code': 0.07; 'utf-8': 0.07; 'string': 0.09; 'encode': 0.09; 'otherwise)': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'rejected': 0.09; 'strings.': 0.09; 'url:unicode': 0.09; 'python': 0.11; 'bug': 0.12; 'jan': 0.12; 'apis.': 0.16; 'codec': 0.16; 'debated': 0.16; 'encodes': 0.16; 'escapes': 0.16; 'handling,': 0.16; 'inserting': 0.16; 'literals.': 0.16; 'next.': 0.16; 'ought': 0.16; 'pairs': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'reedy': 0.16; 'subject:UTF': 0.16; 'sure.': 0.16; 'surrogate': 0.16; 'url:faq': 0.16; 'url:peps': 0.16; 'all.': 0.16; 'wrote:': 0.18; 'normally': 0.19; "python's": 0.19; 'tests': 0.22; 'header:User-Agent:1': 0.23; 'error': 0.23; 'interpret': 0.24; 'unicode': 0.24; 'url:dev': 0.24; "i've": 0.25; 'tracker': 0.26; 'least': 0.26; 'values': 0.27; 'header:X-Complaints-To:1': 0.27; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'am,': 0.29; 'character': 0.29; 'points': 0.29; 'raise': 0.29; "i'm": 0.30; 'code': 0.31; "skip:' 10": 0.31; '"",': 0.31; '(maybe': 0.31; "d'aprano": 0.31; 'doc': 0.31; 'steven': 0.31; 'file': 0.32; 'says': 0.33; 'url:python': 0.33; '(most': 0.33; "can't": 0.35; 'no,': 0.35; 'point.': 0.35; 'but': 0.35; 'there': 0.35; 'entry': 0.36; 'url:org': 0.36; 'should': 0.36; 'too': 0.37; 'two': 0.37; 'being': 0.38; 'needed': 0.38; 'to:addr:python-list': 0.38; 'issue': 0.38; 'previous': 0.38; 'recent': 0.39; 'itself': 0.39; 'sure': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'skip:u 10': 0.60; 'read': 0.60; 'above,': 0.60; 'received:173': 0.61; 'more': 0.64; 'believe': 0.68; 'invalid': 0.68; 'results': 0.69; 'default': 0.69; 'points,': 0.84; 'received:fios.verizon.net': 0.84
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Terry Reedy <tjreedy@udel.edu>
Subject Re: Encoding of surrogate code points to UTF-8
Date Tue, 08 Oct 2013 17:47:15 -0400
References <52540e03$0$29984$c3e8da3$5496439d@news.astraweb.com>
Mime-Version 1.0
Content-Type text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding 7bit
X-Gmane-NNTP-Posting-Host pool-173-75-251-66.phlapa.fios.verizon.net
User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0
In-Reply-To <52540e03$0$29984$c3e8da3$5496439d@news.astraweb.com>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.876.1381268848.18130.python-list@python.org> (permalink)
Lines 84
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1381268848 news.xs4all.nl 15889 [2001:888:2000:d::a6]:44072
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:56446

Show key headers only | View raw


On 10/8/2013 9:52 AM, Steven D'Aprano wrote:
> I think this is a bug in Python's UTF-8 handling, but I'm not sure.
>
> If I've read the Unicode FAQs correctly, you cannot encode *lone*
> surrogate code points into UTF-8:
>
> http://www.unicode.org/faq/utf_bom.html#utf8-5
>
> Sure enough, using Python 3.3:
>
> py> surr = '\udc80'

I am pretty sure that if Python were being strict, that would raise an 
error, as the result is not a valid unicode string. Allowing the above 
or not was debated and laxness was allowed for at least the following 
practical reasons.

1. Python itself uses the invalid surrogate codepoints for 
surrogateescape error-handling.
http://www.python.org/dev/peps/pep-0383/

2. Invalid strings are needed for tests ;-)
-- like the one you do next.

3. Invalid strings may be needed for interfacing with other C APIs.

> py> surr.encode('utf-8')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
> position 0: surrogates not allowed

Default strict encoding (utf-8 or otherwise) will only encode valid 
unicode strings. Encode invalid strings with surrogate codepoints with 
surrogateescape error handling.

> But reading the previous entry in the FAQs:
>
> http://www.unicode.org/faq/utf_bom.html#utf8-4
>
> I interpret this as meaning that I should be able to encode valid pairs
> of surrogates.

It says you should be able to 'convert' them, and that the result for 
utf-8 encoding must be a single 4-bytes code for the corresponding 
supplementary codepoint.

> So if I find a code point that encodes to a surrogate pair
> in UTF-16:
>
> py> c = '\N{LINEAR B SYLLABLE B038 E}'
> py> surr_pair = c.encode('utf-16be')
> py> print(surr_pair)
> b'\xd8\x00\xdc\x01'
>
> and then use those same values as the code points, I ought to be able to
> encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
> point. But I can't:
>
> py> s = '\ud800\udc01'

This is now a string with two invalid codepoints instead of one ;-).
As above, it would be rejected if Python were being strict.

> py> s.encode('utf-8')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
> position 0: surrogates not allowed
>
>
> Have I misunderstood? I think that Python is being too strict about
> rejecting surrogate code points.

No, it is being too lax about allowing them at all.

I believe there is an issue on the tracker (maybe closed) about the doc 
for unicode escapes in string literals. Perhaps is should say more 
clearly that inserting surrogates is allowed but results in an invalid 
string that cannot be normally encoded.

-- 
Terry Jan Reedy

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 13:52 +0000
  Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:14 +0000
    Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:54 +0000
    Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:30 +0000
      Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 21:28 -0400
        Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve@pearwood.info> - 2013-10-09 06:20 +0000
          Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-09 01:22 -0700
            Re: Encoding of surrogate code points to UTF-8 Ned Batchelder <ned@nedbatchelder.com> - 2013-10-09 06:22 -0400
              Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-09 12:55 +0000
  Re: Encoding of surrogate code points to UTF-8 Pete Forman <petef4+usenet@gmail.com> - 2013-10-08 16:23 +0100
    Re: Encoding of surrogate code points to UTF-8 MRAB <python@mrabarnett.plus.com> - 2013-10-08 18:00 +0100
      Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-08 11:24 -0700
      Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:20 +0000
  Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 17:47 -0400
  Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 18:17 -0400

csiph-web