Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!us.feeder.erje.net!newsfeed.fsmpi.rwth-aachen.de!newsfeed.straub-nv.de!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: Marko Rauhamaa Newsgroups: comp.lang.python Subject: Re: Newbie question about text encoding Date: Sat, 07 Mar 2015 18:25:43 +0200 Organization: A noiseless patient Spider Lines: 30 Message-ID: <87k2ysydtk.fsf@elektro.pacujo.net> References: <201502241957.t1OJvrJS015604@fido.openend.se> <9169f3b1-2ac7-42a3-8033-584f84b88a1f@googlegroups.com> <7a75a23c-4678-4d7a-a2ec-9e8fff4c07f8@googlegroups.com> <132d5ce6-f672-4eec-99f9-1cc9e88b94f3@googlegroups.com> <619e4cb5-1c4c-449b-a5d7-951101b32b45@googlegroups.com> <54f862ca$0$13014$c3e8da3$5496439d@news.astraweb.com> <54fadc70$0$13004$c3e8da3$5496439d@news.astraweb.com> <87twxxxbvd.fsf@elektro.pacujo.net> <54fb1bf4$0$12993$c3e8da3$5496439d@news.astraweb.com> <87twxw4xlz.fsf@elektro.pacujo.net> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: mx02.eternal-september.org; posting-host="ff5cf27ef3d5b31f034d3b72bdc27a41"; logging-data="32205"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/65tt/tJpCyJqFYWK8FdvS" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) Cancel-Lock: sha1:+u8YrjOiOHgk+OHuTTRZGHoHVxA= sha1:R37reCgYD44lG+X/YTedh7RKNkA= Xref: csiph.com comp.lang.python:87100 Chris Angelico : > On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa wrote: >> Steven D'Aprano : >> >>> Marko Rauhamaa wrote: >>> >>>> That said, UTF-8 does suffer badly from its not being >>>> a bijective mapping. >>> >>> Can you explain? >> >> In Python terms, there are bytes objects b that don't satisfy: >> >> b.decode('utf-8').encode('utf-8') == b > > Please provide an example; that sounds like a bug. If there is any > invalid UTF-8 stream which decodes without an error, it is actually a > security bug, and should be fixed pronto in all affected and supported > versions. Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. Marko