Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Tue, 19 Jul 2011 04:08:42 +0100
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20110624 Thunderbird/5.0
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: a little parsing challenge =?UTF-8?B?4pi6?=
References: <36037253-086b-4467-a1db-9492d3772e78@r5g2000prf.googlegroups.com> <j01ph6$knt$1@speranza.aioe.org> <4e24c823$0$29981$c3e8da3$5496439d@news.astraweb.com> <j02oth$f4c$1@speranza.aioe.org>
In-Reply-To: <j02oth$f4c$1@speranza.aioe.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: list
Reply-To: python-list@python.org
Newsgroups: comp.lang.python
Message-ID: <mailman.1233.1311044922.1164.python-list@python.org>
Lines: 100
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:9844

On 19/07/2011 03:07, Billy Mays wrote:
> On 7/18/2011 7:56 PM, Steven D'Aprano wrote:
>> Billy Mays wrote:
>>
>>> On 07/17/2011 03:47 AM, Xah Lee wrote:
>>>> 2011-07-16
>>>
>>> I gave it a shot. It doesn't do any of the Unicode delims, because
>>> let's face it, Unicode is for goobers.
>>
>> Goobers... that would be one of those new-fangled slang terms that the
>> young
>> kids today use to mean its opposite, like "bad", "wicked" and "sick",
>> correct?
>>
>> I mention it only because some people might mistakenly interpret your
>> words
>> as a childish and feeble insult against the 98% of the world who want or
>> need more than the 127 characters of ASCII, rather than understand you
>> meant it as a sign of the utmost respect for the richness and
>> diversity of
>> human beings and their languages, cultures, maths and sciences.
>>
>>
>
> TL;DR version: international character sets are a problem, and Unicode
> is not the answer to that problem).
>
> As long as I have used python (which I admit has only been 3 years)
> Unicode has never appeared to be implemented correctly. I'm probably
> repeating old arguments here, but whatever.
>
> Unicode is a mess. When someone says ASCII, you know that they can only
> mean characters 0-127. When someone says Unicode, do the mean real
> Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When
> using the 'u' datatype with the array module, the docs don't even tell
> you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that all the
> of these can be figured out, but the problem is now I have to ask every
> one of these questions whenever I want to use strings.
>
That's down to whether it's a narrow or wide Python build. There's a
PEP suggesting a fix for that (PEP 393).

> Secondly, Python doesn't do Unicode exception handling correctly. (but I
> suspect that its a broader problem with languages) A good example of
> this is with UTF-8 where there are invalid code points ( such as 0xC0,
> 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
> well as everyone else who wants to use strings for some reason).
>
Those aren't codepoints, those are invalid bytes for the UTF-8 encoding.

> When embedding Python in a long running application where user input is
> received, it is very easy to make mistake which bring down the whole
> program. If any user string isn't properly try/excepted, a user could
> craft a malformed string which a UTF-8 decoder would choke on. Using
> ASCII (or whatever 8 bit encoding) doesn't have these problems since all
> codepoints are valid.
>
What if you give an application an invalid JPEG, PNG or other image
file? Does that mean that image formats are bad too?

> Another (this must have been a good laugh amongst the UniDevs) 'feature'
> of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
> Any string can masquerade as any other string by placing few of these in
> a string. Any word filters you might have are now defeated by some
> cheesy Unicode nonsense character. Can you just just check for these
> characters and strip them out? Yes. Should you have to? I would say no.
>
> Does it get better? Of course! international character sets used for
> domain name encoding use yet a different scheme (Punycode). Are the
> following two domain names the same: tést.com , xn--tst-bma.com ? Who
> knows!
>
> I suppose I can gloss over the pains of using Unicode in C with every
> string needing to be an LPS since 0x00 is now a valid code point in
> UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
> strlen or concatenation operations.
>
0x00 is also a valid ASCII code, but C doesn't let you use it!

There's also "Modified UTF-8", in which U+0000 is encoded as 2 bytes,
so that zero-byte can be used as a terminator. You can't do that in
ASCII! :-)

> Can it get even better? Yep. We also now need to have a Byte order Mark
> (BOM) to determine the endianness of our characters. Are they little
> endian or big endian? (or perhaps one of the two possible middle endian
> encodings?) Who knows? String processing with unicode is unpleasant to
> say the least. I suppose that's what we get when we things are designed
> by committee.
>
Proper UTF-8 doesn't have a BOM.

The rule (in Python, at least) is to decode on input and encode on
output. You don't have to worry about endianness when processing
Unicode strings internally; they're just a series of codepoints.

> But Hey! The great thing about standards is that there are so many to
> choose from.
>