Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python,': 0.01; 'languages,': 0.03; 'string.': 0.04; 'arguments': 0.05; 'diversity': 0.05; 'encoded': 0.05; 'suppose': 0.05; 'width': 0.05; ':-)': 0.06; 'ascii': 0.07; 'bytes.': 0.07; 'pep': 0.07; 'suggesting': 0.07; 'whatever.': 0.07; 'python': 0.08; '>>>>': 0.09; 'admit': 0.09; 'bytes,': 0.09; 'craft': 0.09; 'encoding.': 0.09; 'endian': 0.09; 'from:addr:python': 0.09; 'least)': 0.09; 'subject:parsing': 0.09; 'utf-8': 0.09; 'exception': 0.12; 'am,': 0.13; 'wrote:': 0.15; '...,': 0.16; '127': 0.16; 'ascii,': 0.16; 'better?': 0.16; 'billy': 0.16; 'broader': 0.16; 'build.': 0.16; 'childish': 0.16; 'cultures,': 0.16; 'encode': 0.16; 'encoding)': 0.16; 'endian?': 0.16; 'endianness': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:name:mrab': 0.16; 'gloss': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'narrow': 0.16; 'nonsense': 0.16; 'out?': 0.16; 'png': 0.16; 'problem).': 0.16; 'reason).': 0.16; 'received:84.92': 0.16; 'received:84.92.122': 0.16; 'received:84.92.122.60': 0.16; 'reply- to:addr:python-list': 0.16; 'richness': 0.16; 'sciences.': 0.16; 'shot.': 0.16; 'years)': 0.16; 'pm,': 0.16; '>>>': 0.16; 'meant': 0.17; 'bytes': 0.19; 'operations.': 0.19; '(which': 0.20; 'input': 0.21; '(but': 0.22; '(this': 0.22; "aren't": 0.22; "doesn't": 0.22; 'header:In-Reply-To:1': 0.22; 'module,': 0.23; 'code': 0.24; 'byte': 0.25; 'correctly.': 0.25; 'invalid': 0.25; 'suspect': 0.25; 'says': 0.25; '(or': 0.25; 'string': 0.26; '(in': 0.26; '(and': 0.27; "i'm": 0.27; 'character': 0.28; 'lee': 0.28; 'received:84': 0.28; 'bit': 0.28; 'formats': 0.29; 'interpret': 0.29; 'unicode': 0.29; 'problem': 0.29; 'code,': 0.29; 'fix': 0.29; 'example': 0.30; 'array': 0.30; 'character.': 0.30; 'committee.': 0.30; 'languages)': 0.30; 'strings.': 0.30; 'unicode,': 0.30; 'yes.': 0.30; 'yet': 0.30; 'standards': 0.31; 'does': 0.32; 'handling': 0.33; 'rather': 0.33; 'to:addr:python- list': 0.34; 'header:User-Agent:1': 0.34; 'it?': 0.34; 'there': 0.34; 'points': 0.34; 'someone': 0.34; 'characters': 0.34; 'kids': 0.34; 'rule': 0.34; 'to?': 0.34; "can't": 0.34; 'things': 0.34; 'that,': 0.35; 'reply-to:addr:python.org': 0.35; 'here,': 0.35; "isn't": 0.35; 'probably': 0.35; 'running': 0.35; 'some': 0.37; 'human': 0.63; 'designed': 0.63; 'our': 0.63; 'wide': 0.64; 'needing': 0.64; 'respect': 0.65; 'world': 0.65; 'series': 0.66; 'today': 0.71; 'header:Reply-To:1': 0.71; 'reply-to:no real name:2**0': 0.72; 'subjectcharset:utf-8': 0.72; 'placing': 0.74; '03:47': 0.84; 'mistakenly': 0.84; 'received,': 0.84; 'same:': 0.84; 'suffer': 0.84; 'utmost': 0.84; 'xah': 0.84; 'amongst': 0.91; 'appeared': 0.91; 'beings': 0.91; 'subject:little': 0.91; 'from.': 0.93 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av0EAAH0JE7Unw4S/2dsb2JhbABThEmjAXeIfAKzNDuQbIErhAKBDwSXZ4tW Date: Tue, 19 Jul 2011 04:08:42 +0100 From: MRAB User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20110624 Thunderbird/5.0 MIME-Version: 1.0 To: python-list@python.org Subject: Re: a little parsing challenge =?UTF-8?B?4pi6?= References: <36037253-086b-4467-a1db-9492d3772e78@r5g2000prf.googlegroups.com> <4e24c823$0$29981$c3e8da3$5496439d@news.astraweb.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: python-list@python.org List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 100 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1311044922 news.xs4all.nl 23893 [2001:888:2000:d::a6]:43462 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:9844 On 19/07/2011 03:07, Billy Mays wrote: > On 7/18/2011 7:56 PM, Steven D'Aprano wrote: >> Billy Mays wrote: >> >>> On 07/17/2011 03:47 AM, Xah Lee wrote: >>>> 2011-07-16 >>> >>> I gave it a shot. It doesn't do any of the Unicode delims, because >>> let's face it, Unicode is for goobers. >> >> Goobers... that would be one of those new-fangled slang terms that the >> young >> kids today use to mean its opposite, like "bad", "wicked" and "sick", >> correct? >> >> I mention it only because some people might mistakenly interpret your >> words >> as a childish and feeble insult against the 98% of the world who want or >> need more than the 127 characters of ASCII, rather than understand you >> meant it as a sign of the utmost respect for the richness and >> diversity of >> human beings and their languages, cultures, maths and sciences. >> >> > > TL;DR version: international character sets are a problem, and Unicode > is not the answer to that problem). > > As long as I have used python (which I admit has only been 3 years) > Unicode has never appeared to be implemented correctly. I'm probably > repeating old arguments here, but whatever. > > Unicode is a mess. When someone says ASCII, you know that they can only > mean characters 0-127. When someone says Unicode, do the mean real > Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When > using the 'u' datatype with the array module, the docs don't even tell > you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that all the > of these can be figured out, but the problem is now I have to ask every > one of these questions whenever I want to use strings. > That's down to whether it's a narrow or wide Python build. There's a PEP suggesting a fix for that (PEP 393). > Secondly, Python doesn't do Unicode exception handling correctly. (but I > suspect that its a broader problem with languages) A good example of > this is with UTF-8 where there are invalid code points ( such as 0xC0, > 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as > well as everyone else who wants to use strings for some reason). > Those aren't codepoints, those are invalid bytes for the UTF-8 encoding. > When embedding Python in a long running application where user input is > received, it is very easy to make mistake which bring down the whole > program. If any user string isn't properly try/excepted, a user could > craft a malformed string which a UTF-8 decoder would choke on. Using > ASCII (or whatever 8 bit encoding) doesn't have these problems since all > codepoints are valid. > What if you give an application an invalid JPEG, PNG or other image file? Does that mean that image formats are bad too? > Another (this must have been a good laugh amongst the UniDevs) 'feature' > of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). > Any string can masquerade as any other string by placing few of these in > a string. Any word filters you might have are now defeated by some > cheesy Unicode nonsense character. Can you just just check for these > characters and strip them out? Yes. Should you have to? I would say no. > > Does it get better? Of course! international character sets used for > domain name encoding use yet a different scheme (Punycode). Are the > following two domain names the same: tést.com , xn--tst-bma.com ? Who > knows! > > I suppose I can gloss over the pains of using Unicode in C with every > string needing to be an LPS since 0x00 is now a valid code point in > UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do > strlen or concatenation operations. > 0x00 is also a valid ASCII code, but C doesn't let you use it! There's also "Modified UTF-8", in which U+0000 is encoded as 2 bytes, so that zero-byte can be used as a terminator. You can't do that in ASCII! :-) > Can it get even better? Yep. We also now need to have a Byte order Mark > (BOM) to determine the endianness of our characters. Are they little > endian or big endian? (or perhaps one of the two possible middle endian > encodings?) Who knows? String processing with unicode is unpleasant to > say the least. I suppose that's what we get when we things are designed > by committee. > Proper UTF-8 doesn't have a BOM. The rule (in Python, at least) is to decode on input and encode on output. You don't have to worry about endianness when processing Unicode strings internally; they're just a series of codepoints. > But Hey! The great thing about standards is that there are so many to > choose from. >