Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #9844
| Path | csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <python@mrabarnett.plus.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.000 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'python,': 0.01; 'languages,': 0.03; 'string.': 0.04; 'arguments': 0.05; 'diversity': 0.05; 'encoded': 0.05; 'suppose': 0.05; 'width': 0.05; ':-)': 0.06; 'ascii': 0.07; 'bytes.': 0.07; 'pep': 0.07; 'suggesting': 0.07; 'whatever.': 0.07; 'python': 0.08; '>>>>': 0.09; 'admit': 0.09; 'bytes,': 0.09; 'craft': 0.09; 'encoding.': 0.09; 'endian': 0.09; 'from:addr:python': 0.09; 'least)': 0.09; 'subject:parsing': 0.09; 'utf-8': 0.09; 'exception': 0.12; 'am,': 0.13; 'wrote:': 0.15; '...,': 0.16; '127': 0.16; 'ascii,': 0.16; 'better?': 0.16; 'billy': 0.16; 'broader': 0.16; 'build.': 0.16; 'childish': 0.16; 'cultures,': 0.16; 'encode': 0.16; 'encoding)': 0.16; 'endian?': 0.16; 'endianness': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:name:mrab': 0.16; 'gloss': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'narrow': 0.16; 'nonsense': 0.16; 'out?': 0.16; 'png': 0.16; 'problem).': 0.16; 'reason).': 0.16; 'received:84.92': 0.16; 'received:84.92.122': 0.16; 'received:84.92.122.60': 0.16; 'reply- to:addr:python-list': 0.16; 'richness': 0.16; 'sciences.': 0.16; 'shot.': 0.16; 'years)': 0.16; 'pm,': 0.16; '>>>': 0.16; 'meant': 0.17; 'bytes': 0.19; 'operations.': 0.19; '(which': 0.20; 'input': 0.21; '(but': 0.22; '(this': 0.22; "aren't": 0.22; "doesn't": 0.22; 'header:In-Reply-To:1': 0.22; 'module,': 0.23; 'code': 0.24; 'byte': 0.25; 'correctly.': 0.25; 'invalid': 0.25; 'suspect': 0.25; 'says': 0.25; '(or': 0.25; 'string': 0.26; '(in': 0.26; '(and': 0.27; "i'm": 0.27; 'character': 0.28; 'lee': 0.28; 'received:84': 0.28; 'bit': 0.28; 'formats': 0.29; 'interpret': 0.29; 'unicode': 0.29; 'problem': 0.29; 'code,': 0.29; 'fix': 0.29; 'example': 0.30; 'array': 0.30; 'character.': 0.30; 'committee.': 0.30; 'languages)': 0.30; 'strings.': 0.30; 'unicode,': 0.30; 'yes.': 0.30; 'yet': 0.30; 'standards': 0.31; 'does': 0.32; 'handling': 0.33; 'rather': 0.33; 'to:addr:python- list': 0.34; 'header:User-Agent:1': 0.34; 'it?': 0.34; 'there': 0.34; 'points': 0.34; 'someone': 0.34; 'characters': 0.34; 'kids': 0.34; 'rule': 0.34; 'to?': 0.34; "can't": 0.34; 'things': 0.34; 'that,': 0.35; 'reply-to:addr:python.org': 0.35; 'here,': 0.35; "isn't": 0.35; 'probably': 0.35; 'running': 0.35; 'some': 0.37; 'human': 0.63; 'designed': 0.63; 'our': 0.63; 'wide': 0.64; 'needing': 0.64; 'respect': 0.65; 'world': 0.65; 'series': 0.66; 'today': 0.71; 'header:Reply-To:1': 0.71; 'reply-to:no real name:2**0': 0.72; 'subjectcharset:utf-8': 0.72; 'placing': 0.74; '03:47': 0.84; 'mistakenly': 0.84; 'received,': 0.84; 'same:': 0.84; 'suffer': 0.84; 'utmost': 0.84; 'xah': 0.84; 'amongst': 0.91; 'appeared': 0.91; 'beings': 0.91; 'subject:little': 0.91; 'from.': 0.93 |
| X-IronPort-Anti-Spam-Filtered | true |
| X-IronPort-Anti-Spam-Result | Av0EAAH0JE7Unw4S/2dsb2JhbABThEmjAXeIfAKzNDuQbIErhAKBDwSXZ4tW |
| Date | Tue, 19 Jul 2011 04:08:42 +0100 |
| From | MRAB <python@mrabarnett.plus.com> |
| User-Agent | Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20110624 Thunderbird/5.0 |
| MIME-Version | 1.0 |
| To | python-list@python.org |
| Subject | Re: a little parsing challenge ☺ |
| References | <36037253-086b-4467-a1db-9492d3772e78@r5g2000prf.googlegroups.com> <j01ph6$knt$1@speranza.aioe.org> <4e24c823$0$29981$c3e8da3$5496439d@news.astraweb.com> <j02oth$f4c$1@speranza.aioe.org> |
| In-Reply-To | <j02oth$f4c$1@speranza.aioe.org> |
| Content-Type | text/plain; charset=UTF-8; format=flowed |
| Content-Transfer-Encoding | 8bit |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.12 |
| Precedence | list |
| Reply-To | python-list@python.org |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.1233.1311044922.1164.python-list@python.org> (permalink) |
| Lines | 100 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1311044922 news.xs4all.nl 23893 [2001:888:2000:d::a6]:43462 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | x330-a1.tempe.blueboxinc.net comp.lang.python:9844 |
Show key headers only | View raw
On 19/07/2011 03:07, Billy Mays wrote: > On 7/18/2011 7:56 PM, Steven D'Aprano wrote: >> Billy Mays wrote: >> >>> On 07/17/2011 03:47 AM, Xah Lee wrote: >>>> 2011-07-16 >>> >>> I gave it a shot. It doesn't do any of the Unicode delims, because >>> let's face it, Unicode is for goobers. >> >> Goobers... that would be one of those new-fangled slang terms that the >> young >> kids today use to mean its opposite, like "bad", "wicked" and "sick", >> correct? >> >> I mention it only because some people might mistakenly interpret your >> words >> as a childish and feeble insult against the 98% of the world who want or >> need more than the 127 characters of ASCII, rather than understand you >> meant it as a sign of the utmost respect for the richness and >> diversity of >> human beings and their languages, cultures, maths and sciences. >> >> > > TL;DR version: international character sets are a problem, and Unicode > is not the answer to that problem). > > As long as I have used python (which I admit has only been 3 years) > Unicode has never appeared to be implemented correctly. I'm probably > repeating old arguments here, but whatever. > > Unicode is a mess. When someone says ASCII, you know that they can only > mean characters 0-127. When someone says Unicode, do the mean real > Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When > using the 'u' datatype with the array module, the docs don't even tell > you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that all the > of these can be figured out, but the problem is now I have to ask every > one of these questions whenever I want to use strings. > That's down to whether it's a narrow or wide Python build. There's a PEP suggesting a fix for that (PEP 393). > Secondly, Python doesn't do Unicode exception handling correctly. (but I > suspect that its a broader problem with languages) A good example of > this is with UTF-8 where there are invalid code points ( such as 0xC0, > 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as > well as everyone else who wants to use strings for some reason). > Those aren't codepoints, those are invalid bytes for the UTF-8 encoding. > When embedding Python in a long running application where user input is > received, it is very easy to make mistake which bring down the whole > program. If any user string isn't properly try/excepted, a user could > craft a malformed string which a UTF-8 decoder would choke on. Using > ASCII (or whatever 8 bit encoding) doesn't have these problems since all > codepoints are valid. > What if you give an application an invalid JPEG, PNG or other image file? Does that mean that image formats are bad too? > Another (this must have been a good laugh amongst the UniDevs) 'feature' > of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). > Any string can masquerade as any other string by placing few of these in > a string. Any word filters you might have are now defeated by some > cheesy Unicode nonsense character. Can you just just check for these > characters and strip them out? Yes. Should you have to? I would say no. > > Does it get better? Of course! international character sets used for > domain name encoding use yet a different scheme (Punycode). Are the > following two domain names the same: tést.com , xn--tst-bma.com ? Who > knows! > > I suppose I can gloss over the pains of using Unicode in C with every > string needing to be an LPS since 0x00 is now a valid code point in > UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do > strlen or concatenation operations. > 0x00 is also a valid ASCII code, but C doesn't let you use it! There's also "Modified UTF-8", in which U+0000 is encoded as 2 bytes, so that zero-byte can be used as a terminator. You can't do that in ASCII! :-) > Can it get even better? Yep. We also now need to have a Byte order Mark > (BOM) to determine the endianness of our characters. Are they little > endian or big endian? (or perhaps one of the two possible middle endian > encodings?) Who knows? String processing with unicode is unpleasant to > say the least. I suppose that's what we get when we things are designed > by committee. > Proper UTF-8 doesn't have a BOM. The rule (in Python, at least) is to decode on input and encode on output. You don't have to worry about endianness when processing Unicode strings internally; they're just a series of codepoints. > But Hey! The great thing about standards is that there are so many to > choose from. >
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-17 00:47 -0700
Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 02:48 -0700
Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 15:20 +0200
Re: a little parsing challenge ☺ mhenn <michihenn@hotmail.com> - 2011-07-17 15:55 +0200
Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 18:01 +0200
Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 18:54 +0200
Re: a little parsing challenge ☺ Thomas Boell <tboell@domain.invalid> - 2011-07-17 17:49 +0200
Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 12:16 -0700
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-18 07:39 -0700
Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-20 08:23 +0200
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-20 03:31 -0700
Re: a little parsing challenge ☺ "Uri Guttman" <uri@StemSystems.com> - 2011-07-20 12:31 -0400
Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-20 10:30 -0700
Re: a little parsing challenge ☺ merlyn@stonehenge.com (Randal L. Schwartz) - 2011-07-20 12:06 -0700
Re: a little parsing challenge ☺ Jason Earl <jearl@notengoamigos.org> - 2011-07-20 14:57 -0600
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 09:54 -0700
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:07 +0200
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 05:58 -0700
Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-21 08:26 -0600
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 08:36 -0700
Re: a little parsing challenge ☺ python@bdurham.com - 2011-07-21 12:43 -0400
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 11:53 -0700
Re: a little parsing challenge ☺ Terry Reedy <tjreedy@udel.edu> - 2011-07-21 18:37 -0400
Re: a little parsing challenge ☺ John O'Hagan <research@johnohagan.com> - 2011-07-25 15:57 +1000
Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-19 12:08 -0600
Re: a little parsing challenge ☺ Chris Angelico <rosuav@gmail.com> - 2011-07-17 21:34 +1000
Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-17 04:52 -0700
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 16:15 +0200
Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 12:18 -0700
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 22:16 +0200
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-17 22:57 +0200
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 23:43 +0200
Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 03:09 -0400
Re: a little parsing challenge ☺ Stefan Behnel <stefan_ml@behnel.de> - 2011-07-18 09:24 +0200
Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 04:04 -0400
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-18 18:46 +0200
Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 14:14 -0400
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 06:23 -0700
Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-21 17:54 -0400
Re: a little parsing challenge ☺ gene heskett <gheskett@wdtv.com> - 2011-07-17 10:26 -0400
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-17 08:31 -0700
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:49 -0700
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:14 +0200
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 05:29 -0700
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-21 15:21 +0200
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:17 +0200
Re: a little parsing challenge ☺ rantingrick <rantingrick@gmail.com> - 2011-07-17 18:52 -0700
Re: a little parsing challenge ☺ Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> - 2011-07-18 13:12 -0400
Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-18 12:10 -0600
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-18 23:59 +0200
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-19 08:09 +0200
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:32 -0700
Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 09:56 +1000
Re: a little parsing challenge ☺ Billy Mays <noway@nohow.com> - 2011-07-18 22:07 -0400
Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-18 19:50 -0700
Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 13:11 +1000
Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-18 21:59 -0700
Re: a little parsing challenge ☺ Chris Angelico <rosuav@gmail.com> - 2011-07-19 15:36 +1000
Re: a little parsing challenge ☺ MRAB <python@mrabarnett.plus.com> - 2011-07-19 04:08 +0100
Re: a little parsing challenge ☺ Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-07-18 20:54 -0700
Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 14:30 +1000
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 01:58 -0700
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:14 -0700
Re: a little parsing challenge ☺ Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> - 2011-07-19 13:33 -0400
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 11:12 -0700
Re: a little parsing challenge ☺ Terry Reedy <tjreedy@udel.edu> - 2011-07-19 15:09 -0400
Re: a little parsing challenge ☺ jmfauth <wxjmfauth@gmail.com> - 2011-07-19 23:29 -0700
Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-20 01:29 -0600
Re: a little parsing challenge ☺ jmfauth <wxjmfauth@gmail.com> - 2011-07-20 00:54 -0700
Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-20 18:18 +1000
Re: a little parsing challenge ? sln@netherlands.com - 2011-07-18 12:34 -0700
Re: a little parsing challenge ☺ Mark Tarver <dr.mtarver@gmail.com> - 2011-07-19 22:43 -0700
csiph-web