Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #9844

Re: a little parsing challenge ☺

Path csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python@mrabarnett.plus.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'python,': 0.01; 'languages,': 0.03; 'string.': 0.04; 'arguments': 0.05; 'diversity': 0.05; 'encoded': 0.05; 'suppose': 0.05; 'width': 0.05; ':-)': 0.06; 'ascii': 0.07; 'bytes.': 0.07; 'pep': 0.07; 'suggesting': 0.07; 'whatever.': 0.07; 'python': 0.08; '>>>>': 0.09; 'admit': 0.09; 'bytes,': 0.09; 'craft': 0.09; 'encoding.': 0.09; 'endian': 0.09; 'from:addr:python': 0.09; 'least)': 0.09; 'subject:parsing': 0.09; 'utf-8': 0.09; 'exception': 0.12; 'am,': 0.13; 'wrote:': 0.15; '...,': 0.16; '127': 0.16; 'ascii,': 0.16; 'better?': 0.16; 'billy': 0.16; 'broader': 0.16; 'build.': 0.16; 'childish': 0.16; 'cultures,': 0.16; 'encode': 0.16; 'encoding)': 0.16; 'endian?': 0.16; 'endianness': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:name:mrab': 0.16; 'gloss': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'narrow': 0.16; 'nonsense': 0.16; 'out?': 0.16; 'png': 0.16; 'problem).': 0.16; 'reason).': 0.16; 'received:84.92': 0.16; 'received:84.92.122': 0.16; 'received:84.92.122.60': 0.16; 'reply- to:addr:python-list': 0.16; 'richness': 0.16; 'sciences.': 0.16; 'shot.': 0.16; 'years)': 0.16; 'pm,': 0.16; '>>>': 0.16; 'meant': 0.17; 'bytes': 0.19; 'operations.': 0.19; '(which': 0.20; 'input': 0.21; '(but': 0.22; '(this': 0.22; "aren't": 0.22; "doesn't": 0.22; 'header:In-Reply-To:1': 0.22; 'module,': 0.23; 'code': 0.24; 'byte': 0.25; 'correctly.': 0.25; 'invalid': 0.25; 'suspect': 0.25; 'says': 0.25; '(or': 0.25; 'string': 0.26; '(in': 0.26; '(and': 0.27; "i'm": 0.27; 'character': 0.28; 'lee': 0.28; 'received:84': 0.28; 'bit': 0.28; 'formats': 0.29; 'interpret': 0.29; 'unicode': 0.29; 'problem': 0.29; 'code,': 0.29; 'fix': 0.29; 'example': 0.30; 'array': 0.30; 'character.': 0.30; 'committee.': 0.30; 'languages)': 0.30; 'strings.': 0.30; 'unicode,': 0.30; 'yes.': 0.30; 'yet': 0.30; 'standards': 0.31; 'does': 0.32; 'handling': 0.33; 'rather': 0.33; 'to:addr:python- list': 0.34; 'header:User-Agent:1': 0.34; 'it?': 0.34; 'there': 0.34; 'points': 0.34; 'someone': 0.34; 'characters': 0.34; 'kids': 0.34; 'rule': 0.34; 'to?': 0.34; "can't": 0.34; 'things': 0.34; 'that,': 0.35; 'reply-to:addr:python.org': 0.35; 'here,': 0.35; "isn't": 0.35; 'probably': 0.35; 'running': 0.35; 'some': 0.37; 'human': 0.63; 'designed': 0.63; 'our': 0.63; 'wide': 0.64; 'needing': 0.64; 'respect': 0.65; 'world': 0.65; 'series': 0.66; 'today': 0.71; 'header:Reply-To:1': 0.71; 'reply-to:no real name:2**0': 0.72; 'subjectcharset:utf-8': 0.72; 'placing': 0.74; '03:47': 0.84; 'mistakenly': 0.84; 'received,': 0.84; 'same:': 0.84; 'suffer': 0.84; 'utmost': 0.84; 'xah': 0.84; 'amongst': 0.91; 'appeared': 0.91; 'beings': 0.91; 'subject:little': 0.91; 'from.': 0.93
X-IronPort-Anti-Spam-Filtered true
X-IronPort-Anti-Spam-Result Av0EAAH0JE7Unw4S/2dsb2JhbABThEmjAXeIfAKzNDuQbIErhAKBDwSXZ4tW
Date Tue, 19 Jul 2011 04:08:42 +0100
From MRAB <python@mrabarnett.plus.com>
User-Agent Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20110624 Thunderbird/5.0
MIME-Version 1.0
To python-list@python.org
Subject Re: a little parsing challenge ☺
References <36037253-086b-4467-a1db-9492d3772e78@r5g2000prf.googlegroups.com> <j01ph6$knt$1@speranza.aioe.org> <4e24c823$0$29981$c3e8da3$5496439d@news.astraweb.com> <j02oth$f4c$1@speranza.aioe.org>
In-Reply-To <j02oth$f4c$1@speranza.aioe.org>
Content-Type text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding 8bit
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
Reply-To python-list@python.org
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.1233.1311044922.1164.python-list@python.org> (permalink)
Lines 100
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1311044922 news.xs4all.nl 23893 [2001:888:2000:d::a6]:43462
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:9844

Show key headers only | View raw


On 19/07/2011 03:07, Billy Mays wrote:
> On 7/18/2011 7:56 PM, Steven D'Aprano wrote:
>> Billy Mays wrote:
>>
>>> On 07/17/2011 03:47 AM, Xah Lee wrote:
>>>> 2011-07-16
>>>
>>> I gave it a shot. It doesn't do any of the Unicode delims, because
>>> let's face it, Unicode is for goobers.
>>
>> Goobers... that would be one of those new-fangled slang terms that the
>> young
>> kids today use to mean its opposite, like "bad", "wicked" and "sick",
>> correct?
>>
>> I mention it only because some people might mistakenly interpret your
>> words
>> as a childish and feeble insult against the 98% of the world who want or
>> need more than the 127 characters of ASCII, rather than understand you
>> meant it as a sign of the utmost respect for the richness and
>> diversity of
>> human beings and their languages, cultures, maths and sciences.
>>
>>
>
> TL;DR version: international character sets are a problem, and Unicode
> is not the answer to that problem).
>
> As long as I have used python (which I admit has only been 3 years)
> Unicode has never appeared to be implemented correctly. I'm probably
> repeating old arguments here, but whatever.
>
> Unicode is a mess. When someone says ASCII, you know that they can only
> mean characters 0-127. When someone says Unicode, do the mean real
> Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When
> using the 'u' datatype with the array module, the docs don't even tell
> you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that all the
> of these can be figured out, but the problem is now I have to ask every
> one of these questions whenever I want to use strings.
>
That's down to whether it's a narrow or wide Python build. There's a
PEP suggesting a fix for that (PEP 393).

> Secondly, Python doesn't do Unicode exception handling correctly. (but I
> suspect that its a broader problem with languages) A good example of
> this is with UTF-8 where there are invalid code points ( such as 0xC0,
> 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
> well as everyone else who wants to use strings for some reason).
>
Those aren't codepoints, those are invalid bytes for the UTF-8 encoding.

> When embedding Python in a long running application where user input is
> received, it is very easy to make mistake which bring down the whole
> program. If any user string isn't properly try/excepted, a user could
> craft a malformed string which a UTF-8 decoder would choke on. Using
> ASCII (or whatever 8 bit encoding) doesn't have these problems since all
> codepoints are valid.
>
What if you give an application an invalid JPEG, PNG or other image
file? Does that mean that image formats are bad too?

> Another (this must have been a good laugh amongst the UniDevs) 'feature'
> of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
> Any string can masquerade as any other string by placing few of these in
> a string. Any word filters you might have are now defeated by some
> cheesy Unicode nonsense character. Can you just just check for these
> characters and strip them out? Yes. Should you have to? I would say no.
>
> Does it get better? Of course! international character sets used for
> domain name encoding use yet a different scheme (Punycode). Are the
> following two domain names the same: tést.com , xn--tst-bma.com ? Who
> knows!
>
> I suppose I can gloss over the pains of using Unicode in C with every
> string needing to be an LPS since 0x00 is now a valid code point in
> UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
> strlen or concatenation operations.
>
0x00 is also a valid ASCII code, but C doesn't let you use it!

There's also "Modified UTF-8", in which U+0000 is encoded as 2 bytes,
so that zero-byte can be used as a terminator. You can't do that in
ASCII! :-)

> Can it get even better? Yep. We also now need to have a Byte order Mark
> (BOM) to determine the endianness of our characters. Are they little
> endian or big endian? (or perhaps one of the two possible middle endian
> encodings?) Who knows? String processing with unicode is unpleasant to
> say the least. I suppose that's what we get when we things are designed
> by committee.
>
Proper UTF-8 doesn't have a BOM.

The rule (in Python, at least) is to decode on input and encode on
output. You don't have to worry about endianness when processing
Unicode strings internally; they're just a series of codepoints.

> But Hey! The great thing about standards is that there are so many to
> choose from.
>

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-17 00:47 -0700
  Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 02:48 -0700
    Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 15:20 +0200
      Re: a little parsing challenge ☺ mhenn <michihenn@hotmail.com> - 2011-07-17 15:55 +0200
        Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 18:01 +0200
          Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 18:54 +0200
    Re: a little parsing challenge ☺ Thomas Boell <tboell@domain.invalid> - 2011-07-17 17:49 +0200
      Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 12:16 -0700
    Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-18 07:39 -0700
      Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-20 08:23 +0200
      Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-20 03:31 -0700
        Re: a little parsing challenge ☺ "Uri Guttman" <uri@StemSystems.com> - 2011-07-20 12:31 -0400
          Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-20 10:30 -0700
          Re: a little parsing challenge ☺ merlyn@stonehenge.com (Randal L. Schwartz) - 2011-07-20 12:06 -0700
            Re: a little parsing challenge ☺ Jason Earl <jearl@notengoamigos.org> - 2011-07-20 14:57 -0600
    Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 09:54 -0700
      Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:07 +0200
        Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 05:58 -0700
          Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-21 08:26 -0600
            Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 08:36 -0700
              Re: a little parsing challenge ☺ python@bdurham.com - 2011-07-21 12:43 -0400
                Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 11:53 -0700
                Re: a little parsing challenge ☺ Terry Reedy <tjreedy@udel.edu> - 2011-07-21 18:37 -0400
          Re: a little parsing challenge ☺ John O'Hagan <research@johnohagan.com> - 2011-07-25 15:57 +1000
      Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-19 12:08 -0600
  Re: a little parsing challenge ☺ Chris Angelico <rosuav@gmail.com> - 2011-07-17 21:34 +1000
    Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-17 04:52 -0700
    Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 16:15 +0200
      Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 12:18 -0700
        Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 22:16 +0200
          Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-17 22:57 +0200
      Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 23:43 +0200
      Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 03:09 -0400
        Re: a little parsing challenge ☺ Stefan Behnel <stefan_ml@behnel.de> - 2011-07-18 09:24 +0200
          Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 04:04 -0400
        Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-18 18:46 +0200
          Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 14:14 -0400
        Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 06:23 -0700
          Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-21 17:54 -0400
  Re: a little parsing challenge ☺ gene heskett <gheskett@wdtv.com> - 2011-07-17 10:26 -0400
  Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-17 08:31 -0700
    Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:49 -0700
      Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:14 +0200
        Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 05:29 -0700
          Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-21 15:21 +0200
      Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:17 +0200
  Re: a little parsing challenge ☺ rantingrick <rantingrick@gmail.com> - 2011-07-17 18:52 -0700
  Re: a little parsing challenge ☺ Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> - 2011-07-18 13:12 -0400
    Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-18 12:10 -0600
      Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-18 23:59 +0200
        Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-19 08:09 +0200
        Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:32 -0700
    Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 09:56 +1000
      Re: a little parsing challenge ☺ Billy Mays <noway@nohow.com> - 2011-07-18 22:07 -0400
        Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-18 19:50 -0700
          Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 13:11 +1000
            Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-18 21:59 -0700
              Re: a little parsing challenge ☺ Chris Angelico <rosuav@gmail.com> - 2011-07-19 15:36 +1000
        Re: a little parsing challenge ☺ MRAB <python@mrabarnett.plus.com> - 2011-07-19 04:08 +0100
        Re: a little parsing challenge ☺ Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-07-18 20:54 -0700
        Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 14:30 +1000
        Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 01:58 -0700
    Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:14 -0700
      Re: a little parsing challenge ☺ Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> - 2011-07-19 13:33 -0400
        Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 11:12 -0700
          Re: a little parsing challenge ☺ Terry Reedy <tjreedy@udel.edu> - 2011-07-19 15:09 -0400
            Re: a little parsing challenge ☺ jmfauth <wxjmfauth@gmail.com> - 2011-07-19 23:29 -0700
              Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-20 01:29 -0600
                Re: a little parsing challenge ☺ jmfauth <wxjmfauth@gmail.com> - 2011-07-20 00:54 -0700
                Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-20 18:18 +1000
  Re: a little parsing challenge ? sln@netherlands.com - 2011-07-18 12:34 -0700
  Re: a little parsing challenge ☺ Mark Tarver <dr.mtarver@gmail.com> - 2011-07-19 22:43 -0700

csiph-web