Groups > comp.lang.python > #9680 > unrolled thread

a little parsing challenge ☺

Started by	Xah Lee <xahlee@gmail.com>
First post	2011-07-17 00:47 -0700
Last post	2011-07-19 22:43 -0700
Articles	12 on this page of 72 — 28 participants

Back to article view | Back to comp.lang.python

  a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-17 00:47 -0700
    Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 02:48 -0700
      Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 15:20 +0200
        Re: a little parsing challenge ☺ mhenn <michihenn@hotmail.com> - 2011-07-17 15:55 +0200
          Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 18:01 +0200
            Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 18:54 +0200
      Re: a little parsing challenge ☺ Thomas Boell <tboell@domain.invalid> - 2011-07-17 17:49 +0200
        Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 12:16 -0700
      Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-18 07:39 -0700
        Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-20 08:23 +0200
        Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-20 03:31 -0700
          Re: a little parsing challenge ☺ "Uri Guttman" <uri@StemSystems.com> - 2011-07-20 12:31 -0400
            Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-20 10:30 -0700
            Re: a little parsing challenge ☺ merlyn@stonehenge.com (Randal L. Schwartz) - 2011-07-20 12:06 -0700
              Re: a little parsing challenge ☺ Jason Earl <jearl@notengoamigos.org> - 2011-07-20 14:57 -0600
      Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 09:54 -0700
        Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:07 +0200
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 05:58 -0700
            Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-21 08:26 -0600
              Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 08:36 -0700
                Re: a little parsing challenge ☺ python@bdurham.com - 2011-07-21 12:43 -0400
                  Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 11:53 -0700
                    Re: a little parsing challenge ☺ Terry Reedy <tjreedy@udel.edu> - 2011-07-21 18:37 -0400
            Re: a little parsing challenge ☺ John O'Hagan <research@johnohagan.com> - 2011-07-25 15:57 +1000
        Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-19 12:08 -0600
    Re: a little parsing challenge ☺ Chris Angelico <rosuav@gmail.com> - 2011-07-17 21:34 +1000
      Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-17 04:52 -0700
      Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 16:15 +0200
        Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 12:18 -0700
          Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 22:16 +0200
            Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-17 22:57 +0200
        Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 23:43 +0200
        Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 03:09 -0400
          Re: a little parsing challenge ☺ Stefan Behnel <stefan_ml@behnel.de> - 2011-07-18 09:24 +0200
            Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 04:04 -0400
          Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-18 18:46 +0200
            Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 14:14 -0400
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 06:23 -0700
            Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-21 17:54 -0400
    Re: a little parsing challenge ☺ gene heskett <gheskett@wdtv.com> - 2011-07-17 10:26 -0400
    Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-17 08:31 -0700
      Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:49 -0700
        Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:14 +0200
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 05:29 -0700
            Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-21 15:21 +0200
        Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:17 +0200
    Re: a little parsing challenge ☺ rantingrick <rantingrick@gmail.com> - 2011-07-17 18:52 -0700
    Re: a little parsing challenge ☺ Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> - 2011-07-18 13:12 -0400
      Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-18 12:10 -0600
        Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-18 23:59 +0200
          Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-19 08:09 +0200
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:32 -0700
      Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 09:56 +1000
        Re: a little parsing challenge ☺ Billy Mays <noway@nohow.com> - 2011-07-18 22:07 -0400
          Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-18 19:50 -0700
            Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 13:11 +1000
              Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-18 21:59 -0700
                Re: a little parsing challenge ☺ Chris Angelico <rosuav@gmail.com> - 2011-07-19 15:36 +1000
          Re: a little parsing challenge ☺ MRAB <python@mrabarnett.plus.com> - 2011-07-19 04:08 +0100
          Re: a little parsing challenge ☺ Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-07-18 20:54 -0700
          Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 14:30 +1000
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 01:58 -0700
      Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:14 -0700
        Re: a little parsing challenge ☺ Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> - 2011-07-19 13:33 -0400
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 11:12 -0700
            Re: a little parsing challenge ☺ Terry Reedy <tjreedy@udel.edu> - 2011-07-19 15:09 -0400
              Re: a little parsing challenge ☺ jmfauth <wxjmfauth@gmail.com> - 2011-07-19 23:29 -0700
                Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-20 01:29 -0600
                  Re: a little parsing challenge ☺ jmfauth <wxjmfauth@gmail.com> - 2011-07-20 00:54 -0700
                    Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-20 18:18 +1000
    Re: a little parsing challenge ? sln@netherlands.com - 2011-07-18 12:34 -0700
    Re: a little parsing challenge ☺ Mark Tarver <dr.mtarver@gmail.com> - 2011-07-19 22:43 -0700

Page 4 of 4 — ← Prev page 1 2 3 [4]

#9848

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-07-19 14:30 +1000
Message-ID	<4e25085a$0$29997$c3e8da3$5496439d@news.astraweb.com>
In reply to	#9842

Billy Mays wrote:

> TL;DR version: international character sets are a problem, and Unicode
> is not the answer to that problem).

Shorter version: FUD.

Yes, having a rich and varied character set requires work. Yes, the Unicode
standard itself, and any interface to it (including Python's) are imperfect
(like anything created by fallible humans). But your post is a long and
tedious list of FUD with not one bit of useful advice.

I'm not going to go through the whole post -- life is too short. But here
are two especially egregious example showing that you have some fundamental
misapprehensions about what Unicode actually is:

> Python doesn't do Unicode exception handling correctly. (but I
> suspect that its a broader problem with languages) A good example of
> this is with UTF-8 where there are invalid code points ( such as 0xC0,
> 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
> well as everyone else who wants to use strings for some reason).

and then later:

> Another (this must have been a good laugh amongst the UniDevs) 'feature'
> of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).

This is confused. Unicode text has code points, text which has been encoded
is nothing but bytes and not code points. "UTF-8 code point" does not even
mean anything. 

The zero width space has code point U+200B. The bytes you get depend on
which encoding you want:

>>> zws = u'\N{Zero Width Space}'
>>> zws
u'\u200b'
>>> zws.encode('utf-8')
'\xe2\x80\x8b'
>>> zws.encode('utf-16')
'\xff\xfe\x0b '

But regardless of which bytes it is encoded into, ZWS always has just a
single code point: U+200B.

You say "A good example of this is with UTF-8 where there are invalid code
points ( such as 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF" but I don't
even understand why you think this is a problem with Unicode. 

0xC0 is not a code point, it is a byte. Not all combinations of bytes are
legal in all files. If you have byte 0xC0 in a file, it cannot be an ASCII
file: there is no ASCII character represented by byte 0xC0, because hex
0xCO = 192, which is larger than 127.

Likewise, if you have a 0xC0 byte in a file, it cannot be UTF-8. It is as
simple as that. Trying to treat it as UTF-8 will give an error, just as
trying to view a mp3 file as if it were a jpeg will give an error. Why you
imagine this is a problem for Unicode is beyond me.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#9866

From	Xah Lee <xahlee@gmail.com>
Date	2011-07-19 01:58 -0700
Message-ID	<41feac99-d366-442b-bff5-4d08bf120367@f17g2000prf.googlegroups.com>
In reply to	#9842

On Jul 18, 7:07 pm, Billy Mays <no...@nohow.com> wrote:
> On 7/18/2011 7:56 PM, Steven D'Aprano wrote:
>
>
>
>
>
>
>
>
>
> > Billy Mays wrote:
>
> >> On 07/17/2011 03:47 AM, Xah Lee wrote:
> >>> 2011-07-16
>
> >> I gave it a shot.  It doesn't do any of the Unicode delims, because
> >> let's face it, Unicode is for goobers.
>
> > Goobers... that would be one of those new-fangled slang terms that the young
> > kids today use to mean its opposite, like "bad", "wicked" and "sick",
> > correct?
>
> > I mention it only because some people might mistakenly interpret your words
> > as a childish and feeble insult against the 98% of the world who want or
> > need more than the 127 characters of ASCII, rather than understand you
> > meant it as a sign of the utmost respect for the richness and diversity of
> > human beings and their languages, cultures, maths and sciences.
>
> TL;DR version: international character sets are a problem, and Unicode
> is not the answer to that problem).
>
> As long as I have used python (which I admit has only been 3 years)
> Unicode has never appeared to be implemented correctly.  I'm probably
> repeating old arguments here, but whatever.
>
> Unicode is a mess.  When someone says ASCII, you know that they can only
> mean characters 0-127.  When someone says Unicode, do the mean real
> Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8?
> When using the 'u' datatype with the array module, the docs don't even
> tell you if its 2 bytes wide or 4 bytes.  Which is it?  I'm sure that
> all the of these can be figured out, but the problem is now I have to
> ask every one of these questions whenever I want to use strings.
>
> Secondly, Python doesn't do Unicode exception handling correctly. (but I
> suspect that its a broader problem with languages) A good example of
> this is with UTF-8 where there are invalid code points ( such as 0xC0,
> 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
> well as everyone else who wants to use strings for some reason).
>
> When embedding Python in a long running application where user input is
> received, it is very easy to make mistake which bring down the whole
> program.  If any user string isn't properly try/excepted, a user could
> craft a malformed string which a UTF-8 decoder would choke on.  Using
> ASCII (or whatever 8 bit encoding) doesn't have these problems since all
> codepoints are valid.
>
> Another (this must have been a good laugh amongst the UniDevs) 'feature'
> of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
> Any string can masquerade as any other string by placing  few of these
> in a string.  Any word filters you might have are now defeated by some
> cheesy Unicode nonsense character.  Can you just just check for these
> characters and strip them out?  Yes.  Should you have to?  I would say no.
>
> Does it get better?  Of course! international character sets used for
> domain name encoding use yet a different scheme (Punycode).  Are the
> following two domain names the same: tést.com , xn--tst-bma.com ?  Who
> knows!
>
> I suppose I can gloss over the pains of using Unicode in C with every
> string needing to be an LPS since 0x00 is now a valid code point in
> UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
> strlen or concatenation operations.
>
> Can it get even better?  Yep.  We also now need to have a Byte order
> Mark (BOM) to determine the endianness of our characters.  Are they
> little endian or big endian?  (or perhaps one of the two possible middle
> endian encodings?)  Who knows?  String processing with unicode is
> unpleasant to say the least.  I suppose that's what we get when we
> things are designed by committee.
>
> But Hey!  The great thing about standards is that there are so many to
> choose from.
>
> --
> Bill

might check out my take

〈Xah's Unicode Tutorial〉
http://xahlee.org/Periodic_dosage_dir/unicode.html

especially good for emacs users.

if you grew up with english, unicode might seem complex or difficult
due to unfamiliarity.

but for asian people, when you dont have alphabets, it's kinda strange
to think that a byte is char. The notion simply don't exist and
impossible to establish. There are many encodings for chinese before
unicode. Even today, unicode isn't used in taiwan or china. Taiwan
uses big5, china uses GB18030, which contains all chars of unicode.

~8 years ago i thought that it'd be great if china adopted unicode
sometimes in the future... so that we all just have one charset to
deal with. But that's never gonna happen. On the contrary, am thinking
now there's the possibility that the world adopts GB18030 someday. lol
if you go to alexa.com for traffic ranking, a good percentage of the
top few are chinese these days. more and more as i observed since mid
2000s.

by the way, here's what these matching pairs are used for.

‹french quote›
«french quote»

the 〈〉 《》 are chinese brackets used for book titles etc. (CD, TV
program, show title, etc.)
the 「」 『』 are traditional chinese quotes, like english's ‘sinle
curly’, “double curly”
the 【】 〖〗 〔〕 and few others are variant brakets, similar to english's
() {} [].

 Xah

[toc] | [prev] | [next] | [standalone]

#9895

From	Xah Lee <xahlee@gmail.com>
Date	2011-07-19 10:14 -0700
Message-ID	<8085725c-a600-4d64-8533-fd15505f94e1@p12g2000pre.googlegroups.com>
In reply to	#9818

On Jul 18, 10:12 am, Billy Mays
<81282ed9a88799d21e77957df2d84bd6514d9...@myhashismyemail.com> wrote:
> On 07/17/2011 03:47 AM,XahLee wrote:
>
> > 2011-07-16
>
> I gave it a shot.  It doesn't do any of the Unicode delims, because
> let's face it, Unicode is for goobers.
>
> import sys, os
>
> pairs = {'}':'{', ')':'(', ']':'[', '"':'"', "'":"'", '>':'<'}
> valid = set( v for pair in pairs.items() for v in pair )
>
> for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
>      for name in filenames:
>          stack = [' ']
>          with open(os.path.join(dirpath, name), 'rb') as f:
>              chars = (c for line in f for c in line if c in valid)
>              for c in chars:
>                  if c in pairs and stack[-1] == pairs[c]:
>                      stack.pop()
>                  else:
>                      stack.append(c)
>          print ("Good" if len(stack) == 1 else "Bad") + ': %s' % name
>
> --
> Bill

as Ian Kelly mentioned, your script fail because it doesn't report the
position or line/column number of  first mismatched bracket. This is
rather significant part to this small problem. Avoiding unicode also
lessen the value of this exercise, because handling unicode in python
isn't trivial, at least with respect to this small exercise.

I added other unicode brackets to your list of brackets, but it seems
your code still fail to catch a file that has mismatched curly quotes.
(e.g. http://xahlee.org/p/time_machine/tm-ch04.html )

LOL Billy.

 Xah

[toc] | [prev] | [next] | [standalone]

#9900

From	Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com>
Date	2011-07-19 13:33 -0400
Message-ID	<j04f5h$b2$1@speranza.aioe.org>
In reply to	#9895

On 07/19/2011 01:14 PM, Xah Lee wrote:
> I added other unicode brackets to your list of brackets, but it seems
> your code still fail to catch a file that has mismatched curly quotes.
> (e.g.http://xahlee.org/p/time_machine/tm-ch04.html  )
>
> LOL Billy.
>
>   Xah

I suspect its due to the file mode being opened with 'rb' mode.  Also, 
the diction of characters at the top, the closing token is the key, 
while the opening one is the value.  Not sure if thats obvious.

Also returning the position of the first mismatched pair is somewhat 
ambiguous.  File systems store files as streams of octets (mine do 
anyways) rather than as characters.  When you ask for the position of 
the the first mismatched pair, do you mean the position as per 
file.tell() or do you mean the nth character in the utf-8 stream?

Also, you may have answered this earlier but I'll ask again anyways: You 
ask for the first mismatched pair, Are you referring to the inner most 
mismatched, or the outermost?  For example, suppose you have this file:

foo[(])bar

Would the "(" be the first mismatched character or would the "]"?

--
Bill

[toc] | [prev] | [next] | [standalone]

#9906

From	Xah Lee <xahlee@gmail.com>
Date	2011-07-19 11:12 -0700
Message-ID	<71aa9b11-5f47-4c89-bdbb-0fe4cac844e1@s33g2000prg.googlegroups.com>
In reply to	#9900

On Jul 19, 10:33 am, Billy Mays
<81282ed9a88799d21e77957df2d84bd6514d9...@myhashismyemail.com> wrote:
> On 07/19/2011 01:14 PM,XahLee wrote:
>
> > I added other unicode brackets to your list of brackets, but it seems
> > your code still fail to catch a file that has mismatched curly quotes.
> > (e.g.http://xahlee.org/p/time_machine/tm-ch04.html )
>
> > LOL Billy.
>
> >  Xah
>
> I suspect its due to the file mode being opened with 'rb' mode.  Also,
> the diction of characters at the top, the closing token is the key,
> while the opening one is the value.  Not sure if thats obvious.
>
> Also returning the position of the first mismatched pair is somewhat
> ambiguous.  File systems store files as streams of octets (mine do
> anyways) rather than as characters.  When you ask for the position of
> the the first mismatched pair, do you mean the position as per
> file.tell() or do you mean the nth character in the utf-8 stream?
>
> Also, you may have answered this earlier but I'll ask again anyways: You
> ask for the first mismatched pair, Are you referring to the inner most
> mismatched, or the outermost?  For example, suppose you have this file:
>
> foo[(])bar
>
> Would the "(" be the first mismatched character or would the "]"?

yes i haven't been precise. Thanks for brining it up.

thinking about it now, i think it's a bit hard to define precisely. My
elisp code actually reports the “)”, so it's wrong too. LOL

 Xah

[toc] | [prev] | [next] | [standalone]

#9916

From	Terry Reedy <tjreedy@udel.edu>
Date	2011-07-19 15:09 -0400
Message-ID	<mailman.1272.1311102598.1164.python-list@python.org>
In reply to	#9906

On 7/19/2011 2:12 PM, Xah Lee wrote:

>> Also, you may have answered this earlier but I'll ask again anyways: You
>> ask for the first mismatched pair, Are you referring to the inner most
>> mismatched, or the outermost?  For example, suppose you have this file:
>>
>> foo[(])bar
>>
>> Would the "(" be the first mismatched character or would the "]"?
>
> yes i haven't been precise. Thanks for brining it up.
>
> thinking about it now, i think it's a bit hard to define precisely.

Then it is hard to code precisely.

> My elisp code actually reports the “)”, so it's wrong too. LOL

This sort of exercise should start with a series of test cases, starting 
with the simplest.

testpairs = (
   ('', True),  # or whatever you want the OK response to be
   ('a', True),
   ('abdsdfdsdff', True),

   ('()', True),  # and so on for each pair of fences
   ('(', False),  # or exact error output wanted
   (')', False),  # and so on

The above could be generated programatically from the set of pairs that 
should be the input to the program, so that the pairs are not hardcoded 
into the logic.

    '([)]', ???),
...
   )

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#9940

From	jmfauth <wxjmfauth@gmail.com>
Date	2011-07-19 23:29 -0700
Message-ID	<fe661969-c7dc-48d8-97bd-5a3b469b13fe@en1g2000vbb.googlegroups.com>
In reply to	#9916

On 19 juil, 21:09, Terry Reedy <tjre...@udel.edu> wrote:
> On 7/19/2011 2:12 PM, Xah Lee wrote:
>
> >> Also, you may have answered this earlier but I'll ask again anyways: You
> >> ask for the first mismatched pair, Are you referring to the inner most
> >> mismatched, or the outermost?  For example, suppose you have this file:
>
> >> foo[(])bar
>
> >> Would the "(" be the first mismatched character or would the "]"?
>
> > yes i haven't been precise. Thanks for brining it up.
>
> > thinking about it now, i think it's a bit hard to define precisely.
>
> Then it is hard to code precisely.
>

Not really. The trick is to count the different opener/closer
separately.
That is what I am doing to check balanced brackets in
chemical formulas. The rules are howerver not the same
as in math.

Interestingly, I fall on this "problem". enumerate() is very
nice to parse a string from left to right.

>>> for i, c in enumerate('abcd'):
...     print i, c
...
0 a
1 b
2 c
3 d
>>>

But, if I want to parse a string from right to left,
what's the trick?
The best I found so far:

>>> s = 'abcd'
>>> for i, c in enumerate(reversed(s)):
...     print len(s) - 1 - i, c
...
3 d
2 c
1 b
0 a
>>>

[toc] | [prev] | [next] | [standalone]

#9947

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2011-07-20 01:29 -0600
Message-ID	<mailman.1283.1311146981.1164.python-list@python.org>
In reply to	#9940

On Wed, Jul 20, 2011 at 12:29 AM, jmfauth <wxjmfauth@gmail.com> wrote:
>> Then it is hard to code precisely.
>>
>
> Not really. The trick is to count the different opener/closer
> separately.
> That is what I am doing to check balanced brackets in
> chemical formulas. The rules are howerver not the same
> as in math.

I think the difficulty is not in the algorithm, but in adhering to the
desired output when it is ambiguously described.

> But, if I want to parse a string from right to left,
> what's the trick?
> The best I found so far:
>
>>>> s = 'abcd'
>>>> for i, c in enumerate(reversed(s)):
> ...     print len(s) - 1 - i, c

That violates DRY, since you have reversal logic in the iterator
algebra and then again in the loop body.  I prefer to keep all such
logic in the iterator algebra, if possible.  This is one possibility,
if you don't mind it building an intermediate list:

>>> for i, c in reversed(list(enumerate(s))):
...

Otherwise, here's another non-DRY solution:

>>> from itertools import izip
>>> for i, c in izip(reversed(xrange(len(s))), reversed(s)):
...

Unfortunately, this is one space where there just doesn't seem to be a
single obvious way to do it.

[toc] | [prev] | [next] | [standalone]

#9950

From	jmfauth <wxjmfauth@gmail.com>
Date	2011-07-20 00:54 -0700
Message-ID	<e500e9ba-121a-4f6d-a70a-5466b7b583e6@e7g2000vbj.googlegroups.com>
In reply to	#9947

On 20 juil, 09:29, Ian Kelly <ian.g.ke...@gmail.com> wrote:

> Otherwise, here's another non-DRY solution:
>
> >>> from itertools import izip
> >>> for i, c in izip(reversed(xrange(len(s))), reversed(s)):
>
> ...
>
> Unfortunately, this is one space where there just doesn't seem to be a
> single obvious way to do it.

Well, I see. Thanks.

There is still the old, brave solution, I'm in fact using.

>>> s = 'abcd'
>>> for i in xrange(len(s)-1, -1, -1):
...     print i, s[i]
...
3 d
2 c
1 b
0 a
>>>

---

DRY?  acronym for ?

[toc] | [prev] | [next] | [standalone]

#9951

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-07-20 18:18 +1000
Message-ID	<4e268f62$0$29987$c3e8da3$5496439d@news.astraweb.com>
In reply to	#9950

On Wed, 20 Jul 2011 05:54 pm jmfauth wrote:

> DRY?  acronym for ?

I'd like to tell you, but I already told somebody else... 

*grins*


http://en.wikipedia.org/wiki/Don't_repeat_yourself
http://c2.com/cgi/wiki?DontRepeatYourself




-- 
Steven

[toc] | [prev] | [next] | [standalone]

#9829 — Re: a little parsing challenge ?

From	sln@netherlands.com
Date	2011-07-18 12:34 -0700
Subject	Re: a little parsing challenge ?
Message-ID	<482927hpr2mr4au3c0f98rkaghnna9230q@4ax.com>
In reply to	#9680

On Sun, 17 Jul 2011 00:47:42 -0700 (PDT), Xah Lee <xahlee@gmail.com> wrote:

>2011-07-16
>
>folks, this one will be interesting one.
>
>the problem is to write a script that can check a dir of text files
>(and all subdirs) and reports if a file has any mismatched matching
>brackets.
>
[snip]
>i hope you'll participate. Just post solution here. Thanks.
>

I have to hunt for a job so I'm not writing a solution for you.
Here is a thin regex framework that may get you started.

-sln

---------------------

use strict;
use warnings;

 my @samples = qw(
  A98(y[(np)r]x)tp[kk]a.exeb
  A98(y[(np)r]x)tp[kk]a}.exeb
  A98(‹ynprx)tpk›ka.mpeg
  ‹A98(ynprx)tpk›ka
  “A9«8(yn«pr{{[g[x].}*()+}»)tpkka».”
  “A9«8(yn«pr{{[g[x].]}*()+}»)tpkka».”
  “A9«8(yn«pr»)tpkka».”
  “A9«8(yn«pr»)»”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»”
  “A9«8(yn«pr»)”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»”
 );

 my $regex = qr/

  ^ (?&FileName) $

  (?(DEFINE)

      (?<Delim>     
            \( (?&Content) \)
          | \{ (?&Content) \}
          | \[ (?&Content) \]
          | \“ (?&Content) \”
          | \‹ (?&Content) \›
          | \« (?&Content) \»
             # add more here ..
      )

      (?<Content>
           (?:  (?> [^(){}\[\]“”‹›«»]+ ) # add more here ..
              | (?&Delim)
           )*
      ) 

      (?<FileName>
           (?&Content)
      )
    )
 /x;


 for (@samples)
 {
    print "$_ - ";
    if ( /$regex/ ) {
       print "passed \n";
    }
    else {
       print "failed \n";
    }
 }

__END__

Output:

A98(y[(np)r]x)tp[kk]a.exeb - passed 
A98(y[(np)r]x)tp[kk]a}.exeb - failed 
A98(‹ynprx)tpk›ka.mpeg - failed 
‹A98(ynprx)tpk›ka - passed 
“A9«8(yn«pr{{[g[x].}*()+}»)tpkka».” - failed 
“A9«8(yn«pr{{[g[x].]}*()+}»)tpkka».” - passed 
“A9«8(yn«pr»)tpkka».” - passed 
“A9«8(yn«pr»)»”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»” - passed 
“A9«8(yn«pr»)”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»” - failed

[toc] | [prev] | [next] | [standalone]

#9941

From	Mark Tarver <dr.mtarver@gmail.com>
Date	2011-07-19 22:43 -0700
Message-ID	<189130e4-9feb-4d4a-8c31-0e6778b7bba0@15g2000vbw.googlegroups.com>
In reply to	#9680

On Jul 17, 8:47 am, Xah Lee <xah...@gmail.com> wrote:
> 2011-07-16
>
> folks, this one will be interesting one.
>
> the problem is to write a script that can check a dir of text files
> (and all subdirs) and reports if a file has any mismatched matching
> brackets.
>
> • The files will be utf-8 encoded (unix style line ending).
>
> • If a file has mismatched matching-pairs, the script will display the
> file name, and the  line number and column number of the first
> instance where a mismatched bracket occures. (or, just the char number
> instead (as in emacs's “point”))
>
> • the matching pairs are all single unicode chars. They are these and
> nothing else: () {} [] “” ‹› «» 【】 〈〉 《》 「」 『』
> Note that ‘single curly quote’ is not consider matching pair here.
>
> • You script must be standalone. Must not be using some parser tools.
> But can call lib that's part of standard distribution in your lang.
>
> Here's a example of mismatched bracket: ([)], (“[[”), ((, 】etc. (and
> yes, the brackets may be nested. There are usually text between these
> chars.)
>
> I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
> other lang implementations. In particular, perl, python, php, ruby,
> tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
> (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
> Scala, C, C++, or any others, are all welcome, but i won't be able to
> eval it. javascript implementation will be very interesting too, but
> please indicate which and where to install the command line version.
>
> I hope you'll find this a interesting “challenge”. This is a parsing
> problem. I haven't studied parsers except some Wikipedia reading, so
> my solution will probably be naive. I hope to see and learn from your
> solution too.
>
> i hope you'll participate. Just post solution here. Thanks.
>
>  Xah

Parsing technology based on BNF enables an elegant solution.  First
take a basic bracket balancing program which parenthesises the
contents of the input.  e.g. in Shen-YACC

(defcc <br>
   "(" <br> ")" <br$> := [<br> | <br$>];
   <item> <br>;
   <e> := [];)

(defcc <br$>
  <br>;)

(defcc <item>
  -*- := (if (element? -*- ["(" ")"]) (fail) [-*-]);)

Given (compile <br> ["(" 1 2 3 ")" 4]) the program produces [[1 2 3]
4]. When this program is used to parse the input, whatever residue is
left indicates where the parse has failed.  In Shen-YACC

(define tellme
  Stuff -> (let Br (<br> (@p Stuff []))
                Residue (fst Br)
                (if (empty? Residue)
                    (snd Br)
                    (error "parse failure at position ~A~%"
                          (- (length Stuff) (length Residue))))))

e.g.

(tellme ["(" 1 2 3 ")" "(" 4])
parse failure at position 5

(tellme ["(" 1 2 3 ")" "(" ")" 4])
[[1 2 3] [] 4]

The extension of this program to the case described is fairly simple.
Qi-YACC is very similar.

Nice problem.

I do not have further time to correspond right now.

Mark

[toc] | [prev] | [standalone]

Page 4 of 4 — ← Prev page 1 2 3 [4]

csiph-web

a little parsing challenge ☺

Contents

#9848

#9866

#9895

#9900

#9906

#9916

#9940

#9947

#9950

#9951

#9829 — Re: a little parsing challenge ?

#9941