Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #9680 > unrolled thread
| Started by | Xah Lee <xahlee@gmail.com> |
|---|---|
| First post | 2011-07-17 00:47 -0700 |
| Last post | 2011-07-19 22:43 -0700 |
| Articles | 20 on this page of 72 — 28 participants |
Back to article view | Back to comp.lang.python
a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-17 00:47 -0700
Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 02:48 -0700
Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 15:20 +0200
Re: a little parsing challenge ☺ mhenn <michihenn@hotmail.com> - 2011-07-17 15:55 +0200
Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 18:01 +0200
Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 18:54 +0200
Re: a little parsing challenge ☺ Thomas Boell <tboell@domain.invalid> - 2011-07-17 17:49 +0200
Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 12:16 -0700
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-18 07:39 -0700
Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-20 08:23 +0200
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-20 03:31 -0700
Re: a little parsing challenge ☺ "Uri Guttman" <uri@StemSystems.com> - 2011-07-20 12:31 -0400
Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-20 10:30 -0700
Re: a little parsing challenge ☺ merlyn@stonehenge.com (Randal L. Schwartz) - 2011-07-20 12:06 -0700
Re: a little parsing challenge ☺ Jason Earl <jearl@notengoamigos.org> - 2011-07-20 14:57 -0600
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 09:54 -0700
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:07 +0200
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 05:58 -0700
Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-21 08:26 -0600
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 08:36 -0700
Re: a little parsing challenge ☺ python@bdurham.com - 2011-07-21 12:43 -0400
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 11:53 -0700
Re: a little parsing challenge ☺ Terry Reedy <tjreedy@udel.edu> - 2011-07-21 18:37 -0400
Re: a little parsing challenge ☺ John O'Hagan <research@johnohagan.com> - 2011-07-25 15:57 +1000
Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-19 12:08 -0600
Re: a little parsing challenge ☺ Chris Angelico <rosuav@gmail.com> - 2011-07-17 21:34 +1000
Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-17 04:52 -0700
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 16:15 +0200
Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 12:18 -0700
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 22:16 +0200
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-17 22:57 +0200
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 23:43 +0200
Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 03:09 -0400
Re: a little parsing challenge ☺ Stefan Behnel <stefan_ml@behnel.de> - 2011-07-18 09:24 +0200
Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 04:04 -0400
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-18 18:46 +0200
Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 14:14 -0400
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 06:23 -0700
Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-21 17:54 -0400
Re: a little parsing challenge ☺ gene heskett <gheskett@wdtv.com> - 2011-07-17 10:26 -0400
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-17 08:31 -0700
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:49 -0700
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:14 +0200
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 05:29 -0700
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-21 15:21 +0200
Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:17 +0200
Re: a little parsing challenge ☺ rantingrick <rantingrick@gmail.com> - 2011-07-17 18:52 -0700
Re: a little parsing challenge ☺ Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> - 2011-07-18 13:12 -0400
Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-18 12:10 -0600
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-18 23:59 +0200
Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-19 08:09 +0200
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:32 -0700
Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 09:56 +1000
Re: a little parsing challenge ☺ Billy Mays <noway@nohow.com> - 2011-07-18 22:07 -0400
Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-18 19:50 -0700
Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 13:11 +1000
Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-18 21:59 -0700
Re: a little parsing challenge ☺ Chris Angelico <rosuav@gmail.com> - 2011-07-19 15:36 +1000
Re: a little parsing challenge ☺ MRAB <python@mrabarnett.plus.com> - 2011-07-19 04:08 +0100
Re: a little parsing challenge ☺ Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-07-18 20:54 -0700
Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 14:30 +1000
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 01:58 -0700
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:14 -0700
Re: a little parsing challenge ☺ Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> - 2011-07-19 13:33 -0400
Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 11:12 -0700
Re: a little parsing challenge ☺ Terry Reedy <tjreedy@udel.edu> - 2011-07-19 15:09 -0400
Re: a little parsing challenge ☺ jmfauth <wxjmfauth@gmail.com> - 2011-07-19 23:29 -0700
Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-20 01:29 -0600
Re: a little parsing challenge ☺ jmfauth <wxjmfauth@gmail.com> - 2011-07-20 00:54 -0700
Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-20 18:18 +1000
Re: a little parsing challenge ? sln@netherlands.com - 2011-07-18 12:34 -0700
Re: a little parsing challenge ☺ Mark Tarver <dr.mtarver@gmail.com> - 2011-07-19 22:43 -0700
Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →
| From | Thomas Jollans <t@jollybox.de> |
|---|---|
| Date | 2011-07-17 08:31 -0700 |
| Message-ID | <7ae55705-6342-4f06-add8-59de05d111b9@x12g2000yql.googlegroups.com> |
| In reply to | #9680 |
On Jul 17, 9:47 am, Xah Lee <xah...@gmail.com> wrote:
> 2011-07-16
>
> folks, this one will be interesting one.
>
> the problem is to write a script that can check a dir of text files
> (and all subdirs) and reports if a file has any mismatched matching
> brackets.
>
> • The files will be utf-8 encoded (unix style line ending).
>
> • If a file has mismatched matching-pairs, the script will display the
> file name, and the line number and column number of the first
> instance where a mismatched bracket occures. (or, just the char number
> instead (as in emacs's “point”))
>
> • the matching pairs are all single unicode chars. They are these and
> nothing else: () {} [] “” ‹› «» 【】 〈〉 《》 「」 『』
> Note that ‘single curly quote’ is not consider matching pair here.
>
> • You script must be standalone. Must not be using some parser tools.
> But can call lib that's part of standard distribution in your lang.
>
> Here's a example of mismatched bracket: ([)], (“[[”), ((, 】etc. (and
> yes, the brackets may be nested. There are usually text between these
> chars.)
>
> I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
> other lang implementations. In particular, perl, python, php, ruby,
> tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
> (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
> Scala, C, C++, or any others, are all welcome, but i won't be able to
> eval it. javascript implementation will be very interesting too, but
> please indicate which and where to install the command line version.
>
> I hope you'll find this a interesting “challenge”. This is a parsing
> problem. I haven't studied parsers except some Wikipedia reading, so
> my solution will probably be naive. I hope to see and learn from your
> solution too.
>
> i hope you'll participate. Just post solution here. Thanks.
I thought I'd have some fun with multi-processing:
https://gist.github.com/1087682
[toc] | [prev] | [next] | [standalone]
| From | Xah Lee <xahlee@gmail.com> |
|---|---|
| Date | 2011-07-19 10:49 -0700 |
| Message-ID | <22a6d549-b490-4534-a076-94cb6f21f8ce@h7g2000prf.googlegroups.com> |
| In reply to | #9712 |
On Jul 17, 8:31 am, Thomas Jollans <t...@jollybox.de> wrote:
> On Jul 17, 9:47 am,XahLee <xah...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > 2011-07-16
>
> > folks, this one will be interesting one.
>
> > the problem is to write a script that can check a dir of text files
> > (and all subdirs) and reports if a file has any mismatched matching
> > brackets.
>
> > • The files will be utf-8 encoded (unix style line ending).
>
> > • If a file has mismatched matching-pairs, the script will display the
> > file name, and the line number and column number of the first
> > instance where a mismatched bracket occures. (or, just the char number
> > instead (as in emacs's “point”))
>
> > • the matching pairs are all single unicode chars. They are these and
> > nothing else: () {} [] “” ‹› «» 【】 〈〉 《》 「」 『』
> > Note that ‘single curly quote’ is not consider matching pair here.
>
> > • You script must be standalone. Must not be using some parser tools.
> > But can call lib that's part of standard distribution in your lang.
>
> > Here's a example of mismatched bracket: ([)], (“[[”), ((, 】etc. (and
> > yes, the brackets may be nested. There are usually text between these
> > chars.)
>
> > I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
> > other lang implementations. In particular, perl, python, php, ruby,
> > tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
> > (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
> > Scala, C, C++, or any others, are all welcome, but i won't be able to
> > eval it. javascript implementation will be very interesting too, but
> > please indicate which and where to install the command line version.
>
> > I hope you'll find this a interesting “challenge”. This is a parsing
> > problem. I haven't studied parsers except some Wikipedia reading, so
> > my solution will probably be naive. I hope to see and learn from your
> > solution too.
>
> > i hope you'll participate. Just post solution here. Thanks.
>
> I thought I'd have some fun with multi-processing:
>
> https://gist.github.com/1087682
hi Thomas. I ran the program, all cpu went max (i have a quad), but
after i think 3 minutes nothing happens, so i killed it.
is there something special one should know to run the script?
I'm using Python 3.2.1 on Windows 7.
Xah
[toc] | [prev] | [next] | [standalone]
| From | Thomas Jollans <t@jollybox.de> |
|---|---|
| Date | 2011-07-19 20:14 +0200 |
| Message-ID | <mailman.1268.1311099294.1164.python-list@python.org> |
| In reply to | #9901 |
On 19/07/11 19:49, Xah Lee wrote: > On Jul 17, 8:31 am, Thomas Jollans <t...@jollybox.de> wrote: >> >> I thought I'd have some fun with multi-processing: >> >> https://gist.github.com/1087682 > > hi Thomas. I ran the program, all cpu went max (i have a quad), but > after i think 3 minutes nothing happens, so i killed it. > > is there something special one should know to run the script? > > I'm using Python 3.2.1 on Windows 7. > > Xah Well, it overdoes the multi-processing “a little”. Checking each character in a separate process might have been overkill. Here's a sane version: https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e old, crazy version: https://gist.github.com/1087682/6841c3875f7e88c23e0a053ac0d0f0565d8713e2
[toc] | [prev] | [next] | [standalone]
| From | Xah Lee <xahlee@gmail.com> |
|---|---|
| Date | 2011-07-21 05:29 -0700 |
| Message-ID | <1fac9c74-33ea-4380-a060-d20ee5aff971@s33g2000prg.googlegroups.com> |
| In reply to | #9907 |
On Jul 19, 11:14 am, Thomas Jollans <t...@jollybox.de> wrote: > I thought I'd have some fun with multi-processing: Nice joke. ☺ > Here's a sane version: > > https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e hi thomas, i still cant get your code to work. I have a dir named xxdir with a single test file xx.txt,with this content: foo[(])bar when i run your code py3 validate_brackets_Thomas_Jollans_2.py it simply exit and doesn't seem to do anything. I modded your code to print the file name it's proccessing. Apparently it did process it. my python isn't strong else i'd dive in. Thanks. I'm on Python 3.2.1. Here's a shell log: h3@H3-HP 2011-07-21 05:20:30 ~/web/xxst/find_elisp/validate matching brackets py3 validate_brackets_Thomas_Jollans_2.py h3@H3-HP 2011-07-21 05:20:34 ~/web/xxst/find_elisp/validate matching brackets py3 validate_brackets_Thomas_Jollans_2.py c:/Users/h3/web/xxst/find_elisp/validate matching brackets/xxdir \xx.txt h3@H3-HP 2011-07-21 05:21:59 ~/web/xxst/find_elisp/validate matching brackets py3 --version Python 3.2.1 h3@H3-HP 2011-07-21 05:27:03 ~/web/xxst/find_elisp/validate matching brackets Xah
[toc] | [prev] | [next] | [standalone]
| From | Thomas Jollans <t@jollybox.de> |
|---|---|
| Date | 2011-07-21 15:21 +0200 |
| Message-ID | <mailman.1319.1311254485.1164.python-list@python.org> |
| In reply to | #10020 |
On 21/07/11 14:29, Xah Lee wrote:
> On Jul 19, 11:14 am, Thomas Jollans <t...@jollybox.de> wrote:
>> I thought I'd have some fun with multi-processing:
>
> Nice joke. ☺
>
>> Here's a sane version:
>>
>> https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e
>
> hi thomas,
>
> i still cant get your code to work. I have a dir named xxdir with a
> single test file xx.txt,with this content:
>
> foo[(])bar
>
> when i run your code
> py3 validate_brackets_Thomas_Jollans_2.py
>
> it simply exit and doesn't seem to do anything. I modded your code to
> print the file name it's proccessing. Apparently it did process it.
>
> my python isn't strong else i'd dive in. Thanks.
Curious. Perhaps, in the Windows version of Python, subprocesses don't
use the same stdout? Windows doesn't have fork() (how could they
survive?), so who knows. Try replacing
ex.submit(process_file, fullname)
with
process_file(fullname)
for a non-concurrent version.
>
> I'm on Python 3.2.1. Here's a shell log:
>
> h3@H3-HP 2011-07-21 05:20:30 ~/web/xxst/find_elisp/validate matching
> brackets
> py3 validate_brackets_Thomas_Jollans_2.py
> h3@H3-HP 2011-07-21 05:20:34 ~/web/xxst/find_elisp/validate matching
> brackets
> py3 validate_brackets_Thomas_Jollans_2.py
> c:/Users/h3/web/xxst/find_elisp/validate matching brackets/xxdir
> \xx.txt
> h3@H3-HP 2011-07-21 05:21:59 ~/web/xxst/find_elisp/validate matching
> brackets
> py3 --version
> Python 3.2.1
> h3@H3-HP 2011-07-21 05:27:03 ~/web/xxst/find_elisp/validate matching
> brackets
>
> Xah
[toc] | [prev] | [next] | [standalone]
| From | Thomas Jollans <t@jollybox.de> |
|---|---|
| Date | 2011-07-19 20:17 +0200 |
| Message-ID | <mailman.1269.1311099446.1164.python-list@python.org> |
| In reply to | #9901 |
Oh, by the way: On 19/07/11 19:49, Xah Lee wrote: > I ran the program, all cpu went max Mission accomplished.
[toc] | [prev] | [next] | [standalone]
| From | rantingrick <rantingrick@gmail.com> |
|---|---|
| Date | 2011-07-17 18:52 -0700 |
| Message-ID | <741a7641-7833-4d51-a17b-548a930b70f4@u26g2000vby.googlegroups.com> |
| In reply to | #9680 |
On Jul 17, 2:47 am, Xah Lee <xah...@gmail.com> wrote: > 2011-07-16 > > folks, this one will be interesting one. > > the problem is to write a script that can check a dir of text files > (and all subdirs) and reports if a file has any mismatched matching > brackets. > >[...] > > • You script must be standalone. Must not be using some parser tools. > But can call lib that's part of standard distribution in your lang. I stopped reading here and did... >>> from HyperParser import HyperParser # python2.x ...and called it a day. ;-) This module is part of the stdlib (idlelib \HyperParser) so as per your statement it is legal (may not be the fastest solution).
[toc] | [prev] | [next] | [standalone]
| From | Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> |
|---|---|
| Date | 2011-07-18 13:12 -0400 |
| Message-ID | <j01ph6$knt$1@speranza.aioe.org> |
| In reply to | #9680 |
On 07/17/2011 03:47 AM, Xah Lee wrote:
> 2011-07-16
I gave it a shot. It doesn't do any of the Unicode delims, because
let's face it, Unicode is for goobers.
import sys, os
pairs = {'}':'{', ')':'(', ']':'[', '"':'"', "'":"'", '>':'<'}
valid = set( v for pair in pairs.items() for v in pair )
for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
for name in filenames:
stack = [' ']
with open(os.path.join(dirpath, name), 'rb') as f:
chars = (c for line in f for c in line if c in valid)
for c in chars:
if c in pairs and stack[-1] == pairs[c]:
stack.pop()
else:
stack.append(c)
print ("Good" if len(stack) == 1 else "Bad") + ': %s' % name
--
Bill
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2011-07-18 12:10 -0600 |
| Message-ID | <mailman.1223.1311012663.1164.python-list@python.org> |
| In reply to | #9818 |
On Mon, Jul 18, 2011 at 11:12 AM, Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> wrote: > I gave it a shot. It doesn't do any of the Unicode delims, because let's > face it, Unicode is for goobers. Uh, okay... Your script also misses the requirement of outputting the index or row and column of the first mismatched bracket.
[toc] | [prev] | [next] | [standalone]
| From | Thomas 'PointedEars' Lahn <PointedEars@web.de> |
|---|---|
| Date | 2011-07-18 23:59 +0200 |
| Message-ID | <2075498.BysXYHRu7a@PointedEars.de> |
| In reply to | #9821 |
Ian Kelly wrote:
> Billy Mays wrote:
>> I gave it a shot. It doesn't do any of the Unicode delims, because let's
>> face it, Unicode is for goobers.
>
> Uh, okay...
>
> Your script also misses the requirement of outputting the index or row
> and column of the first mismatched bracket.
Thanks to Python's expressiveness, this can be easily remedied (see below).
I also do not follow Billy's comment about Unicode. Unicode and the fact
that Python supports it *natively* cannot be appreciated enough in a
globalized world.
However, I have learned a lot about being pythonic from his posting (take
those generator expressions, for example!), and the idea of looking at the
top of a stack for reference is a really good one. Thank you, Billy!
Here is my improvement of his code, which should fill the mentioned gaps.
I have also reversed the order in the report line as I think it is more
natural this way. I have tested the code superficially with a directory
containing a single text file. Watch for word-wrap:
# encoding: utf-8
'''
Created on 2011-07-18
@author: Thomas 'PointedEars' Lahn <PointedEars@web.de>, based on an idea of
Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com>
in <news:j01ph6$knt$1@speranza.aioe.org>
'''
import sys, os
pairs = {u'}': u'{', u')': u'(', u']': u'[',
u'”': u'“', u'›': u'‹', u'»': u'«',
u'】': u'【', u'〉': u'〈', u'》': u'《',
u'」': u'「', u'』': u'『'}
valid = set(v for pair in pairs.items() for v in pair)
if __name__ == '__main__':
for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
for name in filenames:
stack = [' ']
# you can use chardet etc. instead
encoding = 'utf-8'
with open(os.path.join(dirpath, name), 'r') as f:
reported = False
chars = ((c, line_no, col) for line_no, line in enumerate(f)
for col, c in enumerate(line.decode(encoding)) if c in valid)
for c, line_no, col in chars:
if c in pairs:
if stack[-1] == pairs[c]:
stack.pop()
else:
if not reported:
first_bad = (c, line_no + 1, col + 1)
reported = True
else:
stack.append(c)
print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad
'%s' at %s:%s" % first_bad))
--
PointedEars
Bitte keine Kopien per E-Mail. / Please do not Cc: me.
[toc] | [prev] | [next] | [standalone]
| From | Thomas 'PointedEars' Lahn <PointedEars@web.de> |
|---|---|
| Date | 2011-07-19 08:09 +0200 |
| Message-ID | <42296484.Y7NjKpOp6t@PointedEars.de> |
| In reply to | #9837 |
Thomas 'PointedEars' Lahn wrote:
> with open(os.path.join(dirpath, name), 'r') as f:
SHOULD be
with open(os.path.join(dirpath, name), 'rb') as f:
(as in the original), else the some code units might not be read properly.
--
PointedEars
Bitte keine Kopien per E-Mail. / Please do not Cc: me.
[toc] | [prev] | [next] | [standalone]
| From | Xah Lee <xahlee@gmail.com> |
|---|---|
| Date | 2011-07-19 10:32 -0700 |
| Message-ID | <065729ac-d522-4da6-87ad-915be14e5ff4@e20g2000prf.googlegroups.com> |
| In reply to | #9837 |
On Jul 18, 2:59 pm, Thomas 'PointedEars' Lahn <PointedE...@web.de>
wrote:
> Ian Kelly wrote:
> > Billy Mays wrote:
> >> I gave it a shot. It doesn't do any of the Unicode delims, because let's
> >> face it, Unicode is for goobers.
>
> > Uh, okay...
>
> > Your script also misses the requirement of outputting the index or row
> > and column of the first mismatched bracket.
>
> Thanks to Python's expressiveness, this can be easily remedied (see below).
>
> I also do not follow Billy's comment about Unicode. Unicode and the fact
> that Python supports it *natively* cannot be appreciated enough in a
> globalized world.
>
> However, I have learned a lot about being pythonic from his posting (take
> those generator expressions, for example!), and the idea of looking at the
> top of a stack for reference is a really good one. Thank you, Billy!
>
> Here is my improvement of his code, which should fill the mentioned gaps.
> I have also reversed the order in the report line as I think it is more
> natural this way. I have tested the code superficially with a directory
> containing a single text file. Watch for word-wrap:
>
> # encoding: utf-8
> '''
> Created on 2011-07-18
>
> @author: Thomas 'PointedEars' Lahn <PointedE...@web.de>, based on an idea of
> Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9...@myhashismyemail.com>
> in <news:j01ph6$knt$1@speranza.aioe.org>
> '''
> import sys, os
>
> pairs = {u'}': u'{', u')': u'(', u']': u'[',
> u'”': u'“', u'›': u'‹', u'»': u'«',
> u'】': u'【', u'〉': u'〈', u'》': u'《',
> u'」': u'「', u'』': u'『'}
> valid = set(v for pair in pairs.items() for v in pair)
>
> if __name__ == '__main__':
> for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
> for name in filenames:
> stack = [' ']
>
> # you can use chardet etc. instead
> encoding = 'utf-8'
>
> with open(os.path.join(dirpath, name), 'r') as f:
> reported = False
> chars = ((c, line_no, col) for line_no, line in enumerate(f)
> for col, c in enumerate(line.decode(encoding)) if c in valid)
> for c, line_no, col in chars:
> if c in pairs:
> if stack[-1] == pairs[c]:
> stack.pop()
> else:
> if not reported:
> first_bad = (c, line_no + 1, col + 1)
> reported = True
> else:
> stack.append(c)
>
> print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad
> '%s' at %s:%s" % first_bad))
Thanks for the fix.
Though, it seems still wrong.
On the file http://xahlee.org/p/time_machine/tm-ch04.html
there is a mismatched curly double quote at 28319.
the script reports:
tm-ch04.html: bad ')' at 68:2
that doesn't seems right. Line 68 is empty. There's no opening or
closing round bracket anywhere close. Nearest are lines 11 and 127.
Maybe Billy Mays's algorithm is wrong.
Xah (fairly discouraged now, after running 3 python scripts all
failed)
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2011-07-19 09:56 +1000 |
| Message-ID | <4e24c823$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #9818 |
Billy Mays wrote: > On 07/17/2011 03:47 AM, Xah Lee wrote: >> 2011-07-16 > > I gave it a shot. It doesn't do any of the Unicode delims, because > let's face it, Unicode is for goobers. Goobers... that would be one of those new-fangled slang terms that the young kids today use to mean its opposite, like "bad", "wicked" and "sick", correct? I mention it only because some people might mistakenly interpret your words as a childish and feeble insult against the 98% of the world who want or need more than the 127 characters of ASCII, rather than understand you meant it as a sign of the utmost respect for the richness and diversity of human beings and their languages, cultures, maths and sciences. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Billy Mays <noway@nohow.com> |
|---|---|
| Date | 2011-07-18 22:07 -0400 |
| Message-ID | <j02oth$f4c$1@speranza.aioe.org> |
| In reply to | #9841 |
On 7/18/2011 7:56 PM, Steven D'Aprano wrote: > Billy Mays wrote: > >> On 07/17/2011 03:47 AM, Xah Lee wrote: >>> 2011-07-16 >> >> I gave it a shot. It doesn't do any of the Unicode delims, because >> let's face it, Unicode is for goobers. > > Goobers... that would be one of those new-fangled slang terms that the young > kids today use to mean its opposite, like "bad", "wicked" and "sick", > correct? > > I mention it only because some people might mistakenly interpret your words > as a childish and feeble insult against the 98% of the world who want or > need more than the 127 characters of ASCII, rather than understand you > meant it as a sign of the utmost respect for the richness and diversity of > human beings and their languages, cultures, maths and sciences. > > TL;DR version: international character sets are a problem, and Unicode is not the answer to that problem). As long as I have used python (which I admit has only been 3 years) Unicode has never appeared to be implemented correctly. I'm probably repeating old arguments here, but whatever. Unicode is a mess. When someone says ASCII, you know that they can only mean characters 0-127. When someone says Unicode, do the mean real Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When using the 'u' datatype with the array module, the docs don't even tell you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that all the of these can be figured out, but the problem is now I have to ask every one of these questions whenever I want to use strings. Secondly, Python doesn't do Unicode exception handling correctly. (but I suspect that its a broader problem with languages) A good example of this is with UTF-8 where there are invalid code points ( such as 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as well as everyone else who wants to use strings for some reason). When embedding Python in a long running application where user input is received, it is very easy to make mistake which bring down the whole program. If any user string isn't properly try/excepted, a user could craft a malformed string which a UTF-8 decoder would choke on. Using ASCII (or whatever 8 bit encoding) doesn't have these problems since all codepoints are valid. Another (this must have been a good laugh amongst the UniDevs) 'feature' of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). Any string can masquerade as any other string by placing few of these in a string. Any word filters you might have are now defeated by some cheesy Unicode nonsense character. Can you just just check for these characters and strip them out? Yes. Should you have to? I would say no. Does it get better? Of course! international character sets used for domain name encoding use yet a different scheme (Punycode). Are the following two domain names the same: tést.com , xn--tst-bma.com ? Who knows! I suppose I can gloss over the pains of using Unicode in C with every string needing to be an LPS since 0x00 is now a valid code point in UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do strlen or concatenation operations. Can it get even better? Yep. We also now need to have a Byte order Mark (BOM) to determine the endianness of our characters. Are they little endian or big endian? (or perhaps one of the two possible middle endian encodings?) Who knows? String processing with unicode is unpleasant to say the least. I suppose that's what we get when we things are designed by committee. But Hey! The great thing about standards is that there are so many to choose from. -- Bill
[toc] | [prev] | [next] | [standalone]
| From | rusi <rustompmody@gmail.com> |
|---|---|
| Date | 2011-07-18 19:50 -0700 |
| Message-ID | <feedc526-0c7b-4296-a011-1943950e6b61@j9g2000prj.googlegroups.com> |
| In reply to | #9842 |
On Jul 19, 7:07 am, Billy Mays <no...@nohow.com> wrote: > On 7/18/2011 7:56 PM, Steven D'Aprano wrote: > > > > > Billy Mays wrote: > > >> On 07/17/2011 03:47 AM, Xah Lee wrote: > >>> 2011-07-16 > > >> I gave it a shot. It doesn't do any of the Unicode delims, because > >> let's face it, Unicode is for goobers. > > > Goobers... that would be one of those new-fangled slang terms that the young > > kids today use to mean its opposite, like "bad", "wicked" and "sick", > > correct? > > > I mention it only because some people might mistakenly interpret your words > > as a childish and feeble insult against the 98% of the world who want or > > need more than the 127 characters of ASCII, rather than understand you > > meant it as a sign of the utmost respect for the richness and diversity of > > human beings and their languages, cultures, maths and sciences. > > TL;DR version: international character sets are a problem, and Unicode > is not the answer to that problem). > > As long as I have used python (which I admit has only been 3 years) > Unicode has never appeared to be implemented correctly. I'm probably > repeating old arguments here, but whatever. > > Unicode is a mess. When someone says ASCII, you know that they can only > mean characters 0-127. When someone says Unicode, do the mean real > Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? > When using the 'u' datatype with the array module, the docs don't even > tell you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that > all the of these can be figured out, but the problem is now I have to > ask every one of these questions whenever I want to use strings. > > Secondly, Python doesn't do Unicode exception handling correctly. (but I > suspect that its a broader problem with languages) A good example of > this is with UTF-8 where there are invalid code points ( such as 0xC0, > 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as > well as everyone else who wants to use strings for some reason). > > When embedding Python in a long running application where user input is > received, it is very easy to make mistake which bring down the whole > program. If any user string isn't properly try/excepted, a user could > craft a malformed string which a UTF-8 decoder would choke on. Using > ASCII (or whatever 8 bit encoding) doesn't have these problems since all > codepoints are valid. > > Another (this must have been a good laugh amongst the UniDevs) 'feature' > of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). > Any string can masquerade as any other string by placing few of these > in a string. Any word filters you might have are now defeated by some > cheesy Unicode nonsense character. Can you just just check for these > characters and strip them out? Yes. Should you have to? I would say no. > > Does it get better? Of course! international character sets used for > domain name encoding use yet a different scheme (Punycode). Are the > following two domain names the same: tést.com , xn--tst-bma.com ? Who > knows! > > I suppose I can gloss over the pains of using Unicode in C with every > string needing to be an LPS since 0x00 is now a valid code point in > UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do > strlen or concatenation operations. > > Can it get even better? Yep. We also now need to have a Byte order > Mark (BOM) to determine the endianness of our characters. Are they > little endian or big endian? (or perhaps one of the two possible middle > endian encodings?) Who knows? String processing with unicode is > unpleasant to say the least. I suppose that's what we get when we > things are designed by committee. > > But Hey! The great thing about standards is that there are so many to > choose from. > > -- > Bill Thanks for writing that Every time I try to understand unicode and remain stuck I come to the conclusion that I must be an imbecile. Seeing others (probably more intelligent than yours truly) gives me some solace! [And I am writing this from India where there are dozens of languages, almost as many scripts and everyone speaks and writes at least a couple of non-european ones]
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2011-07-19 13:11 +1000 |
| Message-ID | <4e24f5c7$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #9843 |
rusi wrote: > Every time I try to understand unicode and remain stuck I come to the > conclusion that I must be an imbecile. http://www.joelonsoftware.com/articles/Unicode.html -- Steven
[toc] | [prev] | [next] | [standalone]
| From | rusi <rustompmody@gmail.com> |
|---|---|
| Date | 2011-07-18 21:59 -0700 |
| Message-ID | <b8a46135-3d6f-4608-93cc-8278ad040483@u6g2000prc.googlegroups.com> |
| In reply to | #9845 |
On Jul 19, 8:11 am, Steven D'Aprano <steve +comp.lang.pyt...@pearwood.info> wrote: > rusi wrote: > > Every time I try to understand unicode and remain stuck I come to the > > conclusion that I must be an imbecile. > > http://www.joelonsoftware.com/articles/Unicode.html > > -- > Steven Yes Ive read that and understood a little bit more thanks to it. But for the points raised in this thread this one from Joel is more relevant: http://www.joelonsoftware.com/articles/LeakyAbstractions.html Some evidences of leakiness: code point vs character vs byte encoding and decoding UTF-x and UCS-y Very important and necessary distinctions? Maybe... But I did not need them when my world was built of the 127 bricks of ASCII. My latest brush with unicode was when I tried to port construct to python3. http://construct.wikispaces.com/ If unicode 'just works' you should be able to do it in a jiffy? [And if you did I would be glad to be proved wrong :-) ]
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-07-19 15:36 +1000 |
| Message-ID | <mailman.1238.1311053796.1164.python-list@python.org> |
| In reply to | #9853 |
On Tue, Jul 19, 2011 at 2:59 PM, rusi <rustompmody@gmail.com> wrote: > Some evidences of leakiness: > code point vs character vs byte > encoding and decoding > UTF-x and UCS-y > > Very important and necessary distinctions? Maybe... But I did not need > them when my world was built of the 127 bricks of ASCII. Codepoint vs byte is NOT an abstraction. Unicode consists of characters, where each character is represented by a number called its codepoint. Since computers work with bytes, we need a way of encoding those characters into bytes. It's no different from encoding a piece of music in bytes, and having it come out as 0x90 0x64 0x40. Are those bytes an abstraction of the note? No. They're an encoding of a MIDI message that requests that the note be struck. The note itself is an abstraction, if you like; but the bytes to create that note could be delivered in a variety of other ways. A Python Unicode string, whether it's Python 2's 'unicode' or Python 3's 'str', is a sequence of characters. Since those characters are stored in memory, they must be encoded somehow, but that's not our problem. We need only care about encoding when we save those characters to disk, transmit them across the network, or in some other way need to store them as bytes. Otherwise, there is no abstraction, and no leak. Chris Angelico
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-07-19 04:08 +0100 |
| Message-ID | <mailman.1233.1311044922.1164.python-list@python.org> |
| In reply to | #9842 |
On 19/07/2011 03:07, Billy Mays wrote: > On 7/18/2011 7:56 PM, Steven D'Aprano wrote: >> Billy Mays wrote: >> >>> On 07/17/2011 03:47 AM, Xah Lee wrote: >>>> 2011-07-16 >>> >>> I gave it a shot. It doesn't do any of the Unicode delims, because >>> let's face it, Unicode is for goobers. >> >> Goobers... that would be one of those new-fangled slang terms that the >> young >> kids today use to mean its opposite, like "bad", "wicked" and "sick", >> correct? >> >> I mention it only because some people might mistakenly interpret your >> words >> as a childish and feeble insult against the 98% of the world who want or >> need more than the 127 characters of ASCII, rather than understand you >> meant it as a sign of the utmost respect for the richness and >> diversity of >> human beings and their languages, cultures, maths and sciences. >> >> > > TL;DR version: international character sets are a problem, and Unicode > is not the answer to that problem). > > As long as I have used python (which I admit has only been 3 years) > Unicode has never appeared to be implemented correctly. I'm probably > repeating old arguments here, but whatever. > > Unicode is a mess. When someone says ASCII, you know that they can only > mean characters 0-127. When someone says Unicode, do the mean real > Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When > using the 'u' datatype with the array module, the docs don't even tell > you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that all the > of these can be figured out, but the problem is now I have to ask every > one of these questions whenever I want to use strings. > That's down to whether it's a narrow or wide Python build. There's a PEP suggesting a fix for that (PEP 393). > Secondly, Python doesn't do Unicode exception handling correctly. (but I > suspect that its a broader problem with languages) A good example of > this is with UTF-8 where there are invalid code points ( such as 0xC0, > 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as > well as everyone else who wants to use strings for some reason). > Those aren't codepoints, those are invalid bytes for the UTF-8 encoding. > When embedding Python in a long running application where user input is > received, it is very easy to make mistake which bring down the whole > program. If any user string isn't properly try/excepted, a user could > craft a malformed string which a UTF-8 decoder would choke on. Using > ASCII (or whatever 8 bit encoding) doesn't have these problems since all > codepoints are valid. > What if you give an application an invalid JPEG, PNG or other image file? Does that mean that image formats are bad too? > Another (this must have been a good laugh amongst the UniDevs) 'feature' > of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). > Any string can masquerade as any other string by placing few of these in > a string. Any word filters you might have are now defeated by some > cheesy Unicode nonsense character. Can you just just check for these > characters and strip them out? Yes. Should you have to? I would say no. > > Does it get better? Of course! international character sets used for > domain name encoding use yet a different scheme (Punycode). Are the > following two domain names the same: tést.com , xn--tst-bma.com ? Who > knows! > > I suppose I can gloss over the pains of using Unicode in C with every > string needing to be an LPS since 0x00 is now a valid code point in > UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do > strlen or concatenation operations. > 0x00 is also a valid ASCII code, but C doesn't let you use it! There's also "Modified UTF-8", in which U+0000 is encoded as 2 bytes, so that zero-byte can be used as a terminator. You can't do that in ASCII! :-) > Can it get even better? Yep. We also now need to have a Byte order Mark > (BOM) to determine the endianness of our characters. Are they little > endian or big endian? (or perhaps one of the two possible middle endian > encodings?) Who knows? String processing with unicode is unpleasant to > say the least. I suppose that's what we get when we things are designed > by committee. > Proper UTF-8 doesn't have a BOM. The rule (in Python, at least) is to decode on input and encode on output. You don't have to worry about endianness when processing Unicode strings internally; they're just a series of codepoints. > But Hey! The great thing about standards is that there are so many to > choose from. >
[toc] | [prev] | [next] | [standalone]
| From | Benjamin Kaplan <benjamin.kaplan@case.edu> |
|---|---|
| Date | 2011-07-18 20:54 -0700 |
| Message-ID | <mailman.1234.1311047771.1164.python-list@python.org> |
| In reply to | #9842 |
On Mon, Jul 18, 2011 at 7:07 PM, Billy Mays <noway@nohow.com> wrote: > > On 7/18/2011 7:56 PM, Steven D'Aprano wrote: >> >> Billy Mays wrote: >> >>> On 07/17/2011 03:47 AM, Xah Lee wrote: >>>> >>>> 2011-07-16 >>> >>> I gave it a shot. It doesn't do any of the Unicode delims, because >>> let's face it, Unicode is for goobers. >> >> Goobers... that would be one of those new-fangled slang terms that the young >> kids today use to mean its opposite, like "bad", "wicked" and "sick", >> correct? >> >> I mention it only because some people might mistakenly interpret your words >> as a childish and feeble insult against the 98% of the world who want or >> need more than the 127 characters of ASCII, rather than understand you >> meant it as a sign of the utmost respect for the richness and diversity of >> human beings and their languages, cultures, maths and sciences. >> >> > > TL;DR version: international character sets are a problem, and Unicode is not the answer to that problem). > > As long as I have used python (which I admit has only been 3 years) Unicode has never appeared to be implemented correctly. I'm probably repeating old arguments here, but whatever. > > Unicode is a mess. When someone says ASCII, you know that they can only mean characters 0-127. When someone says Unicode, do the mean real Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When using the 'u' datatype with the array module, the docs don't even tell you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that all the of these can be figured out, but the problem is now I have to ask every one of these questions whenever I want to use strings. > It doesn't matter. When you use the unicode data type in Python, you get to treat it as a sequence of characters, not a sequence of bytes. The fact that it's stored internally as UCS-2 or UCS-4 is irrelevant. > > Secondly, Python doesn't do Unicode exception handling correctly. (but I suspect that its a broader problem with languages) A good example of this is with UTF-8 where there are invalid code points ( such as 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as well as everyone else who wants to use strings for some reason). > A Unicode code point is of the form U+XXXX. 0xC0 is not a Unicode code point, it is a byte. It happens to be an invalid byte using the UTF-8 byte encoding (which is not Unicode, it's a byte string). The Unicode code point U+00C0 is perfectly valid- it's a LATIN CAPITAL LETTER A WITH GRAVE. > > When embedding Python in a long running application where user input is received, it is very easy to make mistake which bring down the whole program. If any user string isn't properly try/excepted, a user could craft a malformed string which a UTF-8 decoder would choke on. Using ASCII (or whatever 8 bit encoding) doesn't have these problems since all codepoints are valid. > UTF-8 != Unicode. UTF-8 is one of several byte encodings capable of representing every character in the Unicode spec, but it is not Unicode. If you have a Unicode string, it is not a sequence of byes, it is a sequence of characters. If you want a sequence of bytes, use a byte string. If you are attempting to interpret a sequence of bytes as a sequence of text, you're doing it wrong. There's a reason we have both text and binary modes for opening files- yes, there is a difference between them. > Another (this must have been a good laugh amongst the UniDevs) 'feature' of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). Any string can masquerade as any other string by placing few of these in a string. Any word filters you might have are now defeated by some cheesy Unicode nonsense character. Can you just just check for these characters and strip them out? Yes. Should you have to? I would say no. > > Does it get better? Of course! international character sets used for domain name encoding use yet a different scheme (Punycode). Are the following two domain names the same: tést.com , xn--tst-bma.com ? Who knows! > > I suppose I can gloss over the pains of using Unicode in C with every string needing to be an LPS since 0x00 is now a valid code point in UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do strlen or concatenation operations. > That is using UTF-8 in C. Which, again, is not the same thing as Unicode. > Can it get even better? Yep. We also now need to have a Byte order Mark (BOM) to determine the endianness of our characters. Are they little endian or big endian? (or perhaps one of the two possible middle endian encodings?) Who knows? String processing with unicode is unpleasant to say the least. I suppose that's what we get when we things are designed by committee. > And that is UTF-16 and UTF-32. Again, those are byte encodings. They are not Unicode. When you use a library capable of handling Unicode, you never see those- you just have a string with characters in it. > But Hey! The great thing about standards is that there are so many to choose from. > > -- > Bill > > > > > > > -- > http://mail.python.org/mailman/listinfo/python-list
[toc] | [prev] | [next] | [standalone]
Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →
Back to top | Article view | comp.lang.python
csiph-web