Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #9680 > unrolled thread

a little parsing challenge ☺

Started byXah Lee <xahlee@gmail.com>
First post2011-07-17 00:47 -0700
Last post2011-07-19 22:43 -0700
Articles 20 on this page of 72 — 28 participants

Back to article view | Back to comp.lang.python


Contents

  a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-17 00:47 -0700
    Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 02:48 -0700
      Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 15:20 +0200
        Re: a little parsing challenge ☺ mhenn <michihenn@hotmail.com> - 2011-07-17 15:55 +0200
          Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 18:01 +0200
            Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-17 18:54 +0200
      Re: a little parsing challenge ☺ Thomas Boell <tboell@domain.invalid> - 2011-07-17 17:49 +0200
        Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 12:16 -0700
      Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-18 07:39 -0700
        Re: a little parsing challenge ☺ Robert Klemme <shortcutter@googlemail.com> - 2011-07-20 08:23 +0200
        Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-20 03:31 -0700
          Re: a little parsing challenge ☺ "Uri Guttman" <uri@StemSystems.com> - 2011-07-20 12:31 -0400
            Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-20 10:30 -0700
            Re: a little parsing challenge ☺ merlyn@stonehenge.com (Randal L. Schwartz) - 2011-07-20 12:06 -0700
              Re: a little parsing challenge ☺ Jason Earl <jearl@notengoamigos.org> - 2011-07-20 14:57 -0600
      Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 09:54 -0700
        Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:07 +0200
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 05:58 -0700
            Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-21 08:26 -0600
              Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 08:36 -0700
                Re: a little parsing challenge ☺ python@bdurham.com - 2011-07-21 12:43 -0400
                  Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 11:53 -0700
                    Re: a little parsing challenge ☺ Terry Reedy <tjreedy@udel.edu> - 2011-07-21 18:37 -0400
            Re: a little parsing challenge ☺ John O'Hagan <research@johnohagan.com> - 2011-07-25 15:57 +1000
        Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-19 12:08 -0600
    Re: a little parsing challenge ☺ Chris Angelico <rosuav@gmail.com> - 2011-07-17 21:34 +1000
      Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-17 04:52 -0700
      Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 16:15 +0200
        Re: a little parsing challenge ☺ Raymond Hettinger <python@rcn.com> - 2011-07-17 12:18 -0700
          Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 22:16 +0200
            Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-17 22:57 +0200
        Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-17 23:43 +0200
        Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 03:09 -0400
          Re: a little parsing challenge ☺ Stefan Behnel <stefan_ml@behnel.de> - 2011-07-18 09:24 +0200
            Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 04:04 -0400
          Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-18 18:46 +0200
            Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-18 14:14 -0400
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 06:23 -0700
            Re: a little parsing challenge ☺ Rouslan Korneychuk <rouslank@msn.com> - 2011-07-21 17:54 -0400
    Re: a little parsing challenge ☺ gene heskett <gheskett@wdtv.com> - 2011-07-17 10:26 -0400
    Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-17 08:31 -0700
      Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:49 -0700
        Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:14 +0200
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-21 05:29 -0700
            Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-21 15:21 +0200
        Re: a little parsing challenge ☺ Thomas Jollans <t@jollybox.de> - 2011-07-19 20:17 +0200
    Re: a little parsing challenge ☺ rantingrick <rantingrick@gmail.com> - 2011-07-17 18:52 -0700
    Re: a little parsing challenge ☺ Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> - 2011-07-18 13:12 -0400
      Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-18 12:10 -0600
        Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-18 23:59 +0200
          Re: a little parsing challenge ☺ Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-19 08:09 +0200
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:32 -0700
      Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 09:56 +1000
        Re: a little parsing challenge ☺ Billy Mays <noway@nohow.com> - 2011-07-18 22:07 -0400
          Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-18 19:50 -0700
            Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 13:11 +1000
              Re: a little parsing challenge ☺ rusi <rustompmody@gmail.com> - 2011-07-18 21:59 -0700
                Re: a little parsing challenge ☺ Chris Angelico <rosuav@gmail.com> - 2011-07-19 15:36 +1000
          Re: a little parsing challenge ☺ MRAB <python@mrabarnett.plus.com> - 2011-07-19 04:08 +0100
          Re: a little parsing challenge ☺ Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-07-18 20:54 -0700
          Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-19 14:30 +1000
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 01:58 -0700
      Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 10:14 -0700
        Re: a little parsing challenge ☺ Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> - 2011-07-19 13:33 -0400
          Re: a little parsing challenge ☺ Xah Lee <xahlee@gmail.com> - 2011-07-19 11:12 -0700
            Re: a little parsing challenge ☺ Terry Reedy <tjreedy@udel.edu> - 2011-07-19 15:09 -0400
              Re: a little parsing challenge ☺ jmfauth <wxjmfauth@gmail.com> - 2011-07-19 23:29 -0700
                Re: a little parsing challenge ☺ Ian Kelly <ian.g.kelly@gmail.com> - 2011-07-20 01:29 -0600
                  Re: a little parsing challenge ☺ jmfauth <wxjmfauth@gmail.com> - 2011-07-20 00:54 -0700
                    Re: a little parsing challenge ☺ Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-07-20 18:18 +1000
    Re: a little parsing challenge ? sln@netherlands.com - 2011-07-18 12:34 -0700
    Re: a little parsing challenge ☺ Mark Tarver <dr.mtarver@gmail.com> - 2011-07-19 22:43 -0700

Page 3 of 4 — ← Prev page 1 2 [3] 4  Next page →


#9712

FromThomas Jollans <t@jollybox.de>
Date2011-07-17 08:31 -0700
Message-ID<7ae55705-6342-4f06-add8-59de05d111b9@x12g2000yql.googlegroups.com>
In reply to#9680
On Jul 17, 9:47 am, Xah Lee <xah...@gmail.com> wrote:
> 2011-07-16
>
> folks, this one will be interesting one.
>
> the problem is to write a script that can check a dir of text files
> (and all subdirs) and reports if a file has any mismatched matching
> brackets.
>
> • The files will be utf-8 encoded (unix style line ending).
>
> • If a file has mismatched matching-pairs, the script will display the
> file name, and the  line number and column number of the first
> instance where a mismatched bracket occures. (or, just the char number
> instead (as in emacs's “point”))
>
> • the matching pairs are all single unicode chars. They are these and
> nothing else: () {} [] “” ‹› «» 【】 〈〉 《》 「」 『』
> Note that ‘single curly quote’ is not consider matching pair here.
>
> • You script must be standalone. Must not be using some parser tools.
> But can call lib that's part of standard distribution in your lang.
>
> Here's a example of mismatched bracket: ([)], (“[[”), ((, 】etc. (and
> yes, the brackets may be nested. There are usually text between these
> chars.)
>
> I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
> other lang implementations. In particular, perl, python, php, ruby,
> tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
> (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
> Scala, C, C++, or any others, are all welcome, but i won't be able to
> eval it. javascript implementation will be very interesting too, but
> please indicate which and where to install the command line version.
>
> I hope you'll find this a interesting “challenge”. This is a parsing
> problem. I haven't studied parsers except some Wikipedia reading, so
> my solution will probably be naive. I hope to see and learn from your
> solution too.
>
> i hope you'll participate. Just post solution here. Thanks.

I thought I'd have some fun with multi-processing:

https://gist.github.com/1087682

[toc] | [prev] | [next] | [standalone]


#9901

FromXah Lee <xahlee@gmail.com>
Date2011-07-19 10:49 -0700
Message-ID<22a6d549-b490-4534-a076-94cb6f21f8ce@h7g2000prf.googlegroups.com>
In reply to#9712
On Jul 17, 8:31 am, Thomas Jollans <t...@jollybox.de> wrote:
> On Jul 17, 9:47 am,XahLee <xah...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > 2011-07-16
>
> > folks, this one will be interesting one.
>
> > the problem is to write a script that can check a dir of text files
> > (and all subdirs) and reports if a file has any mismatched matching
> > brackets.
>
> > • The files will be utf-8 encoded (unix style line ending).
>
> > • If a file has mismatched matching-pairs, the script will display the
> > file name, and the  line number and column number of the first
> > instance where a mismatched bracket occures. (or, just the char number
> > instead (as in emacs's “point”))
>
> > • the matching pairs are all single unicode chars. They are these and
> > nothing else: () {} [] “” ‹› «» 【】 〈〉 《》 「」 『』
> > Note that ‘single curly quote’ is not consider matching pair here.
>
> > • You script must be standalone. Must not be using some parser tools.
> > But can call lib that's part of standard distribution in your lang.
>
> > Here's a example of mismatched bracket: ([)], (“[[”), ((, 】etc. (and
> > yes, the brackets may be nested. There are usually text between these
> > chars.)
>
> > I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
> > other lang implementations. In particular, perl, python, php, ruby,
> > tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
> > (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
> > Scala, C, C++, or any others, are all welcome, but i won't be able to
> > eval it. javascript implementation will be very interesting too, but
> > please indicate which and where to install the command line version.
>
> > I hope you'll find this a interesting “challenge”. This is a parsing
> > problem. I haven't studied parsers except some Wikipedia reading, so
> > my solution will probably be naive. I hope to see and learn from your
> > solution too.
>
> > i hope you'll participate. Just post solution here. Thanks.
>
> I thought I'd have some fun with multi-processing:
>
> https://gist.github.com/1087682

hi Thomas. I ran the program, all cpu went max (i have a quad), but
after i think 3 minutes nothing happens, so i killed it.

is there something special one should know to run the script?

I'm using Python 3.2.1 on Windows 7.

 Xah

[toc] | [prev] | [next] | [standalone]


#9907

FromThomas Jollans <t@jollybox.de>
Date2011-07-19 20:14 +0200
Message-ID<mailman.1268.1311099294.1164.python-list@python.org>
In reply to#9901
On 19/07/11 19:49, Xah Lee wrote:
> On Jul 17, 8:31 am, Thomas Jollans <t...@jollybox.de> wrote:
>>
>> I thought I'd have some fun with multi-processing:
>>
>> https://gist.github.com/1087682
> 
> hi Thomas. I ran the program, all cpu went max (i have a quad), but
> after i think 3 minutes nothing happens, so i killed it.
> 
> is there something special one should know to run the script?
> 
> I'm using Python 3.2.1 on Windows 7.
> 
>  Xah

Well, it overdoes the multi-processing “a little”. Checking each
character in a separate process might have been overkill.

Here's a sane version:

https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e

old, crazy version:
https://gist.github.com/1087682/6841c3875f7e88c23e0a053ac0d0f0565d8713e2

[toc] | [prev] | [next] | [standalone]


#10020

FromXah Lee <xahlee@gmail.com>
Date2011-07-21 05:29 -0700
Message-ID<1fac9c74-33ea-4380-a060-d20ee5aff971@s33g2000prg.googlegroups.com>
In reply to#9907
On Jul 19, 11:14 am, Thomas Jollans <t...@jollybox.de> wrote:
> I thought I'd have some fun with multi-processing:

Nice joke. ☺

> Here's a sane version:
>
> https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e

hi thomas,

i still cant get your code to work. I have a dir named xxdir with a
single test file xx.txt,with this content:

 foo[(])bar

when i run your code
py3 validate_brackets_Thomas_Jollans_2.py

it simply exit and doesn't seem to do anything. I modded your code to
print the file name it's proccessing. Apparently it did process it.

my python isn't strong else i'd dive in. Thanks.

I'm on Python 3.2.1. Here's a shell log:

 h3@H3-HP 2011-07-21 05:20:30 ~/web/xxst/find_elisp/validate matching
brackets
py3 validate_brackets_Thomas_Jollans_2.py
 h3@H3-HP 2011-07-21 05:20:34 ~/web/xxst/find_elisp/validate matching
brackets
py3 validate_brackets_Thomas_Jollans_2.py
c:/Users/h3/web/xxst/find_elisp/validate matching brackets/xxdir
\xx.txt
 h3@H3-HP 2011-07-21 05:21:59 ~/web/xxst/find_elisp/validate matching
brackets
py3 --version
Python 3.2.1
 h3@H3-HP 2011-07-21 05:27:03 ~/web/xxst/find_elisp/validate matching
brackets

 Xah

[toc] | [prev] | [next] | [standalone]


#10022

FromThomas Jollans <t@jollybox.de>
Date2011-07-21 15:21 +0200
Message-ID<mailman.1319.1311254485.1164.python-list@python.org>
In reply to#10020
On 21/07/11 14:29, Xah Lee wrote:
> On Jul 19, 11:14 am, Thomas Jollans <t...@jollybox.de> wrote:
>> I thought I'd have some fun with multi-processing:
> 
> Nice joke. ☺
> 
>> Here's a sane version:
>>
>> https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e
> 
> hi thomas,
> 
> i still cant get your code to work. I have a dir named xxdir with a
> single test file xx.txt,with this content:
> 
>  foo[(])bar
> 
> when i run your code
> py3 validate_brackets_Thomas_Jollans_2.py
> 
> it simply exit and doesn't seem to do anything. I modded your code to
> print the file name it's proccessing. Apparently it did process it.
> 
> my python isn't strong else i'd dive in. Thanks.

Curious. Perhaps, in the Windows version of Python, subprocesses don't
use the same stdout? Windows doesn't have fork() (how could they
survive?), so who knows. Try replacing
    ex.submit(process_file, fullname)
with
    process_file(fullname)
for a non-concurrent version.

> 
> I'm on Python 3.2.1. Here's a shell log:
> 
>  h3@H3-HP 2011-07-21 05:20:30 ~/web/xxst/find_elisp/validate matching
> brackets
> py3 validate_brackets_Thomas_Jollans_2.py
>  h3@H3-HP 2011-07-21 05:20:34 ~/web/xxst/find_elisp/validate matching
> brackets
> py3 validate_brackets_Thomas_Jollans_2.py
> c:/Users/h3/web/xxst/find_elisp/validate matching brackets/xxdir
> \xx.txt
>  h3@H3-HP 2011-07-21 05:21:59 ~/web/xxst/find_elisp/validate matching
> brackets
> py3 --version
> Python 3.2.1
>  h3@H3-HP 2011-07-21 05:27:03 ~/web/xxst/find_elisp/validate matching
> brackets
> 
>  Xah

[toc] | [prev] | [next] | [standalone]


#9908

FromThomas Jollans <t@jollybox.de>
Date2011-07-19 20:17 +0200
Message-ID<mailman.1269.1311099446.1164.python-list@python.org>
In reply to#9901
Oh, by the way:

On 19/07/11 19:49, Xah Lee wrote:
> I ran the program, all cpu went max 

Mission accomplished.

[toc] | [prev] | [next] | [standalone]


#9774

Fromrantingrick <rantingrick@gmail.com>
Date2011-07-17 18:52 -0700
Message-ID<741a7641-7833-4d51-a17b-548a930b70f4@u26g2000vby.googlegroups.com>
In reply to#9680
On Jul 17, 2:47 am, Xah Lee <xah...@gmail.com> wrote:
> 2011-07-16
>
> folks, this one will be interesting one.
>
> the problem is to write a script that can check a dir of text files
> (and all subdirs) and reports if a file has any mismatched matching
> brackets.
>
>[...]
>
> • You script must be standalone. Must not be using some parser tools.
> But can call lib that's part of standard distribution in your lang.

I stopped reading here and did...

>>> from HyperParser import HyperParser # python2.x

...and called it a day. ;-) This module is part of the stdlib (idlelib
\HyperParser) so as per your statement it is legal (may not be the
fastest solution).

[toc] | [prev] | [next] | [standalone]


#9818

FromBilly Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com>
Date2011-07-18 13:12 -0400
Message-ID<j01ph6$knt$1@speranza.aioe.org>
In reply to#9680
On 07/17/2011 03:47 AM, Xah Lee wrote:
> 2011-07-16

I gave it a shot.  It doesn't do any of the Unicode delims, because 
let's face it, Unicode is for goobers.


import sys, os

pairs = {'}':'{', ')':'(', ']':'[', '"':'"', "'":"'", '>':'<'}
valid = set( v for pair in pairs.items() for v in pair )

for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
     for name in filenames:
         stack = [' ']
         with open(os.path.join(dirpath, name), 'rb') as f:
             chars = (c for line in f for c in line if c in valid)
             for c in chars:
                 if c in pairs and stack[-1] == pairs[c]:
                     stack.pop()
                 else:
                     stack.append(c)
         print ("Good" if len(stack) == 1 else "Bad") + ': %s' % name

--
Bill

[toc] | [prev] | [next] | [standalone]


#9821

FromIan Kelly <ian.g.kelly@gmail.com>
Date2011-07-18 12:10 -0600
Message-ID<mailman.1223.1311012663.1164.python-list@python.org>
In reply to#9818
On Mon, Jul 18, 2011 at 11:12 AM, Billy Mays
<81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com> wrote:
> I gave it a shot.  It doesn't do any of the Unicode delims, because let's
> face it, Unicode is for goobers.

Uh, okay...

Your script also misses the requirement of outputting the index or row
and column of the first mismatched bracket.

[toc] | [prev] | [next] | [standalone]


#9837

FromThomas 'PointedEars' Lahn <PointedEars@web.de>
Date2011-07-18 23:59 +0200
Message-ID<2075498.BysXYHRu7a@PointedEars.de>
In reply to#9821
Ian Kelly wrote:

> Billy Mays wrote:
>> I gave it a shot.  It doesn't do any of the Unicode delims, because let's
>> face it, Unicode is for goobers.
> 
> Uh, okay...
> 
> Your script also misses the requirement of outputting the index or row
> and column of the first mismatched bracket.

Thanks to Python's expressiveness, this can be easily remedied (see below).  

I also do not follow Billy's comment about Unicode.  Unicode and the fact 
that Python supports it *natively* cannot be appreciated enough in a 
globalized world.

However, I have learned a lot about being pythonic from his posting (take 
those generator expressions, for example!), and the idea of looking at the 
top of a stack for reference is a really good one.  Thank you, Billy!

Here is my improvement of his code, which should fill the mentioned gaps.
I have also reversed the order in the report line as I think it is more 
natural this way.  I have tested the code superficially with a directory 
containing a single text file.  Watch for word-wrap:

# encoding: utf-8
'''
Created on 2011-07-18

@author: Thomas 'PointedEars' Lahn <PointedEars@web.de>, based on an idea of
Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com>
in <news:j01ph6$knt$1@speranza.aioe.org> 
'''
import sys, os

pairs = {u'}': u'{', u')': u'(', u']': u'[',
         u'”': u'“', u'›': u'‹', u'»': u'«',
         u'】': u'【', u'〉': u'〈', u'》': u'《',
         u'」': u'「', u'』': u'『'}
valid = set(v for pair in pairs.items() for v in pair)

if __name__ == '__main__':
    for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
        for name in filenames:
            stack = [' ']

            # you can use chardet etc. instead 
            encoding = 'utf-8'

            with open(os.path.join(dirpath, name), 'r') as f:
                reported = False
                chars = ((c, line_no, col) for line_no, line in enumerate(f) 
for col, c in enumerate(line.decode(encoding)) if c in valid)
                for c, line_no, col in chars:
                    if c in pairs:
                        if stack[-1] == pairs[c]:
                            stack.pop()
                        else:
                            if not reported:
                                first_bad = (c, line_no + 1, col + 1)
                                reported = True
                    else:
                        stack.append(c)

            print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad 
'%s' at %s:%s" % first_bad))

-- 
PointedEars

Bitte keine Kopien per E-Mail. / Please do not Cc: me.

[toc] | [prev] | [next] | [standalone]


#9856

FromThomas 'PointedEars' Lahn <PointedEars@web.de>
Date2011-07-19 08:09 +0200
Message-ID<42296484.Y7NjKpOp6t@PointedEars.de>
In reply to#9837
Thomas 'PointedEars' Lahn wrote:

>             with open(os.path.join(dirpath, name), 'r') as f:

SHOULD be

            with open(os.path.join(dirpath, name), 'rb') as f:

(as in the original), else the some code units might not be read properly.

-- 
PointedEars

Bitte keine Kopien per E-Mail. / Please do not Cc: me.

[toc] | [prev] | [next] | [standalone]


#9899

FromXah Lee <xahlee@gmail.com>
Date2011-07-19 10:32 -0700
Message-ID<065729ac-d522-4da6-87ad-915be14e5ff4@e20g2000prf.googlegroups.com>
In reply to#9837
On Jul 18, 2:59 pm, Thomas 'PointedEars' Lahn <PointedE...@web.de>
wrote:
> Ian Kelly wrote:
> > Billy Mays wrote:
> >> I gave it a shot.  It doesn't do any of the Unicode delims, because let's
> >> face it, Unicode is for goobers.
>
> > Uh, okay...
>
> > Your script also misses the requirement of outputting the index or row
> > and column of the first mismatched bracket.
>
> Thanks to Python's expressiveness, this can be easily remedied (see below).  
>
> I also do not follow Billy's comment about Unicode.  Unicode and the fact
> that Python supports it *natively* cannot be appreciated enough in a
> globalized world.
>
> However, I have learned a lot about being pythonic from his posting (take
> those generator expressions, for example!), and the idea of looking at the
> top of a stack for reference is a really good one.  Thank you, Billy!
>
> Here is my improvement of his code, which should fill the mentioned gaps.
> I have also reversed the order in the report line as I think it is more
> natural this way.  I have tested the code superficially with a directory
> containing a single text file.  Watch for word-wrap:
>
> # encoding: utf-8
> '''
> Created on 2011-07-18
>
> @author: Thomas 'PointedEars' Lahn <PointedE...@web.de>, based on an idea of
> Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9...@myhashismyemail.com>
> in <news:j01ph6$knt$1@speranza.aioe.org>
> '''
> import sys, os
>
> pairs = {u'}': u'{', u')': u'(', u']': u'[',
>          u'”': u'“', u'›': u'‹', u'»': u'«',
>          u'】': u'【', u'〉': u'〈', u'》': u'《',
>          u'」': u'「', u'』': u'『'}
> valid = set(v for pair in pairs.items() for v in pair)
>
> if __name__ == '__main__':
>     for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
>         for name in filenames:
>             stack = [' ']
>
>             # you can use chardet etc. instead
>             encoding = 'utf-8'
>
>             with open(os.path.join(dirpath, name), 'r') as f:
>                 reported = False
>                 chars = ((c, line_no, col) for line_no, line in enumerate(f)
> for col, c in enumerate(line.decode(encoding)) if c in valid)
>                 for c, line_no, col in chars:
>                     if c in pairs:
>                         if stack[-1] == pairs[c]:
>                             stack.pop()
>                         else:
>                             if not reported:
>                                 first_bad = (c, line_no + 1, col + 1)
>                                 reported = True
>                     else:
>                         stack.append(c)
>
>             print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad
> '%s' at %s:%s" % first_bad))

Thanks for the fix.
Though, it seems still wrong.

On the file http://xahlee.org/p/time_machine/tm-ch04.html

there is a mismatched curly double quote at 28319.

the script reports:
tm-ch04.html: bad ')' at 68:2

that doesn't seems right. Line 68 is empty. There's no opening or
closing round bracket anywhere close. Nearest are lines 11 and 127.

Maybe Billy Mays's algorithm is wrong.

 Xah (fairly discouraged now, after running 3 python scripts all
failed)

[toc] | [prev] | [next] | [standalone]


#9841

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2011-07-19 09:56 +1000
Message-ID<4e24c823$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to#9818
Billy Mays wrote:

> On 07/17/2011 03:47 AM, Xah Lee wrote:
>> 2011-07-16
> 
> I gave it a shot.  It doesn't do any of the Unicode delims, because
> let's face it, Unicode is for goobers.

Goobers... that would be one of those new-fangled slang terms that the young
kids today use to mean its opposite, like "bad", "wicked" and "sick",
correct? 

I mention it only because some people might mistakenly interpret your words
as a childish and feeble insult against the 98% of the world who want or
need more than the 127 characters of ASCII, rather than understand you
meant it as a sign of the utmost respect for the richness and diversity of
human beings and their languages, cultures, maths and sciences.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#9842

FromBilly Mays <noway@nohow.com>
Date2011-07-18 22:07 -0400
Message-ID<j02oth$f4c$1@speranza.aioe.org>
In reply to#9841
On 7/18/2011 7:56 PM, Steven D'Aprano wrote:
> Billy Mays wrote:
>
>> On 07/17/2011 03:47 AM, Xah Lee wrote:
>>> 2011-07-16
>>
>> I gave it a shot.  It doesn't do any of the Unicode delims, because
>> let's face it, Unicode is for goobers.
>
> Goobers... that would be one of those new-fangled slang terms that the young
> kids today use to mean its opposite, like "bad", "wicked" and "sick",
> correct?
>
> I mention it only because some people might mistakenly interpret your words
> as a childish and feeble insult against the 98% of the world who want or
> need more than the 127 characters of ASCII, rather than understand you
> meant it as a sign of the utmost respect for the richness and diversity of
> human beings and their languages, cultures, maths and sciences.
>
>

TL;DR version: international character sets are a problem, and Unicode 
is not the answer to that problem).

As long as I have used python (which I admit has only been 3 years) 
Unicode has never appeared to be implemented correctly.  I'm probably 
repeating old arguments here, but whatever.

Unicode is a mess.  When someone says ASCII, you know that they can only 
mean characters 0-127.  When someone says Unicode, do the mean real 
Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? 
When using the 'u' datatype with the array module, the docs don't even 
tell you if its 2 bytes wide or 4 bytes.  Which is it?  I'm sure that 
all the of these can be figured out, but the problem is now I have to 
ask every one of these questions whenever I want to use strings.

Secondly, Python doesn't do Unicode exception handling correctly. (but I 
suspect that its a broader problem with languages) A good example of 
this is with UTF-8 where there are invalid code points ( such as 0xC0, 
0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as 
well as everyone else who wants to use strings for some reason).

When embedding Python in a long running application where user input is 
received, it is very easy to make mistake which bring down the whole 
program.  If any user string isn't properly try/excepted, a user could 
craft a malformed string which a UTF-8 decoder would choke on.  Using 
ASCII (or whatever 8 bit encoding) doesn't have these problems since all 
codepoints are valid.

Another (this must have been a good laugh amongst the UniDevs) 'feature' 
of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). 
Any string can masquerade as any other string by placing  few of these 
in a string.  Any word filters you might have are now defeated by some 
cheesy Unicode nonsense character.  Can you just just check for these 
characters and strip them out?  Yes.  Should you have to?  I would say no.

Does it get better?  Of course! international character sets used for 
domain name encoding use yet a different scheme (Punycode).  Are the 
following two domain names the same: tést.com , xn--tst-bma.com ?  Who 
knows!

I suppose I can gloss over the pains of using Unicode in C with every 
string needing to be an LPS since 0x00 is now a valid code point in 
UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do 
strlen or concatenation operations.

Can it get even better?  Yep.  We also now need to have a Byte order 
Mark (BOM) to determine the endianness of our characters.  Are they 
little endian or big endian?  (or perhaps one of the two possible middle 
endian encodings?)  Who knows?  String processing with unicode is 
unpleasant to say the least.  I suppose that's what we get when we 
things are designed by committee.

But Hey!  The great thing about standards is that there are so many to 
choose from.

--
Bill





[toc] | [prev] | [next] | [standalone]


#9843

Fromrusi <rustompmody@gmail.com>
Date2011-07-18 19:50 -0700
Message-ID<feedc526-0c7b-4296-a011-1943950e6b61@j9g2000prj.googlegroups.com>
In reply to#9842
On Jul 19, 7:07 am, Billy Mays <no...@nohow.com> wrote:
> On 7/18/2011 7:56 PM, Steven D'Aprano wrote:
>
>
>
> > Billy Mays wrote:
>
> >> On 07/17/2011 03:47 AM, Xah Lee wrote:
> >>> 2011-07-16
>
> >> I gave it a shot.  It doesn't do any of the Unicode delims, because
> >> let's face it, Unicode is for goobers.
>
> > Goobers... that would be one of those new-fangled slang terms that the young
> > kids today use to mean its opposite, like "bad", "wicked" and "sick",
> > correct?
>
> > I mention it only because some people might mistakenly interpret your words
> > as a childish and feeble insult against the 98% of the world who want or
> > need more than the 127 characters of ASCII, rather than understand you
> > meant it as a sign of the utmost respect for the richness and diversity of
> > human beings and their languages, cultures, maths and sciences.
>
> TL;DR version: international character sets are a problem, and Unicode
> is not the answer to that problem).
>
> As long as I have used python (which I admit has only been 3 years)
> Unicode has never appeared to be implemented correctly.  I'm probably
> repeating old arguments here, but whatever.
>
> Unicode is a mess.  When someone says ASCII, you know that they can only
> mean characters 0-127.  When someone says Unicode, do the mean real
> Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8?
> When using the 'u' datatype with the array module, the docs don't even
> tell you if its 2 bytes wide or 4 bytes.  Which is it?  I'm sure that
> all the of these can be figured out, but the problem is now I have to
> ask every one of these questions whenever I want to use strings.
>
> Secondly, Python doesn't do Unicode exception handling correctly. (but I
> suspect that its a broader problem with languages) A good example of
> this is with UTF-8 where there are invalid code points ( such as 0xC0,
> 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
> well as everyone else who wants to use strings for some reason).
>
> When embedding Python in a long running application where user input is
> received, it is very easy to make mistake which bring down the whole
> program.  If any user string isn't properly try/excepted, a user could
> craft a malformed string which a UTF-8 decoder would choke on.  Using
> ASCII (or whatever 8 bit encoding) doesn't have these problems since all
> codepoints are valid.
>
> Another (this must have been a good laugh amongst the UniDevs) 'feature'
> of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
> Any string can masquerade as any other string by placing  few of these
> in a string.  Any word filters you might have are now defeated by some
> cheesy Unicode nonsense character.  Can you just just check for these
> characters and strip them out?  Yes.  Should you have to?  I would say no.
>
> Does it get better?  Of course! international character sets used for
> domain name encoding use yet a different scheme (Punycode).  Are the
> following two domain names the same: tést.com , xn--tst-bma.com ?  Who
> knows!
>
> I suppose I can gloss over the pains of using Unicode in C with every
> string needing to be an LPS since 0x00 is now a valid code point in
> UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
> strlen or concatenation operations.
>
> Can it get even better?  Yep.  We also now need to have a Byte order
> Mark (BOM) to determine the endianness of our characters.  Are they
> little endian or big endian?  (or perhaps one of the two possible middle
> endian encodings?)  Who knows?  String processing with unicode is
> unpleasant to say the least.  I suppose that's what we get when we
> things are designed by committee.
>
> But Hey!  The great thing about standards is that there are so many to
> choose from.
>
> --
> Bill

Thanks for writing that
Every time I try to understand unicode and remain stuck I come to the
conclusion that I must be an imbecile.
Seeing others (probably more intelligent than yours truly) gives me
some solace!

[And I am writing this from India where there are dozens of languages,
almost as many scripts and everyone speaks and writes at least a
couple of non-european ones]

[toc] | [prev] | [next] | [standalone]


#9845

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2011-07-19 13:11 +1000
Message-ID<4e24f5c7$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to#9843
rusi wrote:

> Every time I try to understand unicode and remain stuck I come to the
> conclusion that I must be an imbecile.

http://www.joelonsoftware.com/articles/Unicode.html


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#9853

Fromrusi <rustompmody@gmail.com>
Date2011-07-18 21:59 -0700
Message-ID<b8a46135-3d6f-4608-93cc-8278ad040483@u6g2000prc.googlegroups.com>
In reply to#9845
On Jul 19, 8:11 am, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:
> rusi wrote:
> > Every time I try to understand unicode and remain stuck I come to the
> > conclusion that I must be an imbecile.
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
> --
> Steven

Yes Ive read that and understood a little bit more thanks to it.
But for the points raised in this thread this one from Joel is more
relevant:

http://www.joelonsoftware.com/articles/LeakyAbstractions.html

Some evidences of leakiness:
code point vs character vs byte
encoding and decoding
UTF-x and UCS-y

Very important and necessary distinctions? Maybe... But I did not need
them when my world was built of the 127 bricks of ASCII.

My latest brush with unicode was when I tried to port construct to
python3. http://construct.wikispaces.com/

If unicode 'just works' you should be able to do it in a jiffy?
[And if you did I would be glad to be proved wrong :-) ]

[toc] | [prev] | [next] | [standalone]


#9854

FromChris Angelico <rosuav@gmail.com>
Date2011-07-19 15:36 +1000
Message-ID<mailman.1238.1311053796.1164.python-list@python.org>
In reply to#9853
On Tue, Jul 19, 2011 at 2:59 PM, rusi <rustompmody@gmail.com> wrote:
> Some evidences of leakiness:
> code point vs character vs byte
> encoding and decoding
> UTF-x and UCS-y
>
> Very important and necessary distinctions? Maybe... But I did not need
> them when my world was built of the 127 bricks of ASCII.

Codepoint vs byte is NOT an abstraction. Unicode consists of
characters, where each character is represented by a number called its
codepoint. Since computers work with bytes, we need a way of encoding
those characters into bytes. It's no different from encoding a piece
of music in bytes, and having it come out as 0x90 0x64 0x40. Are those
bytes an abstraction of the note? No. They're an encoding of a MIDI
message that requests that the note be struck. The note itself is an
abstraction, if you like; but the bytes to create that note could be
delivered in a variety of other ways.

A Python Unicode string, whether it's Python 2's 'unicode' or Python
3's 'str', is a sequence of characters. Since those characters are
stored in memory, they must be encoded somehow, but that's not our
problem. We need only care about encoding when we save those
characters to disk, transmit them across the network, or in some other
way need to store them as bytes. Otherwise, there is no abstraction,
and no leak.

Chris Angelico

[toc] | [prev] | [next] | [standalone]


#9844

FromMRAB <python@mrabarnett.plus.com>
Date2011-07-19 04:08 +0100
Message-ID<mailman.1233.1311044922.1164.python-list@python.org>
In reply to#9842
On 19/07/2011 03:07, Billy Mays wrote:
> On 7/18/2011 7:56 PM, Steven D'Aprano wrote:
>> Billy Mays wrote:
>>
>>> On 07/17/2011 03:47 AM, Xah Lee wrote:
>>>> 2011-07-16
>>>
>>> I gave it a shot. It doesn't do any of the Unicode delims, because
>>> let's face it, Unicode is for goobers.
>>
>> Goobers... that would be one of those new-fangled slang terms that the
>> young
>> kids today use to mean its opposite, like "bad", "wicked" and "sick",
>> correct?
>>
>> I mention it only because some people might mistakenly interpret your
>> words
>> as a childish and feeble insult against the 98% of the world who want or
>> need more than the 127 characters of ASCII, rather than understand you
>> meant it as a sign of the utmost respect for the richness and
>> diversity of
>> human beings and their languages, cultures, maths and sciences.
>>
>>
>
> TL;DR version: international character sets are a problem, and Unicode
> is not the answer to that problem).
>
> As long as I have used python (which I admit has only been 3 years)
> Unicode has never appeared to be implemented correctly. I'm probably
> repeating old arguments here, but whatever.
>
> Unicode is a mess. When someone says ASCII, you know that they can only
> mean characters 0-127. When someone says Unicode, do the mean real
> Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When
> using the 'u' datatype with the array module, the docs don't even tell
> you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that all the
> of these can be figured out, but the problem is now I have to ask every
> one of these questions whenever I want to use strings.
>
That's down to whether it's a narrow or wide Python build. There's a
PEP suggesting a fix for that (PEP 393).

> Secondly, Python doesn't do Unicode exception handling correctly. (but I
> suspect that its a broader problem with languages) A good example of
> this is with UTF-8 where there are invalid code points ( such as 0xC0,
> 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
> well as everyone else who wants to use strings for some reason).
>
Those aren't codepoints, those are invalid bytes for the UTF-8 encoding.

> When embedding Python in a long running application where user input is
> received, it is very easy to make mistake which bring down the whole
> program. If any user string isn't properly try/excepted, a user could
> craft a malformed string which a UTF-8 decoder would choke on. Using
> ASCII (or whatever 8 bit encoding) doesn't have these problems since all
> codepoints are valid.
>
What if you give an application an invalid JPEG, PNG or other image
file? Does that mean that image formats are bad too?

> Another (this must have been a good laugh amongst the UniDevs) 'feature'
> of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
> Any string can masquerade as any other string by placing few of these in
> a string. Any word filters you might have are now defeated by some
> cheesy Unicode nonsense character. Can you just just check for these
> characters and strip them out? Yes. Should you have to? I would say no.
>
> Does it get better? Of course! international character sets used for
> domain name encoding use yet a different scheme (Punycode). Are the
> following two domain names the same: tést.com , xn--tst-bma.com ? Who
> knows!
>
> I suppose I can gloss over the pains of using Unicode in C with every
> string needing to be an LPS since 0x00 is now a valid code point in
> UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
> strlen or concatenation operations.
>
0x00 is also a valid ASCII code, but C doesn't let you use it!

There's also "Modified UTF-8", in which U+0000 is encoded as 2 bytes,
so that zero-byte can be used as a terminator. You can't do that in
ASCII! :-)

> Can it get even better? Yep. We also now need to have a Byte order Mark
> (BOM) to determine the endianness of our characters. Are they little
> endian or big endian? (or perhaps one of the two possible middle endian
> encodings?) Who knows? String processing with unicode is unpleasant to
> say the least. I suppose that's what we get when we things are designed
> by committee.
>
Proper UTF-8 doesn't have a BOM.

The rule (in Python, at least) is to decode on input and encode on
output. You don't have to worry about endianness when processing
Unicode strings internally; they're just a series of codepoints.

> But Hey! The great thing about standards is that there are so many to
> choose from.
>

[toc] | [prev] | [next] | [standalone]


#9846

FromBenjamin Kaplan <benjamin.kaplan@case.edu>
Date2011-07-18 20:54 -0700
Message-ID<mailman.1234.1311047771.1164.python-list@python.org>
In reply to#9842
On Mon, Jul 18, 2011 at 7:07 PM, Billy Mays <noway@nohow.com> wrote:
>
> On 7/18/2011 7:56 PM, Steven D'Aprano wrote:
>>
>> Billy Mays wrote:
>>
>>> On 07/17/2011 03:47 AM, Xah Lee wrote:
>>>>
>>>> 2011-07-16
>>>
>>> I gave it a shot.  It doesn't do any of the Unicode delims, because
>>> let's face it, Unicode is for goobers.
>>
>> Goobers... that would be one of those new-fangled slang terms that the young
>> kids today use to mean its opposite, like "bad", "wicked" and "sick",
>> correct?
>>
>> I mention it only because some people might mistakenly interpret your words
>> as a childish and feeble insult against the 98% of the world who want or
>> need more than the 127 characters of ASCII, rather than understand you
>> meant it as a sign of the utmost respect for the richness and diversity of
>> human beings and their languages, cultures, maths and sciences.
>>
>>
>
> TL;DR version: international character sets are a problem, and Unicode is not the answer to that problem).
>
> As long as I have used python (which I admit has only been 3 years) Unicode has never appeared to be implemented correctly.  I'm probably repeating old arguments here, but whatever.
>
> Unicode is a mess.  When someone says ASCII, you know that they can only mean characters 0-127.  When someone says Unicode, do the mean real Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When using the 'u' datatype with the array module, the docs don't even tell you if its 2 bytes wide or 4 bytes.  Which is it?  I'm sure that all the of these can be figured out, but the problem is now I have to ask every one of these questions whenever I want to use strings.
>

It doesn't matter. When you use the unicode data type in Python, you
get to treat it as a sequence of characters, not a sequence of bytes.
The fact that it's stored internally as UCS-2 or UCS-4 is irrelevant.

>
> Secondly, Python doesn't do Unicode exception handling correctly. (but I suspect that its a broader problem with languages) A good example of this is with UTF-8 where there are invalid code points ( such as 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as well as everyone else who wants to use strings for some reason).
>

A Unicode code point is of the form U+XXXX. 0xC0 is not a Unicode code
point, it is a byte. It happens to be an invalid byte using the UTF-8
byte encoding (which is not Unicode, it's a byte string). The Unicode
code point U+00C0 is perfectly valid- it's a LATIN CAPITAL LETTER A
WITH GRAVE.

>
> When embedding Python in a long running application where user input is received, it is very easy to make mistake which bring down the whole program.  If any user string isn't properly try/excepted, a user could craft a malformed string which a UTF-8 decoder would choke on.  Using ASCII (or whatever 8 bit encoding) doesn't have these problems since all codepoints are valid.
>

UTF-8 != Unicode. UTF-8 is one of several byte encodings capable of
representing every character in the Unicode spec, but it is not
Unicode. If you have a Unicode string, it is not a sequence of byes,
it is a sequence of characters. If you want a sequence of bytes, use a
byte string. If you are attempting to interpret a sequence of bytes as
a sequence of text, you're doing it wrong. There's a reason we have
both text and binary modes for opening files- yes, there is a
difference between them.

> Another (this must have been a good laugh amongst the UniDevs) 'feature' of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). Any string can masquerade as any other string by placing  few of these in a string.  Any word filters you might have are now defeated by some cheesy Unicode nonsense character.  Can you just just check for these characters and strip them out?  Yes.  Should you have to?  I would say no.
>
> Does it get better?  Of course! international character sets used for domain name encoding use yet a different scheme (Punycode).  Are the following two domain names the same: tést.com , xn--tst-bma.com ?  Who knows!
>
> I suppose I can gloss over the pains of using Unicode in C with every string needing to be an LPS since 0x00 is now a valid code point in UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do strlen or concatenation operations.
>

That is using UTF-8 in C. Which, again, is not the same thing as Unicode.

> Can it get even better?  Yep.  We also now need to have a Byte order Mark (BOM) to determine the endianness of our characters.  Are they little endian or big endian?  (or perhaps one of the two possible middle endian encodings?)  Who knows?  String processing with unicode is unpleasant to say the least.  I suppose that's what we get when we things are designed by committee.
>

And that is UTF-16 and UTF-32. Again, those are byte encodings. They
are not Unicode. When you use a library capable of handling Unicode,
you never see those- you just have a string with characters in it.

> But Hey!  The great thing about standards is that there are so many to choose from.
>
> --
> Bill
>
>
>
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]


Page 3 of 4 — ← Prev page 1 2 [3] 4  Next page →

Back to top | Article view | comp.lang.python


csiph-web