Groups > comp.lang.python > #6811 > unrolled thread

Re: how to avoid leading white spaces

Started by	Chris Rebert <clp2@rebertia.com>
First post	2011-06-01 10:11 -0700
Last post	2011-06-05 04:17 -0700
Articles	20 on this page of 64 — 19 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: how to avoid leading white spaces Chris Rebert <clp2@rebertia.com> - 2011-06-01 10:11 -0700
    Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-01 12:39 -0700
      Re: how to avoid leading white spaces Karim <karim.liateni@free.fr> - 2011-06-01 22:34 +0200
      Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-02 13:21 +0000
        Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-02 21:57 -0400
          Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-03 03:41 +0100
          Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 02:58 +0000
            Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-02 23:44 -0400
              Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-03 13:52 +1000
              Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-03 13:54 +1000
              Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 04:30 +0000
                Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-03 14:11 +0100
            Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-03 14:18 +0100
            Re: how to avoid leading white spaces Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2011-06-04 13:41 +1200
              Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-04 20:44 +0100
            Re: how to avoid leading white spaces Ian <hobson42@gmail.com> - 2011-06-06 22:04 +0100
              Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-09 02:32 +0000
          Re: how to avoid leading white spaces Thorsten Kampe <thorsten@thorstenkampe.de> - 2011-06-03 10:32 +0200
        Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 05:51 -0700
          Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-03 13:17 +0000
            Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 08:14 -0700
          Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-03 14:25 +0000
            Re: how to avoid leading white spaces "D'Arcy J.M. Cain" <darcy@druid.net> - 2011-06-03 10:58 -0400
            Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 12:29 -0700
              Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-03 20:49 +0000
                Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 21:45 +0000
                  Re: how to avoid leading white spaces Ethan Furman <ethan@stoneleaf.us> - 2011-06-03 15:11 -0700
                  Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-03 23:38 +0100
                  Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 22:47 -0700
                Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 22:44 -0700
                  Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 16:08 +0000
                    Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 10:29 -0600
                      Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 17:17 +0000
                        Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 11:40 -0600
                          Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 17:56 +0000
                    Re: how to avoid leading white spaces Ethan Furman <ethan@stoneleaf.us> - 2011-06-06 10:48 -0700
                    Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 11:42 -0600
              Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 02:05 +0000
                Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-04 03:24 +0100
                  Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 04:59 +0000
                Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-03 22:30 -0400
                  Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 05:14 +0000
                    Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-04 09:39 -0400
                      Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-05 00:44 +0000
                    Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-04 09:36 -0700
                    Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-04 21:02 +0100
                      Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-05 01:01 +0000
                  Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-04 16:04 +1000
                Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 23:03 -0700
                  Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-06 07:11 +0000
                    Re: how to avoid leading white spaces "Octavian Rasnita" <orasnita@gmail.com> - 2011-06-06 11:51 +0300
                    Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-06 19:01 +1000
                    Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-06 07:33 -0700
                      Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-07 11:37 -0700
                        Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-07 20:30 -0400
                          Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-08 07:38 -0700
                            Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-08 09:14 -0700
                        Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-08 01:27 -0700
                  Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-06 15:29 +0000
                    Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 10:06 -0600
                    Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-07 09:00 -0700
                      Re: how to avoid leading white spaces Duncan Booth <duncan.booth@invalid.invalid> - 2011-06-08 09:01 +0000
                        Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-08 07:39 -0700
            Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-05 04:17 -0700

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

#6994

From	Roy Smith <roy@panix.com>
Date	2011-06-03 22:30 -0400
Message-ID	<roy-1DBFCA.22305903062011@news.panix.com>
In reply to	#6991

In article <4de992d7$0$29996$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> Of course, if you include both case-sensitive and insensitive tests in 
> the same calculation, that's a good candidate for a regex... or at least 
> it would be if regexes supported that :)

Of course they support that.

r'([A-Z]+) ([a-zA-Z]+) ([a-z]+)'

matches a word in upper case followed by a word in either (or mixed) 
case, followed by a word in lower case (for some narrow definition of 
"word").

Another nice thing about regexes (as compared to string methods) is that 
they're both portable and serializable.  You can use the same regex in 
Perl, Python, Ruby, PHP, etc.  You can transmit them over a network 
connection to a cooperating process.  You can store them in a database 
or a config file, or allow users to enter them on the fly.

[toc] | [prev] | [next] | [standalone]

#6997

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-06-04 05:14 +0000
Message-ID	<4de9bf50$0$29996$c3e8da3$5496439d@news.astraweb.com>
In reply to	#6994

On Fri, 03 Jun 2011 22:30:59 -0400, Roy Smith wrote:

> In article <4de992d7$0$29996$c3e8da3$5496439d@news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> 
>> Of course, if you include both case-sensitive and insensitive tests in
>> the same calculation, that's a good candidate for a regex... or at
>> least it would be if regexes supported that :)
> 
> Of course they support that.
> 
> r'([A-Z]+) ([a-zA-Z]+) ([a-z]+)'
> 
> matches a word in upper case followed by a word in either (or mixed)
> case, followed by a word in lower case (for some narrow definition of
> "word").

This fails to support non-ASCII letters, and you know quite well that 
having to spell out by hand regexes in both upper and lower (or mixed) 
case is not support for case-insensitive matching. That's why Python's re 
has a case insensitive flag.

> Another nice thing about regexes (as compared to string methods) is that
> they're both portable and serializable.  You can use the same regex in
> Perl, Python, Ruby, PHP, etc.

Say what?

Regexes are anything but portable. Sure, if you limit yourself to some 
subset of regex syntax, you might find that many different languages and 
engines support your regex, but general regexes are not guaranteed to run 
in multiple engines.

The POSIX standard defines two different regexes; Tcl supports three; 
Grep supports the two POSIX syntaxes, plus Perl syntax; Python has two 
(regex and re modules); Perl 5 and Perl 6 have completely different 
syntax. Subtle differences, such as when hyphens in character classes 
count as a literal, abound. See, for example:

http://www.regular-expressions.info/refflavors.html

> You can transmit them over a network
> connection to a cooperating process.  You can store them in a database
> or a config file, or allow users to enter them on the fly.

Sure, but if those sorts of things are important to you, there's no 
reason why you can't create your own string-processing language. Apart 
from the time and effort required :)

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#7007

From	Roy Smith <roy@panix.com>
Date	2011-06-04 09:39 -0400
Message-ID	<roy-FA7A90.09392404062011@news.panix.com>
In reply to	#6997

I wrote:
>> Another nice thing about regexes (as compared to string methods) is 
>> that they're both portable and serializable.  You can use the same 
>> regex in Perl, Python, Ruby, PHP, etc.

In article <4de9bf50$0$29996$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> Regexes are anything but portable. Sure, if you limit yourself to some 
> subset of regex syntax, you might find that many different languages and 
> engines support your regex, but general regexes are not guaranteed to run 
> in multiple engines.

To be sure, if you explore the edges of the regex syntax space, you can 
write non-portable expressions.  You don't even have to get very far out 
to the edge.  But, as you say, if you limit yourself to a subset, you 
can write portable ones.  I have a high level of confidence that I can 
execute:

^foo/bar

on any regex engine in the world and have it match the same thing that

my_string.startswith('foo/bar')

does.  The fact that not all regexes are portable doesn't negate the 
fact that many are portable and that this is useful in real life.

> > You can transmit them over a network
> > connection to a cooperating process.  You can store them in a database
> > or a config file, or allow users to enter them on the fly.
> 
> Sure, but if those sorts of things are important to you, there's no 
> reason why you can't create your own string-processing language. Apart 
> from the time and effort required :)

The time and effort required to write (and debug, and document) the 
language is just part of it.  The bigger part is that you've now got to 
teach this new language to all your users (i.e. another barrier to 
adoption of your system).

For example, I'm working with MongoDB on my current project.  It 
supports regex matching.  Pretty much everything I need to know is 
documented by the Mongo folks saying, "MongoDB uses PCRE for regular 
expressions" (with a link to the PCRE man page).  This lets me leverage 
my existing knowledge of regexes to perform sophisticated queries 
immediately.  Had they invented their own string processing language, I 
would have to invest time to learn that.

As another example, a project I used to work on was very much into NIH 
(Not Invented Here).  They wrote their own pattern matching language, 
loosely based on snobol.  Customers signed up for three-day classes to 
come learn this language so they could use the product.  Ugh.

[toc] | [prev] | [next] | [standalone]

#7026

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-06-05 00:44 +0000
Message-ID	<4dead160$0$29996$c3e8da3$5496439d@news.astraweb.com>
In reply to	#7007

On Sat, 04 Jun 2011 09:39:24 -0400, Roy Smith wrote:

> To be sure, if you explore the edges of the regex syntax space, you can
> write non-portable expressions.  You don't even have to get very far out
> to the edge.  But, as you say, if you limit yourself to a subset, you
> can write portable ones.  I have a high level of confidence that I can
> execute:
> 
> ^foo/bar
> 
> on any regex engine in the world and have it match the same thing that
> 
> my_string.startswith('foo/bar')
> 
> does.

Not the best choice you could have made, for two reasons:

(1) ^ can match at the start of each line, not just the start of the 
string.  Although this doesn't occur by default in Python, do you know 
whether all other engines default the same way?

(2) There is at least one major regex engine that doesn't support ^ for 
start of string matching at all, namely the W3C XML Schema pattern 
matcher.

As you say... not very far out to the edges at all.

[...]
> As another example, a project I used to work on was very much into NIH
> (Not Invented Here).  They wrote their own pattern matching language,
> loosely based on snobol.  Customers signed up for three-day classes to
> come learn this language so they could use the product.  Ugh.

And you think that having customers sign up for a two-week class to learn 
regexes would be an improvement? *wink*

I don't know a lot about SNOBOL pattern matching, but I know that they're 
more powerful than regexes, and it seems to me that they're also easier 
to read and learn. I suspect that the programming world would have been 
much better off if SNOBOL pattern matching had won the popularity battle 
against regexes.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#7010

From	rusi <rustompmody@gmail.com>
Date	2011-06-04 09:36 -0700
Message-ID	<f39e414c-7201-4697-a6d2-79c295001df0@18g2000prd.googlegroups.com>
In reply to	#6997

The efficiently argument is specious. [This is a python list not a C
or assembly list]

The real issue is that complex regexes are hard to get right -- even
if one is experienced.
This is analogous to the fact that knotty programs can be hard to get
right even for experienced programmers.

The analogy stems from the fact that both programs in general and
regexes in particular are a code.
Regex in particular is a code for an interesting class of languages --
the so-called regular languages.  And like all involved cod(ing), can
be helped by a debugger.

And just as it is a clincher for effective C programming to have a C
debugger whereas it is less so for python, the effective use of
regexes needs good debugger(s).

I sometimes use regex-tool but there are better I guess (see
http://bc.tech.coop/blog/071103.html )
Most recently there was mention of a python specific tool: kodos
http://kodos.sourceforge.net/about.html

In short I would reword rurpy's complaint to: Regexes should be
recommended along with (the idea of) regex tools.

[toc] | [prev] | [next] | [standalone]

#7020

From	Nobody <nobody@nowhere.com>
Date	2011-06-04 21:02 +0100
Message-ID	<pan.2011.06.04.20.02.31.657000@nowhere.com>
In reply to	#6997

On Sat, 04 Jun 2011 05:14:56 +0000, Steven D'Aprano wrote:

> This fails to support non-ASCII letters, and you know quite well that 
> having to spell out by hand regexes in both upper and lower (or mixed) 
> case is not support for case-insensitive matching. That's why Python's re 
> has a case insensitive flag.

I find it slightly ironic that you pointed out the ASCII limitation while
overlooking the arbitrariness of upper/lower-case equivalence. Case isn't
the only type of equivalence; it's just the only one which affects ASCII.
Should we also have flags to treat half-width and full-width characters as
equivalent? What about traditional and simplified Chinese, hiragana and
katakana, or the various stylistic variants of the Latin and Greek
alphabets in the mathematical symbols block (U+1D400..U+1D7FF)?

[toc] | [prev] | [next] | [standalone]

#7028

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-06-05 01:01 +0000
Message-ID	<4dead55e$0$29996$c3e8da3$5496439d@news.astraweb.com>
In reply to	#7020

On Sat, 04 Jun 2011 21:02:32 +0100, Nobody wrote:

> On Sat, 04 Jun 2011 05:14:56 +0000, Steven D'Aprano wrote:
> 
>> This fails to support non-ASCII letters, and you know quite well that
>> having to spell out by hand regexes in both upper and lower (or mixed)
>> case is not support for case-insensitive matching. That's why Python's
>> re has a case insensitive flag.
> 
> I find it slightly ironic that you pointed out the ASCII limitation
> while overlooking the arbitrariness of upper/lower-case equivalence.

Case is hardly arbitrary. It's *extremely* common, at least in Western 
languages, which you may have noticed we're writing in :-P


> Case isn't the only type of equivalence; it's just the only one which
> affects ASCII. Should we also have flags to treat half-width and
> full-width characters as equivalent? What about traditional and
> simplified Chinese, hiragana and katakana, or the various stylistic
> variants of the Latin and Greek alphabets in the mathematical symbols
> block (U+1D400..U+1D7FF)?

Perhaps we should. But since Python regexes don't support such flags 
either, I fail to see your point.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#6998

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-04 16:04 +1000
Message-ID	<mailman.2450.1307167467.9059.python-list@python.org>
In reply to	#6994

On Sat, Jun 4, 2011 at 12:30 PM, Roy Smith <roy@panix.com> wrote:
> Another nice thing about regexes (as compared to string methods) is that
> they're both portable and serializable.  You can use the same regex in
> Perl, Python, Ruby, PHP, etc.  You can transmit them over a network
> connection to a cooperating process.  You can store them in a database
> or a config file, or allow users to enter them on the fly.
>

I wouldn't ever be transmitting them around the place, unless also
allowing users to enter them. But I have done exactly that - a
validator system that lets you put a header with a regex, and then
string content below that. That IS one advantage of the regex.

However, that's a very VERY specific situation. If I'm not asking a
third party to provide the match condition, then that's not a reason
to go regex.

Chris Angelico

[toc] | [prev] | [next] | [standalone]

#7075

From	"rurpy@yahoo.com" <rurpy@yahoo.com>
Date	2011-06-05 23:03 -0700
Message-ID	<ef48ad50-da06-47a8-978a-47d6f4271e75@d28g2000yqf.googlegroups.com>
In reply to	#6991

On 06/03/2011 08:05 PM, Steven D'Aprano wrote:
> On Fri, 03 Jun 2011 12:29:52 -0700, rurpy@yahoo.com wrote:
>
>>>> I often find myself changing, for example, a startwith() to a RE when
>>>> I realize that the input can contain mixed case
>>>
>>> Why wouldn't you just normalise the case?
>>
>> Because some of the text may be case-sensitive.
>
> Perhaps you misunderstood me. You don't have to throw away the
> unnormalised text, merely use the normalized text in the expression you
> need.
>
> Of course, if you include both case-sensitive and insensitive tests in
> the same calculation, that's a good candidate for a regex... or at least
> it would be if regexes supported that :)

I did not choose a good example to illustrate what I find often
motivates my use of regexes.

You are right that for a simple .startwith() using a regex "just
in case" is not a good choice, and in fact I would not do that.

The process that I find often occurs is that I write (or am about
to write string method solution and when I think more about the
input data (which is seldom well-specified), I realize that using
a regex I can get better error checking, do more of the "parsing"
in one place, and adapt to changes in input format better than I
could with a .startswith and a couple other such methods.

Thus what starts as
  if line.startswith ('CUSTOMER '):
    try: kw, first_initial, last_name, code, rest = line.split(None,
4)
    ...
often turns into (sometimes before it is written) something like
  m = re.match (r'CUSTOMER (\w+) (\w+) ([A-Z]\d{3})')
  if m: first_initial, last_name, code = m.group(...)

>>>[...]
>>>> or that I have
>>>> to treat commas as well as spaces as delimiters.
>>>
>>> source.replace(",", " ").split(" ")
>>
>> Uhgg. create a whole new string just so you can split it on one rather
>> than two characters?
>
> You say that like it's expensive.

No, I said it like it was ugly.  Doing things unrelated to the
task at hand is ugly.  And not very adaptable -- see my reply
to Chris Torek's post.  I understand it is a common idiom and
I use it myself, but in this case there is a cleaner alternative
with re.split that expresses exactly what one is doing.

> And how do you what the regex engine is doing under the hood? For all you
> know, it could be making hundreds of temporary copies and throwing them
> away. Or something. It's a black box.

That's a silly argument.
And how do you know what replace is doing under the hood?
I would expect any regex processor to compile the regex into
an FSM.  As usual, I would expect to pay a small performance
price for the generality, but that is reasonable tradeoff in
many cases.  If it were a potential problem, I would test it.
What I wouldn't do is throw away a useful tool because, "golly,
I don't know, maybe it'll be slow" -- that's just a form of
cargo cult programming.

> The fact that creating a whole new string to split on is faster than
> *running* the regex (never mind compiling it, loading the regex engine,
> and anything else that needs to be done) should tell you which does more
> work. Copying is cheap. Parsing is expensive.

In addition to being wrong (loading is done once, compilation is
typically done once or a few times, while the regex is used many
times inside a loop so the overhead cost is usually trivial compared
with the cost of starting Python or reading a file), this is another
micro-optimization argument.

I'm not sure why you've suddenly developed this obsession with
wringing every last nanosecond out of your code.  Usually it
is not necessary.  Have you thought of buying a faster computer?
Or using C?  *wink*

>> Sorry, but I find
>>
>>     re.split ('[ ,]', source)
>>
>> states much more clearly exactly what is being done with no obfuscation.
>
> That's because you know regex syntax. And I'd hardly call the version
> with replace obfuscated.
>
> Certainly the regex is shorter, and I suppose it's reasonable to expect
> any reader to know at least enough regex to read that, so I'll grant you
> that this is a small win for clarity. A micro-optimization for
> readability, at the expense of performance.
>
>
>> Obviously this is a simple enough case that the difference is minor but
>> when the pattern gets only a little more complex, the clarity difference
>> becomes greater.
>
> Perhaps. But complicated tasks require complicated regexes, which are
> anything but clear.

Complicated tasks require complicated code as well.

As another post pointed out, there are ways to improve the
clarity of a regex such as the re.VERBOSE flag.
There is no doubt that a regex encapsulates information much more
densely than python string manipulation code.  One should not
be surprised that is might take as much time and effort to understand
a one-line regex as a dozen (or whatever) lines Python code that
do the same thing.  In most cases I'll bet, given equal fluency
in regexes and Python, the regex will take less.

> [...]
>>>> After doing this a
>>>> number of times, one starts to use an RE right from the get go unless
>>>> one is VERY sure that there will be no requirements creep.
>>>
>>> YAGNI.
>>
>> IAHNI. (I actually have needed it.)
>
> I'm sure you have, and when you need it, it's entirely appropriate to use
> a regex solution. But you stated that you used regexes as insurance *just
> in case* the requirements changed. Why, is your text editor broken? You
> can't change a call to str.startswith(prefix) to re.match(prefix, str) if
> and when you need to? That's what I mean by YAGNI -- don't solve the
> problem you think you might have tomorrow.

Retracted above.

>>> There's no need to use a regex just because you think that you *might*,
>>> someday, possibly need a regex. That's just silly. If and when
>>> requirements change, then use a regex. Until then, write the simplest
>>> code that will solve the problem you have to solve now, not the problem
>>> you think you might have to solve later.
>>
>> I would not recommend you use a regex instead of a string method solely
>> because you might need a regex later.  But when you have to spend 10
>> minutes writing a half-dozen lines of python versus 1 minute writing a
>> regex, your evaluation of the possibility of requirements changing
>> should factor into your decision.
>
> Ah, but if your requirements are complicated enough that it takes you ten
> minutes and six lines of string method calls, that sounds to me like a
> situation that probably calls for a regex!

Recall that the post that started this discussion presented
a problem that took me six lines of code (actually spread out
over a few more for readability) to do without regexes versus
one line with.

So you do agree that that a regex was a better solution in
that case?  I ask beause we agree both seem to agree that
regexes are useful tools and preferable when the corresponding
Python code is "too" complex.  We also agree that when the
need can be handled by very simple python code, python may be
preferable.  So I'm trying to calibrate your switch-over point
a little better.

> Of course it depends on what the code actually does... if it counts the
> number of nested ( ) pairs, and you're trying to do that with a regex,
> you're sacked! *wink*

Right.  And again repeating what I said before, regexes
aren't a universal solution to every problem.  *wink*

> [...]
>>> There are a few problems with regexes:
>>>
>>> - they are another language to learn, a very cryptic a terse language;
>>
>> Chinese is cryptic too but there are a few billion people who don't seem
>> to be bothered by that.
>
> Chinese isn't cryptic to the Chinese, because they've learned it from
> childhood.
>
> But has anyone done any studies comparing reading comprehension speed
> between native Chinese readers and native European readers? For all I
> know, Europeans learn to read twice as quickly as Chinese, and once
> learned, read text twice as fast. Or possibly the other way around. Who
> knows? Not me.
>
> But I do know that English typists typing 26 letters of the alphabet
> leave Asian typists and their thousands of ideograms in the dust. There's
> no comparison -- it's like quicksort vs bubblesort *wink*.

70 years ago there was all sorts of scientific evidence
that showed white, Western-European culture did lots of
things better than everyone else, especially non-whites,
in the world.  Let's not go there.  *wink*

> [...]
>>> - debugging regexes is a nightmare;
>>
>> Very complex ones, perhaps.  "Nightmare" seems an overstatement.
>
> You *can't* debug regexes in Python, since there are no tools for (e.g.)
> single-stepping through the regex, displaying intermediate calculations,
> or anything other than making changes to the regex and running it again,
> hoping that it will do the right thing this time.

Thinking in addition to hoping will help quite a bit.

There are two factors that migigate the lack of debuggers.

1) REs are not a Turing complete language so in some sense
are simpler than Python.

2) The vast majority of REs that I have had to fix or write
are not complex enough to require a debugger.  Often they simply
look complex due to all the parens and backslashes -- once you
reformat them (permanently with the re.VERBOSE flag, or
temporarily in a text editor, they don't look so bad.

> I suppose you can use external tools, like Regex Buddy, if you're on a
> supported platform and if they support your language's regex engine.
>
> [...]
>>> Regarding their syntax, I'd like to point out that even Larry Wall is
>>> dissatisfied with regex culture in the Perl community:
>>>
>>> http://www.perl.com/pub/2002/06/04/apo5.html
>>
>> You did see the very first sentence in this, right?
>>
>>   "Editor's Note: this Apocalypse is out of date and remains here for
>>   historic reasons. See Synopsis 05 for the latest information."
>
> Yes. And did you click through to see the Synopsis? It is a bare
> technical document with all the motivation removed. Since I was pointing
> to Larry Wall's motivation, it was appropriate to link to the Apocalypse
> document, not the Synopsis.

OK, fair enough.

>> (Note that "Apocalypse" is referring to a series of Perl design
>> documents and has nothing to do with regexes in particular.)
>
> But Apocalypse 5 specifically has everything to do with regexes. That's
> why I linked to that, and not (say) Apocalypse 2.

Where did I suggest that you should have linked to Apocalypse 2?
I wrote what I wrote to point out that the "Apocalypse" title was
not a pejorative comment on regexes.  I don't see how I could have
been clearer.

>> Synopsis 05 is (AFAICT with a quick scan) a proposal for revising regex
>> syntax.  I didn't see anything about de-emphasizing them in Perl.  (But
>> I have no idea what is going on for Perl 6 so I could be wrong about
>> that.)
>
> I never said anything about de-emphasizing them. I said that Larry Wall
> was dissatisfied with Perl's culture of regexes -- his own words were:
>
> "regular expression culture is a mess"

Right, and I quoted that.  But I don't know what he meant
by "culture of regexes".  Their syntax?  Their extensive use
in Perl?  Something else?  If you don't care about their
de-emphasis in Perl, then presumably their extensive use
there is not part of what you consider "culture of regexes",
yes?  So to you, "culture of regexes" refers only to the
syntax of Perl regexes?

I pointed out that the use of regexs in Perl 6 (AFAICT from
the Synopsis 05 document) are still as widely used as in
Perl 5.  However the document also describes changes in *how*
they are used within Perl (e.g, the production of Match objects)
So I conclude the *use* of regexes is part of Larry Wall concept
of "regex culture".

Further, my guess is that the term means something else again
to many Python programmers -- something more akin to the
LW concept but with a much greater negative valuation.

> and he is also extremely critical of current (i.e. Perl 5) regex syntax.
> Since Python's regex syntax borrows heavily from Perl 5, that's extremely
> pertinent to the issue. When even the champion of regex culture says
> there is much broken about regex culture, we should all listen.

I'll just note that "extremely" is a description you have chosen
to apply.  He identified problems (some of which have developed
since regexes started being widely used) and changes to improve
them.  One could say GvR was "extremely" critical of the str/-
unicode situation in Python-2.  It would be a bit much to use
that to say that one should avoid the use of text in Python 2 '
programs.

The Larry Wall who you claim is "extremely critical of current
regex syntax" proposed the following in the new "fixed" regex
syntax (from the Synopsis 05 doc):

    Unchanged syntactic features
      The following regex features use the same syntax as in Perl 5:
      Capturing: (...)
      Repetition quantifiers: *, +, and ?
      Alternatives: |
      Backslash escape: \
      Minimal matching suffix: ??, *?, +?

Those, with character classes (including "\"-named ones) and non-
capturing ()'s, constitute about 99+% of my regex uses and the
overwhelming majority of regexes I have had to work with.

Nobody here has claimed that regexes are perfect.  No doubt the
Perl 6 changes are an improvement but I doubt that they change
the nature of regexes anywhere near enough to overcome the complaints
against them voiced in this group.  Further, those changes will
likely take years or decades to make their way into the Python
standard library if at all.  (Perl is no longer the thought-leader
it once was, and the new syntax is competing against innumerable
established uses of the old syntax outside of Perl.)  Thus, although
I look forward to the new syntax, I don't see it as any kind of
justification not to use the existing syntax in the meantime.

>> As for the original reference, Wall points out a number of problems with
>> regexes, mostly details of their syntax.  For example that more
>> frequently used non-capturing groups require more characters than
>> less-frequently used capturing groups. Most of these criticisms seem
>> irrelevant to the question of whether hard-wired string manipulation
>> code or regexes should be preferred in a Python program.
>
> It is only relevant in so far as the readability and relative obfuscation
> of regex syntax is relevant. No further.

OK, again you are confirming it is only the syntax of regexes
that bothers you?

> You keep throwing out the term "hard-wired string manipulation", but I
> don't understand what point you're making. I don't understand what you
> see as "hard-wired", or why you think
>
> source.startswith(prefix)
>
> is more hard-wired than
>
> re.match(prefix, source)

What I mean is that I see regexes as being and extremely small,
highly restricted, domain specific language targeted specifically
at describing text patterns.  Thus they do that job better than
than trying to describe patterns implicitly with Python code.

> [...]
>> Perhaps you stopped reading after seeing his "regular expression culture
>> is a mess" comment without trying to see what he meant by "culture" or
>> "mess"?
>
> Perhaps you are being over-sensitive and reading *far* too much into what
> I said.

Not sensitive at all.  I expressed an opinion that I thought
is under-represented here and could help some get over their
regex-phobia.  Since it doesn't have a provably right or wrong
answer, I expected it would be contested and have no problem
with that.

As for reading too much into what you said, possibly.  I look
forward to your clarifications.

> If regexes were more readable, as proposed by Wall, that would go
> a long way to reducing my suspicion of them.

I am delighted to read that you find the new syntax more
acceptable.  I guess that means that although you would
object to the Perl 5 regex

  /(?mi)^(?:[a-z]|\d){1,2}(?=\s)/

you find its Perl 6 form

  / :i ^^ [ <[a..z]> || \d ] ** 1..2 <?before \s> /

a big improvement?

And I presume, based on your lack of comment, the size of the
document required to describe the new syntax does not raise
any concerns for you?  Or the many additional new "line-noise"
meta-characters ("too few metacharacters" was one of the
problems LW described in the Apocalypse document you referred
us to)?  Again, I wonder if you and Larry Wall are really on
the same page with the faults you find in the Perl 5 syntax..

And again with the qualifier that I have not spent much time
reading about the changes, and further my regex-fu is at
a low enough level that I am probably unable to fully
appreciate many of the improvements, the syntax doesn't
really look different enough that I see it overcoming the
objections that I often read here.  Consequently I don't
find the argument, avoid using what is currently available,
very convincing.

[toc] | [prev] | [next] | [standalone]

#7077

From	Chris Torek <nospam@torek.net>
Date	2011-06-06 07:11 +0000
Message-ID	<ishuip02d4g@news6.newsguy.com>
In reply to	#7075

In article <ef48ad50-da06-47a8-978a-47d6f4271e75@d28g2000yqf.googlegroups.com>
rurpy@yahoo.com <rurpy@yahoo.com> wrote (in part):
[mass snippage]
>What I mean is that I see regexes as being an extremely small,
>highly restricted, domain specific language targeted specifically
>at describing text patterns.  Thus they do that job better than
>than trying to describe patterns implicitly with Python code.

Indeed.

Kernighan has often used / supported the idea of "little languages";
see:

    http://www.princeton.edu/~hos/frs122/precis/kernighan.htm

In this case, regular expressions form a "little language" that is
quite well suited to some lexical analysis problems.  Since the
language is (modulo various concerns) targeted at the "right level",
as it were, it becomes easy (modulo various concerns :-) ) to
express the desired algorithm precisely yet concisely.

On the whole, this is a good thing.

The trick lies in knowing when it *is* the right level, and how to
use the language of REs.

>On 06/03/2011 08:05 PM, Steven D'Aprano wrote:
>> If regexes were more readable, as proposed by Wall, that would go
>> a long way to reducing my suspicion of them.

"Suspicion" seems like an odd term here.

Still, it is true that something (whether it be use of re.VERBOSE,
and whitespace-and-comments, or some New and Improved Syntax) could
help.  Dense and complex REs are quite powerful, but may also contain
and hide programming mistakes.  The ability to describe what is
intended -- which may differ from what is written -- is useful.

As an interesting aside, even without the re.VERBOSE flag, one can
build complex, yet reasonably-understandable, REs in Python, by
breaking them into individual parts and giving them appropriate
names.  (This is also possible in perl, although the perl syntax
makes it less obvious, I think.)
-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: gmail (figure it out)      http://web.torek.net/torek/index.html

[toc] | [prev] | [next] | [standalone]

#7079

From	"Octavian Rasnita" <orasnita@gmail.com>
Date	2011-06-06 11:51 +0300
Message-ID	<mailman.2485.1307350328.9059.python-list@python.org>
In reply to	#7077

It is not so hard to decide whether using RE is a good thing or not.

When the speed is important and every millisecond counts, RE should be used 
only when there is no other faster way, because usually RE is less faster 
than using other core Perl/Python functions that can do matching and 
replacing.

When the speed is not such a big issue, RE should be used only if it is 
easier to understand and maintain than using the core functions. And of 
course, RE should be used when the core functions cannot do what RE can do.

In Python, the RE syntax is not so short and simple as in Perl, so using RE 
even for very very simple things requires a longer code, so using other core 
functions may appear as a better solution, because the RE version of the 
code is almost never as easy to read as the code that uses other core 
functions (or... for very simple RE, they are probably same as readable).

In Perl, RE syntax is very short and simple, and in some cases it is more 
easier to understand and maintain a code that uses RE than other core 
functions.

For example, if somebody wants to check if the $var variable contains the 
letter "x", a solution without RE in Perl is:

if ( index( $var, 'x' ) >= 0 ) {
    print "ok";
}

while the solution with RE is:

if ( $var =~ /x/ ) {
    print "ok";
}

And it is obviously that the solution that uses RE is shorter and easier to 
read and maintain, beeing also much more flexible.

Of course, sometimes an even better alternative is to use a module from CPAN 
like Regexp::Common that can use RE in a more simple and readable way for 
matching numbers, profanity words, balanced params, programming languages 
comments, IP and MAC addresses, zip codes... or a module like Email::Valid 
for verifying if an email address is correct, because it may be very hard to 
create a RE for matching an email address.

So... just like with Python, there are more ways to do it, but depending on 
the situation, some of them are better than others. :-)

--Octavian

----- Original Message ----- 
From: "Chris Torek" <nospam@torek.net>
Newsgroups: comp.lang.python
To: <python-list@python.org>
Sent: Monday, June 06, 2011 10:11 AM
Subject: Re: how to avoid leading white spaces

> In article 
> <ef48ad50-da06-47a8-978a-47d6f4271e75@d28g2000yqf.googlegroups.com>
> rurpy@yahoo.com <rurpy@yahoo.com> wrote (in part):
> [mass snippage]
>>What I mean is that I see regexes as being an extremely small,
>>highly restricted, domain specific language targeted specifically
>>at describing text patterns.  Thus they do that job better than
>>than trying to describe patterns implicitly with Python code.
>
> Indeed.
>
> Kernighan has often used / supported the idea of "little languages";
> see:
>
>    http://www.princeton.edu/~hos/frs122/precis/kernighan.htm
>
> In this case, regular expressions form a "little language" that is
> quite well suited to some lexical analysis problems.  Since the
> language is (modulo various concerns) targeted at the "right level",
> as it were, it becomes easy (modulo various concerns :-) ) to
> express the desired algorithm precisely yet concisely.
>
> On the whole, this is a good thing.
>
> The trick lies in knowing when it *is* the right level, and how to
> use the language of REs.
>
>>On 06/03/2011 08:05 PM, Steven D'Aprano wrote:
>>> If regexes were more readable, as proposed by Wall, that would go
>>> a long way to reducing my suspicion of them.
>
> "Suspicion" seems like an odd term here.
>
> Still, it is true that something (whether it be use of re.VERBOSE,
> and whitespace-and-comments, or some New and Improved Syntax) could
> help.  Dense and complex REs are quite powerful, but may also contain
> and hide programming mistakes.  The ability to describe what is
> intended -- which may differ from what is written -- is useful.
>
> As an interesting aside, even without the re.VERBOSE flag, one can
> build complex, yet reasonably-understandable, REs in Python, by
> breaking them into individual parts and giving them appropriate
> names.  (This is also possible in perl, although the perl syntax
> makes it less obvious, I think.)
> -- 
> In-Real-Life: Chris Torek, Wind River Systems
> Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
> email: gmail (figure it out)      http://web.torek.net/torek/index.html
>

--------------------------------------------------------------------------------

> -- 
> http://mail.python.org/mailman/listinfo/python-list
>

[toc] | [prev] | [next] | [standalone]

#7080

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-06 19:01 +1000
Message-ID	<mailman.2486.1307350868.9059.python-list@python.org>
In reply to	#7077

On Mon, Jun 6, 2011 at 6:51 PM, Octavian Rasnita <orasnita@gmail.com> wrote:
> It is not so hard to decide whether using RE is a good thing or not.
>
> When the speed is important and every millisecond counts, RE should be used
> only when there is no other faster way, because usually RE is less faster
> than using other core Perl/Python functions that can do matching and
> replacing.
>
> When the speed is not such a big issue, RE should be used only if it is
> easier to understand and maintain than using the core functions. And of
> course, RE should be used when the core functions cannot do what RE can do.

for X in features:
  "When speed is important and every millisecond counts, X should be
used only when there is no other faster way."
  "When speed is not such a big issue, X should be used only if it is
easier to understand and maintain than other ways."

I think that's fairly obvious. :)

Chris Angelico

[toc] | [prev] | [next] | [standalone]

#7086

From	rusi <rustompmody@gmail.com>
Date	2011-06-06 07:33 -0700
Message-ID	<89c031a7-4fbb-4407-9255-594845f40ee0@d26g2000prn.googlegroups.com>
In reply to	#7077

For any significant language feature (take recursion for example)
there are these issues:

1. Ease of reading/skimming (other's) code
2. Ease of writing/designing one's own
3. Learning curve
4. Costs/payoffs (eg efficiency, succinctness) of use
5. Debug-ability

I'll start with 3.
When someone of Kernighan's calibre (thanks for the link BTW) says
that he found recursion difficult it could mean either that Kernighan
is a stupid guy -- unlikely considering his other achievements. Or
that C is not optimal (as compared to lisp say) for learning
recursion.

Evidently for syntactic, implementation and cultural reasons, Perl
programmers are likely to get (and then overuse) regexes faster than
python programmers.

1 is related but not the same as 3.  Someone with courses in automata,
compilers etc -- standard CS stuff -- is unlikely to find regexes a
problem.  Conversely an intelligent programmer without a CS background
may find them more forbidding.

On Jun 6, 12:11 pm, Chris Torek <nos...@torek.net> wrote:
>
> >On 06/03/2011 08:05 PM, Steven D'Aprano wrote:
> >> If regexes were more readable, as proposed by Wall, that would go
> >> a long way to reducing my suspicion of them.
>
> "Suspicion" seems like an odd term here.

When I was in school my mother warned me that in college I would have
to undergo a most terrifying course called 'calculus'.

Steven's 'suspicions' make me recall my mother's warning :-)

[toc] | [prev] | [next] | [standalone]

#7172

From	"rurpy@yahoo.com" <rurpy@yahoo.com>
Date	2011-06-07 11:37 -0700
Message-ID	<7271e6d5-fb81-46b6-9d7e-812acad1e91a@o10g2000prn.googlegroups.com>
In reply to	#7086

On 06/06/2011 08:33 AM, rusi wrote:
> For any significant language feature (take recursion for example)
> there are these issues:
>
> 1. Ease of reading/skimming (other's) code
> 2. Ease of writing/designing one's own
> 3. Learning curve
> 4. Costs/payoffs (eg efficiency, succinctness) of use
> 5. Debug-ability
>
> I'll start with 3.
> When someone of Kernighan's calibre (thanks for the link BTW) says
> that he found recursion difficult it could mean either that Kernighan
> is a stupid guy -- unlikely considering his other achievements. Or
> that C is not optimal (as compared to lisp say) for learning
> recursion.

Just as a side comment, I didn't see anything in the link
Chris Torek posted (repeated here since it got snipped:
http://www.princeton.edu/~hos/frs122/precis/kernighan.htm)
that said Kernighan found recursion difficult, just that it
was perceived as expensive.  Nor that the expense had anything
to do with programming language but rather was due to hardware
constraints of the time.
But maybe you are referring to some other source?

> Evidently for syntactic, implementation and cultural reasons, Perl
> programmers are likely to get (and then overuse) regexes faster than
> python programmers.

If by "get", you mean "understand", then I'm not sure why
the reasons you give should make a big difference.  Regex
syntax is pretty similar in both Python and Perl, and
virtually identical in terms of learning their basics.
There are some differences in the how regexes are used
between Perl and Python that I mentioned in
http://groups.google.com/group/comp.lang.python/msg/39fca0d4589f4720?,
but as I said there, that wouldn't, particularly in light
of Python culture where one-liners and terseness are not
highly valued, seem very important.  And I don't see how
the different Perl and Python cultures themselves would
make learning regexes harder for Python programmers.  At
most I can see the Perl culture encouraging their use and
the Python culture discouraging it, but that doesn't change
the ease or difficulty of learning.

And why do you say "overuse" regexs?  Why isn't it the case
that Perl programmers use regexes appropriately in Perl?  Are
you not arbitrarily applying a Python-centric standard to a
different culture?  What if a Perl programmer says that Python
programmers under-use regexes?

> 1 is related but not the same as 3.  Someone with courses in automata,
> compilers etc -- standard CS stuff -- is unlikely to find regexes a
> problem.  Conversely an intelligent programmer without a CS background
> may find them more forbidding.

I'm not sure of that.  (Not sure it should be that way,
perhaps it may be that way in practice.)  I suspect that
a good theoretical understanding of automata theory would
be essential in writing a regex compiler but I'm not sure
it is necessary to use regexes.

It does I'm sure give one a solid understanding of the
limitations of regexes but a practical understanding of
those can be achieved without the full course I think.

[toc] | [prev] | [next] | [standalone]

#7202

From	Roy Smith <roy@panix.com>
Date	2011-06-07 20:30 -0400
Message-ID	<roy-E4BE4C.20300607062011@news.panix.com>
In reply to	#7172

On 06/06/2011 08:33 AM, rusi wrote:
>> Evidently for syntactic, implementation and cultural reasons, Perl
>> programmers are likely to get (and then overuse) regexes faster than
>> python programmers.

"rurpy@yahoo.com" <rurpy@yahoo.com> wrote:
> I don't see how the different Perl and Python cultures themselves 
> would make learning regexes harder for Python programmers.

Oh, that part's obvious.  People don't learn things in a vacuum.  They 
read about something, try it, fail, and ask for help.  If, in one 
community, the response they get is, "I see what's wrong with your 
regex, you need to ...", and in another they get, "You shouldn't be 
using a regex there, you should use this string method instead...", it 
should not be a surprise that it's easier to learn about regexes in the 
first community.

[toc] | [prev] | [next] | [standalone]

#7240

From	"rurpy@yahoo.com" <rurpy@yahoo.com>
Date	2011-06-08 07:38 -0700
Message-ID	<55df895c-0344-44c3-a29c-64f382d65e9a@z7g2000prh.googlegroups.com>
In reply to	#7202

On 06/07/2011 06:30 PM, Roy Smith wrote:
> On 06/06/2011 08:33 AM, rusi wrote:
>>> Evidently for syntactic, implementation and cultural reasons, Perl
>>> programmers are likely to get (and then overuse) regexes faster than
>>> python programmers.
>
> "rurpy@yahoo.com" <rurpy@yahoo.com> wrote:
>> I don't see how the different Perl and Python cultures themselves
>> would make learning regexes harder for Python programmers.
>
> Oh, that part's obvious.  People don't learn things in a vacuum.  They
> read about something, try it, fail, and ask for help.  If, in one
> community, the response they get is, "I see what's wrong with your
> regex, you need to ...", and in another they get, "You shouldn't be
> using a regex there, you should use this string method instead...", it
> should not be a surprise that it's easier to learn about regexes in the
> first community.

I think we are just using different definitions of "harder".

I said, immediately after the sentence you quoted,

>> At
>> most I can see the Perl culture encouraging their use and
>> the Python culture discouraging it, but that doesn't change
>> the ease or difficulty of learning.

Constantly being told not to use regexes certainly discourages
one from learning them, but I don't think that's the same as
being *harder* to learn in Python.  The syntax of regexes is,
at least at the basic level, pretty universal, and it is in
learning to understand that syntax that most of any difficulty
lies.  Whether to express a regex as "/code (blue)|(red)/i" in
Perl or "(r'code (blue)|(red)', re.I)" in Python is a superficial
difference, as is, say, using match results: "$alert = $1' vs
"alert = m.group(1)".

A Google for "python regular expression tutorial" produces
lots of results including the Python docs HOWTO.  And because
the syntax is pretty universal, leaving the "python" off that
search string will yield many, many more that are applicable.
Although one does get some "don't do that" responses to regex
questions on this list (and some are good advice), there are
also usually answers too.

So I think of it as more of a Python culture thing, rather
then being actually harder to learn to use regexes in Python
although I see how one can view it your way too.

[toc] | [prev] | [next] | [standalone]

#7246

From	rusi <rustompmody@gmail.com>
Date	2011-06-08 09:14 -0700
Message-ID	<4a857a7d-1d59-49ba-a96b-c5e4d4e36726@r33g2000prh.googlegroups.com>
In reply to	#7240

On Jun 8, 7:38 pm, "ru...@yahoo.com" <ru...@yahoo.com> wrote:
> On 06/07/2011 06:30 PM, Roy Smith wrote:
>
>
>
> > On 06/06/2011 08:33 AM, rusi wrote:
> >>> Evidently for syntactic, implementation and cultural reasons, Perl
> >>> programmers are likely to get (and then overuse) regexes faster than
> >>> python programmers.
>
> > "ru...@yahoo.com" <ru...@yahoo.com> wrote:
> >> I don't see how the different Perl and Python cultures themselves
> >> would make learning regexes harder for Python programmers.
>
> > Oh, that part's obvious.  People don't learn things in a vacuum.  They
> > read about something, try it, fail, and ask for help.  If, in one
> > community, the response they get is, "I see what's wrong with your
> > regex, you need to ...", and in another they get, "You shouldn't be
> > using a regex there, you should use this string method instead...", it
> > should not be a surprise that it's easier to learn about regexes in the
> > first community.
>
> I think we are just using different definitions of "harder".
>
> I said, immediately after the sentence you quoted,
>
> >> At
> >> most I can see the Perl culture encouraging their use and
> >> the Python culture discouraging it, but that doesn't change
> >> the ease or difficulty of learning.
>
> Constantly being told not to use regexes certainly discourages
> one from learning them, but I don't think that's the same as
> being *harder* to learn in Python.  The syntax of regexes is,
> at least at the basic level, pretty universal, and it is in
> learning to understand that syntax that most of any difficulty
> lies.  Whether to express a regex as "/code (blue)|(red)/i" in
> Perl or "(r'code (blue)|(red)', re.I)" in Python is a superficial
> difference, as is, say, using match results: "$alert = $1' vs
> "alert = m.group(1)".
>
> A Google for "python regular expression tutorial" produces
> lots of results including the Python docs HOWTO.  And because
> the syntax is pretty universal, leaving the "python" off that
> search string will yield many, many more that are applicable.
> Although one does get some "don't do that" responses to regex
> questions on this list (and some are good advice), there are
> also usually answers too.
>
> So I think of it as more of a Python culture thing, rather
> then being actually harder to learn to use regexes in Python
> although I see how one can view it your way too.


... this is the old nature vs nurture debate: http://en.wikipedia.org/wiki/Nature_versus_nurture

[toc] | [prev] | [next] | [standalone]

#7223

From	rusi <rustompmody@gmail.com>
Date	2011-06-08 01:27 -0700
Message-ID	<efaa5be9-34c8-43c0-9b91-acde5bee372d@e17g2000prj.googlegroups.com>
In reply to	#7172

On Jun 7, 11:37 pm, "ru...@yahoo.com" <ru...@yahoo.com> wrote:
> On 06/06/2011 08:33 AM, rusi wrote:
>
> > For any significant language feature (take recursion for example)
> > there are these issues:
>
> > 1. Ease of reading/skimming (other's) code
> > 2. Ease of writing/designing one's own
> > 3. Learning curve
> > 4. Costs/payoffs (eg efficiency, succinctness) of use
> > 5. Debug-ability
>
> > I'll start with 3.
> > When someone of Kernighan's calibre (thanks for the link BTW) says
> > that he found recursion difficult it could mean either that Kernighan
> > is a stupid guy -- unlikely considering his other achievements. Or
> > that C is not optimal (as compared to lisp say) for learning
> > recursion.
>
> Just as a side comment, I didn't see anything in the link
> Chris Torek posted (repeated here since it got snipped:http://www.princeton.edu/~hos/frs122/precis/kernighan.htm)
> that said Kernighan found recursion difficult, just that it
> was perceived as expensive.  Nor that the expense had anything
> to do with programming language but rather was due to hardware
> constraints of the time.
> But maybe you are referring to some other source?

No the same source, see:

> In his work Kernighan also experimented with writing structured and unstructured programs.
> He found writing structured programs (programs that did not use goto's) difficult at first,
> but now he cannot imagine writing programs in any other manner.  The idea of recursion
> in programs also seemed to develop slowly; the advantage to the programmer was clear,
> but recursion statements were generally perceived as expensive, and thus were discouraged.

Note the also -- it suggests that recursion and structured programming
went together for Kernighan.

>
> > Evidently for syntactic, implementation and cultural reasons, Perl
> > programmers are likely to get (and then overuse) regexes faster than
> > python programmers.
>
> If by "get", you mean "understand", then I'm not sure why
> the reasons you give should make a big difference.  Regex
> syntax is pretty similar in both Python and Perl, and
> virtually identical in terms of learning their basics.

Having it part of the language (rather than an import-ed module makes
for a certain 'smoothness' (I imagine)

> There are some differences in the how regexes are used
> between Perl and Python that I mentioned in http://groups.google.com/group/comp.lang.python/msg/39fca0d4589f4720?,
> but as I said there, that wouldn't, particularly in light
> of Python culture where one-liners and terseness are not
> highly valued, seem very important.  And I don't see how
> the different Perl and Python cultures themselves would
> make learning regexes harder for Python programmers.  At
> most I can see the Perl culture encouraging their use and
> the Python culture discouraging it, but that doesn't change
> the ease or difficulty of learning.

See Roy's answer.


> What if a Perl programmer says that Python programmers under-use regexes?

That's what I gather they would say (and I guess you and I agree its
true?)

>
> And why do you say "overuse" regexs?  Why isn't it the case
> that Perl programmers use regexes appropriately in Perl?  Are
> you not arbitrarily applying a Python-centric standard to a
> different culture?  


>
> > 1 is related but not the same as 3.  Someone with courses in automata,
> > compilers etc -- standard CS stuff -- is unlikely to find regexes a
> > problem.  Conversely an intelligent programmer without a CS background
> > may find them more forbidding.
>
> I'm not sure of that.  (Not sure it should be that way,
> perhaps it may be that way in practice.)  I suspect that
> a good theoretical understanding of automata theory would
> be essential in writing a regex compiler but I'm not sure
> it is necessary to use regexes.
>
> It does I'm sure give one a solid understanding of the
> limitations of regexes but a practical understanding of
> those can be achieved without the full course I think.

How do you answer when a regex-happy but CS-illiterate programmer asks
for a regex to match parenthesis?

Anyway you may be right and this is quite far from my main points --
which I would like to iterate:

1. regexes were invented by automata-theorists and used mostly-
unchanged by early unix hackers.
   This works upto a point and fails badly when they get too large.
2. Larry Walls suggestions are on the whole good and python can
leapfrog over Perl by implementing them, especially given that it is
much easier for python to add a newRe module than for perl5 to become
perl6.
3. A big problem with regexes is not having re-debuggers. Things like
emacs' re-builder, regex-tool and python's native kodos need more
visibility

[toc] | [prev] | [next] | [standalone]

#7087

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-06-06 15:29 +0000
Message-ID	<4decf252$0$29996$c3e8da3$5496439d@news.astraweb.com>
In reply to	#7075

On Sun, 05 Jun 2011 23:03:39 -0700, rurpy@yahoo.com wrote:

> Thus what starts as
>   if line.startswith ('CUSTOMER '):
>     try: 
>       kw, first_initial, last_name, code, rest = line.split(None, 4)
>       ...
> often turns into (sometimes before it is written) something like
>   m = re.match (r'CUSTOMER (\w+) (\w+) ([A-Z]\d{3})') 
>   if m:
>     first_initial, last_name, code = m.group(...)

I would argue that the first, non-regex solution is superior, as it 
clearly distinguishes the multiple steps of the solution:

* filter lines that start with "CUSTOMER"
* extract fields in that line
* validate fields (not shown in your code snippet)

while the regex tries to do all of these in a single command. This makes 
the regex an "all or nothing" solution: it matches *everything* or 
*nothing*. This means that your opportunity for giving meaningful error 
messages is much reduced. E.g. I'd like to give an error message like:

    found digit in customer name (field 2)

but with your regex, if it fails to match, I have no idea why it failed, 
so can't give any more meaningful error than:

    invalid customer line

and leave it to the caller to determine what makes it invalid. (Did I 
misspell "CUSTOMER"? Put a dot after the initial? Forget the code? Use 
two spaces between fields instead of one?)

[...]
> I would expect
> any regex processor to compile the regex into an FSM.

Flying Spaghetti Monster?

I have been Touched by His Noodly Appendage!!!

[...]
>> The fact that creating a whole new string to split on is faster than
>> *running* the regex (never mind compiling it, loading the regex engine,
>> and anything else that needs to be done) should tell you which does
>> more work. Copying is cheap. Parsing is expensive.
> 
> In addition to being wrong (loading is done once, compilation is
> typically done once or a few times, while the regex is used many times
> inside a loop so the overhead cost is usually trivial compared with the
> cost of starting Python or reading a file), this is another
> micro-optimization argument.

Yes, but you have to pay the cost of loading the re engine, even if it is 
a one off cost, it's still a cost, and sometimes (not always!) it can be 
significant. It's quite hard to write fast, tiny Python scripts, because 
the initialization costs of the Python environment are so high. (Not as 
high as for, say, VB or Java, but much higher than, say, shell scripts.) 
In a tiny script, you may be better off avoiding regexes because it takes 
longer to load the engine than to run the rest of your script!

But yes, you are right that this is a micro-optimization argument. In a 
big application, it's less likely to be important.

> I'm not sure why you've suddenly developed this obsession with wringing
> every last nanosecond out of your code.  Usually it is not necessary. 
> Have you thought of buying a faster computer? Or using C?  *wink*

It's hardly an obsession. I'm just stating it as a relevant factor: for 
simple text parsing tasks, string methods are often *much* faster than 
regexes.

[...]
>> Ah, but if your requirements are complicated enough that it takes you
>> ten minutes and six lines of string method calls, that sounds to me
>> like a situation that probably calls for a regex!
> 
> Recall that the post that started this discussion presented a problem
> that took me six lines of code (actually spread out over a few more for
> readability) to do without regexes versus one line with.
> 
> So you do agree that that a regex was a better solution in that case?

I don't know... I'm afraid I can't find your six lines of code, and so 
can't judge it in comparison to your regex solution:

for line in f:
    fixed = re.sub (r"(TABLE='\S+)\s+'$", r"\1'", line)

My solution would probably be something like this:

for line in lines:
    if line.endswith("'"):
        line = line[:-1].rstrip() + "'"

although perhaps I've misunderstood the requirements.

[...]
>>> (Note that "Apocalypse" is referring to a series of Perl design
>>> documents and has nothing to do with regexes in particular.)
>>
>> But Apocalypse 5 specifically has everything to do with regexes. That's
>> why I linked to that, and not (say) Apocalypse 2.
> 
> Where did I suggest that you should have linked to Apocalypse 2? I wrote
> what I wrote to point out that the "Apocalypse" title was not a
> pejorative comment on regexes.  I don't see how I could have been
> clearer.

Possibly by saying what you just said here?

I never suggested, or implied, or thought, that "Apocalypse" was a 
pejorative comment on *regexes*. The fact that I referenced Apocalypse 
FIVE suggests strongly that there are at least four others, presumably 
not about regexes.

[...]
>> It is only relevant in so far as the readability and relative
>> obfuscation of regex syntax is relevant. No further.
> 
> OK, again you are confirming it is only the syntax of regexes that
> bothers you?

The syntax of regexes is a big part of it. I won't say the only part.

[...]
>> If regexes were more readable, as proposed by Wall, that would go a
>> long way to reducing my suspicion of them.
> 
> I am delighted to read that you find the new syntax more acceptable.

Perhaps I wasn't as clear as I could have been. I don't know what the new 
syntax is. I was referring to the design principle of improving the 
readability of regexes. Whether Wall's new syntax actually does improve 
readability and ease of maintenance is a separate issue, one on which I 
don't have an opinion on. I applaud his *intention* to reform regex 
syntax, without necessarily agreeing that he has done so.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#7089

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2011-06-06 10:06 -0600
Message-ID	<mailman.2489.1307376411.9059.python-list@python.org>
In reply to	#7087

On Mon, Jun 6, 2011 at 9:29 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> [...]
>> I would expect
>> any regex processor to compile the regex into an FSM.
>
> Flying Spaghetti Monster?
>
> I have been Touched by His Noodly Appendage!!!

Finite State Machine.

[toc] | [prev] | [next] | [standalone]

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

csiph-web

Re: how to avoid leading white spaces

Contents

#6994

#6997

#7007

#7026

#7010

#7020

#7028

#6998

#7075

#7077

#7079

#7080

#7086

#7172

#7202

#7240

#7246

#7223

#7087

#7089