Groups > comp.lang.python > #6811 > unrolled thread

Re: how to avoid leading white spaces

Started by	Chris Rebert <clp2@rebertia.com>
First post	2011-06-01 10:11 -0700
Last post	2011-06-05 04:17 -0700
Articles	20 on this page of 64 — 19 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: how to avoid leading white spaces Chris Rebert <clp2@rebertia.com> - 2011-06-01 10:11 -0700
    Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-01 12:39 -0700
      Re: how to avoid leading white spaces Karim <karim.liateni@free.fr> - 2011-06-01 22:34 +0200
      Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-02 13:21 +0000
        Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-02 21:57 -0400
          Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-03 03:41 +0100
          Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 02:58 +0000
            Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-02 23:44 -0400
              Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-03 13:52 +1000
              Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-03 13:54 +1000
              Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 04:30 +0000
                Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-03 14:11 +0100
            Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-03 14:18 +0100
            Re: how to avoid leading white spaces Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2011-06-04 13:41 +1200
              Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-04 20:44 +0100
            Re: how to avoid leading white spaces Ian <hobson42@gmail.com> - 2011-06-06 22:04 +0100
              Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-09 02:32 +0000
          Re: how to avoid leading white spaces Thorsten Kampe <thorsten@thorstenkampe.de> - 2011-06-03 10:32 +0200
        Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 05:51 -0700
          Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-03 13:17 +0000
            Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 08:14 -0700
          Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-03 14:25 +0000
            Re: how to avoid leading white spaces "D'Arcy J.M. Cain" <darcy@druid.net> - 2011-06-03 10:58 -0400
            Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 12:29 -0700
              Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-03 20:49 +0000
                Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 21:45 +0000
                  Re: how to avoid leading white spaces Ethan Furman <ethan@stoneleaf.us> - 2011-06-03 15:11 -0700
                  Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-03 23:38 +0100
                  Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 22:47 -0700
                Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 22:44 -0700
                  Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 16:08 +0000
                    Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 10:29 -0600
                      Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 17:17 +0000
                        Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 11:40 -0600
                          Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 17:56 +0000
                    Re: how to avoid leading white spaces Ethan Furman <ethan@stoneleaf.us> - 2011-06-06 10:48 -0700
                    Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 11:42 -0600
              Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 02:05 +0000
                Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-04 03:24 +0100
                  Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 04:59 +0000
                Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-03 22:30 -0400
                  Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 05:14 +0000
                    Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-04 09:39 -0400
                      Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-05 00:44 +0000
                    Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-04 09:36 -0700
                    Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-04 21:02 +0100
                      Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-05 01:01 +0000
                  Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-04 16:04 +1000
                Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 23:03 -0700
                  Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-06 07:11 +0000
                    Re: how to avoid leading white spaces "Octavian Rasnita" <orasnita@gmail.com> - 2011-06-06 11:51 +0300
                    Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-06 19:01 +1000
                    Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-06 07:33 -0700
                      Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-07 11:37 -0700
                        Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-07 20:30 -0400
                          Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-08 07:38 -0700
                            Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-08 09:14 -0700
                        Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-08 01:27 -0700
                  Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-06 15:29 +0000
                    Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 10:06 -0600
                    Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-07 09:00 -0700
                      Re: how to avoid leading white spaces Duncan Booth <duncan.booth@invalid.invalid> - 2011-06-08 09:01 +0000
                        Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-08 07:39 -0700
            Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-05 04:17 -0700

Page 2 of 4 — ← Prev page 1 [2] 3 4 Next page →

#6950

From	"rurpy@yahoo.com" <rurpy@yahoo.com>
Date	2011-06-03 08:14 -0700
Message-ID	<70673ab2-106b-45b3-a40b-e0bc3e2ad732@c41g2000yqm.googlegroups.com>
In reply to	#6943

On 06/03/2011 07:17 AM, Neil Cerutti wrote:
> On 2011-06-03, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
>> The other tradeoff, applying both to Perl and Python is with
>> maintenance.  As mentioned above, even when today's
>> requirements can be solved with some code involving several
>> string functions, indexes, and conditionals, when those
>> requirements change, it is usually a lot harder to modify that
>> code than a RE.
>>
>> In short, although your observations are true to some extent,
>> they are not sufficient to justify the anti-RE attitude often
>> seen here.
>
> Very good article. Thanks. I mostly wanted to combat the notion
> that that the alleged anti-RE attitude here might be caused by an
> opposition to Perl culture.
>
> I contend that the anti-RE attitude sometimes seen here is caused
> by dissatisfaction with regexes in general combined with an
> aversion to the re module. I agree that it's not that bad, but
> it's clunky enough that it does contribute to making it my last
> resort.

But I questioned the reasons given (not as efficient, not built
in, not often needed) for dissatisfaction with REs.[*]  If those
reasons are not strong, then is not their Perl-smell still a leading
candidate for explaining the anti-RE attitude here?

Of course the whole question, lacking some serious group-psychological
investigation, is pure speculation anyway.

----
[*] A reason for not using REs not mentioned yet is that REs take
some time to learn.  Thus, although most people will know how to use
Python string methods, only a subset of those will be familiar with
REs.  But that doesn't seem like a reason for RE bashing either
since REs are easier to learn than SQL and one frequently sees
recommendations here to use sqlite.

[toc] | [prev] | [next] | [standalone]

#6946

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-06-03 14:25 +0000
Message-ID	<4de8eef1$0$29996$c3e8da3$5496439d@news.astraweb.com>
In reply to	#6940

On Fri, 03 Jun 2011 05:51:18 -0700, rurpy@yahoo.com wrote:

> On 06/02/2011 07:21 AM, Neil Cerutti wrote:

>> > Python's str methods, when they're sufficent, are usually more
>> > efficient.
> 
> Unfortunately, except for the very simplest cases, they are often not
> sufficient.

Maybe so, but the very simplest cases occur very frequently.

> I often find myself changing, for example, a startwith() to
> a RE when I realize that the input can contain mixed case 

Why wouldn't you just normalise the case?

source.lower().startswith(prefix.lower())

Particularly if the two strings are short, this is likely to be much 
faster than a regex.

Admittedly, normalising the case in this fashion is not strictly correct. 
It works well enough for ASCII text, and probably Latin-1, but for 
general Unicode, not so much. But neither will a regex solution. If you 
need to support true case normalisation for arbitrary character sets, 
Python isn't going to be much help for you. But for the rest of us, a 
simple str.lower() or str.upper() might be technically broken but it will 
do the job.

> or that I have
> to treat commas as well as spaces as delimiters.

source.replace(",", " ").split(" ")

[steve@sylar ~]$ python -m timeit -s "source = 'a b c,d,e,f,g h i j k'" 
"source.replace(',', ' ').split(' ')"
100000 loops, best of 3: 2.69 usec per loop

[steve@sylar ~]$ python -m timeit -s "source = 'a b c,d,e,f,g h i j k'" -
s "import re" "re.split(',| ', source)"
100000 loops, best of 3: 11.8 usec per loop

re.split is about four times slower than the simple solution.

> After doing this a
> number of times, one starts to use an RE right from the get go unless
> one is VERY sure that there will be no requirements creep.

YAGNI.

There's no need to use a regex just because you think that you *might*, 
someday, possibly need a regex. That's just silly. If and when 
requirements change, then use a regex. Until then, write the simplest 
code that will solve the problem you have to solve now, not the problem 
you think you might have to solve later.

> And to regurgitate the mantra frequently used to defend Python when it
> is criticized for being slow, the real question should be, are REs fast
> enough?  The answer almost always is yes.

Well, perhaps so.

[...]
> In short, although your observations are true to some extent, they
> are not sufficient to justify the anti-RE attitude often seen here.

I don't think that there's really an *anti* RE attitude here. It's more a 
skeptical, cautious attitude to them, as a reaction to the Perl "when all 
you have is a hammer, everything looks like a nail" love affair with 
regexes.

There are a few problems with regexes:

- they are another language to learn, a very cryptic a terse language;
- hence code using many regexes tends to be obfuscated and brittle;
- they're over-kill for many simple tasks;
- and underpowered for complex jobs, and even some simple ones;
- debugging regexes is a nightmare;
- they're relatively slow;
- and thanks in part to Perl's over-reliance on them, there's a tendency 
among many coders (especially those coming from Perl) to abuse and/or 
misuse regexes; people react to that misuse by treating any use of 
regexes with suspicion.

But they have their role to play as a tool in the programmers toolbox.

Regarding their syntax, I'd like to point out that even Larry Wall is 
dissatisfied with regex culture in the Perl community:

http://www.perl.com/pub/2002/06/04/apo5.html

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#6948

From	"D'Arcy J.M. Cain" <darcy@druid.net>
Date	2011-06-03 10:58 -0400
Message-ID	<mailman.2430.1307113146.9059.python-list@python.org>
In reply to	#6946

On 03 Jun 2011 14:25:53 GMT
Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> source.replace(",", " ").split(" ")

I would do;

  source.replace(",", " ").split()

> [steve@sylar ~]$ python -m timeit -s "source = 'a b c,d,e,f,g h i j k'" 

What if the string is 'a b c, d, e,f,g h i j k'?

>>> source.replace(",", " ").split(" ")
['a', 'b', 'c', '', 'd', '', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
>>> source.replace(",", " ").split()
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']

Of course, it may be that the former is what you want but I think that
the latter would be more common.

> There's no need to use a regex just because you think that you *might*, 
> someday, possibly need a regex. That's just silly. If and when 
> requirements change, then use a regex. Until then, write the simplest 
> code that will solve the problem you have to solve now, not the problem 
> you think you might have to solve later.

I'm not sure if this should be rule #1 for programmers but it
definitely needs to be one of the very low numbers.  Trying to guess
the client's future requests is always a losing game.

-- 
D'Arcy J.M. Cain <darcy@druid.net>         |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.

[toc] | [prev] | [next] | [standalone]

#6963

From	"rurpy@yahoo.com" <rurpy@yahoo.com>
Date	2011-06-03 12:29 -0700
Message-ID	<1237a287-10b0-4a2d-ba35-97b5238deda1@n11g2000yqf.googlegroups.com>
In reply to	#6946

On 06/03/2011 08:25 AM, Steven D'Aprano wrote:
> On Fri, 03 Jun 2011 05:51:18 -0700, rurpy@yahoo.com wrote:
>
>> On 06/02/2011 07:21 AM, Neil Cerutti wrote:
>
>>> > Python's str methods, when they're sufficent, are usually more
>>> > efficient.
>>
>> Unfortunately, except for the very simplest cases, they are often not
>> sufficient.
>
> Maybe so, but the very simplest cases occur very frequently.

Right, and I stated that.

>> I often find myself changing, for example, a startwith() to
>> a RE when I realize that the input can contain mixed case
>
> Why wouldn't you just normalise the case?

Because some of the text may be case-sensitive.

>[...]
>> or that I have
>> to treat commas as well as spaces as delimiters.
>
> source.replace(",", " ").split(" ")

Uhgg. create a whole new string just so you can split it on
one rather than two characters?  Sorry, but I find

    re.split ('[ ,]', source)

states much more clearly exactly what is being done with no
obfuscation.  Obviously this is a simple enough case that the
difference is minor but when the pattern gets only a little
more complex, the clarity difference becomes greater.

>[...]
> re.split is about four times slower than the simple solution.

If this processing is a bottleneck, by all means use a more
complex hard-coded replacement for a regex.  In most cases
that won't be necessary.

>> After doing this a
>> number of times, one starts to use an RE right from the get go unless
>> one is VERY sure that there will be no requirements creep.
>
> YAGNI.

IAHNI. (I actually have needed it.)

> There's no need to use a regex just because you think that you *might*,
> someday, possibly need a regex. That's just silly. If and when
> requirements change, then use a regex. Until then, write the simplest
> code that will solve the problem you have to solve now, not the problem
> you think you might have to solve later.

I would not recommend you use a regex instead of a string method
solely because you might need a regex later.  But when you have
to spend 10 minutes writing a half-dozen lines of python versus
1 minute writing a regex, your evaluation of the possibility of
requirements changing should factor into your decision.

> [...]
>> In short, although your observations are true to some extent, they
>> are not sufficient to justify the anti-RE attitude often seen here.
>
> I don't think that there's really an *anti* RE attitude here. It's more a
> skeptical, cautious attitude to them, as a reaction to the Perl "when all
> you have is a hammer, everything looks like a nail" love affair with
> regexes.

Yes, as I said, the regex attitude here seems in large part to
be a reaction to their frequent use in Perl.  It seems anti- to
me in that I often see cautions about their use but seldom see
anyone pointing out that they are often a better solution than
a mass of twisty little string methods and associated plumbing.

> There are a few problems with regexes:
>
> - they are another language to learn, a very cryptic a terse language;

Chinese is cryptic too but there are a few billion people who
don't seem to be bothered by that.

> - hence code using many regexes tends to be obfuscated and brittle;

No.  With regexes the code is likely to be less brittle than
a dozen or more lines of mixed string functions, indexes, and
conditionals.

> - they're over-kill for many simple tasks;
> - and underpowered for complex jobs, and even some simple ones;

Right, like all tools (including Python itself) they are suited
best for a specific range of problems.  That range is quite wide.

> - debugging regexes is a nightmare;

Very complex ones, perhaps.  "Nightmare" seems an overstatement.

> - they're relatively slow;

So is Python.  In both cases, if it is a bottleneck then
choosing another tool is appropriate.

> - and thanks in part to Perl's over-reliance on them, there's a tendency
> among many coders (especially those coming from Perl) to abuse and/or
> misuse regexes; people react to that misuse by treating any use of
> regexes with suspicion.

So you claim.  I have seen more postings in here where
REs were not used when they would have simplified the code,
then I have seen regexes used when a string method or two
would have done the same thing.

> But they have their role to play as a tool in the programmers toolbox.

We agree.

> Regarding their syntax, I'd like to point out that even Larry Wall is
> dissatisfied with regex culture in the Perl community:
>
> http://www.perl.com/pub/2002/06/04/apo5.html

You did see the very first sentence in this, right?

  "Editor's Note: this Apocalypse is out of date and remains here
  for historic reasons. See Synopsis 05 for the latest information."

(Note that "Apocalypse" is referring to a series of Perl design
documents and has nothing to do with regexes in particular.)

Synopsis 05 is (AFAICT with a quick scan) a proposal for revising
regex syntax.  I didn't see anything about de-emphasizing them in
Perl.  (But I have no idea what is going on for Perl 6 so I could
be wrong about that.)

As for the original reference, Wall points out a number of
problems with regexes, mostly details of their syntax.  For
example that more frequently used non-capturing groups require
more characters than less-frequently used capturing groups.
Most of these criticisms seem irrelevant to the question of
whether hard-wired string manipulation code or regexes should
be preferred in a Python program.

And for the few criticisms that are relevant, nobody ever said
regexes were perfect.  The problems are well known, especially on
this list where we've all been told about them a million times.

The fact that REs are not perfect does not make them not useful.
We also know about Python's problems (slow, the GIL, excessively
terse and poorly organized documentation, etc) but that hardly
makes Python not useful.

Finally he is talking about *revising* regex syntax (in part by
replacing some magic character sequences with other "better" ones)
beyond the core CS textbook forms.  He was *not* AFAICT advocating
using hard-wired string manipulation code in place of regexes.
So it is hardly a condemnation of the concept of regexs, rather
just the opposite.

Perhaps you stopped reading after seeing his "regular expression
culture is a mess" comment without trying to see what he meant
by "culture" or "mess"?

[toc] | [prev] | [next] | [standalone]

#6971

From	Neil Cerutti <neilc@norwich.edu>
Date	2011-06-03 20:49 +0000
Message-ID	<94svm4Fe7eU1@mid.individual.net>
In reply to	#6963

On 2011-06-03, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
>>> or that I have to treat commas as well as spaces as
>>> delimiters.
>>
>> source.replace(",", " ").split(" ")
>
> Uhgg. create a whole new string just so you can split it on one
> rather than two characters?  Sorry, but I find
>
>     re.split ('[ ,]', source)

It's quibbling to complain about creating one more string in an
operation that already creates N strings.

Here's another alternative:

list(itertools.chain.from_iterable(elem.split(" ") 
  for elem in source.split(",")))

It's weird looking, but delimiting text with two different
delimiters is weird.

> states much more clearly exactly what is being done with no
> obfuscation.  Obviously this is a simple enough case that the
> difference is minor but when the pattern gets only a little
> more complex, the clarity difference becomes greater.

re.split is a nice improvement over str.split. I use it often.
It's a happy place in the re module. Using a single capture group
it can perhaps also be used for applications of str.partition.

> I would not recommend you use a regex instead of a string
> method solely because you might need a regex later.  But when
> you have to spend 10 minutes writing a half-dozen lines of
> python versus 1 minute writing a regex, your evaluation of the
> possibility of requirements changing should factor into your
> decision.

Most of the simplest and clearest applications of the re module
are simply not necessary at all. If I'm inspecting a string with
what amounts to more than a couple of lines of basic Python then
break out the re module.

Of course often times that means I've got a context sensitive
parsing job on my hands, and I have to put it away again. ;)

> Yes, as I said, the regex attitude here seems in large part to
> be a reaction to their frequent use in Perl.  It seems anti- to
> me in that I often see cautions about their use but seldom see
> anyone pointing out that they are often a better solution than
> a mass of twisty little string methods and associated plumbing.

That doesn't seem to apply to the problem that prompted your
complaint, at least.

>> There are a few problems with regexes:
>>
>> - they are another language to learn, a very cryptic a terse
>> language;
>
> Chinese is cryptic too but there are a few billion people who
> don't seem to be bothered by that.

Chinese *would* be a problem if you proposed it as the solution
to a problem that could be solved by using a persons native
tongue instead.

>> - hence code using many regexes tends to be obfuscated and
>> brittle;
>
> No.  With regexes the code is likely to be less brittle than a
> dozen or more lines of mixed string functions, indexes, and
> conditionals.

That is the opposite of my experience, but YMMV.

>> - they're over-kill for many simple tasks;
>> - and underpowered for complex jobs, and even some simple ones;
>
> Right, like all tools (including Python itself) they are suited
> best for a specific range of problems.  That range is quite
> wide.
>
>> - debugging regexes is a nightmare;
>
> Very complex ones, perhaps.  "Nightmare" seems an
> overstatement.

I will abandon a re based solution long before the nightmare.

>> - they're relatively slow;
>
> So is Python.  In both cases, if it is a bottleneck then
> choosing another tool is appropriate.

It's not a problem at all until it is.

>> - and thanks in part to Perl's over-reliance on them, there's
>> a tendency among many coders (especially those coming from
>> Perl) to abuse and/or misuse regexes; people react to that
>> misuse by treating any use of regexes with suspicion.
>
> So you claim.  I have seen more postings in here where
> REs were not used when they would have simplified the code,
> then I have seen regexes used when a string method or two
> would have done the same thing.

Can you find an example or invent one? I simply don't remember
such problems coming up, but I admit it's possible.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#6973

From	Chris Torek <nospam@torek.net>
Date	2011-06-03 21:45 +0000
Message-ID	<isbkl301v7v@news2.newsguy.com>
In reply to	#6971

>On 2011-06-03, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
[prefers]
>>     re.split ('[ ,]', source)

This is probably not what you want in dealing with
human-created text:

    >>> re.split('[ ,]', 'foo bar, spam,maps')
    ['foo', '', 'bar', '', 'spam', 'maps']

Instead, you probably want "a comma followed by zero or
more spaces; or, one or more spaces":

    >>> re.split(r',\s*|\s+', 'foo bar, spam,maps')
    ['foo', 'bar', 'spam', 'maps']

or perhaps (depending on how you want to treat multiple
adjacent commas) even this:

    >>> re.split(r',+\s*|\s+', 'foo bar, spam,maps,, eggs')
    ['foo', 'bar', 'spam', 'maps', 'eggs']

although eventually you might want to just give in and use the
csv module. :-)  (Especially if you want to be able to quote
commas, for instance.)

>> ...  With regexes the code is likely to be less brittle than a
>> dozen or more lines of mixed string functions, indexes, and
>> conditionals.

In article <94svm4Fe7eU1@mid.individual.net>
Neil Cerutti  <neilc@norwich.edu> wrote:
[lots of snippage]
>That is the opposite of my experience, but YMMV.

I suspect it depends on how familiar the user is with regular
expressions, their abilities, and their limitations.

People relatively new to REs always seem to want to use them
to count (to balance parentheses, for instance).  People who
have gone through the compiler course know better. :-)
-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: gmail (figure it out)      http://web.torek.net/torek/index.html

[toc] | [prev] | [next] | [standalone]

#6975

From	Ethan Furman <ethan@stoneleaf.us>
Date	2011-06-03 15:11 -0700
Message-ID	<mailman.2440.1307138328.9059.python-list@python.org>
In reply to	#6973

Chris Torek wrote:
>> On 2011-06-03, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
> [prefers]
>>>     re.split ('[ ,]', source)
> 
> This is probably not what you want in dealing with
> human-created text:
> 
>     >>> re.split('[ ,]', 'foo bar, spam,maps')
>     ['foo', '', 'bar', '', 'spam', 'maps']

I think you've got a typo in there... this is what I get:

--> re.split('[ ,]', 'foo bar, spam,maps')
['foo', 'bar', '', 'spam', 'maps']

I would add a * to get rid of that empty element, myself:
--> re.split('[ ,]*', 'foo bar, spam,maps')
['foo', 'bar', 'spam', 'maps']

~Ethan~

[toc] | [prev] | [next] | [standalone]

#6979

From	MRAB <python@mrabarnett.plus.com>
Date	2011-06-03 23:38 +0100
Message-ID	<mailman.2442.1307140744.9059.python-list@python.org>
In reply to	#6973

On 03/06/2011 23:11, Ethan Furman wrote:
> Chris Torek wrote:
>>> On 2011-06-03, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
>> [prefers]
>>>> re.split ('[ ,]', source)
>>
>> This is probably not what you want in dealing with
>> human-created text:
>>
>> >>> re.split('[ ,]', 'foo bar, spam,maps')
>> ['foo', '', 'bar', '', 'spam', 'maps']
>
> I think you've got a typo in there... this is what I get:
>
> --> re.split('[ ,]', 'foo bar, spam,maps')
> ['foo', 'bar', '', 'spam', 'maps']
>
> I would add a * to get rid of that empty element, myself:
> --> re.split('[ ,]*', 'foo bar, spam,maps')
> ['foo', 'bar', 'spam', 'maps']
>
It's better to use + instead of * because you don't want it to be a
zero-width separator. The fact that it works should be treated as an
idiosyncrasy of the current re module, which can't split on a
zero-width match.

[toc] | [prev] | [next] | [standalone]

#7073

From	"rurpy@yahoo.com" <rurpy@yahoo.com>
Date	2011-06-05 22:47 -0700
Message-ID	<cde4af31-6471-4fbf-a6d3-6ddfe8bd18c1@v37g2000yqb.googlegroups.com>
In reply to	#6973

On 06/03/2011 03:45 PM, Chris Torek wrote:
>>On 2011-06-03, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
> [prefers]
>>>     re.split ('[ ,]', source)
>
> This is probably not what you want in dealing with
> human-created text:
>
>     >>> re.split('[ ,]', 'foo bar, spam,maps')
>     ['foo', '', 'bar', '', 'spam', 'maps']
>
> Instead, you probably want "a comma followed by zero or
> more spaces; or, one or more spaces":
>
>     >>> re.split(r',\s*|\s+', 'foo bar, spam,maps')
>     ['foo', 'bar', 'spam', 'maps']
>
> or perhaps (depending on how you want to treat multiple
> adjacent commas) even this:
>
>     >>> re.split(r',+\s*|\s+', 'foo bar, spam,maps,, eggs')
>     ['foo', 'bar', 'spam', 'maps', 'eggs']

Which to me, illustrates nicely the power of a regex to concisely
localize the specification of an input format and adapt easily
to changes in that specification.

> although eventually you might want to just give in and use the
> csv module. :-)  (Especially if you want to be able to quote
> commas, for instance.)

Which internally uses regexes, at least for the sniffer function.
(The main parser is in C presumably for speed, this being a
library module and all.)

>>> ...  With regexes the code is likely to be less brittle than a
>>> dozen or more lines of mixed string functions, indexes, and
>>> conditionals.
>
> In article <94svm4Fe7eU1@mid.individual.net>
> Neil Cerutti  <neilc@norwich.edu> wrote:
> [lots of snippage]
>>That is the opposite of my experience, but YMMV.
>
> I suspect it depends on how familiar the user is with regular
> expressions, their abilities, and their limitations.

I suspect so too at least in part.

> People relatively new to REs always seem to want to use them
> to count (to balance parentheses, for instance).  People who
> have gone through the compiler course know better. :-)

But also, a thing I think sometimes gets forgotten, is if the
max nesting depth is finite, parens can be balanced with a
regex.  This is nice for the particularly common case of a
nest depth of 1 (balanced but non-nested parens.)

[toc] | [prev] | [next] | [standalone]

#7072

From	"rurpy@yahoo.com" <rurpy@yahoo.com>
Date	2011-06-05 22:44 -0700
Message-ID	<65164054-f11d-4f8e-a141-31513e70ca04@h9g2000yqk.googlegroups.com>
In reply to	#6971

On 06/03/2011 02:49 PM, Neil Cerutti wrote:
> > On 2011-06-03, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
>>>> >>>> or that I have to treat commas as well as spaces as
>>>> >>>> delimiters.
>>> >>>
>>> >>> source.replace(",", " ").split(" ")
>> >>
>> >> Uhgg. create a whole new string just so you can split it on one
>> >> rather than two characters?  Sorry, but I find
>> >>
>> >>     re.split ('[ ,]', source)
> >
> > It's quibbling to complain about creating one more string in an
> > operation that already creates N strings.

It's not the time it take to create the string, its the doing
of things that aren't really needed to accomplish the task:
The re.split says directly and with no extraneous actions,
"split 'source' on either spaces or commas".  This of course
is a trivial example but used thoughtfully, REs allow you to
be very precise about what you are doing, versus using "tricks"
like substituting individual characters first so you can split
on a single character afterwards.

> > Here's another alternative:
> >
> > list(itertools.chain.from_iterable(elem.split(" ")
> >   for elem in source.split(",")))

You seriously find that clearer than re.split('[ ,]') above?
I have no further comment. :-)

> > It's weird looking, but delimiting text with two different
> > delimiters is weird.

Perhaps, but real-world input data is often very weird.
Try parsing a text "database" of a circa 1980 telephone
company phone directory sometime. :-)

> >[...]
>>> >>> - they are another language to learn, a very cryptic a terse
>>> >>> language;
>> >>
>> >> Chinese is cryptic too but there are a few billion people who
>> >> don't seem to be bothered by that.
> >
> > Chinese *would* be a problem if you proposed it as the solution
> > to a problem that could be solved by using a persons native
> > tongue instead.

My point was that "cryptic" is in large part an inverse function
of knowledge.  If I always go out of my way to avoid regexes, than
likely I will never become comfortable with them and they will
always seem cryptic.  To someone who uses them more often, they
will seem less cryptic.  They may never have the clarity of Python
but neither is Python code a very clear way to describe text
patterns.

As for needing to learn them (S D'A comment), shrug.  Programmers
are expected to learn new things all the time, many even do so
for fun.  REs (practical use that is) in the grand scheme of things
are not that hard.

They are I think a lot easier to learn than SQL, yet it is common
here to see recommendations to use sqlite rather than an ad-hoc
concoction of Python dicts.

> >[...]
>>> >>> - and thanks in part to Perl's over-reliance on them, there's
>>> >>> a tendency among many coders (especially those coming from
>>> >>> Perl) to abuse and/or misuse regexes; people react to that
>>> >>> misuse by treating any use of regexes with suspicion.
>> >>
>> >> So you claim.  I have seen more postings in here where
>> >> REs were not used when they would have simplified the code,
>> >> then I have seen regexes used when a string method or two
>> >> would have done the same thing.
> >
> > Can you find an example or invent one? I simply don't remember
> > such problems coming up, but I admit it's possible.

Sure, the response to the OP of this thread.

[toc] | [prev] | [next] | [standalone]

#7090

From	Neil Cerutti <neilc@norwich.edu>
Date	2011-06-06 16:08 +0000
Message-ID	<954cb5F5qbU1@mid.individual.net>
In reply to	#7072

On 2011-06-06, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
> On 06/03/2011 02:49 PM, Neil Cerutti wrote:
> Can you find an example or invent one? I simply don't remember
> such problems coming up, but I admit it's possible.
>
> Sure, the response to the OP of this thread.

Here's a recap, along with two candidate solutions, one based on
your recommendation, and one using str functions and slicing. 

(I fixed a specification problem in your original regex, as one
of the lines of data contained a space after the closing ',
making the $ inappropriate)

data.txt:
//ACCDJ         EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
//         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCDJ       '
//ACCT          EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
//         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCT        '
//ACCUM         EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
//         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM       '
//ACCUM1        EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
//         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM1      ' 
^Z

import re

print("re solution")
with open("data.txt") as f:
    for line in f:
        fixed = re.sub(r"(TABLE='\S+)\s+'", r"\1'", line)
        print(fixed, end='')

print("non-re solution")
with open("data.txt") as f:
    for line in f:
        i = line.find("TABLE='")
        if i != -1:
            begin = line.index("'", i) + 1
            end = line.index("'", begin)
            field = line[begin: end].rstrip()
            print(line[:i] + line[i:begin] + field + line[end:], end='')
        else:
            print(line, end='')

These two solutions print identical output processing the sample
data. Slight changes in the data would reveal divergence in the
assumptions each solution made.

I agree with you that this is a very tempting candidate for
re.sub, and if it probably would have been my first try as well.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#7093

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2011-06-06 10:29 -0600
Message-ID	<mailman.2492.1307377787.9059.python-list@python.org>
In reply to	#7090

On Mon, Jun 6, 2011 at 10:08 AM, Neil Cerutti <neilc@norwich.edu> wrote:
> import re
>
> print("re solution")
> with open("data.txt") as f:
>    for line in f:
>        fixed = re.sub(r"(TABLE='\S+)\s+'", r"\1'", line)
>        print(fixed, end='')
>
> print("non-re solution")
> with open("data.txt") as f:
>    for line in f:
>        i = line.find("TABLE='")
>        if i != -1:
>            begin = line.index("'", i) + 1
>            end = line.index("'", begin)
>            field = line[begin: end].rstrip()
>            print(line[:i] + line[i:begin] + field + line[end:], end='')
>        else:
>            print(line, end='')

print("non-re solution")
with open("data.txt") as f:
    for line in f:
        try:
            start = line.index("TABLE='") + 7
            end = line.index("'", start)
        except ValueError:
            pass
        else:
            line = line[:start] + line[start:end].rstrip() + line[end:]
        print(line, end='')

[toc] | [prev] | [next] | [standalone]

#7096

From	Neil Cerutti <neilc@norwich.edu>
Date	2011-06-06 17:17 +0000
Message-ID	<954gd8Fh73U1@mid.individual.net>
In reply to	#7093

On 2011-06-06, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Mon, Jun 6, 2011 at 10:08 AM, Neil Cerutti <neilc@norwich.edu> wrote:
>> import re
>>
>> print("re solution")
>> with open("data.txt") as f:
>> ? ?for line in f:
>> ? ? ? ?fixed = re.sub(r"(TABLE='\S+)\s+'", r"\1'", line)
>> ? ? ? ?print(fixed, end='')
>>
>> print("non-re solution")
>> with open("data.txt") as f:
>> ? ?for line in f:
>> ? ? ? ?i = line.find("TABLE='")
>> ? ? ? ?if i != -1:
>> ? ? ? ? ? ?begin = line.index("'", i) + 1
>> ? ? ? ? ? ?end = line.index("'", begin)
>> ? ? ? ? ? ?field = line[begin: end].rstrip()
>> ? ? ? ? ? ?print(line[:i] + line[i:begin] + field + line[end:], end='')
>> ? ? ? ?else:
>> ? ? ? ? ? ?print(line, end='')
>
> print("non-re solution")
> with open("data.txt") as f:
>     for line in f:
>         try:
>             start = line.index("TABLE='") + 7

I wrestled with using addition like that, and decided against it.
The 7 is a magic number and repeats/hides information. I wanted
something like:

   prefix = "TABLE='"
   start = line.index(prefix) + len(prefix)

But decided I searching for the opening ' was a bit better.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#7099

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2011-06-06 11:40 -0600
Message-ID	<mailman.2496.1307382082.9059.python-list@python.org>
In reply to	#7096

On Mon, Jun 6, 2011 at 11:17 AM, Neil Cerutti <neilc@norwich.edu> wrote:
> I wrestled with using addition like that, and decided against it.
> The 7 is a magic number and repeats/hides information. I wanted
> something like:
>
>   prefix = "TABLE='"
>   start = line.index(prefix) + len(prefix)
>
> But decided I searching for the opening ' was a bit better.

Fair enough, although if you ask me the + 1 is just as magical as the
+ 7 (it's still the length of the string that you're searching for).
Also, re-finding the opening ' still repeats information.

The main thing I wanted to fix was that the second .index() call had
the possibility of raising an unhandled ValueError.  There are really
two things we have to search for in the line, either of which could be
missing, and catching them both with the same except: clause feels
better to me than checking both of them for -1.

Cheers,
Ian

[toc] | [prev] | [next] | [standalone]

#7103

From	Neil Cerutti <neilc@norwich.edu>
Date	2011-06-06 17:56 +0000
Message-ID	<954imtF432U1@mid.individual.net>
In reply to	#7099

On 2011-06-06, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> Fair enough, although if you ask me the + 1 is just as magical
> as the + 7 (it's still the length of the string that you're
> searching for). Also, re-finding the opening ' still repeats
> information.

Heh, true. I doesn't really repeat information, though, as in my
version there could be intervening garbage after the TABLE=,
which probably isn't desirable.

> The main thing I wanted to fix was that the second .index()
> call had the possibility of raising an unhandled ValueError.
> There are really two things we have to search for in the line,
> either of which could be missing, and catching them both with
> the same except: clause feels better to me than checking both
> of them for -1.

I thought an unhandled ValueError was a good idea in that case. I
knew that TABLE= may not exist, but I assumed if it did, that the
quotes are supposed to be there.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#7098

From	Ethan Furman <ethan@stoneleaf.us>
Date	2011-06-06 10:48 -0700
Message-ID	<mailman.2495.1307381710.9059.python-list@python.org>
In reply to	#7090

Ian Kelly wrote:
> On Mon, Jun 6, 2011 at 10:08 AM, Neil Cerutti <neilc@norwich.edu> wrote:
>> import re
>>
>> print("re solution")
>> with open("data.txt") as f:
>>    for line in f:
>>        fixed = re.sub(r"(TABLE='\S+)\s+'", r"\1'", line)
>>        print(fixed, end='')
>>
>> print("non-re solution")
>> with open("data.txt") as f:
>>    for line in f:
>>        i = line.find("TABLE='")
>>        if i != -1:
>>            begin = line.index("'", i) + 1
>>            end = line.index("'", begin)
>>            field = line[begin: end].rstrip()
>>            print(line[:i] + line[i:begin] + field + line[end:], end='')
>>        else:
>>            print(line, end='')
> 
> print("non-re solution")
> with open("data.txt") as f:
>     for line in f:
>         try:
>             start = line.index("TABLE='") + 7
>             end = line.index("'", start)
>         except ValueError:
>             pass
>         else:
>             line = line[:start] + line[start:end].rstrip() + line[end:]
>         print(line, end='')

I like the readability of this version, but isn't generating an 
exception on every other line going to kill performance?

~Ethan~

[toc] | [prev] | [next] | [standalone]

#7101

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2011-06-06 11:42 -0600
Message-ID	<mailman.2498.1307382178.9059.python-list@python.org>
In reply to	#7090

On Mon, Jun 6, 2011 at 11:48 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
> I like the readability of this version, but isn't generating an exception on
> every other line going to kill performance?

I timed it on the example data before I posted and found that it was
still 10 times as fast as the regex version.  I didn't time the
version without the exceptions.

[toc] | [prev] | [next] | [standalone]

#6991

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-06-04 02:05 +0000
Message-ID	<4de992d7$0$29996$c3e8da3$5496439d@news.astraweb.com>
In reply to	#6963

On Fri, 03 Jun 2011 12:29:52 -0700, rurpy@yahoo.com wrote:

>>> I often find myself changing, for example, a startwith() to a RE when
>>> I realize that the input can contain mixed case
>>
>> Why wouldn't you just normalise the case?
> 
> Because some of the text may be case-sensitive.

Perhaps you misunderstood me. You don't have to throw away the 
unnormalised text, merely use the normalized text in the expression you 
need.

Of course, if you include both case-sensitive and insensitive tests in 
the same calculation, that's a good candidate for a regex... or at least 
it would be if regexes supported that :)

>>[...]
>>> or that I have
>>> to treat commas as well as spaces as delimiters.
>>
>> source.replace(",", " ").split(" ")
> 
> Uhgg. create a whole new string just so you can split it on one rather
> than two characters?

You say that like it's expensive.

And how do you what the regex engine is doing under the hood? For all you 
know, it could be making hundreds of temporary copies and throwing them 
away. Or something. It's a black box.

The fact that creating a whole new string to split on is faster than 
*running* the regex (never mind compiling it, loading the regex engine, 
and anything else that needs to be done) should tell you which does more 
work. Copying is cheap. Parsing is expensive.

> Sorry, but I find
> 
>     re.split ('[ ,]', source)
> 
> states much more clearly exactly what is being done with no obfuscation.

That's because you know regex syntax. And I'd hardly call the version 
with replace obfuscated.

Certainly the regex is shorter, and I suppose it's reasonable to expect 
any reader to know at least enough regex to read that, so I'll grant you 
that this is a small win for clarity. A micro-optimization for 
readability, at the expense of performance.

> Obviously this is a simple enough case that the difference is minor but
> when the pattern gets only a little more complex, the clarity difference
> becomes greater.

Perhaps. But complicated tasks require complicated regexes, which are 
anything but clear.

[...]
>>> After doing this a
>>> number of times, one starts to use an RE right from the get go unless
>>> one is VERY sure that there will be no requirements creep.
>>
>> YAGNI.
> 
> IAHNI. (I actually have needed it.)

I'm sure you have, and when you need it, it's entirely appropriate to use 
a regex solution. But you stated that you used regexes as insurance *just 
in case* the requirements changed. Why, is your text editor broken? You 
can't change a call to str.startswith(prefix) to re.match(prefix, str) if 
and when you need to? That's what I mean by YAGNI -- don't solve the 
problem you think you might have tomorrow.

>> There's no need to use a regex just because you think that you *might*,
>> someday, possibly need a regex. That's just silly. If and when
>> requirements change, then use a regex. Until then, write the simplest
>> code that will solve the problem you have to solve now, not the problem
>> you think you might have to solve later.
> 
> I would not recommend you use a regex instead of a string method solely
> because you might need a regex later.  But when you have to spend 10
> minutes writing a half-dozen lines of python versus 1 minute writing a
> regex, your evaluation of the possibility of requirements changing
> should factor into your decision.

Ah, but if your requirements are complicated enough that it takes you ten 
minutes and six lines of string method calls, that sounds to me like a 
situation that probably calls for a regex!

Of course it depends on what the code actually does... if it counts the 
number of nested ( ) pairs, and you're trying to do that with a regex, 
you're sacked! *wink*

[...]
>> There are a few problems with regexes:
>>
>> - they are another language to learn, a very cryptic a terse language;
> 
> Chinese is cryptic too but there are a few billion people who don't seem
> to be bothered by that.

Chinese isn't cryptic to the Chinese, because they've learned it from 
childhood. 

But has anyone done any studies comparing reading comprehension speed 
between native Chinese readers and native European readers? For all I 
know, Europeans learn to read twice as quickly as Chinese, and once 
learned, read text twice as fast. Or possibly the other way around. Who 
knows? Not me.

But I do know that English typists typing 26 letters of the alphabet 
leave Asian typists and their thousands of ideograms in the dust. There's 
no comparison -- it's like quicksort vs bubblesort *wink*.

[...]
>> - debugging regexes is a nightmare;
> 
> Very complex ones, perhaps.  "Nightmare" seems an overstatement.

You *can't* debug regexes in Python, since there are no tools for (e.g.) 
single-stepping through the regex, displaying intermediate calculations, 
or anything other than making changes to the regex and running it again, 
hoping that it will do the right thing this time.

I suppose you can use external tools, like Regex Buddy, if you're on a 
supported platform and if they support your language's regex engine.

[...]
>> Regarding their syntax, I'd like to point out that even Larry Wall is
>> dissatisfied with regex culture in the Perl community:
>>
>> http://www.perl.com/pub/2002/06/04/apo5.html
> 
> You did see the very first sentence in this, right?
> 
>   "Editor's Note: this Apocalypse is out of date and remains here for
>   historic reasons. See Synopsis 05 for the latest information."

Yes. And did you click through to see the Synopsis? It is a bare 
technical document with all the motivation removed. Since I was pointing 
to Larry Wall's motivation, it was appropriate to link to the Apocalypse 
document, not the Synopsis.

> (Note that "Apocalypse" is referring to a series of Perl design
> documents and has nothing to do with regexes in particular.)

But Apocalypse 5 specifically has everything to do with regexes. That's 
why I linked to that, and not (say) Apocalypse 2.

> Synopsis 05 is (AFAICT with a quick scan) a proposal for revising regex
> syntax.  I didn't see anything about de-emphasizing them in Perl.  (But
> I have no idea what is going on for Perl 6 so I could be wrong about
> that.)

I never said anything about de-emphasizing them. I said that Larry Wall 
was dissatisfied with Perl's culture of regexes -- his own words were:

"regular expression culture is a mess"

and he is also extremely critical of current (i.e. Perl 5) regex syntax. 
Since Python's regex syntax borrows heavily from Perl 5, that's extremely 
pertinent to the issue. When even the champion of regex culture says 
there is much broken about regex culture, we should all listen.

> As for the original reference, Wall points out a number of problems with
> regexes, mostly details of their syntax.  For example that more
> frequently used non-capturing groups require more characters than
> less-frequently used capturing groups. Most of these criticisms seem
> irrelevant to the question of whether hard-wired string manipulation
> code or regexes should be preferred in a Python program.

It is only relevant in so far as the readability and relative obfuscation 
of regex syntax is relevant. No further.

You keep throwing out the term "hard-wired string manipulation", but I 
don't understand what point you're making. I don't understand what you 
see as "hard-wired", or why you think

source.startswith(prefix)

is more hard-wired than

re.match(prefix, source)

[...]
> Perhaps you stopped reading after seeing his "regular expression culture
> is a mess" comment without trying to see what he meant by "culture" or
> "mess"?

Perhaps you are being over-sensitive and reading *far* too much into what 
I said. If regexes were more readable, as proposed by Wall, that would go 
a long way to reducing my suspicion of them.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#6993

From	MRAB <python@mrabarnett.plus.com>
Date	2011-06-04 03:24 +0100
Message-ID	<mailman.2449.1307154367.9059.python-list@python.org>
In reply to	#6991

On 04/06/2011 03:05, Steven D'Aprano wrote:
> On Fri, 03 Jun 2011 12:29:52 -0700, rurpy@yahoo.com wrote:
>
>>>> I often find myself changing, for example, a startwith() to a RE when
>>>> I realize that the input can contain mixed case
>>>
>>> Why wouldn't you just normalise the case?
>>
>> Because some of the text may be case-sensitive.
>
> Perhaps you misunderstood me. You don't have to throw away the
> unnormalised text, merely use the normalized text in the expression you
> need.
>
> Of course, if you include both case-sensitive and insensitive tests in
> the same calculation, that's a good candidate for a regex... or at least
> it would be if regexes supported that :)
>
[snip]
Some regex implementations support scoped case sensitivity. :-)

I have at times thought that it would be useful if .startswith offered
the option of case insensitivity and there were also str.equal which
offered it.

[toc] | [prev] | [next] | [standalone]

#6996

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-06-04 04:59 +0000
Message-ID	<4de9bbc0$0$29996$c3e8da3$5496439d@news.astraweb.com>
In reply to	#6993

On Sat, 04 Jun 2011 03:24:50 +0100, MRAB wrote:

> [snip]
> Some regex implementations support scoped case sensitivity. :-)

Yes, you should link to your regex library :)

Have you considered the suggested Perl 6 syntax? Much of it looks good to 
me.

> I have at times thought that it would be useful if .startswith offered
> the option of case insensitivity and there were also str.equal which
> offered it.

I agree. I wish string methods in general would support a case_sensitive 
flag. I think that's a common enough task to count as a exception to the 
rule "don't include function boolean arguments that merely swap between 
two different actions".

-- 
Steven

[toc] | [prev] | [next] | [standalone]

Page 2 of 4 — ← Prev page 1 [2] 3 4 Next page →

csiph-web

Re: how to avoid leading white spaces

Contents

#6950

#6946

#6948

#6963

#6971

#6973

#6975

#6979

#7073

#7072

#7090

#7093

#7096

#7099

#7103

#7098

#7101

#6991

#6993

#6996