Groups > comp.lang.python > #64777 > unrolled thread

re Questions

Started by	Blake Adams <blakesadams@gmail.com>
First post	2014-01-26 08:59 -0800
Last post	2014-01-26 17:30 +0000
Articles	10 — 6 participants

Back to article view | Back to comp.lang.python

  re Questions Blake Adams <blakesadams@gmail.com> - 2014-01-26 08:59 -0800
    Re: re Questions Larry Martell <larry.martell@gmail.com> - 2014-01-26 10:06 -0700
      Re: re Questions Blake Adams <blakesadams@gmail.com> - 2014-01-26 09:15 -0800
    Re: re Questions Chris Angelico <rosuav@gmail.com> - 2014-01-27 04:08 +1100
      Re: re Questions Roy Smith <roy@panix.com> - 2014-01-26 12:15 -0500
        Re: re Questions Chris Angelico <rosuav@gmail.com> - 2014-01-27 04:25 +1100
        Re: re Questions Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-26 17:39 +0000
        Re: re Questions Tim Chase <python.list@tim.thechases.com> - 2014-01-26 13:41 -0600
      Re: re Questions Blake Adams <blakesadams@gmail.com> - 2014-01-26 09:15 -0800
        Re: re Questions Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-26 17:30 +0000

#64777 — re Questions

From	Blake Adams <blakesadams@gmail.com>
Date	2014-01-26 08:59 -0800
Subject	re Questions
Message-ID	<3f568767-e13a-4c7d-a4fb-85caca2adf6e@googlegroups.com>

Im pretty new to Python and understand most of the basics of Python re but am stumped by a unexpected matching dynamics.

If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'.  However, when I run the following:

re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_'].  I would expect the match to be ['z', 'C', '0', '_'].

Why does this happen?

Thanks in advance

Blake

[toc] | [next] | [standalone]

#64779

From	Larry Martell <larry.martell@gmail.com>
Date	2014-01-26 10:06 -0700
Message-ID	<mailman.5995.1390756022.18130.python-list@python.org>
In reply to	#64777

On Sun, Jan 26, 2014 at 9:59 AM, Blake Adams <blakesadams@gmail.com> wrote:
> Im pretty new to Python and understand most of the basics of Python re but am stumped by a unexpected matching dynamics.
>
> If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'.  However, when I run the following:
>
> re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_'].  I would expect the match to be ['z', 'C', '0', '_'].
>
> Why does this happen?

Because the characters \ ] ^ and _ are between Z and a in the ASCII
character set.

You need to do this:

re.findall('[A-Za-z0-9_]','^;z %C\@0~_')

[toc] | [prev] | [next] | [standalone]

#64782

From	Blake Adams <blakesadams@gmail.com>
Date	2014-01-26 09:15 -0800
Message-ID	<e319250e-0b5a-4364-9ea1-435d53648016@googlegroups.com>
In reply to	#64779

On Sunday, January 26, 2014 12:06:59 PM UTC-5, Larry....@gmail.com wrote:
> On Sun, Jan 26, 2014 at 9:59 AM, Blake Adams <blakesadams@gmail.com> wrote:
> 
> > Im pretty new to Python and understand most of the basics of Python re but am stumped by a unexpected matching dynamics.
> 
> >
> 
> > If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'.  However, when I run the following:
> 
> >
> 
> > re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_'].  I would expect the match to be ['z', 'C', '0', '_'].
> 
> >
> 
> > Why does this happen?
> 
> 
> 
> Because the characters \ ] ^ and _ are between Z and a in the ASCII
> 
> character set.
> 
> 
> 
> You need to do this:
> 
> 
> 
> re.findall('[A-Za-z0-9_]','^;z %C\@0~_')

Got it that makes sense.  Thanks for the quick reply Larry

[toc] | [prev] | [next] | [standalone]

#64780

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-27 04:08 +1100
Message-ID	<mailman.5996.1390756093.18130.python-list@python.org>
In reply to	#64777

On Mon, Jan 27, 2014 at 3:59 AM, Blake Adams <blakesadams@gmail.com> wrote:
> If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'.  However, when I run the following:
>
> re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_'].  I would expect the match to be ['z', 'C', '0', '_'].
>
> Why does this happen?

Because \w is not the same as [A-z0-9_]. Quoting from the docs:

"""
\w For Unicode (str) patterns:Matches Unicode word characters; this
includes most characters that can be part of a word in any language,
as well as numbers and the underscore. If the ASCII flag is used, only
[a-zA-Z0-9_] is matched (but the flag affects the entire regular
expression, so in such cases using an explicit [a-zA-Z0-9_] may be a
better choice).For 8-bit (bytes) patterns:Matches characters
considered alphanumeric in the ASCII character set; this is equivalent
to [a-zA-Z0-9_].
"""

If you're working with a byte string, then you're close, but A-z is
quite different from A-Za-z. The set [A-z] is equivalent to
[ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] (that's
a literal backslash in there, btw), so it'll also catch several
non-alphabetic characters. With a Unicode string, it's quite
distinctly different. Either way, \w means "word characters", though,
so just go ahead and use it whenever you want word characters :)

ChrisA

[toc] | [prev] | [next] | [standalone]

#64781

From	Roy Smith <roy@panix.com>
Date	2014-01-26 12:15 -0500
Message-ID	<roy-B66F25.12151426012014@news.panix.com>
In reply to	#64780

In article <mailman.5996.1390756093.18130.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> The set [A-z] is equivalent to
> [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz]

I'm inclined to suggest the regex compiler should issue a warning for 
this.

I've never seen a character range other than A-Z, a-z, or 0-9.  Well, I 
suppose A-F or a-f if you're trying to match hex digits (and some 
variations on that for octal).  But, I can't imagine any example where 
somebody wrote A-z and it wasn't an error.

[toc] | [prev] | [next] | [standalone]

#64784

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-27 04:25 +1100
Message-ID	<mailman.5997.1390757137.18130.python-list@python.org>
In reply to	#64781

On Mon, Jan 27, 2014 at 4:15 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.5996.1390756093.18130.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
>
>> The set [A-z] is equivalent to
>> [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz]
>
> I'm inclined to suggest the regex compiler should issue a warning for
> this.
>
> I've never seen a character range other than A-Z, a-z, or 0-9.  Well, I
> suppose A-F or a-f if you're trying to match hex digits (and some
> variations on that for octal).  But, I can't imagine any example where
> somebody wrote A-z and it wasn't an error.

I've used a variety of character ranges, certainly more than the 4-5
you listed, but I agree that A-z is extremely likely to be an error.
However, I've sometimes used a regex (bytes mode) to find, say, all
the ASCII printable characters - [ -~] - and I wouldn't want that
precluded. It's a bit tricky trying to figure out which are likely to
be errors and which are not, so I'd be inclined to keep things as they
are. No warnings.

ChrisA

[toc] | [prev] | [next] | [standalone]

#64786

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-01-26 17:39 +0000
Message-ID	<mailman.5999.1390757975.18130.python-list@python.org>
In reply to	#64781

On 26/01/2014 17:25, Chris Angelico wrote:
> On Mon, Jan 27, 2014 at 4:15 AM, Roy Smith <roy@panix.com> wrote:
>> In article <mailman.5996.1390756093.18130.python-list@python.org>,
>>   Chris Angelico <rosuav@gmail.com> wrote:
>>
>>> The set [A-z] is equivalent to
>>> [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz]
>>
>> I'm inclined to suggest the regex compiler should issue a warning for
>> this.
>>
>> I've never seen a character range other than A-Z, a-z, or 0-9.  Well, I
>> suppose A-F or a-f if you're trying to match hex digits (and some
>> variations on that for octal).  But, I can't imagine any example where
>> somebody wrote A-z and it wasn't an error.
>
> I've used a variety of character ranges, certainly more than the 4-5
> you listed, but I agree that A-z is extremely likely to be an error.
> However, I've sometimes used a regex (bytes mode) to find, say, all
> the ASCII printable characters - [ -~] - and I wouldn't want that
> precluded. It's a bit tricky trying to figure out which are likely to
> be errors and which are not, so I'd be inclined to keep things as they
> are. No warnings.
>
> ChrisA
>

I suggest a single warning is always given "Regular expressions can be 
fickle.  Have you considered using string methods?".  My apologies to 
regex fans if they're currently choking over their tea, coffee, cocoa, 
beer, scotch, saki, ouzo or whatever :)

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#64788

From	Tim Chase <python.list@tim.thechases.com>
Date	2014-01-26 13:41 -0600
Message-ID	<mailman.6000.1390765257.18130.python-list@python.org>
In reply to	#64781

On 2014-01-26 12:15, Roy Smith wrote:
> > The set [A-z] is equivalent to
> > [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz]  
> 
> I'm inclined to suggest the regex compiler should issue a warning
> for this.
> 
> I've never seen a character range other than A-Z, a-z, or 0-9.
> Well, I suppose A-F or a-f if you're trying to match hex digits
> (and some variations on that for octal).  But, I can't imagine any
> example where somebody wrote A-z and it wasn't an error.

I'd not object to warnings on that one literal "A-z" set, but I've
done some work with VINs¹ where the allowable character-set is A-Z and
digits, minus letters that can be hard to distinguish visually
(I/O/Q), so I've used ^[A-HJ-NPR-Z0-9]{17}$ as a first-pass filter
for VINs that were entered (often scanned, but occasionally
hand-keyed).  In some environments, I've been able to intercept I/O/Q
and remap them accordingly to 1/0/0 to do the disambiguation for the
user.  So I'd not want to see other character-classes touched, as
they can be perfectly legit.

-tkc

¹ http://en.wikipedia.org/wiki/Vehicle_Identification_Number

[toc] | [prev] | [next] | [standalone]

#64783

From	Blake Adams <blakesadams@gmail.com>
Date	2014-01-26 09:15 -0800
Message-ID	<c5c75189-1280-40fb-8bd2-de00eba97257@googlegroups.com>
In reply to	#64780

On Sunday, January 26, 2014 12:08:01 PM UTC-5, Chris Angelico wrote:
> On Mon, Jan 27, 2014 at 3:59 AM, Blake Adams <blakesadams@gmail.com> wrote:
> 
> > If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'.  However, when I run the following:
> 
> >
> 
> > re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_'].  I would expect the match to be ['z', 'C', '0', '_'].
> 
> >
> 
> > Why does this happen?
> 
> 
> 
> Because \w is not the same as [A-z0-9_]. Quoting from the docs:
> 
> 
> 
> """
> 
> \w For Unicode (str) patterns:Matches Unicode word characters; this
> 
> includes most characters that can be part of a word in any language,
> 
> as well as numbers and the underscore. If the ASCII flag is used, only
> 
> [a-zA-Z0-9_] is matched (but the flag affects the entire regular
> 
> expression, so in such cases using an explicit [a-zA-Z0-9_] may be a
> 
> better choice).For 8-bit (bytes) patterns:Matches characters
> 
> considered alphanumeric in the ASCII character set; this is equivalent
> 
> to [a-zA-Z0-9_].
> 
> """
> 
> 
> 
> If you're working with a byte string, then you're close, but A-z is
> 
> quite different from A-Za-z. The set [A-z] is equivalent to
> 
> [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] (that's
> 
> a literal backslash in there, btw), so it'll also catch several
> 
> non-alphabetic characters. With a Unicode string, it's quite
> 
> distinctly different. Either way, \w means "word characters", though,
> 
> so just go ahead and use it whenever you want word characters :)
> 
> 
> 
> ChrisA

Thanks Chris

[toc] | [prev] | [next] | [standalone]

#64785

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-01-26 17:30 +0000
Message-ID	<mailman.5998.1390757464.18130.python-list@python.org>
In reply to	#64783

On 26/01/2014 17:15, Blake Adams wrote:
> On Sunday, January 26, 2014 12:08:01 PM UTC-5, Chris Angelico wrote:
>> On Mon, Jan 27, 2014 at 3:59 AM, Blake Adams <blakesadams@gmail.com> wrote:
>>
>>> If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'.  However, when I run the following:
>>
>>>
>>
>>> re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_'].  I would expect the match to be ['z', 'C', '0', '_'].
>>
>>>
>>
>>> Why does this happen?
>>
>>
>>
>> Because \w is not the same as [A-z0-9_]. Quoting from the docs:
>>
>>
>>
>> """
>>
>> \w For Unicode (str) patterns:Matches Unicode word characters; this
>>
>> includes most characters that can be part of a word in any language,
>>
>> as well as numbers and the underscore. If the ASCII flag is used, only
>>
>> [a-zA-Z0-9_] is matched (but the flag affects the entire regular
>>
>> expression, so in such cases using an explicit [a-zA-Z0-9_] may be a
>>
>> better choice).For 8-bit (bytes) patterns:Matches characters
>>
>> considered alphanumeric in the ASCII character set; this is equivalent
>>
>> to [a-zA-Z0-9_].
>>
>> """
>>
>>
>>
>> If you're working with a byte string, then you're close, but A-z is
>>
>> quite different from A-Za-z. The set [A-z] is equivalent to
>>
>> [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] (that's
>>
>> a literal backslash in there, btw), so it'll also catch several
>>
>> non-alphabetic characters. With a Unicode string, it's quite
>>
>> distinctly different. Either way, \w means "word characters", though,
>>
>> so just go ahead and use it whenever you want word characters :)
>>
>>
>>
>> ChrisA
>
> Thanks Chris
>

I'm pleased to see that your question has been answered.

Now would you please read and action this 
https://wiki.python.org/moin/GoogleGroupsPython to prevent us seeing the 
double line spacing above, thanks.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [standalone]

csiph-web

re Questions

Contents

#64777 — re Questions

#64779

#64782

#64780

#64781

#64784

#64786

#64788

#64783

#64785