Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <3f568767-e13a-4c7d-a4fb-85caca2adf6e@googlegroups.com>
References: <3f568767-e13a-4c7d-a4fb-85caca2adf6e@googlegroups.com>
Date: Mon, 27 Jan 2014 04:08:01 +1100
Subject: Re: re Questions
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5996.1390756093.18130.python-list@python.org>
Lines: 29
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:64780

On Mon, Jan 27, 2014 at 3:59 AM, Blake Adams <blakesadams@gmail.com> wrote:
> If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'.  However, when I run the following:
>
> re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_'].  I would expect the match to be ['z', 'C', '0', '_'].
>
> Why does this happen?

Because \w is not the same as [A-z0-9_]. Quoting from the docs:

"""
\w For Unicode (str) patterns:Matches Unicode word characters; this
includes most characters that can be part of a word in any language,
as well as numbers and the underscore. If the ASCII flag is used, only
[a-zA-Z0-9_] is matched (but the flag affects the entire regular
expression, so in such cases using an explicit [a-zA-Z0-9_] may be a
better choice).For 8-bit (bytes) patterns:Matches characters
considered alphanumeric in the ASCII character set; this is equivalent
to [a-zA-Z0-9_].
"""

If you're working with a byte string, then you're close, but A-z is
quite different from A-Za-z. The set [A-z] is equivalent to
[ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] (that's
a literal backslash in there, btw), so it'll also catch several
non-alphabetic characters. With a Unicode string, it's quite
distinctly different. Either way, \w means "word characters", though,
so just go ahead and use it whenever you want word characters :)

ChrisA