Groups > comp.lang.python > #12335 > unrolled thread

On re / regex replacement

Started by	jmfauth <wxjmfauth@gmail.com>
First post	2011-08-28 01:43 -0700
Last post	2011-08-28 22:40 -0700
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

  On re / regex replacement jmfauth <wxjmfauth@gmail.com> - 2011-08-28 01:43 -0700
    Re: On re / regex replacement Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-08-28 15:40 +0200
    Re: On re / regex replacement MRAB <python@mrabarnett.plus.com> - 2011-08-28 19:40 +0100
      Re: On re / regex replacement jmfauth <wxjmfauth@gmail.com> - 2011-08-28 22:40 -0700

#12335 — On re / regex replacement

From	jmfauth <wxjmfauth@gmail.com>
Date	2011-08-28 01:43 -0700
Subject	On re / regex replacement
Message-ID	<8e5b2e1c-bb7a-45a3-a0a3-23d25c3d16a7@w28g2000yqw.googlegroups.com>

There is actually a discussion on the dev-list about the replacement
of "re" by "regex".

I'm not a regular expressions specialist, neither a regex user.
However, there is in regex a point that is a little bit disturbing
me.

The regex module proposes a flag to select the "coding" (wrong word,
just to be short):

The global flags are: ASCII, LOCALE, NEW, REVERSE, UNICODE.

If I can undestand the ASCII flag, ASCII being the "lingua franca" of
almost all codings, I am more skeptical about the LOCALE/UNICODE
flags.

There is in my mind some kind of conflict here. What is 100% unicode
compliant shoud be locale independent ("Unicode.org") and a locale
depedency means a loss of unicode compliance.

I'm fearing some potential problems here:  Users or modules working
in one mode, while some others are working in the other mode.

Nothing technical here. It seems to me nobody has pointed this
fact.

jmf

[toc] | [next] | [standalone]

#12340

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2011-08-28 15:40 +0200
Message-ID	<mailman.505.1314538808.27778.python-list@python.org>
In reply to	#12335

2011/8/28 jmfauth <wxjmfauth@gmail.com>:
> There is actually a discussion on the dev-list about the replacement
> of "re" by "regex".
>...
> If I can undestand the ASCII flag, ASCII being the "lingua franca" of
> almost all codings, I am more skeptical about the LOCALE/UNICODE
> flags.
>
> There is in my mind some kind of conflict here. What is 100% unicode
> compliant shoud be locale independent ("Unicode.org") and a locale
> depedency means a loss of unicode compliance.
>
> I'm fearing some potential problems here:  Users or modules working
> in one mode, while some others are working in the other mode.
>
>...
> jmf
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>


As I understand it, regex was designed to be as much compatible with
re as possible, sometimes even some problematic (in some
interpretation) behaviour is retained as default and "corrected" via
the NEW flag (e.g. zero-width split). Also the LOCALE flag seems to be
considered as legacy feature and kept with the same behaviour like re;
cf.: http://code.google.com/p/mrab-regex-hg/issues/detail?id=6&can=1
In my opinon, the LOCALE flag is not reliable (in a way I would
imagine) in either re or regex.

In the area of flags regex should work the same way like re or it just
adds more possibilities (REVERSE for backwards search,  ASCII as the
complement for unicode, NEW to enable some incompatible additions or
corrections, where the original behaviour could be relied on).

The only (understandable) incompatibility I encounter in regex are the
new features requiring special syntax, which would obviously raise
errors in re or which would be matched literally instead.
see
http://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails#Additional_features
for an overview of the additions.

Personally I am very happy with regex, both with its features as well
as with the support and maintenance by its developer;
however I am mostly using it for manually entered patterns, and less
for hardcoded operation.

regards,
   Vlastimil Brom

[toc] | [prev] | [next] | [standalone]

#12347

From	MRAB <python@mrabarnett.plus.com>
Date	2011-08-28 19:40 +0100
Message-ID	<mailman.513.1314556846.27778.python-list@python.org>
In reply to	#12335

On 28/08/2011 14:40, Vlastimil Brom wrote:
> 2011/8/28 jmfauth<wxjmfauth@gmail.com>:
>> There is actually a discussion on the dev-list about the replacement
>> of "re" by "regex".
>> ...
>> If I can undestand the ASCII flag, ASCII being the "lingua franca" of
>> almost all codings, I am more skeptical about the LOCALE/UNICODE
>> flags.
>>
>> There is in my mind some kind of conflict here. What is 100% unicode
>> compliant shoud be locale independent ("Unicode.org") and a locale
>> depedency means a loss of unicode compliance.
>>
>> I'm fearing some potential problems here:  Users or modules working
>> in one mode, while some others are working in the other mode.
>>
>> ...
>> jmf
>>
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>
>
>
> As I understand it, regex was designed to be as much compatible with
> re as possible, sometimes even some problematic (in some
> interpretation) behaviour is retained as default and "corrected" via
> the NEW flag (e.g. zero-width split). Also the LOCALE flag seems to be
> considered as legacy feature and kept with the same behaviour like re;
> cf.: http://code.google.com/p/mrab-regex-hg/issues/detail?id=6&can=1
> In my opinon, the LOCALE flag is not reliable (in a way I would
> imagine) in either re or regex.
>
In Python 2, re defaults to ASCII and you must use UNICODE for Unicode
strings (the str type is a bytestring). In Python 3, re defaults to
UNICODE and you must use ASCII for ASCII bytestrings (the str type is a
Unicode string).

The LOCALE flag is for locale-dependent 8-bit bytestrings. It uses the
toupper and tolower functions of the underlying C library.

The regex module tries to be drop-in compatible. It supports the LOCALE
flag only because the re module has it. Even Perl has something similar.

> In the area of flags regex should work the same way like re or it just
> adds more possibilities (REVERSE for backwards search,  ASCII as the
> complement for unicode, NEW to enable some incompatible additions or
> corrections, where the original behaviour could be relied on).
>
> The only (understandable) incompatibility I encounter in regex are the
> new features requiring special syntax, which would obviously raise
> errors in re or which would be matched literally instead.
> see
> http://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails#Additional_features
> for an overview of the additions.
>
In the re module, unknown escape sequences are treated as literals, eg
\K is treated as K.

The regex module has more escape sequences, so that may break existing
regexes, eg \X isn't treated as X, but matches a grapheme. Unknown
escape sequences are still treated as literals, as in re.

My view is that you shouldn't be relying on that behaviour. If it looks
like an escape sequence, it may very well be one. It's like their use
in strings literals for file paths on Windows. I would've preferred
that a invalid escape sequence in a string literal raised an exception
(either it's valid and has a meaning, or it's invalid/reserved for
future use).

It's a balancing act. Requiring the NEW flag for _any_ deviation from
re would be very annoying.

> Personally I am very happy with regex, both with its features as well
> as with the support and maintenance by its developer;
> however I am mostly using it for manually entered patterns, and less
> for hardcoded operation.
>
And I'm very happy with your feedback. ;-)

[toc] | [prev] | [next] | [standalone]

#12390

From	jmfauth <wxjmfauth@gmail.com>
Date	2011-08-28 22:40 -0700
Message-ID	<5651b3de-40ae-4fda-b7d2-b59d06bd1e97@o26g2000vbi.googlegroups.com>
In reply to	#12347

On 28 août, 20:40, MRAB <pyt...@mrabarnett.plus.com> wrote:
> ...

> The regex module tries to be drop-in compatible. It supports the LOCALE
> flag only because the re module has it. Even Perl has something similar.
>  ...


Ok. That's quite logical.

jmf

[toc] | [prev] | [standalone]

csiph-web

On re / regex replacement

Contents

#12335 — On re / regex replacement

#12340

#12347

#12390