Groups > gnu.bash.bug > #14554 > unrolled thread

Re: built-in regex matches wrong character

Started by	Eric Blake <eblake@redhat.com>
First post	2018-09-06 09:23 -0500
Last post	2018-09-06 12:58 -0500
Articles	3 — 2 participants

Back to article view | Back to gnu.bash.bug

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: built-in regex matches wrong character Eric Blake <eblake@redhat.com> - 2018-09-06 09:23 -0500
    Re: built-in regex matches wrong character arnold@skeeve.com (Aharon Robbins) - 2018-09-06 17:39 +0000
      Re: built-in regex matches wrong character Eric Blake <eblake@redhat.com> - 2018-09-06 12:58 -0500

#14554 — Re: built-in regex matches wrong character

From	Eric Blake <eblake@redhat.com>
Date	2018-09-06 09:23 -0500
Subject	Re: built-in regex matches wrong character
Message-ID	<mailman.444.1536243821.1284.bug-bash@gnu.org>

On 09/06/2018 09:17 AM, Chet Ramey wrote:
> On 9/5/18 4:39 PM, Eric Blake wrote:
> 
>> Or, you can use bash's 'shopt -s globasciiranges' which is
>> supposed to enable Rational Range Interpretation, where even in non-C
>> locales, a character range bounded by two ASCII characters takes on the C
>> locale definition of only the ASCII characters in that range, rather than
>> the locale's definition of whatever other characters might also be
>> equivalent (actually, while I know that shopt affects globbing, I don't
>> know if it also affects regex matching - but if it doesn't, that's probably
>> a bug that should be fixed).
> 
> Since bash uses the C library's regexp engine, and most C libraries don't
> implement RRI, much less expose it as a flags option available via
> regcomp(), there's no reason to expect that globasciiranges would have
> any effect on regular expression matching.

But bash could be taught to convert any regex that contains a range with 
both endpoints ASCII into a different bracket expression before handing 
things over to regcomp().  That is, if the user is matching against 
[a-d], bash hands [abcd] to regcomp() instead.  You don't need a flag in 
regcomp() to get RRI, just merely some pre-processing (and often memory 
allocation, as the expansion of a range into a non-range tends to 
require more characters).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[toc] | [next] | [standalone]

#14557

From	arnold@skeeve.com (Aharon Robbins)
Date	2018-09-06 17:39 +0000
Message-ID	<pmroop$tv8$1@dont-email.me>
In reply to	#14554

In article <mailman.444.1536243821.1284.bug-bash@gnu.org>,
Eric Blake  <eblake@redhat.com> wrote:
>But bash could be taught to convert any regex that contains a range with 
>both endpoints ASCII into a different bracket expression before handing 
>things over to regcomp().  That is, if the user is matching against 
>[a-d], bash hands [abcd] to regcomp() instead.  You don't need a flag in 
>regcomp() to get RRI, just merely some pre-processing (and often memory 
>allocation, as the expansion of a range into a non-range tends to 
>require more characters).

This is easy and inexpensive for ASCII only.  Full RRI does the
same thing for wide character sets as well, though, and there
the possibility for using very large amounts of memory makes the
rewrite-the-range idea less palatable.
-- 
Aharon (Arnold) Robbins 		arnold AT skeeve DOT com

[toc] | [prev] | [next] | [standalone]

#14558

From	Eric Blake <eblake@redhat.com>
Date	2018-09-06 12:58 -0500
Message-ID	<mailman.454.1536256705.1284.bug-bash@gnu.org>
In reply to	#14557

On 09/06/2018 12:39 PM, Aharon Robbins wrote:
> In article <mailman.444.1536243821.1284.bug-bash@gnu.org>,
> Eric Blake  <eblake@redhat.com> wrote:
>> But bash could be taught to convert any regex that contains a range with
>> both endpoints ASCII into a different bracket expression before handing
>> things over to regcomp().  That is, if the user is matching against
>> [a-d], bash hands [abcd] to regcomp() instead.  You don't need a flag in
>> regcomp() to get RRI, just merely some pre-processing (and often memory
>> allocation, as the expansion of a range into a non-range tends to
>> require more characters).
> 
> This is easy and inexpensive for ASCII only.  Full RRI does the
> same thing for wide character sets as well, though, and there
> the possibility for using very large amounts of memory makes the
> rewrite-the-range idea less palatable.

Indeed. But the bash option is named 'globasciiranges', and I find far 
more use in having ranges with both endpoints in single-byte ASCII 
behaving sanely than I do for ranges with one or more ends resulting in 
a multibyte character (by the time my regex involves multibyte 
characters, I am already admitting that I am in locale-dependent 
territory, and RRI may no longer be the best action anyway).  That is, 
RRI makes the most sense when dealing with ASCII characters (< 128) in 
the first place, and that's a reasonable stopgap for immediate 
implementation, even if we don't get full RRI across all of Unicode 
(assuming that such might later become available via a new regcomp() flag).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[toc] | [prev] | [standalone]

csiph-web

Re: built-in regex matches wrong character

Contents

#14554 — Re: built-in regex matches wrong character

#14557

#14558