Path: csiph.com!goblin1!goblin.stu.neva.ru!usenet.stanford.edu!not-for-mail From: Eric Blake Newsgroups: gnu.bash.bug Subject: Re: built-in regex matches wrong character Date: Thu, 6 Sep 2018 12:58:17 -0500 Organization: Red Hat, Inc. Lines: 32 Approved: bug-bash@gnu.org Message-ID: References: <201809051850.w85IoClP001449@mamatb-laptop> <5d3e2655-9b29-563e-a3aa-f96f6563f9fc@redhat.com> NNTP-Posting-Host: lists.gnu.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: usenet.stanford.edu 1536256706 21640 208.118.235.17 (6 Sep 2018 17:58:26 GMT) X-Complaints-To: action@cs.stanford.edu To: Aharon Robbins , bug-bash@gnu.org Envelope-to: bug-bash@gnu.org User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 In-Reply-To: Content-Language: en-US X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.6]); Thu, 06 Sep 2018 17:58:20 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.6]); Thu, 06 Sep 2018 17:58:20 +0000 (UTC) for IP:'10.11.54.6' DOMAIN:'int-mx06.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'eblake@redhat.com' RCPT:'' X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 66.187.233.73 X-BeenThere: bug-bash@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Bug reports for the GNU Bourne Again SHell List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com gnu.bash.bug:14558 On 09/06/2018 12:39 PM, Aharon Robbins wrote: > In article , > Eric Blake wrote: >> But bash could be taught to convert any regex that contains a range with >> both endpoints ASCII into a different bracket expression before handing >> things over to regcomp(). That is, if the user is matching against >> [a-d], bash hands [abcd] to regcomp() instead. You don't need a flag in >> regcomp() to get RRI, just merely some pre-processing (and often memory >> allocation, as the expansion of a range into a non-range tends to >> require more characters). > > This is easy and inexpensive for ASCII only. Full RRI does the > same thing for wide character sets as well, though, and there > the possibility for using very large amounts of memory makes the > rewrite-the-range idea less palatable. Indeed. But the bash option is named 'globasciiranges', and I find far more use in having ranges with both endpoints in single-byte ASCII behaving sanely than I do for ranges with one or more ends resulting in a multibyte character (by the time my regex involves multibyte characters, I am already admitting that I am in locale-dependent territory, and RRI may no longer be the best action anyway). That is, RRI makes the most sense when dealing with ASCII characters (< 128) in the first place, and that's a reasonable stopgap for immediate implementation, even if we don't get full RRI across all of Unicode (assuming that such might later become available via a new regcomp() flag). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org