Path: csiph.com!3.us.feeder.erje.net!feeder.erje.net!news.linkpendium.com!news.linkpendium.com!panix!usenet.stanford.edu!not-for-mail
From: Bize Ma <binaryzebra@gmail.com>
Newsgroups: gnu.bash.bug
Subject: Re: Bash removes unrequested characters in bracket expressions (not a range).
Date: Sat, 24 Nov 2018 17:34:55 -0400
Lines: 48
Approved: bug-bash@gnu.org
Message-ID: <mailman.4547.1543095311.1284.bug-bash@gnu.org>
References: <CAFra36hcAjBHGgd_8sHjOV4wSzjmdCyLV2aQo8Ww1bwJqkxYQA@mail.gmail.com> <1c24a279-f439-a13c-be60-901096ccd4e1@case.edu>
NNTP-Posting-Host: lists.gnu.org
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Cc: bug-bash <bug-bash@gnu.org>, bash@packages.debian.org
To: Chester Ramey <chet.ramey@case.edu>
Envelope-to: bug-bash@gnu.org
In-Reply-To: <1c24a279-f439-a13c-be60-901096ccd4e1@case.edu>
Precedence: list
Xref: csiph.com gnu.bash.bug:14851

Chet Ramey (<chet.ramey@case.edu>) wrote:

> On 11/23/18 6:09 PM, Bize Ma wrote:
>
> > Bash Version: 4.4
> > Patch Level: 12
> > Release Status: release
>


> > Description:
> >
> > Bash is removing characters not explicitly listed in a bracket
> > expression (character range).
> > In this example, it is removing digits from other languages.
>
> What is your locale?
>
>
The locale used was en_US.utf-8 but also happens with  459
locales out of 868 available under Debian (not in C, for example).

Also in all locales affected (except one), setting either
LC_ALL=$loc or LC_COLLATE=$loc did the same.
Except in zh_CN.gb18030

But IMO locale collation should not be used for an explicit list.

I have been made aware that there is a
      cstart = cend = FOLD (cstart);
inside the `sm_loop.c` file that will convert into a range many
individual character. If that understanding is correct that is the
source of the difference with other shells.

I have the perception that a collation table *must have a "total order"*,
in fact, an strict total order. If two characters `a` and `b` could sort as
equal the order will fail to provide a confirmation that a character is
absent from the list. Consider characters `a`, `b` and `c`, if a and b
sort as equal, a sorted list in which we find `a` followed by `c` doesn't
confirm that `b` is absent as the order could well be `b a c`.

In this case, there must not be any other character than `a` in the
range `a-a` and using a range `a-a` is equivalent (just slower and
more complex) to the single character `a`.

If this is not the case, the error is in the collation table, not in using
single (faster) characters. And what should be updated is such
collation table IMO.