Path: csiph.com!3.us.feeder.erje.net!feeder.erje.net!news.linkpendium.com!news.linkpendium.com!panix!usenet.stanford.edu!not-for-mail From: Bize Ma Newsgroups: gnu.bash.bug Subject: Re: Bash removes unrequested characters in bracket expressions (not a range). Date: Sat, 24 Nov 2018 17:34:55 -0400 Lines: 48 Approved: bug-bash@gnu.org Message-ID: References: <1c24a279-f439-a13c-be60-901096ccd4e1@case.edu> NNTP-Posting-Host: lists.gnu.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" X-Trace: usenet.stanford.edu 1543095312 19093 208.118.235.17 (24 Nov 2018 21:35:12 GMT) X-Complaints-To: action@cs.stanford.edu Cc: bug-bash , bash@packages.debian.org To: Chester Ramey Envelope-to: bug-bash@gnu.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=vizIMoqY7Xu5ayTJwK2GpLmGG/7odWLkK0QzLvwdzC4=; b=akhchKwycMGISCg8VmnyiJEBdMf6yQ3GQzkKO0LDObfDelvEMJ/GavRhdMbFGYF5jH PLap38d7j1bF0+IWOINpvZPkiBUbmM4Lo4BDykBdQAEZ6CnrarUFluU+T6eYJtSA5q+M E1qMx43bNO1jt/UR6JRiN0KV6Qus4V1zFH+RWy6KHfxTdqV9hJT8fmX7IMi2aUwvFgGB Wje2EyU1hT9uP+b6ztqs8CMpFIaVilHvS46fWIgyBNKKgi1ruAEMY1pn90jgzK3TCqId pPS+t224xIBb3CIh7d5NGnWpmuz4ZWl94c0tONX9aJRgXRmYNLiBV/hAWyAuDxjcikMe q1SQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=vizIMoqY7Xu5ayTJwK2GpLmGG/7odWLkK0QzLvwdzC4=; b=pEdVMcO4cR6GsNbBGwbt7nft4OgQ3exY18vCBW8d9B5EouUZUmAOi2v01kkd4FIZa2 EHXwPDqetKfpaFm63YCOPWSEvB6uzz0KkoYUF7SENvHfNldVsd6dweQ1KwQ6dq2fLb13 WvrwKSgqBlVPmiPAOeHTDkd03X8f3GrYjjMwO9BkTsGI8vZWmglJ9cZOQ3Lqxi40mDsq sibzDT4+0yZspuDpca8lcpDkUJOsubyrroUqRrN57nqocM9vSW+v28JdPXKXz4luG5Tx 3SxMKF/pSOyZXm+cZqW5QL/ZDq66M2SPxVZUKRYhjEHjV2DJyXjJ6DPjFIfOfUDZYUr6 QTaw== X-Gm-Message-State: AA+aEWaQ+hEOjUswPpOYQqGgaxBAmsCfqJyif289zAQ/4H9SDTyTYAPe G71Wa1sGxOEyUC7tMvOTPe0Q1W2kwr21UE3pnCc= X-Google-Smtp-Source: AFSGD/VEfx/bdB0oWL6Fza2PSc+JtNZrQC5NU83QrND41hUNFb02FutBn59kncvXBEjVfcoa90UevMewySZ2D+NTNEE= X-Received: by 2002:a9d:d73:: with SMTP id 106mr7083238oti.291.1543095308574; Sat, 24 Nov 2018 13:35:08 -0800 (PST) In-Reply-To: <1c24a279-f439-a13c-be60-901096ccd4e1@case.edu> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::335 X-Content-Filtered-By: Mailman/MimeDel 2.1.21 X-BeenThere: bug-bash@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Bug reports for the GNU Bourne Again SHell List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com gnu.bash.bug:14851 Chet Ramey () wrote: > On 11/23/18 6:09 PM, Bize Ma wrote: > > > Bash Version: 4.4 > > Patch Level: 12 > > Release Status: release > > > Description: > > > > Bash is removing characters not explicitly listed in a bracket > > expression (character range). > > In this example, it is removing digits from other languages. > > What is your locale? > > The locale used was en_US.utf-8 but also happens with 459 locales out of 868 available under Debian (not in C, for example). Also in all locales affected (except one), setting either LC_ALL=$loc or LC_COLLATE=$loc did the same. Except in zh_CN.gb18030 But IMO locale collation should not be used for an explicit list. I have been made aware that there is a cstart = cend = FOLD (cstart); inside the `sm_loop.c` file that will convert into a range many individual character. If that understanding is correct that is the source of the difference with other shells. I have the perception that a collation table *must have a "total order"*, in fact, an strict total order. If two characters `a` and `b` could sort as equal the order will fail to provide a confirmation that a character is absent from the list. Consider characters `a`, `b` and `c`, if a and b sort as equal, a sorted list in which we find `a` followed by `c` doesn't confirm that `b` is absent as the order could well be `b a c`. In this case, there must not be any other character than `a` in the range `a-a` and using a range `a-a` is equivalent (just slower and more complex) to the single character `a`. If this is not the case, the error is in the collation table, not in using single (faster) characters. And what should be updated is such collation table IMO.