Path: csiph.com!3.us.feeder.erje.net!feeder.erje.net!news.snarked.org!news.linkpendium.com!news.linkpendium.com!panix!usenet.stanford.edu!not-for-mail
From: Greg Wooledge <wooledg@eeg.ccf.org>
Newsgroups: gnu.bash.bug
Subject: Re: Unicode range and enumeration support.
Date: Mon, 23 Dec 2019 08:20:49 -0500
Lines: 50
Approved: bug-bash@gnu.org
Message-ID: <mailman.1307.1577107281.1979.bug-bash@gnu.org>
References: <568aeaaa-22b3-c7b9-0e18-a92bef6d2ffb@iki.fi> <5DF2FE31.9070406@tlinx.org> <0ff3a920-94c2-b0c9-5631-0964955657aa@archlinux.org> <5DF3D78B.4090208@tlinx.org> <20191213184213.GO851@eeg.ccf.org> <5DF4BDF0.6000402@tlinx.org> <20191216163906.GV851@eeg.ccf.org> <5DFA7AE2.2060504@tlinx.org> <20191218194651.GH851@eeg.ccf.org> <5DFD68B9.3050202@tlinx.org> <20191223132049.GW851@eeg.ccf.org>
NNTP-Posting-Host: lists.gnu.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: usenet.stanford.edu 1577107282 31282 209.51.188.17 (23 Dec 2019 13:21:22 GMT)
X-Complaints-To: action@cs.stanford.edu
To: bug-bash <bug-bash@gnu.org>
Envelope-to: bug-bash@gnu.org
Mail-Followup-To: bug-bash <bug-bash@gnu.org>
Content-Disposition: inline
In-Reply-To: <5DFD68B9.3050202@tlinx.org>
User-Agent: Mutt/1.10.1 (2018-07-13)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy]
X-Received-From: 139.137.100.1
X-BeenThere: bug-bash@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Bug reports for the GNU Bourne Again SHell <bug-bash.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-bash>, <mailto:bug-bash-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-bash>
List-Post: <mailto:bug-bash@gnu.org>
List-Help: <mailto:bug-bash-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-bash>, <mailto:bug-bash-request@gnu.org?subject=subscribe>
X-Mailman-Original-Message-ID: <20191223132049.GW851@eeg.ccf.org>
X-Mailman-Original-References: <568aeaaa-22b3-c7b9-0e18-a92bef6d2ffb@iki.fi> <5DF2FE31.9070406@tlinx.org> <0ff3a920-94c2-b0c9-5631-0964955657aa@archlinux.org> <5DF3D78B.4090208@tlinx.org> <20191213184213.GO851@eeg.ccf.org> <5DF4BDF0.6000402@tlinx.org> <20191216163906.GV851@eeg.ccf.org> <5DFA7AE2.2060504@tlinx.org> <20191218194651.GH851@eeg.ccf.org> <5DFD68B9.3050202@tlinx.org>
Xref: csiph.com gnu.bash.bug:15768

On Fri, Dec 20, 2019 at 04:35:05PM -0800, L A Walsh wrote:
> On 2019/12/18 11:46, Greg Wooledge wrote:
> > To put it another way: you can write code that determines whether
> > an input character $c matches a glob or regex like [Z-a].  (Maybe.)
> > 
> > But, you CANNOT write code to generate all of the characters from Z to a
> This generates characters from decimal 8300 - 8400 (because that range
> includes raised and lowered digits which have the number and value
> properties equivalent to 0-9.
> 
> ----
> 
> No? 8300, 8400 arbitrary code points that contain raised and lowered numbers
> that have the number property (as does 0..9):
> 
> perl -we' use strict; use v5.16;
> my $c;
> for ($c=8300;$c<8400;++$c) {
[...]

As I said in the previous message, a brute force solution that enumerates
the ENTIRE Unicode code point space is not a valid answer.  I even gave
a bash program that does something very similar to your perl program,
just using a different segment of the code point space.

Given that both of us are capable of generating such a brute force
solution, how did you INTEND to use that solution to solve the actual
problem, which is "in an arbitrary locale, list all of the characters
from $start to $end in collating order"?

You can't simply translate $start and $end to single Unicode code point
values, enumerate the Unicode characters between those two points,
and translate those characters back to the user's locale.  That doesn't
give you the correct answer.  There will be extra characters in the
Unicode code point range that don't fit the solution, AND there will
be characters outside the Unicode code point range that SHOULD be in
the solution, but are missed.

The only way to do it is to iterate over the ENTIRE code point space,
however many millions or billions of characters that is today.
Translate each of those millions of characters back into the user's
locale, check whether that character sorts after $start, check whether
that character sorts before $end, and include/exclude it from the
solution set.  Then, when you have the solution set, sort it one final
time to get it in order.

Is that what you are proposing bash should do, in order to get a working
brace expansion outside of the C locale?  I don't believe this is an
acceptable solution.