Path: csiph.com!3.us.feeder.erje.net!feeder.erje.net!news.snarked.org!news.linkpendium.com!news.linkpendium.com!panix!usenet.stanford.edu!not-for-mail From: Greg Wooledge Newsgroups: gnu.bash.bug Subject: Re: Unicode range and enumeration support. Date: Mon, 23 Dec 2019 08:20:49 -0500 Lines: 50 Approved: bug-bash@gnu.org Message-ID: References: <568aeaaa-22b3-c7b9-0e18-a92bef6d2ffb@iki.fi> <5DF2FE31.9070406@tlinx.org> <0ff3a920-94c2-b0c9-5631-0964955657aa@archlinux.org> <5DF3D78B.4090208@tlinx.org> <20191213184213.GO851@eeg.ccf.org> <5DF4BDF0.6000402@tlinx.org> <20191216163906.GV851@eeg.ccf.org> <5DFA7AE2.2060504@tlinx.org> <20191218194651.GH851@eeg.ccf.org> <5DFD68B9.3050202@tlinx.org> <20191223132049.GW851@eeg.ccf.org> NNTP-Posting-Host: lists.gnu.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: usenet.stanford.edu 1577107282 31282 209.51.188.17 (23 Dec 2019 13:21:22 GMT) X-Complaints-To: action@cs.stanford.edu To: bug-bash Envelope-to: bug-bash@gnu.org Mail-Followup-To: bug-bash Content-Disposition: inline In-Reply-To: <5DFD68B9.3050202@tlinx.org> User-Agent: Mutt/1.10.1 (2018-07-13) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 139.137.100.1 X-BeenThere: bug-bash@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Bug reports for the GNU Bourne Again SHell List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: <20191223132049.GW851@eeg.ccf.org> X-Mailman-Original-References: <568aeaaa-22b3-c7b9-0e18-a92bef6d2ffb@iki.fi> <5DF2FE31.9070406@tlinx.org> <0ff3a920-94c2-b0c9-5631-0964955657aa@archlinux.org> <5DF3D78B.4090208@tlinx.org> <20191213184213.GO851@eeg.ccf.org> <5DF4BDF0.6000402@tlinx.org> <20191216163906.GV851@eeg.ccf.org> <5DFA7AE2.2060504@tlinx.org> <20191218194651.GH851@eeg.ccf.org> <5DFD68B9.3050202@tlinx.org> Xref: csiph.com gnu.bash.bug:15768 On Fri, Dec 20, 2019 at 04:35:05PM -0800, L A Walsh wrote: > On 2019/12/18 11:46, Greg Wooledge wrote: > > To put it another way: you can write code that determines whether > > an input character $c matches a glob or regex like [Z-a]. (Maybe.) > > > > But, you CANNOT write code to generate all of the characters from Z to a > This generates characters from decimal 8300 - 8400 (because that range > includes raised and lowered digits which have the number and value > properties equivalent to 0-9. > > ---- > > No? 8300, 8400 arbitrary code points that contain raised and lowered numbers > that have the number property (as does 0..9): > > perl -we' use strict; use v5.16; > my $c; > for ($c=8300;$c<8400;++$c) { [...] As I said in the previous message, a brute force solution that enumerates the ENTIRE Unicode code point space is not a valid answer. I even gave a bash program that does something very similar to your perl program, just using a different segment of the code point space. Given that both of us are capable of generating such a brute force solution, how did you INTEND to use that solution to solve the actual problem, which is "in an arbitrary locale, list all of the characters from $start to $end in collating order"? You can't simply translate $start and $end to single Unicode code point values, enumerate the Unicode characters between those two points, and translate those characters back to the user's locale. That doesn't give you the correct answer. There will be extra characters in the Unicode code point range that don't fit the solution, AND there will be characters outside the Unicode code point range that SHOULD be in the solution, but are missed. The only way to do it is to iterate over the ENTIRE code point space, however many millions or billions of characters that is today. Translate each of those millions of characters back into the user's locale, check whether that character sorts after $start, check whether that character sorts before $end, and include/exclude it from the solution set. Then, when you have the solution set, sort it one final time to get it in order. Is that what you are proposing bash should do, in order to get a working brace expansion outside of the C locale? I don't believe this is an acceptable solution.