Path: csiph.com!xmission!news.snarked.org!news.linkpendium.com!news.linkpendium.com!panix!usenet.stanford.edu!not-for-mail From: L A Walsh Newsgroups: gnu.bash.bug Subject: Unicode range and enumeration support. Date: Wed, 18 Dec 2019 11:15:46 -0800 Lines: 98 Approved: bug-bash@gnu.org Message-ID: References: <5b5064a8-7175-42e7-1eb5-6374dee6c11e@redhat.com> <21761e28-c496-ff67-d7b7-628c9325085f@iki.fi> <9dd3a388-39b1-c059-de99-813f1e411764@case.edu> <5DF2987E.5000309@tlinx.org> <568aeaaa-22b3-c7b9-0e18-a92bef6d2ffb@iki.fi> <5DF2FE31.9070406@tlinx.org> <0ff3a920-94c2-b0c9-5631-0964955657aa@archlinux.org> <5DF3D78B.4090208@tlinx.org> <20191213184213.GO851@eeg.ccf.org> <5DF4BDF0.6000402@tlinx.org> <20191216163906.GV851@eeg.ccf.org> <5DFA7AE2.2060504@tlinx.org> NNTP-Posting-Host: lists.gnu.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Trace: usenet.stanford.edu 1576696556 26975 209.51.188.17 (18 Dec 2019 19:15:56 GMT) X-Complaints-To: action@cs.stanford.edu To: bug-bash@gnu.org Envelope-to: bug-bash@gnu.org User-Agent: Thunderbird In-Reply-To: <20191216163906.GV851@eeg.ccf.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x (no timestamps) [generic] [fuzzy] X-Received-From: 173.164.175.65 X-BeenThere: bug-bash@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Bug reports for the GNU Bourne Again SHell List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: <5DFA7AE2.2060504@tlinx.org> X-Mailman-Original-References: <5b5064a8-7175-42e7-1eb5-6374dee6c11e@redhat.com> <21761e28-c496-ff67-d7b7-628c9325085f@iki.fi> <9dd3a388-39b1-c059-de99-813f1e411764@case.edu> <5DF2987E.5000309@tlinx.org> <568aeaaa-22b3-c7b9-0e18-a92bef6d2ffb@iki.fi> <5DF2FE31.9070406@tlinx.org> <0ff3a920-94c2-b0c9-5631-0964955657aa@archlinux.org> <5DF3D78B.4090208@tlinx.org> <20191213184213.GO851@eeg.ccf.org> <5DF4BDF0.6000402@tlinx.org> <20191216163906.GV851@eeg.ccf.org> Xref: csiph.com gnu.bash.bug:15748 On 2019/12/16 08:39, Greg Wooledge wrote: > On Sat, Dec 14, 2019 at 02:48:16AM -0800, L A Walsh wrote: > =20 >> On 2019/12/13 10:42, Greg Wooledge wrote: >> =20 >>> There's a larger issue to be addressed first. The man page says, >>> [...] >>> sary. When characters are supplied, the expression expands to= each >>> character lexicographically between x and y, inclusive, using t= he de=E2=80=90 >>> fault C locale. >>> =20 > > =20 >> ---- >> If it says letters that lends stronger support to including >> unicode ranges of letters and numbers since the shell handles unicode = and >> brace expansions with unicode filenames works just fine. That ranges = don't >> seems a bit of a wart. >> =20 > > No, it won't include Unicode, because it very clearly says "C locale" > right up there. > =20 ---- At one point in time, Bash only supported the C locale for display=20 and input. That isn't the case in the current Bash. Just because it wasn't so in th= e past, doesn't mean things can't or won't change in the future. If that=20 was true we wouldn't have computers. > The problem is, it is *not possible* to extract the set of characters > out of an arbitrary locale. The locale interfaces simply are not built= > to allow it. > > You can do it in the C locale, simply because the C locale is a known, > fixed quantity that you can hard-code. You can't do it in any other lo= cale. > =20 ---- You can do it in Perl, JavaScript, Python, Ruby C, C++ among others, where range matching support has support for identifying characters of a specific type out of arbitrary locales. For example (from https://www.regular-expressions.info/unicode.html): \p{L} or \p{Letter}: any kind of letter from any language. \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant. \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant. ... \p{Math_Symbol}: any mathematical symbol. \p{N} or \p{Number}: any kind of numeric character in any script. \p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts. Those can be cross-sectioned with script-name properties from any script in Unicode (Common, Arabic, Braille, Cherokee, Devangari...Thai, Tibetan, Ya). The list of support is very extensive. Tables are published in machine readable form that are used to build support to allo= w range matching and enumeration for a huge number of characters. I.e. you can do it in pretty much any locale supported by Unicode, no= t just the C language. I can't begin to list all the references for this, but just googling on: "programming language support for ranges of numbers or alphabets in unicode" will show a huge number of references. Such features could be put in [a] loadable module[s], or made "includable= " at build time to manage memory if desired/needed. OTOH, I already said if one didn't want to do ranges, one could follo= w the easier path (I think) and allow any arbitrary unicode range to be enumerated while ensuring quoting of ASCII-ranged meta characters.