Path: csiph.com!xmission!news.snarked.org!news.linkpendium.com!news.linkpendium.com!panix!usenet.stanford.edu!not-for-mail
From: L A Walsh <bash@tlinx.org>
Newsgroups: gnu.bash.bug
Subject: Unicode range and enumeration support.
Date: Wed, 18 Dec 2019 11:15:46 -0800
Lines: 98
Approved: bug-bash@gnu.org
Message-ID: <mailman.1098.1576696556.1979.bug-bash@gnu.org>
References: <5b5064a8-7175-42e7-1eb5-6374dee6c11e@redhat.com> <21761e28-c496-ff67-d7b7-628c9325085f@iki.fi> <9dd3a388-39b1-c059-de99-813f1e411764@case.edu> <5DF2987E.5000309@tlinx.org> <568aeaaa-22b3-c7b9-0e18-a92bef6d2ffb@iki.fi> <5DF2FE31.9070406@tlinx.org> <0ff3a920-94c2-b0c9-5631-0964955657aa@archlinux.org> <5DF3D78B.4090208@tlinx.org> <20191213184213.GO851@eeg.ccf.org> <5DF4BDF0.6000402@tlinx.org> <20191216163906.GV851@eeg.ccf.org> <5DFA7AE2.2060504@tlinx.org>
NNTP-Posting-Host: lists.gnu.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Trace: usenet.stanford.edu 1576696556 26975 209.51.188.17 (18 Dec 2019 19:15:56 GMT)
X-Complaints-To: action@cs.stanford.edu
To: bug-bash@gnu.org
Envelope-to: bug-bash@gnu.org
User-Agent: Thunderbird
In-Reply-To: <20191216163906.GV851@eeg.ccf.org>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x (no timestamps) [generic] [fuzzy]
X-Received-From: 173.164.175.65
X-BeenThere: bug-bash@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Bug reports for the GNU Bourne Again SHell <bug-bash.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-bash>, <mailto:bug-bash-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-bash>
List-Post: <mailto:bug-bash@gnu.org>
List-Help: <mailto:bug-bash-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-bash>, <mailto:bug-bash-request@gnu.org?subject=subscribe>
X-Mailman-Original-Message-ID: <5DFA7AE2.2060504@tlinx.org>
X-Mailman-Original-References: <5b5064a8-7175-42e7-1eb5-6374dee6c11e@redhat.com> <21761e28-c496-ff67-d7b7-628c9325085f@iki.fi> <9dd3a388-39b1-c059-de99-813f1e411764@case.edu> <5DF2987E.5000309@tlinx.org> <568aeaaa-22b3-c7b9-0e18-a92bef6d2ffb@iki.fi> <5DF2FE31.9070406@tlinx.org> <0ff3a920-94c2-b0c9-5631-0964955657aa@archlinux.org> <5DF3D78B.4090208@tlinx.org> <20191213184213.GO851@eeg.ccf.org> <5DF4BDF0.6000402@tlinx.org> <20191216163906.GV851@eeg.ccf.org>
Xref: csiph.com gnu.bash.bug:15748

On 2019/12/16 08:39, Greg Wooledge wrote:
> On Sat, Dec 14, 2019 at 02:48:16AM -0800, L A Walsh wrote:
>  =20
>> On 2019/12/13 10:42, Greg Wooledge wrote:
>>    =20
>>> There's a larger issue to be addressed first.  The man page says,
>>>     [...]
>>>     sary.  When characters are supplied, the  expression  expands  to=
  each
>>>     character  lexicographically  between x and y, inclusive, using t=
he de=E2=80=90
>>>     fault C locale.
>>>      =20
>
>  =20
>> ----
>>    If it says letters that lends stronger support to including
>> unicode ranges of letters and numbers since the shell handles unicode =
and
>> brace expansions with unicode filenames works just fine.  That ranges =
don't
>> seems a bit of a wart.
>>    =20
>
> No, it won't include Unicode, because it very clearly says "C locale"
> right up there.
>  =20
----
    At one point in time, Bash only supported the C locale for display=20
and input.
That isn't the case in the current Bash.  Just because it wasn't so in th=
e
past, doesn't mean things can't or won't change in the future.  If that=20
was true
we wouldn't have computers.
> The problem is, it is *not possible* to extract the set of characters
> out of an arbitrary locale.  The locale interfaces simply are not built=

> to allow it.
>
> You can do it in the C locale, simply because the C locale is a known,
> fixed quantity that you can hard-code.  You can't do it in any other lo=
cale.
>  =20
----
    You can do it in Perl, JavaScript, Python, Ruby C, C++ among others,
where range matching support has support for identifying characters of
a specific type out of arbitrary locales.  For example (from
https://www.regular-expressions.info/unicode.html):


     \p{L} or \p{Letter}: any kind of letter from any language.
     \p{Ll} or \p{Lowercase_Letter}: a lowercase letter
that has an uppercase variant.
     \p{Lu} or \p{Uppercase_Letter}: an uppercase letter
that has a lowercase variant.
      ...
    \p{Math_Symbol}: any mathematical symbol.
\p{N} or \p{Number}: any kind of numeric character in any script.

    \p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any
    script except ideographic scripts.


    Those can be cross-sectioned with script-name properties from any
script in Unicode (Common, Arabic, Braille, Cherokee, Devangari...Thai,
Tibetan, Ya).  The list of support is very extensive.  Tables are
published in machine readable form that are used to build support to allo=
w
range matching and enumeration for a huge number of characters.

    I.e. you can do it in pretty much any locale supported by Unicode, no=
t
just the C language.  I can't begin to list all the references for this,
but just googling on:

"programming language support for ranges of numbers or alphabets in
unicode"

will show a huge number of references.

Such features could be put in [a] loadable module[s], or made "includable=
"
at build time to manage memory if desired/needed.

    OTOH, I already said if one didn't want to do ranges, one could follo=
w
the easier path (I think) and allow any arbitrary unicode range to be
enumerated while ensuring quoting of ASCII-ranged meta characters.