Groups | Search | Server Info | Login | Register

PPC: restrict legal identifier names for security

Newsgroups	perl.perl5.porters
Message-ID	<58a71a37-0ebf-46af-be5d-767d2e817975@khwilliamson.com> (permalink)
Date	2026-02-08 15:45 -0700
Subject	PPC: restrict legal identifier names for security
From	perl@khwilliamson.com (Karl Williamson)

Show all headers | View raw

I'm resubmitting this with an indication that I want the PSC to consider 
it.  This would change the language we present to XS authors.

Accepting any combination of legal Unicode Identifier characters has led 
to security problems.  Hence Unicode has added guidance that we are not 
following.

I'm proposing that the PSC and other interested parties familiarize 
yourselves with https://www.unicode.org/reports/tr39/ "Unicode Security 
Mechanisms" and https://www.unicode.org/reports/tr55/ "Unicode Source 
Code Handling" so that we can discuss which we might want to implement.
I think it's a no brainer that we stop accept deprecated characters 
(there are 20-ish of these).  But there's much more there.

The benefits are fewer potential security holes, including many none of 
us on this project has the background to be aware of.  We would be using 
accumulated knowledge from bitter experiences of others.  This could 
detect existing trojans and we could get them removed.

The downside is we might break existing legitimate code.  The more 
restrictions we impose, the more likely there is breakage.  This could 
be alleviated by enabling some restrictions only under a 'use v5.xx'


Unicode divides all their code points into two classes: Allowed and 
Restricted. These are further divided into subcategories, as wo why they 
are in the given class.  My straw proposal would be to restrict 
identifiers in Perl to the Allowed subclass.  That would exclude 
characters (beyond what we already don't accept) from subclasses: 
Deprecated, Obsolete, Uncommon_Use, Limited_Use, Exclusion, Technical, 
and Default_Ignorable.

Default_Ignorable are ones that you are free to skip.  The most common 
example is the Soft Hyphen. 4K of them

Exclusion are from scripts that they don't think would or should be in 
code.  Besides obvious things like hieroglyphics, there is Coptic and 
other scripts. 22K

Obsolete are characters that don't occur in modern usage. 8K

Uncommon_Use are characters that occur rarely in modern usage.  83K of these

Limited_Use have more modern use than Uncommon ones.  These come from 
scripts less likely to be used in commercial computing, such as 
Cherokee. 5000

Technical don't include any computer ones I could see; rather they are 
more for linguists, including Braille, and things like Byzantine Musical 
notation. 1900

We could also restrict identifiers to every character being in the same 
script.  Unicode doesn't go this far, saying you could have identifiers 
that occur in multiple scripts, but each script would form a "chunk". 
The definition of chunk is not given.  But they have an example for a 
web server in Russia whose name would begin with HTTP and then switch to 
Cyrillic.  Both H and P have Cyrillic look-alikes that have different 
meanings..  T is also a Cyrillic letter, but has the same meaning in the 
Latin alphabet.

Back to perl.perl5.porters | Previous | Next — Next in thread | Find similar

Thread

PPC: restrict legal identifier names for security perl@khwilliamson.com (Karl Williamson) - 2026-02-08 15:45 -0700
  Re: PPC: restrict legal identifier names for security leonerd@leonerd.org.uk ("Paul \"LeoNerd\" Evans") - 2026-02-09 10:51 +0000
    Re: PPC: restrict legal identifier names for security perl@khwilliamson.com (Karl Williamson) - 2026-02-10 12:22 -0700

csiph-web