Groups | Search | Server Info | Login | Register
Groups > perl.perl5.porters > #99852
| Newsgroups | perl.perl5.porters |
|---|---|
| Message-ID | <58a71a37-0ebf-46af-be5d-767d2e817975@khwilliamson.com> (permalink) |
| Date | 2026-02-08 15:45 -0700 |
| Subject | PPC: restrict legal identifier names for security |
| From | perl@khwilliamson.com (Karl Williamson) |
I'm resubmitting this with an indication that I want the PSC to consider it. This would change the language we present to XS authors. Accepting any combination of legal Unicode Identifier characters has led to security problems. Hence Unicode has added guidance that we are not following. I'm proposing that the PSC and other interested parties familiarize yourselves with https://www.unicode.org/reports/tr39/ "Unicode Security Mechanisms" and https://www.unicode.org/reports/tr55/ "Unicode Source Code Handling" so that we can discuss which we might want to implement. I think it's a no brainer that we stop accept deprecated characters (there are 20-ish of these). But there's much more there. The benefits are fewer potential security holes, including many none of us on this project has the background to be aware of. We would be using accumulated knowledge from bitter experiences of others. This could detect existing trojans and we could get them removed. The downside is we might break existing legitimate code. The more restrictions we impose, the more likely there is breakage. This could be alleviated by enabling some restrictions only under a 'use v5.xx' Unicode divides all their code points into two classes: Allowed and Restricted. These are further divided into subcategories, as wo why they are in the given class. My straw proposal would be to restrict identifiers in Perl to the Allowed subclass. That would exclude characters (beyond what we already don't accept) from subclasses: Deprecated, Obsolete, Uncommon_Use, Limited_Use, Exclusion, Technical, and Default_Ignorable. Default_Ignorable are ones that you are free to skip. The most common example is the Soft Hyphen. 4K of them Exclusion are from scripts that they don't think would or should be in code. Besides obvious things like hieroglyphics, there is Coptic and other scripts. 22K Obsolete are characters that don't occur in modern usage. 8K Uncommon_Use are characters that occur rarely in modern usage. 83K of these Limited_Use have more modern use than Uncommon ones. These come from scripts less likely to be used in commercial computing, such as Cherokee. 5000 Technical don't include any computer ones I could see; rather they are more for linguists, including Braille, and things like Byzantine Musical notation. 1900 We could also restrict identifiers to every character being in the same script. Unicode doesn't go this far, saying you could have identifiers that occur in multiple scripts, but each script would form a "chunk". The definition of chunk is not given. But they have an example for a web server in Russia whose name would begin with HTTP and then switch to Cyrillic. Both H and P have Cyrillic look-alikes that have different meanings.. T is also a Cyrillic letter, but has the same meaning in the Latin alphabet.
Back to perl.perl5.porters | Previous | Next — Next in thread | Find similar
PPC: restrict legal identifier names for security perl@khwilliamson.com (Karl Williamson) - 2026-02-08 15:45 -0700
Re: PPC: restrict legal identifier names for security leonerd@leonerd.org.uk ("Paul \"LeoNerd\" Evans") - 2026-02-09 10:51 +0000
Re: PPC: restrict legal identifier names for security perl@khwilliamson.com (Karl Williamson) - 2026-02-10 12:22 -0700
csiph-web