Newsgroups: perl.perl5.porters Path: csiph.com!fu-berlin.de!bofh.it!nntp.perl.org Xref: csiph.com perl.perl5.porters:99852 Return-Path: Mailing-List: contact perl5-porters-help@perl.org; run by ezmlm Delivered-To: mailing list perl5-porters@perl.org Delivered-To: moderator for perl5-porters@perl.org Received: (qmail 13119 invoked from network); 8 Feb 2026 22:45:26 -0000 Received: from xx1.develooper.com (147.75.38.233) by x6.develooper.com with SMTP; 8 Feb 2026 22:45:26 -0000 Received: from inbound-egress-8.mailchannels.net (inbound-egress-8.mailchannels.net [23.83.223.254]) by xx1.develooper.com (Postfix) with ESMTP id 7488C7C18B for ; Sun, 8 Feb 2026 14:45:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; d=mailchannels.net; s=arc-2022; cv=none; t=1770590722; b=117yQ0nHjB7vFj5zUAjFauddZRMdCpo/U+mwPlD/wB8kOJjYS5l2CjwidqEdo+CT15sff1 G4yw37l1GLYoyR2EL2p6MP7G8kOfvOp9CZfND/kIKyiJrHVqoeIlBhp90XMpSb10K4HFZ5 scZ7AQDLI2KSeRjPBN0BARuuQtwC336vmvrSwrGy3zSDDyxtwe/R8APUZIxK0wV4edqNcs jSQ3OVCGLzKNdnILJ16IT3IdcUjR7ausr3IHZsseIGYTahXw/HmL/KQrVbHW5DluhLbuYr lpF1mFD0dCp0z+132pwRiLwtyUmdTVeyy5LzlNzk3Nokm/L/IypuSgAc2wpYGA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=mailchannels.net; s=arc-2022; t=1770590722; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding:dkim-signature; bh=zSlpGURxIveNBa3DT+lc8TgdVeHUYNVNVj8sHuzeApI=; b=CRg1FYDLqwvU0JxmUmEQ1jI7ZQrzepdvMh4v/jq/Jse+6n4p4bPr64j1+LJkjipkfNbiJA WioDO8zhsYuN+K46bgm7hyR5g35SAl9z8a43zWWm2bW6KwkChGvzUY4Nj/V6rt08QcAY88 zdYft9Vydspl5Hs/Px2swLN3mYEIFVtIhqEolXx9eiKhAhgvYj45dDXHOPBTGFB3BkttB6 2M5hfw1CE8MK5Vz6a4mbWRjnRJPqCUS875H/bdOARjRD2jMMWYs8k6s+j52K25SFgUfE0A mHICeRvz0j1YFigVEWq6ZtPaRykVOTmr7AJAeLZ68Aq16HyunnX/t1n0bTrbJA== ARC-Authentication-Results: i=1; inbound-rspamd-65db97fb88-mvdjx; none X-Message-ID: YXyjR2ZQsJeFMPWWgfLR8ymG Received: from proxy1.themailcloud.com (proxy1.themailcloud.com [216.24.137.198]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384) by 100.123.131.118 (trex/7.1.3); Sun, 08 Feb 2026 22:45:25 +0000 Authentication-Results: inbound.mailchannels.net; spf=softfail smtp.mailfrom=perl@khwilliamson.com; dkim=pass header.d=khwilliamson.com; dmarc=none; arc=none Received-SPF: softfail (dmarc-service-78968d7585-bhhz7: transitioning khwilliamson.com does not designate 216.24.137.198 as permitted sender) client-ip=216.24.137.198; envelope-from=perl@khwilliamson.com; helo=proxy1.themailcloud.com; DKIM-Filter: OpenDKIM Filter v2.10.3 proxy1.themailcloud.com 42F693B64D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=khwilliamson.com; s=7F4B9DFE-F68A-11ED-A923-FA531B0F0280; t=1770590716; bh=zSlpGURxIveNBa3DT+lc8TgdVeHUYNVNVj8sHuzeApI=; h=Message-ID:Date:MIME-Version:From:To; b=PdxppF+mbPEqEz7+tsjD7ri/BRd+nVy7fP2cU/sVhlboU9e9DZlOnLhDMXlu1hcy1 awBufOBXMKNmD1Yr9JAFDlV2ssf2GM/TV3Ab9kEowdPbULGf+hJQj3ja+Bhjbc1esG yY07FXRSLKh0NPNMpG9MHYYpMlwivgg2LedH47b2uLEfxoFIgiWc8cXaqcYPYugDHQ xvlrm3CYujv7gzJy+WgNi4VwI1Ngk8M5fMgFVZt8D07uAUp21BMNefU1CnfnUvyLGP NqnP0fOJLi89zoPzvWPx01rieJ8xC/HDUil8CY5ixUfVw20H9CgCaY0MdcfXMA3N4v hic59EkoJoV4g== X-Virus-Scanned: amavis at Received: from [192.168.0.10] (c-67-177-250-223.hsd1.co.comcast.net [67.177.250.223]) by proxy1.themailcloud.com (Postfix) with ESMTPSA id 146203B6B5 for ; Sun, 8 Feb 2026 15:45:16 -0700 (MST) Message-ID: <58a71a37-0ebf-46af-be5d-767d2e817975@khwilliamson.com> Date: Sun, 8 Feb 2026 15:45:15 -0700 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: PPC: restrict legal identifier names for security To: perl5-porters@perl.org Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Approved: news@nntp.perl.org From: perl@khwilliamson.com (Karl Williamson) I'm resubmitting this with an indication that I want the PSC to consider it. This would change the language we present to XS authors. Accepting any combination of legal Unicode Identifier characters has led to security problems. Hence Unicode has added guidance that we are not following. I'm proposing that the PSC and other interested parties familiarize yourselves with https://www.unicode.org/reports/tr39/ "Unicode Security Mechanisms" and https://www.unicode.org/reports/tr55/ "Unicode Source Code Handling" so that we can discuss which we might want to implement. I think it's a no brainer that we stop accept deprecated characters (there are 20-ish of these). But there's much more there. The benefits are fewer potential security holes, including many none of us on this project has the background to be aware of. We would be using accumulated knowledge from bitter experiences of others. This could detect existing trojans and we could get them removed. The downside is we might break existing legitimate code. The more restrictions we impose, the more likely there is breakage. This could be alleviated by enabling some restrictions only under a 'use v5.xx' Unicode divides all their code points into two classes: Allowed and Restricted. These are further divided into subcategories, as wo why they are in the given class. My straw proposal would be to restrict identifiers in Perl to the Allowed subclass. That would exclude characters (beyond what we already don't accept) from subclasses: Deprecated, Obsolete, Uncommon_Use, Limited_Use, Exclusion, Technical, and Default_Ignorable. Default_Ignorable are ones that you are free to skip. The most common example is the Soft Hyphen. 4K of them Exclusion are from scripts that they don't think would or should be in code. Besides obvious things like hieroglyphics, there is Coptic and other scripts. 22K Obsolete are characters that don't occur in modern usage. 8K Uncommon_Use are characters that occur rarely in modern usage. 83K of these Limited_Use have more modern use than Uncommon ones. These come from scripts less likely to be used in commercial computing, such as Cherokee. 5000 Technical don't include any computer ones I could see; rather they are more for linguists, including Braille, and things like Byzantine Musical notation. 1900 We could also restrict identifiers to every character being in the same script. Unicode doesn't go this far, saying you could have identifiers that occur in multiple scripts, but each script would form a "chunk". The definition of chunk is not given. But they have an example for a web server in Russia whose name would begin with HTTP and then switch to Cyrillic. Both H and P have Cyrillic look-alikes that have different meanings.. T is also a Cyrillic letter, but has the same meaning in the Latin alphabet.