Path: csiph.com!goblin2!goblin-spool!goblin1!goblin.stu.neva.ru!usenet.stanford.edu!not-for-mail From: "G. Branden Robinson" Newsgroups: gnu.groff.bug Subject: [bug #58930] take baby steps toward Unicode Date: Fri, 14 Aug 2020 06:00:02 -0400 (EDT) Lines: 104 Approved: bug-groff@gnu.org Message-ID: References: <20200810-095606.sv93119.42780@savannah.gnu.org> <20200814-100002.sv108747.62919@savannah.gnu.org> NNTP-Posting-Host: lists.gnu.org Mime-Version: 1.0 Content-Type: text/plain;charset=UTF-8 X-Trace: usenet.stanford.edu 1597399205 6334 209.51.188.17 (14 Aug 2020 10:00:05 GMT) X-Complaints-To: action@cs.stanford.edu To: "G. Branden Robinson" , Dave , bug-groff@gnu.org Envelope-to: bug-groff@gnu.org X-PHP-Originating-Script: 1001:sendmail.php X-Savane-Server: savannah.gnu.org:443 [209.51.188.72] X-Savane-Project: groff X-Savane-Tracker: bugs X-Savane-Item-ID: 58930 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0 X-Apparently-From: 1.129.111.45 (Savane authenticated user gbranden) In-Reply-To: <20200810-095606.sv93119.42780@savannah.gnu.org> X-BeenThere: bug-groff@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Bug reports for the GNU version of nroff, troff et al" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: <20200814-100002.sv108747.62919@savannah.gnu.org> X-Mailman-Original-References: <20200810-095606.sv93119.42780@savannah.gnu.org> Xref: csiph.com gnu.groff.bug:1968 Update of bug #58930 (project groff): Status: None => Need Info Assigned to: None => gbranden _______________________________________________________ Follow-up Comment #1: It's a little demoralizing that even these baby steps seem fraught with complication. 1. "U+00A0 NO-BREAK SPACE This character is in the Latin-1 character set, which groff recognizes, and when groff's input is in Latin-1 encoding, it correctly handles this character (though I'm not certain whether it interprets it as "\~" or "\ ")." None of the above, it seems: $ cat EXPERIMENTS/spaces.groff .pl 1v .if '\ '\ ' \eSP = \eSP .if '\ '\~' \eSP = \e\[ti] .if '\ '\[u00A0]' \eSP = \e[u00A0] .br .if '\~'\ ' \e\[ti] = \eSP .if '\~'\~' \e\[ti] = \e\[ti] .if '\~'\[u00A0]' \e\[ti] = \e[u00A0] .br .if '\[u00A0]'\ ' \e[u00A0] = \eSP .if '\[u00A0]'\~' \e[u00A0] = \e\[ti] .if '\[u00A0]'\[u00A0]' \e[u00A0] = \e[u00A0] $ ./build/test-groff -Tutf8 \SP = \SP \~ = \~ \[u00A0] = \[u00A0] None of these are equivalent to the others. :-/ 2. The behavior of \: when used as the RHS of a .char request does indeed seem a bit strange. It looks like the transform is just not happening: .pl 1v .char \[u200B] \: .ds a \[u200B] .length i \*a \ni 8 .pl 1v .ds a \[u200B] .length i \*a \ni 8 .pl 1v .char a b .ds a a \*a b That unchanged length of 8, the exact character count of "\[u2000B]" is highly suspicious to me. 3. Narrow no-break space. Have you named all of the non-breaking spaces in Unicode in this ticket? I know there are bunch of others (hair space, thin space, ideographic space, ...) but I don't know what their breaking semantics are in Unicode. 4. A non-breaking hyphen would then be something that looks like \[hy] but doesn't actually break? I don't know that this is actually the hardest of the tasks on this list. You can just use the character as-is in input. groff doesn't know it's a hyphen, and no hyphenation patterns include it, so it never gets a break after it. $ cat EXPERIMENTS/non-breaking-hyphen.groff .pl 1v .ds a a\[u2011] .nr b 50 -1 .while \n+b \*a\c troff: warning [p 1, 0.0i]: can't break line a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑ Let me know what you think of these findings. _______________________________________________________ Reply to this item at: _______________________________________________ Message sent via Savannah https://savannah.gnu.org/