Path: csiph.com!goblin2!goblin1!goblin.stu.neva.ru!usenet.stanford.edu!not-for-mail From: "G. Branden Robinson" Newsgroups: gnu.groff.bug Subject: [bug #58930] take baby steps toward Unicode Date: Sat, 15 Aug 2020 00:05:39 -0400 (EDT) Lines: 120 Approved: bug-groff@gnu.org Message-ID: References: <20200810-095606.sv93119.42780@savannah.gnu.org> <20200814-100002.sv108747.62919@savannah.gnu.org> <20200814-220415.sv93119.59625@savannah.gnu.org> <20200814-222905.sv93119.21750@savannah.gnu.org> <20200815-040539.sv108747.29165@savannah.gnu.org> NNTP-Posting-Host: lists.gnu.org Mime-Version: 1.0 Content-Type: text/plain;charset=UTF-8 X-Trace: usenet.stanford.edu 1597464341 11785 209.51.188.17 (15 Aug 2020 04:05:41 GMT) X-Complaints-To: action@cs.stanford.edu To: "G. Branden Robinson" , Dave , bug-groff@gnu.org Envelope-to: bug-groff@gnu.org X-PHP-Originating-Script: 1001:sendmail.php X-Savane-Server: savannah.gnu.org:443 [209.51.188.72] X-Savane-Project: groff X-Savane-Tracker: bugs X-Savane-Item-ID: 58930 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0 X-Apparently-From: 1.129.111.159 (Savane authenticated user gbranden) In-Reply-To: <20200814-222905.sv93119.21750@savannah.gnu.org> X-BeenThere: bug-groff@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Bug reports for the GNU version of nroff, troff et al" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: <20200815-040539.sv108747.29165@savannah.gnu.org> X-Mailman-Original-References: <20200810-095606.sv93119.42780@savannah.gnu.org> <20200814-100002.sv108747.62919@savannah.gnu.org> <20200814-220415.sv93119.59625@savannah.gnu.org> <20200814-222905.sv93119.21750@savannah.gnu.org> Xref: csiph.com gnu.groff.bug:1976 Follow-up Comment #4, bug #58930 (project groff): [comment #2 comment #2:] > "\~" and "\ " _shouldn't_ be equivalent; they're documented as behaving differently. No, not suggesting they should, just lamenting the total disjunctivity of the set. > > The input string "\[u00A0]" being equivalent to neither of these is exactly the problem this plank of this bug report is looking to solve. > > It's only the character NO-BREAK SPACE in its Latin-1 form, which groff accepts as direct input, that groff recognizes and interprets as a nonbreaking space. groff_char(7) (which I only now thought to check) says it maps to \~. But that appears to be less than 100% accurate: > > > $ LC_CTYPE=en_US.iso88591 printf ".if '\u00A0'\~' .tm equal\n" | groff > $ > > > But the upshot is, however groff interprets a Latin-1 A0, it really ought to interpret the form of that character emitted by preconv, \[u00A0], identically. Yes, I think I agree here. I can't think of a more appropriate mapping for it. > So the only one I didn't cover was U+2007 FIGURE SPACE, which should map to groff's (already nonbreaking) \0. Might as well sweep that one into this report, then. Once the "where" to fix this has been determined, the incremental effort to handle that one will probably be tiny. > > there are bunch of others (hair space, thin space, ideographic space, > > ...) but I don't know what their breaking semantics are in Unicode. > > Irrational, IMO. Unicode considers U+2009 THIN SPACE and > U+200A HAIR SPACE breakable, for no good reason that I can see. Groff (quite sensibly, since the concept is sort of absurd) does not offer breaking versions of these spaces, and the only reason to add them would be strict compliance with a Unicode property that probably no one who uses those code points actually wants: I can't think of a single real-world use case for a breaking thin space (though perhaps this is merely a failure of my imagination). Well, I can't think of one either. > This is all another can of worms I intentionally didn't address in what I intended to be a simple change. Hah. This is Sparta^Wgroff! Complexity rapidly ramifies. > > 4. A non-breaking hyphen would then be something that looks > > like \[hy] but doesn't actually break? > > Yes. > > > You can just use the character as-is in input. > > Ah, I guess you used -Tutf8 output, where that does work. (Somehow your groff command got stripped from your comment.) The "somehow" was me not thinking to include it. > All other output formats (notably -Tps and -Tpdf) produce "warning: can't find special character 'u2011'". Okay, yes, that seems like another mapping issue. And it appears to be a one-liner fix (morally). tmac/pdf.tmac sources tmac/ps.tmac so the fix only has to be made in one place. I did have to goose the loop count in the test up to 100. $ ./build/test-groff -Tps ./EXPERIMENTS/non-breaking-hyphen.groff >|/tmp/2011.ps troff: warning [p 1, 0.0i]: can't break line $ ./build/test-groff -Tpdf ./EXPERIMENTS/non-breaking-hyphen.groff >|/tmp/2011.pdf troff: warning [p 1, 0.0i]: can't break line $ cat ./EXPERIMENTS/non-breaking-hyphen.groff .pl 1v .ds a a\[u2011] .nr b 100 -1 .while \n+b \*a\c $ git di tmac/ps.tmac diff --git a/tmac/ps.tmac b/tmac/ps.tmac index 18928765..860919e1 100644 --- a/tmac/ps.tmac +++ b/tmac/ps.tmac @@ -28,6 +28,9 @@ . .cflags 8 \[an] . +\# non-breaking hyphen +.fchar \[u2011] - +. .char \[radicalex] \h'-\w'\[sr]'u'\[radicalex]\h'\w'\[sr]'u' .fchar \[sqrtex] \[radicalex] .char \[mo] \h'.08m'\[mo]\h'-.08m' _______________________________________________________ Reply to this item at: _______________________________________________ Message sent via Savannah https://savannah.gnu.org/