Path: csiph.com!goblin2!goblin1!goblin.stu.neva.ru!usenet.stanford.edu!not-for-mail
From: "G. Branden Robinson" <INVALID.NOREPLY@gnu.org>
Newsgroups: gnu.groff.bug
Subject: [bug #58930] take baby steps toward Unicode
Date: Sat, 15 Aug 2020 00:05:39 -0400 (EDT)
Lines: 120
Approved: bug-groff@gnu.org
Message-ID: <mailman.2152.1597464340.2739.bug-groff@gnu.org>
References: <20200810-095606.sv93119.42780@savannah.gnu.org> <20200814-100002.sv108747.62919@savannah.gnu.org> <20200814-220415.sv93119.59625@savannah.gnu.org> <20200814-222905.sv93119.21750@savannah.gnu.org> <20200815-040539.sv108747.29165@savannah.gnu.org>
NNTP-Posting-Host: lists.gnu.org
Mime-Version: 1.0
Content-Type: text/plain;charset=UTF-8
X-Trace: usenet.stanford.edu 1597464341 11785 209.51.188.17 (15 Aug 2020 04:05:41 GMT)
X-Complaints-To: action@cs.stanford.edu
To: "G. Branden Robinson" <g.branden.robinson@gmail.com>, Dave <saint.snit@gmail.com>, bug-groff@gnu.org
Envelope-to: bug-groff@gnu.org
X-PHP-Originating-Script: 1001:sendmail.php
X-Savane-Server: savannah.gnu.org:443 [209.51.188.72]
X-Savane-Project: groff
X-Savane-Tracker: bugs
X-Savane-Item-ID: 58930
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0
X-Apparently-From: 1.129.111.159 (Savane authenticated user gbranden)
In-Reply-To: <20200814-222905.sv93119.21750@savannah.gnu.org>
X-BeenThere: bug-groff@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Bug reports for the GNU version of nroff, troff et al" <bug-groff.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-groff>, <mailto:bug-groff-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-groff>
List-Post: <mailto:bug-groff@gnu.org>
List-Help: <mailto:bug-groff-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-groff>, <mailto:bug-groff-request@gnu.org?subject=subscribe>
X-Mailman-Original-Message-ID: <20200815-040539.sv108747.29165@savannah.gnu.org>
X-Mailman-Original-References: <20200810-095606.sv93119.42780@savannah.gnu.org> <20200814-100002.sv108747.62919@savannah.gnu.org> <20200814-220415.sv93119.59625@savannah.gnu.org> <20200814-222905.sv93119.21750@savannah.gnu.org>
Xref: csiph.com gnu.groff.bug:1976

Follow-up Comment #4, bug #58930 (project groff):

[comment #2 comment #2:]
> "\~" and "\ " _shouldn't_ be equivalent; they're documented as behaving
differently.

No, not suggesting they should, just lamenting the total disjunctivity of the
set.

> 
> The input string "\[u00A0]" being equivalent to neither of these is exactly
the problem this plank of this bug report is looking to solve.
> 
> It's only the character NO-BREAK SPACE in its Latin-1 form, which groff
accepts as direct input, that groff recognizes and interprets as a nonbreaking
space.  groff_char(7) (which I only now thought to check) says it maps to \~. 
But that appears to be less than 100% accurate:
> 
> 
> $ LC_CTYPE=en_US.iso88591 printf ".if '\u00A0'\~' .tm equal\n" | groff
> $ 
> 
> 
> But the upshot is, however groff interprets a Latin-1 A0, it really ought to
interpret the form of that character emitted by preconv, \[u00A0],
identically.

Yes, I think I agree here.  I can't think of a more appropriate mapping for
it.

> So the only one I didn't cover was U+2007 FIGURE SPACE, which should map to
groff's (already nonbreaking) \0.

Might as well sweep that one into this report, then.  Once the "where" to fix
this has been determined, the incremental effort to handle that one will
probably be tiny.

> > there are bunch of others (hair space, thin space, ideographic space,
> > ...) but I don't know what their breaking semantics are in Unicode.
> 
> Irrational, IMO.  Unicode considers U+2009 THIN SPACE and
> U+200A HAIR SPACE breakable, for no good reason that I can see.  Groff
(quite sensibly, since the concept is sort of absurd) does not offer breaking
versions of these spaces, and the only reason to add them would be strict
compliance with a Unicode property that probably no one who uses those code
points actually wants: I can't think of a single real-world use case for a
breaking thin space (though perhaps this is merely a failure of my
imagination).

Well, I can't think of one either.
 
> This is all another can of worms I intentionally didn't address in what I
intended to be a simple change.

Hah.  This is Sparta^Wgroff!  Complexity rapidly ramifies.

> > 4. A non-breaking hyphen would then be something that looks
> > like \[hy] but doesn't actually break?
> 
> Yes.
> 
> > You can just use the character as-is in input.
> 
> Ah, I guess you used -Tutf8 output, where that does work.  (Somehow your
groff command got stripped from your comment.)

The "somehow" was me not thinking to include it.

> All other output formats (notably -Tps and -Tpdf) produce "warning: can't
find special character 'u2011'".

Okay, yes, that seems like another mapping issue.

And it appears to be a one-liner fix (morally).

tmac/pdf.tmac sources tmac/ps.tmac so the fix only has to be made in one
place.

I did have to goose the loop count in the test up to 100.


$ ./build/test-groff -Tps ./EXPERIMENTS/non-breaking-hyphen.groff 
>|/tmp/2011.ps
troff: warning [p 1, 0.0i]: can't break line
$ ./build/test-groff -Tpdf ./EXPERIMENTS/non-breaking-hyphen.groff 
>|/tmp/2011.pdf
troff: warning [p 1, 0.0i]: can't break line
$ cat ./EXPERIMENTS/non-breaking-hyphen.groff
.pl 1v
.ds a a\[u2011]
.nr b 100 -1
.while \n+b \*a\c
$ git di tmac/ps.tmac
diff --git a/tmac/ps.tmac b/tmac/ps.tmac
index 18928765..860919e1 100644
--- a/tmac/ps.tmac
+++ b/tmac/ps.tmac
@@ -28,6 +28,9 @@
 .
 .cflags 8 \[an]
 .
+\# non-breaking hyphen
+.fchar \[u2011] -
+.
 .char \[radicalex] \h'-\w'\[sr]'u'\[radicalex]\h'\w'\[sr]'u'
 .fchar \[sqrtex] \[radicalex]
 .char \[mo] \h'.08m'\[mo]\h'-.08m'


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58930>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/