Groups > gnu.groff.bug > #1968

[bug #58930] take baby steps toward Unicode

From	"G. Branden Robinson" <INVALID.NOREPLY@gnu.org>
Newsgroups	gnu.groff.bug
Subject	[bug #58930] take baby steps toward Unicode
Date	2020-08-14 06:00 -0400
Message-ID	<mailman.2034.1597399205.2739.bug-groff@gnu.org> (permalink)
References	<20200810-095606.sv93119.42780@savannah.gnu.org> <20200814-100002.sv108747.62919@savannah.gnu.org>

Show all headers | View raw

Update of bug #58930 (project groff):

                  Status:                    None => Need Info              
             Assigned to:                    None => gbranden               

    _______________________________________________________

Follow-up Comment #1:

It's a little demoralizing that even these baby steps seem fraught with
complication.

1. "U+00A0 NO-BREAK SPACE

This character is in the Latin-1 character set, which groff recognizes, and
when groff's input is in Latin-1 encoding, it correctly handles this character
(though I'm not certain whether it interprets it as "\~" or "\ ")."

None of the above, it seems:


$ cat EXPERIMENTS/spaces.groff
.pl 1v
.if '\ '\ ' \eSP = \eSP
.if '\ '\~' \eSP = \e\[ti]
.if '\ '\[u00A0]' \eSP = \e[u00A0]
.br
.if '\~'\ ' \e\[ti] = \eSP
.if '\~'\~' \e\[ti] = \e\[ti]
.if '\~'\[u00A0]' \e\[ti] = \e[u00A0]
.br
.if '\[u00A0]'\ ' \e[u00A0] = \eSP
.if '\[u00A0]'\~' \e[u00A0] = \e\[ti]
.if '\[u00A0]'\[u00A0]' \e[u00A0] = \e[u00A0]
$ ./build/test-groff -Tutf8
\SP = \SP
\~ = \~
\[u00A0] = \[u00A0]


None of these are equivalent to the others. :-/

2. The behavior of \: when used as the RHS of a .char request does indeed seem
a bit strange.  It looks like the transform is just not happening:


.pl 1v
.char \[u200B] \:
.ds a \[u200B]
.length i \*a
\ni
8

.pl 1v
.ds a \[u200B]
.length i \*a
\ni
8

.pl 1v
.char a b
.ds a a
\*a
b


That unchanged length of 8, the exact character count of "\[u2000B]" is highly
suspicious to me.

3. Narrow no-break space.  Have you named all of the non-breaking spaces in
Unicode in this ticket?  I know there are bunch of others (hair space, thin
space, ideographic space, ...) but I don't know what their breaking semantics
are in Unicode.

4. A non-breaking hyphen would then be something that looks like \[hy] but
doesn't actually break?  I don't know that this is actually the hardest of the
tasks on this list.  You can just use the character as-is in input.  groff
doesn't know it's a hyphen, and no hyphenation patterns include it, so it
never gets a break after it.


$ cat EXPERIMENTS/non-breaking-hyphen.groff
.pl 1v
.ds a a\[u2011]
.nr b 50 -1
.while \n+b \*a\c

troff: warning [p 1, 0.0i]: can't break line
a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑


Let me know what you think of these findings.

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58930>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/

Back to gnu.groff.bug | Previous | Next | Find similar

Thread

[bug #58930] take baby steps toward Unicode "G. Branden Robinson" <INVALID.NOREPLY@gnu.org> - 2020-08-14 06:00 -0400

csiph-web