Printing UTF8 (Unicode)

Newsgroups	comp.lang.postscript
From	David Newall <davidn@davidnewall.com>
Subject	Printing UTF8 (Unicode)
Message-ID	<4fe53d50-e66a-82b8-48fd-d0928e149698@davidnewall.com> (permalink)
Date	2022-01-21 21:56 +1100
Organization	Ausics - https://www.ausics.net

Show all headers | View raw

Hello All,

I've written some PostScript to allow me to print UTF8-encoded strings:

    (UTF-8 Encoded String.....) utfshow

I'm happy to send you the full source, or, if appropriate, publish it 
here; however, the exposition below includes everything you should need.

I use a UTF-8 decoder which was written (in C) by Bjoern Hoehrmann (see
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/):

%/ Copyright (c) 2008-2010 Bjoern Hoehrmann <bjoern@hoehrmann.de>
%/ See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.

/UTF8_ACCEPT 0 def
/UTF8_REJECT 12 def

/utf8d [
%/ The first part of the table maps bytes to character classes that
%/ to reduce the size of the transition table and create bitmasks.
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1   9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
  7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7   7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
  8 8 2 2 2 2 2 2 2 2 2 2 2 2 2 2   2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
10 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3  11 6 6 6 5 8 8 8 8 8 8 8 8 8 8 8

%/ The second part is a transition table that maps a combination
%/ of a state of the automaton and a character class to a state.
  0 12 24 36 60 96 84 12 12 12 48 72  12 12 12 12 12 12 12 12 12 12 12 12
12  0 12 12 12 12 12  0 12  0 12 12  12 24 12 12 12 12 12 24 12 24 12 12
12 12 12 12 12 12 12 24 12 12 12 12  12 24 12 12 12 12 12 12 12 24 12 12
12 12 12 12 12 12 12 36 12 36 12 12  12 36 12 12 12 12 12 36 12 36 12 12
12 36 12 12 12 12 12 12 12 12 12 12
] def

% codep state byte   decode   codep' state'
/decode {
   utf8d 1 index get                    % type
   % codep state byte type
   2 index UTF8_ACCEPT ne               % state not UTF8_ACCEPT?
     { exch 16#3F and 4 -1 roll 6 bitshift or }
     { dup neg 16#FF exch bitshift 3 -1 roll and 4 -1 roll pop }
   ifelse                               % state type codep'
   3 1 roll add 256 add utf8d exch get  % codep' state'
} def

%***************************************************************************/


I also use a table which Adobe published ("UNICODE translation table for 
non-ASCII characters"), which they say is for going from a glyph name to 
a Unicode codepoint.  I (ab)use it in the reverse direction.  I turned 
it into a dictionary keyed on the codepoint.

The table is currently at https://github.com/adobe-type-tools/agl-aglfn. 
  Some codepoints have multiple possible glyph names, so the dictionary 
has an array of potential glyph names for each codepoint.  Finally, 
fonts often have glyphs named /uniHHHH, where HHHH is the codepoint.

I converted the table to PS using awk:

BEGIN{FS="[; ]"}
{
   for(i=2; i<=NF; i++) {
     if(!($i in h)) {h[$i]=++n;v[n]=$i}
     g[$i]=g[$i]"/"$1
   }
}
END{
   print "/unicode <<"
   for(i=1;i<=n;i++) print "\t16#"v[i]"["g[v[i]]"/uni"toupper(v[i])"]"
   print ">> def"
}

Adobe's table is turned into this:

/unicode <<
   16#0041[/A/uni0041]
   16#00C6[/AE/uni00C6]
   ...
   16#305A[/zuhiragana/uni305A]
   16#30BA[/zukatakana/uni30BA]
 >> def

The crux of printing Unicode code points is to find which of the 
possible glyphs the current font defines.  I search currentfont's 
CharStrings.

% look for one of the glyphs in fontdict's CharStrings
% [/glyph ...] fontdict  chooseglyph  /glyph true
%                                            false
/chooseglyph {
   /CharStrings get exch % the glyphs defined in fontdict
   false 3 1 roll        % assume we don't find a glyph
                         % false CharStrings [glyphs]
   { 2 copy known {true 4 2 roll exch pop exit}{pop} ifelse } forall
   pop                   % remove CharStrings
} def

I've noticed that Symbol sometimes contains glyphs that other fonts 
don't, so, if I don't find a glyph in currentfont I look through Symbol.

I thought it might be a good idea to also try ZapfDingbats.  In 
retrospect, that might be a red herring.

Adobe also publish a table like the Unicode table, giving the names of 
that font's glyphs.  It's at the same place, and converts using the same 
awk:

/zapf <<
   16#275E[/a100/uni275E]
   16#2761[/a101/uni2761]
   ...
   16#275D[/a99/uni275D]
   16#2720[/a9/uni2720]
 >> def


This is the code which prints a unicode code point (or .notdef if a 
glyph cannot be found):

% SPDX-License-Identifier: LGPL-2.1-or-later
%
% Copyright (c) 2022 by davidnewall.com.  All rights reserved.

% print a single unicode codepoint:
% integer unicodeshow -
/unicodeshow {
   % load array of known glyph names for this code point
   unicode 1 index known
     {unicode exch get} % array of possible glyphs
     { pop []} % unknown code point
   ifelse
   {
     dup currentfont chooseglyph { glyphshow exit } if
     dup /ZapfDingbats findfont chooseglyph {
       currentfont exch /ZapfDingbats currentfontsize selectfont
       glyphshow setfont exit } if
     dup /Symbol findfont chooseglyph {
       currentfont exch /Symbol currentfontsize selectfont
       glyphshow setfont exit } if
     /.notdef glyphshow exit
   } loop
   pop
} def


I get the current font size using this:

/currentfontsize {
   currentfont dup /OrigFont get
   2 { /FontMatrix get 3 get exch } repeat div
} bind def


Finally (at last!), to print a UTF-8 string:

/utfshow {
   UTF8_ACCEPT 0 UTF8_ACCEPT % prev codep current
   4 -1 roll {
     decode
     dup UTF8_ACCEPT eq { 1 index unicodeshow } if
     dup UTF8_REJECT eq {
       (%% Bad UTF-8 sequence\n) print pop
       UTF8_ACCEPT /.notdef glyphshow } if
     3 -1 roll pop dup 3 1 roll % prev = current
    } forall
    pop pop pop
} def


Regards,

David

Back to comp.lang.postscript | Previous | Next — Next in thread | Find similar

Thread

Printing UTF8 (Unicode) David Newall <davidn@davidnewall.com> - 2022-01-21 21:56 +1100
  Re: Printing UTF8 (Unicode) Carlos <carlos@cvkm.cz> - 2022-01-21 14:23 +0100
    Re: Printing UTF8 (Unicode) David Newall <davidn@davidnewall.com> - 2022-01-22 12:27 +1100
  Re: Printing UTF8 (Unicode) David Newall <davidn@davidnewall.com> - 2022-01-23 13:31 +1100
    Re: Printing UTF8 (Unicode) Carlos <carlos@cvkm.cz> - 2022-01-23 13:35 +0100
      Re: Printing UTF8 (Unicode) David Newall <davidn@davidnewall.com> - 2022-01-26 14:59 +1100
        Re: Printing UTF8 (Unicode) Carlos <carlos@cvkm.cz> - 2022-02-10 15:05 +0100
          Re: Printing UTF8 (Unicode) David Newall <davidn@davidnewall.com> - 2022-02-16 13:55 +1100
  Printing UTF8 (Unicode) - opinions please David Newall <davidn@davidnewall.com> - 2022-01-23 14:10 +1100
    Re: Printing UTF8 (Unicode) - opinions please Carlos <carlos@cvkm.cz> - 2022-01-23 13:56 +0100
      Re: Printing UTF8 (Unicode) - opinions please luser droog <luser.droog@gmail.com> - 2022-01-24 08:37 -0800
    Re: Printing UTF8 (Unicode) - opinions please luser droog <luser.droog@gmail.com> - 2022-01-24 08:33 -0800
      Re: Printing UTF8 (Unicode) - opinions please David Newall <davidn@davidnewall.com> - 2022-01-26 15:06 +1100

csiph-web