Path: csiph.com!aioe.org!.POSTED.U158MBNsLt97drbZ2zludw.user.gioia.aioe.org!not-for-mail
From: news@zzo38computer.org.invalid
Newsgroups: comp.lang.postscript
Subject: Re: JSON reader/writer in PostScript (second version)
Date: Sun, 15 Sep 2019 18:34:51 +0000
Organization: Aioe.org NNTP Server
Lines: 34
Message-ID: <1568569984.bystand@zzo38computer.org>
References: <1567559086.bystand@zzo38computer.org> <1567977875.bystand@zzo38computer.org> <1ac2aa42-301c-4343-99de-2943b33fe7b0@googlegroups.com> <a148ab80-3ae6-4bd4-aa86-9c6e8533d8eb@googlegroups.com>
NNTP-Posting-Host: U158MBNsLt97drbZ2zludw.user.gioia.aioe.org
Mime-Version: 1.0
X-Complaints-To: abuse@aioe.org
User-Agent: bystand/0.6.2
X-Notice: Filtered by postfilter v. 0.9.2
Xref: csiph.com comp.lang.postscript:3452

luser droog <luser.droog@gmail.com> wrote:
> 
> The next issue is: I don't understand what to do with unicode 
> characters if they are discovered. It appears that OP's code
> reads in the multibyte sequences, constructs the codepoint in
> an int, and then truncates that to 8 bits and stores it in a
> string. That doesn't seem right, but I can't really think of
> anything better. Maybe an option either to leave the utf8 alone,
> or convert to arrays of integers? It's not clear to me what
> a PostScript program could hope to do with unicode data.
> 
> So I haven't written any utf8 handling. If I do add it, I think
> it should be added to the parser library itself as an input
> filter. The C version has these already.

The first version of my program will treat \u escapes in the way you
mention; only the low 8 bits of the codepoints are used.

The second version of my program has an option to instead convert any \u
escapes into UTF-8 encoding. (However, it will not convert surrogate pairs
into astral characters.)

Regardless of the version and of the option, if it reads any unescaped
non-ASCII characters, they will be passed through as is; it will not
interpret UTF-8 input at all, but just passes it through.

You might be able to write UTF-8 text on the page with codespace ranges,
so maybe there is the possibility to use Unicode data in that way.

If you want to add UTF-8 handling in your own program though, you can do
it whichever way you think is good, I think.

-- 
Note: I am not always able to read/post messages during Monday-Friday.