Groups > gnu.bash.bug > #15591 > unrolled thread

Minor utf32-to-utf8 bug

Started by	István Pásztor <pasztorpisti@gmail.com>
First post	2019-11-10 14:07 +0000
Last post	2019-11-10 14:07 +0000
Articles	1 — 1 participant

Back to article view | Back to gnu.bash.bug

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Minor utf32-to-utf8 bug István Pásztor <pasztorpisti@gmail.com> - 2019-11-10 14:07 +0000

#15591 — Minor utf32-to-utf8 bug

From	István Pásztor <pasztorpisti@gmail.com>
Date	2019-11-10 14:07 +0000
Subject	Minor utf32-to-utf8 bug
Message-ID	<mailman.1182.1573396992.13325.bug-bash@gnu.org>

[Multipart message — attachments visible in raw view] — view raw

Hi

The encoding of six bytes long utf-8 sequences is buggy. Today unicode
requires at most 4 bytes long utf-8 sequences but if we handle 5 and 6 too
then let's do it the right way.

The attached patch was created using a fresh master clone
(d894cfd104086ddf68c286e67a5fb2e02eb43b7b).

I'm writing a tool (pxargs, an xargs variant) that can accept input strings
in shell-quoted format and used the bash manual and source code as
references. I haven't compiled the latest bash sources to check the bug but
it is likely to affect ANSI-C quoted strings like $'\U7fffffff'.

Best Regards,
Istvan Pasztor

[toc] | [standalone]

csiph-web

Minor utf32-to-utf8 bug

Contents

#15591 — Minor utf32-to-utf8 bug