Path: csiph.com!xmission!news.snarked.org!news.linkpendium.com!news.linkpendium.com!panix!usenet.stanford.edu!not-for-mail From: Kalle Olavi Niemitalo Newsgroups: gnu.bash.bug Subject: printf '\uFEFF' outputs invalid UTF-8 on Windows Date: Mon, 05 Nov 2018 19:09:06 +0200 Lines: 43 Approved: bug-bash@gnu.org Message-ID: NNTP-Posting-Host: lists.gnu.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: usenet.stanford.edu 1541441576 31586 208.118.235.17 (5 Nov 2018 18:12:56 GMT) X-Complaints-To: action@cs.stanford.edu To: bug-bash@gnu.org Envelope-to: bug-bash@gnu.org User-Agent: Gnus/5.110007 (No Gnus v0.7) Emacs/23.0.51 (gnu/linux) X-Accept-Language: fi;q=1.0, en;q=0.9, sv;q=0.5, de;q=0.1 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 62.142.5.110 X-Mailman-Approved-At: Mon, 05 Nov 2018 13:12:54 -0500 X-BeenThere: bug-bash@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Bug reports for the GNU Bourne Again SHell List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com gnu.bash.bug:14763 Configuration Information [Automatically generated, do not change]: Machine: x86_64 OS: msys Compiler: gcc Compilation CFLAGS: -DPROGRAM='bash.exe' -DCONF_HOSTTYPE='x86_64' -DCONF_OSTYPE='msys' -DCONF_MACHTYPE='x86_64-pc-msys' -DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H -DRECYCLES_PIDS -I. -I. -I./include -I./lib -DWORDEXP_OPTION -Wno-discarded-qualifiers -march=x86-64 -mtune=generic -O2 -pipe -Wno-parentheses -Wno-format-security -D_STATIC_BUILD -g uname output: MINGW64_NT-6.1 fjkallen 2.10.0(0.325/5/3) 2018-07-25 13:06 x86_64 Msys Machine Type: x86_64-pc-msys Bash Version: 4.4 Patch Level: 19 Release Status: release Description: The builtin printf '\uFEFF' outputs ED 9F BF ED BB BF in a UTF-8 locale on Microsoft Windows, where sizeof(wchar_t) == 2. It should output EF BB BF, like printf (GNU coreutils) 8.30 does. The incorrect output ED 9F BF ED BB BF is a UTF-8-like encoding of U+D7FF U+DEFF, which looks somewhat like a UTF-16 surrogate pair but the U+D7FF character is not in the surrogate range. Repeat-By: Install Git for Windows 2.19.1, on Windows 7 SP1. Start "Git Bash" from the Start menu. Run the command: env --ignore-environment LANG=en_US.UTF-8 \ /usr/bin/bash --noprofile -c 'builtin printf "\ufeff"' \ | od -t x1 Fix: In lib/sh/unicode.c, change u32toutf16 to treat characters in the U+E000...U+FFFF range just like the U+0000...U+D7FF range, i.e. copy them unchanged to the output and not make a surrogate pair. I did not test that change but the function clearly has a bug and it matches the symptoms perfectly.