Path: csiph.com!xmission!news.glorb.com!usenet.stanford.edu!not-for-mail
From: Ulrich Mueller <ulm@gentoo.org>
Newsgroups: gnu.bash.bug
Subject: bash-4.3: casemod word expansions broken with UTF-8
Date: Mon, 16 Nov 2015 16:12:15 +0100
Lines: 57
Approved: bug-bash@gnu.org
Message-ID: <mailman.3.1447720279.31583.bug-bash@gnu.org>
NNTP-Posting-Host: lists.gnu.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Trace: usenet.stanford.edu 1447720280 6201 208.118.235.17 (17 Nov 2015 00:31:20 GMT)
X-Complaints-To: action@cs.stanford.edu
To: bug-bash@gnu.org
Envelope-to: bug-bash@gnu.org
X-Mailer: VM 8.2.0b under 24.3.1 (x86_64-pc-linux-gnu)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 134.93.134.1
X-Mailman-Approved-At: Mon, 16 Nov 2015 15:58:54 -0500
X-BeenThere: bug-bash@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Bug reports for the GNU Bourne Again SHell <bug-bash.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-bash>, <mailto:bug-bash-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/bug-bash>
List-Post: <mailto:bug-bash@gnu.org>
List-Help: <mailto:bug-bash-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-bash>, <mailto:bug-bash-request@gnu.org?subject=subscribe>
Xref: csiph.com gnu.bash.bug:11890

[Resending, apparently my first message didn't make it to the list.]

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: x86_64-pc-linux-gnu-gcc
Compilation CFLAGS:  -DPROGRAM='bash' -DCONF_HOSTTYPE='x86_64' -DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='x86_64-pc-linux-gnu' -DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H   -I. -I./include -I. -I./include -I./lib  -DDEFAULT_PATH_VALUE='/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin' -DSTANDARD_UTILS_PATH='/bin:/usr/bin:/sbin:/usr/sbin' -DSYS_BASHRC='/etc/bash/bashrc' -DSYS_BASH_LOGOUT='/etc/bash/bash_logout' -DNON_INTERACTIVE_LOGIN_SHELLS -DSSH_SOURCE_BASHRC -march=core2 -ggdb -O2 -pipe
uname output: Linux juno 3.18.24-gentoo #1 SMP Sun Nov 8 10:43:05 CET 2015 x86_64 Intel(R) Core(TM)2 Duo CPU T6570 @ 2.10GHz GenuineIntel GNU/Linux
Machine Type: x86_64-pc-linux-gnu

Bash Version: 4.3
Patch Level: 42
Release Status: release

Description:
	In an UTF-8 locale like en_US.UTF-8, the case-modifying
	parameter expansions sometimes return invalid UTF-8 encodings.

	This seems to happen when the UTF-8 byte sequences that are
	encoding upper and lower case have different lengths.

Repeat-By:
	$ LC_ALL=en_US.UTF-8
	$ x=$'\xc4\xb1' # LATIN SMALL LETTER DOTLESS I
	$ echo -n "${x^}" | od -t x1
	0000000 49 b1
	0000002

	This should have output "49" for "I" only. The "b1" is illegal
	as the first byte of an UTF-8 sequence.

	$ x=$'\xe1\xba\x9e' # LATIN CAPITAL LETTER SHARP S
	$ echo -n "${x,}" | od -t x1
	0000000 c3 9f 9e
	0000003

	This should have output "c3 9f" (for "sharp s") only.

	Even more interesting effects happen if the string contains
	a character whose UTF-8 encoding gets *longer* after case
	conversion, because then the terminating null byte will be
	overwritten.

	For example, U+0250 "LATIN SMALL LETTER TURNED A" is
	represented by a two byte sequence in UTF-8, while its
	uppercase equivalent U+2C6F needs three bytes:

	$ LC_ALL=en_US.UTF-8
	$ x=$'aaaaa\xc9\x90'
	$ y=${x^^}
	$ echo -n "$y" | od -t x1
	0000000 41 41 41 41 41 e2 90 af 6f 6d 65 2f 75 6c 6d
	0000017

	Variable y contains some trailing garbage (could be a part of
	$HOME or $PWD).