Path: csiph.com!xmission!news.glorb.com!usenet.stanford.edu!not-for-mail
From: Ulrich Mueller <ulm@gentoo.org>
Newsgroups: gnu.bash.bug
Subject: Re: bash-4.3: casemod word expansions broken with UTF-8
Date: Sun, 15 Nov 2015 17:56:59 +0100
Lines: 26
Approved: bug-bash@gnu.org
Message-ID: <mailman.1.1447720277.31583.bug-bash@gnu.org>
References: <22088.36043.764500.752406@a1i15.kph.uni-mainz.de>
NNTP-Posting-Host: lists.gnu.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Trace: usenet.stanford.edu 1447720278 6199 208.118.235.17 (17 Nov 2015 00:31:18 GMT)
X-Complaints-To: action@cs.stanford.edu
To: bug-bash@gnu.org
Envelope-to: bug-bash@gnu.org
In-Reply-To: <22088.36043.764500.752406@a1i15.kph.uni-mainz.de>
X-Mailer: VM 8.2.0b under 24.3.1 (x86_64-pc-linux-gnu)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 134.93.134.1
X-Mailman-Approved-At: Mon, 16 Nov 2015 15:58:54 -0500
X-BeenThere: bug-bash@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Bug reports for the GNU Bourne Again SHell <bug-bash.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-bash>, <mailto:bug-bash-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/bug-bash>
List-Post: <mailto:bug-bash@gnu.org>
List-Help: <mailto:bug-bash-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-bash>, <mailto:bug-bash-request@gnu.org?subject=subscribe>
Xref: csiph.com gnu.bash.bug:11888

>>>>> On Sun, 15 Nov 2015, Ulrich Mueller wrote:

> Description:
> 	In an UTF-8 locale like en_US.UTF-8, the case-modifying
> 	parameter expansions sometimes return invalid UTF-8 encodings.

> 	This seems to happen when the UTF-8 byte sequences that are
> 	encoding upper and lower case have different lengths.

Even more interesting effects happen if the string contains a
character whose UTF-8 encoding gets *longer* after case conversion,
because then the terminating null byte will be overwritten.

For example, U+0250 "LATIN SMALL LETTER TURNED A" is represented by a
two byte sequence in UTF-8, while its uppercase equivalent U+2C6F
needs three bytes:

	$ LC_ALL=en_US.UTF-8
	$ x=$'aaaaa\xc9\x90'
	$ y=${x^^}
	$ echo -n "$y" | od -t x1
	0000000 41 41 41 41 41 e2 90 af 6f 6d 65 2f 75 6c 6d
	0000017

y contains some trailing garbage (could be a part of $HOME or $PWD).