Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!eternal-september.org!feeder.eternal-september.org!border1.nntp.ams1.giganews.com!nntp.giganews.com!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
From: random832@fastmail.us
To: python-list@python.org
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain
In-Reply-To: <CAPkN8xK452YWZvVRipTjq05eKthxEE28CB5W5-MN+6DZnXHOGg@mail.gmail.com>
References: <CAPkN8xKTXJu2nhvocG8KuyO1XkJVfK_WsmY6dM=hWsVyg+BVyA@mail.gmail.com> <CAPkN8xK452YWZvVRipTjq05eKthxEE28CB5W5-MN+6DZnXHOGg@mail.gmail.com>
Subject: Re: Fwd: Lossless bulletproof conversion to unicode (backslashing)
Date: Wed, 27 May 2015 08:40:38 -0400
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.85.1432730441.5151.python-list@python.org>
Lines: 35
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:91301

On Wed, May 27, 2015, at 07:15, anatoly techtonik wrote:
> The solution is to have filter preprocess the binary string to escape all
> non-unicode symbols so that the following lossless transformation
> becomes possible:
> 
>    binary -> escaped utf-8 string -> unicode -> binary
> 
> I want to know if that's real? I need to accomplish that with
> Python 2.x, but the use case is probably valid for Python 3 as well.

In Python 3, you could *in principle* use surrogateescape (this would be
more of a binary -> escaped unicode workflow), but see below. It is
worth noting that when you *read* posix filenames in unicode form (e.g.
listdir with a unicode argument), they are decoded with surrogateescape,
and can be returned to bytes format with
fn.encode(sys.getfilesystemencoding(), errors='surrogateescape').
However keep in mind that on *windows*, the native filename format is a
sequence of 16-bit WCHAR values, not a sequence of bytes.

> This stuff is critical to port SCons to Python 3.x and I expect for other
> similar tools that have to deal with unknown ascii-binary strings too.

Even if your filename *is* valid UTF-8 (or whatever other encoding), it
might contain invisible control characters that make it difficult to
read. You'd probably be better off simply working directly with the
binary representation, iterating over it and replacing all
non-*ascii*-printable bytes with an escaped representation. As it
happens, the repr() function should work well for doing exactly this.
(note: repr on a *unicode* string in python 3 will pass non-ascii
characters, but ideally you're working with byte strings.)

There's no real need to go beyond this unless you're working in a
problem domain where filenames are likely to legitimately include
non-ascii characters (e.g. user documents of non-technical users who use
languages other than English).