Fwd: Lossless bulletproof conversion to unicode (backslashing)

Path	csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!eternal-september.org!feeder.eternal-september.org!border1.nntp.ams1.giganews.com!nntp.giganews.com!newsfeed.xs4all.nl!newsfeed2a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<techtonik@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.000
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'broken': 0.03; 'binary': 0.05; 'removes': 0.05; '(unicode': 0.07; 'constructor': 0.07; 'escape': 0.07; 'filename': 0.07; 'tool,': 0.07; 'utf-8': 0.07; 'chunks': 0.09; 'fails.': 0.09; 'filename.': 0.09; 'from:addr:techtonik': 0.09; 'from:name:anatoly techtonik': 0.09; 'garbage': 0.09; 'processing,': 0.09; 'python': 0.11; 'ignore': 0.14; 'encoding': 0.15; 'anatoly': 0.16; 'assumptions': 0.16; 'behaviour.': 0.16; 'corrupt': 0.16; 'crashes': 0.16; 'invalid.': 0.16; 'labelled': 0.16; 'losing': 0.16; 'nodes': 0.16; 'preprocess': 0.16; 'readable': 0.16; 'subject:unicode': 0.16; 'unicode)': 0.16; 'unicode.': 0.16; 'varies': 0.16; 'string': 0.17; 'basically': 0.18; 'helper': 0.18; 'skip': 0.18; 'input': 0.18; '3.x': 0.22; 'assumes': 0.22; 'back.': 0.22; 'strict': 0.22; 'header:In-Reply-To:1': 0.24; 'idea': 0.26; 'external': 0.27; 'coding': 0.27; '(such': 0.27; 'converting': 0.27; 'message- id:@mail.gmail.com': 0.28; "doesn't": 0.28; 'cases.': 0.29; 'crash': 0.29; 'escaped': 0.29; 'node': 0.29; 'symbols': 0.29; 'convert': 0.29; 'function': 0.30; 'that.': 0.30; 'too.': 0.30; 'becomes': 0.31; 'print': 0.31; 'option': 0.31; 'code': 0.31; "can't": 0.32; 'similar': 0.32; 'implement': 0.32; 'probably': 0.32; 'are:': 0.32; 'url:python': 0.33; 'wrap': 0.33; 'received:google.com': 0.34; 'could': 0.35; 'to:addr:python-list': 0.35; 'filter': 0.35; 'replace': 0.35; 'unicode': 0.35; 'unknown': 0.35; 'sometimes': 0.35; 'but': 0.36; 'url:org': 0.36; 'possible': 0.36; 'basic': 0.36; 'data.': 0.36; 'depends': 0.36; 'url:library': 0.36; 'forwarded': 0.37; 'so,': 0.37; 'two': 0.37; 'should': 0.37; 'subject:: ': 0.37; 'stuff': 0.38; 'expect': 0.39; 'url:2': 0.39; 'does': 0.39; 'url:docs': 0.39; 'to:addr:python.org': 0.39; 'data': 0.40; 'build': 0.40; 'well.': 0.40; 'subject: (': 0.40; 'why': 0.40; 'some': 0.40; 'back': 0.61; 'decision': 0.61; 'real': 0.61; 'default': 0.61; 'here.': 0.61; 'skip:u 10': 0.62; 'more': 0.62; 'information': 0.62; '(that': 0.63; 'safe': 0.63; 'world': 0.64; 'forward': 0.65; 'here': 0.66; 'strategies': 0.77; 'escaping': 0.84; 'url:functions': 0.84; 'story.': 0.95
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=E3F+HY6CZyFTYmh9uzO4usae2f70IPo4xctFbzuSPes=; b=xzQUIWl0hzmkjku7pbPoaSpIUDJZzB1+wu8E81usgFHNns1ngdoOWA476mJtGQ+am1 BGLpdjpizFY/Codl4lULkvXjXQmuBdkVc/hPD5RteOaVfqWP3eMxOqD/ZQCZ6HM1ifZp ySxoLO1s7EFJGym38frCNmPLYken5XELugQ/pBDOwVW+gwNOCRL4Ix5ZX77oYlaDtKbA tA2VDgSnw/Zvzqipmg6Nd+L81LEzHLo7bD9Pr+H45NV59x1dN46WDnzqlmanKRwQKcwF ULs1qQx+/3/ZcCpoOsuPk/FMTdf+Me20AQ9Rktxgtcu7D+ox3peY/vJQ2NBY3WPEVBhK CkyQ==
X-Received	by 10.140.151.209 with SMTP id 200mr40977560qhx.71.1432725326818; Wed, 27 May 2015 04:15:26 -0700 (PDT)
MIME-Version	1.0
In-Reply-To	<CAPkN8xKTXJu2nhvocG8KuyO1XkJVfK_WsmY6dM=hWsVyg+BVyA@mail.gmail.com>
References	<CAPkN8xKTXJu2nhvocG8KuyO1XkJVfK_WsmY6dM=hWsVyg+BVyA@mail.gmail.com>
From	anatoly techtonik <techtonik@gmail.com>
Date	Wed, 27 May 2015 14:15:06 +0300
Subject	Fwd: Lossless bulletproof conversion to unicode (backslashing)
To	python-list@python.org
Content-Type	text/plain; charset=UTF-8
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.20+
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.78.1432725335.5151.python-list@python.org> (permalink)
Lines	67
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1432725335 news.xs4all.nl 2916 [2001:888:2000:d::a6]:54802
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:91294

Show key headers only | View raw

Hi.

This was labelled offtopic in python-ideas, so I edited and forwarded
it here. Please CC as I am not subscribed.


In short. I need is a bulletproof way to convert from anything to
unicode. This requires some kind of escaping to go forward and back.
Some helper function like u2b() (unicode to binary) and b2u() (that
also removes escaping). So far I can't find any code that does just
that.


Background story. I need to print SCons graph. SCons is a build tool,
so it has a graph of nodes - what depends on what. I have no idea
what a node object could be. I know only that it can have human
readable representation. Sometimes node is a filename in some
encoding that is not utf-8, and without knowing the encoding,
converting it to unicode is not possible without loosing the information
about that filename.

So, here is what Python proposes:

https://docs.python.org/2.7/library/functions.html?highlight=unicode#unicode

unicode() type constructor that doesn't allow you to do conversion
without losing the data. It offers only two basic strategies - crash or
corrupt:

1. ignore  - meaning skip and corrupt the data
2. replace  - just corrupt the data
3. strict - just crash

Python design leaves the decision how to implement safe
interoperability to you, and that's basically the reason why Python 3
fails. Without a safe approach (get my binary data back frum that
unicode) people just can't wrap their heads around that.

Python design assumes that people know the encoding of data they
are processing, but that's not true in many cases. The data may also
be just broken or invalid. So, the real world coding assumptions are:

1. external data encoding is unknown or varies
2. external data has binary chunks that are invalid for
conversion to unicode

In real world UnicodeDecode crashes is not an option for deal with
unknown or broken and invalid input (such as when I need to print
human representation of Node to the screen). In many (most?)
situations lossless garbage is more welcome than crash or dataloss
and that should be a default behaviour.


The solution is to have filter preprocess the binary string to escape all
non-unicode symbols so that the following lossless transformation
becomes possible:

   binary -> escaped utf-8 string -> unicode -> binary

I want to know if that's real? I need to accomplish that with
Python 2.x, but the use case is probably valid for Python 3 as well.

This stuff is critical to port SCons to Python 3.x and I expect for other
similar tools that have to deal with unknown ascii-binary strings too.

-- 
anatoly t.

Back to comp.lang.python | Previous | Next — Next in thread | Find similar | Unroll thread

Thread

Fwd: Lossless bulletproof conversion to unicode (backslashing) anatoly techtonik <techtonik@gmail.com> - 2015-05-27 14:15 +0300
  Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) Steven D'Aprano <steve@pearwood.info> - 2015-05-27 22:47 +1000
    Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) wxjmfauth@gmail.com - 2015-05-27 06:31 -0700
      Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) wxjmfauth@gmail.com - 2015-05-27 07:00 -0700

csiph-web