Groups > comp.lang.python > #91294 > unrolled thread

Fwd: Lossless bulletproof conversion to unicode (backslashing)

Started by	anatoly techtonik <techtonik@gmail.com>
First post	2015-05-27 14:15 +0300
Last post	2015-05-27 07:00 -0700
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Fwd: Lossless bulletproof conversion to unicode (backslashing) anatoly techtonik <techtonik@gmail.com> - 2015-05-27 14:15 +0300
    Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) Steven D'Aprano <steve@pearwood.info> - 2015-05-27 22:47 +1000
      Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) wxjmfauth@gmail.com - 2015-05-27 06:31 -0700
        Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) wxjmfauth@gmail.com - 2015-05-27 07:00 -0700

#91294 — Fwd: Lossless bulletproof conversion to unicode (backslashing)

From	anatoly techtonik <techtonik@gmail.com>
Date	2015-05-27 14:15 +0300
Subject	Fwd: Lossless bulletproof conversion to unicode (backslashing)
Message-ID	<mailman.78.1432725335.5151.python-list@python.org>

Hi.

This was labelled offtopic in python-ideas, so I edited and forwarded
it here. Please CC as I am not subscribed.


In short. I need is a bulletproof way to convert from anything to
unicode. This requires some kind of escaping to go forward and back.
Some helper function like u2b() (unicode to binary) and b2u() (that
also removes escaping). So far I can't find any code that does just
that.


Background story. I need to print SCons graph. SCons is a build tool,
so it has a graph of nodes - what depends on what. I have no idea
what a node object could be. I know only that it can have human
readable representation. Sometimes node is a filename in some
encoding that is not utf-8, and without knowing the encoding,
converting it to unicode is not possible without loosing the information
about that filename.

So, here is what Python proposes:

https://docs.python.org/2.7/library/functions.html?highlight=unicode#unicode

unicode() type constructor that doesn't allow you to do conversion
without losing the data. It offers only two basic strategies - crash or
corrupt:

1. ignore  - meaning skip and corrupt the data
2. replace  - just corrupt the data
3. strict - just crash

Python design leaves the decision how to implement safe
interoperability to you, and that's basically the reason why Python 3
fails. Without a safe approach (get my binary data back frum that
unicode) people just can't wrap their heads around that.

Python design assumes that people know the encoding of data they
are processing, but that's not true in many cases. The data may also
be just broken or invalid. So, the real world coding assumptions are:

1. external data encoding is unknown or varies
2. external data has binary chunks that are invalid for
conversion to unicode

In real world UnicodeDecode crashes is not an option for deal with
unknown or broken and invalid input (such as when I need to print
human representation of Node to the screen). In many (most?)
situations lossless garbage is more welcome than crash or dataloss
and that should be a default behaviour.


The solution is to have filter preprocess the binary string to escape all
non-unicode symbols so that the following lossless transformation
becomes possible:

   binary -> escaped utf-8 string -> unicode -> binary

I want to know if that's real? I need to accomplish that with
Python 2.x, but the use case is probably valid for Python 3 as well.

This stuff is critical to port SCons to Python 3.x and I expect for other
similar tools that have to deal with unknown ascii-binary strings too.

-- 
anatoly t.

[toc] | [next] | [standalone]

#91303

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-05-27 22:47 +1000
Message-ID	<5565bcf3$0$12978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#91294

On Wed, 27 May 2015 09:15 pm, anatoly techtonik wrote:

> Hi.
> 
> This was labelled offtopic in python-ideas, so I edited and forwarded
> it here. Please CC as I am not subscribed.
> 
> 
> In short. I need is a bulletproof way to convert from anything to
> unicode. This requires some kind of escaping to go forward and back.

Why do you need to go back? Just keep the node, and use that.

> Some helper function like u2b() (unicode to binary) and b2u() (that
> also removes escaping). So far I can't find any code that does just
> that.

def bytes2unicode(bytes):
    # Converts bytes to Unicode, allowing garbage (moji-bake).
    return bytes.decode('latin1')

def unicode2bytes(unicode):
    # Convert unicode containing garbage (moji-bake) to bytes.
    return unicode.encode('latin1')

It correctly does the round trip from any sequence of bytes to unicode and
back to bytes, losslessly:

py> import random
py> node = bytes([random.randrange(0, 256) for _ in range(100000)])
py> uni = bytes2unicode(node)
py> b = unicode2bytes(uni)
py> b == node
True

But take careful note that you can't start with Unicode and still expect to
round-trip losslessly. Many perfectly readable Unicode strings do *not*
convert to bytes:

py> unicode2bytes(u'ДЙ')  # two Cyrillic letters
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in unicode2bytes
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1:
ordinal not in range(256)

That means that if you take a correctly encoded string, it will round-trip,
but it will also display as garbage:

py> s = u'ДЙ'
py> node = s.encode('utf-8')
py> print(node)  # Correctly encoded UTF-8
b'\xd0\x94\xd0\x99'
py> node == unicode2bytes(bytes2unicode(node))  # round trips okay
True
py> print(repr(bytes2unicode(node)))  # but prints as crap
'Ð\x94Ð\x99'

> Background story. I need to print SCons graph. SCons is a build tool,
> so it has a graph of nodes - what depends on what. I have no idea
> what a node object could be. I know only that it can have human
> readable representation. Sometimes node is a filename in some
> encoding that is not utf-8, and without knowing the encoding,
> converting it to unicode is not possible without loosing the information
> about that filename.

py> filename = "My Russian ДЙ name"  # Unicode
py> b = filename.encode('koi8-r')  # Oops, not UTF-8!
py> b.decode("utf-8")  # Fails
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 11:
invalid continuation byte
py> b.decode("utf-8", errors="replace")  # lossy, but works
'My Russian �� name'
py> s = b.decode("utf-8", errors="surrogateescape")  # magic!
py> s
'My Russian \udce4\udcea name'

It round-trips as well:

py> s.encode("utf-8", errors="surrogateescape") == b
True

Converting this back to Python 2.7 is left as an exercise for the reader.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#91308

From	wxjmfauth@gmail.com
Date	2015-05-27 06:31 -0700
Message-ID	<f456b61c-b546-4490-a7f6-bfbff14cfe85@googlegroups.com>
In reply to	#91303

Le mercredi 27 mai 2015 14:47:59 UTC+2, Steven D'Aprano a écrit :
> On Wed, 27 May 2015 09:15 pm, anatoly techtonik wrote:
> 
> > Hi.
> > 
> > This was labelled offtopic in python-ideas, so I edited and forwarded
> > it here. Please CC as I am not subscribed.
> > 
> > 
> > In short. I need is a bulletproof way to convert from anything to
> > unicode. This requires some kind of escaping to go forward and back.
> 
> Why do you need to go back? Just keep the node, and use that.
> 
> 
> > Some helper function like u2b() (unicode to binary) and b2u() (that
> > also removes escaping). So far I can't find any code that does just
> > that.
> 
> 
> def bytes2unicode(bytes):
>     # Converts bytes to Unicode, allowing garbage (moji-bake).
>     return bytes.decode('latin1')
> 
> def unicode2bytes(unicode):
>     # Convert unicode containing garbage (moji-bake) to bytes.
>     return unicode.encode('latin1')
> 
> 
> It correctly does the round trip from any sequence of bytes to unicode and
> back to bytes, losslessly:
> 
> 
> py> import random
> py> node = bytes([random.randrange(0, 256) for _ in range(100000)])
> py> uni = bytes2unicode(node)
> py> b = unicode2bytes(uni)
> py> b == node
> True
> 
> 
> But take careful note that you can't start with Unicode and still expect to
> round-trip losslessly. Many perfectly readable Unicode strings do *not*
> convert to bytes:
> 
> py> unicode2bytes(u'ДЙ')  # two Cyrillic letters
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "<stdin>", line 3, in unicode2bytes
> UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1:
> ordinal not in range(256)
> 
> 
> That means that if you take a correctly encoded string, it will round-trip,
> but it will also display as garbage:
> 
> py> s = u'ДЙ'
> py> node = s.encode('utf-8')
> py> print(node)  # Correctly encoded UTF-8
> b'\xd0\x94\xd0\x99'
> py> node == unicode2bytes(bytes2unicode(node))  # round trips okay
> True
> py> print(repr(bytes2unicode(node)))  # but prints as crap
> 'Ð\x94Ð\x99'
> 
> 
> 
> > Background story. I need to print SCons graph. SCons is a build tool,
> > so it has a graph of nodes - what depends on what. I have no idea
> > what a node object could be. I know only that it can have human
> > readable representation. Sometimes node is a filename in some
> > encoding that is not utf-8, and without knowing the encoding,
> > converting it to unicode is not possible without loosing the information
> > about that filename.
> 
> py> filename = "My Russian ДЙ name"  # Unicode
> py> b = filename.encode('koi8-r')  # Oops, not UTF-8!
> py> b.decode("utf-8")  # Fails
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 11:
> invalid continuation byte
> py> b.decode("utf-8", errors="replace")  # lossy, but works
> 'My Russian �� name'
> py> s = b.decode("utf-8", errors="surrogateescape")  # magic!
> py> s
> 'My Russian \udce4\udcea name'
> 
> 
> It round-trips as well:
> 
> py> s.encode("utf-8", errors="surrogateescape") == b
> True
> 
> 
> Converting this back to Python 2.7 is left as an exercise for the reader.
> 
> 
> 
> -- 
> Steven

This is so brillant, I do not think
it is worth to comment.

jmf

[toc] | [prev] | [next] | [standalone]

#91311

From	wxjmfauth@gmail.com
Date	2015-05-27 07:00 -0700
Message-ID	<053a0efd-0677-4302-8ad5-ed2c859ec0b0@googlegroups.com>
In reply to	#91308

==========

- Fair play, jmf. Fair play.
- Yes

>>> "\udce4" == "Д"
False
>>> "\udcea" == "Й"
False
>>>

[toc] | [prev] | [standalone]

csiph-web

Fwd: Lossless bulletproof conversion to unicode (backslashing)

Contents

#91294 — Fwd: Lossless bulletproof conversion to unicode (backslashing)

#91303

#91308

#91311