Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #91294 > unrolled thread
| Started by | anatoly techtonik <techtonik@gmail.com> |
|---|---|
| First post | 2015-05-27 14:15 +0300 |
| Last post | 2015-05-27 07:00 -0700 |
| Articles | 4 — 3 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Fwd: Lossless bulletproof conversion to unicode (backslashing) anatoly techtonik <techtonik@gmail.com> - 2015-05-27 14:15 +0300
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) Steven D'Aprano <steve@pearwood.info> - 2015-05-27 22:47 +1000
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) wxjmfauth@gmail.com - 2015-05-27 06:31 -0700
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) wxjmfauth@gmail.com - 2015-05-27 07:00 -0700
| From | anatoly techtonik <techtonik@gmail.com> |
|---|---|
| Date | 2015-05-27 14:15 +0300 |
| Subject | Fwd: Lossless bulletproof conversion to unicode (backslashing) |
| Message-ID | <mailman.78.1432725335.5151.python-list@python.org> |
Hi. This was labelled offtopic in python-ideas, so I edited and forwarded it here. Please CC as I am not subscribed. In short. I need is a bulletproof way to convert from anything to unicode. This requires some kind of escaping to go forward and back. Some helper function like u2b() (unicode to binary) and b2u() (that also removes escaping). So far I can't find any code that does just that. Background story. I need to print SCons graph. SCons is a build tool, so it has a graph of nodes - what depends on what. I have no idea what a node object could be. I know only that it can have human readable representation. Sometimes node is a filename in some encoding that is not utf-8, and without knowing the encoding, converting it to unicode is not possible without loosing the information about that filename. So, here is what Python proposes: https://docs.python.org/2.7/library/functions.html?highlight=unicode#unicode unicode() type constructor that doesn't allow you to do conversion without losing the data. It offers only two basic strategies - crash or corrupt: 1. ignore - meaning skip and corrupt the data 2. replace - just corrupt the data 3. strict - just crash Python design leaves the decision how to implement safe interoperability to you, and that's basically the reason why Python 3 fails. Without a safe approach (get my binary data back frum that unicode) people just can't wrap their heads around that. Python design assumes that people know the encoding of data they are processing, but that's not true in many cases. The data may also be just broken or invalid. So, the real world coding assumptions are: 1. external data encoding is unknown or varies 2. external data has binary chunks that are invalid for conversion to unicode In real world UnicodeDecode crashes is not an option for deal with unknown or broken and invalid input (such as when I need to print human representation of Node to the screen). In many (most?) situations lossless garbage is more welcome than crash or dataloss and that should be a default behaviour. The solution is to have filter preprocess the binary string to escape all non-unicode symbols so that the following lossless transformation becomes possible: binary -> escaped utf-8 string -> unicode -> binary I want to know if that's real? I need to accomplish that with Python 2.x, but the use case is probably valid for Python 3 as well. This stuff is critical to port SCons to Python 3.x and I expect for other similar tools that have to deal with unknown ascii-binary strings too. -- anatoly t.
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-05-27 22:47 +1000 |
| Message-ID | <5565bcf3$0$12978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #91294 |
On Wed, 27 May 2015 09:15 pm, anatoly techtonik wrote:
> Hi.
>
> This was labelled offtopic in python-ideas, so I edited and forwarded
> it here. Please CC as I am not subscribed.
>
>
> In short. I need is a bulletproof way to convert from anything to
> unicode. This requires some kind of escaping to go forward and back.
Why do you need to go back? Just keep the node, and use that.
> Some helper function like u2b() (unicode to binary) and b2u() (that
> also removes escaping). So far I can't find any code that does just
> that.
def bytes2unicode(bytes):
# Converts bytes to Unicode, allowing garbage (moji-bake).
return bytes.decode('latin1')
def unicode2bytes(unicode):
# Convert unicode containing garbage (moji-bake) to bytes.
return unicode.encode('latin1')
It correctly does the round trip from any sequence of bytes to unicode and
back to bytes, losslessly:
py> import random
py> node = bytes([random.randrange(0, 256) for _ in range(100000)])
py> uni = bytes2unicode(node)
py> b = unicode2bytes(uni)
py> b == node
True
But take careful note that you can't start with Unicode and still expect to
round-trip losslessly. Many perfectly readable Unicode strings do *not*
convert to bytes:
py> unicode2bytes(u'ДЙ') # two Cyrillic letters
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in unicode2bytes
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1:
ordinal not in range(256)
That means that if you take a correctly encoded string, it will round-trip,
but it will also display as garbage:
py> s = u'ДЙ'
py> node = s.encode('utf-8')
py> print(node) # Correctly encoded UTF-8
b'\xd0\x94\xd0\x99'
py> node == unicode2bytes(bytes2unicode(node)) # round trips okay
True
py> print(repr(bytes2unicode(node))) # but prints as crap
'Ð\x94Ð\x99'
> Background story. I need to print SCons graph. SCons is a build tool,
> so it has a graph of nodes - what depends on what. I have no idea
> what a node object could be. I know only that it can have human
> readable representation. Sometimes node is a filename in some
> encoding that is not utf-8, and without knowing the encoding,
> converting it to unicode is not possible without loosing the information
> about that filename.
py> filename = "My Russian ДЙ name" # Unicode
py> b = filename.encode('koi8-r') # Oops, not UTF-8!
py> b.decode("utf-8") # Fails
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 11:
invalid continuation byte
py> b.decode("utf-8", errors="replace") # lossy, but works
'My Russian �� name'
py> s = b.decode("utf-8", errors="surrogateescape") # magic!
py> s
'My Russian \udce4\udcea name'
It round-trips as well:
py> s.encode("utf-8", errors="surrogateescape") == b
True
Converting this back to Python 2.7 is left as an exercise for the reader.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-05-27 06:31 -0700 |
| Message-ID | <f456b61c-b546-4490-a7f6-bfbff14cfe85@googlegroups.com> |
| In reply to | #91303 |
Le mercredi 27 mai 2015 14:47:59 UTC+2, Steven D'Aprano a écrit :
> On Wed, 27 May 2015 09:15 pm, anatoly techtonik wrote:
>
> > Hi.
> >
> > This was labelled offtopic in python-ideas, so I edited and forwarded
> > it here. Please CC as I am not subscribed.
> >
> >
> > In short. I need is a bulletproof way to convert from anything to
> > unicode. This requires some kind of escaping to go forward and back.
>
> Why do you need to go back? Just keep the node, and use that.
>
>
> > Some helper function like u2b() (unicode to binary) and b2u() (that
> > also removes escaping). So far I can't find any code that does just
> > that.
>
>
> def bytes2unicode(bytes):
> # Converts bytes to Unicode, allowing garbage (moji-bake).
> return bytes.decode('latin1')
>
> def unicode2bytes(unicode):
> # Convert unicode containing garbage (moji-bake) to bytes.
> return unicode.encode('latin1')
>
>
> It correctly does the round trip from any sequence of bytes to unicode and
> back to bytes, losslessly:
>
>
> py> import random
> py> node = bytes([random.randrange(0, 256) for _ in range(100000)])
> py> uni = bytes2unicode(node)
> py> b = unicode2bytes(uni)
> py> b == node
> True
>
>
> But take careful note that you can't start with Unicode and still expect to
> round-trip losslessly. Many perfectly readable Unicode strings do *not*
> convert to bytes:
>
> py> unicode2bytes(u'ДЙ') # two Cyrillic letters
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "<stdin>", line 3, in unicode2bytes
> UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1:
> ordinal not in range(256)
>
>
> That means that if you take a correctly encoded string, it will round-trip,
> but it will also display as garbage:
>
> py> s = u'ДЙ'
> py> node = s.encode('utf-8')
> py> print(node) # Correctly encoded UTF-8
> b'\xd0\x94\xd0\x99'
> py> node == unicode2bytes(bytes2unicode(node)) # round trips okay
> True
> py> print(repr(bytes2unicode(node))) # but prints as crap
> 'Ð\x94Ð\x99'
>
>
>
> > Background story. I need to print SCons graph. SCons is a build tool,
> > so it has a graph of nodes - what depends on what. I have no idea
> > what a node object could be. I know only that it can have human
> > readable representation. Sometimes node is a filename in some
> > encoding that is not utf-8, and without knowing the encoding,
> > converting it to unicode is not possible without loosing the information
> > about that filename.
>
> py> filename = "My Russian ДЙ name" # Unicode
> py> b = filename.encode('koi8-r') # Oops, not UTF-8!
> py> b.decode("utf-8") # Fails
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 11:
> invalid continuation byte
> py> b.decode("utf-8", errors="replace") # lossy, but works
> 'My Russian �� name'
> py> s = b.decode("utf-8", errors="surrogateescape") # magic!
> py> s
> 'My Russian \udce4\udcea name'
>
>
> It round-trips as well:
>
> py> s.encode("utf-8", errors="surrogateescape") == b
> True
>
>
> Converting this back to Python 2.7 is left as an exercise for the reader.
>
>
>
> --
> Steven
This is so brillant, I do not think
it is worth to comment.
jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-05-27 07:00 -0700 |
| Message-ID | <053a0efd-0677-4302-8ad5-ed2c859ec0b0@googlegroups.com> |
| In reply to | #91308 |
========== - Fair play, jmf. Fair play. - Yes >>> "\udce4" == "Д" False >>> "\udcea" == "Й" False >>>
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web