Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #91435

Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!bcyclone01.am1.xlned.com!bcyclone01.am1.xlned.com!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <rosuav@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.001
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'anyway.': 0.04; 'bytes.': 0.07; 'problem?': 0.07; 'rest,': 0.07; 'utf-8': 0.07; 'scripts': 0.09; 'backslash': 0.09; 'bits.': 0.09; 'bytes,': 0.09; 'encoding.': 0.09; 'repr': 0.09; 'stdout': 0.09; 'url:github': 0.09; 'cc:addr:python-list': 0.10; 'mailman': 0.10; 'python': 0.11; 'question.': 0.13; 'wed,': 0.15; '"some': 0.16; 'anatoly': 0.16; 'backslashes': 0.16; 'crashes': 0.16; 'curses': 0.16; 'decode': 0.16; 'encodings': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'guessing': 0.16; 'nodes': 0.16; 'subject:unicode': 0.16; 'then?': 0.16; 'wrote:': 0.16; 'string': 0.17; 'byte': 0.18; 'bytes': 0.18; 'debugging': 0.18; 'tree': 0.18; '>>>': 0.20; 'cc:2**0': 0.21; 'cc:addr:python.org': 0.21; 'trying': 0.22; 'saying': 0.22; 'explicit': 0.22; 'of.': 0.22; 'text,': 0.22; 'try:': 0.22; '2015': 0.23; 'this:': 0.23; 'header :In-Reply-To:1': 0.24; 'sort': 0.25; 'error': 0.27; 'community.': 0.27; 'least': 0.27; 'message-id:@mail.gmail.com': 0.28; 'rest': 0.28; 'crash': 0.29; 'loss,': 0.29; 'node': 0.29; 'solution,': 0.29; 'windows,': 0.29; 'character': 0.29; 'no,': 0.29; 'there.': 0.30; 'fri,': 0.31; 'mode': 0.31; 'maybe': 0.31; 'print': 0.31; "can't": 0.32; 'post': 0.32; 'problem': 0.33; 'interface,': 0.33; 'everyone': 0.34; 'received:google.com': 0.34; 'could': 0.35; 'fail': 0.35; 'text.': 0.35; 'unicode': 0.35; 'unknown': 0.35; 'something': 0.35; 'really': 0.35; 'list': 0.35; 'but': 0.36; 'being': 0.36; 'text': 0.36; 'except': 0.36; 'there': 0.36; 'possible': 0.36; 'should': 0.37; 'display': 0.37; 'subject:: ': 0.37; 'missing': 0.37; "won't": 0.38; 'stuff': 0.38; 'community': 0.38; 'means': 0.39; 'pm,': 0.39; 'does': 0.39; 'data': 0.40; 'build': 0.40; 'sure': 0.40; 'subject: (': 0.40; 'why': 0.40; 'some': 0.40; 'your': 0.60; 'entire': 0.61; 'skip:u 10': 0.62; 'information': 0.62; 'fire': 0.63; 'leaving': 0.63; 'safe': 0.63; 'you.': 0.64; 'better.': 0.66; 'fundamental': 0.66; 'russian': 0.72; 'led': 0.73; '(four': 0.84; 'chrisa': 0.84; 'everything,': 0.84; 'freaky': 0.84; 'refuses': 0.84; 'safety.': 0.84; "software's": 0.84; 'to:none': 0.90; 'recover': 0.91; 'why?': 0.91; 'imagine': 0.96
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=uLIv6uusHrYL1aUuVRz3HT2k7EjXeo7Q2D6d2fnjTb4=; b=ATPntR1FNDDxuc1hHdn32I0TC4dTHzCGhVk20/XFCmr62C2fTFzVrqS1McFNggnV5O 9wemn6grSUPPBVPtYs9Jpx+5MaJPfwnxrLX+WIr83cHmKRnM0Vv096cCnfUAVe/ziJyi oiyQia0jlcWL79Y2CQXXihPZaGTbCYVgmYX6T16tkp8ZZ09/DZklnH5HUaQ1WugmPFzs qJaFLK+GFloykS88A9OYYMwgVoGw5CWRUWR/cx9e6DqGlmZv9FBWJ3R8TN0WOPnLsAa0 ryT2sh+nWnkjsP0pxt7GX5BWnIaCHZ+fjUm+Ke/7zZjZxFfpxYrpDe4mqvjrGF5pkZe3 ArUA==
MIME-Version 1.0
X-Received by 10.50.61.166 with SMTP id q6mr2473829igr.14.1432887554385; Fri, 29 May 2015 01:19:14 -0700 (PDT)
In-Reply-To <CAPkN8xJxdumt1cNoistwNagUtvze-cFD2y8_3+Z4hyuTfTLmdA@mail.gmail.com>
References <201505271257.t4RCv1R2015793@fido.openend.se> <CAPkN8xJxdumt1cNoistwNagUtvze-cFD2y8_3+Z4hyuTfTLmdA@mail.gmail.com>
Date Fri, 29 May 2015 18:19:14 +1000
Subject Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
From Chris Angelico <rosuav@gmail.com>
Cc "python-list@python.org" <python-list@python.org>
Content-Type text/plain; charset=UTF-8
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.20+
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.162.1432887557.5151.python-list@python.org> (permalink)
Lines 79
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1432887557 news.xs4all.nl 2850 [2001:888:2000:d::a6]:46263
X-Complaints-To abuse@xs4all.nl
X-Received-Bytes 8063
X-Received-Body-CRC 2970086128
Xref csiph.com comp.lang.python:91435

Show key headers only | View raw


On Fri, May 29, 2015 at 6:05 PM, anatoly techtonik <techtonik@gmail.com> wrote:
>> On Wed, May 27, 2015 at 9:52 PM, anatoly techtonik <techtonik@gmail.com> wrote:
>>> And the short answer is that we need unicode because we are printing this
>>> information to the stdout, and stdout is opened in text mode at least on
>>> Windows, and without explicit conversion, Python will try to decode stuff
>>> as being `ascii` and fail anyway.
>>
>> So you're working with text.
>
> No. It is unknown.
>
> I am printing Nodes of SCons build graph and I don't know how Nodes are
> represented. In my case it appeared that Node contained Russian text, which
> led to crash of SCons. It could contain Russian text in cp1251 or in utf-8 or in
> KOI-8 and I can't do guessing of all possible encodings there. I just need to
> print that tree without crash or information loss.

You're saying it's text, but you don't know the encoding. You're
trying to display bytes as if they're text, but fundamentally, you're
trying to work with text.

>> That means you HAVE to decode it somehow;
>> you fundamentally cannot print bytes to the console. Lossless
>> concealment of arbitrary bytes won't help you.
>
> Won't help me with what? I am debugging build scripts to find out the
> *structure* of my dependencies and then all of the sudden Python crashes
> with UnicodeDecode error leaving me pronouncing bad Russian curses
> aloud.

Your fundamental problem is not the UnicodeDecodeError, but the
unknown encoding. What you're seeing is that Python refuses to be
sloppy.

>> If you can't adequately
>> decode everything, either backslash-escape the rest, or use a
>> replacement character; you can't print out those bytes.
>
> Yes. How to backslash the rest in Python 2? In Python 3 there is
> some freaky "surrogateescape" error strategy, but what to do in
> Python 2?

Not sure what's so freaky about it. But hey. If Python 2 can't do what
you want, is it so hard to use Python 3? Unicode support really is
better. Alternatively, just do something like this:

b = "some arbitrary byte string that you got from somewhere"
try:
    text = b.decode("utf-8")
except UnicodeDecodeError:
    text = repr(b).decode("ascii")

The repr of a byte string in Py2 should be a safe way to display
arbitrary bytes, without data loss. It will expand the string
significantly (four characters for one \xNN escape, plus adding
backslashes to everything else that needs them), but it does guarantee
safety.

> Replacement character is not a solution, because it is a data loss,
> and if I want to do post processing of graph log, I won't be able to
> recover the missing bits.
>
>> And no, I will not cc you. Subscribe to the list if you're going to
>> ask a question.
>
> Added Mailman to my suxx tracker:
> https://github.com/techtonik/suxx-tracker#mailman

Why? You're trying to fire questions out to a community without being
a part of that community. Why is that the software's problem?

You can either subscribe to the list/ng or follow via some web
interface, but it's unreasonable to ask everyone to cc you. Imagine if
we _did_ all cc you, but we also cc you in on an entire sub-thread
that you're not interested in. Or maybe half of us do and half don't.
What then? You don't get any sort of control over what you get copies
of. Is that really what you want?

ChrisA

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd) Chris Angelico <rosuav@gmail.com> - 2015-05-29 18:19 +1000

csiph-web