Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!eternal-september.org!feeder.eternal-september.org!border1.nntp.ams1.giganews.com!nntp.giganews.com!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'ideally': 0.04; 'binary': 0.05; 'bytes.': 0.07; 'escape': 0.07; 'filename': 0.07; 'filenames': 0.07; 'utf-8': 0.07; '*is*': 0.09; '16-bit': 0.09; 'non-ascii': 0.09; 'received:internal': 0.09; 'repr': 0.09; 'python': 0.11; 'read.': 0.13; 'wed,': 0.15; 'skip:f 30': 0.15; '(note:': 0.16; 'anatoly': 0.16; 'invisible': 0.16; 'iterating': 0.16; 'message-id:@webmail.messagingengine.com': 0.16; 'posix': 0.16; 'preprocess': 0.16; 'received:10.202': 0.16; 'received:10.202.2': 0.16; 'received:66.111': 0.16; 'received:66.111.4': 0.16; 'received:messagingengine.com': 0.16; 'repr()': 0.16; 'subject:unicode': 0.16; 'wrote:': 0.16; 'string': 0.17; 'byte': 0.18; 'bytes': 0.18; '(or': 0.21; '3.x': 0.22; 'pass': 0.22; 'replacing': 0.23; 'header:In-Reply-To:1': 0.24; '(this': 0.24; '(e.g.': 0.27; 'sequence': 0.27; 'this.': 0.28; 'escaped': 0.29; 'symbols': 0.29; 'function': 0.30; 'too.': 0.30; 'becomes': 0.31; 'similar': 0.32; 'probably': 0.32; 'returned': 0.32; 'problem': 0.33; 'languages': 0.34; 'could': 0.35; 'to:addr :python-list': 0.35; 'filter': 0.35; 'unicode': 0.35; 'unknown': 0.35; 'but': 0.36; 'should': 0.37; 'received:10': 0.37; 'subject:: ': 0.37; 'beyond': 0.37; 'received:66': 0.38; 'stuff': 0.38; 'doing': 0.38; 'expect': 0.39; 'whatever': 0.39; 'to:addr:python.org': 0.39; 'skip:e 20': 0.39; 'well.': 0.40; 'where': 0.40; 'subject: (': 0.40; 'your': 0.60; 'even': 0.61; 'from:no real name:2**0': 0.61; 'real': 0.61; 'documents': 0.61; 'header:Message-Id:1': 0.62; 'more': 0.62; 'skip:n 10': 0.63; 'below.': 0.66; 'worth': 0.73; 'english).': 0.84 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=fastmail.us; h= content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-sasl-enc :x-sasl-enc; s=mesmtp; bh=mKTLuUq8xrkvCXdiwU2aj4yUXLw=; b=QYgyoo vOibOxlWDw+e++/c1HDBVas1N6ZWlHsJ+hfbi6yntd9d8zmHyMVUctc8xVbqcHPx i5UGIYiuqx3vGApEJ83PiFvx3KjKgllOC+A3HKwY6yxexA0qlP0T7+NlgL39Gs7c vn0mFBj6OI1rFY2by+X/VG02M6D3UwdFhaxcc= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-sasl-enc:x-sasl-enc; s=smtpout; bh=mKTLuUq8xrkvCXd iwU2aj4yUXLw=; b=Z3hKF5Vc5GGlg7bsxhxdT9IaFdQsfK8/CcYBGVd+ocUuFmu rW+Mtr5lnicDo6FN7RiP9AHHunW4kX6rch8hywcg9e+klmLjKjOJtpThhfvJqafh u9nLS519dStvTydPkQ8RqLq/ANBB7feDZy7ivzSon9ZqwNBMnXSA+YtL8qYg= X-Sasl-Enc: BdZzZcd4IJR+2ydjQek6622/FwUu9Zm9XC19pOWqDDJZ 1432730438 From: random832@fastmail.us To: python-list@python.org MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain X-Mailer: MessagingEngine.com Webmail Interface - ajax-073992ec In-Reply-To: References: Subject: Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) Date: Wed, 27 May 2015 08:40:38 -0400 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 35 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1432730441 news.xs4all.nl 2862 [2001:888:2000:d::a6]:52623 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:91301 On Wed, May 27, 2015, at 07:15, anatoly techtonik wrote: > The solution is to have filter preprocess the binary string to escape all > non-unicode symbols so that the following lossless transformation > becomes possible: > > binary -> escaped utf-8 string -> unicode -> binary > > I want to know if that's real? I need to accomplish that with > Python 2.x, but the use case is probably valid for Python 3 as well. In Python 3, you could *in principle* use surrogateescape (this would be more of a binary -> escaped unicode workflow), but see below. It is worth noting that when you *read* posix filenames in unicode form (e.g. listdir with a unicode argument), they are decoded with surrogateescape, and can be returned to bytes format with fn.encode(sys.getfilesystemencoding(), errors='surrogateescape'). However keep in mind that on *windows*, the native filename format is a sequence of 16-bit WCHAR values, not a sequence of bytes. > This stuff is critical to port SCons to Python 3.x and I expect for other > similar tools that have to deal with unknown ascii-binary strings too. Even if your filename *is* valid UTF-8 (or whatever other encoding), it might contain invisible control characters that make it difficult to read. You'd probably be better off simply working directly with the binary representation, iterating over it and replacing all non-*ascii*-printable bytes with an escaped representation. As it happens, the repr() function should work well for doing exactly this. (note: repr on a *unicode* string in python 3 will pass non-ascii characters, but ideally you're working with byte strings.) There's no real need to go beyond this unless you're working in a problem domain where filenames are likely to legitimately include non-ascii characters (e.g. user documents of non-technical users who use languages other than English).