Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #91439

Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

Path csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <techtonik@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'binary': 0.05; 'svn': 0.05; 'filename': 0.07; 'backslash': 0.09; 'creighton': 0.09; 'exist.': 0.09; 'filename,': 0.09; 'filesystem': 0.09; 'from:addr:techtonik': 0.09; 'from:name:anatoly techtonik': 0.09; 'grep': 0.09; 'stringio': 0.09; 'url:github': 0.09; 'cc:addr :python-list': 0.10; 'thread': 0.10; 'mailman': 0.10; 'python': 0.11; "'ascii',": 0.16; 'anatoly': 0.16; 'crashes': 0.16; 'disk.': 0.16; 'node,': 0.16; 'preserved': 0.16; 'received:mail- qk0-x22a.google.com': 0.16; 'sees': 0.16; 'subject:unicode': 0.16; 'unicode.': 0.16; 'usenet': 0.16; 'wrote:': 0.16; 'string': 0.17; 'bytes': 0.18; 'laura': 0.18; 'switched': 0.18; 'tree': 0.18; 'all,': 0.20; 'issue.': 0.20; 'cc:2**0': 0.21; 'cc:addr:python.org': 0.21; '2.x': 0.22; 'gateway': 0.22; 'text,': 0.22; 'am,': 0.23; '2015': 0.23; 'file.': 0.24; 'header:In-Reply- To:1': 0.24; 'written': 0.24; 'idea': 0.26; 'converting': 0.27; 'entries': 0.27; 'turns': 0.27; 'message-id:@mail.gmail.com': 0.28; "doesn't": 0.28; 'fine': 0.29; 'dumps': 0.29; 'equivalent.': 0.29; 'loss,': 0.29; 'node': 0.29; 'piece': 0.29; '(which': 0.29; 'fri,': 0.31; 'print': 0.31; 'worked': 0.31; 'realize': 0.32; 'structure': 0.32; 'problem': 0.33; 'common': 0.33; 'hopefully': 0.33; 'particular,': 0.33; 'stands': 0.33; 'received:google.com': 0.34; 'there,': 0.35; 'loss': 0.35; 'wrong': 0.35; 'could': 0.35; 'formats': 0.35; 'happened': 0.35; 'unicode': 0.35; 'problem.': 0.35; 'but': 0.36; 'text': 0.36; 'serve': 0.36; 'there': 0.36; 'data.': 0.36; 'subject:: ': 0.37; 'starting': 0.38; "won't": 0.38; 'experience,': 0.38; 'version': 0.38; 'stuff': 0.38; 'data': 0.40; 'build': 0.40; 'sure': 0.40; 'subject: (': 0.40; 'why': 0.40; 'some': 0.40; 'protection': 0.60; 'your': 0.60; 'back': 0.61; 'avoid': 0.61; 'default': 0.61; 'identify': 0.61; 'side': 0.62; 'story': 0.63; 'due': 0.65; 'series': 0.65; 'backup': 0.66; 'python-list': 0.66; 'readers': 0.66; 'store,': 0.66; 'reply': 0.67; 'sounds': 0.72; 'mail.': 0.73; '100%': 0.75; 'legit.': 0.84; 'original.': 0.84; 'using.': 0.84
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=Y7H8BjpHFFkYNJUx+2JK9s6TfWx3H5QISfVKTy5gajI=; b=qJ7gsLNX9lzdsWCBzBVURiQKllEwp+a/CDdk0ZtmVAn+RSC+yB0eL9l9I0AFTPsgtB bLeKohG0i7z7LUvPI2huoVf8mJDbojTPnnZCJrdcdC5zU8DzvfvbngN8BIZtovf/oegP kdT4tnZH6CVU/Bp3P1G27ZLQchBVIx3fPa5p046bNyiYN4zeZQX8o0cht1lji7zDk9IL Mqve+avuQV/xkMHS3ExYW9ifZCIC88iVQKi4369lpJR/zz9hEr7wx3NSB3GhVGJ61oXw vKwxidKRo6TeoWVZCdHTYxMObzQ6Ai4uo4CCOzPspCfdDw33Hes9G4sdRcT0AUHyXTte H9sw==
X-Received by 10.55.19.106 with SMTP id d103mr14072971qkh.42.1432892782404; Fri, 29 May 2015 02:46:22 -0700 (PDT)
MIME-Version 1.0
In-Reply-To <201505290841.t4T8f9Tr014513@fido.openend.se>
References <201505271257.t4RCv1R2015793@fido.openend.se> <techtonik@gmail.com> <CAPkN8xJxdumt1cNoistwNagUtvze-cFD2y8_3+Z4hyuTfTLmdA@mail.gmail.com> <201505290841.t4T8f9Tr014513@fido.openend.se>
From anatoly techtonik <techtonik@gmail.com>
Date Fri, 29 May 2015 12:46:01 +0300
Subject Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
To Laura Creighton <lac@openend.se>
Cc "python-list@python.org" <python-list@python.org>
Content-Type text/plain; charset=UTF-8
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.20+
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.166.1432892784.5151.python-list@python.org> (permalink)
Lines 61
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1432892784 news.xs4all.nl 2900 [2001:888:2000:d::a6]:36520
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:91439

Show key headers only | View raw


On Fri, May 29, 2015 at 11:41 AM, Laura Creighton <lac@openend.se> wrote:
> In a message of Fri, 29 May 2015 11:05:07 +0300, anatoly techtonik writes:
>
>>Added Mailman to my suxx tracker:
>>https://github.com/techtonik/suxx-tracker#mailman
>
> You are damning the wrong piece of software -- this is not a problem
> with mailman; mailman doesn't care at all what software you use to
> read mail and reply to it with.  The problem is with the various
> readers and repliers that people are using.  In particular, people on
> the other side of one the usenet -> python-list gateway may not be seeing
> this as mail at all, or sending their replies as mail.

Sounds legit. But middle ux in suxx stands for user experience,
and Mailman still doesn't improve it. If Mailman could subscribe
me automatically to the thread I am starting, that would resolve
all the problems.

> But back to your original problem.
>
> I still don't understand why you need to go from some lossless
> representation of your filename, back to the original.

It is just happened that the only way to get graph out of SCons
is to print its tree representation. That worked fine until we
switched to from StringIO to its io.StringIO unicode equivalent.

Dumping binary stuff in text form is a very common and reliable
way to backup and process data. Starting from SQL dumps to
SVN dumps - all these formats are convenient to store, transmit
and process.

> You start
> with the binary version of the filename  -- a series of bytes which
> turns out to be good Cyrillic text, but could be anything.

Right, good Cyrillic text in utf-8, and Python 2.x uses 'ascii', so if
Python 2.x used 'utf-8' as its default encoding, there won't be an
issue. For now. But I realize that it is not enough, so I want 100%
protection from unwanted crashes and data loss, so I want to
backslash non-utf-8 bytes when converting the data to unicode.

> You store
> that as the first so many bytes of your file. If ever you need to have
> the original representation of your filename, you already have it,
> right there, by reading the first so many bytes of your file.  Why
> care about what the user sees as a filename?

Not sure that I understand. I don't store anything in file. Build graph
is a representation of filesystem structure with entries that may or
may not exist. Node in build graph can also be a string that is never
written to disk. When I dump graph, I have no idea how I will
process it, but when I will need to identify some Node, grep it, find
a reference to it, I want its representation (which may as well serve
as ID) to be preserved to avoid conflicts and wrong interpretation
due to data loss

Hopefully now that my user story is clear, can you tell me how can I
do this bulletproof unicode conversion in Python 2? =)
-- 
anatoly t.

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd) anatoly techtonik <techtonik@gmail.com> - 2015-05-29 12:46 +0300

csiph-web