Path: csiph.com!usenet.pasdenom.info!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <f1d3acb6-94e2-48f6-8ccd-042b929d0ef4@googlegroups.com>
References: <f1d3acb6-94e2-48f6-8ccd-042b929d0ef4@googlegroups.com>
Date: Sun, 24 Feb 2013 03:00:10 +1100
Subject: Re: Good cross-version ASCII serialisation protocol for simple types
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2353.1361635213.2939.python-list@python.org>
Lines: 40
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:39678

On Sun, Feb 24, 2013 at 2:45 AM, Paul  Moore <p.f.moore@gmail.com> wrote:
> At the moment, I'm using
>
> encoded = json.dumps([ord(c) for c in json.dumps(obj)])
> decoded = json.loads(''.join([chr(n) for n in json.loads(encoded)]))
>
> The double-encoding ensures that non-ASCII characters don't make it into the result.
>
> This works fine, but is there something simpler (i.e., less of a hack!) that I could use? (Base64 and the like don't work because they encode bytes->strings, not strings->strings).

Hmm. How likely is it that you'll have non-ASCII characters in the
input? If they're fairly uncommon, you could use UTF-7 - it's fairly
space-efficient when the input is mostly ASCII, but inefficient on
other characters.

Not sure what the problem is with bytes vs strings; you can always do
an encode("ascii") or decode("ascii") to convert 7-bit strings between
those types.

With that covered, I'd just go with a single JSON packaging, and work
with the resulting Unicode string.

Python 2.6:
>>> s=u"asdf\u1234zxcv"
>>> s.encode("utf-7").decode("ascii")
u'asdf+EjQ-zxcv'

Python 3.3:
>>> s=u"asdf\u1234zxcv"
>>> s.encode("utf-7").decode("ascii")
'asdf+EjQ-zxcv'

Another option would be to JSON-encode in pure-ASCII mode:

>>> json.dumps([s],ensure_ascii=True)
'["asdf\\u1234zxcv"]'

Would that cover it?

ChrisA