Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #16341

Re: suppressing bad characters in output PCDATA (converting JSON to XML)

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!news-transit.tcx.org.uk!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'encoded': 0.05; 'way:': 0.05; 'char': 0.07; 'json': 0.07; 'modules.': 0.07; 'python': 0.08; 'dict': 0.09; 'exceptions': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:80.91.229.12': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'received:lo.gmane.org': 0.09; 'subject:characters': 0.09; 'utf-8': 0.09; 'output': 0.10; 'broken': 0.12; 'binary': 0.13; 'converting': 0.15; '"b"': 0.16; "'rb')": 0.16; "'w',": 0.16; 'adam': 0.16; 'document:': 0.16; 'from:addr:behnel.de': 0.16; 'from:addr:stefan_ml': 0.16; 'from:name:stefan behnel': 0.16; 'hint': 0.16; "library's": 0.16; 'mode.': 0.16; 'subject: \n ': 0.16; 'subject:XML': 0.16; 'subject:bad': 0.16; 'instance': 0.18; 'rewrite': 0.18; 'memory': 0.21; 'trying': 0.21; 'input': 0.22; 'appropriate': 0.22; '(or': 0.22; "doesn't": 0.22; 'header:In- Reply-To:1': 0.22; "shouldn't": 0.23; 'defined': 0.24; 'stefan': 0.24; 'code': 0.25; 'module': 0.26; "i'm": 0.26; 'all,': 0.28; 'invalid': 0.28; 'unicode': 0.29; 'problem': 0.29; 'cycles': 0.30; 'recursion': 0.30; '(the': 0.30; 'xml': 0.31; 'source': 0.31; 'pretty': 0.32; 'header:User-Agent:1': 0.33; 'header:X-Complaints- To:1': 0.33; 'to:addr:python-list': 0.34; 'force': 0.34; 'received:84': 0.34; 'file.': 0.34; 'keys': 0.34; 'something': 0.35; 'supposed': 0.35; 'especially': 0.35; 'file': 0.36; 'element': 0.37; 'encoding': 0.37; 'instead,': 0.37; 'but': 0.37; 'open': 0.38; 'using': 0.38; 'received:org': 0.38; 'some': 0.38; 'being': 0.39; 'data,': 0.39; 'subject: (': 0.40; 'to:addr:python.org': 0.40; 'more': 0.61; 'your': 0.61; 'kind': 0.61; 'double': 0.61; 'here': 0.65; 'unnecessary': 0.73; 'encoding,': 0.84; 'valid,': 0.84; 'mistakes': 0.93
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Stefan Behnel <stefan_ml@behnel.de>
Subject Re: suppressing bad characters in output PCDATA (converting JSON to XML)
Date Mon, 28 Nov 2011 19:20:40 +0100
References <91j4q8xgv9.ln2@news.ducksburg.com>
Mime-Version 1.0
Content-Type text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding 7bit
X-Gmane-NNTP-Posting-Host dslb-084-056-007-001.pools.arcor-ip.net
User-Agent Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110921 Lightning/1.0b2 Thunderbird/3.1.15
In-Reply-To <91j4q8xgv9.ln2@news.ducksburg.com>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.3102.1322504456.27778.python-list@python.org> (permalink)
Lines 60
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1322504456 news.xs4all.nl 6862 [2001:888:2000:d::a6]:52295
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:16341

Show key headers only | View raw


Adam Funk, 25.11.2011 14:50:
> I'm converting JSON data to XML using the standard library's json and
> xml.dom.minidom modules.  I get the input this way:
>
> input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')

It doesn't make sense to use codecs.open() with a "b" mode.


> big_json = json.load(input_source)

You shouldn't decode the input before passing it into json.load(), just 
open the file in binary mode. Serialised JSON is defined as being UTF-8 
encoded (or BOM-prefixed), not decoded Unicode.


> input_source.close()

In case of a failure, the file will not be closed safely. All in all, use 
this instead:

     with open(input_file, 'rb') as f:
         big_json = json.load(f)


> Then I recurse through the contents of big_json to build an instance
> of xml.dom.minidom.Document (the recursion includes some code to
> rewrite dict keys as valid element names if necessary)

If the name "big_json" is supposed to hint at a large set of data, you may 
want to use something other than minidom. Take a look at the 
xml.etree.cElementTree module instead, which is substantially more memory 
efficient.


> and I save the document:
>
> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
> doc.writexml(xml_file, encoding='UTF-8')
> xml_file.close()

Same mistakes as above. Especially the double encoding is both unnecessary 
and likely to fail. This is also most likely the source of your problems.


> I thought this would force all the output to be valid, but xmlstarlet
> gives some errors like these on a few documents:
>
> PCDATA invalid Char value 7
> PCDATA invalid Char value 31

This strongly hints at a broken encoding, which can easily be triggered by 
your erroneous encode-and-encode cycles above.

Also, the kind of problem you present here makes it pretty clear that you 
are using Python 2.x. In Python 3, you'd get the appropriate exceptions 
when trying to write binary data to a Unicode file.

Stefan

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-25 13:50 +0000
  Re: suppressing bad characters in output PCDATA (converting JSON to XML) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-11-28 12:11 +0000
    Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-29 12:50 +0000
  Re: suppressing bad characters in output PCDATA (converting JSON to XML) Stefan Behnel <stefan_ml@behnel.de> - 2011-11-28 19:20 +0100
    Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-29 12:57 +0000
      Re: suppressing bad characters in output PCDATA (converting JSON to XML) Stefan Behnel <stefan_ml@behnel.de> - 2011-11-29 15:33 +0100
        Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-12-02 10:30 +0000

csiph-web