Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #16394
| From | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| Subject | Re: suppressing bad characters in output PCDATA (converting JSON to XML) |
| Date | 2011-11-29 15:33 +0100 |
| References | <91j4q8xgv9.ln2@news.ducksburg.com> <mailman.3102.1322504456.27778.python-list@python.org> <ie1fq8x1r2.ln2@news.ducksburg.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.3126.1322577261.27778.python-list@python.org> (permalink) |
Adam Funk, 29.11.2011 13:57: > On 2011-11-28, Stefan Behnel wrote: >> Adam Funk, 25.11.2011 14:50: >>> Then I recurse through the contents of big_json to build an instance >>> of xml.dom.minidom.Document (the recursion includes some code to >>> rewrite dict keys as valid element names if necessary) >> >> If the name "big_json" is supposed to hint at a large set of data, you may >> want to use something other than minidom. Take a look at the >> xml.etree.cElementTree module instead, which is substantially more memory >> efficient. > > Well, the input file in this case contains one big JSON list of > reasonably sized elements, each of which I'm turning into a separate > XML file. The output files range from 600 to 6000 bytes. It's also substantially easier to use, but if your XML writing code works already, why change it. >>> and I save the document: >>> >>> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace') >>> doc.writexml(xml_file, encoding='UTF-8') >>> xml_file.close() >> >> Same mistakes as above. Especially the double encoding is both unnecessary >> and likely to fail. This is also most likely the source of your problems. > > Well actually, I had the problem with the occasional control > characters in the output *before* I started sticking encoding="UTF-8" > all over the place (in an unsuccessful attempt to beat them down). You should read up on Unicode a bit. >>> I thought this would force all the output to be valid, but xmlstarlet >>> gives some errors like these on a few documents: >>> >>> PCDATA invalid Char value 7 >>> PCDATA invalid Char value 31 >> >> This strongly hints at a broken encoding, which can easily be triggered by >> your erroneous encode-and-encode cycles above. > > No, I've checked the JSON input and those exact control characters are > there too. Ah, right, I didn't look closely enough. Those are forbidden in XML: http://www.w3.org/TR/REC-xml/#charsets It's sad that minidom (apparently) lets them pass through without even a warning. > I want to suppress them (delete or replace with spaces). Ok, then you need to process your string content while creating XML from it. If replacing is enough, take a look at string.maketrans() in the string module and str.translate(), a method on strings. Or maybe just use a regular expression that matches any whitespace character and replace it with a space. Or whatever suits your data best. >> Also, the kind of problem you present here makes it pretty clear that you >> are using Python 2.x. In Python 3, you'd get the appropriate exceptions >> when trying to write binary data to a Unicode file. > > Sorry, I forgot to mention the version I'm using, which is "2.7.2+". Yep, Py2 makes Unicode handling harder than it should be. Stefan
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-25 13:50 +0000
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-11-28 12:11 +0000
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-29 12:50 +0000
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Stefan Behnel <stefan_ml@behnel.de> - 2011-11-28 19:20 +0100
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-29 12:57 +0000
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Stefan Behnel <stefan_ml@behnel.de> - 2011-11-29 15:33 +0100
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-12-02 10:30 +0000
csiph-web