Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #16394
| Path | csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!txtfeed1.tudelft.nl!tudelft.nl!txtfeed2.tudelft.nl!amsnews11.chello.com!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <python-python-list@m.gmane.org> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.000 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'bytes.': 0.07; 'char': 0.07; 'json': 0.07; 'sized': 0.07; 'space.': 0.07; 'suppress': 0.07; 'python': 0.08; 'ah,': 0.09; 'dict': 0.09; 'down).': 0.09; 'exceptions': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:80.91.229.12': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'received:lo.gmane.org': 0.09; 'subject:characters': 0.09; 'output': 0.10; 'broken': 0.12; 'binary': 0.13; "'w',": 0.16; '*before*': 0.16; '6000': 0.16; 'adam': 0.16; 'bit.': 0.16; 'document:': 0.16; 'elements,': 0.16; 'from:addr:behnel.de': 0.16; 'from:addr:stefan_ml': 0.16; 'from:name:stefan behnel': 0.16; 'hint': 0.16; 'received:188.174': 0.16; 'received:m-online.net': 0.16; 'sticking': 0.16; 'subject: \n ': 0.16; 'subject:XML': 0.16; 'subject:bad': 0.16; 'wrote:': 0.18; '>>>': 0.18; 'instance': 0.18; 'rewrite': 0.18; 'checked': 0.21; 'memory': 0.21; 'trying': 0.21; 'maybe': 0.21; 'input': 0.22; 'appropriate': 0.22; 'header:In-Reply-To:1': 0.22; 'replacing': 0.23; 'string': 0.24; 'stefan': 0.24; 'creating': 0.25; 'code': 0.25; '(in': 0.26; 'module': 0.26; "i'm": 0.26; 'separate': 0.28; 'invalid': 0.28; 'pass': 0.29; 'forgot': 0.29; 'matches': 0.29; 'sorry,': 0.29; 'unicode': 0.29; 'problem': 0.29; 'handling': 0.30; 'cycles': 0.30; 'occasional': 0.30; 'recursion': 0.30; 'strings.': 0.30; 'whitespace': 0.30; '(the': 0.30; 'ok,': 0.31; 'xml': 0.31; "i've": 0.31; 'source': 0.31; "didn't": 0.31; 'version': 0.32; 'list': 0.32; 'pretty': 0.32; 'header:User- Agent:1': 0.33; 'header:X-Complaints-To:1': 0.33; 'there': 0.33; 'to:addr:python-list': 0.34; 'it.': 0.34; 'character': 0.34; 'force': 0.34; 'right,': 0.34; 'file.': 0.34; 'closely': 0.34; 'keys': 0.34; 'something': 0.35; 'be.': 0.35; 'supposed': 0.35; 'especially': 0.35; 'regular': 0.35; 'file': 0.36; 'element': 0.37; 'encoding': 0.37; 'instead,': 0.37; 'but': 0.37; 'using': 0.38; 'replace': 0.38; 'received:org': 0.38; 'some': 0.38; 'easier': 0.38; 'characters': 0.39; 'url:org': 0.39; 'should': 0.39; 'data,': 0.39; 'files': 0.39; 'why': 0.39; 'subject: (': 0.40; "it's": 0.40; 'to:addr:python.org': 0.40; 'range': 0.61; 'more': 0.61; 'your': 0.61; 'kind': 0.61; 'double': 0.61; '600': 0.64; 'harder': 0.64; 'here': 0.65; 'beat': 0.67; 'exact': 0.68; 'received:188': 0.68; 'unnecessary': 0.73; 'encoding,': 0.84; 'unsuccessful': 0.84; 'valid,': 0.84; 'warning.': 0.84; 'mistakes': 0.93 |
| X-Injected-Via-Gmane | http://gmane.org/ |
| To | python-list@python.org |
| From | Stefan Behnel <stefan_ml@behnel.de> |
| Subject | Re: suppressing bad characters in output PCDATA (converting JSON to XML) |
| Date | Tue, 29 Nov 2011 15:33:57 +0100 |
| References | <91j4q8xgv9.ln2@news.ducksburg.com> <mailman.3102.1322504456.27778.python-list@python.org> <ie1fq8x1r2.ln2@news.ducksburg.com> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset=UTF-8; format=flowed |
| Content-Transfer-Encoding | 7bit |
| X-Gmane-NNTP-Posting-Host | host-188-174-186-186.customer.m-online.net |
| User-Agent | Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110921 Lightning/1.0b2 Thunderbird/3.1.15 |
| In-Reply-To | <ie1fq8x1r2.ln2@news.ducksburg.com> |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.12 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.3126.1322577261.27778.python-list@python.org> (permalink) |
| Lines | 75 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1322577261 news.xs4all.nl 6899 [2001:888:2000:d::a6]:58858 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | x330-a1.tempe.blueboxinc.net comp.lang.python:16394 |
Show key headers only | View raw
Adam Funk, 29.11.2011 13:57: > On 2011-11-28, Stefan Behnel wrote: >> Adam Funk, 25.11.2011 14:50: >>> Then I recurse through the contents of big_json to build an instance >>> of xml.dom.minidom.Document (the recursion includes some code to >>> rewrite dict keys as valid element names if necessary) >> >> If the name "big_json" is supposed to hint at a large set of data, you may >> want to use something other than minidom. Take a look at the >> xml.etree.cElementTree module instead, which is substantially more memory >> efficient. > > Well, the input file in this case contains one big JSON list of > reasonably sized elements, each of which I'm turning into a separate > XML file. The output files range from 600 to 6000 bytes. It's also substantially easier to use, but if your XML writing code works already, why change it. >>> and I save the document: >>> >>> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace') >>> doc.writexml(xml_file, encoding='UTF-8') >>> xml_file.close() >> >> Same mistakes as above. Especially the double encoding is both unnecessary >> and likely to fail. This is also most likely the source of your problems. > > Well actually, I had the problem with the occasional control > characters in the output *before* I started sticking encoding="UTF-8" > all over the place (in an unsuccessful attempt to beat them down). You should read up on Unicode a bit. >>> I thought this would force all the output to be valid, but xmlstarlet >>> gives some errors like these on a few documents: >>> >>> PCDATA invalid Char value 7 >>> PCDATA invalid Char value 31 >> >> This strongly hints at a broken encoding, which can easily be triggered by >> your erroneous encode-and-encode cycles above. > > No, I've checked the JSON input and those exact control characters are > there too. Ah, right, I didn't look closely enough. Those are forbidden in XML: http://www.w3.org/TR/REC-xml/#charsets It's sad that minidom (apparently) lets them pass through without even a warning. > I want to suppress them (delete or replace with spaces). Ok, then you need to process your string content while creating XML from it. If replacing is enough, take a look at string.maketrans() in the string module and str.translate(), a method on strings. Or maybe just use a regular expression that matches any whitespace character and replace it with a space. Or whatever suits your data best. >> Also, the kind of problem you present here makes it pretty clear that you >> are using Python 2.x. In Python 3, you'd get the appropriate exceptions >> when trying to write binary data to a Unicode file. > > Sorry, I forgot to mention the version I'm using, which is "2.7.2+". Yep, Py2 makes Unicode handling harder than it should be. Stefan
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-25 13:50 +0000
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-11-28 12:11 +0000
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-29 12:50 +0000
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Stefan Behnel <stefan_ml@behnel.de> - 2011-11-28 19:20 +0100
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-29 12:57 +0000
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Stefan Behnel <stefan_ml@behnel.de> - 2011-11-29 15:33 +0100
Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-12-02 10:30 +0000
csiph-web