Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #16394

Re: suppressing bad characters in output PCDATA (converting JSON to XML)

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!txtfeed1.tudelft.nl!tudelft.nl!txtfeed2.tudelft.nl!amsnews11.chello.com!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'bytes.': 0.07; 'char': 0.07; 'json': 0.07; 'sized': 0.07; 'space.': 0.07; 'suppress': 0.07; 'python': 0.08; 'ah,': 0.09; 'dict': 0.09; 'down).': 0.09; 'exceptions': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:80.91.229.12': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'received:lo.gmane.org': 0.09; 'subject:characters': 0.09; 'output': 0.10; 'broken': 0.12; 'binary': 0.13; "'w',": 0.16; '*before*': 0.16; '6000': 0.16; 'adam': 0.16; 'bit.': 0.16; 'document:': 0.16; 'elements,': 0.16; 'from:addr:behnel.de': 0.16; 'from:addr:stefan_ml': 0.16; 'from:name:stefan behnel': 0.16; 'hint': 0.16; 'received:188.174': 0.16; 'received:m-online.net': 0.16; 'sticking': 0.16; 'subject: \n ': 0.16; 'subject:XML': 0.16; 'subject:bad': 0.16; 'wrote:': 0.18; '>>>': 0.18; 'instance': 0.18; 'rewrite': 0.18; 'checked': 0.21; 'memory': 0.21; 'trying': 0.21; 'maybe': 0.21; 'input': 0.22; 'appropriate': 0.22; 'header:In-Reply-To:1': 0.22; 'replacing': 0.23; 'string': 0.24; 'stefan': 0.24; 'creating': 0.25; 'code': 0.25; '(in': 0.26; 'module': 0.26; "i'm": 0.26; 'separate': 0.28; 'invalid': 0.28; 'pass': 0.29; 'forgot': 0.29; 'matches': 0.29; 'sorry,': 0.29; 'unicode': 0.29; 'problem': 0.29; 'handling': 0.30; 'cycles': 0.30; 'occasional': 0.30; 'recursion': 0.30; 'strings.': 0.30; 'whitespace': 0.30; '(the': 0.30; 'ok,': 0.31; 'xml': 0.31; "i've": 0.31; 'source': 0.31; "didn't": 0.31; 'version': 0.32; 'list': 0.32; 'pretty': 0.32; 'header:User- Agent:1': 0.33; 'header:X-Complaints-To:1': 0.33; 'there': 0.33; 'to:addr:python-list': 0.34; 'it.': 0.34; 'character': 0.34; 'force': 0.34; 'right,': 0.34; 'file.': 0.34; 'closely': 0.34; 'keys': 0.34; 'something': 0.35; 'be.': 0.35; 'supposed': 0.35; 'especially': 0.35; 'regular': 0.35; 'file': 0.36; 'element': 0.37; 'encoding': 0.37; 'instead,': 0.37; 'but': 0.37; 'using': 0.38; 'replace': 0.38; 'received:org': 0.38; 'some': 0.38; 'easier': 0.38; 'characters': 0.39; 'url:org': 0.39; 'should': 0.39; 'data,': 0.39; 'files': 0.39; 'why': 0.39; 'subject: (': 0.40; "it's": 0.40; 'to:addr:python.org': 0.40; 'range': 0.61; 'more': 0.61; 'your': 0.61; 'kind': 0.61; 'double': 0.61; '600': 0.64; 'harder': 0.64; 'here': 0.65; 'beat': 0.67; 'exact': 0.68; 'received:188': 0.68; 'unnecessary': 0.73; 'encoding,': 0.84; 'unsuccessful': 0.84; 'valid,': 0.84; 'warning.': 0.84; 'mistakes': 0.93
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Stefan Behnel <stefan_ml@behnel.de>
Subject Re: suppressing bad characters in output PCDATA (converting JSON to XML)
Date Tue, 29 Nov 2011 15:33:57 +0100
References <91j4q8xgv9.ln2@news.ducksburg.com> <mailman.3102.1322504456.27778.python-list@python.org> <ie1fq8x1r2.ln2@news.ducksburg.com>
Mime-Version 1.0
Content-Type text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding 7bit
X-Gmane-NNTP-Posting-Host host-188-174-186-186.customer.m-online.net
User-Agent Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110921 Lightning/1.0b2 Thunderbird/3.1.15
In-Reply-To <ie1fq8x1r2.ln2@news.ducksburg.com>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.3126.1322577261.27778.python-list@python.org> (permalink)
Lines 75
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1322577261 news.xs4all.nl 6899 [2001:888:2000:d::a6]:58858
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:16394

Show key headers only | View raw


Adam Funk, 29.11.2011 13:57:
> On 2011-11-28, Stefan Behnel wrote:
>> Adam Funk, 25.11.2011 14:50:
>>> Then I recurse through the contents of big_json to build an instance
>>> of xml.dom.minidom.Document (the recursion includes some code to
>>> rewrite dict keys as valid element names if necessary)
>>
>> If the name "big_json" is supposed to hint at a large set of data, you may
>> want to use something other than minidom. Take a look at the
>> xml.etree.cElementTree module instead, which is substantially more memory
>> efficient.
>
> Well, the input file in this case contains one big JSON list of
> reasonably sized elements, each of which I'm turning into a separate
> XML file.  The output files range from 600 to 6000 bytes.

It's also substantially easier to use, but if your XML writing code works 
already, why change it.


>>> and I save the document:
>>>
>>> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
>>> doc.writexml(xml_file, encoding='UTF-8')
>>> xml_file.close()
>>
>> Same mistakes as above. Especially the double encoding is both unnecessary
>> and likely to fail. This is also most likely the source of your problems.
>
> Well actually, I had the problem with the occasional control
> characters in the output *before* I started sticking encoding="UTF-8"
> all over the place (in an unsuccessful attempt to beat them down).

You should read up on Unicode a bit.


>>> I thought this would force all the output to be valid, but xmlstarlet
>>> gives some errors like these on a few documents:
>>>
>>> PCDATA invalid Char value 7
>>> PCDATA invalid Char value 31
>>
>> This strongly hints at a broken encoding, which can easily be triggered by
>> your erroneous encode-and-encode cycles above.
>
> No, I've checked the JSON input and those exact control characters are
> there too.

Ah, right, I didn't look closely enough. Those are forbidden in XML:

http://www.w3.org/TR/REC-xml/#charsets

It's sad that minidom (apparently) lets them pass through without even a 
warning.


> I want to suppress them (delete or replace with spaces).

Ok, then you need to process your string content while creating XML from 
it. If replacing is enough, take a look at string.maketrans() in the string 
module and str.translate(), a method on strings. Or maybe just use a 
regular expression that matches any whitespace character and replace it 
with a space. Or whatever suits your data best.


>> Also, the kind of problem you present here makes it pretty clear that you
>> are using Python 2.x. In Python 3, you'd get the appropriate exceptions
>> when trying to write binary data to a Unicode file.
>
> Sorry, I forgot to mention the version I'm using, which is "2.7.2+".

Yep, Py2 makes Unicode handling harder than it should be.

Stefan

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-25 13:50 +0000
  Re: suppressing bad characters in output PCDATA (converting JSON to XML) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-11-28 12:11 +0000
    Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-29 12:50 +0000
  Re: suppressing bad characters in output PCDATA (converting JSON to XML) Stefan Behnel <stefan_ml@behnel.de> - 2011-11-28 19:20 +0100
    Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-11-29 12:57 +0000
      Re: suppressing bad characters in output PCDATA (converting JSON to XML) Stefan Behnel <stefan_ml@behnel.de> - 2011-11-29 15:33 +0100
        Re: suppressing bad characters in output PCDATA (converting JSON to XML) Adam Funk <a24061@ducksburg.com> - 2011-12-02 10:30 +0000

csiph-web