Groups > comp.lang.python > #19696

Re: xhtml encoding question

From	Stefan Behnel <stefan_ml@behnel.de>
Subject	Re: xhtml encoding question
Date	2012-02-01 09:26 +0100
References	<jg9apg$v0$1@foggy.unx.sas.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.5291.1328084788.27778.python-list@python.org> (permalink)

Show all headers | View raw

Tim Arnold, 31.01.2012 19:09:
> I have to follow a specification for producing xhtml files.
> The original files are in cp1252 encoding and I must reencode them to utf-8.
> Also, I have to replace certain characters with html entities.
> 
> I think I've got this right, but I'd like to hear if there's something I'm
> doing that is dangerous or wrong.
> 
> Please see the appended code, and thanks for any comments or suggestions.
> 
> I have two functions, translate (replaces high characters with entities)
> and reencode (um, reencodes):
> ---------------------------------
> import codecs, StringIO
> from lxml import etree
> high_chars = {
>    0x2014:'&mdash;', # 'EM DASH',
>    0x2013:'&ndash;', # 'EN DASH',
>    0x0160:'&Scaron;',# 'LATIN CAPITAL LETTER S WITH CARON',
>    0x201d:'&rdquo;', # 'RIGHT DOUBLE QUOTATION MARK',
>    0x201c:'&ldquo;', # 'LEFT DOUBLE QUOTATION MARK',
>    0x2019:"&rsquo;", # 'RIGHT SINGLE QUOTATION MARK',
>    0x2018:"&lsquo;", # 'LEFT SINGLE QUOTATION MARK',
>    0x2122:'&trade;', # 'TRADE MARK SIGN',
>    0x00A9:'&copy;',  # 'COPYRIGHT SYMBOL',
>    }
> def translate(string):
>    s = ''
>    for c in string:
>        if ord(c) in high_chars:
>            c = high_chars.get(ord(c))
>        s += c
>    return s

I hope you are aware that this is about the slowest possible algorithm
(well, the slowest one that doesn't do anything unnecessary). Since none of
this is required when parsing or generating XHTML, I assume your spec tells
you that you should do these replacements?


> def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
>    with codecs.open(filename,encoding=in_encoding) as f:
>        s = f.read()
>    sio = StringIO.StringIO(translate(s))
>    parser = etree.HTMLParser(encoding=in_encoding)
>    tree = etree.parse(sio, parser)

Yes, you are doing something dangerous and wrong here. For one, you are
decoding the data twice. Then, didn't you say XHTML? Why do you use the
HTML parser to parse XML?


>    result = etree.tostring(tree.getroot(), method='html',
>                            pretty_print=True,
>                            encoding=out_encoding)
>    with open(filename,'wb') as f:
>        f.write(result)

Use tree.write(f, ...)

Assuming you really meant XHTML and not HTML, I'd just drop your entire
code and do this instead:

  tree = etree.parse(in_path)
  tree.write(out_path, encoding='utf8', pretty_print=True)

Note that I didn't provide an input encoding. XML is safe in that regard.

Stefan

Thread

xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-01-31 13:09 -0500
  Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-01 09:26 +0100
    Re: xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-02-01 13:15 -0500
      Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-02 08:02 +0100
  Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 09:39 +0100
    Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-01 10:32 +0100
      Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 17:03 +0100
        Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-02 12:02 +0100
          Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-02 13:40 +0100

csiph-web