Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #19696

Re: xhtml encoding question

From Stefan Behnel <stefan_ml@behnel.de>
Subject Re: xhtml encoding question
Date 2012-02-01 09:26 +0100
References <jg9apg$v0$1@foggy.unx.sas.com>
Newsgroups comp.lang.python
Message-ID <mailman.5291.1328084788.27778.python-list@python.org> (permalink)

Show all headers | View raw


Tim Arnold, 31.01.2012 19:09:
> I have to follow a specification for producing xhtml files.
> The original files are in cp1252 encoding and I must reencode them to utf-8.
> Also, I have to replace certain characters with html entities.
> 
> I think I've got this right, but I'd like to hear if there's something I'm
> doing that is dangerous or wrong.
> 
> Please see the appended code, and thanks for any comments or suggestions.
> 
> I have two functions, translate (replaces high characters with entities)
> and reencode (um, reencodes):
> ---------------------------------
> import codecs, StringIO
> from lxml import etree
> high_chars = {
>    0x2014:'&mdash;', # 'EM DASH',
>    0x2013:'&ndash;', # 'EN DASH',
>    0x0160:'&Scaron;',# 'LATIN CAPITAL LETTER S WITH CARON',
>    0x201d:'&rdquo;', # 'RIGHT DOUBLE QUOTATION MARK',
>    0x201c:'&ldquo;', # 'LEFT DOUBLE QUOTATION MARK',
>    0x2019:"&rsquo;", # 'RIGHT SINGLE QUOTATION MARK',
>    0x2018:"&lsquo;", # 'LEFT SINGLE QUOTATION MARK',
>    0x2122:'&trade;', # 'TRADE MARK SIGN',
>    0x00A9:'&copy;',  # 'COPYRIGHT SYMBOL',
>    }
> def translate(string):
>    s = ''
>    for c in string:
>        if ord(c) in high_chars:
>            c = high_chars.get(ord(c))
>        s += c
>    return s

I hope you are aware that this is about the slowest possible algorithm
(well, the slowest one that doesn't do anything unnecessary). Since none of
this is required when parsing or generating XHTML, I assume your spec tells
you that you should do these replacements?


> def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
>    with codecs.open(filename,encoding=in_encoding) as f:
>        s = f.read()
>    sio = StringIO.StringIO(translate(s))
>    parser = etree.HTMLParser(encoding=in_encoding)
>    tree = etree.parse(sio, parser)

Yes, you are doing something dangerous and wrong here. For one, you are
decoding the data twice. Then, didn't you say XHTML? Why do you use the
HTML parser to parse XML?


>    result = etree.tostring(tree.getroot(), method='html',
>                            pretty_print=True,
>                            encoding=out_encoding)
>    with open(filename,'wb') as f:
>        f.write(result)

Use tree.write(f, ...)

Assuming you really meant XHTML and not HTML, I'd just drop your entire
code and do this instead:

  tree = etree.parse(in_path)
  tree.write(out_path, encoding='utf8', pretty_print=True)

Note that I didn't provide an input encoding. XML is safe in that regard.

Stefan

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-01-31 13:09 -0500
  Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-01 09:26 +0100
    Re: xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-02-01 13:15 -0500
      Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-02 08:02 +0100
  Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 09:39 +0100
    Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-01 10:32 +0100
      Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 17:03 +0100
        Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-02 12:02 +0100
          Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-02 13:40 +0100

csiph-web