Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #19696

Re: xhtml encoding question

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.002
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'parser': 0.05; 'suggestions.': 0.07; 'encoding.': 0.09; 'parsing': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'slowest': 0.09; 'files.': 0.09; 'def': 0.13; 'algorithm': 0.13; '(well,': 0.16; 'cp1252': 0.16; 'f.read()': 0.16; 'from:addr:behnel.de': 0.16; 'from:addr:stefan_ml': 0.16; 'from:name:stefan behnel': 0.16; 'string:': 0.16; 'stringio': 0.16; 'twice.': 0.16; 'subject:question': 0.16; 'meant': 0.17; 'functions,': 0.18; 'tells': 0.21; 'input': 0.21; "doesn't": 0.22; 'assume': 0.22; 'header:In-Reply-To:1': 0.22; 'stefan': 0.24; 'code': 0.25; 'tree': 0.25; 'import': 0.27; 'code,': 0.27; "i'm": 0.27; 'producing': 0.28; 'assuming': 0.28; 'html,': 0.28; "didn't": 0.30; 'translate': 0.30; 'xml': 0.30; "i've": 0.31; 'thanks': 0.32; 'skip:- 30': 0.33; 'header:User-Agent:1': 0.33; 'to:addr :python-list': 0.33; 'right,': 0.34; 'certain': 0.34; 'anything': 0.34; 'parse': 0.34; 'skip:i 40': 0.34; 'header:X-Complaints- To:1': 0.34; 'something': 0.35; 'none': 0.36; 'encoding': 0.36; 'two': 0.37; 'but': 0.37; 'received:org': 0.37; "there's": 0.37; 'replace': 0.38; 'think': 0.38; 'comments': 0.38; 'should': 0.38; 'characters': 0.38; "i'd": 0.39; 'received:de': 0.39; 'why': 0.39; 'files': 0.39; 'doing': 0.39; 'subject:: ': 0.39; 'to:addr:python.org': 0.40; 'one,': 0.40; 'data': 0.40; 'got': 0.40; 'hope': 0.61; 'double': 0.61; 'your': 0.61; 'capital': 0.62; 'dangerous': 0.64; 'here.': 0.64; 'high': 0.67; 'safe': 0.70; "'em": 0.84; 'regard.': 0.84; 'quotation': 0.93
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Stefan Behnel <stefan_ml@behnel.de>
Subject Re: xhtml encoding question
Date Wed, 01 Feb 2012 09:26:15 +0100
References <jg9apg$v0$1@foggy.unx.sas.com>
Mime-Version 1.0
Content-Type text/plain; charset=UTF-8
Content-Transfer-Encoding 7bit
X-Gmane-NNTP-Posting-Host fw-snc-frn5-de01.fw.telefonica.de
User-Agent Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111220 Thunderbird/9.0
In-Reply-To <jg9apg$v0$1@foggy.unx.sas.com>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.5291.1328084788.27778.python-list@python.org> (permalink)
Lines 70
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1328084788 news.xs4all.nl 6902 [2001:888:2000:d::a6]:40726
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:19696

Show key headers only | View raw


Tim Arnold, 31.01.2012 19:09:
> I have to follow a specification for producing xhtml files.
> The original files are in cp1252 encoding and I must reencode them to utf-8.
> Also, I have to replace certain characters with html entities.
> 
> I think I've got this right, but I'd like to hear if there's something I'm
> doing that is dangerous or wrong.
> 
> Please see the appended code, and thanks for any comments or suggestions.
> 
> I have two functions, translate (replaces high characters with entities)
> and reencode (um, reencodes):
> ---------------------------------
> import codecs, StringIO
> from lxml import etree
> high_chars = {
>    0x2014:'&mdash;', # 'EM DASH',
>    0x2013:'&ndash;', # 'EN DASH',
>    0x0160:'&Scaron;',# 'LATIN CAPITAL LETTER S WITH CARON',
>    0x201d:'&rdquo;', # 'RIGHT DOUBLE QUOTATION MARK',
>    0x201c:'&ldquo;', # 'LEFT DOUBLE QUOTATION MARK',
>    0x2019:"&rsquo;", # 'RIGHT SINGLE QUOTATION MARK',
>    0x2018:"&lsquo;", # 'LEFT SINGLE QUOTATION MARK',
>    0x2122:'&trade;', # 'TRADE MARK SIGN',
>    0x00A9:'&copy;',  # 'COPYRIGHT SYMBOL',
>    }
> def translate(string):
>    s = ''
>    for c in string:
>        if ord(c) in high_chars:
>            c = high_chars.get(ord(c))
>        s += c
>    return s

I hope you are aware that this is about the slowest possible algorithm
(well, the slowest one that doesn't do anything unnecessary). Since none of
this is required when parsing or generating XHTML, I assume your spec tells
you that you should do these replacements?


> def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
>    with codecs.open(filename,encoding=in_encoding) as f:
>        s = f.read()
>    sio = StringIO.StringIO(translate(s))
>    parser = etree.HTMLParser(encoding=in_encoding)
>    tree = etree.parse(sio, parser)

Yes, you are doing something dangerous and wrong here. For one, you are
decoding the data twice. Then, didn't you say XHTML? Why do you use the
HTML parser to parse XML?


>    result = etree.tostring(tree.getroot(), method='html',
>                            pretty_print=True,
>                            encoding=out_encoding)
>    with open(filename,'wb') as f:
>        f.write(result)

Use tree.write(f, ...)

Assuming you really meant XHTML and not HTML, I'd just drop your entire
code and do this instead:

  tree = etree.parse(in_path)
  tree.write(out_path, encoding='utf8', pretty_print=True)

Note that I didn't provide an input encoding. XML is safe in that regard.

Stefan

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-01-31 13:09 -0500
  Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-01 09:26 +0100
    Re: xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-02-01 13:15 -0500
      Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-02 08:02 +0100
  Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 09:39 +0100
    Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-01 10:32 +0100
      Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 17:03 +0100
        Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-02 12:02 +0100
          Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-02 13:40 +0100

csiph-web