Groups > comp.lang.python > #19699

Re: xhtml encoding question

Path	csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<python-python-list@m.gmane.org>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.004
X-Spam-Evidence	'H': 0.99; 'S': 0.00; 'else:': 0.03; 'none:': 0.07; 'default:': 0.09; 'method:': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'eckhardt': 0.16; 'literals': 0.16; 'received:dip.t-dialin.net': 0.16; 'received:t-dialin.net': 0.16; 'string:': 0.16; 'subject:question': 0.16; 'wrote:': 0.16; '>>>': 0.18; 'later': 0.21; "doesn't": 0.22; 'from:addr:web.de': 0.23; 'string': 0.24; 'stefan': 0.24; '(see': 0.28; 'bit': 0.28; 'unicode': 0.28; 'print': 0.29; 'instead': 0.33; 'to:addr:python- list': 0.33; 'there': 0.33; 'checking': 0.34; 'header:X -Complaints-To:1': 0.34; 'probably': 0.35; '...': 0.35; 'none': 0.36; 'but': 0.37; 'received:org': 0.37; 'using': 0.37; 'could': 0.37; 'replace': 0.38; 'characters': 0.38; 'subject:: ': 0.39; 'to:addr:python.org': 0.40; 'matter': 0.61; 'more': 0.61; 'double': 0.61; 'capital': 0.62; 'making': 0.65; 'saving': 0.76; "'em": 0.84; 'quotation': 0.93
X-Injected-Via-Gmane	http://gmane.org/
To	python-list@python.org
From	Peter Otten <__peter__@web.de>
Subject	Re: xhtml encoding question
Date	Wed, 01 Feb 2012 10:32:52 +0100
Organization	None
References	<jg9apg$v0$1@foggy.unx.sas.com> <daanv8-7i.ln1@satorlaser.homedns.org>
Mime-Version	1.0
Content-Type	text/plain; charset="UTF-8"
Content-Transfer-Encoding	8Bit
X-Gmane-NNTP-Posting-Host	p5084ae86.dip.t-dialin.net
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.12
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.5292.1328088791.27778.python-list@python.org> (permalink)
Lines	57
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1328088791 news.xs4all.nl 6961 [2001:888:2000:d::a6]:52722
X-Complaints-To	abuse@xs4all.nl
Xref	x330-a1.tempe.blueboxinc.net comp.lang.python:19699

Show key headers only | View raw

Ulrich Eckhardt wrote:

> Am 31.01.2012 19:09, schrieb Tim Arnold:
>> high_chars = {
>>     0x2014:'&mdash;', # 'EM DASH',
>>     0x2013:'&ndash;', # 'EN DASH',
>>     0x0160:'&Scaron;',# 'LATIN CAPITAL LETTER S WITH CARON',
>>     0x201d:'&rdquo;', # 'RIGHT DOUBLE QUOTATION MARK',
>>     0x201c:'&ldquo;', # 'LEFT DOUBLE QUOTATION MARK',
>>     0x2019:"&rsquo;", # 'RIGHT SINGLE QUOTATION MARK',
>>     0x2018:"&lsquo;", # 'LEFT SINGLE QUOTATION MARK',
>>     0x2122:'&trade;', # 'TRADE MARK SIGN',
>>     0x00A9:'&copy;', # 'COPYRIGHT SYMBOL',
>> }
> 
> You could use Unicode string literals directly instead of using the
> codepoint, making it a bit more self-documenting and saving you the
> later call to ord():
> 
> high_chars = {
>      u'\u2014': '&mdash;',
>      u'\u2013': '&ndash;',
>      ...
> }
> 
>> for c in string:
>>     if ord(c) in high_chars:
>>         c = high_chars.get(ord(c))
>>     s += c
>> return s
> 
> Instead of checking if there is a replacement and then looking up the
> replacement again, just use the default:
> 
>    for c in string:
>        s += high_chars.get(c, c)
> 
> Alternatively, if you find that clearer, you could also check if the
> returnvalue of get() is None to find out if there is a replacement:
> 
>    for c in string:
>        r = high_chars.get(c)
>        if r is None:
>            s += c
>        else:
>            s += r

It doesn't matter for the OP (see Stefan Behnel's post), but If you want to 
replace characters in a unicode string the best way is probably the 
translate() method:

>>> print u"\xa9\u2122"
©™
>>> u"\xa9\u2122".translate({0xa9: u"&copy;", 0x2122: u"&trade;"})
u'&copy;&trade;'

Thread

xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-01-31 13:09 -0500
  Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-01 09:26 +0100
    Re: xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-02-01 13:15 -0500
      Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-02 08:02 +0100
  Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 09:39 +0100
    Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-01 10:32 +0100
      Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 17:03 +0100
        Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-02 12:02 +0100
          Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-02 13:40 +0100

csiph-web