Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.004 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'else:': 0.03; 'none:': 0.07; 'default:': 0.09; 'method:': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'eckhardt': 0.16; 'literals': 0.16; 'received:dip.t-dialin.net': 0.16; 'received:t-dialin.net': 0.16; 'string:': 0.16; 'subject:question': 0.16; 'wrote:': 0.16; '>>>': 0.18; 'later': 0.21; "doesn't": 0.22; 'from:addr:web.de': 0.23; 'string': 0.24; 'stefan': 0.24; '(see': 0.28; 'bit': 0.28; 'unicode': 0.28; 'print': 0.29; 'instead': 0.33; 'to:addr:python- list': 0.33; 'there': 0.33; 'checking': 0.34; 'header:X -Complaints-To:1': 0.34; 'probably': 0.35; '...': 0.35; 'none': 0.36; 'but': 0.37; 'received:org': 0.37; 'using': 0.37; 'could': 0.37; 'replace': 0.38; 'characters': 0.38; 'subject:: ': 0.39; 'to:addr:python.org': 0.40; 'matter': 0.61; 'more': 0.61; 'double': 0.61; 'capital': 0.62; 'making': 0.65; 'saving': 0.76; "'em": 0.84; 'quotation': 0.93 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Peter Otten <__peter__@web.de> Subject: Re: xhtml encoding question Date: Wed, 01 Feb 2012 10:32:52 +0100 Organization: None References: Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8Bit X-Gmane-NNTP-Posting-Host: p5084ae86.dip.t-dialin.net X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 57 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1328088791 news.xs4all.nl 6961 [2001:888:2000:d::a6]:52722 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:19699 Ulrich Eckhardt wrote: > Am 31.01.2012 19:09, schrieb Tim Arnold: >> high_chars = { >> 0x2014:'—', # 'EM DASH', >> 0x2013:'–', # 'EN DASH', >> 0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON', >> 0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK', >> 0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK', >> 0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK', >> 0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK', >> 0x2122:'™', # 'TRADE MARK SIGN', >> 0x00A9:'©', # 'COPYRIGHT SYMBOL', >> } > > You could use Unicode string literals directly instead of using the > codepoint, making it a bit more self-documenting and saving you the > later call to ord(): > > high_chars = { > u'\u2014': '—', > u'\u2013': '–', > ... > } > >> for c in string: >> if ord(c) in high_chars: >> c = high_chars.get(ord(c)) >> s += c >> return s > > Instead of checking if there is a replacement and then looking up the > replacement again, just use the default: > > for c in string: > s += high_chars.get(c, c) > > Alternatively, if you find that clearer, you could also check if the > returnvalue of get() is None to find out if there is a replacement: > > for c in string: > r = high_chars.get(c) > if r is None: > s += c > else: > s += r It doesn't matter for the OP (see Stefan Behnel's post), but If you want to replace characters in a unicode string the best way is probably the translate() method: >>> print u"\xa9\u2122" ©™ >>> u"\xa9\u2122".translate({0xa9: u"©", 0x2122: u"™"}) u'©™'