Groups > comp.lang.python > #19688 > unrolled thread

xhtml encoding question

Started by	Tim Arnold <Tim.Arnold@sas.com>
First post	2012-01-31 13:09 -0500
Last post	2012-02-02 13:40 +0100
Articles	9 — 4 participants

Back to article view | Back to comp.lang.python

  xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-01-31 13:09 -0500
    Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-01 09:26 +0100
      Re: xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-02-01 13:15 -0500
        Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-02 08:02 +0100
    Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 09:39 +0100
      Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-01 10:32 +0100
        Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 17:03 +0100
          Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-02 12:02 +0100
            Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-02 13:40 +0100

#19688 — xhtml encoding question

From	Tim Arnold <Tim.Arnold@sas.com>
Date	2012-01-31 13:09 -0500
Subject	xhtml encoding question
Message-ID	<jg9apg$v0$1@foggy.unx.sas.com>

I have to follow a specification for producing xhtml files.
The original files are in cp1252 encoding and I must reencode them to utf-8.
Also, I have to replace certain characters with html entities.

I think I've got this right, but I'd like to hear if there's something 
I'm doing that is dangerous or wrong.

Please see the appended code, and thanks for any comments or suggestions.

I have two functions, translate (replaces high characters with entities) 
and reencode (um, reencodes):
---------------------------------
import codecs, StringIO
from lxml import etree
high_chars = {
    0x2014:'&mdash;', # 'EM DASH',
    0x2013:'&ndash;', # 'EN DASH',
    0x0160:'&Scaron;',# 'LATIN CAPITAL LETTER S WITH CARON',
    0x201d:'&rdquo;', # 'RIGHT DOUBLE QUOTATION MARK',
    0x201c:'&ldquo;', # 'LEFT DOUBLE QUOTATION MARK',
    0x2019:"&rsquo;", # 'RIGHT SINGLE QUOTATION MARK',
    0x2018:"&lsquo;", # 'LEFT SINGLE QUOTATION MARK',
    0x2122:'&trade;', # 'TRADE MARK SIGN',
    0x00A9:'&copy;',  # 'COPYRIGHT SYMBOL',
    }
def translate(string):
    s = ''
    for c in string:
        if ord(c) in high_chars:
            c = high_chars.get(ord(c))
        s += c
    return s

def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
    with codecs.open(filename,encoding=in_encoding) as f:
        s = f.read()
    sio = StringIO.StringIO(translate(s))
    parser = etree.HTMLParser(encoding=in_encoding)
    tree = etree.parse(sio, parser)
    result = etree.tostring(tree.getroot(), method='html',
                            pretty_print=True,
                            encoding=out_encoding)
    with open(filename,'wb') as f:
        f.write(result)

if __name__ == '__main__':
    fname = 'mytest.htm'
    reencode(fname)

[toc] | [next] | [standalone]

#19696

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2012-02-01 09:26 +0100
Message-ID	<mailman.5291.1328084788.27778.python-list@python.org>
In reply to	#19688

Tim Arnold, 31.01.2012 19:09:
> I have to follow a specification for producing xhtml files.
> The original files are in cp1252 encoding and I must reencode them to utf-8.
> Also, I have to replace certain characters with html entities.
> 
> I think I've got this right, but I'd like to hear if there's something I'm
> doing that is dangerous or wrong.
> 
> Please see the appended code, and thanks for any comments or suggestions.
> 
> I have two functions, translate (replaces high characters with entities)
> and reencode (um, reencodes):
> ---------------------------------
> import codecs, StringIO
> from lxml import etree
> high_chars = {
>    0x2014:'&mdash;', # 'EM DASH',
>    0x2013:'&ndash;', # 'EN DASH',
>    0x0160:'&Scaron;',# 'LATIN CAPITAL LETTER S WITH CARON',
>    0x201d:'&rdquo;', # 'RIGHT DOUBLE QUOTATION MARK',
>    0x201c:'&ldquo;', # 'LEFT DOUBLE QUOTATION MARK',
>    0x2019:"&rsquo;", # 'RIGHT SINGLE QUOTATION MARK',
>    0x2018:"&lsquo;", # 'LEFT SINGLE QUOTATION MARK',
>    0x2122:'&trade;', # 'TRADE MARK SIGN',
>    0x00A9:'&copy;',  # 'COPYRIGHT SYMBOL',
>    }
> def translate(string):
>    s = ''
>    for c in string:
>        if ord(c) in high_chars:
>            c = high_chars.get(ord(c))
>        s += c
>    return s

I hope you are aware that this is about the slowest possible algorithm
(well, the slowest one that doesn't do anything unnecessary). Since none of
this is required when parsing or generating XHTML, I assume your spec tells
you that you should do these replacements?


> def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
>    with codecs.open(filename,encoding=in_encoding) as f:
>        s = f.read()
>    sio = StringIO.StringIO(translate(s))
>    parser = etree.HTMLParser(encoding=in_encoding)
>    tree = etree.parse(sio, parser)

Yes, you are doing something dangerous and wrong here. For one, you are
decoding the data twice. Then, didn't you say XHTML? Why do you use the
HTML parser to parse XML?


>    result = etree.tostring(tree.getroot(), method='html',
>                            pretty_print=True,
>                            encoding=out_encoding)
>    with open(filename,'wb') as f:
>        f.write(result)

Use tree.write(f, ...)

Assuming you really meant XHTML and not HTML, I'd just drop your entire
code and do this instead:

  tree = etree.parse(in_path)
  tree.write(out_path, encoding='utf8', pretty_print=True)

Note that I didn't provide an input encoding. XML is safe in that regard.

Stefan

[toc] | [prev] | [next] | [standalone]

#19767

From	Tim Arnold <Tim.Arnold@sas.com>
Date	2012-02-01 13:15 -0500
Message-ID	<jgbvfd$q3d$1@foggy.unx.sas.com>
In reply to	#19696

On 2/1/2012 3:26 AM, Stefan Behnel wrote:
> Tim Arnold, 31.01.2012 19:09:
>> I have to follow a specification for producing xhtml files.
>> The original files are in cp1252 encoding and I must reencode them to utf-8.
>> Also, I have to replace certain characters with html entities.
>> ---------------------------------
>> import codecs, StringIO
>> from lxml import etree
>> high_chars = {
>>     0x2014:'&mdash;', # 'EM DASH',
>>     0x2013:'&ndash;', # 'EN DASH',
>>     0x0160:'&Scaron;',# 'LATIN CAPITAL LETTER S WITH CARON',
>>     0x201d:'&rdquo;', # 'RIGHT DOUBLE QUOTATION MARK',
>>     0x201c:'&ldquo;', # 'LEFT DOUBLE QUOTATION MARK',
>>     0x2019:"&rsquo;", # 'RIGHT SINGLE QUOTATION MARK',
>>     0x2018:"&lsquo;", # 'LEFT SINGLE QUOTATION MARK',
>>     0x2122:'&trade;', # 'TRADE MARK SIGN',
>>     0x00A9:'&copy;',  # 'COPYRIGHT SYMBOL',
>>     }
>> def translate(string):
>>     s = ''
>>     for c in string:
>>         if ord(c) in high_chars:
>>             c = high_chars.get(ord(c))
>>         s += c
>>     return s
>
> I hope you are aware that this is about the slowest possible algorithm
> (well, the slowest one that doesn't do anything unnecessary). Since none of
> this is required when parsing or generating XHTML, I assume your spec tells
> you that you should do these replacements?
>
I wasn't aware of it, but I am now--code's embarassing now.
The spec I must follow forces me to do the translation.

I am actually working with html not xhtml; which makes a huge 
difference, sorry for that.

Ulrich's line of code for translate is elegant.
for c in string:
     s += high_chars.get(c,c)

>
>> def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
>>     with codecs.open(filename,encoding=in_encoding) as f:
>>         s = f.read()
>>     sio = StringIO.StringIO(translate(s))
>>     parser = etree.HTMLParser(encoding=in_encoding)
>>     tree = etree.parse(sio, parser)
>
> Yes, you are doing something dangerous and wrong here. For one, you are
> decoding the data twice. Then, didn't you say XHTML? Why do you use the
> HTML parser to parse XML?
>
I see that I'm decoding twice now, thanks.

Also, I now see that when lxml writes the result back out the entities I 
got from my translate function are resolved, which defeats the whole 
purpose.
>
>>     result = etree.tostring(tree.getroot(), method='html',
>>                             pretty_print=True,
>>                             encoding=out_encoding)
>>     with open(filename,'wb') as f:
>>         f.write(result)
>
> Use tree.write(f, ...)

 From the all the info I've received on this thread, plus some 
additional reading, I think I need the following code.

Use the HTMLParser because the source files are actually HTML, and use 
output from etree.tostring() as input to translate() as the very last step.

def reencode(filename, in_encoding='cp1252', out_encoding='utf-8'):
     parser = etree.HTMLParser(encoding=in_encoding)
     tree = etree.parse(filename, parser)
     result = etree.tostring(tree.getroot(), method='html',
                             pretty_print=True,
                             encoding=out_encoding)
     with open(filename, 'wb') as f:
         f.write(translate(result))

not simply tree.write(f...) because I have to do the translation at the 
end, so I get the entities instead of the resolved entities from lxml.

Again, it would be simpler if this was xhtml, but I misspoke 
(mis-wrote?) when I said xhtml; this is for html.

> Assuming you really meant XHTML and not HTML, I'd just drop your entire
> code and do this instead:
>
>    tree = etree.parse(in_path)
>    tree.write(out_path, encoding='utf8', pretty_print=True)
>
> Note that I didn't provide an input encoding. XML is safe in that regard.
>
> Stefan
>

thanks everyone for the help.

--Tim Arnold

[toc] | [prev] | [next] | [standalone]

#19771

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2012-02-02 08:02 +0100
Message-ID	<mailman.5345.1328166145.27778.python-list@python.org>
In reply to	#19767

Tim Arnold, 01.02.2012 19:15:
> On 2/1/2012 3:26 AM, Stefan Behnel wrote:
>> Tim Arnold, 31.01.2012 19:09:
>>> I have to follow a specification for producing xhtml files.
>>> The original files are in cp1252 encoding and I must reencode them to
>>> utf-8.
>>> Also, I have to replace certain characters with html entities.
>>> ---------------------------------
>>> import codecs, StringIO
>>> from lxml import etree
>>> high_chars = {
>>>     0x2014:'&mdash;', # 'EM DASH',
>>>     0x2013:'&ndash;', # 'EN DASH',
>>>     0x0160:'&Scaron;',# 'LATIN CAPITAL LETTER S WITH CARON',
>>>     0x201d:'&rdquo;', # 'RIGHT DOUBLE QUOTATION MARK',
>>>     0x201c:'&ldquo;', # 'LEFT DOUBLE QUOTATION MARK',
>>>     0x2019:"&rsquo;", # 'RIGHT SINGLE QUOTATION MARK',
>>>     0x2018:"&lsquo;", # 'LEFT SINGLE QUOTATION MARK',
>>>     0x2122:'&trade;', # 'TRADE MARK SIGN',
>>>     0x00A9:'&copy;',  # 'COPYRIGHT SYMBOL',
>>>     }
>>> def translate(string):
>>>     s = ''
>>>     for c in string:
>>>         if ord(c) in high_chars:
>>>             c = high_chars.get(ord(c))
>>>         s += c
>>>     return s
>>
>> I hope you are aware that this is about the slowest possible algorithm
>> (well, the slowest one that doesn't do anything unnecessary). Since none of
>> this is required when parsing or generating XHTML, I assume your spec tells
>> you that you should do these replacements?
>
> I wasn't aware of it, but I am now--code's embarassing now.
> The spec I must follow forces me to do the translation.
> 
> I am actually working with html not xhtml; which makes a huge difference,

We all learn.

> Ulrich's line of code for translate is elegant.
> for c in string:
>     s += high_chars.get(c,c)

Still not efficient because it builds the string one character at a time
and needs to reallocate (and potentially copy) the string buffer quite
frequently in order to do that. You are lucky with CPython, because it has
an internal optimisation that mitigates this overhead on some platforms.
Other Python implementations don't have that, and even the optimisation in
CPython is platform specific (works well on Linux, for example).

Peter Otten presented the a better way of doing it.

> From the all the info I've received on this thread, plus some additional
> reading, I think I need the following code.
> 
> Use the HTMLParser because the source files are actually HTML, and use
> output from etree.tostring() as input to translate() as the very last step.
> 
> def reencode(filename, in_encoding='cp1252', out_encoding='utf-8'):
>     parser = etree.HTMLParser(encoding=in_encoding)
>     tree = etree.parse(filename, parser)
>     result = etree.tostring(tree.getroot(), method='html',
>                             pretty_print=True,
>                             encoding=out_encoding)
>     with open(filename, 'wb') as f:
>         f.write(translate(result))
> 
> not simply tree.write(f...) because I have to do the translation at the
> end, so I get the entities instead of the resolved entities from lxml.

Yes, that's better.

Still one thing (since you didn't show us your final translate() function):
you do the character escaping on a UTF-8 encoded string and made the
encoding configurable. That means that the characters you are looking for
must also be encoded with the same encoding in order to find matches.
However, if you ever choose a different target encoding that doesn't have
the nice properties of UTF-8's byte sequences, you may end up with
ambiguous byte sequences in the output that your translate() function
accidentally matches on, thus potentially corrupting your data.

Assuming that you are using Python 2, you may even be accidentally doing
the replacement using Unicode character strings, which then only happens to
work on systems that use UTF-8 as their default encoding. Python 3 has
fixed this trap, but you have to take care to avoid it in Python 2.

I'd prefer serialising the documents into a unicode string
(encoding='unicode'), then post-processing that and finally encoding it to
the target encoding when writing it out. But you'll have to see how that
works out together with your escaping step, and also how it impacts the
HTML <meta> tag that states the document encoding.

Stefan

[toc] | [prev] | [next] | [standalone]

#19697

From	Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com>
Date	2012-02-01 09:39 +0100
Message-ID	<daanv8-7i.ln1@satorlaser.homedns.org>
In reply to	#19688

Am 31.01.2012 19:09, schrieb Tim Arnold:
> high_chars = {
>     0x2014:'&mdash;', # 'EM DASH',
>     0x2013:'&ndash;', # 'EN DASH',
>     0x0160:'&Scaron;',# 'LATIN CAPITAL LETTER S WITH CARON',
>     0x201d:'&rdquo;', # 'RIGHT DOUBLE QUOTATION MARK',
>     0x201c:'&ldquo;', # 'LEFT DOUBLE QUOTATION MARK',
>     0x2019:"&rsquo;", # 'RIGHT SINGLE QUOTATION MARK',
>     0x2018:"&lsquo;", # 'LEFT SINGLE QUOTATION MARK',
>     0x2122:'&trade;', # 'TRADE MARK SIGN',
>     0x00A9:'&copy;', # 'COPYRIGHT SYMBOL',
> }

You could use Unicode string literals directly instead of using the 
codepoint, making it a bit more self-documenting and saving you the 
later call to ord():

high_chars = {
     u'\u2014': '&mdash;',
     u'\u2013': '&ndash;',
     ...
}

> for c in string:
>     if ord(c) in high_chars:
>         c = high_chars.get(ord(c))
>     s += c
> return s

Instead of checking if there is a replacement and then looking up the 
replacement again, just use the default:

   for c in string:
       s += high_chars.get(c, c)

Alternatively, if you find that clearer, you could also check if the 
returnvalue of get() is None to find out if there is a replacement:

   for c in string:
       r = high_chars.get(c)
       if r is None:
           s += c
       else:
           s += r


Uli

[toc] | [prev] | [next] | [standalone]

#19699

From	Peter Otten <__peter__@web.de>
Date	2012-02-01 10:32 +0100
Message-ID	<mailman.5292.1328088791.27778.python-list@python.org>
In reply to	#19697

Ulrich Eckhardt wrote:

> Am 31.01.2012 19:09, schrieb Tim Arnold:
>> high_chars = {
>>     0x2014:'&mdash;', # 'EM DASH',
>>     0x2013:'&ndash;', # 'EN DASH',
>>     0x0160:'&Scaron;',# 'LATIN CAPITAL LETTER S WITH CARON',
>>     0x201d:'&rdquo;', # 'RIGHT DOUBLE QUOTATION MARK',
>>     0x201c:'&ldquo;', # 'LEFT DOUBLE QUOTATION MARK',
>>     0x2019:"&rsquo;", # 'RIGHT SINGLE QUOTATION MARK',
>>     0x2018:"&lsquo;", # 'LEFT SINGLE QUOTATION MARK',
>>     0x2122:'&trade;', # 'TRADE MARK SIGN',
>>     0x00A9:'&copy;', # 'COPYRIGHT SYMBOL',
>> }
> 
> You could use Unicode string literals directly instead of using the
> codepoint, making it a bit more self-documenting and saving you the
> later call to ord():
> 
> high_chars = {
>      u'\u2014': '&mdash;',
>      u'\u2013': '&ndash;',
>      ...
> }
> 
>> for c in string:
>>     if ord(c) in high_chars:
>>         c = high_chars.get(ord(c))
>>     s += c
>> return s
> 
> Instead of checking if there is a replacement and then looking up the
> replacement again, just use the default:
> 
>    for c in string:
>        s += high_chars.get(c, c)
> 
> Alternatively, if you find that clearer, you could also check if the
> returnvalue of get() is None to find out if there is a replacement:
> 
>    for c in string:
>        r = high_chars.get(c)
>        if r is None:
>            s += c
>        else:
>            s += r

It doesn't matter for the OP (see Stefan Behnel's post), but If you want to 
replace characters in a unicode string the best way is probably the 
translate() method:

>>> print u"\xa9\u2122"
©™
>>> u"\xa9\u2122".translate({0xa9: u"&copy;", 0x2122: u"&trade;"})
u'&copy;&trade;'

[toc] | [prev] | [next] | [standalone]

#19716

From	Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com>
Date	2012-02-01 17:03 +0100
Message-ID	<8b4ov8-ad2.ln1@satorlaser.homedns.org>
In reply to	#19699

Am 01.02.2012 10:32, schrieb Peter Otten:
> It doesn't matter for the OP (see Stefan Behnel's post), but If you want to
> replace characters in a unicode string the best way is probably the
> translate() method:
>
>>>> print u"\xa9\u2122"
> ©™
>>>> u"\xa9\u2122".translate({0xa9: u"&copy;", 0x2122: u"&trade;"})
> u'&copy;&trade;'
>

Yes, this is both more expressive and at the same time probably even 
more efficient.


Question though:

 >>> u'abc'.translate({u'a': u'A'})
u'abc'

I would call this a chance to improve Python. According to the 
documentation, using a string is invalid, but it neither raises an 
exception nor does it do the obvious and accept single-character strings 
as keys.


Thoughts?


Uli

[toc] | [prev] | [next] | [standalone]

#19786

From	Peter Otten <__peter__@web.de>
Date	2012-02-02 12:02 +0100
Message-ID	<mailman.5353.1328180546.27778.python-list@python.org>
In reply to	#19716

Ulrich Eckhardt wrote:

> Am 01.02.2012 10:32, schrieb Peter Otten:
>> It doesn't matter for the OP (see Stefan Behnel's post), but If you want
>> to replace characters in a unicode string the best way is probably the
>> translate() method:
>>
>>>>> print u"\xa9\u2122"
>> ©™
>>>>> u"\xa9\u2122".translate({0xa9: u"&copy;", 0x2122: u"&trade;"})
>> u'&copy;&trade;'
>>
> 
> Yes, this is both more expressive and at the same time probably even
> more efficient.
> 
> 
> Question though:
> 
>  >>> u'abc'.translate({u'a': u'A'})
> u'abc'
> 
> I would call this a chance to improve Python. According to the
> documentation, using a string is invalid, but it neither raises an
> exception nor does it do the obvious and accept single-character strings
> as keys.
> 
> 
> Thoughts?

How could this raise an exception? You'd either need a typed dictionary (int 
--> unicode) or translate() would have to verify that all keys are indeed 
integers. The former would go against the grain of Python, the latter would 
make the method less flexible as the set of keys currently need not be 
predefined:

>>> class A(object):
...     def __getitem__(self, key):
...             return unichr(key).upper()
...
>>> u"alpha".translate(A())
u'ALPHA'

Using unicode instead of integer keys would be nice but breaks backwards 
compatibility, using both could double the number of dictionary lookups.

[toc] | [prev] | [next] | [standalone]

#19788

From	Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com>
Date	2012-02-02 13:40 +0100
Message-ID	<vqcqv8-m08.ln1@satorlaser.homedns.org>
In reply to	#19786

Am 02.02.2012 12:02, schrieb Peter Otten:
> Ulrich Eckhardt wrote:
>>
>>   >>>  u'abc'.translate({u'a': u'A'})
>> u'abc'
>>
>> I would call this a chance to improve Python. According to the
>> documentation, using a string [as key] is invalid, but it neither raises
>> an exception nor does it do the obvious and accept single-character
>> strings as keys.
>>
>>
>> Thoughts?
>
> How could this raise an exception? You'd either need a typed dictionary (int
> -->  unicode) or translate() would have to verify that all keys are indeed
> integers.

The latter is exactly what I would have done, i.e. scan the dictionary 
for invalid values, in the spirit of not letting errors pass unnoticed.

> The former would go against the grain of Python, the latter would
> make the method less flexible as the set of keys currently need not be
> predefined:
>
>>>> class A(object):
> ...     def __getitem__(self, key):
> ...             return unichr(key).upper()
> ...
>>>> u"alpha".translate(A())
> u'ALPHA'

Working with __getitem__ is a point. I'm not sure if it is reasonable to 
expect this to work though. I'm -0 on that. I could also imagine a 
completely separate path for iterable and non-iterable mappings.

> Using unicode instead of integer keys would be nice but breaks backwards
> compatibility, using both could double the number of dictionary lookups.

Dictionary lookups are constant time and well-optimized, so I'd actually 
go for allowing both and paying that price. I could even imagine 
preprocessing the supplied dictionary while checking for invalid values. 
The result could be a structure that makes use of the fact that Unicode 
codepoints are < 22 bits and that makes the way from the elements of the 
source sequence to the according map entry as short as possible (I'm not 
sure if using codepoints or single-character strings is faster). 
However, those are early optimizations of which I'm not sure if they are 
worth it.

Anyway, thanks for your thoughts, they are always appreciated!

Uli

[toc] | [prev] | [standalone]

csiph-web

xhtml encoding question

Contents

#19688 — xhtml encoding question

#19696

#19767

#19771

#19697

#19699

#19716

#19786

#19788