Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #19688 > unrolled thread
| Started by | Tim Arnold <Tim.Arnold@sas.com> |
|---|---|
| First post | 2012-01-31 13:09 -0500 |
| Last post | 2012-02-02 13:40 +0100 |
| Articles | 9 — 4 participants |
Back to article view | Back to comp.lang.python
xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-01-31 13:09 -0500
Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-01 09:26 +0100
Re: xhtml encoding question Tim Arnold <Tim.Arnold@sas.com> - 2012-02-01 13:15 -0500
Re: xhtml encoding question Stefan Behnel <stefan_ml@behnel.de> - 2012-02-02 08:02 +0100
Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 09:39 +0100
Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-01 10:32 +0100
Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-01 17:03 +0100
Re: xhtml encoding question Peter Otten <__peter__@web.de> - 2012-02-02 12:02 +0100
Re: xhtml encoding question Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-02-02 13:40 +0100
| From | Tim Arnold <Tim.Arnold@sas.com> |
|---|---|
| Date | 2012-01-31 13:09 -0500 |
| Subject | xhtml encoding question |
| Message-ID | <jg9apg$v0$1@foggy.unx.sas.com> |
I have to follow a specification for producing xhtml files.
The original files are in cp1252 encoding and I must reencode them to utf-8.
Also, I have to replace certain characters with html entities.
I think I've got this right, but I'd like to hear if there's something
I'm doing that is dangerous or wrong.
Please see the appended code, and thanks for any comments or suggestions.
I have two functions, translate (replaces high characters with entities)
and reencode (um, reencodes):
---------------------------------
import codecs, StringIO
from lxml import etree
high_chars = {
0x2014:'—', # 'EM DASH',
0x2013:'–', # 'EN DASH',
0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
0x2122:'™', # 'TRADE MARK SIGN',
0x00A9:'©', # 'COPYRIGHT SYMBOL',
}
def translate(string):
s = ''
for c in string:
if ord(c) in high_chars:
c = high_chars.get(ord(c))
s += c
return s
def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
with codecs.open(filename,encoding=in_encoding) as f:
s = f.read()
sio = StringIO.StringIO(translate(s))
parser = etree.HTMLParser(encoding=in_encoding)
tree = etree.parse(sio, parser)
result = etree.tostring(tree.getroot(), method='html',
pretty_print=True,
encoding=out_encoding)
with open(filename,'wb') as f:
f.write(result)
if __name__ == '__main__':
fname = 'mytest.htm'
reencode(fname)
[toc] | [next] | [standalone]
| From | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| Date | 2012-02-01 09:26 +0100 |
| Message-ID | <mailman.5291.1328084788.27778.python-list@python.org> |
| In reply to | #19688 |
Tim Arnold, 31.01.2012 19:09:
> I have to follow a specification for producing xhtml files.
> The original files are in cp1252 encoding and I must reencode them to utf-8.
> Also, I have to replace certain characters with html entities.
>
> I think I've got this right, but I'd like to hear if there's something I'm
> doing that is dangerous or wrong.
>
> Please see the appended code, and thanks for any comments or suggestions.
>
> I have two functions, translate (replaces high characters with entities)
> and reencode (um, reencodes):
> ---------------------------------
> import codecs, StringIO
> from lxml import etree
> high_chars = {
> 0x2014:'—', # 'EM DASH',
> 0x2013:'–', # 'EN DASH',
> 0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
> 0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
> 0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
> 0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
> 0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
> 0x2122:'™', # 'TRADE MARK SIGN',
> 0x00A9:'©', # 'COPYRIGHT SYMBOL',
> }
> def translate(string):
> s = ''
> for c in string:
> if ord(c) in high_chars:
> c = high_chars.get(ord(c))
> s += c
> return s
I hope you are aware that this is about the slowest possible algorithm
(well, the slowest one that doesn't do anything unnecessary). Since none of
this is required when parsing or generating XHTML, I assume your spec tells
you that you should do these replacements?
> def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
> with codecs.open(filename,encoding=in_encoding) as f:
> s = f.read()
> sio = StringIO.StringIO(translate(s))
> parser = etree.HTMLParser(encoding=in_encoding)
> tree = etree.parse(sio, parser)
Yes, you are doing something dangerous and wrong here. For one, you are
decoding the data twice. Then, didn't you say XHTML? Why do you use the
HTML parser to parse XML?
> result = etree.tostring(tree.getroot(), method='html',
> pretty_print=True,
> encoding=out_encoding)
> with open(filename,'wb') as f:
> f.write(result)
Use tree.write(f, ...)
Assuming you really meant XHTML and not HTML, I'd just drop your entire
code and do this instead:
tree = etree.parse(in_path)
tree.write(out_path, encoding='utf8', pretty_print=True)
Note that I didn't provide an input encoding. XML is safe in that regard.
Stefan
[toc] | [prev] | [next] | [standalone]
| From | Tim Arnold <Tim.Arnold@sas.com> |
|---|---|
| Date | 2012-02-01 13:15 -0500 |
| Message-ID | <jgbvfd$q3d$1@foggy.unx.sas.com> |
| In reply to | #19696 |
On 2/1/2012 3:26 AM, Stefan Behnel wrote:
> Tim Arnold, 31.01.2012 19:09:
>> I have to follow a specification for producing xhtml files.
>> The original files are in cp1252 encoding and I must reencode them to utf-8.
>> Also, I have to replace certain characters with html entities.
>> ---------------------------------
>> import codecs, StringIO
>> from lxml import etree
>> high_chars = {
>> 0x2014:'—', # 'EM DASH',
>> 0x2013:'–', # 'EN DASH',
>> 0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
>> 0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
>> 0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
>> 0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
>> 0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
>> 0x2122:'™', # 'TRADE MARK SIGN',
>> 0x00A9:'©', # 'COPYRIGHT SYMBOL',
>> }
>> def translate(string):
>> s = ''
>> for c in string:
>> if ord(c) in high_chars:
>> c = high_chars.get(ord(c))
>> s += c
>> return s
>
> I hope you are aware that this is about the slowest possible algorithm
> (well, the slowest one that doesn't do anything unnecessary). Since none of
> this is required when parsing or generating XHTML, I assume your spec tells
> you that you should do these replacements?
>
I wasn't aware of it, but I am now--code's embarassing now.
The spec I must follow forces me to do the translation.
I am actually working with html not xhtml; which makes a huge
difference, sorry for that.
Ulrich's line of code for translate is elegant.
for c in string:
s += high_chars.get(c,c)
>
>> def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
>> with codecs.open(filename,encoding=in_encoding) as f:
>> s = f.read()
>> sio = StringIO.StringIO(translate(s))
>> parser = etree.HTMLParser(encoding=in_encoding)
>> tree = etree.parse(sio, parser)
>
> Yes, you are doing something dangerous and wrong here. For one, you are
> decoding the data twice. Then, didn't you say XHTML? Why do you use the
> HTML parser to parse XML?
>
I see that I'm decoding twice now, thanks.
Also, I now see that when lxml writes the result back out the entities I
got from my translate function are resolved, which defeats the whole
purpose.
>
>> result = etree.tostring(tree.getroot(), method='html',
>> pretty_print=True,
>> encoding=out_encoding)
>> with open(filename,'wb') as f:
>> f.write(result)
>
> Use tree.write(f, ...)
From the all the info I've received on this thread, plus some
additional reading, I think I need the following code.
Use the HTMLParser because the source files are actually HTML, and use
output from etree.tostring() as input to translate() as the very last step.
def reencode(filename, in_encoding='cp1252', out_encoding='utf-8'):
parser = etree.HTMLParser(encoding=in_encoding)
tree = etree.parse(filename, parser)
result = etree.tostring(tree.getroot(), method='html',
pretty_print=True,
encoding=out_encoding)
with open(filename, 'wb') as f:
f.write(translate(result))
not simply tree.write(f...) because I have to do the translation at the
end, so I get the entities instead of the resolved entities from lxml.
Again, it would be simpler if this was xhtml, but I misspoke
(mis-wrote?) when I said xhtml; this is for html.
> Assuming you really meant XHTML and not HTML, I'd just drop your entire
> code and do this instead:
>
> tree = etree.parse(in_path)
> tree.write(out_path, encoding='utf8', pretty_print=True)
>
> Note that I didn't provide an input encoding. XML is safe in that regard.
>
> Stefan
>
thanks everyone for the help.
--Tim Arnold
[toc] | [prev] | [next] | [standalone]
| From | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| Date | 2012-02-02 08:02 +0100 |
| Message-ID | <mailman.5345.1328166145.27778.python-list@python.org> |
| In reply to | #19767 |
Tim Arnold, 01.02.2012 19:15:
> On 2/1/2012 3:26 AM, Stefan Behnel wrote:
>> Tim Arnold, 31.01.2012 19:09:
>>> I have to follow a specification for producing xhtml files.
>>> The original files are in cp1252 encoding and I must reencode them to
>>> utf-8.
>>> Also, I have to replace certain characters with html entities.
>>> ---------------------------------
>>> import codecs, StringIO
>>> from lxml import etree
>>> high_chars = {
>>> 0x2014:'—', # 'EM DASH',
>>> 0x2013:'–', # 'EN DASH',
>>> 0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
>>> 0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
>>> 0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
>>> 0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
>>> 0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
>>> 0x2122:'™', # 'TRADE MARK SIGN',
>>> 0x00A9:'©', # 'COPYRIGHT SYMBOL',
>>> }
>>> def translate(string):
>>> s = ''
>>> for c in string:
>>> if ord(c) in high_chars:
>>> c = high_chars.get(ord(c))
>>> s += c
>>> return s
>>
>> I hope you are aware that this is about the slowest possible algorithm
>> (well, the slowest one that doesn't do anything unnecessary). Since none of
>> this is required when parsing or generating XHTML, I assume your spec tells
>> you that you should do these replacements?
>
> I wasn't aware of it, but I am now--code's embarassing now.
> The spec I must follow forces me to do the translation.
>
> I am actually working with html not xhtml; which makes a huge difference,
We all learn.
> Ulrich's line of code for translate is elegant.
> for c in string:
> s += high_chars.get(c,c)
Still not efficient because it builds the string one character at a time
and needs to reallocate (and potentially copy) the string buffer quite
frequently in order to do that. You are lucky with CPython, because it has
an internal optimisation that mitigates this overhead on some platforms.
Other Python implementations don't have that, and even the optimisation in
CPython is platform specific (works well on Linux, for example).
Peter Otten presented the a better way of doing it.
> From the all the info I've received on this thread, plus some additional
> reading, I think I need the following code.
>
> Use the HTMLParser because the source files are actually HTML, and use
> output from etree.tostring() as input to translate() as the very last step.
>
> def reencode(filename, in_encoding='cp1252', out_encoding='utf-8'):
> parser = etree.HTMLParser(encoding=in_encoding)
> tree = etree.parse(filename, parser)
> result = etree.tostring(tree.getroot(), method='html',
> pretty_print=True,
> encoding=out_encoding)
> with open(filename, 'wb') as f:
> f.write(translate(result))
>
> not simply tree.write(f...) because I have to do the translation at the
> end, so I get the entities instead of the resolved entities from lxml.
Yes, that's better.
Still one thing (since you didn't show us your final translate() function):
you do the character escaping on a UTF-8 encoded string and made the
encoding configurable. That means that the characters you are looking for
must also be encoded with the same encoding in order to find matches.
However, if you ever choose a different target encoding that doesn't have
the nice properties of UTF-8's byte sequences, you may end up with
ambiguous byte sequences in the output that your translate() function
accidentally matches on, thus potentially corrupting your data.
Assuming that you are using Python 2, you may even be accidentally doing
the replacement using Unicode character strings, which then only happens to
work on systems that use UTF-8 as their default encoding. Python 3 has
fixed this trap, but you have to take care to avoid it in Python 2.
I'd prefer serialising the documents into a unicode string
(encoding='unicode'), then post-processing that and finally encoding it to
the target encoding when writing it out. But you'll have to see how that
works out together with your escaping step, and also how it impacts the
HTML <meta> tag that states the document encoding.
Stefan
[toc] | [prev] | [next] | [standalone]
| From | Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> |
|---|---|
| Date | 2012-02-01 09:39 +0100 |
| Message-ID | <daanv8-7i.ln1@satorlaser.homedns.org> |
| In reply to | #19688 |
Am 31.01.2012 19:09, schrieb Tim Arnold:
> high_chars = {
> 0x2014:'—', # 'EM DASH',
> 0x2013:'–', # 'EN DASH',
> 0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
> 0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
> 0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
> 0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
> 0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
> 0x2122:'™', # 'TRADE MARK SIGN',
> 0x00A9:'©', # 'COPYRIGHT SYMBOL',
> }
You could use Unicode string literals directly instead of using the
codepoint, making it a bit more self-documenting and saving you the
later call to ord():
high_chars = {
u'\u2014': '—',
u'\u2013': '–',
...
}
> for c in string:
> if ord(c) in high_chars:
> c = high_chars.get(ord(c))
> s += c
> return s
Instead of checking if there is a replacement and then looking up the
replacement again, just use the default:
for c in string:
s += high_chars.get(c, c)
Alternatively, if you find that clearer, you could also check if the
returnvalue of get() is None to find out if there is a replacement:
for c in string:
r = high_chars.get(c)
if r is None:
s += c
else:
s += r
Uli
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-02-01 10:32 +0100 |
| Message-ID | <mailman.5292.1328088791.27778.python-list@python.org> |
| In reply to | #19697 |
Ulrich Eckhardt wrote:
> Am 31.01.2012 19:09, schrieb Tim Arnold:
>> high_chars = {
>> 0x2014:'—', # 'EM DASH',
>> 0x2013:'–', # 'EN DASH',
>> 0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
>> 0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
>> 0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
>> 0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
>> 0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
>> 0x2122:'™', # 'TRADE MARK SIGN',
>> 0x00A9:'©', # 'COPYRIGHT SYMBOL',
>> }
>
> You could use Unicode string literals directly instead of using the
> codepoint, making it a bit more self-documenting and saving you the
> later call to ord():
>
> high_chars = {
> u'\u2014': '—',
> u'\u2013': '–',
> ...
> }
>
>> for c in string:
>> if ord(c) in high_chars:
>> c = high_chars.get(ord(c))
>> s += c
>> return s
>
> Instead of checking if there is a replacement and then looking up the
> replacement again, just use the default:
>
> for c in string:
> s += high_chars.get(c, c)
>
> Alternatively, if you find that clearer, you could also check if the
> returnvalue of get() is None to find out if there is a replacement:
>
> for c in string:
> r = high_chars.get(c)
> if r is None:
> s += c
> else:
> s += r
It doesn't matter for the OP (see Stefan Behnel's post), but If you want to
replace characters in a unicode string the best way is probably the
translate() method:
>>> print u"\xa9\u2122"
©™
>>> u"\xa9\u2122".translate({0xa9: u"©", 0x2122: u"™"})
u'©™'
[toc] | [prev] | [next] | [standalone]
| From | Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> |
|---|---|
| Date | 2012-02-01 17:03 +0100 |
| Message-ID | <8b4ov8-ad2.ln1@satorlaser.homedns.org> |
| In reply to | #19699 |
Am 01.02.2012 10:32, schrieb Peter Otten:
> It doesn't matter for the OP (see Stefan Behnel's post), but If you want to
> replace characters in a unicode string the best way is probably the
> translate() method:
>
>>>> print u"\xa9\u2122"
> ©™
>>>> u"\xa9\u2122".translate({0xa9: u"©", 0x2122: u"™"})
> u'©™'
>
Yes, this is both more expressive and at the same time probably even
more efficient.
Question though:
>>> u'abc'.translate({u'a': u'A'})
u'abc'
I would call this a chance to improve Python. According to the
documentation, using a string is invalid, but it neither raises an
exception nor does it do the obvious and accept single-character strings
as keys.
Thoughts?
Uli
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-02-02 12:02 +0100 |
| Message-ID | <mailman.5353.1328180546.27778.python-list@python.org> |
| In reply to | #19716 |
Ulrich Eckhardt wrote:
> Am 01.02.2012 10:32, schrieb Peter Otten:
>> It doesn't matter for the OP (see Stefan Behnel's post), but If you want
>> to replace characters in a unicode string the best way is probably the
>> translate() method:
>>
>>>>> print u"\xa9\u2122"
>> ©™
>>>>> u"\xa9\u2122".translate({0xa9: u"©", 0x2122: u"™"})
>> u'©™'
>>
>
> Yes, this is both more expressive and at the same time probably even
> more efficient.
>
>
> Question though:
>
> >>> u'abc'.translate({u'a': u'A'})
> u'abc'
>
> I would call this a chance to improve Python. According to the
> documentation, using a string is invalid, but it neither raises an
> exception nor does it do the obvious and accept single-character strings
> as keys.
>
>
> Thoughts?
How could this raise an exception? You'd either need a typed dictionary (int
--> unicode) or translate() would have to verify that all keys are indeed
integers. The former would go against the grain of Python, the latter would
make the method less flexible as the set of keys currently need not be
predefined:
>>> class A(object):
... def __getitem__(self, key):
... return unichr(key).upper()
...
>>> u"alpha".translate(A())
u'ALPHA'
Using unicode instead of integer keys would be nice but breaks backwards
compatibility, using both could double the number of dictionary lookups.
[toc] | [prev] | [next] | [standalone]
| From | Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> |
|---|---|
| Date | 2012-02-02 13:40 +0100 |
| Message-ID | <vqcqv8-m08.ln1@satorlaser.homedns.org> |
| In reply to | #19786 |
Am 02.02.2012 12:02, schrieb Peter Otten:
> Ulrich Eckhardt wrote:
>>
>> >>> u'abc'.translate({u'a': u'A'})
>> u'abc'
>>
>> I would call this a chance to improve Python. According to the
>> documentation, using a string [as key] is invalid, but it neither raises
>> an exception nor does it do the obvious and accept single-character
>> strings as keys.
>>
>>
>> Thoughts?
>
> How could this raise an exception? You'd either need a typed dictionary (int
> --> unicode) or translate() would have to verify that all keys are indeed
> integers.
The latter is exactly what I would have done, i.e. scan the dictionary
for invalid values, in the spirit of not letting errors pass unnoticed.
> The former would go against the grain of Python, the latter would
> make the method less flexible as the set of keys currently need not be
> predefined:
>
>>>> class A(object):
> ... def __getitem__(self, key):
> ... return unichr(key).upper()
> ...
>>>> u"alpha".translate(A())
> u'ALPHA'
Working with __getitem__ is a point. I'm not sure if it is reasonable to
expect this to work though. I'm -0 on that. I could also imagine a
completely separate path for iterable and non-iterable mappings.
> Using unicode instead of integer keys would be nice but breaks backwards
> compatibility, using both could double the number of dictionary lookups.
Dictionary lookups are constant time and well-optimized, so I'd actually
go for allowing both and paying that price. I could even imagine
preprocessing the supplied dictionary while checking for invalid values.
The result could be a structure that makes use of the fact that Unicode
codepoints are < 22 bits and that makes the way from the elements of the
source sequence to the according map entry as short as possible (I'm not
sure if using codepoints or single-character strings is faster).
However, those are early optimizations of which I'm not sure if they are
worth it.
Anyway, thanks for your thoughts, they are always appreciated!
Uli
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web