Groups > comp.lang.python > #54708 > unrolled thread

Re: removing BOM prepended by codecs?

Started by	Peter Otten <__peter__@web.de>
First post	2013-09-24 17:59 +0200
Last post	2013-09-24 17:59 +0200
Articles	1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: removing BOM prepended by codecs? Peter Otten <__peter__@web.de> - 2013-09-24 17:59 +0200

#54708 — Re: removing BOM prepended by codecs?

From	Peter Otten <__peter__@web.de>
Date	2013-09-24 17:59 +0200
Subject	Re: removing BOM prepended by codecs?
Message-ID	<mailman.299.1380038322.18130.python-list@python.org>

J. Bagg wrote:

> I've checked the original files using od and they don't have BOMs.
> 
> I'll remove them in the servlet. The overhead is probably small enough
> unless somebody is doing a massive search. We have a limit anyway to
> prevent somebody stealing the entire set of data.
> 
> I started writing the Python search because the ancient C search had
> started putting out BOMs. I'm actually mystified because our home Linux
> box does not add BOMs even though it runs 2.7 but my work one does even
> though it has the same version. The only difference is Fedora 18 v
> Fedora 17.
> 
> The BOMs are certainly there:
> 
> <86> <AD><FB>%R 10C0203z-621
> %A François-Xavier Le_Bourdonnec
> 
> 0000000 206     255 373   %   R       1   0   C   0   2   0   3   z   -
> 
> J
> 

Were these files edited with Notepad? According to

http://docs.python.org/2/library/codecs.html#encodings-and-unicode

"""
To increase the reliability with which a UTF-8 encoding can be detected, 
Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") 
for its Notepad program: Before any of the Unicode characters is written to 
the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 
0xef, 0xbb, 0xbf) is written.
"""

To strip off such a UTF-8 encoded BOM you can open the source file with 
"utf-8-sig" and write the output to a (different!) file with "utf-8"

with codecs.open(source, "r", encoding="utf-8-sig") as instream:
    with codecs.open(dest, "w", encoding="utf-8") as outstream:
        shutil.copyfileobj(instream, outstream)

[toc] | [standalone]

csiph-web

Re: removing BOM prepended by codecs?

Contents

#54708 — Re: removing BOM prepended by codecs?