Groups > comp.lang.python > #54685 > unrolled thread

removing BOM prepended by codecs?

Started by	"J. Bagg" <j.bagg@kent.ac.uk>
First post	2013-09-24 10:42 +0100
Last post	2013-09-25 08:20 +1000
Articles	4 — 4 participants

Back to article view | Back to comp.lang.python

  removing BOM prepended by codecs? "J. Bagg" <j.bagg@kent.ac.uk> - 2013-09-24 10:42 +0100
    Re: removing BOM prepended by codecs? Steven D'Aprano <steve@pearwood.info> - 2013-09-24 10:56 +0000
    Re: removing BOM prepended by codecs? wxjmfauth@gmail.com - 2013-09-24 11:43 -0700
      Re: removing BOM prepended by codecs? Chris Angelico <rosuav@gmail.com> - 2013-09-25 08:20 +1000

#54685 — removing BOM prepended by codecs?

From	"J. Bagg" <j.bagg@kent.ac.uk>
Date	2013-09-24 10:42 +0100
Subject	removing BOM prepended by codecs?
Message-ID	<mailman.290.1380017820.18130.python-list@python.org>

I'm having trouble with the BOM that is now prepended to codecs files. 
The files have to be read by java servlets which expect a clean file 
without any BOM.

Is there a way to stop the BOM being written?

It is seriously messing up my work as the servlets do not expect it to 
be there. I could delete it but that means another delay in retrieving 
the data. My work is a bibliographic system and I'm writing a new search 
engine in Python to replace an ancient one in C.

I'm working on Linux with a locale of en_GB.UTF8

-- 
Dr Janet Bagg
CSAC, Dept of Anthropology,
University of Kent, UK

[toc] | [next] | [standalone]

#54687

From	Steven D'Aprano <steve@pearwood.info>
Date	2013-09-24 10:56 +0000
Message-ID	<52416fd5$0$29991$c3e8da3$5496439d@news.astraweb.com>
In reply to	#54685

On Tue, 24 Sep 2013 10:42:22 +0100, J. Bagg wrote:

> I'm having trouble with the BOM that is now prepended to codecs files.
> The files have to be read by java servlets which expect a clean file
> without any BOM.
> 
> Is there a way to stop the BOM being written?

Of course there is :-) but first we need to know how you are writing it 
in the first place.

If you are dealing with existing files, which already contain a BOM, you 
may need to open the files and re-save them without the BOM.

If you are dealing with temporary files you're creating programmatically, 
it depends how you're creating them. My guess is that you're doing 
something like this:

f = open("some file", "w", encoding="UTF-16")  # or UTF-32
f.write(data)
f.close()

or similar. Both the UTF-16 and UTF-32 codecs write BOMs. To avoid that, 
you should use UTF-16-BE or UTF-16-LE (Big Endian or Little Endian), as 
appropriate to your platform.

If you're getting a UTF-8 BOM, that's seriously weird. The standard UTF-8 
codec doesn't write a BOM. (Strictly speaking, it's not a Byte Order 
Mark, but a Signature.) Unless you're using encoding='UTF-8-sig', I can't 
guess how you're getting a UTF-8 BOM.

If you're doing something else, well, you'll have to explain what you're 
doing before we can tell you how to stop doing it :-)

> I'm working on Linux with a locale of en_GB.UTF8

The locale only sets the default encoding used by the OS, not that used 
by Python. Python 2 defaults to ASCII; Python 3 defaults to UTF-8.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#54714

From	wxjmfauth@gmail.com
Date	2013-09-24 11:43 -0700
Message-ID	<35d582cb-130d-484f-a434-554bb52175ac@googlegroups.com>
In reply to	#54685

Le mardi 24 septembre 2013 11:42:22 UTC+2, J. Bagg a écrit :
> I'm having trouble with the BOM that is now prepended to codecs files. 
> 
> The files have to be read by java servlets which expect a clean file 
> 
> without any BOM.
> 
> 
> 
> Is there a way to stop the BOM being written?
> 
> 
> 
> It is seriously messing up my work as the servlets do not expect it to 
> 
> be there. I could delete it but that means another delay in retrieving 
> 
> the data. My work is a bibliographic system and I'm writing a new search 
> 
> engine in Python to replace an ancient one in C.
> 
> 
> 
> I'm working on Linux with a locale of en_GB.UTF8
> 
> 
> 
> -- 
> 
> Dr Janet Bagg
> 
> CSAC, Dept of Anthropology,
> 
> University of Kent, UK

---------

Some points.

- The coding of a text file does not matter. What's
count is the knowledge of the coding.

- The *mark* (once the Unicode.org terminology in FAQ) indicating
a unicode encoded raw text file is neither a byte order mark,
nor a signature, it is an encoded code point, the encoded
U+FEFF, 'ZERO WIDTH NO-BREAK SPACE', code point. (Note, a
non breaking space at the start of a text is a non sense.)

- When such a mark exists, it is always possible to work
100% safely. No possible error.

- When such a mark does not exist, in many cases only
guessing is a (the) valid solution.

These are facts.


Now to the question, should I use (put) such a mark,
esp. in utf-8? I would say the following:

It seems to me, one see more and more marked utf-8 files.
(Windows is probably a reason.)

More importantly, more and more tools and software are
handling this utf-8 mark, or are corrected to support it,
so it basicaly does not hurt too much. Eg. Python, golang 1.1
(was not the case in 1.0), LibreOffice, TeXWorks supports it
now (was once not the case), the unicode TeX engines, ...

If I had to work in "archiving", it would seriously think
twice.

PS Unicode encodes characters on a per *script* ("alphabet")
basis, not per *language*.

jmf

[toc] | [prev] | [next] | [standalone]

#54717

From	Chris Angelico <rosuav@gmail.com>
Date	2013-09-25 08:20 +1000
Message-ID	<mailman.306.1380061254.18130.python-list@python.org>
In reply to	#54714

On Wed, Sep 25, 2013 at 4:43 AM,  <wxjmfauth@gmail.com> wrote:
> - The *mark* (once the Unicode.org terminology in FAQ) indicating
> a unicode encoded raw text file is neither a byte order mark,
> nor a signature, it is an encoded code point, the encoded
> U+FEFF, 'ZERO WIDTH NO-BREAK SPACE', code point. (Note, a
> non breaking space at the start of a text is a non sense.)
>
> - When such a mark exists, it is always possible to work
> 100% safely. No possible error.

I have a file encoded in Latin-1 which begins with LATIN SMALL LETTER
Y WITH DIAERESIS followed by LATIN SMALL LETTER THORN. I also have a
file encoded in EBCDIC (okay, I don't really, but let's pretend) that
begins with the same bytes. But of course, when such a mark exists,
there is no possible error - of that there is no manner of doubt, no
possible, probable shadow of doubt, no possible doubt whatever.

("No possible doubt whatever.")

ChrisA

[toc] | [prev] | [standalone]

csiph-web

removing BOM prepended by codecs?

Contents

#54685 — removing BOM prepended by codecs?

#54687

#54714

#54717