Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <CALG+76f8z-06kHimsk3mMBY0bFKP9u4GSsmJxj9PGwQwjW=afg@mail.gmail.com>
References: <52d74063$0$29970$c3e8da3$5496439d@news.astraweb.com> <CALG+76f8z-06kHimsk3mMBY0bFKP9u4GSsmJxj9PGwQwjW=afg@mail.gmail.com>
Date: Fri, 17 Jan 2014 05:06:16 +1100
Subject: Re: Guessing the encoding from a BOM
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5595.1389895586.18130.python-list@python.org>
Lines: 19
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:64095

On Fri, Jan 17, 2014 at 5:01 AM, Bj=C3=B6rn Lindqvist <bjourne@gmail.com> w=
rote:
> 2014/1/16 Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>> def guess_encoding_from_bom(filename, default):
>>     with open(filename, 'rb') as f:
>>         sig =3D f.read(4)
>>     if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
>>         return 'utf_16'
>>     elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
>>         return 'utf_32'
>>     else:
>>         return default
>
> You might want to add the utf8 bom too: '\xEF\xBB\xBF'.

I'd actually rather not. It would tempt people to pollute UTF-8 files
with a BOM, which is not necessary unless you are MS Notepad.

ChrisA