Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #64040 > unrolled thread

Guessing the encoding from a BOM

Started bySteven D'Aprano <steve+comp.lang.python@pearwood.info>
First post2014-01-16 02:13 +0000
Last post2014-01-16 12:50 -0600
Articles 10 — 7 participants

Back to article view | Back to comp.lang.python


Contents

  Guessing the encoding from a BOM Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 02:13 +0000
    Re: Guessing the encoding from a BOM Ben Finney <ben+python@benfinney.id.au> - 2014-01-16 14:47 +1100
      Re: Guessing the encoding from a BOM Steven D'Aprano <steve@pearwood.info> - 2014-01-16 06:55 +0000
        Re: Guessing the encoding from a BOM Ethan Furman <ethan@stoneleaf.us> - 2014-01-15 23:29 -0800
    Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-16 16:01 +1100
      Re: Guessing the encoding from a BOM Steven D'Aprano <steve@pearwood.info> - 2014-01-16 06:45 +0000
    Re: Guessing the encoding from a BOM Ethan Furman <ethan@stoneleaf.us> - 2014-01-15 21:40 -0800
    Re: Guessing the encoding from a BOM Björn Lindqvist <bjourne@gmail.com> - 2014-01-16 19:01 +0100
    Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-17 05:06 +1100
    Re: Guessing the encoding from a BOM Tim Chase <python.list@tim.thechases.com> - 2014-01-16 12:50 -0600

#64040 — Guessing the encoding from a BOM

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-01-16 02:13 +0000
SubjectGuessing the encoding from a BOM
Message-ID<52d74063$0$29970$c3e8da3$5496439d@news.astraweb.com>
I have a function which guesses the likely encoding used by text files by 
reading the BOM (byte order mark) at the beginning of the file. A 
simplified version:


def guess_encoding_from_bom(filename, default):
    with open(filename, 'rb') as f:
        sig = f.read(4)
    if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
        return 'utf_16'
    elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
        return 'utf_32'
    else:
        return default


The idea is that you can call the function with a file name and a default 
encoding to return if one can't be guessed. I want to provide a default 
value for the default argument (a default default), but one which will 
unconditionally fail if you blindly go ahead and use it.

E.g. I want to either provide a default:

enc = guess_encoding_from_bom("filename", 'latin1')
f = open("filename", encoding=enc)


or I want to write:

enc = guess_encoding_from_bom("filename")
if enc == something:
     # Can't guess, fall back on an alternative strategy
     ...
else:
     f = open("filename", encoding=enc)


If I forget to check the returned result, I should get an explicit 
failure as soon as I try to use it, rather than silently returning the 
wrong results.

What should I return as the default default? I have four possibilities:

    (1) 'undefined', which is an standard encoding guaranteed to 
        raise an exception when used;

    (2) 'unknown', which best describes the result, and currently 
        there is no encoding with that name;

    (3) None, which is not the name of an encoding; or

    (4) Don't return anything, but raise an exception. (But 
        which exception?)


Apart from option (4), here are the exceptions you get from blindly using 
options (1) through (3):

py> 'abc'.encode('undefined')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.3/encodings/undefined.py", line 19, in 
encode
    raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding

py> 'abc'.encode('unknown')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: unknown

py> 'abc'.encode(None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: encode() argument 1 must be str, not None


At the moment, I'm leaning towards option (1). Thoughts?



-- 
Steven

[toc] | [next] | [standalone]


#64046

FromBen Finney <ben+python@benfinney.id.au>
Date2014-01-16 14:47 +1100
Message-ID<mailman.5566.1389844041.18130.python-list@python.org>
In reply to#64040
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

> enc = guess_encoding_from_bom("filename")
> if enc == something:
>      # Can't guess, fall back on an alternative strategy
>      ...
> else:
>      f = open("filename", encoding=enc)
>
>
> If I forget to check the returned result, I should get an explicit
> failure as soon as I try to use it, rather than silently returning the
> wrong results.

Yes, agreed.

> What should I return as the default default? I have four possibilities:
>
>     (1) 'undefined', which is an standard encoding guaranteed to 
>         raise an exception when used;

+0.5. This describes the outcome of the guess.

>     (2) 'unknown', which best describes the result, and currently 
>         there is no encoding with that name;

+0. This *better* describes the outcome, but I don't think adding a new
name is needed nor very helpful.

>     (3) None, which is not the name of an encoding; or

−1. This is too much like a real result and doesn't adequately indicate
the failure.

>     (4) Don't return anything, but raise an exception. (But 
>         which exception?)

+1. I'd like a custom exception class, sub-classed from ValueError.

-- 
 \       “I love to go down to the schoolyard and watch all the little |
  `\   children jump up and down and run around yelling and screaming. |
_o__)             They don't know I'm only using blanks.” —Emo Philips |
Ben Finney

[toc] | [prev] | [next] | [standalone]


#64062

FromSteven D'Aprano <steve@pearwood.info>
Date2014-01-16 06:55 +0000
Message-ID<52d78254$0$6599$c3e8da3$5496439d@news.astraweb.com>
In reply to#64046
On Thu, 16 Jan 2014 14:47:00 +1100, Ben Finney wrote:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> 
>> enc = guess_encoding_from_bom("filename") if enc == something:
>>      # Can't guess, fall back on an alternative strategy ...
>> else:
>>      f = open("filename", encoding=enc)
>>
>>
>> If I forget to check the returned result, I should get an explicit
>> failure as soon as I try to use it, rather than silently returning the
>> wrong results.
> 
> Yes, agreed.
> 
>> What should I return as the default default? I have four possibilities:
>>
>>     (1) 'undefined', which is an standard encoding guaranteed to
>>         raise an exception when used;
> 
> +0.5. This describes the outcome of the guess.
> 
>>     (2) 'unknown', which best describes the result, and currently
>>         there is no encoding with that name;
> 
> +0. This *better* describes the outcome, but I don't think adding a new
> name is needed nor very helpful.

And there is a chance -- albeit a small chance -- that someday the std 
lib will gain an encoding called "unknown".


>>     (4) Don't return anything, but raise an exception. (But
>>         which exception?)
> 
> +1. I'd like a custom exception class, sub-classed from ValueError.

Why ValueError? It's not really a "invalid value" error, it's more "my 
heuristic isn't good enough" failure. (Maybe the file starts with another 
sort of BOM which I don't know about.)

If I go with an exception, I'd choose RuntimeError, or a custom error 
that inherits directly from Exception.



Thanks to everyone for the feedback.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#64065

FromEthan Furman <ethan@stoneleaf.us>
Date2014-01-15 23:29 -0800
Message-ID<mailman.5576.1389858692.18130.python-list@python.org>
In reply to#64062
On 01/15/2014 10:55 PM, Steven D'Aprano wrote:
> On Thu, 16 Jan 2014 14:47:00 +1100, Ben Finney wrote:
>>
>> +1. I'd like a custom exception class, sub-classed from ValueError.
>
> Why ValueError? It's not really a "invalid value" error, it's more "my
> heuristic isn't good enough" failure. (Maybe the file starts with another
> sort of BOM which I don't know about.)
>
> If I go with an exception, I'd choose RuntimeError, or a custom error
> that inherits directly from Exception.

 From the docs [1]:
============================

     exception RuntimeError

         Raised when an error is detected that doesn’t fall in any
         of the other categories. The associated value is a string
         indicating what precisely went wrong.

It doesn't sound like RuntimeError is any more informative than Exception or AssertionError, and to my mind at least is 
usually close to catastrophic in nature [2].

I'd say a ValueError subclass because, while not an strictly an error, it is values you don't know how to deal with. 
But either that or plain Exception, just not RuntimeError.

--
~Ethan~


[1] http://docs.python.org/3/library/exceptions.html#RuntimeError
[2] verified by a (very) brief grep of the sources

[toc] | [prev] | [next] | [standalone]


#64051

FromChris Angelico <rosuav@gmail.com>
Date2014-01-16 16:01 +1100
Message-ID<mailman.5568.1389848524.18130.python-list@python.org>
In reply to#64040
On Thu, Jan 16, 2014 at 1:13 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>     if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
>         return 'utf_16'
>     elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
>         return 'utf_32'

I'd swap the order of these two checks. If the file starts FF FE 00
00, your code will guess that it's UTF-16 and begins with a U+0000.

ChrisA

[toc] | [prev] | [next] | [standalone]


#64061

FromSteven D'Aprano <steve@pearwood.info>
Date2014-01-16 06:45 +0000
Message-ID<52d78012$0$6599$c3e8da3$5496439d@news.astraweb.com>
In reply to#64051
On Thu, 16 Jan 2014 16:01:56 +1100, Chris Angelico wrote:

> On Thu, Jan 16, 2014 at 1:13 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>>     if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
>>         return 'utf_16'
>>     elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
>>         return 'utf_32'
> 
> I'd swap the order of these two checks. If the file starts FF FE 00 00,
> your code will guess that it's UTF-16 and begins with a U+0000.

Good catch, thank you.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#64056

FromEthan Furman <ethan@stoneleaf.us>
Date2014-01-15 21:40 -0800
Message-ID<mailman.5571.1389850819.18130.python-list@python.org>
In reply to#64040
On 01/15/2014 07:47 PM, Ben Finney wrote:
> Steven D'Aprano writes:
>>
>>      (4) Don't return anything, but raise an exception. (But
>>          which exception?)
>
> +1. I'd like a custom exception class, sub-classed from ValueError.

+1

--
~Ethan~

[toc] | [prev] | [next] | [standalone]


#64093

FromBjörn Lindqvist <bjourne@gmail.com>
Date2014-01-16 19:01 +0100
Message-ID<mailman.5594.1389895319.18130.python-list@python.org>
In reply to#64040
2014/1/16 Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> def guess_encoding_from_bom(filename, default):
>     with open(filename, 'rb') as f:
>         sig = f.read(4)
>     if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
>         return 'utf_16'
>     elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
>         return 'utf_32'
>     else:
>         return default

You might want to add the utf8 bom too: '\xEF\xBB\xBF'.

>     (4) Don't return anything, but raise an exception. (But
>         which exception?)

I like this option the most because it is the most "fail fast". If you
return 'undefined' the error might happen hours later or not at all in
some cases.


-- 
mvh/best regards Björn Lindqvist

[toc] | [prev] | [next] | [standalone]


#64095

FromChris Angelico <rosuav@gmail.com>
Date2014-01-17 05:06 +1100
Message-ID<mailman.5595.1389895586.18130.python-list@python.org>
In reply to#64040
On Fri, Jan 17, 2014 at 5:01 AM, Björn Lindqvist <bjourne@gmail.com> wrote:
> 2014/1/16 Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>> def guess_encoding_from_bom(filename, default):
>>     with open(filename, 'rb') as f:
>>         sig = f.read(4)
>>     if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
>>         return 'utf_16'
>>     elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
>>         return 'utf_32'
>>     else:
>>         return default
>
> You might want to add the utf8 bom too: '\xEF\xBB\xBF'.

I'd actually rather not. It would tempt people to pollute UTF-8 files
with a BOM, which is not necessary unless you are MS Notepad.

ChrisA

[toc] | [prev] | [next] | [standalone]


#64099

FromTim Chase <python.list@tim.thechases.com>
Date2014-01-16 12:50 -0600
Message-ID<mailman.5599.1389898160.18130.python-list@python.org>
In reply to#64040
On 2014-01-17 05:06, Chris Angelico wrote:
> > You might want to add the utf8 bom too: '\xEF\xBB\xBF'.  
> 
> I'd actually rather not. It would tempt people to pollute UTF-8
> files with a BOM, which is not necessary unless you are MS Notepad.

If the intent is to just sniff and parse the file accordingly, I get
enough of these junk UTF-8 BOMs at $DAY_JOB that I've had to create
utility-openers much like Steven is doing here.  It's particularly
problematic for me in combination with csv.DictReader, where I go
looking for $COLUMN_NAME and get KeyError exceptions because it wants
me to ask for $UTF_BOM+$COLUMN_NAME for the first column.

-tkc


[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web