Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #64040 > unrolled thread
| Started by | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| First post | 2014-01-16 02:13 +0000 |
| Last post | 2014-01-16 12:50 -0600 |
| Articles | 10 — 7 participants |
Back to article view | Back to comp.lang.python
Guessing the encoding from a BOM Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 02:13 +0000
Re: Guessing the encoding from a BOM Ben Finney <ben+python@benfinney.id.au> - 2014-01-16 14:47 +1100
Re: Guessing the encoding from a BOM Steven D'Aprano <steve@pearwood.info> - 2014-01-16 06:55 +0000
Re: Guessing the encoding from a BOM Ethan Furman <ethan@stoneleaf.us> - 2014-01-15 23:29 -0800
Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-16 16:01 +1100
Re: Guessing the encoding from a BOM Steven D'Aprano <steve@pearwood.info> - 2014-01-16 06:45 +0000
Re: Guessing the encoding from a BOM Ethan Furman <ethan@stoneleaf.us> - 2014-01-15 21:40 -0800
Re: Guessing the encoding from a BOM Björn Lindqvist <bjourne@gmail.com> - 2014-01-16 19:01 +0100
Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-17 05:06 +1100
Re: Guessing the encoding from a BOM Tim Chase <python.list@tim.thechases.com> - 2014-01-16 12:50 -0600
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-01-16 02:13 +0000 |
| Subject | Guessing the encoding from a BOM |
| Message-ID | <52d74063$0$29970$c3e8da3$5496439d@news.astraweb.com> |
I have a function which guesses the likely encoding used by text files by
reading the BOM (byte order mark) at the beginning of the file. A
simplified version:
def guess_encoding_from_bom(filename, default):
with open(filename, 'rb') as f:
sig = f.read(4)
if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
else:
return default
The idea is that you can call the function with a file name and a default
encoding to return if one can't be guessed. I want to provide a default
value for the default argument (a default default), but one which will
unconditionally fail if you blindly go ahead and use it.
E.g. I want to either provide a default:
enc = guess_encoding_from_bom("filename", 'latin1')
f = open("filename", encoding=enc)
or I want to write:
enc = guess_encoding_from_bom("filename")
if enc == something:
# Can't guess, fall back on an alternative strategy
...
else:
f = open("filename", encoding=enc)
If I forget to check the returned result, I should get an explicit
failure as soon as I try to use it, rather than silently returning the
wrong results.
What should I return as the default default? I have four possibilities:
(1) 'undefined', which is an standard encoding guaranteed to
raise an exception when used;
(2) 'unknown', which best describes the result, and currently
there is no encoding with that name;
(3) None, which is not the name of an encoding; or
(4) Don't return anything, but raise an exception. (But
which exception?)
Apart from option (4), here are the exceptions you get from blindly using
options (1) through (3):
py> 'abc'.encode('undefined')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.3/encodings/undefined.py", line 19, in
encode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding
py> 'abc'.encode('unknown')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: unknown
py> 'abc'.encode(None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: encode() argument 1 must be str, not None
At the moment, I'm leaning towards option (1). Thoughts?
--
Steven
[toc] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2014-01-16 14:47 +1100 |
| Message-ID | <mailman.5566.1389844041.18130.python-list@python.org> |
| In reply to | #64040 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> enc = guess_encoding_from_bom("filename")
> if enc == something:
> # Can't guess, fall back on an alternative strategy
> ...
> else:
> f = open("filename", encoding=enc)
>
>
> If I forget to check the returned result, I should get an explicit
> failure as soon as I try to use it, rather than silently returning the
> wrong results.
Yes, agreed.
> What should I return as the default default? I have four possibilities:
>
> (1) 'undefined', which is an standard encoding guaranteed to
> raise an exception when used;
+0.5. This describes the outcome of the guess.
> (2) 'unknown', which best describes the result, and currently
> there is no encoding with that name;
+0. This *better* describes the outcome, but I don't think adding a new
name is needed nor very helpful.
> (3) None, which is not the name of an encoding; or
−1. This is too much like a real result and doesn't adequately indicate
the failure.
> (4) Don't return anything, but raise an exception. (But
> which exception?)
+1. I'd like a custom exception class, sub-classed from ValueError.
--
\ “I love to go down to the schoolyard and watch all the little |
`\ children jump up and down and run around yelling and screaming. |
_o__) They don't know I'm only using blanks.” —Emo Philips |
Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2014-01-16 06:55 +0000 |
| Message-ID | <52d78254$0$6599$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #64046 |
On Thu, 16 Jan 2014 14:47:00 +1100, Ben Finney wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>
>> enc = guess_encoding_from_bom("filename") if enc == something:
>> # Can't guess, fall back on an alternative strategy ...
>> else:
>> f = open("filename", encoding=enc)
>>
>>
>> If I forget to check the returned result, I should get an explicit
>> failure as soon as I try to use it, rather than silently returning the
>> wrong results.
>
> Yes, agreed.
>
>> What should I return as the default default? I have four possibilities:
>>
>> (1) 'undefined', which is an standard encoding guaranteed to
>> raise an exception when used;
>
> +0.5. This describes the outcome of the guess.
>
>> (2) 'unknown', which best describes the result, and currently
>> there is no encoding with that name;
>
> +0. This *better* describes the outcome, but I don't think adding a new
> name is needed nor very helpful.
And there is a chance -- albeit a small chance -- that someday the std
lib will gain an encoding called "unknown".
>> (4) Don't return anything, but raise an exception. (But
>> which exception?)
>
> +1. I'd like a custom exception class, sub-classed from ValueError.
Why ValueError? It's not really a "invalid value" error, it's more "my
heuristic isn't good enough" failure. (Maybe the file starts with another
sort of BOM which I don't know about.)
If I go with an exception, I'd choose RuntimeError, or a custom error
that inherits directly from Exception.
Thanks to everyone for the feedback.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2014-01-15 23:29 -0800 |
| Message-ID | <mailman.5576.1389858692.18130.python-list@python.org> |
| In reply to | #64062 |
On 01/15/2014 10:55 PM, Steven D'Aprano wrote:
> On Thu, 16 Jan 2014 14:47:00 +1100, Ben Finney wrote:
>>
>> +1. I'd like a custom exception class, sub-classed from ValueError.
>
> Why ValueError? It's not really a "invalid value" error, it's more "my
> heuristic isn't good enough" failure. (Maybe the file starts with another
> sort of BOM which I don't know about.)
>
> If I go with an exception, I'd choose RuntimeError, or a custom error
> that inherits directly from Exception.
From the docs [1]:
============================
exception RuntimeError
Raised when an error is detected that doesn’t fall in any
of the other categories. The associated value is a string
indicating what precisely went wrong.
It doesn't sound like RuntimeError is any more informative than Exception or AssertionError, and to my mind at least is
usually close to catastrophic in nature [2].
I'd say a ValueError subclass because, while not an strictly an error, it is values you don't know how to deal with.
But either that or plain Exception, just not RuntimeError.
--
~Ethan~
[1] http://docs.python.org/3/library/exceptions.html#RuntimeError
[2] verified by a (very) brief grep of the sources
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-16 16:01 +1100 |
| Message-ID | <mailman.5568.1389848524.18130.python-list@python.org> |
| In reply to | #64040 |
On Thu, Jan 16, 2014 at 1:13 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')): > return 'utf_16' > elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): > return 'utf_32' I'd swap the order of these two checks. If the file starts FF FE 00 00, your code will guess that it's UTF-16 and begins with a U+0000. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2014-01-16 06:45 +0000 |
| Message-ID | <52d78012$0$6599$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #64051 |
On Thu, 16 Jan 2014 16:01:56 +1100, Chris Angelico wrote: > On Thu, Jan 16, 2014 at 1:13 PM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')): >> return 'utf_16' >> elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): >> return 'utf_32' > > I'd swap the order of these two checks. If the file starts FF FE 00 00, > your code will guess that it's UTF-16 and begins with a U+0000. Good catch, thank you. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2014-01-15 21:40 -0800 |
| Message-ID | <mailman.5571.1389850819.18130.python-list@python.org> |
| In reply to | #64040 |
On 01/15/2014 07:47 PM, Ben Finney wrote: > Steven D'Aprano writes: >> >> (4) Don't return anything, but raise an exception. (But >> which exception?) > > +1. I'd like a custom exception class, sub-classed from ValueError. +1 -- ~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | Björn Lindqvist <bjourne@gmail.com> |
|---|---|
| Date | 2014-01-16 19:01 +0100 |
| Message-ID | <mailman.5594.1389895319.18130.python-list@python.org> |
| In reply to | #64040 |
2014/1/16 Steven D'Aprano <steve+comp.lang.python@pearwood.info>: > def guess_encoding_from_bom(filename, default): > with open(filename, 'rb') as f: > sig = f.read(4) > if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')): > return 'utf_16' > elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): > return 'utf_32' > else: > return default You might want to add the utf8 bom too: '\xEF\xBB\xBF'. > (4) Don't return anything, but raise an exception. (But > which exception?) I like this option the most because it is the most "fail fast". If you return 'undefined' the error might happen hours later or not at all in some cases. -- mvh/best regards Björn Lindqvist
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-17 05:06 +1100 |
| Message-ID | <mailman.5595.1389895586.18130.python-list@python.org> |
| In reply to | #64040 |
On Fri, Jan 17, 2014 at 5:01 AM, Björn Lindqvist <bjourne@gmail.com> wrote: > 2014/1/16 Steven D'Aprano <steve+comp.lang.python@pearwood.info>: >> def guess_encoding_from_bom(filename, default): >> with open(filename, 'rb') as f: >> sig = f.read(4) >> if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')): >> return 'utf_16' >> elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): >> return 'utf_32' >> else: >> return default > > You might want to add the utf8 bom too: '\xEF\xBB\xBF'. I'd actually rather not. It would tempt people to pollute UTF-8 files with a BOM, which is not necessary unless you are MS Notepad. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-01-16 12:50 -0600 |
| Message-ID | <mailman.5599.1389898160.18130.python-list@python.org> |
| In reply to | #64040 |
On 2014-01-17 05:06, Chris Angelico wrote: > > You might want to add the utf8 bom too: '\xEF\xBB\xBF'. > > I'd actually rather not. It would tempt people to pollute UTF-8 > files with a BOM, which is not necessary unless you are MS Notepad. If the intent is to just sniff and parse the file accordingly, I get enough of these junk UTF-8 BOMs at $DAY_JOB that I've had to create utility-openers much like Steven is doing here. It's particularly problematic for me in combination with csv.DictReader, where I go looking for $COLUMN_NAME and get KeyError exceptions because it wants me to ask for $UTF_BOM+$COLUMN_NAME for the first column. -tkc
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web