Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #64102 > unrolled thread
| Started by | Albert-Jan Roskam <fomcl@yahoo.com> |
|---|---|
| First post | 2014-01-16 11:37 -0800 |
| Last post | 2014-01-17 01:18 +0000 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.python
Re: Guessing the encoding from a BOM Albert-Jan Roskam <fomcl@yahoo.com> - 2014-01-16 11:37 -0800
Re: Guessing the encoding from a BOM Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-17 01:18 +0000
| From | Albert-Jan Roskam <fomcl@yahoo.com> |
|---|---|
| Date | 2014-01-16 11:37 -0800 |
| Subject | Re: Guessing the encoding from a BOM |
| Message-ID | <mailman.5601.1389901239.18130.python-list@python.org> |
-------------------------------------------- On Thu, 1/16/14, Chris Angelico <rosuav@gmail.com> wrote: Subject: Re: Guessing the encoding from a BOM To: Cc: "python-list@python.org" <python-list@python.org> Date: Thursday, January 16, 2014, 7:06 PM On Fri, Jan 17, 2014 at 5:01 AM, Björn Lindqvist <bjourne@gmail.com> wrote: > 2014/1/16 Steven D'Aprano <steve+comp.lang.python@pearwood.info>: >> def guess_encoding_from_bom(filename, default): >> with open(filename, 'rb') as f: >> sig = f.read(4) >> if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')): >> return 'utf_16' >> elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): >> return 'utf_32' >> else: >> return default > > You might want to add the utf8 bom too: '\xEF\xBB\xBF'. I'd actually rather not. It would tempt people to pollute UTF-8 files with a BOM, which is not necessary unless you are MS Notepad. ===> Can you elaborate on that? Unless your utf-8 files will only contain ascii characters I do not understand why you would not want a bom utf-8. Btw, isn't "read_encoding_from_bom" a better function name than "guess_encoding_from_bom"? I thought the point of BOMs was that there would be no more need to guess? Thanks! Albert-Jan
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-01-17 01:18 +0000 |
| Message-ID | <52d884ee$0$29999$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #64102 |
On Thu, 16 Jan 2014 11:37:29 -0800, Albert-Jan Roskam wrote: > -------------------------------------------- On Thu, 1/16/14, Chris > Angelico <rosuav@gmail.com> wrote: > > Subject: Re: Guessing the encoding from a BOM To: > Cc: "python-list@python.org" <python-list@python.org> Date: Thursday, > January 16, 2014, 7:06 PM > > On Fri, Jan 17, 2014 at 5:01 AM, > Björn Lindqvist <bjourne@gmail.com> > wrote: > > 2014/1/16 Steven D'Aprano <steve+comp.lang.python@pearwood.info>: > >> def guess_encoding_from_bom(filename, default): > >> with open(filename, 'rb') > as f: > >> sig = > f.read(4) > >> if > sig.startswith((b'\xFE\xFF', b'\xFF\xFE')): > >> return > 'utf_16' > >> elif > sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): > >> return > 'utf_32' > >> else: > >> return > default > > > > You might want to add the utf8 bom too: > '\xEF\xBB\xBF'. > > I'd actually rather not. It would tempt people to pollute UTF-8 files > with a BOM, which is not necessary unless you are MS Notepad. > > > ===> Can you elaborate on that? Unless your utf-8 files will only > contain ascii characters I do not understand why you would not want a > bom utf-8. Because the UTF-8 signature -- it's not actually a Byte Order Mark -- is not really necessary. Unlike UTF-16 and UTF-32, there is no platform dependent ambiguity between Big Endian and Little Endian systems, so the UTF-8 stream of bytes is identical no matter what platform you are on. If the UTF-8 signature was just unnecessary, it wouldn't be too bad, but it's actually harmful. Pure-ASCII text encoded as UTF-8 is still pure ASCII, and so backwards compatible with old software that assumes ASCII. But the same pure-ASCII text encoded as UTF-8 with a signature looks like a binary file. > Btw, isn't "read_encoding_from_bom" a better function name than > "guess_encoding_from_bom"? I thought the point of BOMs was that there > would be no more need to guess? Of course it's a guess. If you see a file that starts with 0000FFFE, is that a UTF-32 text file, or a binary file that happens to start with two nulls followed by FFFE? -- Steven
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web