Groups > comp.lang.python > #64131 > unrolled thread

Re: Guessing the encoding from a BOM

Started by	Tim Chase <tim@thechases.com>
First post	2014-01-16 19:40 -0600
Last post	2014-01-18 03:33 +1100
Articles	6 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Guessing the encoding from a BOM Tim Chase <tim@thechases.com> - 2014-01-16 19:40 -0600
    Re: Guessing the encoding from a BOM Rustom Mody <rustompmody@gmail.com> - 2014-01-16 21:08 -0800
      Re: Guessing the encoding from a BOM Pete Forman <petef4+usenet@gmail.com> - 2014-01-17 16:26 +0000
        Re: Guessing the encoding from a BOM Rustom Mody <rustompmody@gmail.com> - 2014-01-17 08:30 -0800
          Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-18 03:50 +1100
        Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-18 03:33 +1100

#64131 — Re: Guessing the encoding from a BOM

From	Tim Chase <tim@thechases.com>
Date	2014-01-16 19:40 -0600
Subject	Re: Guessing the encoding from a BOM
Message-ID	<mailman.5618.1389922759.18130.python-list@python.org>

On 2014-01-17 11:14, Chris Angelico wrote:
> UTF-8 specifies the byte order
> as part of the protocol, so you don't need to mark it.

You don't need to mark it when writing, but some idiots use it
anyway.  If you're sniffing a file for purposes of reading, you need
to look for it and remove it from the actual data that gets returned
from the file--otherwise, your data can see it as corruption.  I end
up with lots of CSV files from customers who have polluted it with
Notepad or had Excel insert some UTF-8 BOM when exporting.  This
means my first column-name gets the BOM prefixed onto it when the
file is passed to csv.DictReader, grr.

-tkc

[toc] | [next] | [standalone]

#64139

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-01-16 21:08 -0800
Message-ID	<32c1b684-1ff7-48c0-af7a-cd15235ea531@googlegroups.com>
In reply to	#64131

On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote:
> On 2014-01-17 11:14, Chris Angelico wrote:
> > UTF-8 specifies the byte order
> > as part of the protocol, so you don't need to mark it.

> You don't need to mark it when writing, but some idiots use it
> anyway.  If you're sniffing a file for purposes of reading, you need
> to look for it and remove it from the actual data that gets returned
> from the file--otherwise, your data can see it as corruption.  I end
> up with lots of CSV files from customers who have polluted it with
> Notepad or had Excel insert some UTF-8 BOM when exporting.  This
> means my first column-name gets the BOM prefixed onto it when the
> file is passed to csv.DictReader, grr.

And its part of the standard:
Table 2.4 here
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

[toc] | [prev] | [next] | [standalone]

#64175

From	Pete Forman <petef4+usenet@gmail.com>
Date	2014-01-17 16:26 +0000
Message-ID	<86zjmufubv.fsf@gmail.com>
In reply to	#64139

Rustom Mody <rustompmody@gmail.com> writes:

> On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote:
>> On 2014-01-17 11:14, Chris Angelico wrote:
>> > UTF-8 specifies the byte order
>> > as part of the protocol, so you don't need to mark it.
>
>> You don't need to mark it when writing, but some idiots use it
>> anyway.  If you're sniffing a file for purposes of reading, you need
>> to look for it and remove it from the actual data that gets returned
>> from the file--otherwise, your data can see it as corruption.  I end
>> up with lots of CSV files from customers who have polluted it with
>> Notepad or had Excel insert some UTF-8 BOM when exporting.  This
>> means my first column-name gets the BOM prefixed onto it when the
>> file is passed to csv.DictReader, grr.
>
> And its part of the standard:
> Table 2.4 here
> http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

It would have been nice if there was an eighth encoding scheme defined
there UTF-8NB which would be UTF-8 with BOM not allowed.
-- 
Pete Forman

[toc] | [prev] | [next] | [standalone]

#64176

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-01-17 08:30 -0800
Message-ID	<edc1acb8-be97-43c9-819b-3af18086d5d0@googlegroups.com>
In reply to	#64175

On Friday, January 17, 2014 9:56:28 PM UTC+5:30, Pete Forman wrote:
> Rustom Mody  writes:

> > On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote:
> >> On 2014-01-17 11:14, Chris Angelico wrote:
> >> > UTF-8 specifies the byte order
> >> > as part of the protocol, so you don't need to mark it.
> >> You don't need to mark it when writing, but some idiots use it
> >> anyway.  If you're sniffing a file for purposes of reading, you need
> >> to look for it and remove it from the actual data that gets returned
> >> from the file--otherwise, your data can see it as corruption.  I end
> >> up with lots of CSV files from customers who have polluted it with
> >> Notepad or had Excel insert some UTF-8 BOM when exporting.  This
> >> means my first column-name gets the BOM prefixed onto it when the
> >> file is passed to csv.DictReader, grr.
> > And its part of the standard:
> > Table 2.4 here
> > http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

> It would have been nice if there was an eighth encoding scheme defined
> there UTF-8NB which would be UTF-8 with BOM not allowed.

If you or I break a standard then, well, we broke a standard.
If Microsoft breaks a standard the standard is obliged to change.

Or as the saying goes, everyone is equal though some are more equal.

[toc] | [prev] | [next] | [standalone]

#64180

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-18 03:50 +1100
Message-ID	<mailman.5649.1389977412.18130.python-list@python.org>
In reply to	#64176

On Sat, Jan 18, 2014 at 3:30 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> If you or I break a standard then, well, we broke a standard.
> If Microsoft breaks a standard the standard is obliged to change.
>
> Or as the saying goes, everyone is equal though some are more equal.

https://en.wikipedia.org/wiki/800_pound_gorilla

Though Microsoft has been losing weight over the past decade or so,
just as IBM before them had done (there was a time when IBM was *the*
800lb gorilla, pretty much, but definitely not now). In Unix/POSIX
contexts, Linux might be playing that role - I've seen code that
unwittingly assumes Linux more often than, say, assuming FreeBSD - but
I haven't seen a huge amount of "the standard has to change, Linux
does it differently", possibly because the areas of Linux-assumption
are areas that aren't standardized anyway (eg additional socket
options beyond the spec).

The one area where industry leaders still heavily dictate to standards
is the web. Fortunately, it usually still results in usable standards
documents that HTML authors can rely on. Usually. *twiddles fingers*

ChrisA

[toc] | [prev] | [next] | [standalone]

#64177

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-18 03:33 +1100
Message-ID	<mailman.5648.1389976433.18130.python-list@python.org>
In reply to	#64175

On Sat, Jan 18, 2014 at 3:26 AM, Pete Forman <petef4+usenet@gmail.com> wrote:
> It would have been nice if there was an eighth encoding scheme defined
> there UTF-8NB which would be UTF-8 with BOM not allowed.

Or call that one UTF-8, and the one with the marker can be UTF-8-MS-NOTEPAD.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Re: Guessing the encoding from a BOM

Contents

#64131 — Re: Guessing the encoding from a BOM

#64139

#64175

#64176

#64180

#64177