Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #64131 > unrolled thread

Re: Guessing the encoding from a BOM

Started byTim Chase <tim@thechases.com>
First post2014-01-16 19:40 -0600
Last post2014-01-18 03:33 +1100
Articles 6 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Guessing the encoding from a BOM Tim Chase <tim@thechases.com> - 2014-01-16 19:40 -0600
    Re: Guessing the encoding from a BOM Rustom Mody <rustompmody@gmail.com> - 2014-01-16 21:08 -0800
      Re: Guessing the encoding from a BOM Pete Forman <petef4+usenet@gmail.com> - 2014-01-17 16:26 +0000
        Re: Guessing the encoding from a BOM Rustom Mody <rustompmody@gmail.com> - 2014-01-17 08:30 -0800
          Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-18 03:50 +1100
        Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-18 03:33 +1100

#64131 — Re: Guessing the encoding from a BOM

FromTim Chase <tim@thechases.com>
Date2014-01-16 19:40 -0600
SubjectRe: Guessing the encoding from a BOM
Message-ID<mailman.5618.1389922759.18130.python-list@python.org>
On 2014-01-17 11:14, Chris Angelico wrote:
> UTF-8 specifies the byte order
> as part of the protocol, so you don't need to mark it.

You don't need to mark it when writing, but some idiots use it
anyway.  If you're sniffing a file for purposes of reading, you need
to look for it and remove it from the actual data that gets returned
from the file--otherwise, your data can see it as corruption.  I end
up with lots of CSV files from customers who have polluted it with
Notepad or had Excel insert some UTF-8 BOM when exporting.  This
means my first column-name gets the BOM prefixed onto it when the
file is passed to csv.DictReader, grr.

-tkc


[toc] | [next] | [standalone]


#64139

FromRustom Mody <rustompmody@gmail.com>
Date2014-01-16 21:08 -0800
Message-ID<32c1b684-1ff7-48c0-af7a-cd15235ea531@googlegroups.com>
In reply to#64131
On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote:
> On 2014-01-17 11:14, Chris Angelico wrote:
> > UTF-8 specifies the byte order
> > as part of the protocol, so you don't need to mark it.

> You don't need to mark it when writing, but some idiots use it
> anyway.  If you're sniffing a file for purposes of reading, you need
> to look for it and remove it from the actual data that gets returned
> from the file--otherwise, your data can see it as corruption.  I end
> up with lots of CSV files from customers who have polluted it with
> Notepad or had Excel insert some UTF-8 BOM when exporting.  This
> means my first column-name gets the BOM prefixed onto it when the
> file is passed to csv.DictReader, grr.

And its part of the standard:
Table 2.4 here
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

[toc] | [prev] | [next] | [standalone]


#64175

FromPete Forman <petef4+usenet@gmail.com>
Date2014-01-17 16:26 +0000
Message-ID<86zjmufubv.fsf@gmail.com>
In reply to#64139
Rustom Mody <rustompmody@gmail.com> writes:

> On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote:
>> On 2014-01-17 11:14, Chris Angelico wrote:
>> > UTF-8 specifies the byte order
>> > as part of the protocol, so you don't need to mark it.
>
>> You don't need to mark it when writing, but some idiots use it
>> anyway.  If you're sniffing a file for purposes of reading, you need
>> to look for it and remove it from the actual data that gets returned
>> from the file--otherwise, your data can see it as corruption.  I end
>> up with lots of CSV files from customers who have polluted it with
>> Notepad or had Excel insert some UTF-8 BOM when exporting.  This
>> means my first column-name gets the BOM prefixed onto it when the
>> file is passed to csv.DictReader, grr.
>
> And its part of the standard:
> Table 2.4 here
> http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

It would have been nice if there was an eighth encoding scheme defined
there UTF-8NB which would be UTF-8 with BOM not allowed.
-- 
Pete Forman

[toc] | [prev] | [next] | [standalone]


#64176

FromRustom Mody <rustompmody@gmail.com>
Date2014-01-17 08:30 -0800
Message-ID<edc1acb8-be97-43c9-819b-3af18086d5d0@googlegroups.com>
In reply to#64175
On Friday, January 17, 2014 9:56:28 PM UTC+5:30, Pete Forman wrote:
> Rustom Mody  writes:

> > On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote:
> >> On 2014-01-17 11:14, Chris Angelico wrote:
> >> > UTF-8 specifies the byte order
> >> > as part of the protocol, so you don't need to mark it.
> >> You don't need to mark it when writing, but some idiots use it
> >> anyway.  If you're sniffing a file for purposes of reading, you need
> >> to look for it and remove it from the actual data that gets returned
> >> from the file--otherwise, your data can see it as corruption.  I end
> >> up with lots of CSV files from customers who have polluted it with
> >> Notepad or had Excel insert some UTF-8 BOM when exporting.  This
> >> means my first column-name gets the BOM prefixed onto it when the
> >> file is passed to csv.DictReader, grr.
> > And its part of the standard:
> > Table 2.4 here
> > http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

> It would have been nice if there was an eighth encoding scheme defined
> there UTF-8NB which would be UTF-8 with BOM not allowed.

If you or I break a standard then, well, we broke a standard.
If Microsoft breaks a standard the standard is obliged to change.

Or as the saying goes, everyone is equal though some are more equal.

[toc] | [prev] | [next] | [standalone]


#64180

FromChris Angelico <rosuav@gmail.com>
Date2014-01-18 03:50 +1100
Message-ID<mailman.5649.1389977412.18130.python-list@python.org>
In reply to#64176
On Sat, Jan 18, 2014 at 3:30 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> If you or I break a standard then, well, we broke a standard.
> If Microsoft breaks a standard the standard is obliged to change.
>
> Or as the saying goes, everyone is equal though some are more equal.

https://en.wikipedia.org/wiki/800_pound_gorilla

Though Microsoft has been losing weight over the past decade or so,
just as IBM before them had done (there was a time when IBM was *the*
800lb gorilla, pretty much, but definitely not now). In Unix/POSIX
contexts, Linux might be playing that role - I've seen code that
unwittingly assumes Linux more often than, say, assuming FreeBSD - but
I haven't seen a huge amount of "the standard has to change, Linux
does it differently", possibly because the areas of Linux-assumption
are areas that aren't standardized anyway (eg additional socket
options beyond the spec).

The one area where industry leaders still heavily dictate to standards
is the web. Fortunately, it usually still results in usable standards
documents that HTML authors can rely on. Usually. *twiddles fingers*

ChrisA

[toc] | [prev] | [next] | [standalone]


#64177

FromChris Angelico <rosuav@gmail.com>
Date2014-01-18 03:33 +1100
Message-ID<mailman.5648.1389976433.18130.python-list@python.org>
In reply to#64175
On Sat, Jan 18, 2014 at 3:26 AM, Pete Forman <petef4+usenet@gmail.com> wrote:
> It would have been nice if there was an eighth encoding scheme defined
> there UTF-8NB which would be UTF-8 with BOM not allowed.

Or call that one UTF-8, and the one with the marker can be UTF-8-MS-NOTEPAD.

ChrisA

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web