Re: Guessing the encoding from a BOM

Path	csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From	Pete Forman <petef4+usenet@gmail.com>
Newsgroups	comp.lang.python
Subject	Re: Guessing the encoding from a BOM
Date	Fri, 17 Jan 2014 16:26:28 +0000
Organization	A noiseless patient Spider
Lines	24
Message-ID	<86zjmufubv.fsf@gmail.com> (permalink)
References	<CAPTjJmqyO0UHrq31510iNeoQ2TcrJnosV0A6oHQOt5i-gz3njA@mail.gmail.com> <1389901049.40172.YahooMailBasic@web163804.mail.gq1.yahoo.com> <CAPTjJmqNhokKF8X3jHNZrW0iEt8foTaMM+26a3+2O9FG4rMPpw@mail.gmail.com> <mailman.5618.1389922759.18130.python-list@python.org> <32c1b684-1ff7-48c0-af7a-cd15235ea531@googlegroups.com>
Mime-Version	1.0
Content-Type	text/plain
Injection-Info	mx05.eternal-september.org; posting-host="cdf6132ee4c43c2d1457a368e89c85c9"; logging-data="14446"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+NK6Z82vawfD8Lul7CDNpR"
User-Agent	Gnus/5.13 (Gnus v5.13) Emacs/24.3 (windows-nt)
Cancel-Lock	sha1:t7AV/cxzJ3T9dLEKzCA5DV782rw= sha1:NpRN/DtOW0oQ58rO4Cz6xUn+yl8=
Xref	csiph.com comp.lang.python:64175

Show key headers only | View raw

Rustom Mody <rustompmody@gmail.com> writes:

> On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote:
>> On 2014-01-17 11:14, Chris Angelico wrote:
>> > UTF-8 specifies the byte order
>> > as part of the protocol, so you don't need to mark it.
>
>> You don't need to mark it when writing, but some idiots use it
>> anyway.  If you're sniffing a file for purposes of reading, you need
>> to look for it and remove it from the actual data that gets returned
>> from the file--otherwise, your data can see it as corruption.  I end
>> up with lots of CSV files from customers who have polluted it with
>> Notepad or had Excel insert some UTF-8 BOM when exporting.  This
>> means my first column-name gets the BOM prefixed onto it when the
>> file is passed to csv.DictReader, grr.
>
> And its part of the standard:
> Table 2.4 here
> http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

It would have been nice if there was an eighth encoding scheme defined
there UTF-8NB which would be UTF-8 with BOM not allowed.
-- 
Pete Forman

Thread

Re: Guessing the encoding from a BOM Tim Chase <tim@thechases.com> - 2014-01-16 19:40 -0600
  Re: Guessing the encoding from a BOM Rustom Mody <rustompmody@gmail.com> - 2014-01-16 21:08 -0800
    Re: Guessing the encoding from a BOM Pete Forman <petef4+usenet@gmail.com> - 2014-01-17 16:26 +0000
      Re: Guessing the encoding from a BOM Rustom Mody <rustompmody@gmail.com> - 2014-01-17 08:30 -0800
        Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-18 03:50 +1100
      Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-18 03:33 +1100

csiph-web