Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Pete Forman Newsgroups: comp.lang.python Subject: Re: Guessing the encoding from a BOM Date: Fri, 17 Jan 2014 16:26:28 +0000 Organization: A noiseless patient Spider Lines: 24 Message-ID: <86zjmufubv.fsf@gmail.com> References: <1389901049.40172.YahooMailBasic@web163804.mail.gq1.yahoo.com> <32c1b684-1ff7-48c0-af7a-cd15235ea531@googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: mx05.eternal-september.org; posting-host="cdf6132ee4c43c2d1457a368e89c85c9"; logging-data="14446"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+NK6Z82vawfD8Lul7CDNpR" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (windows-nt) Cancel-Lock: sha1:t7AV/cxzJ3T9dLEKzCA5DV782rw= sha1:NpRN/DtOW0oQ58rO4Cz6xUn+yl8= Xref: csiph.com comp.lang.python:64175 Rustom Mody writes: > On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote: >> On 2014-01-17 11:14, Chris Angelico wrote: >> > UTF-8 specifies the byte order >> > as part of the protocol, so you don't need to mark it. > >> You don't need to mark it when writing, but some idiots use it >> anyway. If you're sniffing a file for purposes of reading, you need >> to look for it and remove it from the actual data that gets returned >> from the file--otherwise, your data can see it as corruption. I end >> up with lots of CSV files from customers who have polluted it with >> Notepad or had Excel insert some UTF-8 BOM when exporting. This >> means my first column-name gets the BOM prefixed onto it when the >> file is passed to csv.DictReader, grr. > > And its part of the standard: > Table 2.4 here > http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf It would have been nice if there was an eighth encoding scheme defined there UTF-8NB which would be UTF-8 with BOM not allowed. -- Pete Forman