Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #64147

Re: Guessing the encoding from a BOM

Path csiph.com!usenet.pasdenom.info!goblin1!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.015
X-Spam-Evidence '*H*': 0.97; '*S*': 0.00; 'from:addr:yahoo.co.uk': 0.04; 'anyway.': 0.05; 'insert': 0.05; 'utf-8': 0.07; 'lawrence': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'language.': 0.14; 'thread': 0.14; '-tkc': 0.16; 'csv': 0.16; 'prefixed': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'slight': 0.16; 'thread,': 0.16; 'language': 0.16; 'wrote:': 0.18; 'header:User-Agent:1': 0.23; 'byte': 0.24; 'specifies': 0.24; 'file.': 0.24; "i've": 0.25; 'purposes': 0.26; 'excel': 0.26; 'gets': 0.27; 'header:X-Complaints-To:1': 0.27; 'header:In-Reply- To:1': 0.27; 'chris': 0.29; 'tim': 0.29; 'moved': 0.30; 'returned': 0.30; 'code': 0.31; 'chase': 0.31; 'file': 0.32; 'actual': 0.34; 'subject:the': 0.34; 'subject:from': 0.34; 'but': 0.35; 'changing': 0.37; 'handle': 0.38; 'to:addr:python-list': 0.38; 'files': 0.38; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'even': 0.60; 'remove': 0.60; "you're": 0.61; 'first': 0.61; 'our': 0.64; 'chance': 0.65; 'customers': 0.66; 'subject': 0.69; 'money': 0.72; 'ending': 0.78; 'idiots': 0.84; 'inspector': 0.84; 'protocol,': 0.84; 'reading,': 0.84; 'received:2': 0.84; 'sniffing': 0.84; 'writing,': 0.84; 'why?': 0.91; 'picture': 0.97
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Mark Lawrence <breamoreboy@yahoo.co.uk>
Subject Re: Guessing the encoding from a BOM
Date Fri, 17 Jan 2014 09:10:32 +0000
References <CAPTjJmqyO0UHrq31510iNeoQ2TcrJnosV0A6oHQOt5i-gz3njA@mail.gmail.com> <1389901049.40172.YahooMailBasic@web163804.mail.gq1.yahoo.com> <CAPTjJmqNhokKF8X3jHNZrW0iEt8foTaMM+26a3+2O9FG4rMPpw@mail.gmail.com> <20140116194005.387a9125@bigbox.christie.dr>
Mime-Version 1.0
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
X-Gmane-NNTP-Posting-Host host-2-98-198-27.as13285.net
User-Agent Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
In-Reply-To <20140116194005.387a9125@bigbox.christie.dr>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.5627.1389949840.18130.python-list@python.org> (permalink)
Lines 31
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1389949840 news.xs4all.nl 2930 [2001:888:2000:d::a6]:37226
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:64147

Show key headers only | View raw


On 17/01/2014 01:40, Tim Chase wrote:
> On 2014-01-17 11:14, Chris Angelico wrote:
>> UTF-8 specifies the byte order
>> as part of the protocol, so you don't need to mark it.
>
> You don't need to mark it when writing, but some idiots use it
> anyway.  If you're sniffing a file for purposes of reading, you need
> to look for it and remove it from the actual data that gets returned
> from the file--otherwise, your data can see it as corruption.  I end
> up with lots of CSV files from customers who have polluted it with
> Notepad or had Excel insert some UTF-8 BOM when exporting.  This
> means my first column-name gets the BOM prefixed onto it when the
> file is passed to csv.DictReader, grr.
>
> -tkc
>

My code that used to handle CSV files from M$ Money had to allow for a 
single NUL byte right at the end of the file.  Thankfully I've now moved 
on to gnucash.

Slight aside, any chance of changing the subject of this thread, or even 
ending the thread completely?  Why?  Every time I see it I picture 
Inspector Clouseau, "A BOM!!!" :)

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Guessing the encoding from a BOM Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-17 09:10 +0000

csiph-web