Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #64131 > unrolled thread
| Started by | Tim Chase <tim@thechases.com> |
|---|---|
| First post | 2014-01-16 19:40 -0600 |
| Last post | 2014-01-18 03:33 +1100 |
| Articles | 6 — 4 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Guessing the encoding from a BOM Tim Chase <tim@thechases.com> - 2014-01-16 19:40 -0600
Re: Guessing the encoding from a BOM Rustom Mody <rustompmody@gmail.com> - 2014-01-16 21:08 -0800
Re: Guessing the encoding from a BOM Pete Forman <petef4+usenet@gmail.com> - 2014-01-17 16:26 +0000
Re: Guessing the encoding from a BOM Rustom Mody <rustompmody@gmail.com> - 2014-01-17 08:30 -0800
Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-18 03:50 +1100
Re: Guessing the encoding from a BOM Chris Angelico <rosuav@gmail.com> - 2014-01-18 03:33 +1100
| From | Tim Chase <tim@thechases.com> |
|---|---|
| Date | 2014-01-16 19:40 -0600 |
| Subject | Re: Guessing the encoding from a BOM |
| Message-ID | <mailman.5618.1389922759.18130.python-list@python.org> |
On 2014-01-17 11:14, Chris Angelico wrote: > UTF-8 specifies the byte order > as part of the protocol, so you don't need to mark it. You don't need to mark it when writing, but some idiots use it anyway. If you're sniffing a file for purposes of reading, you need to look for it and remove it from the actual data that gets returned from the file--otherwise, your data can see it as corruption. I end up with lots of CSV files from customers who have polluted it with Notepad or had Excel insert some UTF-8 BOM when exporting. This means my first column-name gets the BOM prefixed onto it when the file is passed to csv.DictReader, grr. -tkc
[toc] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-01-16 21:08 -0800 |
| Message-ID | <32c1b684-1ff7-48c0-af7a-cd15235ea531@googlegroups.com> |
| In reply to | #64131 |
On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote: > On 2014-01-17 11:14, Chris Angelico wrote: > > UTF-8 specifies the byte order > > as part of the protocol, so you don't need to mark it. > You don't need to mark it when writing, but some idiots use it > anyway. If you're sniffing a file for purposes of reading, you need > to look for it and remove it from the actual data that gets returned > from the file--otherwise, your data can see it as corruption. I end > up with lots of CSV files from customers who have polluted it with > Notepad or had Excel insert some UTF-8 BOM when exporting. This > means my first column-name gets the BOM prefixed onto it when the > file is passed to csv.DictReader, grr. And its part of the standard: Table 2.4 here http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
[toc] | [prev] | [next] | [standalone]
| From | Pete Forman <petef4+usenet@gmail.com> |
|---|---|
| Date | 2014-01-17 16:26 +0000 |
| Message-ID | <86zjmufubv.fsf@gmail.com> |
| In reply to | #64139 |
Rustom Mody <rustompmody@gmail.com> writes: > On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote: >> On 2014-01-17 11:14, Chris Angelico wrote: >> > UTF-8 specifies the byte order >> > as part of the protocol, so you don't need to mark it. > >> You don't need to mark it when writing, but some idiots use it >> anyway. If you're sniffing a file for purposes of reading, you need >> to look for it and remove it from the actual data that gets returned >> from the file--otherwise, your data can see it as corruption. I end >> up with lots of CSV files from customers who have polluted it with >> Notepad or had Excel insert some UTF-8 BOM when exporting. This >> means my first column-name gets the BOM prefixed onto it when the >> file is passed to csv.DictReader, grr. > > And its part of the standard: > Table 2.4 here > http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf It would have been nice if there was an eighth encoding scheme defined there UTF-8NB which would be UTF-8 with BOM not allowed. -- Pete Forman
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-01-17 08:30 -0800 |
| Message-ID | <edc1acb8-be97-43c9-819b-3af18086d5d0@googlegroups.com> |
| In reply to | #64175 |
On Friday, January 17, 2014 9:56:28 PM UTC+5:30, Pete Forman wrote: > Rustom Mody writes: > > On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote: > >> On 2014-01-17 11:14, Chris Angelico wrote: > >> > UTF-8 specifies the byte order > >> > as part of the protocol, so you don't need to mark it. > >> You don't need to mark it when writing, but some idiots use it > >> anyway. If you're sniffing a file for purposes of reading, you need > >> to look for it and remove it from the actual data that gets returned > >> from the file--otherwise, your data can see it as corruption. I end > >> up with lots of CSV files from customers who have polluted it with > >> Notepad or had Excel insert some UTF-8 BOM when exporting. This > >> means my first column-name gets the BOM prefixed onto it when the > >> file is passed to csv.DictReader, grr. > > And its part of the standard: > > Table 2.4 here > > http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf > It would have been nice if there was an eighth encoding scheme defined > there UTF-8NB which would be UTF-8 with BOM not allowed. If you or I break a standard then, well, we broke a standard. If Microsoft breaks a standard the standard is obliged to change. Or as the saying goes, everyone is equal though some are more equal.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-18 03:50 +1100 |
| Message-ID | <mailman.5649.1389977412.18130.python-list@python.org> |
| In reply to | #64176 |
On Sat, Jan 18, 2014 at 3:30 AM, Rustom Mody <rustompmody@gmail.com> wrote: > If you or I break a standard then, well, we broke a standard. > If Microsoft breaks a standard the standard is obliged to change. > > Or as the saying goes, everyone is equal though some are more equal. https://en.wikipedia.org/wiki/800_pound_gorilla Though Microsoft has been losing weight over the past decade or so, just as IBM before them had done (there was a time when IBM was *the* 800lb gorilla, pretty much, but definitely not now). In Unix/POSIX contexts, Linux might be playing that role - I've seen code that unwittingly assumes Linux more often than, say, assuming FreeBSD - but I haven't seen a huge amount of "the standard has to change, Linux does it differently", possibly because the areas of Linux-assumption are areas that aren't standardized anyway (eg additional socket options beyond the spec). The one area where industry leaders still heavily dictate to standards is the web. Fortunately, it usually still results in usable standards documents that HTML authors can rely on. Usually. *twiddles fingers* ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-18 03:33 +1100 |
| Message-ID | <mailman.5648.1389976433.18130.python-list@python.org> |
| In reply to | #64175 |
On Sat, Jan 18, 2014 at 3:26 AM, Pete Forman <petef4+usenet@gmail.com> wrote: > It would have been nice if there was an eighth encoding scheme defined > there UTF-8NB which would be UTF-8 with BOM not allowed. Or call that one UTF-8, and the one with the marker can be UTF-8-MS-NOTEPAD. ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web