Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #86311 > unrolled thread
| Started by | pierrick.brihaye@gmail.com |
|---|---|
| First post | 2015-02-24 02:49 -0800 |
| Last post | 2015-02-27 10:23 +1100 |
| Articles | 20 on this page of 158 — 19 participants |
Back to article view | Back to comp.lang.python
Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100
Page 1 of 8 [1] 2 3 4 5 6 7 8 Next page →
| From | pierrick.brihaye@gmail.com |
|---|---|
| Date | 2015-02-24 02:49 -0800 |
| Subject | Newbie question about text encoding |
| Message-ID | <aae131a7-29a1-4f79-ac16-d1e223616c51@googlegroups.com> |
Hello,
Working with pyshp, this is my code :
import shapefile
inFile = shapefile.Reader("blah")
for sr in inFile.shapeRecords():
rec = sr.record[2]
print("Output : ", rec, type(rec))
Output: hippodrome du resto <class 'str'>
Output: b'stade de man\xe9 braz' <class 'bytes'>
Why do I get 2 different types ?
How to get a string object when I have accented characters ?
Thank you,
p.b.
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-24 22:09 +1100 |
| Message-ID | <mailman.19124.1424776180.18130.python-list@python.org> |
| In reply to | #86311 |
On Tue, Feb 24, 2015 at 9:49 PM, <pierrick.brihaye@gmail.com> wrote:
> Working with pyshp, this is my code :
>
> import shapefile
>
> inFile = shapefile.Reader("blah")
>
> for sr in inFile.shapeRecords():
> rec = sr.record[2]
> print("Output : ", rec, type(rec))
>
> Output: hippodrome du resto <class 'str'>
> Output: b'stade de man\xe9 braz' <class 'bytes'>
>
> Why do I get 2 different types ?
> How to get a string object when I have accented characters ?
I don't know what pyshp is doing here, so you may want to seek a
pyshp-specific mailing list for help. My guess is that it's
automatically decoding to str if it's ASCII-only, and giving you back
the raw bytes if there are any that it can't handle. The question is:
What encoding _is_ that? Do you know what character you're expecting
to see there? Before you can turn that into a string, you have to
figure out whether it's Latin-1 (ISO-8859-1), or some other ISO-8859-x
standard, or a Windows codepage, or an ancient thing off a Mac, or
whatever else it might be. Once you know that, it's easy: you just
decode() the bytes objects. But you MUST figure out the encoding
first.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-02-24 06:25 -0500 |
| Message-ID | <mailman.19126.1424777143.18130.python-list@python.org> |
| In reply to | #86311 |
On 02/24/2015 05:49 AM, pierrick.brihaye@gmail.com wrote:
> Hello,
>
> Working with pyshp, this is my code :
What version of Python, what version of pyshp, from where, and what OS?
These are the first information to supply in any query that goes
outside of the standard library.
For example, you might be running CPython 3.2.1 on Ubuntu 14.04.1, and
installed pyshp 1.2.1 from https://pypi.python.org/pypi/pyshp
Or some other combination.
>
> import shapefile
>
> inFile = shapefile.Reader("blah")
>
> for sr in inFile.shapeRecords():
> rec = sr.record[2]
> print("Output : ", rec, type(rec))
>
> Output: hippodrome du resto <class 'str'>
> Output: b'stade de man\xe9 braz' <class 'bytes'>
>
> Why do I get 2 different types ?
> How to get a string object when I have accented characters ?
>
> Thank you,
>
> p.b.
>
From my (cursory) reading of the pyshp docs on the above page, I cannot
see what the [2] element of the record list should look like. So I'd
have to guess.
The bytes object is presumably an encoded version of the character
string. I don't see anything on that page about unicode, or decode, so
you might have to guess the encoding. Anyway, you can decode the
bytestring into a regular string if you can correctly guess the encoding
method, such as utf-8.
If that were the right decoding, you could just use
mystring = rec.decode()
But utf-8 does not seem to be the right encoding for that bytestring.
So you'll need a form like:
mystring = rec.decode(encoding='xxx')
for some value of xxx.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-02-24 15:55 +0100 |
| Message-ID | <mailman.19132.1424789760.18130.python-list@python.org> |
| In reply to | #86311 |
In a message of Tue, 24 Feb 2015 06:25:24 -0500, Dave Angel writes: >But utf-8 does not seem to be the right encoding for that bytestring. >So you'll need a form like: > mystring = rec.decode(encoding='xxx') > >for some value of xxx. >DaveA And the xxx you want is "latin1" Laura
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-25 02:03 +1100 |
| Message-ID | <mailman.19133.1424790205.18130.python-list@python.org> |
| In reply to | #86311 |
On Wed, Feb 25, 2015 at 1:55 AM, Laura Creighton <lac@openend.se> wrote: > In a message of Tue, 24 Feb 2015 06:25:24 -0500, Dave Angel writes: >>But utf-8 does not seem to be the right encoding for that bytestring. >>So you'll need a form like: >> mystring = rec.decode(encoding='xxx') >> >>for some value of xxx. > >>DaveA > > And the xxx you want is "latin1" Can you be sure it's Latin-1? I'm not certain of that. In any case, I never advocate fixing encoding problems by "just do this and it'll all go away"; you have to understand your data before you can decode it. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-02-24 16:06 +0100 |
| Message-ID | <mailman.19134.1424790392.18130.python-list@python.org> |
| In reply to | #86311 |
In a message of Tue, 24 Feb 2015 15:55:41 +0100, Laura Creighton writes: >In a message of Tue, 24 Feb 2015 06:25:24 -0500, Dave Angel writes: >>But utf-8 does not seem to be the right encoding for that bytestring. >>So you'll need a form like: >> mystring = rec.decode(encoding='xxx') >> >>for some value of xxx. > >>DaveA > >And the xxx you want is "latin1" > >Laura er, latin1. You don't want an extra set of quotes. There are many aliases for latin1. i.e. latin_1, iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1 see: https://docs.python.org/2.4/lib/standard-encodings.html and you might want to read https://docs.python.org/2/howto/unicode.html to understand the problem better. Laura
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-02-24 08:01 -0800 |
| Message-ID | <7f2bdec0-483a-4ffb-ad64-09a687f385ba@googlegroups.com> |
| In reply to | #86322 |
Sorry, you are all wrong. The coding is UCS1. Python is so funny. jmf
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-02-24 16:07 +0100 |
| Message-ID | <mailman.19135.1424790463.18130.python-list@python.org> |
| In reply to | #86311 |
In a message of Wed, 25 Feb 2015 02:03:16 +1100, Chris Angelico writes: >On Wed, Feb 25, 2015 at 1:55 AM, Laura Creighton <lac@openend.se> wrote: >> In a message of Tue, 24 Feb 2015 06:25:24 -0500, Dave Angel writes: >>>But utf-8 does not seem to be the right encoding for that bytestring. >>>So you'll need a form like: >>> mystring = rec.decode(encoding='xxx') >>> >>>for some value of xxx. >> >>>DaveA >> >> And the xxx you want is "latin1" > >Can you be sure it's Latin-1? I'm not certain of that. In any case, I >never advocate fixing encoding problems by "just do this and it'll all >go away"; you have to understand your data before you can decode it. > >ChrisA I can, I speak French and I recognise the data. It's French place names, places where sporting events are held. :) Laura
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-25 02:10 +1100 |
| Message-ID | <mailman.19136.1424790645.18130.python-list@python.org> |
| In reply to | #86311 |
On Wed, Feb 25, 2015 at 2:07 AM, Laura Creighton <lac@openend.se> wrote: >>Can you be sure it's Latin-1? I'm not certain of that. In any case, I >>never advocate fixing encoding problems by "just do this and it'll all >>go away"; you have to understand your data before you can decode it. >> >>ChrisA > > I can, I speak French and I recognise the data. It's French place names, > places where sporting events are held. :) Ah, okay. :) But even with that level of confidence, you still have to pick between Latin-1 and CP-1252, which you can't tell based on this one snippet. Welcome to untagged encodings. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-02-24 16:24 +0100 |
| Message-ID | <mailman.19137.1424791449.18130.python-list@python.org> |
| In reply to | #86311 |
In a message of Wed, 25 Feb 2015 02:10:42 +1100, Chris Angelico writes: >On Wed, Feb 25, 2015 at 2:07 AM, Laura Creighton <lac@openend.se> wrote: >>>Can you be sure it's Latin-1? I'm not certain of that. In any case, I >>>never advocate fixing encoding problems by "just do this and it'll all >>>go away"; you have to understand your data before you can decode it. >>> >>>ChrisA >> >> I can, I speak French and I recognise the data. It's French place names, >> places where sporting events are held. :) > >Ah, okay. :) But even with that level of confidence, you still have to >pick between Latin-1 and CP-1252, which you can't tell based on this >one snippet. Welcome to untagged encodings. > >ChrisA Ah, yes, you are right about that. I see CP-1252 about 2 times every 10 years, and latin1 every minute of my life, so I am biased to assume I know what I am seeing. ChrisA, you come from an English speaking country, right? For those of us who come from countries whose language doesn't fit in ASCII, the notion of 'understand the data' doesn't work very well. We already understand the data -- its a set of words in our native language. The hard part isn't understanding the data, but rather understanding how the hell Python could be so stupid as to not understand it. :) The notion that Python normally only understands the subset of the characters in your native language than English speakers use in their language is not the most obvious thing. And having taught countless European kids how to write their very first program in Python, I can tell you for certain that the sort of deep understanding of encoding methods is not what 10 year olds who just want to print out the names of their friends, and their favourite music titles, and their favourite musicians want to know. :) Laura
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-25 02:33 +1100 |
| Message-ID | <mailman.19138.1424792018.18130.python-list@python.org> |
| In reply to | #86311 |
On Wed, Feb 25, 2015 at 2:24 AM, Laura Creighton <lac@openend.se> wrote: > Ah, yes, you are right about that. I see CP-1252 about 2 times every 10 > years, and latin1 every minute of my life, so I am biased to assume I > know what I am seeing. Fair enough. CP-1252 is still a possibility, but the difference can be dealt with later. > ChrisA, you come from an English speaking country, right? Yes (Australia, to be specific). > For those of us who come from countries whose language doesn't fit in > ASCII, the notion of 'understand the data' doesn't work very well. We > already understand the data -- its a set of words in our native language. > The hard part isn't understanding the data, but rather understanding how > the hell Python could be so stupid as to not understand it. :) The > notion that Python normally only understands the subset of the > characters in your native language than English speakers use in their > language is not the most obvious thing. Also a reasonable baseline assumption; but the trouble is that if you automatically assume that text is encoded in your favourite eight-bit system, you're taking a huge risk. Now, you have a huge leg up on me, in that you actually recognize the *words* in that piece of text. That means you can have MUCH greater confidence in stating that it's Latin-1 than I can. But that's precisely what I mean by "understand the data". If you, being a native French speaker, pick up a file written in (say) Polish, and encoded Latin-2, you'll recognize by the ASCII characters that it's not French text, and probably you'd be able to spot that it ought to be Latin-2 rather than Latin-1. That's understanding the data, that's having more information than just the byte patterns. A computer can't reliably do that (just look up the "Bush hid the facts" bug if you don't believe me), but a human often can. > And having taught countless European kids how to write their very first > program in Python, I can tell you for certain that the sort of deep > understanding of encoding methods is not what 10 year olds who just > want to print out the names of their friends, and their favourite > music titles, and their favourite musicians want to know. :) Right, so you should be teaching them to use Python 3, and always saving everything in UTF-8, and basically ignoring the whole mess of eight-bit encodings :) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | random832@fastmail.us |
|---|---|
| Date | 2015-02-24 10:38 -0500 |
| Message-ID | <mailman.19139.1424792306.18130.python-list@python.org> |
| In reply to | #86311 |
On Tue, Feb 24, 2015, at 10:10, Chris Angelico wrote: > Ah, okay. :) But even with that level of confidence, you still have to > pick between Latin-1 and CP-1252, which you can't tell based on this > one snippet. Welcome to untagged encodings. Or Latin-9 (ISO 8859-15) That was popular on Linux systems for a while before everyone switched to UTF-8 - it's got the Euro sign, and (relevant to French) the "oe" ligature, and uppercase Y with diaeresis, at the expense of "generic currency" and fractions. Or it could be Latin-3, Latin-5 (8859-9), or Latin-8 (8859-14) - they are not commonly used for French locales, being primarily intended for other languages, but they do support all characters (at least all from Latin-1) used in French names. I assume there are likewise several Windows codepages it could be.
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-02-24 17:20 +0100 |
| Message-ID | <mailman.19141.1424794860.18130.python-list@python.org> |
| In reply to | #86311 |
In a message of Wed, 25 Feb 2015 02:33:30 +1100, Chris Angelico writes: >Also a reasonable baseline assumption; but the trouble is that if you >automatically assume that text is encoded in your favourite eight-bit >system, you're taking a huge risk. But, you know, I wasn't assuming this. I actually read latin1. I could read it in ascii, know that \xe9 means 'é', a letter combination that we have in Swedish, so I am rather used to reading, and then well, I could read all of his strings, know they were in French, and know that latin1 was what he needed things to be decoded to. >Now, you have a huge leg up on me, in that you actually recognize the >*words* in that piece of text. That means you can have MUCH greater >confidence in stating that it's Latin-1 than I can. But that's >precisely what I mean by "understand the data". If you, being a native >French speaker, pick up a file written in (say) Polish, and encoded >Latin-2, you'll recognize by the ASCII characters that it's not French >text, and probably you'd be able to spot that it ought to be Latin-2 >rather than Latin-1. That's understanding the data, that's having more >information than just the byte patterns. A computer can't reliably do >that (just look up the "Bush hid the facts" bug if you don't believe >me), but a human often can. Absolutely correct. But you must not require that all of the speakers of non-English languages think about their languages as 'special encodings'. Only the monoglot ever think of a foreign language as a code. That poor guy the original poster just wants to have a nice string of his sporting event place name. We should tell him how to get that, not how to be an expert in all the encodings on the face of this earth. Chances are, the only thing he needs to talk about are French words. If not, well, he will come back when things stop working, and have lots more data to give him. If, instead, this makes him go away happy, then this was the very best thing to do. >> And having taught countless European kids how to write their very first >> program in Python, I can tell you for certain that the sort of deep >> understanding of encoding methods is not what 10 year olds who just >> want to print out the names of their friends, and their favourite >> music titles, and their favourite musicians want to know. :) > >Right, so you should be teaching them to use Python 3, and always >saving everything in UTF-8, and basically ignoring the whole mess of >eight-bit encodings :) Of course this makes sense. But you seem to be missing the point. People who are asking for help in getting things to work in their native language need a 'do this quick' sort of answer. The deeper problems of supporting all languages and language encodings can very much wait. The OP wants a hunk of bytes that happens to mean something in French, and is not encodable in the limited English language to work like a different hunk of bytes that means something in French but is encodable. Don't overburden them. >ChrisA Laura
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-25 03:24 +1100 |
| Message-ID | <mailman.19142.1424795081.18130.python-list@python.org> |
| In reply to | #86311 |
On Wed, Feb 25, 2015 at 3:20 AM, Laura Creighton <lac@openend.se> wrote: > People who are asking for help in getting things to work in their > native language need a 'do this quick' sort of answer. The deeper > problems of supporting all languages and language encodings can very > much wait. I'm not so sure about that. When "supporting all languages" is as simple as "use Python 3 and UTF-8 everywhere", the cost is much lower than it might be, and the benefit is potentially huge. A "do this quick" answer might get you by *right now*, but it leaves open the possibility of subtler errors. That's why Python moved to Unicode-by-default, even though eight-bit encodings will tend to produce the right results for simple text. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-02-24 12:13 -0500 |
| Message-ID | <mailman.19143.1424798024.18130.python-list@python.org> |
| In reply to | #86311 |
On 02/24/2015 11:20 AM, Laura Creighton wrote: > In a message of Wed, 25 Feb 2015 02:33:30 +1100, Chris Angelico writes: >> Also a reasonable baseline assumption; but the trouble is that if you >> automatically assume that text is encoded in your favourite eight-bit >> system, you're taking a huge risk. > > But, you know, I wasn't assuming this. I actually read latin1. I > could read it in ascii, know that \xe9 means 'é', a letter combination > that we have in Swedish, so I am rather used to reading, and then > well, I could read all of his strings, know they were in French, > and know that latin1 was what he needed things to be decoded to. With a sample of one string, how did you read "all his strings". And with one non-ASCII code in that single string, how did you know that 'latin1' was the only encoding that included a reasonable character at that encoding? > >> Now, you have a huge leg up on me, in that you actually recognize the >> *words* in that piece of text. That means you can have MUCH greater >> confidence in stating that it's Latin-1 than I can. But that's >> precisely what I mean by "understand the data". If you, being a native >> French speaker, pick up a file written in (say) Polish, and encoded >> Latin-2, you'll recognize by the ASCII characters that it's not French >> text, and probably you'd be able to spot that it ought to be Latin-2 >> rather than Latin-1. That's understanding the data, that's having more >> information than just the byte patterns. A computer can't reliably do >> that (just look up the "Bush hid the facts" bug if you don't believe >> me), but a human often can. > > Absolutely correct. But you must not require that all of the speakers > of non-English languages think about their languages as 'special > encodings'. Only the monoglot ever think of a foreign language as > a code. All languages are foreign. All that can be written to a disk file are bytes. Those have to have been encoded to represent some abstraction called a character set, or string. The question is whether the encoding method is specified for the particular file type, or for the particular file. See http://support.esri.com/cn/knowledgebase/techarticles/detail/21106 according to that page, starting at ArcGIS 10.2.1, the default sets the code page to UTF-8 (UNICODE) in the shapefile (.DBF) But in earlier ones, there's supposed to be a reference to the codepage used. From that, one can presumably derive which decoder to use. > > That poor guy the original poster just wants to have a nice string > of his sporting event place name. We should tell him how to get that, > not how to be an expert in all the encodings on the face of this earth. > Chances are, the only thing he needs to talk about are French words. > > If not, well, he will come back when things stop working, and have lots > more data to give him. If, instead, this makes him go away happy, then > this was the very best thing to do. > >>> And having taught countless European kids how to write their very first >>> program in Python, I can tell you for certain that the sort of deep >>> understanding of encoding methods is not what 10 year olds who just >>> want to print out the names of their friends, and their favourite >>> music titles, and their favourite musicians want to know. :) >> >> Right, so you should be teaching them to use Python 3, and always >> saving everything in UTF-8, and basically ignoring the whole mess of >> eight-bit encodings :) > > Of course this makes sense. But you seem to be missing the point. > People who are asking for help in getting things to work in their > native language need a 'do this quick' sort of answer. The deeper > problems of supporting all languages and language encodings can very > much wait. The OP wants a hunk of bytes that happens to mean > something in French, and is not encodable in the limited English > language to work like a different hunk of bytes that means something > in French but is encodable. > > Don't overburden them. My guess is that this is only appropriate for users who use only locally created data. Since the OP's data is apparently old (if it were current versions, it'd have been utf-8), who knows how consistent the encoding is. -- -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-02-24 20:45 +0100 |
| Message-ID | <mailman.19146.1424807175.18130.python-list@python.org> |
| In reply to | #86311 |
In a message of Tue, 24 Feb 2015 12:13:24 -0500, Dave Angel writes: >With a sample of one string, how did you read "all his strings". And >with one non-ASCII code in that single string, how did you know that >'latin1' was the only encoding that included a reasonable character at >that encoding? Ah, 2 strings. And I did not promise that latin1 was the only encoding that included a reasonable char at his encoding. I only proinmised that it was one that did. And, given the nature of the data, I was pretty sure that this was the one he wanted. If it did not work, he would come back and complain. >See http://support.esri.com/cn/knowledgebase/techarticles/detail/21106 > >according to that page, starting at ArcGIS 10.2.1, the default sets the >code page to UTF-8 (UNICODE) in the shapefile (.DBF) Who cares. In Europe, among Europeans, we are used to seeing Latin1 or Latin2. >My guess is that this is only appropriate for users who use only locally >created data. Since the OP's data is apparently old (if it were current >versions, it'd have been utf-8), who knows how consistent the encoding is. I do. Very much so. The idea that the whole world loves utf-8 is nonsense. Most of europe has been using latin1, latin2 etc. before unicode was invented and will, as far as I know, continue to use it. Oldness is an indication that latin1 is more likely to be the encoding than uft-8. Your guess is that latin1 is only used in local encodings. My data is that, we in Western Europe, have this format pretty much all of the time, for everywhere, unless you are only doing local encodings (in which case you would use utf-8) Laura
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-02-25 00:21 +0200 |
| Message-ID | <87ioerc5n7.fsf@elektro.pacujo.net> |
| In reply to | #86340 |
Laura Creighton <lac@openend.se>: > Who cares. In Europe, among Europeans, we are used to seeing > Latin1 or Latin2. No, it's UCS-2 (Windows) or UTF-8 (Linux) -- among us Europeans. > The idea that the whole world loves utf-8 is nonsense. Windows people don't care for UTF-8, they don't have to. Linux people use it. Love is not necessary. Me, I use en_US.UTF-8. > Most of europe has been using latin1, latin2 etc. before unicode was > invented and will, as far as I know, continue to use it. Oldness is an > indication that latin1 is more likely to be the encoding than uft-8. Latin-1 is confined to HTML, if even there. > My data is that, we in Western Europe, have this format pretty much > all of the time, for everywhere, unless you are only doing local > encodings (in which case you would use utf-8) There's a third way, but it's not in Western Europe, as far as I can tell. Japan is another story. I don't know about Russia. Marko
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-02-25 12:20 +1100 |
| Message-ID | <54ed2349$0$13004$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #86340 |
Laura Creighton wrote: > The idea that the whole world loves utf-8 is nonsense. I don't think anyone says the whole world loves UTF-8. I think people say that the whole world *ought to* love UTF-8, and that legacy encodings from the Windows "code-page" days ought to die. > Most of europe has been using latin1, latin2 etc. before > unicode was invented and will, as far as I know, continue to use it. And this is why people in Greece cannot transfer text files to people in France without the content changing (ISO-8859-7 vs ISO-8859-1). And why Russians cannot even swap text files with other Russians (a plethora of encodings). :-( -- Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-02-25 06:34 -0800 |
| Message-ID | <52672557-6aa2-4cf4-a61c-7139cdca9cb0@googlegroups.com> |
| In reply to | #86368 |
======================== U+0001F601 U+0001F602 ...
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-02-24 20:57 +0100 |
| Message-ID | <mailman.19147.1424807886.18130.python-list@python.org> |
| In reply to | #86311 |
Dave Angel are you another Native English speaker living in a world where ASCII is enough? Laura
[toc] | [prev] | [next] | [standalone]
Page 1 of 8 [1] 2 3 4 5 6 7 8 Next page →
Back to top | Article view | comp.lang.python
csiph-web