Groups > comp.lang.python > #86311 > unrolled thread

Newbie question about text encoding

Started by	pierrick.brihaye@gmail.com
First post	2015-02-24 02:49 -0800
Last post	2015-02-27 10:23 +1100
Articles	20 on this page of 158 — 19 participants

Back to article view | Back to comp.lang.python

  Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
    Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
        Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
        Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
        Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
          Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
        Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
              Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
                  Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
                      Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
                        Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
                            Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
                              Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
                          Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
                      Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
                        Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
                      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
                      Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
                      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
                              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
                                Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
                                    Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
                                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
                            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
                                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
                                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
                                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
                                                  Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
                                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
                                                      Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
                                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
                                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
                                                          Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
                                                        Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
                                    Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
                                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
                                          Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
                                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
                                              Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
                                                Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
                                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
                                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
                                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
                                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
                          Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
                          Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
            Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
                Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
          Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100

Page 1 of 8 [1] 2 3 4 5 6 7 8 Next page →

#86311 — Newbie question about text encoding

From	pierrick.brihaye@gmail.com
Date	2015-02-24 02:49 -0800
Subject	Newbie question about text encoding
Message-ID	<aae131a7-29a1-4f79-ac16-d1e223616c51@googlegroups.com>

Hello,

Working with pyshp, this is my code :

import shapefile

inFile = shapefile.Reader("blah")

for sr in inFile.shapeRecords():
    rec = sr.record[2]
    print("Output : ", rec, type(rec))

Output:  hippodrome du resto <class 'str'>
Output:  b'stade de man\xe9 braz' <class 'bytes'>

Why do I get 2 different types ?
How to get a string object when I have accented characters ?

Thank you,

p.b.

[toc] | [next] | [standalone]

#86314

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-24 22:09 +1100
Message-ID	<mailman.19124.1424776180.18130.python-list@python.org>
In reply to	#86311

On Tue, Feb 24, 2015 at 9:49 PM,  <pierrick.brihaye@gmail.com> wrote:
> Working with pyshp, this is my code :
>
> import shapefile
>
> inFile = shapefile.Reader("blah")
>
> for sr in inFile.shapeRecords():
>     rec = sr.record[2]
>     print("Output : ", rec, type(rec))
>
> Output:  hippodrome du resto <class 'str'>
> Output:  b'stade de man\xe9 braz' <class 'bytes'>
>
> Why do I get 2 different types ?
> How to get a string object when I have accented characters ?

I don't know what pyshp is doing here, so you may want to seek a
pyshp-specific mailing list for help. My guess is that it's
automatically decoding to str if it's ASCII-only, and giving you back
the raw bytes if there are any that it can't handle. The question is:
What encoding _is_ that? Do you know what character you're expecting
to see there? Before you can turn that into a string, you have to
figure out whether it's Latin-1 (ISO-8859-1), or some other ISO-8859-x
standard, or a Windows codepage, or an ancient thing off a Mac, or
whatever else it might be. Once you know that, it's easy: you just
decode() the bytes objects. But you MUST figure out the encoding
first.

ChrisA

[toc] | [prev] | [next] | [standalone]

#86316

From	Dave Angel <davea@davea.name>
Date	2015-02-24 06:25 -0500
Message-ID	<mailman.19126.1424777143.18130.python-list@python.org>
In reply to	#86311

On 02/24/2015 05:49 AM, pierrick.brihaye@gmail.com wrote:
> Hello,
>
> Working with pyshp, this is my code :

What version of Python, what version of pyshp, from where, and what OS? 
  These are the first information to supply in any query that goes 
outside of the standard library.

For example, you might be running CPython 3.2.1 on Ubuntu 14.04.1, and 
installed pyshp 1.2.1 from https://pypi.python.org/pypi/pyshp

Or some other combination.

>
> import shapefile
>
> inFile = shapefile.Reader("blah")
>
> for sr in inFile.shapeRecords():
>      rec = sr.record[2]
>      print("Output : ", rec, type(rec))
>
> Output:  hippodrome du resto <class 'str'>
> Output:  b'stade de man\xe9 braz' <class 'bytes'>
>
> Why do I get 2 different types ?
> How to get a string object when I have accented characters ?
>
> Thank you,
>
> p.b.
>

 From my (cursory) reading of the pyshp docs on the above page, I cannot 
see what the [2] element of the record list should look like.  So I'd 
have to guess.

The bytes object is presumably an encoded version of the character 
string.  I don't see anything on that page about unicode, or decode, so 
you might have to guess the encoding.  Anyway, you can decode the 
bytestring into a regular string if you can correctly guess the encoding 
method, such as utf-8.

If that were the right decoding, you could just use
     mystring = rec.decode()

But utf-8 does not seem to be the right encoding for that bytestring. 
So you'll need a form like:
     mystring = rec.decode(encoding='xxx')

for some value of xxx.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#86320

From	Laura Creighton <lac@openend.se>
Date	2015-02-24 15:55 +0100
Message-ID	<mailman.19132.1424789760.18130.python-list@python.org>
In reply to	#86311

In a message of Tue, 24 Feb 2015 06:25:24 -0500, Dave Angel writes:
>But utf-8 does not seem to be the right encoding for that bytestring. 
>So you'll need a form like:
>     mystring = rec.decode(encoding='xxx')
>
>for some value of xxx.

>DaveA

And the xxx you want is "latin1"

Laura

[toc] | [prev] | [next] | [standalone]

#86321

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-25 02:03 +1100
Message-ID	<mailman.19133.1424790205.18130.python-list@python.org>
In reply to	#86311

On Wed, Feb 25, 2015 at 1:55 AM, Laura Creighton <lac@openend.se> wrote:
> In a message of Tue, 24 Feb 2015 06:25:24 -0500, Dave Angel writes:
>>But utf-8 does not seem to be the right encoding for that bytestring.
>>So you'll need a form like:
>>     mystring = rec.decode(encoding='xxx')
>>
>>for some value of xxx.
>
>>DaveA
>
> And the xxx you want is "latin1"

Can you be sure it's Latin-1? I'm not certain of that. In any case, I
never advocate fixing encoding problems by "just do this and it'll all
go away"; you have to understand your data before you can decode it.

ChrisA

[toc] | [prev] | [next] | [standalone]

#86322

From	Laura Creighton <lac@openend.se>
Date	2015-02-24 16:06 +0100
Message-ID	<mailman.19134.1424790392.18130.python-list@python.org>
In reply to	#86311

In a message of Tue, 24 Feb 2015 15:55:41 +0100, Laura Creighton writes:
>In a message of Tue, 24 Feb 2015 06:25:24 -0500, Dave Angel writes:
>>But utf-8 does not seem to be the right encoding for that bytestring. 
>>So you'll need a form like:
>>     mystring = rec.decode(encoding='xxx')
>>
>>for some value of xxx.
>
>>DaveA
>
>And the xxx you want is "latin1"
>
>Laura

er, latin1.  You don't want an extra set of quotes.
There are many aliases for latin1. i.e. latin_1, iso-8859-1, iso8859-1,
8859, cp819, latin, latin1, L1
see: https://docs.python.org/2.4/lib/standard-encodings.html

and you might want to read
https://docs.python.org/2/howto/unicode.html

to understand the problem better.

Laura

[toc] | [prev] | [next] | [standalone]

#86328

From	wxjmfauth@gmail.com
Date	2015-02-24 08:01 -0800
Message-ID	<7f2bdec0-483a-4ffb-ad64-09a687f385ba@googlegroups.com>
In reply to	#86322

Sorry, you are all wrong.
The coding is UCS1.
Python is so funny.

jmf

[toc] | [prev] | [next] | [standalone]

#86323

From	Laura Creighton <lac@openend.se>
Date	2015-02-24 16:07 +0100
Message-ID	<mailman.19135.1424790463.18130.python-list@python.org>
In reply to	#86311

In a message of Wed, 25 Feb 2015 02:03:16 +1100, Chris Angelico writes:
>On Wed, Feb 25, 2015 at 1:55 AM, Laura Creighton <lac@openend.se> wrote:
>> In a message of Tue, 24 Feb 2015 06:25:24 -0500, Dave Angel writes:
>>>But utf-8 does not seem to be the right encoding for that bytestring.
>>>So you'll need a form like:
>>>     mystring = rec.decode(encoding='xxx')
>>>
>>>for some value of xxx.
>>
>>>DaveA
>>
>> And the xxx you want is "latin1"
>
>Can you be sure it's Latin-1? I'm not certain of that. In any case, I
>never advocate fixing encoding problems by "just do this and it'll all
>go away"; you have to understand your data before you can decode it.
>
>ChrisA

I can, I speak French and I recognise the data.  It's French place names,
places where sporting events are held. :)

Laura

[toc] | [prev] | [next] | [standalone]

#86324

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-25 02:10 +1100
Message-ID	<mailman.19136.1424790645.18130.python-list@python.org>
In reply to	#86311

On Wed, Feb 25, 2015 at 2:07 AM, Laura Creighton <lac@openend.se> wrote:
>>Can you be sure it's Latin-1? I'm not certain of that. In any case, I
>>never advocate fixing encoding problems by "just do this and it'll all
>>go away"; you have to understand your data before you can decode it.
>>
>>ChrisA
>
> I can, I speak French and I recognise the data.  It's French place names,
> places where sporting events are held. :)

Ah, okay. :) But even with that level of confidence, you still have to
pick between Latin-1 and CP-1252, which you can't tell based on this
one snippet. Welcome to untagged encodings.

ChrisA

[toc] | [prev] | [next] | [standalone]

#86325

From	Laura Creighton <lac@openend.se>
Date	2015-02-24 16:24 +0100
Message-ID	<mailman.19137.1424791449.18130.python-list@python.org>
In reply to	#86311

In a message of Wed, 25 Feb 2015 02:10:42 +1100, Chris Angelico writes:
>On Wed, Feb 25, 2015 at 2:07 AM, Laura Creighton <lac@openend.se> wrote:
>>>Can you be sure it's Latin-1? I'm not certain of that. In any case, I
>>>never advocate fixing encoding problems by "just do this and it'll all
>>>go away"; you have to understand your data before you can decode it.
>>>
>>>ChrisA
>>
>> I can, I speak French and I recognise the data.  It's French place names,
>> places where sporting events are held. :)
>
>Ah, okay. :) But even with that level of confidence, you still have to
>pick between Latin-1 and CP-1252, which you can't tell based on this
>one snippet. Welcome to untagged encodings.
>
>ChrisA

Ah, yes, you are right about that.  I see CP-1252 about 2 times every 10
years, and latin1 every minute of my life, so I am biased to assume I
know what I am seeing.

ChrisA, you come from an English speaking country, right?

For those of us who come from countries whose language doesn't fit in
ASCII, the notion of 'understand the data' doesn't work very well.  We
already understand the data -- its a set of words in our native language.
The hard part isn't understanding the data, but rather understanding how
the hell Python could be so stupid as to not understand it. :)  The
notion that Python normally only understands the subset of the
characters in your native language than English speakers use in their
language is not the most obvious thing.

And having taught countless European kids how to write their very first
program in Python, I can tell you for certain that the sort of deep
understanding of encoding methods is not what 10 year olds who just
want to print out the names of their friends, and their favourite
music titles, and their favourite musicians want to know. :)

Laura

[toc] | [prev] | [next] | [standalone]

#86326

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-25 02:33 +1100
Message-ID	<mailman.19138.1424792018.18130.python-list@python.org>
In reply to	#86311

On Wed, Feb 25, 2015 at 2:24 AM, Laura Creighton <lac@openend.se> wrote:
> Ah, yes, you are right about that.  I see CP-1252 about 2 times every 10
> years, and latin1 every minute of my life, so I am biased to assume I
> know what I am seeing.

Fair enough. CP-1252 is still a possibility, but the difference can be
dealt with later.

> ChrisA, you come from an English speaking country, right?

Yes (Australia, to be specific).

> For those of us who come from countries whose language doesn't fit in
> ASCII, the notion of 'understand the data' doesn't work very well.  We
> already understand the data -- its a set of words in our native language.
> The hard part isn't understanding the data, but rather understanding how
> the hell Python could be so stupid as to not understand it. :)  The
> notion that Python normally only understands the subset of the
> characters in your native language than English speakers use in their
> language is not the most obvious thing.

Also a reasonable baseline assumption; but the trouble is that if you
automatically assume that text is encoded in your favourite eight-bit
system, you're taking a huge risk.

Now, you have a huge leg up on me, in that you actually recognize the
*words* in that piece of text. That means you can have MUCH greater
confidence in stating that it's Latin-1 than I can. But that's
precisely what I mean by "understand the data". If you, being a native
French speaker, pick up a file written in (say) Polish, and encoded
Latin-2, you'll recognize by the ASCII characters that it's not French
text, and probably you'd be able to spot that it ought to be Latin-2
rather than Latin-1. That's understanding the data, that's having more
information than just the byte patterns. A computer can't reliably do
that (just look up the "Bush hid the facts" bug if you don't believe
me), but a human often can.

> And having taught countless European kids how to write their very first
> program in Python, I can tell you for certain that the sort of deep
> understanding of encoding methods is not what 10 year olds who just
> want to print out the names of their friends, and their favourite
> music titles, and their favourite musicians want to know. :)

Right, so you should be teaching them to use Python 3, and always
saving everything in UTF-8, and basically ignoring the whole mess of
eight-bit encodings :)

ChrisA

[toc] | [prev] | [next] | [standalone]

#86327

From	random832@fastmail.us
Date	2015-02-24 10:38 -0500
Message-ID	<mailman.19139.1424792306.18130.python-list@python.org>
In reply to	#86311

On Tue, Feb 24, 2015, at 10:10, Chris Angelico wrote:
> Ah, okay. :) But even with that level of confidence, you still have to
> pick between Latin-1 and CP-1252, which you can't tell based on this
> one snippet. Welcome to untagged encodings.

Or Latin-9 (ISO 8859-15) That was popular on Linux systems for a while
before everyone switched to UTF-8 - it's got the Euro sign, and
(relevant to French) the "oe" ligature, and uppercase Y with diaeresis,
at the expense of "generic currency" and fractions.

Or it could be Latin-3, Latin-5 (8859-9), or Latin-8 (8859-14) - they
are not commonly used for French locales, being primarily intended for
other languages, but they do support all characters (at least all from
Latin-1) used in French names. I assume there are likewise several
Windows codepages it could be.

[toc] | [prev] | [next] | [standalone]

#86329

From	Laura Creighton <lac@openend.se>
Date	2015-02-24 17:20 +0100
Message-ID	<mailman.19141.1424794860.18130.python-list@python.org>
In reply to	#86311

In a message of Wed, 25 Feb 2015 02:33:30 +1100, Chris Angelico writes:
>Also a reasonable baseline assumption; but the trouble is that if you
>automatically assume that text is encoded in your favourite eight-bit
>system, you're taking a huge risk.

But, you know, I wasn't assuming this.  I actually read latin1.  I
could read it in ascii, know that \xe9  means 'é', a letter combination
that we have in Swedish, so I am rather used to reading, and then
well, I could read all of his strings, know they were in French,
and know that latin1 was what he needed things to be decoded to.

>Now, you have a huge leg up on me, in that you actually recognize the
>*words* in that piece of text. That means you can have MUCH greater
>confidence in stating that it's Latin-1 than I can. But that's
>precisely what I mean by "understand the data". If you, being a native
>French speaker, pick up a file written in (say) Polish, and encoded
>Latin-2, you'll recognize by the ASCII characters that it's not French
>text, and probably you'd be able to spot that it ought to be Latin-2
>rather than Latin-1. That's understanding the data, that's having more
>information than just the byte patterns. A computer can't reliably do
>that (just look up the "Bush hid the facts" bug if you don't believe
>me), but a human often can.

Absolutely correct.  But you must not require that all of the speakers
of non-English languages think about their languages as 'special
encodings'.  Only the monoglot ever think of a foreign language as
a code.

That poor guy the original poster just wants to have a nice string
of his sporting event place name.  We should tell him how to get that,
not how to be an expert in all the encodings on the face of this earth.
Chances are, the only thing he needs to talk about are French words.

If not, well, he will come back when things stop working, and have lots
more data to give him.  If, instead, this makes him go away happy, then
this was the very best thing to do.

>> And having taught countless European kids how to write their very first
>> program in Python, I can tell you for certain that the sort of deep
>> understanding of encoding methods is not what 10 year olds who just
>> want to print out the names of their friends, and their favourite
>> music titles, and their favourite musicians want to know. :)
>
>Right, so you should be teaching them to use Python 3, and always
>saving everything in UTF-8, and basically ignoring the whole mess of
>eight-bit encodings :)

Of course this makes sense.  But you seem to be missing the point.
People who are asking for help in getting things to work in their
native language need a 'do this quick' sort of answer.  The deeper
problems of supporting all languages and language encodings can very
much wait.  The OP wants a hunk of bytes that happens to mean
something in French, and is not encodable in the limited English
language to work like a different hunk of bytes that means something
in French but is encodable.

Don't overburden them.

>ChrisA

Laura

[toc] | [prev] | [next] | [standalone]

#86330

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-25 03:24 +1100
Message-ID	<mailman.19142.1424795081.18130.python-list@python.org>
In reply to	#86311

On Wed, Feb 25, 2015 at 3:20 AM, Laura Creighton <lac@openend.se> wrote:
> People who are asking for help in getting things to work in their
> native language need a 'do this quick' sort of answer.  The deeper
> problems of supporting all languages and language encodings can very
> much wait.

I'm not so sure about that. When "supporting all languages" is as
simple as "use Python 3 and UTF-8 everywhere", the cost is much lower
than it might be, and the benefit is potentially huge. A "do this
quick" answer might get you by *right now*, but it leaves open the
possibility of subtler errors. That's why Python moved to
Unicode-by-default, even though eight-bit encodings will tend to
produce the right results for simple text.

ChrisA

[toc] | [prev] | [next] | [standalone]

#86334

From	Dave Angel <davea@davea.name>
Date	2015-02-24 12:13 -0500
Message-ID	<mailman.19143.1424798024.18130.python-list@python.org>
In reply to	#86311

On 02/24/2015 11:20 AM, Laura Creighton wrote:
> In a message of Wed, 25 Feb 2015 02:33:30 +1100, Chris Angelico writes:
>> Also a reasonable baseline assumption; but the trouble is that if you
>> automatically assume that text is encoded in your favourite eight-bit
>> system, you're taking a huge risk.
>
> But, you know, I wasn't assuming this.  I actually read latin1.  I
> could read it in ascii, know that \xe9  means 'é', a letter combination
> that we have in Swedish, so I am rather used to reading, and then
> well, I could read all of his strings, know they were in French,
> and know that latin1 was what he needed things to be decoded to.

With a sample of one string, how did you read "all his strings".  And 
with one non-ASCII code in that single string, how did you know that 
'latin1' was the only encoding that included a reasonable character at 
that encoding?

>
>> Now, you have a huge leg up on me, in that you actually recognize the
>> *words* in that piece of text. That means you can have MUCH greater
>> confidence in stating that it's Latin-1 than I can. But that's
>> precisely what I mean by "understand the data". If you, being a native
>> French speaker, pick up a file written in (say) Polish, and encoded
>> Latin-2, you'll recognize by the ASCII characters that it's not French
>> text, and probably you'd be able to spot that it ought to be Latin-2
>> rather than Latin-1. That's understanding the data, that's having more
>> information than just the byte patterns. A computer can't reliably do
>> that (just look up the "Bush hid the facts" bug if you don't believe
>> me), but a human often can.
>
> Absolutely correct.  But you must not require that all of the speakers
> of non-English languages think about their languages as 'special
> encodings'.  Only the monoglot ever think of a foreign language as
> a code.

All languages are foreign.  All that can be written to a disk file are 
bytes.  Those have to have been encoded to represent some abstraction 
called a character set, or string.  The question is whether the encoding 
method is specified for the particular file type, or for the particular 
file.

See http://support.esri.com/cn/knowledgebase/techarticles/detail/21106

according to that page, starting at ArcGIS 10.2.1, the default sets the 
code page to UTF-8 (UNICODE) in the shapefile (.DBF)

But in earlier ones, there's supposed to be a reference to the codepage 
used.  From that, one can presumably derive which decoder to use.


>
> That poor guy the original poster just wants to have a nice string
> of his sporting event place name.  We should tell him how to get that,
> not how to be an expert in all the encodings on the face of this earth.
> Chances are, the only thing he needs to talk about are French words.
>
> If not, well, he will come back when things stop working, and have lots
> more data to give him.  If, instead, this makes him go away happy, then
> this was the very best thing to do.
>
>>> And having taught countless European kids how to write their very first
>>> program in Python, I can tell you for certain that the sort of deep
>>> understanding of encoding methods is not what 10 year olds who just
>>> want to print out the names of their friends, and their favourite
>>> music titles, and their favourite musicians want to know. :)
>>
>> Right, so you should be teaching them to use Python 3, and always
>> saving everything in UTF-8, and basically ignoring the whole mess of
>> eight-bit encodings :)
>
> Of course this makes sense.  But you seem to be missing the point.
> People who are asking for help in getting things to work in their
> native language need a 'do this quick' sort of answer.  The deeper
> problems of supporting all languages and language encodings can very
> much wait.  The OP wants a hunk of bytes that happens to mean
> something in French, and is not encodable in the limited English
> language to work like a different hunk of bytes that means something
> in French but is encodable.
>
> Don't overburden them.

My guess is that this is only appropriate for users who use only locally 
created data.  Since the OP's data is apparently old (if it were current 
versions, it'd have been utf-8), who knows how consistent the encoding is.

-- 
-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#86340

From	Laura Creighton <lac@openend.se>
Date	2015-02-24 20:45 +0100
Message-ID	<mailman.19146.1424807175.18130.python-list@python.org>
In reply to	#86311

In a message of Tue, 24 Feb 2015 12:13:24 -0500, Dave Angel writes:
>With a sample of one string, how did you read "all his strings".  And 
>with one non-ASCII code in that single string, how did you know that 
>'latin1' was the only encoding that included a reasonable character at 
>that encoding?

Ah, 2 strings.  And I did not promise that latin1 was
the only encoding that  included a reasonable char at
his encoding.  I only proinmised that it was one that did.
And, given the nature of the data, I was pretty sure that
this was the one he wanted.  If it did not work, he
would come back and complain.

>See http://support.esri.com/cn/knowledgebase/techarticles/detail/21106
>
>according to that page, starting at ArcGIS 10.2.1, the default sets the 
>code page to UTF-8 (UNICODE) in the shapefile (.DBF)

Who cares.   In Europe, among Europeans, we are used to seeing
Latin1 or Latin2.

>My guess is that this is only appropriate for users who use only locally 
>created data.  Since the OP's data is apparently old (if it were current 
>versions, it'd have been utf-8), who knows how consistent the encoding is.

I do.  Very much so.  The idea that the whole world loves utf-8 is
nonsense.  Most of europe has been using latin1, latin2 etc. before
unicode was invented and will, as far as I know, continue to use it.
Oldness is an indication that latin1 is more likely to be the encoding
than uft-8.

Your guess is that latin1 is only used in local encodings.

My data is that, we in Western Europe, have this format pretty much all
of the time, for everywhere, unless you are only doing local
encodings (in which case you would use utf-8)

Laura

[toc] | [prev] | [next] | [standalone]

#86348

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-02-25 00:21 +0200
Message-ID	<87ioerc5n7.fsf@elektro.pacujo.net>
In reply to	#86340

Laura Creighton <lac@openend.se>:

> Who cares.   In Europe, among Europeans, we are used to seeing
> Latin1 or Latin2.

No, it's UCS-2 (Windows) or UTF-8 (Linux) -- among us Europeans.

> The idea that the whole world loves utf-8 is nonsense.

Windows people don't care for UTF-8, they don't have to. Linux people
use it. Love is not necessary.

Me, I use en_US.UTF-8.

> Most of europe has been using latin1, latin2 etc. before unicode was
> invented and will, as far as I know, continue to use it. Oldness is an
> indication that latin1 is more likely to be the encoding than uft-8.

Latin-1 is confined to HTML, if even there.

> My data is that, we in Western Europe, have this format pretty much
> all of the time, for everywhere, unless you are only doing local
> encodings (in which case you would use utf-8)

There's a third way, but it's not in Western Europe, as far as I can
tell. Japan is another story. I don't know about Russia.


Marko

[toc] | [prev] | [next] | [standalone]

#86368

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-02-25 12:20 +1100
Message-ID	<54ed2349$0$13004$c3e8da3$5496439d@news.astraweb.com>
In reply to	#86340

Laura Creighton wrote:

> The idea that the whole world loves utf-8 is nonsense.

I don't think anyone says the whole world loves UTF-8. I think people say
that the whole world *ought to* love UTF-8, and that legacy encodings from
the Windows "code-page" days ought to die.

> Most of europe has been using latin1, latin2 etc. before 
> unicode was invented and will, as far as I know, continue to use it.

And this is why people in Greece cannot transfer text files to people in
France without the content changing (ISO-8859-7 vs ISO-8859-1). And why
Russians cannot even swap text files with other Russians (a plethora of
encodings).

:-(

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#86401

From	wxjmfauth@gmail.com
Date	2015-02-25 06:34 -0800
Message-ID	<52672557-6aa2-4cf4-a61c-7139cdca9cb0@googlegroups.com>
In reply to	#86368

========================

U+0001F601
U+0001F602

...

[toc] | [prev] | [next] | [standalone]

#86341

From	Laura Creighton <lac@openend.se>
Date	2015-02-24 20:57 +0100
Message-ID	<mailman.19147.1424807886.18130.python-list@python.org>
In reply to	#86311

Dave Angel
are you another Native English speaker living in a world where ASCII
is enough?

Laura

[toc] | [prev] | [next] | [standalone]

Page 1 of 8 [1] 2 3 4 5 6 7 8 Next page →

csiph-web

Newbie question about text encoding

Contents

#86311 — Newbie question about text encoding

#86314

#86316

#86320

#86321

#86322

#86328

#86323

#86324

#86325

#86326

#86327

#86329

#86330

#86334

#86340

#86348

#86368

#86401

#86341