Groups > comp.lang.python > #86311 > unrolled thread

Newbie question about text encoding

Started by	pierrick.brihaye@gmail.com
First post	2015-02-24 02:49 -0800
Last post	2015-02-27 10:23 +1100
Articles	20 on this page of 158 — 19 participants

Back to article view | Back to comp.lang.python

  Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
    Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
        Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
        Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
        Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
          Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
        Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
              Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
                  Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
                      Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
                        Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
                            Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
                              Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
                          Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
                      Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
                        Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
                      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
                      Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
                      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
                              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
                                Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
                                    Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
                                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
                            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
                                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
                                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
                                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
                                                  Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
                                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
                                                      Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
                                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
                                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
                                                          Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
                                                        Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
                                    Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
                                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
                                          Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
                                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
                                              Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
                                                Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
                                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
                                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
                                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
                                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
                          Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
                          Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
            Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
                Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
          Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100

Page 6 of 8 — ← Prev page 1 2 3 4 5 [6] 7 8 Next page →

#87112

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-07 19:14 +0200
Message-ID	<877fusybkb.fsf@elektro.pacujo.net>
In reply to	#87110

Chris Angelico <rosuav@gmail.com>:

> If you really REALLY can't use the bytes() type to work with something
> that is, yaknow, bytes, then you could use an alternative encoding
> that has a value for every byte. It's still not Unicode text, so it
> doesn't much matter which encoding you use. But it's much better to
> use the bytes type to work with bytes. It is not text, so don't treat
> it as text.

See:

   $ mkdir /tmp/xyz
   $ touch /tmp/xyz/$'\x80'
   $ python3
   Python 3.3.2 (default, Dec  4 2014, 12:49:00) 
   [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import os
   >>> os.listdir('/tmp/xyz')
   ['\udc80']
   >>> open(os.listdir('/tmp/xyz')[0])
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   FileNotFoundError: [Errno 2] No such file or directory: '\udc80'

File names encoded with Latin-X are quite commonplace even in UTF-8
locales.


Marko

[toc] | [prev] | [next] | [standalone]

#87114

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-08 04:26 +1100
Message-ID	<mailman.154.1425749228.21433.python-list@python.org>
In reply to	#87112

On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> See:
>
>    $ mkdir /tmp/xyz
>    $ touch /tmp/xyz/
> \x80'
>    $ python3
>    Python 3.3.2 (default, Dec  4 2014, 12:49:00)
>    [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
>    Type "help", "copyright", "credits" or "license" for more information.
>    >>> import os
>    >>> os.listdir('/tmp/xyz')
>    ['\udc80']
>    >>> open(os.listdir('/tmp/xyz')[0])
>    Traceback (most recent call last):
>      File "<stdin>", line 1, in <module>
>    FileNotFoundError: [Errno 2] No such file or directory: '\udc80'
>
> File names encoded with Latin-X are quite commonplace even in UTF-8
> locales.

That is not a problem with UTF-8, though. I don't understand how
you're blaming UTF-8 for that. There are two things happening here:

1) The underlying file system is not UTF-8, and you can't depend on
that, ergo the decode to Unicode has to have some special handling of
failing bytes.
2) You forgot to put the path on that, so it failed to find the file.
Here's my version of your demo:

>>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
<_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'>

Looks fine to me.

Alternatively, if you pass a byte string to os.listdir, you get back a
list of byte string file names:

>>> os.listdir(b"/tmp/xyz")
[b'\x80']
>>> open(b"/tmp/xyz/"+os.listdir(b'/tmp/xyz')[0])
<_io.TextIOWrapper name=b'/tmp/xyz/\x80' mode='r' encoding='UTF-8'>

Either way works. You can use bytes or text, and if you use text,
there is a way to smuggle bytes through it. None of this has anything
to do with UTF-8 as an encoding. (Note that the "encoding='UTF-8'"
note in the response has to do with the presumed encoding of the file
contents, not of the file name. As an empty file, it can be considered
to be a stream of zero Unicode characters, encoded UTF-8, so that's
valid.)

ChrisA

[toc] | [prev] | [next] | [standalone]

#87115

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-07 19:50 +0200
Message-ID	<87y4n8wvc3.fsf@elektro.pacujo.net>
In reply to	#87114

Chris Angelico <rosuav@gmail.com>:

> On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> File names encoded with Latin-X are quite commonplace even in UTF-8
>> locales.
>
> That is not a problem with UTF-8, though. I don't understand how
> you're blaming UTF-8 for that.

I'm saying it creates practical problems. There's a snake in the
paradise.

> There are two things happening here:
>
> 1) The underlying file system is not UTF-8, and you can't depend on
> that,

Correct. Linux pathnames are octet strings regardless of the locale.

That's why Linux developers should refer to filenames using bytes.
Unfortunately, Python itself violates that principle by having
os.listdir() return str objects (to mention one example).

> 2) You forgot to put the path on that, so it failed to find the file.
> Here's my version of your demo:
>
>>>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
> <_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'>
>
> Looks fine to me.

I stand corrected.

Then we have:

   >>> os.listdir()[0].encode('utf-8')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
   position 0: surrogates not allowed

Marko

[toc] | [prev] | [next] | [standalone]

#87116

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-08 04:59 +1100
Message-ID	<mailman.155.1425751200.21433.python-list@python.org>
In reply to	#87115

On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> There are two things happening here:
>>
>> 1) The underlying file system is not UTF-8, and you can't depend on
>> that,
>
> Correct. Linux pathnames are octet strings regardless of the locale.
>
> That's why Linux developers should refer to filenames using bytes.
> Unfortunately, Python itself violates that principle by having
> os.listdir() return str objects (to mention one example).

Only because you gave it a str with the path name. If you want to
refer to file names using bytes, then be consistent and refer to ALL
file names using bytes. As I demonstrated, that works just fine.

>> 2) You forgot to put the path on that, so it failed to find the file.
>> Here's my version of your demo:
>>
>>>>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
>> <_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'>
>>
>> Looks fine to me.
>
> I stand corrected.
>
> Then we have:
>
>    >>> os.listdir()[0].encode('utf-8')
>    Traceback (most recent call last):
>      File "<stdin>", line 1, in <module>
>    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
>    position 0: surrogates not allowed

So?

ChrisA

[toc] | [prev] | [next] | [standalone]

#87117

From	Dan Sommers <dan@tombstonezero.net>
Date	2015-03-07 18:02 +0000
Message-ID	<mdfega$6bj$1@dont-email.me>
In reply to	#87116

On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:

> On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko@pacujo.net> wrote:

>> Correct. Linux pathnames are octet strings regardless of the locale.
>>
>> That's why Linux developers should refer to filenames using bytes.
>> Unfortunately, Python itself violates that principle by having
>> os.listdir() return str objects (to mention one example).
> 
> Only because you gave it a str with the path name. If you want to
> refer to file names using bytes, then be consistent and refer to ALL
> file names using bytes. As I demonstrated, that works just fine.

Python 3.4.2 (default, Oct  8 2014, 10:45:20) 
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> type(os.listdir(os.curdir)[0])
<class 'str'>

[toc] | [prev] | [next] | [standalone]

#87118

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-08 05:13 +1100
Message-ID	<mailman.156.1425751992.21433.python-list@python.org>
In reply to	#87117

On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers <dan@tombstonezero.net> wrote:
> On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:
>
>> On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>
>>> Correct. Linux pathnames are octet strings regardless of the locale.
>>>
>>> That's why Linux developers should refer to filenames using bytes.
>>> Unfortunately, Python itself violates that principle by having
>>> os.listdir() return str objects (to mention one example).
>>
>> Only because you gave it a str with the path name. If you want to
>> refer to file names using bytes, then be consistent and refer to ALL
>> file names using bytes. As I demonstrated, that works just fine.
>
> Python 3.4.2 (default, Oct  8 2014, 10:45:20)
> [GCC 4.9.1] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import os
>>>> type(os.listdir(os.curdir)[0])
> <class 'str'>

Help on module os:

DESCRIPTION
    This exports:
      - os.curdir is a string representing the current directory ('.' or ':')
      - os.pardir is a string representing the parent directory ('..' or '::')

Explicitly documented as strings. If you want to work with strings,
work with strings. If you want to work with bytes, don't use
os.curdir, use bytes instead. Personally, I'm happy using strings, but
if you want to go down the path of using bytes, you simply have to be
consistent, and that probably means being platform-dependent anyway,
so just use b"." for the current directory.

Normally, using Unicode strings for file names will work just fine.
Any name that you craft yourself will be correctly encoded for the
target file system (or UTF-8 if you can't know), and any that you get
back from os.listdir or equivalent will be usable in file name
contexts. What else can you do with a file name that isn't encoded the
way you expect it to be? Unless you have some out-of-band encoding
information, you can't do anything meaningful with the stream of
bytes, other than keeping it exactly as it is.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87121

From	Dan Sommers <dan@tombstonezero.net>
Date	2015-03-07 18:34 +0000
Message-ID	<mdfgbj$qdm$1@dont-email.me>
In reply to	#87118

On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote:

> On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers <dan@tombstonezero.net> wrote:
>> On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:
>>
>>> On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>>
>>>> Correct. Linux pathnames are octet strings regardless of the locale.
>>>>
>>>> That's why Linux developers should refer to filenames using bytes.
>>>> Unfortunately, Python itself violates that principle by having
>>>> os.listdir() return str objects (to mention one example).
>>>
>>> Only because you gave it a str with the path name. If you want to
>>> refer to file names using bytes, then be consistent and refer to ALL
>>> file names using bytes. As I demonstrated, that works just fine.
>>
>> Python 3.4.2 (default, Oct  8 2014, 10:45:20)
>> [GCC 4.9.1] on linux
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> import os
>>>>> type(os.listdir(os.curdir)[0])
>> <class 'str'>
> 
> Help on module os:
> 
> DESCRIPTION
>     This exports:
>       - os.curdir is a string representing the current directory ('.' or ':')
>       - os.pardir is a string representing the parent directory ('..' or '::')
> 
> Explicitly documented as strings. If you want to work with strings,
> work with strings. If you want to work with bytes, don't use
> os.curdir, use bytes instead. Personally, I'm happy using strings, but
> if you want to go down the path of using bytes, you simply have to be
> consistent, and that probably means being platform-dependent anyway,
> so just use b"." for the current directory.

I think we're all agreeing:  not all file systems are the same, and
Python doesn't smooth out all of the bumps, even for something that
seems as simple as displaying the names of files in a directory.  And
that's *after* we've agreed that filesystems contain files in
hierarchical directories.

Dan

[toc] | [prev] | [next] | [standalone]

#87122

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-08 05:44 +1100
Message-ID	<mailman.159.1425753853.21433.python-list@python.org>
In reply to	#87121

On Sun, Mar 8, 2015 at 5:34 AM, Dan Sommers <dan@tombstonezero.net> wrote:
> I think we're all agreeing:  not all file systems are the same, and
> Python doesn't smooth out all of the bumps, even for something that
> seems as simple as displaying the names of files in a directory.  And
> that's *after* we've agreed that filesystems contain files in
> hierarchical directories.

I think you and I are in agreement. No idea about Marko, I'm still not
entirely sure what he's saying.

Python can't smooth out all of the bumps in file systems, any more
than Unicode can smooth out the bumps in natural language, or TCP can
smooth out the bumps in IP. The abstraction layers help, but every now
and then they leak, and you have to cope with the underlying mess.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87123

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2015-03-07 19:00 +0000
Message-ID	<mailman.160.1425754874.21433.python-list@python.org>
In reply to	#87121

On 07/03/2015 18:34, Dan Sommers wrote:
> On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote:
>
>> On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers <dan@tombstonezero.net> wrote:
>>> On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:
>>>
>>>> On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>>>
>>>>> Correct. Linux pathnames are octet strings regardless of the locale.
>>>>>
>>>>> That's why Linux developers should refer to filenames using bytes.
>>>>> Unfortunately, Python itself violates that principle by having
>>>>> os.listdir() return str objects (to mention one example).
>>>>
>>>> Only because you gave it a str with the path name. If you want to
>>>> refer to file names using bytes, then be consistent and refer to ALL
>>>> file names using bytes. As I demonstrated, that works just fine.
>>>
>>> Python 3.4.2 (default, Oct  8 2014, 10:45:20)
>>> [GCC 4.9.1] on linux
>>> Type "help", "copyright", "credits" or "license" for more information.
>>>>>> import os
>>>>>> type(os.listdir(os.curdir)[0])
>>> <class 'str'>
>>
>> Help on module os:
>>
>> DESCRIPTION
>>      This exports:
>>        - os.curdir is a string representing the current directory ('.' or ':')
>>        - os.pardir is a string representing the parent directory ('..' or '::')
>>
>> Explicitly documented as strings. If you want to work with strings,
>> work with strings. If you want to work with bytes, don't use
>> os.curdir, use bytes instead. Personally, I'm happy using strings, but
>> if you want to go down the path of using bytes, you simply have to be
>> consistent, and that probably means being platform-dependent anyway,
>> so just use b"." for the current directory.
>
> I think we're all agreeing:  not all file systems are the same, and
> Python doesn't smooth out all of the bumps, even for something that
> seems as simple as displaying the names of files in a directory.  And
> that's *after* we've agreed that filesystems contain files in
> hierarchical directories.
>
> Dan
>

Isn't pathlib 
https://docs.python.org/3/library/pathlib.html#module-pathlib 
effectively a more recent attempt at smoothing or even removing (some 
of) the bumps?  Has anybody here got experience of it as I've never used it?

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#87127

From	Dan Sommers <dan@tombstonezero.net>
Date	2015-03-07 19:16 +0000
Message-ID	<mdfiqq$qdm$2@dont-email.me>
In reply to	#87123

On Sat, 07 Mar 2015 19:00:47 +0000, Mark Lawrence wrote:

> Isn't pathlib
> https://docs.python.org/3/library/pathlib.html#module-pathlib
> effectively a more recent attempt at smoothing or even removing (some
> of) the bumps?  Has anybody here got experience of it as I've never
> used it?

I almost said something about Common Lisp's PATHNAME type, but I didn't.

An extremely quick reading of that page tells me that os.pathlib
addresses *some* of the issues that PATHNAME addresses, but os.pathlib
seems more limited in scope (e.g., os.pathlib doesn't account for
filesystems with versioned files).  I'll certainly have a closer look
later.

[toc] | [prev] | [next] | [standalone]

#87124

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-07 21:01 +0200
Message-ID	<87twxwws1k.fsf@elektro.pacujo.net>
In reply to	#87121

Dan Sommers <dan@tombstonezero.net>:

> I think we're all agreeing: not all file systems are the same, and
> Python doesn't smooth out all of the bumps, even for something that
> seems as simple as displaying the names of files in a directory. And
> that's *after* we've agreed that filesystems contain files in
> hierarchical directories.

A whole new set of problems took root with Unicode. There were gains but
there were losses, too.

Python is not alone in the conceptual difficulties. Guile 2's (readdir)
simply converts bad UTF-8 in a filename into a question mark:

   scheme@(guile-user) [1]> (readdir s)
   $3 = "?"
   scheme@(guile-user) [4]> (equal? $3 "?")
   $4 = #t

So does lxterminal:

   $ ls
   ?

even though it's all bytes on the inside:

   $ [ $(ls) = "?" ]
   $ echo $?
   1

Scripts that make use of standard text utilities must now be very
careful:

   $ ls | egrep "^.$" | wc -l
   0

You are well advised to sprinkle LANG=C in your scripts:

   $ ls | LANG=C egrep "^.$" | wc -l
   1

Nasty locale-related bugs plague installation scripts, whose writers are
not accustomed to running their tests in myriads of locales. The topic
is of course larger than just Unicode.


Marko

[toc] | [prev] | [next] | [standalone]

#87105

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2015-03-07 16:40 +0000
Message-ID	<mailman.149.1425746708.21433.python-list@python.org>
In reply to	#87100

On 07/03/2015 16:25, Marko Rauhamaa wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>>> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>>>
>>>> Marko Rauhamaa wrote:
>>>>
>>>>> That said, UTF-8 does suffer badly from its not being
>>>>> a bijective mapping.
>>>>
>>>> Can you explain?
>>>
>>> In Python terms, there are bytes objects b that don't satisfy:
>>>
>>>     b.decode('utf-8').encode('utf-8') == b
>>
>> Please provide an example; that sounds like a bug. If there is any
>> invalid UTF-8 stream which decodes without an error, it is actually a
>> security bug, and should be fixed pronto in all affected and supported
>> versions.
>
> Here's an example:
>
>     b = b'\x80'
>
> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
> from str objects to bytes objects.
>

Python 2 might, Python 3 doesn't.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#87106

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-07 18:48 +0200
Message-ID	<87fv9gycrr.fsf@elektro.pacujo.net>
In reply to	#87105

Mark Lawrence <breamoreboy@yahoo.co.uk>:

> On 07/03/2015 16:25, Marko Rauhamaa wrote:
>> Here's an example:
>>
>>     b = b'\x80'
>>
>> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
>> from str objects to bytes objects.
>
> Python 2 might, Python 3 doesn't.

   Python 3.3.2 (default, Dec  4 2014, 12:49:00) 
   [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> b'\x80'.decode('utf-8')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
   invalid start byte


Marko

[toc] | [prev] | [next] | [standalone]

#87111

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2015-03-07 17:02 +0000
Message-ID	<mailman.153.1425747767.21433.python-list@python.org>
In reply to	#87106

On 07/03/2015 16:48, Marko Rauhamaa wrote:
> Mark Lawrence <breamoreboy@yahoo.co.uk>:
>
>> On 07/03/2015 16:25, Marko Rauhamaa wrote:
>>> Here's an example:
>>>
>>>      b = b'\x80'
>>>
>>> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
>>> from str objects to bytes objects.
>>
>> Python 2 might, Python 3 doesn't.
>
>     Python 3.3.2 (default, Dec  4 2014, 12:49:00)
>     [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
>     Type "help", "copyright", "credits" or "license" for more information.
>     >>> b'\x80'.decode('utf-8')
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>     UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
>     invalid start byte
>
>
> Marko
>

It would clearly help if you were to type in the correct UK English accent.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#87113

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-03-07 19:16 +0200
Message-ID	<87385gybgj.fsf@elektro.pacujo.net>
In reply to	#87111

Mark Lawrence <breamoreboy@yahoo.co.uk>:

> It would clearly help if you were to type in the correct UK English
> accent.

Your ad-hominem-to-contribution ratio is alarmingly high.


Marko

[toc] | [prev] | [next] | [standalone]

#87120

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2015-03-07 18:18 +0000
Message-ID	<mailman.158.1425752339.21433.python-list@python.org>
In reply to	#87113

On 07/03/2015 17:16, Marko Rauhamaa wrote:
> Mark Lawrence <breamoreboy@yahoo.co.uk>:
>
>> It would clearly help if you were to type in the correct UK English
>> accent.
>
> Your ad-hominem-to-contribution ratio is alarmingly high.
>
>
> Marko
>

You've been a PITA ever since you first joined this list, what about it?

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#87129

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-03-07 21:06 -0800
Message-ID	<ab32c8ab-cf4e-44fa-80e2-6e7766199c6a@googlegroups.com>
In reply to	#87120

On Saturday, March 7, 2015 at 11:49:44 PM UTC+5:30, Mark Lawrence wrote:
> On 07/03/2015 17:16, Marko Rauhamaa wrote:
> > Mark Lawrence:
> >
> >> It would clearly help if you were to type in the correct UK English
> >> accent.
> >
> > Your ad-hominem-to-contribution ratio is alarmingly high.
> >
> >
> > Marko
> >
> 
> You've been a PITA ever since you first joined this list, what about it?
> 
> -- 
> My fellow Pythonistas, ask not what our language can do for you, ask
> what you can do for our language.

Hi Mark
Your UK accent above is funny [At least *I* find it so]
The above however is crossing a line. Please desist.

[toc] | [prev] | [next] | [standalone]

#87107

From	Chris Angelico <rosuav@gmail.com>
Date	2015-03-08 03:53 +1100
Message-ID	<mailman.150.1425747192.21433.python-list@python.org>
In reply to	#87100

On Sun, Mar 8, 2015 at 3:40 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
>> Here's an example:
>>
>>     b = b'\x80'
>>
>> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
>> from str objects to bytes objects.
>>
>
> Python 2 might, Python 3 doesn't.

He was talking about this line of code:

b.decode('utf-8').encode('utf-8') == b

With the above assignment, that does indeed throw an error - which is
correct behaviour.

Challenge: Figure out a byte-string input that will make this function
return True.

def is_utf8_broken(b):
    return b.decode('utf-8').encode('utf-8') != b

Correct responses for this function are either False or raising an exception.

ChrisA

[toc] | [prev] | [next] | [standalone]

#87125

From	wxjmfauth@gmail.com
Date	2015-03-07 11:03 -0800
Message-ID	<3ffbd7e8-23eb-46fe-90eb-1b037f49100a@googlegroups.com>
In reply to	#87099

Le samedi 7 mars 2015 17:18:43 UTC+1, Chris Angelico a écrit :
> On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> > Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> >
> >> Marko Rauhamaa wrote:
> >>
> >>> That said, UTF-8 does suffer badly from its not being
> >>> a bijective mapping.
> >>
> >> Can you explain?
> >
> > In Python terms, there are bytes objects b that don't satisfy:
> >
> >    b.decode('utf-8').encode('utf-8') == b
> 
> Please provide an example; that sounds like a bug. If there is any
> invalid UTF-8 stream which decodes without an error, it is actually a
> security bug, and should be fixed pronto in all affected and supported
> versions.
> 
> ChrisA

Poor Chris. No offense.
Python 2 and 3 have never work properly outside the ascii
world. Sad reality, but reality. And I am not speaking
about specific tasks related to the os like the file
system encoding.
I can ensure you, I'm not alone to know it.

jmf

[toc] | [prev] | [next] | [standalone]

#87128

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-03-08 12:45 +1100
Message-ID	<54fba9d4$0$12988$c3e8da3$5496439d@news.astraweb.com>
In reply to	#87092

Marko Rauhamaa wrote:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> 
>> Marko Rauhamaa wrote:
>>
>>> That said, UTF-8 does suffer badly from its not being
>>> a bijective mapping.
>>
>> Can you explain?
> 
> In Python terms, there are bytes objects b that don't satisfy:
> 
>    b.decode('utf-8').encode('utf-8') == b

Are you talking about the fact that not all byte streams are valid UTF-8?
That is, some byte objects b may raise an exception on b.decode('utf-8').

I don't see why that means UTF-8 "suffers badly" from this. Can you give an
example of where you would expect to take an arbitrary byte-stream, decode
it as UTF-8, and expect the results to be meaningful?

For those cases where you do wish to take an arbitrary byte stream and
round-trip it, Python now provides an error handler for that.

py> import random
py> b = bytes([random.randint(0, 255) for _ in range(10000)])
py> s = b.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0:
invalid start byte
py> s = b.decode('utf-8', errors='surrogateescape')
py> s.encode('utf-8', errors='surrogateescape') == b
True

-- 
Steven

[toc] | [prev] | [next] | [standalone]

Page 6 of 8 — ← Prev page 1 2 3 4 5 [6] 7 8 Next page →

csiph-web

Newbie question about text encoding

Contents

#87112

#87114

#87115

#87116

#87117

#87118

#87121

#87122

#87123

#87127

#87124

#87105

#87106

#87111

#87113

#87120

#87129

#87107

#87125

#87128