Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #86311 > unrolled thread
| Started by | pierrick.brihaye@gmail.com |
|---|---|
| First post | 2015-02-24 02:49 -0800 |
| Last post | 2015-02-27 10:23 +1100 |
| Articles | 20 on this page of 158 — 19 participants |
Back to article view | Back to comp.lang.python
Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100
Page 6 of 8 — ← Prev page 1 2 3 4 5 [6] 7 8 Next page →
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-07 19:14 +0200 |
| Message-ID | <877fusybkb.fsf@elektro.pacujo.net> |
| In reply to | #87110 |
Chris Angelico <rosuav@gmail.com>:
> If you really REALLY can't use the bytes() type to work with something
> that is, yaknow, bytes, then you could use an alternative encoding
> that has a value for every byte. It's still not Unicode text, so it
> doesn't much matter which encoding you use. But it's much better to
> use the bytes type to work with bytes. It is not text, so don't treat
> it as text.
See:
$ mkdir /tmp/xyz
$ touch /tmp/xyz/$'\x80'
$ python3
Python 3.3.2 (default, Dec 4 2014, 12:49:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('/tmp/xyz')
['\udc80']
>>> open(os.listdir('/tmp/xyz')[0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: '\udc80'
File names encoded with Latin-X are quite commonplace even in UTF-8
locales.
Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-08 04:26 +1100 |
| Message-ID | <mailman.154.1425749228.21433.python-list@python.org> |
| In reply to | #87112 |
On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> See:
>
> $ mkdir /tmp/xyz
> $ touch /tmp/xyz/
> \x80'
> $ python3
> Python 3.3.2 (default, Dec 4 2014, 12:49:00)
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import os
> >>> os.listdir('/tmp/xyz')
> ['\udc80']
> >>> open(os.listdir('/tmp/xyz')[0])
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> FileNotFoundError: [Errno 2] No such file or directory: '\udc80'
>
> File names encoded with Latin-X are quite commonplace even in UTF-8
> locales.
That is not a problem with UTF-8, though. I don't understand how
you're blaming UTF-8 for that. There are two things happening here:
1) The underlying file system is not UTF-8, and you can't depend on
that, ergo the decode to Unicode has to have some special handling of
failing bytes.
2) You forgot to put the path on that, so it failed to find the file.
Here's my version of your demo:
>>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
<_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'>
Looks fine to me.
Alternatively, if you pass a byte string to os.listdir, you get back a
list of byte string file names:
>>> os.listdir(b"/tmp/xyz")
[b'\x80']
>>> open(b"/tmp/xyz/"+os.listdir(b'/tmp/xyz')[0])
<_io.TextIOWrapper name=b'/tmp/xyz/\x80' mode='r' encoding='UTF-8'>
Either way works. You can use bytes or text, and if you use text,
there is a way to smuggle bytes through it. None of this has anything
to do with UTF-8 as an encoding. (Note that the "encoding='UTF-8'"
note in the response has to do with the presumed encoding of the file
contents, not of the file name. As an empty file, it can be considered
to be a stream of zero Unicode characters, encoded UTF-8, so that's
valid.)
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-07 19:50 +0200 |
| Message-ID | <87y4n8wvc3.fsf@elektro.pacujo.net> |
| In reply to | #87114 |
Chris Angelico <rosuav@gmail.com>:
> On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> File names encoded with Latin-X are quite commonplace even in UTF-8
>> locales.
>
> That is not a problem with UTF-8, though. I don't understand how
> you're blaming UTF-8 for that.
I'm saying it creates practical problems. There's a snake in the
paradise.
> There are two things happening here:
>
> 1) The underlying file system is not UTF-8, and you can't depend on
> that,
Correct. Linux pathnames are octet strings regardless of the locale.
That's why Linux developers should refer to filenames using bytes.
Unfortunately, Python itself violates that principle by having
os.listdir() return str objects (to mention one example).
> 2) You forgot to put the path on that, so it failed to find the file.
> Here's my version of your demo:
>
>>>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
> <_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'>
>
> Looks fine to me.
I stand corrected.
Then we have:
>>> os.listdir()[0].encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
position 0: surrogates not allowed
Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-08 04:59 +1100 |
| Message-ID | <mailman.155.1425751200.21433.python-list@python.org> |
| In reply to | #87115 |
On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> There are two things happening here:
>>
>> 1) The underlying file system is not UTF-8, and you can't depend on
>> that,
>
> Correct. Linux pathnames are octet strings regardless of the locale.
>
> That's why Linux developers should refer to filenames using bytes.
> Unfortunately, Python itself violates that principle by having
> os.listdir() return str objects (to mention one example).
Only because you gave it a str with the path name. If you want to
refer to file names using bytes, then be consistent and refer to ALL
file names using bytes. As I demonstrated, that works just fine.
>> 2) You forgot to put the path on that, so it failed to find the file.
>> Here's my version of your demo:
>>
>>>>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
>> <_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'>
>>
>> Looks fine to me.
>
> I stand corrected.
>
> Then we have:
>
> >>> os.listdir()[0].encode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
> position 0: surrogates not allowed
So?
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Dan Sommers <dan@tombstonezero.net> |
|---|---|
| Date | 2015-03-07 18:02 +0000 |
| Message-ID | <mdfega$6bj$1@dont-email.me> |
| In reply to | #87116 |
On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: > On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko@pacujo.net> wrote: >> Correct. Linux pathnames are octet strings regardless of the locale. >> >> That's why Linux developers should refer to filenames using bytes. >> Unfortunately, Python itself violates that principle by having >> os.listdir() return str objects (to mention one example). > > Only because you gave it a str with the path name. If you want to > refer to file names using bytes, then be consistent and refer to ALL > file names using bytes. As I demonstrated, that works just fine. Python 3.4.2 (default, Oct 8 2014, 10:45:20) [GCC 4.9.1] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> type(os.listdir(os.curdir)[0]) <class 'str'>
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-08 05:13 +1100 |
| Message-ID | <mailman.156.1425751992.21433.python-list@python.org> |
| In reply to | #87117 |
On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers <dan@tombstonezero.net> wrote:
> On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:
>
>> On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>
>>> Correct. Linux pathnames are octet strings regardless of the locale.
>>>
>>> That's why Linux developers should refer to filenames using bytes.
>>> Unfortunately, Python itself violates that principle by having
>>> os.listdir() return str objects (to mention one example).
>>
>> Only because you gave it a str with the path name. If you want to
>> refer to file names using bytes, then be consistent and refer to ALL
>> file names using bytes. As I demonstrated, that works just fine.
>
> Python 3.4.2 (default, Oct 8 2014, 10:45:20)
> [GCC 4.9.1] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import os
>>>> type(os.listdir(os.curdir)[0])
> <class 'str'>
Help on module os:
DESCRIPTION
This exports:
- os.curdir is a string representing the current directory ('.' or ':')
- os.pardir is a string representing the parent directory ('..' or '::')
Explicitly documented as strings. If you want to work with strings,
work with strings. If you want to work with bytes, don't use
os.curdir, use bytes instead. Personally, I'm happy using strings, but
if you want to go down the path of using bytes, you simply have to be
consistent, and that probably means being platform-dependent anyway,
so just use b"." for the current directory.
Normally, using Unicode strings for file names will work just fine.
Any name that you craft yourself will be correctly encoded for the
target file system (or UTF-8 if you can't know), and any that you get
back from os.listdir or equivalent will be usable in file name
contexts. What else can you do with a file name that isn't encoded the
way you expect it to be? Unless you have some out-of-band encoding
information, you can't do anything meaningful with the stream of
bytes, other than keeping it exactly as it is.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Dan Sommers <dan@tombstonezero.net> |
|---|---|
| Date | 2015-03-07 18:34 +0000 |
| Message-ID | <mdfgbj$qdm$1@dont-email.me> |
| In reply to | #87118 |
On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote:
> On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers <dan@tombstonezero.net> wrote:
>> On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:
>>
>>> On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>>
>>>> Correct. Linux pathnames are octet strings regardless of the locale.
>>>>
>>>> That's why Linux developers should refer to filenames using bytes.
>>>> Unfortunately, Python itself violates that principle by having
>>>> os.listdir() return str objects (to mention one example).
>>>
>>> Only because you gave it a str with the path name. If you want to
>>> refer to file names using bytes, then be consistent and refer to ALL
>>> file names using bytes. As I demonstrated, that works just fine.
>>
>> Python 3.4.2 (default, Oct 8 2014, 10:45:20)
>> [GCC 4.9.1] on linux
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> import os
>>>>> type(os.listdir(os.curdir)[0])
>> <class 'str'>
>
> Help on module os:
>
> DESCRIPTION
> This exports:
> - os.curdir is a string representing the current directory ('.' or ':')
> - os.pardir is a string representing the parent directory ('..' or '::')
>
> Explicitly documented as strings. If you want to work with strings,
> work with strings. If you want to work with bytes, don't use
> os.curdir, use bytes instead. Personally, I'm happy using strings, but
> if you want to go down the path of using bytes, you simply have to be
> consistent, and that probably means being platform-dependent anyway,
> so just use b"." for the current directory.
I think we're all agreeing: not all file systems are the same, and
Python doesn't smooth out all of the bumps, even for something that
seems as simple as displaying the names of files in a directory. And
that's *after* we've agreed that filesystems contain files in
hierarchical directories.
Dan
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-08 05:44 +1100 |
| Message-ID | <mailman.159.1425753853.21433.python-list@python.org> |
| In reply to | #87121 |
On Sun, Mar 8, 2015 at 5:34 AM, Dan Sommers <dan@tombstonezero.net> wrote: > I think we're all agreeing: not all file systems are the same, and > Python doesn't smooth out all of the bumps, even for something that > seems as simple as displaying the names of files in a directory. And > that's *after* we've agreed that filesystems contain files in > hierarchical directories. I think you and I are in agreement. No idea about Marko, I'm still not entirely sure what he's saying. Python can't smooth out all of the bumps in file systems, any more than Unicode can smooth out the bumps in natural language, or TCP can smooth out the bumps in IP. The abstraction layers help, but every now and then they leak, and you have to cope with the underlying mess. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2015-03-07 19:00 +0000 |
| Message-ID | <mailman.160.1425754874.21433.python-list@python.org> |
| In reply to | #87121 |
On 07/03/2015 18:34, Dan Sommers wrote:
> On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote:
>
>> On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers <dan@tombstonezero.net> wrote:
>>> On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:
>>>
>>>> On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>>>
>>>>> Correct. Linux pathnames are octet strings regardless of the locale.
>>>>>
>>>>> That's why Linux developers should refer to filenames using bytes.
>>>>> Unfortunately, Python itself violates that principle by having
>>>>> os.listdir() return str objects (to mention one example).
>>>>
>>>> Only because you gave it a str with the path name. If you want to
>>>> refer to file names using bytes, then be consistent and refer to ALL
>>>> file names using bytes. As I demonstrated, that works just fine.
>>>
>>> Python 3.4.2 (default, Oct 8 2014, 10:45:20)
>>> [GCC 4.9.1] on linux
>>> Type "help", "copyright", "credits" or "license" for more information.
>>>>>> import os
>>>>>> type(os.listdir(os.curdir)[0])
>>> <class 'str'>
>>
>> Help on module os:
>>
>> DESCRIPTION
>> This exports:
>> - os.curdir is a string representing the current directory ('.' or ':')
>> - os.pardir is a string representing the parent directory ('..' or '::')
>>
>> Explicitly documented as strings. If you want to work with strings,
>> work with strings. If you want to work with bytes, don't use
>> os.curdir, use bytes instead. Personally, I'm happy using strings, but
>> if you want to go down the path of using bytes, you simply have to be
>> consistent, and that probably means being platform-dependent anyway,
>> so just use b"." for the current directory.
>
> I think we're all agreeing: not all file systems are the same, and
> Python doesn't smooth out all of the bumps, even for something that
> seems as simple as displaying the names of files in a directory. And
> that's *after* we've agreed that filesystems contain files in
> hierarchical directories.
>
> Dan
>
Isn't pathlib
https://docs.python.org/3/library/pathlib.html#module-pathlib
effectively a more recent attempt at smoothing or even removing (some
of) the bumps? Has anybody here got experience of it as I've never used it?
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.
Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Dan Sommers <dan@tombstonezero.net> |
|---|---|
| Date | 2015-03-07 19:16 +0000 |
| Message-ID | <mdfiqq$qdm$2@dont-email.me> |
| In reply to | #87123 |
On Sat, 07 Mar 2015 19:00:47 +0000, Mark Lawrence wrote: > Isn't pathlib > https://docs.python.org/3/library/pathlib.html#module-pathlib > effectively a more recent attempt at smoothing or even removing (some > of) the bumps? Has anybody here got experience of it as I've never > used it? I almost said something about Common Lisp's PATHNAME type, but I didn't. An extremely quick reading of that page tells me that os.pathlib addresses *some* of the issues that PATHNAME addresses, but os.pathlib seems more limited in scope (e.g., os.pathlib doesn't account for filesystems with versioned files). I'll certainly have a closer look later.
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-07 21:01 +0200 |
| Message-ID | <87twxwws1k.fsf@elektro.pacujo.net> |
| In reply to | #87121 |
Dan Sommers <dan@tombstonezero.net>: > I think we're all agreeing: not all file systems are the same, and > Python doesn't smooth out all of the bumps, even for something that > seems as simple as displaying the names of files in a directory. And > that's *after* we've agreed that filesystems contain files in > hierarchical directories. A whole new set of problems took root with Unicode. There were gains but there were losses, too. Python is not alone in the conceptual difficulties. Guile 2's (readdir) simply converts bad UTF-8 in a filename into a question mark: scheme@(guile-user) [1]> (readdir s) $3 = "?" scheme@(guile-user) [4]> (equal? $3 "?") $4 = #t So does lxterminal: $ ls ? even though it's all bytes on the inside: $ [ $(ls) = "?" ] $ echo $? 1 Scripts that make use of standard text utilities must now be very careful: $ ls | egrep "^.$" | wc -l 0 You are well advised to sprinkle LANG=C in your scripts: $ ls | LANG=C egrep "^.$" | wc -l 1 Nasty locale-related bugs plague installation scripts, whose writers are not accustomed to running their tests in myriads of locales. The topic is of course larger than just Unicode. Marko
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2015-03-07 16:40 +0000 |
| Message-ID | <mailman.149.1425746708.21433.python-list@python.org> |
| In reply to | #87100 |
On 07/03/2015 16:25, Marko Rauhamaa wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>>> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>>>
>>>> Marko Rauhamaa wrote:
>>>>
>>>>> That said, UTF-8 does suffer badly from its not being
>>>>> a bijective mapping.
>>>>
>>>> Can you explain?
>>>
>>> In Python terms, there are bytes objects b that don't satisfy:
>>>
>>> b.decode('utf-8').encode('utf-8') == b
>>
>> Please provide an example; that sounds like a bug. If there is any
>> invalid UTF-8 stream which decodes without an error, it is actually a
>> security bug, and should be fixed pronto in all affected and supported
>> versions.
>
> Here's an example:
>
> b = b'\x80'
>
> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
> from str objects to bytes objects.
>
Python 2 might, Python 3 doesn't.
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.
Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-07 18:48 +0200 |
| Message-ID | <87fv9gycrr.fsf@elektro.pacujo.net> |
| In reply to | #87105 |
Mark Lawrence <breamoreboy@yahoo.co.uk>:
> On 07/03/2015 16:25, Marko Rauhamaa wrote:
>> Here's an example:
>>
>> b = b'\x80'
>>
>> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
>> from str objects to bytes objects.
>
> Python 2 might, Python 3 doesn't.
Python 3.3.2 (default, Dec 4 2014, 12:49:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> b'\x80'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
Marko
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2015-03-07 17:02 +0000 |
| Message-ID | <mailman.153.1425747767.21433.python-list@python.org> |
| In reply to | #87106 |
On 07/03/2015 16:48, Marko Rauhamaa wrote:
> Mark Lawrence <breamoreboy@yahoo.co.uk>:
>
>> On 07/03/2015 16:25, Marko Rauhamaa wrote:
>>> Here's an example:
>>>
>>> b = b'\x80'
>>>
>>> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
>>> from str objects to bytes objects.
>>
>> Python 2 might, Python 3 doesn't.
>
> Python 3.3.2 (default, Dec 4 2014, 12:49:00)
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> b'\x80'.decode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
> invalid start byte
>
>
> Marko
>
It would clearly help if you were to type in the correct UK English accent.
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.
Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-07 19:16 +0200 |
| Message-ID | <87385gybgj.fsf@elektro.pacujo.net> |
| In reply to | #87111 |
Mark Lawrence <breamoreboy@yahoo.co.uk>: > It would clearly help if you were to type in the correct UK English > accent. Your ad-hominem-to-contribution ratio is alarmingly high. Marko
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2015-03-07 18:18 +0000 |
| Message-ID | <mailman.158.1425752339.21433.python-list@python.org> |
| In reply to | #87113 |
On 07/03/2015 17:16, Marko Rauhamaa wrote: > Mark Lawrence <breamoreboy@yahoo.co.uk>: > >> It would clearly help if you were to type in the correct UK English >> accent. > > Your ad-hominem-to-contribution ratio is alarmingly high. > > > Marko > You've been a PITA ever since you first joined this list, what about it? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-07 21:06 -0800 |
| Message-ID | <ab32c8ab-cf4e-44fa-80e2-6e7766199c6a@googlegroups.com> |
| In reply to | #87120 |
On Saturday, March 7, 2015 at 11:49:44 PM UTC+5:30, Mark Lawrence wrote: > On 07/03/2015 17:16, Marko Rauhamaa wrote: > > Mark Lawrence: > > > >> It would clearly help if you were to type in the correct UK English > >> accent. > > > > Your ad-hominem-to-contribution ratio is alarmingly high. > > > > > > Marko > > > > You've been a PITA ever since you first joined this list, what about it? > > -- > My fellow Pythonistas, ask not what our language can do for you, ask > what you can do for our language. Hi Mark Your UK accent above is funny [At least *I* find it so] The above however is crossing a line. Please desist.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-08 03:53 +1100 |
| Message-ID | <mailman.150.1425747192.21433.python-list@python.org> |
| In reply to | #87100 |
On Sun, Mar 8, 2015 at 3:40 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
>> Here's an example:
>>
>> b = b'\x80'
>>
>> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
>> from str objects to bytes objects.
>>
>
> Python 2 might, Python 3 doesn't.
He was talking about this line of code:
b.decode('utf-8').encode('utf-8') == b
With the above assignment, that does indeed throw an error - which is
correct behaviour.
Challenge: Figure out a byte-string input that will make this function
return True.
def is_utf8_broken(b):
return b.decode('utf-8').encode('utf-8') != b
Correct responses for this function are either False or raising an exception.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-07 11:03 -0800 |
| Message-ID | <3ffbd7e8-23eb-46fe-90eb-1b037f49100a@googlegroups.com> |
| In reply to | #87099 |
Le samedi 7 mars 2015 17:18:43 UTC+1, Chris Angelico a écrit :
> On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> > Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> >
> >> Marko Rauhamaa wrote:
> >>
> >>> That said, UTF-8 does suffer badly from its not being
> >>> a bijective mapping.
> >>
> >> Can you explain?
> >
> > In Python terms, there are bytes objects b that don't satisfy:
> >
> > b.decode('utf-8').encode('utf-8') == b
>
> Please provide an example; that sounds like a bug. If there is any
> invalid UTF-8 stream which decodes without an error, it is actually a
> security bug, and should be fixed pronto in all affected and supported
> versions.
>
> ChrisA
Poor Chris. No offense.
Python 2 and 3 have never work properly outside the ascii
world. Sad reality, but reality. And I am not speaking
about specific tasks related to the os like the file
system encoding.
I can ensure you, I'm not alone to know it.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-08 12:45 +1100 |
| Message-ID | <54fba9d4$0$12988$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #87092 |
Marko Rauhamaa wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>
>> Marko Rauhamaa wrote:
>>
>>> That said, UTF-8 does suffer badly from its not being
>>> a bijective mapping.
>>
>> Can you explain?
>
> In Python terms, there are bytes objects b that don't satisfy:
>
> b.decode('utf-8').encode('utf-8') == b
Are you talking about the fact that not all byte streams are valid UTF-8?
That is, some byte objects b may raise an exception on b.decode('utf-8').
I don't see why that means UTF-8 "suffers badly" from this. Can you give an
example of where you would expect to take an arbitrary byte-stream, decode
it as UTF-8, and expect the results to be meaningful?
For those cases where you do wish to take an arbitrary byte stream and
round-trip it, Python now provides an error handler for that.
py> import random
py> b = bytes([random.randint(0, 255) for _ in range(10000)])
py> s = b.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0:
invalid start byte
py> s = b.decode('utf-8', errors='surrogateescape')
py> s.encode('utf-8', errors='surrogateescape') == b
True
--
Steven
[toc] | [prev] | [next] | [standalone]
Page 6 of 8 — ← Prev page 1 2 3 4 5 [6] 7 8 Next page →
Back to top | Article view | comp.lang.python
csiph-web