Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #87126

Re: Newbie question about text encoding

Date 2015-03-07 19:03 +0000
From Albert-Jan Roskam <fomcl@yahoo.com>
References <CAPTjJmrOTgY520hU2e90TkAsck5OVjMq4kQ9xMP5dU5DJ1Ejpg@mail.gmail.com>
Subject Re: Newbie question about text encoding
Newsgroups comp.lang.python
Message-ID <mailman.161.1425755314.21433.python-list@python.org> (permalink)

Show all headers | View raw



--- Original Message -----

> From: Chris Angelico <rosuav@gmail.com>
> To: 
> Cc: "python-list@python.org" <python-list@python.org>
> Sent: Saturday, March 7, 2015 6:26 PM
> Subject: Re: Newbie question about text encoding
> 
> On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>>  See:
>> 
>>     $ mkdir /tmp/xyz
>>     $ touch /tmp/xyz/
>>  \x80'
>>     $ python3
>>     Python 3.3.2 (default, Dec  4 2014, 12:49:00)
>>     [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
>>     Type "help", "copyright", "credits" or 
> "license" for more information.
>>     >>> import os
>>     >>> os.listdir('/tmp/xyz')
>>     ['\udc80']
>>     >>> open(os.listdir('/tmp/xyz')[0])
>>     Traceback (most recent call last):
>>       File "<stdin>", line 1, in <module>
>>     FileNotFoundError: [Errno 2] No such file or directory: 
> '\udc80'
>> 
>>  File names encoded with Latin-X are quite commonplace even in UTF-8
>>  locales.
> 
> That is not a problem with UTF-8, though. I don't understand how
> you're blaming UTF-8 for that. There are two things happening here:
> 
> 1) The underlying file system is not UTF-8, and you can't depend on
> that, ergo the decode to Unicode has to have some special handling of
> failing bytes.
> 2) You forgot to put the path on that, so it failed to find the file.
> Here's my version of your demo:
> 
>>>>  open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
> <_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' 
> encoding='UTF-8'>
> 
> Looks fine to me.
> 
> Alternatively, if you pass a byte string to os.listdir, you get back a
> list of byte string file names:
> 
>>>>  os.listdir(b"/tmp/xyz")

> [b'\x80']

Nice, I did not know that. And glob.glob works the same way: it returns a list of ustrings when given a ustring, and returns bstrings when given a bstring.

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Newbie question about text encoding Albert-Jan Roskam <fomcl@yahoo.com> - 2015-03-07 19:03 +0000

csiph-web