Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.003 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'encoding': 0.05; 'subject:text': 0.05; 'encoded': 0.07; 'failing': 0.07; 'python3': 0.07; 'see:': 0.07; 'utf-8': 0.07; 'string': 0.09; 'bytes.': 0.09; 'works.': 0.09; 'subject:question': 0.10; 'cc:addr:python-list': 0.11; 'python': 0.11; '(note': 0.16; 'encoding.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'mkdir': 0.16; 'underlying': 0.16; 'wrote:': 0.18; 'file,': 0.19; '>>>': 0.22; 'import': 0.22; 'cc:addr:python.org': 0.22; 'byte': 0.24; 'bytes': 0.24; 'text,': 0.24; 'unicode': 0.24; 'fine': 0.24; 'file.': 0.24; 'looks': 0.24; 'cc:2**0': 0.24; 'handling': 0.26; 'pass': 0.26; 'header:In-Reply-To:1': 0.27; 'am,': 0.29; 'dec': 0.30; 'forgot': 0.30; 'message-id:@mail.gmail.com': 0.30; 'that.': 0.31; '"",': 0.31; 'though.': 0.31; 'file': 0.32; 'quite': 0.32; 'linux': 0.33; '(most': 0.33; 'problem': 0.35; "can't": 0.35; 'received:google.com': 0.35; 'there': 0.35; 'version': 0.36; 'two': 0.37; 'list': 0.37; 'skip:o 20': 0.38; 'hat': 0.38; 'that,': 0.38; 'anything': 0.39; 'recent': 0.39; 'either': 0.39; 'how': 0.40; 'even': 0.60; 'skip:o 30': 0.61; "you're": 0.61; 'here:': 0.62; 'back': 0.62; 'such': 0.63; 'more': 0.64; 'mar': 0.68; 'touch': 0.74; 'special': 0.74; '2014,': 0.84; '2015': 0.84; 'characters,': 0.84; 'contents,': 0.84; 'directory:': 0.84; 'to:none': 0.92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=rK/vKgUEwpqPEcUzJ2+y+HFgWJVeqwqzY/YinZhoiQ8=; b=kj2iVy0TsVyK1/gxhJaER+wA1Tp+xYYI21SyKAjVfG635r492AWOcwn+cb6c1kn/Yu ihhHsOG61Jy3KgzN4qyCanC9YXEdkgFmLBEVkiBFw78/U0WZ+QMRwtAkTPQATORfd+Gf XRAZk//nhPAHGYWq0MF1qo/q0Ivf5XV+UwooJflb9wsAbFEVID1aALHufr1Ceo+686j/ PdvTttO6euJSQkly2xlkDiMBuN68CvyOvnx2nPtrbqxIch27wP1HxAY3NRwyOZ08f3Ij ysDnQ9QTqNe7XY/ibI8amM79pdd6rTTOkQViKwmJipYV6aBjr5ML5XfxyY7oGBdiUJG6 83rA== MIME-Version: 1.0 X-Received: by 10.107.128.219 with SMTP id k88mr35313912ioi.27.1425749218898; Sat, 07 Mar 2015 09:26:58 -0800 (PST) In-Reply-To: <877fusybkb.fsf@elektro.pacujo.net> References: <9169f3b1-2ac7-42a3-8033-584f84b88a1f@googlegroups.com> <7a75a23c-4678-4d7a-a2ec-9e8fff4c07f8@googlegroups.com> <132d5ce6-f672-4eec-99f9-1cc9e88b94f3@googlegroups.com> <619e4cb5-1c4c-449b-a5d7-951101b32b45@googlegroups.com> <54f862ca$0$13014$c3e8da3$5496439d@news.astraweb.com> <54fadc70$0$13004$c3e8da3$5496439d@news.astraweb.com> <87twxxxbvd.fsf@elektro.pacujo.net> <54fb1bf4$0$12993$c3e8da3$5496439d@news.astraweb.com> <87twxw4xlz.fsf@elektro.pacujo.net> <87k2ysydtk.fsf@elektro.pacujo.net> <87bnk4yci1.fsf@elektro.pacujo.net> <877fusybkb.fsf@elektro.pacujo.net> Date: Sun, 8 Mar 2015 04:26:58 +1100 Subject: Re: Newbie question about text encoding From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.19 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 52 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1425749228 news.xs4all.nl 2927 [2001:888:2000:d::a6]:37444 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:87114 On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa wrote: > See: > > $ mkdir /tmp/xyz > $ touch /tmp/xyz/ > \x80' > $ python3 > Python 3.3.2 (default, Dec 4 2014, 12:49:00) > [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import os > >>> os.listdir('/tmp/xyz') > ['\udc80'] > >>> open(os.listdir('/tmp/xyz')[0]) > Traceback (most recent call last): > File "", line 1, in > FileNotFoundError: [Errno 2] No such file or directory: '\udc80' > > File names encoded with Latin-X are quite commonplace even in UTF-8 > locales. That is not a problem with UTF-8, though. I don't understand how you're blaming UTF-8 for that. There are two things happening here: 1) The underlying file system is not UTF-8, and you can't depend on that, ergo the decode to Unicode has to have some special handling of failing bytes. 2) You forgot to put the path on that, so it failed to find the file. Here's my version of your demo: >>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0]) <_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'> Looks fine to me. Alternatively, if you pass a byte string to os.listdir, you get back a list of byte string file names: >>> os.listdir(b"/tmp/xyz") [b'\x80'] >>> open(b"/tmp/xyz/"+os.listdir(b'/tmp/xyz')[0]) <_io.TextIOWrapper name=b'/tmp/xyz/\x80' mode='r' encoding='UTF-8'> Either way works. You can use bytes or text, and if you use text, there is a way to smuggle bytes through it. None of this has anything to do with UTF-8 as an encoding. (Note that the "encoding='UTF-8'" note in the response has to do with the presumed encoding of the file contents, not of the file name. As an empty file, it can be considered to be a stream of zero Unicode characters, encoded UTF-8, so that's valid.) ChrisA