Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'encoding': 0.05; 'explicitly': 0.05; 'subject:text': 0.05; 'correct.': 0.07; 'encoded': 0.07; 'utf-8': 0.07; 'string': 0.09; 'bytes,': 0.09; 'bytes.': 0.09; 'craft': 0.09; 'dan': 0.09; 'filenames': 0.09; 'meaningful': 0.09; 'strings.': 0.09; 'violates': 0.09; 'subject:question': 0.10; 'cc:addr:python-list': 0.11; 'python': 0.11; 'example).': 0.16; 'fine.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'normally,': 0.16; 'os.curdir': 0.16; 'os:': 0.16; 'pathnames': 0.16; 'personally,': 0.16; 'usable': 0.16; 'wrote:': 0.18; 'module': 0.19; '>>>': 0.22; 'import': 0.22; 'cc:addr:python.org': 0.22; 'bytes': 0.24; 'directory.': 0.24; 'documented': 0.24; 'instead.': 0.24; 'unicode': 0.24; 'regardless': 0.24; '(or': 0.24; 'cc:2**0': 0.24; 'developers': 0.25; 'equivalent': 0.26; 'mention': 0.26; 'header :In-Reply-To:1': 0.27; 'chris': 0.29; 'am,': 0.29; 'message- id:@mail.gmail.com': 0.30; "i'm": 0.30; '>>>>': 0.31; 'file': 0.32; 'probably': 0.32; 'linux': 0.33; "can't": 0.35; 'objects': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'consistent': 0.36; 'representing': 0.36; 'should': 0.36; 'being': 0.38; 'anything': 0.39; 'expect': 0.39; 'itself': 0.39; 'skip:u 10': 0.60; 'is.': 0.60; 'skip:t 30': 0.61; 'information,': 0.61; 'simply': 0.61; 'back': 0.62; 'name': 0.63; 'refer': 0.63; 'more': 0.64; 'mar': 0.68; 'yourself': 0.78; '2014,': 0.84; '2015': 0.84; 'to:none': 0.92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=GIolXDv/TlpxzSz47Hbat3YxbDOl51evieb2W93af8E=; b=MNwymTCEZL/f6ChTt9J1PCO8CJ2VWZXhhej6qdwAq0dWsEdJXLA/hJrqXq+4jnPCRW DAeZBsW7LqeFSO7nyPX3At9iJZJ0OjAZuO3JQnXdwHQye3MaKCCfM53fvAkB7nSQaJPR ySB79IHWQP8+OflXT45meUM8MfP6mtaMeN7mFyBlL/ofndXRujK01dwaTWb1lLZvohFe pbD/NDEugpDOCObEbd718BnAgZjG6H4VBwfs5QUMwzeiH0P71CnmcbIKqf48CGUFXY/a l46bIBmW2FV2xcCd8BkG6ZghaqdKqb6bMlGTmxE1hohnslTNVfzZiaZ8yUh0h/0argzF h1OA== MIME-Version: 1.0 X-Received: by 10.50.131.196 with SMTP id oo4mr36559416igb.2.1425751989119; Sat, 07 Mar 2015 10:13:09 -0800 (PST) In-Reply-To: References: <7a75a23c-4678-4d7a-a2ec-9e8fff4c07f8@googlegroups.com> <132d5ce6-f672-4eec-99f9-1cc9e88b94f3@googlegroups.com> <619e4cb5-1c4c-449b-a5d7-951101b32b45@googlegroups.com> <54f862ca$0$13014$c3e8da3$5496439d@news.astraweb.com> <54fadc70$0$13004$c3e8da3$5496439d@news.astraweb.com> <87twxxxbvd.fsf@elektro.pacujo.net> <54fb1bf4$0$12993$c3e8da3$5496439d@news.astraweb.com> <87twxw4xlz.fsf@elektro.pacujo.net> <87k2ysydtk.fsf@elektro.pacujo.net> <87bnk4yci1.fsf@elektro.pacujo.net> <877fusybkb.fsf@elektro.pacujo.net> <87y4n8wvc3.fsf@elektro.pacujo.net> Date: Sun, 8 Mar 2015 05:13:09 +1100 Subject: Re: Newbie question about text encoding From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.19 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 46 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1425751992 news.xs4all.nl 2902 [2001:888:2000:d::a6]:51985 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:87118 On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers wrote: > On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: > >> On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa wrote: > >>> Correct. Linux pathnames are octet strings regardless of the locale. >>> >>> That's why Linux developers should refer to filenames using bytes. >>> Unfortunately, Python itself violates that principle by having >>> os.listdir() return str objects (to mention one example). >> >> Only because you gave it a str with the path name. If you want to >> refer to file names using bytes, then be consistent and refer to ALL >> file names using bytes. As I demonstrated, that works just fine. > > Python 3.4.2 (default, Oct 8 2014, 10:45:20) > [GCC 4.9.1] on linux > Type "help", "copyright", "credits" or "license" for more information. >>>> import os >>>> type(os.listdir(os.curdir)[0]) > Help on module os: DESCRIPTION This exports: - os.curdir is a string representing the current directory ('.' or ':') - os.pardir is a string representing the parent directory ('..' or '::') Explicitly documented as strings. If you want to work with strings, work with strings. If you want to work with bytes, don't use os.curdir, use bytes instead. Personally, I'm happy using strings, but if you want to go down the path of using bytes, you simply have to be consistent, and that probably means being platform-dependent anyway, so just use b"." for the current directory. Normally, using Unicode strings for file names will work just fine. Any name that you craft yourself will be correctly encoded for the target file system (or UTF-8 if you can't know), and any that you get back from os.listdir or equivalent will be usable in file name contexts. What else can you do with a file name that isn't encoded the way you expect it to be? Unless you have some out-of-band encoding information, you can't do anything meaningful with the stream of bytes, other than keeping it exactly as it is. ChrisA