Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; 'essentially': 0.04; 'encoding': 0.05; 'assign': 0.07; 'encoded': 0.07; 'python3': 0.07; 'rename': 0.07; 'skip:\\ 20': 0.07; 'skip:u 30': 0.07; 'utf-8': 0.07; 'string': 0.09; "'a'": 0.09; 'ascii': 0.09; 'bytes.': 0.09; 'decodes': 0.09; 'encode': 0.09; 'exception,': 0.09; 'filename': 0.09; 'filenames': 0.09; 'literal': 0.09; 'locale': 0.09; 'logic': 0.09; 'sequences.': 0.09; 'skip:\\ 40': 0.09; 'strings.': 0.09; 'thrown': 0.09; 'try:': 0.09; 'valueerror:': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; '(and,': 0.16; '(assuming': 0.16; '0-127': 0.16; '127': 0.16; '8-bit': 0.16; 'ascii,': 0.16; 'bytes)': 0.16; 'check.': 0.16; 'compute': 0.16; 'encoding.': 0.16; 'encodings': 0.16; 'fiction.': 0.16; 'from:addr:cs': 0.16; 'from:addr:zip.com.au': 0.16; 'from:name:cameron simpson': 0.16; 'hex': 0.16; 'mangled': 0.16; 'mapped': 0.16; 'message- id:@cskk.homeip.net': 0.16; 'raised.': 0.16; 'range,': 0.16; 'range.': 0.16; 'received:211.29': 0.16; 'received:211.29.132': 0.16; 'received:optusnet.com.au': 0.16; 'received:syd.optusnet.com.au': 0.16; 'run:': 0.16; 'set,': 0.16; 'simpson': 0.16; 'stuff.': 0.16; 'underlying': 0.16; 'unicode.': 0.16; 'utf8': 0.16; 'exception': 0.16; 'so.': 0.16; 'fix': 0.17; 'wrote:': 0.18; 'variable': 0.18; 'file,': 0.19; "skip:' 30": 0.19; 'slightly': 0.19; 'example': 0.22; 'cc:addr:python.org': 0.22; 'header:User-Agent:1': 0.23; 'error': 0.23; 'byte': 0.24; 'bytes': 0.24; 'certainly': 0.24; 'char': 0.24; 'format,': 0.24; 'replace': 0.24; 'unicode': 0.24; 'earlier': 0.24; 'fine': 0.24; 'cheers,': 0.24; 'file.': 0.24; 'cc:2**0': 0.24; 'cc:no real name:2**0': 0.24; 'script': 0.25; 'equivalent': 0.26; 'possibly': 0.26; 'skip:" 40': 0.26; 'this:': 0.26; 'pass': 0.26; 'least': 0.26; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'tried': 0.27; 'idea': 0.28; 'correct': 0.29; 'rest': 0.29; 'external': 0.29; '(this': 0.29; 'character': 0.29; 'points': 0.29; 'raise': 0.29; 'characters': 0.30; 'errors': 0.30; 'sets': 0.30; 'skip:g 30': 0.30; 'work.': 0.31; '(which': 0.31; 'code': 0.31; "skip:' 10": 0.31; 'commonly': 0.31; 'names.': 0.31; 'quotes': 0.31; "skip:' 40": 0.31; 'skip:7 10': 0.31; 'skip:= 40': 0.31; 'values.': 0.31; 'yes.': 0.31; 'file': 0.32; 'skip:- 30': 0.32; 'another': 0.32; 'dont': 0.67; 'reverse': 0.68; 'fact,': 0.69; 'receive': 0.70; '8bit%:92': 0.71; '8bit%:100': 0.72; 'therefore': 0.72; '1st': 0.74; 'upper': 0.74; 'touch': 0.74; '71,': 0.84; 'characters,': 0.84; 'complaint': 0.84; 'everything,': 0.84; 'fails,': 0.84; 'greek': 0.84; 'physiology': 0.84; 'presumably': 0.84; 'remarks': 0.84; 'safe.': 0.84; 'unclear': 0.84; '8bit%:70': 0.91; 'louis': 0.91; 'was:': 0.91; '2013': 0.98 Date: Sat, 8 Jun 2013 12:49:31 +1000 From: Cameron Simpson To: =?utf-8?B?zp3Ouc66z4zOu86xzr/PgiDOms6/z43Pgc6xz4I=?= Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <7d8da6c9-fb92-4329-b207-4280f29ba664@googlegroups.com> User-Agent: Mutt/1.5.21 (2010-09-15) References: <7d8da6c9-fb92-4329-b207-4280f29ba664@googlegroups.com> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=eqSHVfVX c=1 sm=1 a=wom5GMh1gUkA:10 a=AdgiQdVXbpoA:10 a=IkcTkHD0fZMA:10 a=vrnE16BAAAAA:8 a=ZtCCktOnAAAA:8 a=uw23S90zXSUA:10 a=pGLkceISAAAA:8 a=TJkdAbdyc8P5olvwckAA:9 a=QEXdDO2ut3YA:10 a=MSl-tDqOz04A:10 a=pmhQCwZGmNeYAuLO:21 a=nSz1Sc7j6l3temx2:21 a=ChdAjXE5lkUvdteQbhpnkQ==:117 Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 217 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1370660227 news.xs4all.nl 15904 [2001:888:2000:d::a6]:44615 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:47360 On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= wrote: | Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: | > | >| errors='replace' mean dont break in case or error? | > | > | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled | > | >for something that would not decode smoothly. | > | > | How can it be correct? We have encoded out string in utf-8 and then | > | we tried to decode it as greek-iso? How can this possibly be | > | correct? | | > If it is a valid iso-8859-7 sequence (which might cover everything, | > since I expect it is an 8-bit 1:1 mapping from bytes values to a | > set of codepoints, just like iso-8859-1) then it may decode to the | > "wrong" characters, but the reverse process (characters encoded as | > bytes) should produce the original bytes. With a mapping like this, | > errors='replace' may mean nothing; there will be no errors because | > the only Unicode characters in play are all from iso-8859-7 to start | > with. Of course another string may not be safe. | | > Visually, the names will be garbage. And if you go: | > mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3' | > while using the iso-8859-7 locale, the wrong thing will occur | > (assuming it even works, though I think it should because all these | > characters are represented in iso-8859-7, yes?) | | All the rest you i understood only the above quotes its still unclear to me. | I cant see to understand it. | | Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar? Yes. It is certainly true for utf-8 and latin-iso and ASCII. I expect it to be so for greek-iso, but have not checked. They're all essentially the ASCII set plus a range of other character codepoints for the upper values. The 8-bit sets iso-8859-1 (which I take you to mean by "latin-iso") and iso-8859-7 (which I take you to mean by "greek-iso") are single byte mapping with the top half mapped to characters commonly used in a particular region. Unicode has a much greater range, but the UTF-8 encoding of Unicode deliberately has the bottom 0-127 identical to ASCII, and higher values represented by multibyte sequences commences with at least the first byte in the 128-255 range. In this way pure ASCII files are already in UTF-8 (and, in fact, work just fine for the iso-8859-x encodings as well). | For example char 'a' has the value of '65' for all of those character sets? | Is hat what you mean? Yes. | s = 'a' (This is unicode right? Why when we assign a string to | a variable that string's type is always unicode and does not | automatically become utf-8 which includes all available world-wide | characters? Unicode is something different that a character set? ) In Python 3, yes. Strings are unicode. Note that that means they are sequences of codepoints whose meaning is as for Unicode. "utf-8" is a byte encoding for Unicode strings. An external storage format, if you like. The first 0-127 codepoints are 1:1 with byte values, and the higher code points require multibyte sequences. | utf8_byte = s.encode('utf-8') Unicode string => utf-8 byte encoding. | Now if we are to decode this back to utf8 we will receive the char 'a'. Yes. | I beleive same thing will happen with latin, greek, ascii isos. Correct? | | utf8_a = utf8_byte.decode('iso-8859-7') | latin_a = utf8_byte.decode('iso-8859-1') | ascii_a = utf8_byte.decode('ascii') | utf8_a = utf8_byte.decode('iso-8859-7') | | Is this correct? Yes, because of the design decision about the 0-127 codepoints. | All of those decodes will work even if the encoded bytestring was of utf8 type? | | The characters that will not decode correctly are those that their codepoints are greater that > 127 ? | for example if s = 'α' (greek character equivalent to english 'a') | Is this what you mean? Yes, exactly so. | -------------------------------- | | Now back to my almost ready files.py script please: | | | #======================================================== | # Collect filenames of the path dir as bytes | greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' ) | | for filename in greek_filenames: | # Compute 'path/to/filename' in bytes | greek_path = b'/home/nikos/public_html/data/apps/' + b'filename' You don't mean b'filename', which is the literal word "filename". You mean: filename.encode('iso-8859-7') More probably, you mean: dirpath = b'/home/nikos/public_html/data/apps/' greek_filenames = os.listdir(dirpath) for greek_filename in greek_filenames: try: filename = greek_filename.decode('iso-8859-7') and then: greek_path = dirpath + greek_filename utf8_filename = filename.encode('utf-8') utf8_path = dirpath + utf8_filename | try: | filepath = greek_path.decode('iso-8859-7') | # Rename current filename from greek bytes --> utf-8 bytes | os.rename( greek_path, filepath.encode('utf-8') ) I would break this up into smaller pieces: filepath = greek_path.decode('iso-8859-7') # Rename current filename from greek bytes --> utf-8 bytes utf8_path = filepath.encode('utf-8') os.rename( greek_path, utf8_path ) That way if an exception it thrown you have a much better idea of exactly which line had a problem. | except UnicodeDecodeError: | # Since its not a greek bytestring then its a proper utf8 bytestring | filepath = greek_path.decode('utf-8') And here you have a logic error. The idea is ok, but the encode and os.rename are not relevant to your UnicodeDecodeError check. So do this: dirpath = b'/home/nikos/public_html/data/apps/' greek_filenames = os.listdir(dirpath) for greek_filename in greek_filenames: try: filename = greek_filename.decode('iso-8859-7') except UnicodeDecodeError: # Since its not a greek bytestring then its a proper utf8 bytestring # no need to rename it pass else: # Rename current filename from greek bytes --> utf-8 bytes utf8_filename = filename.encode('utf-8') greek_path = dirpath + greek_filename utf8_path = dirpath + utf8_filename os.rename( greek_path, utf8_path ) You should try/except only around exactly the code expected to raise an exception, not extra stuff. However, this code won't work. Because iso-8859-7 is an 8-bit character set, it will _never_ fail to decode. All the bytes are value bytes. So not UnicodeDecodeError raised. A better test might be to decode it as utf-8. If that fails, then _guess_ that it is iso-8859-7 and rename the file, otherwise do not touch it. However, the real test is by eye: your program cannot deduce if a filename is nonsense, but presumably a visual inspection will show nonsense or sensible names. So: write a standalone python program to fix a filename (provided as sys.argv[1]) using the code above get a utf-8 Putty terminal check the remote locale is utf-8 do an "ls" for each nonsense file, run: python3 fix_filename.py nonsense-filename You should augument your rename with a prior os.path.exists() test to make sure you do not replace an existing file. [...snip...] | nikos@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Error in sys.excepthook: | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] ValueError: underlying buffer has been detached | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Original exception was: | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Traceback (most recent call last): | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] File "/home/nikos/public_html/cgi-bin/files.py", line 71, in | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] os.rename( greek_path, filepath.encode('utf-8') ) | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] FileNotFoundError: [Errno 2] \\u0394\\u03b5\\u03bd \\u03c5\\u03c0\\u03ac\\u03c1\\u03c7\\u03b5\\u03b9 \\u03c4\\u03ad\\u03c4\\u03bf\\u03b9\\u03bf \\u03b1\\u03c1\\u03c7\\u03b5\\u03af\\u03bf \\u03ae \\u03ba\\u03b1\\u03c4\\u03ac\\u03bb\\u03bf\\u03b3\\u03bf\\u03c2: '/home/nikos/public_html/data/apps/filename' Well, I would guess 2 things are happening: - you construct a literal b'/home/nikos/public_html/data/apps/filename' at the top of your script see my earlier remarks therefore the complaint that it does not exist - I would guess that the \\uxxxx sequences are a Unicode transcription of the error message, transcribed as hex because they don't look "printable" in the current local Cheers, -- Cameron Simpson Louis Pasteur's theory of germs is ridiculous fiction. --Pierre Pachet, Professor of Physiology at Toulouse, 1872