Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Sat, 8 Jun 2013 12:49:31 +1000
From: Cameron Simpson <cs@zip.com.au>
To: =?utf-8?B?zp3Ouc66z4zOu86xzr/PgiDOms6/z43Pgc6xz4I=?= <nikos.gr33k@gmail.com>
Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <7d8da6c9-fb92-4329-b207-4280f29ba664@googlegroups.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
References: <7d8da6c9-fb92-4329-b207-4280f29ba664@googlegroups.com>
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2870.1370660227.3114.python-list@python.org>
Lines: 217
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:47360

On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
| Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
| > | >| errors='replace' mean dont break in case or error?
| > 
| > | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled
| > | >for something that would not decode smoothly.
| > 
| > | How can it be correct? We have encoded out string in utf-8 and then
| > | we tried to decode it as greek-iso? How can this possibly be
| > | correct?
| 
| > If it is a valid iso-8859-7 sequence (which might cover everything, 
| > since I expect it is an 8-bit 1:1 mapping from bytes values to a 
| > set of codepoints, just like iso-8859-1) then it may decode to the 
| > "wrong" characters, but the reverse process (characters encoded as
| > bytes) should produce the original bytes.  With a mapping like this, 
| > errors='replace' may mean nothing; there will be no errors because
| > the only Unicode characters in play are all from iso-8859-7 to start
| > with. Of course another string may not be safe. 
| 
| > Visually, the names will be garbage. And if you go:
| >   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
| > while using the iso-8859-7 locale, the wrong thing will occur
| > (assuming it even works, though I think it should because all these
| > characters are represented in iso-8859-7, yes?)
| 
| All the rest you i understood only the above quotes its still unclear to me.
| I cant see to understand it.
| 
| Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar?

Yes. It is certainly true for utf-8 and latin-iso and ASCII.
I expect it to be so for greek-iso, but have not checked.

They're all essentially the ASCII set plus a range of other character
codepoints for the upper values.  The 8-bit sets iso-8859-1 (which
I take you to mean by "latin-iso") and iso-8859-7 (which I take you
to mean by "greek-iso") are single byte mapping with the top half
mapped to characters commonly used in a particular region.

Unicode has a much greater range, but the UTF-8 encoding of Unicode
deliberately has the bottom 0-127 identical to ASCII, and higher
values represented by multibyte sequences commences with at least
the first byte in the 128-255 range. In this way pure ASCII files
are already in UTF-8 (and, in fact, work just fine for the iso-8859-x
encodings as well).

| For example char 'a' has the value of '65' for all of those character sets?
| Is hat what you mean?

Yes.

| s = 'a'  (This is unicode right?  Why when we assign a string to
| a variable that string's type is always unicode and does not
| automatically become utf-8 which includes all available world-wide
| characters? Unicode is something different that a character set? )

In Python 3, yes. Strings are unicode. Note that that means they are
sequences of codepoints whose meaning is as for Unicode.

"utf-8" is a byte encoding for Unicode strings. An external storage
format, if you like. The first 0-127 codepoints are 1:1 with byte
values, and the higher code points require multibyte sequences.

| utf8_byte = s.encode('utf-8')

Unicode string => utf-8 byte encoding.

| Now if we are to decode this back to utf8 we will receive the char 'a'.

Yes.

| I beleive same thing will happen with latin, greek, ascii isos. Correct?
| 
| utf8_a = utf8_byte.decode('iso-8859-7')
| latin_a = utf8_byte.decode('iso-8859-1')
| ascii_a = utf8_byte.decode('ascii')
| utf8_a = utf8_byte.decode('iso-8859-7')
| 
| Is this correct? 

Yes, because of the design decision about the 0-127 codepoints.

| All of those decodes will work even if the encoded bytestring was of utf8 type?
| 
| The characters that will not decode correctly are those that their codepoints are greater that > 127 ?
| for example if s = 'α' (greek character equivalent to english 'a')
| Is this what you mean?

Yes, exactly so.

| --------------------------------
| 
| Now back to my almost ready files.py script please:
| 
| 
| #========================================================
| # Collect filenames of the path dir as bytes
| greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' )
| 
| for filename in greek_filenames:
| 	# Compute 'path/to/filename' in bytes
| 	greek_path = b'/home/nikos/public_html/data/apps/' + b'filename'

You don't mean b'filename', which is the literal word "filename".
You mean: filename.encode('iso-8859-7')

More probably, you mean:

  dirpath = b'/home/nikos/public_html/data/apps/'
  greek_filenames = os.listdir(dirpath)
  for greek_filename in greek_filenames:
    try:
      filename = greek_filename.decode('iso-8859-7')

and then:

  greek_path = dirpath + greek_filename
  utf8_filename = filename.encode('utf-8')
  utf8_path = dirpath + utf8_filename

| 	try:
| 		filepath = greek_path.decode('iso-8859-7')
| 		# Rename current filename from greek bytes --> utf-8 bytes
| 		os.rename( greek_path, filepath.encode('utf-8') )

I would break this up into smaller pieces:

  filepath = greek_path.decode('iso-8859-7')
  # Rename current filename from greek bytes --> utf-8 bytes
  utf8_path = filepath.encode('utf-8')
  os.rename( greek_path, utf8_path )

That way if an exception it thrown you have a much better idea of
exactly which line had a problem.

| 	except UnicodeDecodeError:
| 		# Since its not a greek bytestring then its a proper utf8 bytestring
| 		filepath = greek_path.decode('utf-8')

And here you have a logic error. The idea is ok, but the encode and os.rename are not
relevant to your UnicodeDecodeError check. So do this:

  dirpath = b'/home/nikos/public_html/data/apps/'
  greek_filenames = os.listdir(dirpath)
  for greek_filename in greek_filenames:
    try:
      filename = greek_filename.decode('iso-8859-7')
    except UnicodeDecodeError:
      # Since its not a greek bytestring then its a proper utf8 bytestring
      # no need to rename it
      pass
    else:
      # Rename current filename from greek bytes --> utf-8 bytes
      utf8_filename = filename.encode('utf-8')
      greek_path = dirpath + greek_filename
      utf8_path = dirpath + utf8_filename
      os.rename( greek_path, utf8_path )

You should try/except only around exactly the code expected to raise
an exception, not extra stuff.

However, this code won't work. Because iso-8859-7 is an 8-bit
character set, it will _never_ fail to decode. All the bytes are
value bytes. So not UnicodeDecodeError raised.

A better test might be to decode it as utf-8. If that fails, then
_guess_ that it is iso-8859-7 and rename the file, otherwise do not
touch it.

However, the real test is by eye: your program cannot deduce if a
filename is nonsense, but presumably a visual inspection will show
nonsense or sensible names.

So:

  write a standalone python program to fix a filename
    (provided as sys.argv[1])
    using the code above

  get a utf-8 Putty terminal
  check the remote locale is utf-8
  do an "ls"

  for each nonsense file, run:
    python3 fix_filename.py nonsense-filename

You should augument your rename with a prior os.path.exists() test
to make sure you do not replace an existing file.

[...snip...]
| nikos@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Error in sys.excepthook:
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] ValueError: underlying buffer has been detached
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Original exception was:
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Traceback (most recent call last):
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 71, in <module>
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]     os.rename( greek_path, filepath.encode('utf-8') )
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] FileNotFoundError: [Errno 2] \\u0394\\u03b5\\u03bd \\u03c5\\u03c0\\u03ac\\u03c1\\u03c7\\u03b5\\u03b9 \\u03c4\\u03ad\\u03c4\\u03bf\\u03b9\\u03bf \\u03b1\\u03c1\\u03c7\\u03b5\\u03af\\u03bf \\u03ae \\u03ba\\u03b1\\u03c4\\u03ac\\u03bb\\u03bf\\u03b3\\u03bf\\u03c2: '/home/nikos/public_html/data/apps/filename'

Well, I would guess 2 things are happening:

  - you construct a literal b'/home/nikos/public_html/data/apps/filename'
    at the top of your script
    see my earlier remarks
    therefore the complaint that it does not exist

  - I would guess that the \\uxxxx sequences are a Unicode transcription
    of the error message, transcribed as hex because they don't look
    "printable" in the current local

Cheers,
-- 
Cameron Simpson <cs@zip.com.au>

Louis Pasteur's theory of germs is ridiculous fiction.
       --Pierre Pachet, Professor of Physiology at Toulouse, 1872