Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #47322 > unrolled thread

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Started byCameron Simpson <cs@zip.com.au>
First post2013-06-07 18:53 +1000
Last post2013-06-10 13:28 -0700
Articles 20 on this page of 68 — 14 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-07 18:53 +1000
    Re: Changing filenames from Greeklish => Greek (subprocess complain) alex23 <wuwei23@gmail.com> - 2013-06-07 02:41 -0700
    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 04:53 -0700
      Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-07 15:29 +0100
        Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 11:52 -0700
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Zero Piraeus <schesis@gmail.com> - 2013-06-07 15:31 -0400
          Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-07 21:45 +0100
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Zero Piraeus <schesis@gmail.com> - 2013-06-07 19:24 -0400
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-08 12:52 +1000
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 23:49 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-08 16:58 +1000
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-08 07:26 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-08 17:40 +1000
              Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-08 17:32 +0100
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 09:53 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 10:35 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-08 18:48 +0100
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-07 15:33 +0000
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-08 12:49 +1000
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 21:01 +0300
        Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-08 19:01 +0000
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 14:14 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 08:32 +1000
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 07:46 +0300
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 06:25 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 18:02 +1000
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:03 -0700
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 14:21 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-09 08:10 +1000
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 01:11 -0700
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-09 04:47 +1000
        Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-08 22:09 -0700
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 06:45 +0000
            Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-09 00:00 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 08:15 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:14 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:32 -0700
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 19:16 +1000
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 12:36 +0000
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-09 10:25 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Lele Gaifax <lele@metapensiero.it> - 2013-06-09 10:55 +0200
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:08 -0700
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Lele Gaifax <lele@metapensiero.it> - 2013-06-09 11:20 +0200
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:38 -0700
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-09 14:24 +0200
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 13:13 +0000
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-06-09 13:05 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:42 -0700
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:37 -0700
                      Re: Changing filenames from Greeklish => Greek (subprocess complain) Larry Hudson <orgnut@yahoo.com> - 2013-06-10 00:51 -0700
                        Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 01:11 -0700
                          Re: Changing filenames from Greeklish => Greek (subprocess complain) Larry Hudson <orgnut@yahoo.com> - 2013-06-11 00:20 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 11:50 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 05:18 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:00 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 19:12 +1000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:20 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-06-09 13:01 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 12:31 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-10 00:10 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-10 10:15 +0200
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 01:54 -0700
                      Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 02:59 -0700
                        Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-10 12:42 +0200
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-10 11:59 +0000
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 07:27 -0700
                      Re: Changing filenames from Greeklish => Greek (subprocess complain) jmfauth <wxjmfauth@gmail.com> - 2013-06-10 12:48 -0700
                        Re: Changing filenames from Greeklish => Greek (subprocess complain) Ned Batchelder <ned@nedbatchelder.com> - 2013-06-10 13:28 -0700

Page 1 of 4  [1] 2 3 4  Next page →


#47322 — Re: Changing filenames from Greeklish => Greek (subprocess complain)

FromCameron Simpson <cs@zip.com.au>
Date2013-06-07 18:53 +1000
SubjectRe: Changing filenames from Greeklish => Greek (subprocess complain)
Message-ID<mailman.2848.1370597084.3114.python-list@python.org>
On 07Jun2013 09:56, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
| On 7/6/2013 4:01 πμ, Cameron Simpson wrote:
| >On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
| >| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
| >| > py> s = '999-Eυχή-του-Ιησού'
| >| > py> bytes_as_utf8 = s.encode('utf-8')
| >| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
| >| > py> print(t)
| >| > 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
| >|
| >| errors='replace' mean dont break in case or error?
| >
| >Yes. The result will be correct for correct iso-8859-7 and slightly mangled
| >for something that would not decode smoothly.
|
| How can it be correct? We have encoded out string in utf-8 and then
| we tried to decode it as greek-iso? How can this possibly be
| correct?

Ok, not correct. But consistent. Safe to call.

If it is a valid iso-8859-7 sequence (which might cover everything,
since I expect it is an 8-bit 1:1 mapping from bytes values to a
set of codepoints, just like iso-8859-1) then it may decode to the
"wrong" characters, but the reverse process (characters encoded as
bytes) should produce the original bytes.  With a mapping like this,
errors='replace' may mean nothing; there will be no errors because
the only Unicode characters in play are all from iso-8859-7 to start
with. Of course another string may not be safe.

| >| You took the unicode 's' string you utf-8 bytestringed it.
| >| Then how its possible to ask for the utf8-bytestring to decode
| >| back to unicode string with the use of a different charset that the
| >| one used for encoding and thsi actually printed the filename in
| >| greek-iso?
| >
| >It is easily possible, as shown above. Does it make sense? Normally
| >not, but Steven is demonstrating how your "mv" exercises have
| >behaved: a rename using utf-8, then a _display_ using iso-8859-7.
|
| Same as above, i don't understand it at all, since different
| charsets(encodings) used in the encode/decode process.

Visually, the names will be garbage. And if you go:

  mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'

while using the iso-8859-7 locale, the wrong thing will occur
(assuming it even works, though I think it should because all these
characters are represented in iso-8859-7, yes?)

Why?

In the iso-8859-7 locale, your (currently named under an utf-8
regime) file looks like '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' (because the
Unicode byte sequence maps to those characters in iso-8859-7). Why
you issue the about "mv" command, the new name will be the _iso-8859-7_
bytes encoding for '999-Eυχή-του-Ιησού.mp3'.  Which, under an utf-8
regime will decode to _other_ characters.

If you want to repair filenames, by which I mean, cause them to be correctly
encoded for utf-8, you are best to work in utf-8 (using "mv" or python).

Of course, the badly named files will then look wrong in your listing.

If you _know_ the filenames were written using iso-8859-7 encoding, and that the names are "right" under that encoding, you can write python code to rename them to utf-8.

Totally untested example code:

  import sys
  from binascii import hexlify

  for bytename in os.listdir( b'.' ):
    unicode_name = bytename.decode('iso-8859-7')
    new_bytename = unicode_name.encode('utf-8')
    print("%s: %s => %s" % (unicode_name, hexlify(bytename), hexlify(new_bytename)), file=sys.stderr)
    os.rename(bytename, new_bytename)

That code should not care what locale you are using because it uses
bytes for the file calls and is explicit about the encoding/decoding
steps.

| >| a) WHAT does it mean when a linux system is set to use utf-8?
| >
| >It means the locale settings _for the current process_ are set for
| >UTF-8. The "locale" command will show you the current state.
|
| That means that, when a linux application needs to saved a filename
| to the linux filesystem, the app checks the filesytem's 'locale', so
| to encode the filename using the utf-8 charset ?

At the command line, many will not. They'll just read and write bytes.

Some will decode/encode. Those that do, should by default use the
current locale.

But broadly, it is GUI apps that care about this because they must
translate byte sequences to glyphs: images of characters. So plenty
of command line tools do not need to care; the terminal application
is the one that presents the names to you; _it_ will decode them
for display. And it is the terminal app that translates your
keystrokes into bytes to feed to the command line.

NOTE: it is NOT the filesystem's locale. It is the current process'
locale, which is deduced from environment variables (which have
defaults if they are not set).

Under Windows I believe filesystems have locales; this can prevent
you storing some files on some filesystems on Windows, because the
filesystem doesn't cope. UNIX just takes bytes.

| And likewise when a linux application wants to decode a filename is
| also checking the filesystem's 'locale' setting so to know what
| charset must use to decode the filename correctly back to the
| original string?

Again, NOT the filesystem's locale. The process' locale. The
filesystem filenames are just bytes.

| So locale is used for filesystem itself and linux apps to know how
| to read(decode) and write(enode) filenames from/into the system's
| hdd?

NOT THE FILESYSTEM LOCALE. There is no filesystem locale.

If you look at:

  http://docs.python.org/3/library/sys.html#sys.getfilesystemencoding

you'll see if does not talk about a property of the filesystem, but
the behaviour that will be used when storing filenames.

| >| c) WHAT happens when the two of them try to work together?
| >
| >If everything matches, it is all good. If the locales do not match,
| >the mismatch will result in an undesired bytes<->characters
| >encode/decode step somewhere, and something will display incorrectly
| >or be entered as input incorrectly.
| 
| Cant quite grasp the idea:
| 
| local end: Win8,  locale = greek-iso
| remote end: CentOS 6.4,  locale = utf-8

What makes you think the remote end is utf-8?
When you say "locale = utf-8", _exactly_ what does that mean to you?

| FileZilla by default uses "do not know what charset" to upload filenames

Then at a guess it uploaded the filenames as greek-iso byte sequences.
The filenames on disc will be greek-iso byte sequences.

| Putty by default uses greek-iso to display filenames

Then it will look ok, superficially, I would expect.

| WHAT someone can expect to happen when all of the above work together?
| Mess of course, but i want to hear in detail each step of the mess
| as it emerges.

There are several steps, for example:

  FileZilla will pass filenames to the remote end (FTP, SFTP, maybe)
  as bytes.  What those bytes will be will depend on FileZilla.
  The UNIX end probably accepts them as-is and uses them directly.
  So the filenames on disc would probably be greek-iso byte sequences.

  Running a /bin/ls ("ls" without the alias, with no special options)
  should present these byte sequences to the Terminal, which will
  decode them using its locale (greek-iso?)

  Running a "/bin/ls -b" (using the -b option from the ls alias)
  will "print octal escapes for nongraphic characters". So "ls"
  must decide what are nongraphic characters. It does this by
  decoding the filenames using the _remote_ locale (its own locale).
  So it will decode the greek-iso byet sequences as though they
  were utf-8.  Anything in the ASCII range (1-127, which will
  represent the same characters in utf-8, iso-8859-1 or iso-8859-7),
  the boring Roman alphabet range, will be treated the same. But
  outside that range the byte sequence will be taken to mean different
  characters depending on the locale.
  So "ls -b" will decide some of the greek-iso byte sequences do not
  represent printable characters, and will decide to print octal.

  Experiment:

    LC_ALL=C ls -b
    LC_ALL=utf-8 ls -b
    LC_ALL=iso-8859-7 ls -b

  And the Terminal itself is decoding the output for display, and
  encoding your input keystrokes to feed as input to the command
  line.

You would be best setting your Windows box to UTF-8, matching how
you intend to work on the rmeote UNIX host. I do not know what
ramifications that may have for your local efilesystems of text
files.

Cheers,
-- 
Cameron Simpson <cs@zip.com.au>

Humans are incapable of securely storing high quality cryptographic
keys and they have unacceptable speed and accuracy when performing
cryptographic operations.  (They are also large, expensive to maintain
diffcult to manage and they pollute the environment.) It is astonishing
that these devices continue to be manufactured and deployed. But they
are suffciently pervasive that we must design our protocols around
their limitations.      - C Kaufman, R Perlman, M Speciner
                          _Network Security: PRIVATE Communication in a
                           PUBLIC World_, Prentice Hall, 1995, pp. 205.

[toc] | [next] | [standalone]


#47323

Fromalex23 <wuwei23@gmail.com>
Date2013-06-07 02:41 -0700
Message-ID<af47e6b1-3abd-4c4b-975e-ca704e8c6887@mq5g2000pbb.googlegroups.com>
In reply to#47322
On Jun 7, 6:53 pm, Cameron Simpson <c...@zip.com.au> wrote:
>   Experiment:
>
>     LC_ALL=C ls -b
>     LC_ALL=utf-8 ls -b
>     LC_ALL=iso-8859-7 ls -b
>
>   And the Terminal itself is decoding the output for display, and
>   encoding your input keystrokes to feed as input to the command
>   line.

This reminded me of something I saw on stackoverflow recently:
http://stackoverflow.com/questions/11735363/python3-unicodeencodeerror-only-when-run-from-crontab

Script would run from shell but not from crontab, as the crontab
environment had different locale settings. Solution was to prepend the
correct LC_CTYPE to the command in the crontab. Would it be similar
for httpd processes?

[toc] | [prev] | [next] | [standalone]


#47326

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-07 04:53 -0700
Message-ID<7d8da6c9-fb92-4329-b207-4280f29ba664@googlegroups.com>
In reply to#47322
Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
> On 07Jun2013 09:56, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
> 
> | On 7/6/2013 4:01 πμ, Cameron Simpson wrote:
> 
> | >On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
> 
> | >| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
> 
> | >| > py> s = '999-Eυχή-του-Ιησού'
> 
> | >| > py> bytes_as_utf8 = s.encode('utf-8')
> 
> | >| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
> 
> | >| > py> print(t)
> 
> | >| > 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
> 
> | >|
> 
> | >| errors='replace' mean dont break in case or error?
> 
> | >
> 
> | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled
> 
> | >for something that would not decode smoothly.
> 
> |
> 
> | How can it be correct? We have encoded out string in utf-8 and then
> 
> | we tried to decode it as greek-iso? How can this possibly be
> 
> | correct?

> If it is a valid iso-8859-7 sequence (which might cover everything, 
> since I expect it is an 8-bit 1:1 mapping from bytes values to a 
> set of codepoints, just like iso-8859-1) then it may decode to the 
> "wrong" characters, but the reverse process (characters encoded as
> bytes) should produce the original bytes.  With a mapping like this, 
> errors='replace' may mean nothing; there will be no errors because
> the only Unicode characters in play are all from iso-8859-7 to start
> with. Of course another string may not be safe. 

> Visually, the names will be garbage. And if you go:
>   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
> while using the iso-8859-7 locale, the wrong thing will occur
> (assuming it even works, though I think it should because all these
> characters are represented in iso-8859-7, yes?)

All the rest you i understood only the above quotes its still unclear to me.
I cant see to understand it.

Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar?

For example char 'a' has the value of '65' for all of those character sets?
Is hat what you mean?

s = 'a'  (This is unicode right?  Why when we assign a string to a variable that string's type is always unicode and does not automatically become utf-8 which includes all available world-wide characters? Unicode is something different that a character set? )

utf8_byte = s.encode('utf-8')

Now if we are to decode this back to utf8 we will receive the char 'a'.
I beleive same thing will happen with latin, greek, ascii isos. Correct?

utf8_a = utf8_byte.decode('iso-8859-7')
latin_a = utf8_byte.decode('iso-8859-1')
ascii_a = utf8_byte.decode('ascii')
utf8_a = utf8_byte.decode('iso-8859-7')

Is this correct? 
All of those decodes will work even if the encoded bytestring was of utf8 type?

The characters that will not decode correctly are those that their codepoints are greater that > 127 ?

for example if s = 'α' (greek character equivalent to english 'a')

Is this what you mean?
--------------------------------

Now back to my almost ready files.py script please:


#========================================================
# Collect filenames of the path dir as bytes
greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in greek_filenames:
	# Compute 'path/to/filename' in bytes
	greek_path = b'/home/nikos/public_html/data/apps/' + b'filename'
	try:
		filepath = greek_path.decode('iso-8859-7')
		
		# Rename current filename from greek bytes --> utf-8 bytes
		os.rename( greek_path, filepath.encode('utf-8') )
	except UnicodeDecodeError:
		# Since its not a greek bytestring then its a proper utf8 bytestring
		filepath = greek_path.decode('utf-8')


#========================================================
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
	try:
		# Check the presence of a file against the database and insert if it doesn't exist
		cur.execute('''SELECT url FROM files WHERE url = %s''', filename )
		data = cur.fetchone()
		
		if not data:
			# First time for file; primary key is automatic, hit is defaulted 
			cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
	except pymysql.ProgrammingError as e:
		print( repr(e) )


#========================================================
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
	filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
	if rec not in filepaths:
		cur.execute('''DELETE FROM files WHERE url = %s''', rec )

=======================

nikos@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Error in sys.excepthook:
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] ValueError: underlying buffer has been detached
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Original exception was:
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Traceback (most recent call last):
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 71, in <module>
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]     os.rename( greek_path, filepath.encode('utf-8') )
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] FileNotFoundError: [Errno 2] \\u0394\\u03b5\\u03bd \\u03c5\\u03c0\\u03ac\\u03c1\\u03c7\\u03b5\\u03b9 \\u03c4\\u03ad\\u03c4\\u03bf\\u03b9\\u03bf \\u03b1\\u03c1\\u03c7\\u03b5\\u03af\\u03bf \\u03ae \\u03ba\\u03b1\\u03c4\\u03ac\\u03bb\\u03bf\\u03b3\\u03bf\\u03c2: '/home/nikos/public_html/data/apps/filename'


?????

[toc] | [prev] | [next] | [standalone]


#47330

FromMRAB <python@mrabarnett.plus.com>
Date2013-06-07 15:29 +0100
Message-ID<mailman.2852.1370615365.3114.python-list@python.org>
In reply to#47326
On 07/06/2013 12:53, Νικόλαος Κούρας wrote:
[snip]
>
> #========================================================
> # Collect filenames of the path dir as bytes
> greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' )
>
> for filename in greek_filenames:
> 	# Compute 'path/to/filename' in bytes
> 	greek_path = b'/home/nikos/public_html/data/apps/' + b'filename'
> 	try:

This is a worse way of doing it because the ISO-8859-7 encoding has 1
byte per codepoint, meaning that it's more 'tolerant' (if that's the
word) of errors. A sequence of bytes that is actually UTF-8 can be
decoded as ISO-8859-7, giving gibberish.

UTF-8 is less tolerant, and it's the encoding that ideally you should
be using everywhere, so it's better to assume UTF-8 and, if it fails, 
try ISO-8859-7 and then rename so that any names that were ISO-8859-7
will be converted to UTF-8.

That's the reason I did it that way in the code I posted, but, yet
again, you've changed it without understanding why!

> 		filepath = greek_path.decode('iso-8859-7')
> 		
> 		# Rename current filename from greek bytes --> utf-8 bytes
> 		os.rename( greek_path, filepath.encode('utf-8') )
> 	except UnicodeDecodeError:
> 		# Since its not a greek bytestring then its a proper utf8 bytestring
> 		filepath = greek_path.decode('utf-8')
>
[snip]

[toc] | [prev] | [next] | [standalone]


#47347

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-07 11:52 -0700
Message-ID<b88ab423-fe91-4ba9-a25b-f42fbd33ecbc@googlegroups.com>
In reply to#47330
Τη Παρασκευή, 7 Ιουνίου 2013 5:29:25 μ.μ. UTC+3, ο χρήστης MRAB έγραψε:

> This is a worse way of doing it because the ISO-8859-7 encoding has 1
> byte per codepoint, meaning that it's more 'tolerant' (if that's the 
> word) of errors. A sequence of bytes that is actually UTF-8 can be
> decoded as ISO-8859-7, giving gibberish.

> UTF-8 is less tolerant, and it's the encoding that ideally you should 
> be using everywhere, so it's better to assume UTF-8 and, if it fails,  
> try ISO-8859-7 and then rename so that any names that were ISO-8859-7
> will be converted to UTF-8.

Indeed iw asnt aware of that, at that time, i was under the impression that if a string was encoded to bytes using soem charset can only be switched back with the use of that and only that charset. Since this is the case here is my fixning:


#========================================================
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in filename_bytes:
	# Compute 'path/to/filename' into bytes
	filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
	flag = False
	
	try:
		# Assume current file is utf8 encoded
		filepath = filepath_bytes.decode('utf-8')
		flag = 'utf8' 
	except UnicodeDecodeError:
		try:
			# Since current filename is not utf8 encoded then it has to be greek-iso encoded
			filepath = filepath_bytes.decode('iso-8859-7')
			flag = 'greek'
		except UnicodeDecodeError:
			print( '''I give up! File name is unreadable!''' )
	
	if( flag = 'greek' )
		# Rename filename from greek bytes --> utf-8 bytes
		os.rename( filepath_bytes, filepath.encode('utf-8') )


#========================================================
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
	try:
		# Check the presence of a file against the database and insert if it doesn't exist
		cur.execute('''SELECT url FROM files WHERE url = %s''', filename )
		data = cur.fetchone()
		
		if not data:
			# First time for file; primary key is automatic, hit is defaulted 
			cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
	except pymysql.ProgrammingError as e:
		print( repr(e) )


#========================================================
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
	filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
	if rec not in filepaths:
		cur.execute('''DELETE FROM files WHERE url = %s''', rec )

=============================
nikos@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 81
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]     if( flag == 'greek' )
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]                         ^
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py
-------------------------------
i dont know why that if statement errors.

[toc] | [prev] | [next] | [standalone]


#47349

FromZero Piraeus <schesis@gmail.com>
Date2013-06-07 15:31 -0400
Message-ID<mailman.2861.1370633508.3114.python-list@python.org>
In reply to#47347
:

On 7 June 2013 14:52, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
File "/home/nikos/public_html/cgi-bin/files.py", line 81
> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]     if( flag == 'greek' )
> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]                         ^
> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax
> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py
> -------------------------------
> i dont know why that if statement errors.

Oh for f... READ SOME DOCUMENTATION, FOR THE LOVE OF BOB!!! READ YOUR
OWN EFFING CODE!

Look at this:

  http://docs.python.org/2/tutorial/controlflow.html

Read it now? Of course not. Go away and read it.

Now have you read it? GO AND READ IT.

What does an if statement end with? Hint: yep, that's it.

 -[]z.

[toc] | [prev] | [next] | [standalone]


#47351

FromMRAB <python@mrabarnett.plus.com>
Date2013-06-07 21:45 +0100
Message-ID<mailman.2863.1370637932.3114.python-list@python.org>
In reply to#47347
On 07/06/2013 20:31, Zero Piraeus wrote:
> :
>
> On 7 June 2013 14:52, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
> File "/home/nikos/public_html/cgi-bin/files.py", line 81
>> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]     if( flag == 'greek' )
>> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]                         ^
>> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax
>> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py
>> -------------------------------
>> i dont know why that if statement errors.
>
> Oh for f... READ SOME DOCUMENTATION, FOR THE LOVE OF BOB!!! READ YOUR
> OWN EFFING CODE!
>
> Look at this:
>
>    http://docs.python.org/2/tutorial/controlflow.html
>
> Read it now? Of course not. Go away and read it.
>
> Now have you read it? GO AND READ IT.
>
> What does an if statement end with? Hint: yep, that's it.
>
Have you noticed how the line in the traceback doesn't match the line
in the post?

[toc] | [prev] | [next] | [standalone]


#47356

FromZero Piraeus <schesis@gmail.com>
Date2013-06-07 19:24 -0400
Message-ID<mailman.2867.1370647512.3114.python-list@python.org>
In reply to#47347
:

On 7 June 2013 16:45, MRAB <python@mrabarnett.plus.com> wrote:
> On 07/06/2013 20:31, Zero Piraeus wrote:
>> [something exasperated, in capitals]
>
> Have you noticed how the line in the traceback doesn't match the line
> in the post?

Actually, I hadn't. It's not exactly a surprise at this point, though ...

I learnt a new word today, while searching for an apt ending to the
sentence "Reading Nikos' posts is the internet equivalent of ..."

... and that word is Dermatillomania.

 -[]z.

[toc] | [prev] | [next] | [standalone]


#47359

FromCameron Simpson <cs@zip.com.au>
Date2013-06-08 12:52 +1000
Message-ID<mailman.2869.1370659956.3114.python-list@python.org>
In reply to#47347
On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
| nikos@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 81
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]     if( flag == 'greek' )
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]                         ^
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py
| -------------------------------
| i dont know why that if statement errors.

Python statements that continue (if, while, try etc) end in a colon, so:

  if flag == 'greek':

Cheers,
-- 
Cameron Simpson <cs@zip.com.au>

Hello, my name is Yog-Sothoth, and I'll be your eldritch horror today.
        - Heather Keith <hkeith+@andrew.cmu.edu>

[toc] | [prev] | [next] | [standalone]


#47370

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-07 23:49 -0700
Message-ID<f68d66ae-7fcd-4bfd-b5e5-cc03f7a7127c@googlegroups.com>
In reply to#47359
Τη Σάββατο, 8 Ιουνίου 2013 5:52:22 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
> On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
> 
> | nikos@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 81
> 
> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]     if( flag == 'greek' )
> 
> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]                         ^
> 
> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax
> 
> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py
> 
> | -------------------------------
> 
> | i dont know why that if statement errors.
> 
> 
> 
> Python statements that continue (if, while, try etc) end in a colon, so:

Oh iam very sorry.
Oh my God i cant beleive i missed a colon *again*:

I have corrected this:

#========================================================
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in filename_bytes:
	# Compute 'path/to/filename' into bytes
	filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
	flag = False
	
	try:
		# Assume current file is utf8 encoded
		filepath = filepath_bytes.decode('utf-8')
		flag = 'utf8' 
	except UnicodeDecodeError:
		try:
			# Since current filename is not utf8 encoded then it has to be greek-iso encoded
			filepath = filepath_bytes.decode('iso-8859-7')
			flag = 'greek'
		except UnicodeDecodeError:
			print( '''I give up! File name is unreadable!''' )
	
	if flag == 'greek':
		# Rename filename from greek bytes --> utf-8 bytes
		os.rename( filepath_bytes, filepath.encode('utf-8') )
==================================

Now everythitng were supposed to work but instead iam getting this surrogate error once more. 
What is this surrogate thing?

Since i make use of error cathcing and handling like 'except UnicodeDecodeError:'

then it utf8's decode fails for some reason, it should leave that file alone and try the next file?
	try:
		# Assume current file is utf8 encoded
		filepath = filepath_bytes.decode('utf-8')
		flag = 'utf8' 
	except UnicodeDecodeError:

This is what it supposed to do, correct?

==================================
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 94, in <module>
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]     cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File "/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py", line 108, in execute
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]     query = query.encode(charset)
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not allowed

[toc] | [prev] | [next] | [standalone]


#47373

FromChris Angelico <rosuav@gmail.com>
Date2013-06-08 16:58 +1000
Message-ID<mailman.2879.1370674724.3114.python-list@python.org>
In reply to#47370
On Sat, Jun 8, 2013 at 4:49 PM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
> Oh my God i cant beleive i missed a colon *again*:

For most Python programmers, this is a matter of moments to solve. Run
the program, get a SyntaxError, fix it. Non-interesting event. (Maybe
even sooner than that, if the editor highlights it for you.) This is
why you really need to start yourself a testbox. DO NOT PLAY ON YOUR
LIVE SYSTEM. This is sysadminning 101. And Python programming 101: The
error traceback points to the error, or just after it.

Get to know how error messages work. This is not even Python-specific.
*Every* language behaves this way. You look at the highlighted line,
if you can't see an error there you look a little bit higher.

You should not need to beg for help for such trivial problems. This is
the mark of a novice. You ought no longer to be a novice, based on how
long you've been doing this stuff. You ought not to behave like one.

ChrisA

[toc] | [prev] | [next] | [standalone]


#47375

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-06-08 07:26 +0000
Message-ID<51b2dc9f$0$29966$c3e8da3$5496439d@news.astraweb.com>
In reply to#47370
On Fri, 07 Jun 2013 23:49:17 -0700, Νικόλαος Κούρας wrote:

[...]
> Oh iam very sorry.
> Oh my God i cant beleive i missed a colon *again*:
> 
> I have corrected this:

[snip code]

Stop posting your code after every trivial edit!!!


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#47376

FromChris Angelico <rosuav@gmail.com>
Date2013-06-08 17:40 +1000
Message-ID<mailman.2881.1370677222.3114.python-list@python.org>
In reply to#47375
On Sat, Jun 8, 2013 at 5:26 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Fri, 07 Jun 2013 23:49:17 -0700, Νικόλαος Κούρας wrote:
>
> [...]
>> Oh iam very sorry.
>> Oh my God i cant beleive i missed a colon *again*:
>>
>> I have corrected this:
>
> [snip code]
>
> Stop posting your code after every trivial edit!!!

I think he uses the python-list archives as ersatz source control.

ChrisA

[toc] | [prev] | [next] | [standalone]


#47387

FromMRAB <python@mrabarnett.plus.com>
Date2013-06-08 17:32 +0100
Message-ID<mailman.2887.1370709169.3114.python-list@python.org>
In reply to#47370
On 08/06/2013 07:49, Νικόλαος Κούρας wrote:
> Τη Σάββατο, 8 Ιουνίου 2013 5:52:22 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
>> On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
>>
>> | nikos@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 81
>>
>> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]     if( flag == 'greek' )
>>
>> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]                         ^
>>
>> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax
>>
>> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py
>>
>> | -------------------------------
>>
>> | i dont know why that if statement errors.
>>
>>
>>
>> Python statements that continue (if, while, try etc) end in a colon, so:
>
> Oh iam very sorry.
> Oh my God i cant beleive i missed a colon *again*:
>
> I have corrected this:
>
> #========================================================
> # Collect filenames of the path dir as bytes
> filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )
>
> for filename in filename_bytes:
> 	# Compute 'path/to/filename' into bytes
> 	filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
> 	flag = False
> 	
> 	try:
> 		# Assume current file is utf8 encoded
> 		filepath = filepath_bytes.decode('utf-8')
> 		flag = 'utf8'
> 	except UnicodeDecodeError:
> 		try:
> 			# Since current filename is not utf8 encoded then it has to be greek-iso encoded
> 			filepath = filepath_bytes.decode('iso-8859-7')
> 			flag = 'greek'
> 		except UnicodeDecodeError:
> 			print( '''I give up! File name is unreadable!''' )
> 	
> 	if flag == 'greek':
> 		# Rename filename from greek bytes --> utf-8 bytes
> 		os.rename( filepath_bytes, filepath.encode('utf-8') )
> ==================================
>
> Now everythitng were supposed to work but instead iam getting this surrogate error once more.
> What is this surrogate thing?
>
> Since i make use of error cathcing and handling like 'except UnicodeDecodeError:'
>
> then it utf8's decode fails for some reason, it should leave that file alone and try the next file?
> 	try:
> 		# Assume current file is utf8 encoded
> 		filepath = filepath_bytes.decode('utf-8')
> 		flag = 'utf8'
> 	except UnicodeDecodeError:
>
> This is what it supposed to do, correct?
>
> ==================================
> [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 94, in <module>
> [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]     cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
> [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File "/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py", line 108, in execute
> [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]     query = query.encode(charset)
> [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not allowed
>
Look at the traceback.

It says that the exception was raised by:

     query = query.encode(charset)

which was called by:

     cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )

But what is 'filename'? And what has it to do with the first code
snippet? Does the traceback have _anything_ to do with the first code
snippet?

[toc] | [prev] | [next] | [standalone]


#47389

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-08 09:53 -0700
Message-ID<d58ac1cc-2e4f-45d2-871f-5e2e4b3fbd30@googlegroups.com>
In reply to#47387
Sorry for th delay guys, was busy with other thigns today and i am still reading your resposes, still ahvent rewad them all just Cameron's:

Here is what i have now following Cameron's advices:


#========================================================
# Collect filenames of the path directory as bytes
path = b'/home/nikos/public_html/data/apps/'
filenames_bytes = os.listdir( path )

for filename_bytes in filenames_bytes:
	try:
		filename = filename_bytes.decode('utf-8)
	except UnicodeDecodeError:
		# Since its not a utf8 bytestring then its for sure a greek bytestring

		# Prepare arguments for rename to happen
		utf8_filename = filename_bytes.encode('utf-8')
		greek_filename = filename_bytes.encode('iso-8859-7')
		
		utf8_path = path + utf8_filename
		greek_path = path + greek_filename
		
		# Rename current filename from greek bytes --> utf8 bytes
		os.rename( greek_path, utf8_path )
==========================================

I know this is wrong though.
Since filename_bytes is the current filename encoded as utf8 or greek-iso
then i cannot just *encode* what is already encoded by doing this:

utf8_filename = filename_bytes.encode('utf-8')
greek_filename = filename_bytes.encode('iso-8859-7')

[toc] | [prev] | [next] | [standalone]


#47393

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-08 10:35 -0700
Message-ID<a700ce98-6a7a-4d73-b853-a3db81860345@googlegroups.com>
In reply to#47389
Okey after reading also Steven post, i was relived form the previous suck position i was, so with an alternation of a few variable names here is the code now:


#========================================================
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
	# Compute 'path/to/filename'
	filepath_bytes = path + filename
	for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
		try: 
			filepath = filepath_bytes.decode( encoding )
		except UnicodeDecodeError:
			continue
        
		# Rename to something valid in UTF-8 
		if encoding != 'utf-8': 
			os.rename( filepath_bytes, filepath.encode('utf-8') )

		assert os.path.exists( filepath )
		break 
	else: 
		# This only runs if we never reached the break
		raise ValueError( 'unable to clean filename %r' % filepath_bytes ) 

=================================

I dont know why it is still failing when it tried to decode stuff since it tries 3 ways of decoding. Here is the exact error.


nikos@superhost.gr [~/www/cgi-bin]# [Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] Error in sys.excepthook:
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] ValueError: underlying buffer has been detached
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173]
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] Original exception was:
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] Traceback (most recent call last):
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 78, in <module>
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173]     assert os.path.exists( filepath )
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173]   File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173]     os.stat(path)
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-37: ordinal not in range(128)

[toc] | [prev] | [next] | [standalone]


#47394

FromMRAB <python@mrabarnett.plus.com>
Date2013-06-08 18:48 +0100
Message-ID<mailman.2890.1370713724.3114.python-list@python.org>
In reply to#47389
On 08/06/2013 17:53, Νικόλαος Κούρας wrote:
> Sorry for th delay guys, was busy with other thigns today and i am still reading your resposes, still ahvent rewad them all just Cameron's:
>
> Here is what i have now following Cameron's advices:
>
>
> #========================================================
> # Collect filenames of the path directory as bytes
> path = b'/home/nikos/public_html/data/apps/'
> filenames_bytes = os.listdir( path )
>
> for filename_bytes in filenames_bytes:
> 	try:
> 		filename = filename_bytes.decode('utf-8)
> 	except UnicodeDecodeError:
> 		# Since its not a utf8 bytestring then its for sure a greek bytestring
>
> 		# Prepare arguments for rename to happen
> 		utf8_filename = filename_bytes.encode('utf-8')
> 		greek_filename = filename_bytes.encode('iso-8859-7')
> 		
> 		utf8_path = path + utf8_filename
> 		greek_path = path + greek_filename
> 		
> 		# Rename current filename from greek bytes --> utf8 bytes
> 		os.rename( greek_path, utf8_path )
> ==========================================
>
> I know this is wrong though.

Yet you did it anyway!

> Since filename_bytes is the current filename encoded as utf8 or greek-iso
> then i cannot just *encode* what is already encoded by doing this:
>
> utf8_filename = filename_bytes.encode('utf-8')
> greek_filename = filename_bytes.encode('iso-8859-7')
>
Try reading and understanding the code I originally posted.

[toc] | [prev] | [next] | [standalone]


#47332

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-06-07 15:33 +0000
Message-ID<51b1fd4b$0$9505$c3e8da3$5496439d@news.astraweb.com>
In reply to#47326
On Fri, 07 Jun 2013 04:53:42 -0700, Νικόλαος Κούρας wrote:

> Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st
> 0-127 codepoints similar?

You can answer this yourself. Open a terminal window and start a Python 
interactive session. Then try it and see what happens:


s = ''.join(chr(i) for i in range(128))
bytes_as_utf8 = s.encode('utf-8')
bytes_as_latin1 = s.encode('latin-1')
bytes_as_greek_iso = s.encode('ISO-8859-7')
bytes_as_ascii = s.encode('ascii')

bytes_as_utf8 == bytes_as_latin1 == bytes_as_greek_iso == bytes_as_ascii


What result do you get? True or False?

And now you know the answer, without having to ask.


> For example char 'a' has the value of '65' for all of those character
> sets? Is hat what you mean?

You can answer that question yourself.

c = 'a'
for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'):
    print(c.encode(encoding))


By the way, I believe that Python has made a strategic mistake in the way 
that bytes are printed. I think it leads to more confusion, not less. 
Better would be something like this:

c = 'a'
for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'):
    print(hex(c.encode(encoding)[0]))


For historical reasons, most (but not all) charsets are supersets of 
ASCII. That is, the first 128 characters in the charset are the same as 
the 128 characters in ASCII.


> s = 'a'  (This is unicode right?  Why when we assign a string to a
> variable that string's type is always unicode 

Strings in Python 3 are Unicode strings. That's just the way Python 
works. Unicode was chosen because Unicode includes over a million 
different characters (well, potentially over a million, most of them are 
currently unused), and is a strict superset of *all* common legacy 
codepages from the old DOS and Windows 95 days.


> and does not automatically
> become utf-8 which includes all available world-wide characters? Unicode
> is something different that a character set? )

Unicode is a character set. It is an enormous set of over one million 
characters (technically "code point", but don't worry about the 
difference right now) which can be collected in strings.

UTF-8 is an encoding that goes from a string using the Unicode character 
set into bytes, and back again. Sometimes, people are lazy and say 
"UTF-8" when they mean "Unicode", or visa versa. 

UTF-16 and UTF-32 are two different encodings for the same purpose, but 
for various technical reasons UTF-8 is better for files.

'λ' is a character which exists in some charsets but not others. It is 
not in the ASCII charset, nor is it in Latin-1, nor Big-5. It is in the 
ISO-8859-7 charset, and of course it is in Unicode.

In ISO-8859-7, the character 'λ' is stored as byte 0xEB (decimal 235), 
just as the character 'a' is stored as byte 0x61 (decimal 97).

In UTF-8, the character λ is stored as two bytes 0xCE 0xBB.

In UTF-16 (big-endian), the character λ is stored as two bytes 0x03 0xBB.

In UTF-32 (big-endian), the character λ is stored as four bytes 0x00 0x00 
0x03 0xBB.

That's four different ways of "spelling" the same character as bytes, 
just as "three", "trois", "drei", "τρία", "três" are all different ways 
of spelling the same number 3.


> utf8_byte = s.encode('utf-8')
> 
> Now if we are to decode this back to utf8 we will receive the char 'a'.
> I beleive same thing will happen with latin, greek, ascii isos. Correct?

Why don't you try it for yourself and see?



> The characters that will not decode correctly are those that their
> codepoints are greater that > 127 ?

Maybe, maybe not. It depends on which codepoint, and which encodings. 
Some encodings use the same bytes for the same characters. Some encodings 
use different bytes. It all depends on the encoding, just like American 
and English both spell 3 "three", while French spells it "trois".


> for example if s = 'α' (greek character equivalent to english 'a')

In Latin-1, 'α' does not exist:

py> 'α'.encode('latin-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u03b1' in 
position 0: ordinal not in range(256)


In the old Windows Greek charset, ISO-8859-7, 'α' is stored as byte 0xE1:

py> 'α'.encode('ISO-8859-7')
b'\xe1'


But in the old Windows *Russian* charset, ISO-8859-5, the byte 0xE1 means 
a completely different character, CYRILLIC SMALL LETTER ES:

py> b'\xE1'.decode('ISO-8859-5')
'с'

(don't be fooled that this looks like the English c, it is not the same).


In Unicode, 'α' is always codepoint 0x3B1 (decimal 945):

py> ord('α')
945

but before you can store that on a disk, or as a file name, it needs to 
be converted to bytes, and which bytes you get depends on which encoding 
you use:

py> 'α'.encode('utf-8')
b'\xce\xb1'

py> 'α'.encode('utf-16be')
b'\x03\xb1'

py> 'α'.encode('utf-32be')
b'\x00\x00\x03\xb1'


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#47360

FromCameron Simpson <cs@zip.com.au>
Date2013-06-08 12:49 +1000
Message-ID<mailman.2870.1370660227.3114.python-list@python.org>
In reply to#47326
On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
| Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
| > | >| errors='replace' mean dont break in case or error?
| > 
| > | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled
| > | >for something that would not decode smoothly.
| > 
| > | How can it be correct? We have encoded out string in utf-8 and then
| > | we tried to decode it as greek-iso? How can this possibly be
| > | correct?
| 
| > If it is a valid iso-8859-7 sequence (which might cover everything, 
| > since I expect it is an 8-bit 1:1 mapping from bytes values to a 
| > set of codepoints, just like iso-8859-1) then it may decode to the 
| > "wrong" characters, but the reverse process (characters encoded as
| > bytes) should produce the original bytes.  With a mapping like this, 
| > errors='replace' may mean nothing; there will be no errors because
| > the only Unicode characters in play are all from iso-8859-7 to start
| > with. Of course another string may not be safe. 
| 
| > Visually, the names will be garbage. And if you go:
| >   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
| > while using the iso-8859-7 locale, the wrong thing will occur
| > (assuming it even works, though I think it should because all these
| > characters are represented in iso-8859-7, yes?)
| 
| All the rest you i understood only the above quotes its still unclear to me.
| I cant see to understand it.
| 
| Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar?

Yes. It is certainly true for utf-8 and latin-iso and ASCII.
I expect it to be so for greek-iso, but have not checked.

They're all essentially the ASCII set plus a range of other character
codepoints for the upper values.  The 8-bit sets iso-8859-1 (which
I take you to mean by "latin-iso") and iso-8859-7 (which I take you
to mean by "greek-iso") are single byte mapping with the top half
mapped to characters commonly used in a particular region.

Unicode has a much greater range, but the UTF-8 encoding of Unicode
deliberately has the bottom 0-127 identical to ASCII, and higher
values represented by multibyte sequences commences with at least
the first byte in the 128-255 range. In this way pure ASCII files
are already in UTF-8 (and, in fact, work just fine for the iso-8859-x
encodings as well).

| For example char 'a' has the value of '65' for all of those character sets?
| Is hat what you mean?

Yes.

| s = 'a'  (This is unicode right?  Why when we assign a string to
| a variable that string's type is always unicode and does not
| automatically become utf-8 which includes all available world-wide
| characters? Unicode is something different that a character set? )

In Python 3, yes. Strings are unicode. Note that that means they are
sequences of codepoints whose meaning is as for Unicode.

"utf-8" is a byte encoding for Unicode strings. An external storage
format, if you like. The first 0-127 codepoints are 1:1 with byte
values, and the higher code points require multibyte sequences.

| utf8_byte = s.encode('utf-8')

Unicode string => utf-8 byte encoding.

| Now if we are to decode this back to utf8 we will receive the char 'a'.

Yes.

| I beleive same thing will happen with latin, greek, ascii isos. Correct?
| 
| utf8_a = utf8_byte.decode('iso-8859-7')
| latin_a = utf8_byte.decode('iso-8859-1')
| ascii_a = utf8_byte.decode('ascii')
| utf8_a = utf8_byte.decode('iso-8859-7')
| 
| Is this correct? 

Yes, because of the design decision about the 0-127 codepoints.

| All of those decodes will work even if the encoded bytestring was of utf8 type?
| 
| The characters that will not decode correctly are those that their codepoints are greater that > 127 ?
| for example if s = 'α' (greek character equivalent to english 'a')
| Is this what you mean?

Yes, exactly so.

| --------------------------------
| 
| Now back to my almost ready files.py script please:
| 
| 
| #========================================================
| # Collect filenames of the path dir as bytes
| greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' )
| 
| for filename in greek_filenames:
| 	# Compute 'path/to/filename' in bytes
| 	greek_path = b'/home/nikos/public_html/data/apps/' + b'filename'

You don't mean b'filename', which is the literal word "filename".
You mean: filename.encode('iso-8859-7')

More probably, you mean:

  dirpath = b'/home/nikos/public_html/data/apps/'
  greek_filenames = os.listdir(dirpath)
  for greek_filename in greek_filenames:
    try:
      filename = greek_filename.decode('iso-8859-7')

and then:

  greek_path = dirpath + greek_filename
  utf8_filename = filename.encode('utf-8')
  utf8_path = dirpath + utf8_filename

| 	try:
| 		filepath = greek_path.decode('iso-8859-7')
| 		# Rename current filename from greek bytes --> utf-8 bytes
| 		os.rename( greek_path, filepath.encode('utf-8') )

I would break this up into smaller pieces:

  filepath = greek_path.decode('iso-8859-7')
  # Rename current filename from greek bytes --> utf-8 bytes
  utf8_path = filepath.encode('utf-8')
  os.rename( greek_path, utf8_path )

That way if an exception it thrown you have a much better idea of
exactly which line had a problem.

| 	except UnicodeDecodeError:
| 		# Since its not a greek bytestring then its a proper utf8 bytestring
| 		filepath = greek_path.decode('utf-8')

And here you have a logic error. The idea is ok, but the encode and os.rename are not
relevant to your UnicodeDecodeError check. So do this:

  dirpath = b'/home/nikos/public_html/data/apps/'
  greek_filenames = os.listdir(dirpath)
  for greek_filename in greek_filenames:
    try:
      filename = greek_filename.decode('iso-8859-7')
    except UnicodeDecodeError:
      # Since its not a greek bytestring then its a proper utf8 bytestring
      # no need to rename it
      pass
    else:
      # Rename current filename from greek bytes --> utf-8 bytes
      utf8_filename = filename.encode('utf-8')
      greek_path = dirpath + greek_filename
      utf8_path = dirpath + utf8_filename
      os.rename( greek_path, utf8_path )

You should try/except only around exactly the code expected to raise
an exception, not extra stuff.

However, this code won't work. Because iso-8859-7 is an 8-bit
character set, it will _never_ fail to decode. All the bytes are
value bytes. So not UnicodeDecodeError raised.

A better test might be to decode it as utf-8. If that fails, then
_guess_ that it is iso-8859-7 and rename the file, otherwise do not
touch it.

However, the real test is by eye: your program cannot deduce if a
filename is nonsense, but presumably a visual inspection will show
nonsense or sensible names.

So:

  write a standalone python program to fix a filename
    (provided as sys.argv[1])
    using the code above

  get a utf-8 Putty terminal
  check the remote locale is utf-8
  do an "ls"

  for each nonsense file, run:
    python3 fix_filename.py nonsense-filename

You should augument your rename with a prior os.path.exists() test
to make sure you do not replace an existing file.

[...snip...]
| nikos@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Error in sys.excepthook:
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] ValueError: underlying buffer has been detached
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Original exception was:
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Traceback (most recent call last):
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 71, in <module>
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]     os.rename( greek_path, filepath.encode('utf-8') )
| [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] FileNotFoundError: [Errno 2] \\u0394\\u03b5\\u03bd \\u03c5\\u03c0\\u03ac\\u03c1\\u03c7\\u03b5\\u03b9 \\u03c4\\u03ad\\u03c4\\u03bf\\u03b9\\u03bf \\u03b1\\u03c1\\u03c7\\u03b5\\u03af\\u03bf \\u03ae \\u03ba\\u03b1\\u03c4\\u03ac\\u03bb\\u03bf\\u03b3\\u03bf\\u03c2: '/home/nikos/public_html/data/apps/filename'

Well, I would guess 2 things are happening:

  - you construct a literal b'/home/nikos/public_html/data/apps/filename'
    at the top of your script
    see my earlier remarks
    therefore the complaint that it does not exist

  - I would guess that the \\uxxxx sequences are a Unicode transcription
    of the error message, transcribed as hex because they don't look
    "printable" in the current local

Cheers,
-- 
Cameron Simpson <cs@zip.com.au>

Louis Pasteur's theory of germs is ridiculous fiction.
       --Pierre Pachet, Professor of Physiology at Toulouse, 1872

[toc] | [prev] | [next] | [standalone]


#47396

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-08 21:01 +0300
Message-ID<mailman.2891.1370714502.3114.python-list@python.org>
In reply to#47326

[Multipart message — attachments visible in raw view] — view raw

On 8/6/2013 5:49 πμ, Cameron Simpson wrote:
> On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
> | Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
> | > | >| errors='replace' mean dont break in case or error?
> | >
> | > | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled
> | > | >for something that would not decode smoothly.
> | >
> | > | How can it be correct? We have encoded out string in utf-8 and then
> | > | we tried to decode it as greek-iso? How can this possibly be
> | > | correct?
> |
> | > If it is a valid iso-8859-7 sequence (which might cover everything,
> | > since I expect it is an 8-bit 1:1 mapping from bytes values to a
> | > set of codepoints, just like iso-8859-1) then it may decode to the
> | > "wrong" characters, but the reverse process (characters encoded as
> | > bytes) should produce the original bytes.  With a mapping like this,
> | > errors='replace' may mean nothing; there will be no errors because
> | > the only Unicode characters in play are all from iso-8859-7 to start
> | > with. Of course another string may not be safe.
> |
> | > Visually, the names will be garbage. And if you go:
> | >   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
> | > while using the iso-8859-7 locale, the wrong thing will occur
> | > (assuming it even works, though I think it should because all these
> | > characters are represented in iso-8859-7, yes?)
> |
> | All the rest you i understood only the above quotes its still unclear to me.
> | I cant see to understand it.
> |
> | Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar?
>
> Yes. It is certainly true for utf-8 and latin-iso and ASCII.
> I expect it to be so for greek-iso, but have not checked.
>
> They're all essentially the ASCII set plus a range of other character
> codepoints for the upper values.  The 8-bit sets iso-8859-1 (which
> I take you to mean by "latin-iso") and iso-8859-7 (which I take you
> to mean by "greek-iso") are single byte mapping with the top half
> mapped to characters commonly used in a particular region.
>
> Unicode has a much greater range, but the UTF-8 encoding of Unicode
> deliberately has the bottom 0-127 identical to ASCII, and higher
> values represented by multibyte sequences commences with at least
> the first byte in the 128-255 range. In this way pure ASCII files
> are already in UTF-8 (and, in fact, work just fine for the iso-8859-x
> encodings as well).
>
Hold on!

In the beginning there was ASCII with 0-127 values and then there was 
Unicode with 0-127 of ASCII's + i dont know how much many more?

Now ASCIII needs 1 byte to store a single character while Unicode needs 
2 bytes to store a character and that is because it has > 256 characters 
to store > 2^8bits ?

Is this correct?

Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters 
into the hard drive?

Because in some post i have read that 'UTF-8 encoding of Unicode'.
Can you please explain to me whats the difference of ASCII-Unicode 
themselves aand then of them compared to 'Charsets' . I'm still confused 
about this.

Is it like we said in C++:
' int a',     a variable with name 'a' of type integer.
'char a',   a variable with name 'a' of type char

So taken form above example(the closest i could think of), the way i 
understand them is:

A 'string' can be of (unicode's or ascii's) type and that type needs a 
way (thats a charset) to store this string into the hdd as a sequense of 
bytes?






-- 
Webhost <http://superhost.gr>&& Weblog <http://psariastonafro.wordpress.com>

[toc] | [prev] | [next] | [standalone]


Page 1 of 4  [1] 2 3 4  Next page →

Back to top | Article view | comp.lang.python


csiph-web