Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #47322 > unrolled thread

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Started byCameron Simpson <cs@zip.com.au>
First post2013-06-07 18:53 +1000
Last post2013-06-10 13:28 -0700
Articles 20 on this page of 68 — 14 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-07 18:53 +1000
    Re: Changing filenames from Greeklish => Greek (subprocess complain) alex23 <wuwei23@gmail.com> - 2013-06-07 02:41 -0700
    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 04:53 -0700
      Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-07 15:29 +0100
        Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 11:52 -0700
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Zero Piraeus <schesis@gmail.com> - 2013-06-07 15:31 -0400
          Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-07 21:45 +0100
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Zero Piraeus <schesis@gmail.com> - 2013-06-07 19:24 -0400
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-08 12:52 +1000
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 23:49 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-08 16:58 +1000
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-08 07:26 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-08 17:40 +1000
              Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-08 17:32 +0100
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 09:53 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 10:35 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-08 18:48 +0100
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-07 15:33 +0000
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-08 12:49 +1000
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 21:01 +0300
        Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-08 19:01 +0000
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 14:14 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 08:32 +1000
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 07:46 +0300
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 06:25 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 18:02 +1000
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:03 -0700
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 14:21 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-09 08:10 +1000
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 01:11 -0700
      Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-09 04:47 +1000
        Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-08 22:09 -0700
          Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 06:45 +0000
            Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-09 00:00 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 08:15 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:14 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:32 -0700
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 19:16 +1000
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 12:36 +0000
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-09 10:25 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Lele Gaifax <lele@metapensiero.it> - 2013-06-09 10:55 +0200
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:08 -0700
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Lele Gaifax <lele@metapensiero.it> - 2013-06-09 11:20 +0200
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:38 -0700
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-09 14:24 +0200
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 13:13 +0000
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-06-09 13:05 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:42 -0700
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:37 -0700
                      Re: Changing filenames from Greeklish => Greek (subprocess complain) Larry Hudson <orgnut@yahoo.com> - 2013-06-10 00:51 -0700
                        Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 01:11 -0700
                          Re: Changing filenames from Greeklish => Greek (subprocess complain) Larry Hudson <orgnut@yahoo.com> - 2013-06-11 00:20 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 11:50 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 05:18 -0700
            Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:00 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 19:12 +1000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:20 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-06-09 13:01 -0700
              Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 12:31 +0000
                Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-10 00:10 -0700
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-10 10:15 +0200
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 01:54 -0700
                      Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 02:59 -0700
                        Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-10 12:42 +0200
                  Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-10 11:59 +0000
                    Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 07:27 -0700
                      Re: Changing filenames from Greeklish => Greek (subprocess complain) jmfauth <wxjmfauth@gmail.com> - 2013-06-10 12:48 -0700
                        Re: Changing filenames from Greeklish => Greek (subprocess complain) Ned Batchelder <ned@nedbatchelder.com> - 2013-06-10 13:28 -0700

Page 2 of 4 — ← Prev page 1 [2] 3 4  Next page →


#47400

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-06-08 19:01 +0000
Message-ID<51b37fa4$0$29966$c3e8da3$5496439d@news.astraweb.com>
In reply to#47396
On Sat, 08 Jun 2013 21:01:23 +0300, Νικόλαος Κούρας wrote:

> In the beginning there was ASCII with 0-127 values 

No, there were encoding systems that existed before ASCII, such as 
EBCDIC. But we can ignore those, and just start with ASCII.


> and then there was
> Unicode with 0-127 of ASCII's + i dont know how much many more?

No, you have missed the utter chaos of dozens and dozens of Windows 
codepages and charsets. We still have to live with the pain of that.

But now we have Unicode, with 0x10FFFF (decimal 1114111) code points. You 
can consider a code point to be the same as a character, at least for now.


> Now ASCIII needs 1 byte to store a single character 

ASCII actually needs 7 bits to store a character. Since computers are 
optimized to work with bytes, not bits, normally ASCII characters are 
stored in a single byte, with one bit wasted.


> while Unicode needs 2 bytes to store a character 

No. Since there are 0x10FFFF different Unicode "characters" (really code 
points, but ignore the difference) two bytes is not enough. Unicode needs 
21 bits to store a character. Since that is more than 2 bytes, but less 
than 3, there are a few different ways for Unicode to be stored in 
memory, including:

"Wide" Unicode uses four bytes per character. Why four instead of three? 
Because computers are more efficient when working with chunks of memory 
that is a multiple of four.

"Narrow" Unicode uses two bytes per character. Since two bytes is only 
enough for about 65,000 characters, not 1,000,000+, the rest of the 
characters are stored as pairs of two-byte "surrogates".



> and that is because it has > 256 characters
> to store > 2^8bits ?

Correct.



> Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters
> into the hard drive?

Your computer cannot carve a tiny little "A" into the hard drive when it 
stores that letter in a file. It has to write some bytes. So you need to 
know:

- what byte, or bytes, represents the letter "A"?

- what byte, or bytes, represents the letter "B"?

- what byte, or bytes, represents the letter "λ"?

and so on. This set of rules, "byte XXXX means letter YYYY", is called an 
encoding. If you don't know what encoding to use, you cannot tell what 
the byte means.

 
> Because in some post i have read that 'UTF-8 encoding of Unicode'. Can
> you please explain to me whats the difference of ASCII-Unicode
> themselves aand then of them compared to 'Charsets' . I'm still confused
> about this.

A charset is an ordered set of characters. For example, ASCII has 127 
characters, starting with NUL:

NUL ... A B C D E ... Z [ \ ] ^ ... a b c ... z ... 


where NULL is at position 0, 'A' is at position 65, 'B' at position 66, 
and so on.

Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is 
also similar, also 256 positions, but the characters are different. And 
so on, with dozens of charsets.

And then there is Unicode, which includes *every* character is all of 
those dozens of charsets. It has 1114111 positions (most are currently 
unfilled).


An encoding is simply a program that takes a character and returns a 
byte, or visa versa. For instance, the ASCII encoding will take character 
'A'. That is found at position 65, which is 0x41 in hexadecimal, so the 
ASCII encoding turns character 'A' into byte 0x41, and visa versa.


> Is it like we said in C++:
> ' int a',     a variable with name 'a' of type integer. 'char a',   a
> variable with name 'a' of type char
> 
> So taken form above example(the closest i could think of), the way i
> understand them is:
> 
> A 'string' can be of (unicode's or ascii's) type and that type needs a
> way (thats a charset) to store this string into the hdd as a sequense of
> bytes?


Correct.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#47406

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-08 14:14 -0700
Message-ID<e1cfd5ed-798d-44fa-8bf7-17f3549a288e@googlegroups.com>
In reply to#47400
Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:

> ASCII actually needs 7 bits to store a character. Since computers are  
> optimized to work with bytes, not bits, normally ASCII characters are
> stored in a single byte, with one bit wasted.

So ASCII and Unicode are 2 Encoding Systems currently in use.
How should i imagine them, visualize them?
Like tables 'A' = 65, 'B' = 66 and so on?

But if i do then that would be the visualization of a 'charset' not of an encoding system.
What the diffrence of an encoding system and of a charset?

ebcdic - ascii - unicode = al of them are encoding systems

greek-iso - latin-iso - utf8 - utf16 = all of them are charsets.

What are these differences? i cant imagine them all, i can only imagine charsets not encodign systems.

Why python interprets by default all given strings as unicode and not ascii? because the former supports many positions while ascii only 127 positions , hence can interpet only 127 different characters? 


> "Narrow" Unicode uses two bytes per character. Since two bytes is only 
> enough for about 65,000 characters, not 1,000,000+, the rest of the 
> characters are stored as pairs of two-byte "surrogates".

surrogates literal means a replacemnt?


> Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is 
> also similar, also 256 positions, but the characters are different. And 
> so on, with dozens of charsets. 

Latin has to display english chars(capital, small) + numbers + symbols. that would be 127 why 256?

greek = all of the above plus greek chars, no?

> And then there is Unicode, which includes *every* character is all of 
> those dozens of charsets. It has 1114111 positions (most are currently  
> unfilled).

Shouldt the positions that Unicode has to use equal to the summary of all available characters of all the languages of the worlds plus numbers and special chars? why 1.000.000+ why the need for so many positions? Narrow Unicode format (2 byted) can cover all ofmthe worlds symbols.

> An encoding is simply a program that takes a character and returns a 
> byte, or visa versa. For instance, the ASCII encoding will take character 
> 'A'. That is found at position 65, which is 0x41 in hexadecimal, so the 
> ASCII encoding turns character 'A' into byte 0x41, and visa versa.

Why you say ASCII turn a character into HEX format and not as in binary format?
Isnt the latter the way bytes are stored into hdd, like 010101111010101 etc?
Are they stored as hex instead or you just said so to avoid printing 0s and 1s?

[toc] | [prev] | [next] | [standalone]


#47416

FromCameron Simpson <cs@zip.com.au>
Date2013-06-09 08:32 +1000
Message-ID<mailman.2903.1370730797.3114.python-list@python.org>
In reply to#47406
On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
| Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
| > ASCII actually needs 7 bits to store a character. Since computers are  
| > optimized to work with bytes, not bits, normally ASCII characters are
| > stored in a single byte, with one bit wasted.
| 
| So ASCII and Unicode are 2 Encoding Systems currently in use.
| How should i imagine them, visualize them?
| Like tables 'A' = 65, 'B' = 66 and so on?

Yes, that works.

| But if i do then that would be the visualization of a 'charset' not of an encoding system.
| What the diffrence of an encoding system and of a charset?

And encoding system is the method or transcribing these values to bytes and back again.

| ebcdic - ascii - unicode = al of them are encoding systems
| greek-iso - latin-iso - utf8 - utf16 = all of them are charsets.

No.

EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets.
(1:1 mappings of characters to numbers/ordinals).

And encoding is a way of writing these values to bytes.
Decoding reads bytes and emits character values.

Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the range 0-255,
they are usually transcribed (encoded) directly, one byte per ordinal.

Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one value.
There are several ways of transcribing Unicode. UTF-8 is a popular and usually compact form,
using one byte for values below 128 and and multiple bytes for higher values.

| Why python interprets by default all given strings as unicode and
| not ascii? because the former supports many positions while ascii
| only 127 positions , hence can interpet only 127 different characters?

Yes.

[...]
| > Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is 
| > also similar, also 256 positions, but the characters are different. And 
| > so on, with dozens of charsets. 
| 
| Latin has to display english chars(capital, small) + numbers + symbols. that would be 127 why 256?

ASCII runs up to 127. Essentially English, numerals, control codes and various symbols.

The iso-8859-x sets run to 255, and the upper 128 values map to
characters popular in various regions.

| greek = all of the above plus greek chars, no?

So iso-8859-7 included the Greek characters.

| > And then there is Unicode, which includes *every* character is all of 
| > those dozens of charsets. It has 1114111 positions (most are currently  
| > unfilled).
| 
| Shouldt the positions that Unicode has to use equal to the summary
| of all available characters of all the languages of the worlds plus
| numbers and special chars? why 1.000.000+ why the need for so many
| positions? Narrow Unicode format (2 byted) can cover all ofmthe
| worlds symbols.

2 bytes is not enough. Chinese alone has more glyphs than that.

| > An encoding is simply a program that takes a character and returns a 
| > byte, or visa versa. For instance, the ASCII encoding will take character 
| > 'A'. That is found at position 65, which is 0x41 in hexadecimal, so the 
| > ASCII encoding turns character 'A' into byte 0x41, and visa versa.
| 
| Why you say ASCII turn a character into HEX format and not as in binary format?

Steven didn't say that. He said "position 65". People often write
bytes in hex (eg 0x41) because a byte always fits in a 2-character
hex (16 x 16) and because often these values have binary-based
subranges, and hex makes that more obvious.

For example, 'A' is 0x41. 'a' is 0x61. So you can look at the hex
code and almost visually know if you're dealing with upper or lower
case, etc.

| Isnt the latter the way bytes are stored into hdd, like 010101111010101 etc?
| Are they stored as hex instead or you just said so to avoid printing 0s and 1s?

They're stored as bits at the gate level. But writing hex codes
_in_ _text_ is more compact, and more readable for humans.

Cheers,
-- 
Cameron Simpson <cs@zip.com.au>

A lot of people don't know the difference between a violin and a viola, so
I'll tell you.  A viola burns longer.   - Victor Borge

[toc] | [prev] | [next] | [standalone]


#47422

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 07:46 +0300
Message-ID<mailman.2906.1370753210.3114.python-list@python.org>
In reply to#47406

[Multipart message — attachments visible in raw view] — view raw

On 9/6/2013 1:32 πμ, Cameron Simpson wrote:
> On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote:
> | Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
> | > ASCII actually needs 7 bits to store a character. Since computers are
> | > optimized to work with bytes, not bits, normally ASCII characters are
> | > stored in a single byte, with one bit wasted.
> |
> | So ASCII and Unicode are 2 Encoding Systems currently in use.
> | How should i imagine them, visualize them?
> | Like tables 'A' = 65, 'B' = 66 and so on?
>
> Yes, that works.
>
> | But if i do then that would be the visualization of a 'charset' not of an encoding system.
> | What the diffrence of an encoding system and of a charset?
>
> And encoding system is the method or transcribing these values to bytes and back again.
So we have:

( 'A' mapped to the value of '65' ) => encoding process(i.e. uf-8) => bytes
bytes => decoding process(i.e. utf-8) =>  ( '65' mapped to character 'A' )

Why does every character in a character set needs to be associated with 
a numeric value?
I mean couldn't we just have characters sets that wouldn't have numeric 
associations like:

'A'  => encoding process(i.e. uf-8) => bytes
bytes => decoding process(i.e. utf-8) =>  character 'A'


>
> EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets.
> (1:1 mappings of characters to numbers/ordinals).
>
> And encoding is a way of writing these values to bytes.
> Decoding reads bytes and emits character values.
>
> Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the range 0-255,
> they are usually transcribed (encoded) directly, one byte per ordinal.
>
> Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one value.
> There are several ways of transcribing Unicode. UTF-8 is a popular and usually compact form,
> using one byte for values below 128 and and multiple bytes for higher values.
An ordinal = ordered numbers like 7,8,910 and so on?

Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
values up to 256?

UTF-8 and UTF-16 and UTF-32
I though the number beside of UTF- was to declare how many bits the 
character set was using to store a character into the hdd, no?

"Narrow" Unicode uses two bytes per character. Since two bytes is only
enough for about 65,000 characters, not 1,000,000+, the rest of the
characters are stored as pairs of two-byte "surrogates".

Can you please explain this line "the rest of thecharacters are stored 
as pairs of two-byte "surrogates"" more easily for me to understand it?
I'm still having troubl understanding what a surrogate is.

Again, thank you very much for explaining the encodings to me, they were 
giving me trouble for years in all of my scripts.


And one last thing.
When locale to linux system is set to utf-8 that would mean that the 
linux applications, should try to encode string into hdd by using 
system's default encoding to utf-8 nad read them back from bytes by also 
using utf-8. Is that correct?
-- 
Webhost <http://superhost.gr>&& Weblog <http://psariastonafro.wordpress.com>

[toc] | [prev] | [next] | [standalone]


#47427

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-06-09 06:25 +0000
Message-ID<51b41fc6$0$30001$c3e8da3$5496439d@news.astraweb.com>
In reply to#47422
On Sun, 09 Jun 2013 07:46:40 +0300, Νικόλαος Κούρας wrote:

> Why does every character in a character set needs to be associated with
> a numeric value?

Because computers are digital, not analog, and because bytes are numbers.

Here are a few of the 256 possible bytes, written in binary, decimal and 
hexadecimal:

0b00000000 0 0x00
0b00000001 1 0x01
0b00000010 2 0x02
[...]
0b01111111 127 0x7F
0b10000000 128 0x80
[...]
0b11111110 254 0xFE
0b11111111 255 0xFF


EVERYTHING in computers are numbers, because everything is stored as 
bytes. Text is stored as bytes. Sound files are stored as bytes. Images 
are stored as bytes. Programs are stored as bytes. So everything is being 
stored as numbers. But the *meaning* we give to those numbers depends on 
what we do with them, whether we treat them as characters, bitmapped 
images, floating point values, or something else.

Once we decide we want to store the character "A" as bytes, we need to 
decide which number it should be. That is the job of the charset.

ASCII:

65 <--> 'A'
66 <--> 'B'
67 <--> 'C'
etc.


> I mean couldn't we just have characters sets that wouldn't have numeric
> associations like:
> 
> 'A'  => encoding process(i.e. uf-8) => bytes bytes => decoding
> process(i.e. utf-8) =>  character 'A'

No. How would you store it in a computer's memory, or on a hard drive? By 
carving a tiny, microscopic "A" onto the hard drive? How would you read 
it back?

It is theoretically possible to build an analog computer, out of 
clockwork, or water flowing through pipes, or something, but nobody 
really bothers because it is much harder and not very useful.


> An ordinal = ordered numbers like 7,8,910 and so on?

Yes.


> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
> values up to 256?

Because then how do you tell when you need one byte, and when you need 
two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
characters, with ordinal values 0x4C and 0xFA, or one character with 
ordinal value 0x4CFA?

UTF-8 solves this problem by reserving some values to mean "this byte, on 
its own", and others to mean "this byte, plus the next byte, together", 
and so forth, up to four bytes.

If you look up UTF-8 on Wikipedia, you will see more about this.

> UTF-8 and UTF-16 and UTF-32
> I though the number beside of UTF- was to declare how many bits the 
> character set was using to store a character into the hdd, no?

Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
values to make a surrogate pair. UTF-8 uses 8-bit values, but sometimes 
it combines two, three or four of them to represent a single code-point.

> > "Narrow" Unicode uses two bytes per character. Since two bytes is only
> > enough for about 65,000 characters, not 1,000,000+, the rest of the
> > characters are stored as pairs of two-byte "surrogates".
> 
> Can you please explain this line "the rest of thecharacters are stored 
> as pairs of two-byte "surrogates"" more easily for me to understand it?
> I'm still having troubl understanding what a surrogate is.

Look up UTF-16 and "surrogate pair" on Wikepedia.

But basically, there are 65000+ different possible 16-bit values 
available for UTF-16 to use. Some of those values are reserved to mean 
"this value is not a character, it is half of a surrogate pair". Since 
they are *pairs*, they must always come in twos. A surrogate pair makes 
up a valid character. Half of a surrogate pair, on its own, is an error.


A lot of this complexity is because of historical reasons. For example, 
when Unicode was first invented, there was only 65 thousand characters, 
and a fixed 16 bits was all you needed. But it was soon learned that 65 
thousand was not enough (there are more than 65,000 Asian characters 
alone!) and so UTF-16 developed the trick with surrogate pairs to cover 
the extras.


[...]
> When locale to linux system is set to utf-8 that would mean that the 
> linux applications, should try to encode string into hdd by using 
> system's default encoding to utf-8 nad read them back from bytes by
> also using utf-8. Is that correct?

Yes.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#47430

FromCameron Simpson <cs@zip.com.au>
Date2013-06-09 18:02 +1000
Message-ID<mailman.2909.1370765374.3114.python-list@python.org>
In reply to#47427
On 09Jun2013 06:25, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
| [... heaps of useful explaination ...]
| > When locale to linux system is set to utf-8 that would mean that the 
| > linux applications, should try to encode string into hdd by using 
| > system's default encoding to utf-8 nad read them back from bytes by
| > also using utf-8. Is that correct?
| 
| Yes.

Although I'd point out that only application that care about text
as _text_ need to consider Unicode and the encoding. A command like
"mv" does not care. You type the command and "mv" receives byte
strings as its arguments. So it is doing straight forward "bytes"
file renames. It does not care or even know about encodings.

In this scenario, really it is the Terminal program (eg Putty) which
cares about text (what you type, and what gets displayed). It is
because of mismatches between your Terminal local settings and the
encoding that was chosen for the filenames that you get garbage
listings, one way or another.

Cheers,
-- 
Cameron Simpson <cs@zip.com.au>

But then, I'm only 50. Things may well get a bit much for me when I
reach the gasping heights of senile decrepitude of which old Andy
Woodward speaks with such feeling.
        - Chris Malcolm, cam@uk.ac.ed.aifh, DoD #205

[toc] | [prev] | [next] | [standalone]


#47435

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 02:03 -0700
Message-ID<40931e6b-11dd-4f97-bb1f-44c9b002d98f@googlegroups.com>
In reply to#47430
Τη Κυριακή, 9 Ιουνίου 2013 11:02:48 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:

> In this scenario, really it is the Terminal program (eg Putty) which
> cares about text (what you type, and what gets displayed). It is
> because of mismatches between your Terminal local settings and the
> encoding that was chosen for the filenames that you get garbage
> listings, one way or another.

Ca n you give an example please that shows a string being greek-iso encoded and then being utf8 decoded and presented back as:

1. properly
2. garbage ( means trash but dont what a garbage char is)
3. error

[toc] | [prev] | [next] | [standalone]


#47408

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-08 14:21 -0700
Message-ID<ce3bc9ae-f0eb-4dd2-ae73-75533e03921a@googlegroups.com>
In reply to#47400
Sorry for displaying my code so many times, i know i ahve exhaust you but hti is the last thinkg i am gonna ask from you in this thread. We are very close to have this working.


#========================================================
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
	# Compute 'path/to/filename'
	filepath_bytes = path + filename
	for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
		try: 
			filepath = filepath_bytes.decode( encoding )
		except UnicodeDecodeError:
			continue
        
		# Rename to something valid in UTF-8 
		if encoding != 'utf-8': 
			os.rename( filepath_bytes, filepath.encode('utf-8') )

		assert os.path.exists( filepath )
		break 
	else: 
		# This only runs if we never reached the break
		raise ValueError( 'unable to clean filename %r' % filepath_bytes ) 


#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
	try:
		# Check the presence of a file against the database and insert if it doesn't exist
		cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
		data = cur.fetchone()
		
		if not data:
			# First time for file; primary key is automatic, hit is defaulted
			print( "iam here", filename + '\n' )
			cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
	except pymysql.ProgrammingError as e:
		print( repr(e) )


#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
	filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
	if rec not in filepaths:
		cur.execute('''DELETE FROM files WHERE url = %s''', rec )





=================================================
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] Original exception was:
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] Traceback (most recent call last):
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 78, in <module>
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173]     assert os.path.exists( filepath )
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173]   File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173]     os.stat(path)
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-37: ordinal not in range(128)
==================

Asserts what to make sure the the path/to/file afetr the rename exists but why are we still get those unicodeencodeerrors?

[toc] | [prev] | [next] | [standalone]


#47412

FromChris Angelico <rosuav@gmail.com>
Date2013-06-09 08:10 +1000
Message-ID<mailman.2899.1370729423.3114.python-list@python.org>
In reply to#47408
On Sun, Jun 9, 2013 at 7:21 AM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
> Sorry for displaying my code so many times, i know i ahve exhaust you but hti is the last thinkg i am gonna ask from you in this thread. We are very close to have this working.

You need to spend more time reading and less time frantically jumping
around. Go read my post on Unicode; it answers several of the
questions you posted in response to Steven's. And please, don't use
this list as your substitute for source control. Don't keep posting
your code. Most of us are ignoring it already.

ChrisA

[toc] | [prev] | [next] | [standalone]


#47431

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 01:11 -0700
Message-ID<f438a4fd-d6a9-4e33-8b3d-abb28b306064@googlegroups.com>
In reply to#47400
I'm sorry posted by mistake unnessary code: here is the correct one that prodiuced the above error:


#========================================================
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
	# Compute 'path/to/filename'
	filepath_bytes = path + filename
	for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
		try: 
			filepath = filepath_bytes.decode( encoding )
		except UnicodeDecodeError:
			continue
        
		# Rename to something valid in UTF-8 
		if encoding != 'utf-8': 
			os.rename( filepath_bytes, filepath.encode('utf-8') )

		assert os.path.exists( filepath )
		break 
	else: 
		# This only runs if we never reached the break
		raise ValueError( 'unable to clean filename %r' % filepath_bytes ) 


#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
	try:
		# Check the presence of a file against the database and insert if it doesn't exist
		cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
		data = cur.fetchone()
		
		if not data:
			# First time for file; primary key is automatic, hit is defaulted
			print( "iam here", filename + '\n' )
			cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
	except pymysql.ProgrammingError as e:
		print( repr(e) )


#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = set()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
	filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
	if rec not in filepaths:
		cur.execute('''DELETE FROM files WHERE url = %s''', rec )

[toc] | [prev] | [next] | [standalone]


#47403

FromChris Angelico <rosuav@gmail.com>
Date2013-06-09 04:47 +1000
Message-ID<mailman.2894.1370719010.3114.python-list@python.org>
In reply to#47326
On Sun, Jun 9, 2013 at 4:01 AM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
> Hold on!
>
> In the beginning there was ASCII with 0-127 values and then there was
> Unicode with 0-127 of ASCII's + i dont know how much many more?
>
> Now ASCIII needs 1 byte to store a single character while Unicode needs 2
> bytes to store a character and that is because it has > 256 characters to
> store > 2^8bits ?
>
> Is this correct?

No. Let me start from the beginning.

Computers don't work with characters, or strings, natively. They work
with numbers. To be specific, they work with bits; and it's only by
convention that we can work with anything larger. For instance,
there's a VERY common convention around the PC world that a set of
bits can be interpreted as a signed integer; if the highest bit is
set, it's negative. There are also standards for floating-point (IEEE
754), and so on.

ASCII is a character set. It defines a mapping of numbers to
characters - for instance, @ is 64, SOH is 1, $ is 36, etcetera,
etcetera. There are 128 such mappings. Since they all fit inside a
7-bit number, there's a trivial way to represent ASCII characters in a
PC's 8-bit byte: you just leave the high bit clear and use the other
seven. There have been various schemes for using the eighth bit -
serial ports with parity, WordStar (I think) marking the ends of
words, and most notably, Extended ASCII schemes that give you another
whole set of 128 characters. And that was the beginning of Code Pages,
because nobody could agree on what those extra 128 should be.
Norwegians used Norwegian, the Greeks were taught their Greek,
Arabians created themselves an Arabian codepage with the speed of
summer lightning, and Hebrews allocated from 255 down to 128, which is
absolutely frightening. But I digress.

There were a variety of multi-byte schemes devised at various times,
but we'll ignore all of them and jump straight to Unicode. With
Unicode, there's (theoretically) no need to use any other system ever
again, because whatever character you want, it'll exist in Unicode. In
theory, of course; there are debates over that. Now, Unicode currently
has defined an "address space" of roughly 20 bits, and in a throwback
to the first programming I ever did, it's a segmented system: sixteen
or seventeen planes of 65,536 characters each. (Fortunately the planes
are identified by low numbers, not high numbers, and there's no
stupidity of overlapping planes the way the 8086 did with memory!) The
highest planes are  special (plane 14 has a few special-purpose
characters, planes 15 and 16 are for private use), and most of the
middle ones have no characters assigned to them, so for the most part,
you'll see characters from the first three planes.

So what do we now have? A mapping of characters to "code points",
which are numbers. (I'm leaving aside the issues of combining
characters and such for the moment.) But computers don't work with
numbers, they work with bits. Somehow we have to store those bits in
memory.

There are a good few ways to do that; one is to note that every
Unicode character can be represented inside 32 bits, so we can use the
standard integer scheme safely. (Since they fit inside 31 bits, we
don't even need to care if it's signed or unsigned.) That's called
UTF-32 or UCS-4, and it's a great way to handle the full Unicode range
in a manner that makes a Texan look agoraphobic. Wide builds of Python
up to 3.2 did this. Or you can try to store them in 16-bit numbers,
but then you have to worry about the ones that don't fit in 16 bits,
because it's really hard to squeeze 20 bits of information into 16
bits of storage. UTF-16 is one way to do this; special numbers mean
"grab another number". It has its issues, but is (in my opinion,
unfortunately) fairly prevalent. Narrow builds of Python up to 3.2 did
this. Finally, you can use a more complicated scheme that uses
anywhere from 1 to 4 bytes for each character, by carefully encoding
information into the top bit - if it's set, you have a multi-byte
character. That's how UTF-8 works, and is probably the most prevalent
disk/network encoding.

All of the UTF-X systems are called "UCS Transformation Formats" (UCS
meaning Universal Character Set, roughly "Unicode"). They are mappings
from Unicode numbers to bytes. Between Unicode and UTF-X, you have a
mapping from character to byte sequence.

> Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into
> the hard drive?

The ISO standard 8859 specifies a number of ASCII-compatible
encodings, referred to as ISO-8859-1 through ISO-8859-16. You've been
working with ISO-8859-1, also called Latin-1, and ISO-8859-7, which
has your Greek characters in it. These are all ways of translating
characters into numbers; and since they all fit within 8 bits, they're
most commonly represented on PCs with single bytes.

> So taken form above example(the closest i could think of), the way i
> understand them is:
>
> A 'string' can be of (unicode's or ascii's) type and that type needs a way
> (thats a charset) to store this string into the hdd as a sequense of bytes?

A Python 3 'string' is always a series of Unicode characters. How
they're represented in memory doesn't matter, but as of Python 3.3
that's a fairly compact and efficient system that can omit unnecessary
zero bits. To store that string on your hard disk, send it across a
network, or transmit it to another process, you need to encode it as
bytes, somehow. The UCS Transformation Formats are specifically
designed for this, and most of the time, UTF-8 is going to be the best
option. It's compact, it's well known, and usually, it'll do
everything you want. The only thing it won't do is let you quickly
locate the Nth character, which is why it makes a poor in-memory
format.

Fortunately, Python lets us hide away pretty much all those details,
just as it lets us hide away the details of what makes up a list, a
dictionary, or an integer. You can safely assume that the string "foo"
is a string of three characters, which you can work with as
characters. The chr() and ord() functions let you switch between
characters and numbers, and str.encode() and bytes.decode() let you
switch between characters and byte sequences. Once you get your head
around the differences between those three, it all works fairly
neatly.

Chris Angelico

[toc] | [prev] | [next] | [standalone]


#47423

Fromnagia.retsina@gmail.com
Date2013-06-08 22:09 -0700
Message-ID<3fbb5d0e-51fb-4aed-b829-8388304a9885@googlegroups.com>
In reply to#47403
Τη Σάββατο, 8 Ιουνίου 2013 9:47:53 μ.μ. UTC+3, ο χρήστης Chris Angelico έγραψε:

> Fortunately, Python lets us hide away pretty much all those details, 
> just as it lets us hide away the details of what makes up a list, a
> dictionary, or an integer. You can safely assume that the string "foo"
> is a string of three characters, which you can work with as
> characters. The chr() and ord() functions let you switch between
> characters and numbers, and str.encode() and bytes.decode() let you
> switch between characters and byte sequences. Once you get your head
> around the differences between those three, it all works fairly
> neatly.

I'm trying too!

So,

chr('A') would give me the mapping of this char, the number 65 while
ord(65) would output the char 'A' likewise.

>and str.encode() and bytes.decode() let you switch between characters and byte >sequences. Once

What would happen if we we try to re-encode bytes on the disk?
like trying:

s = "νίκος"
utf8_bytes = s.encode('utf-8')
greek_bytes = utf_bytes.encode('iso-8869-7')

Can we re-encode twice or as many times we want and then decode back respectively lke?

utf8_bytes = greek_bytes.decode('iso-8859-7')
s = utf8_bytes.decoce('utf-8')

Is somethign like that totally crazy?

And also is there a deiffrence between "encoding" and "compressing" ?

Isnt the latter useing some form of encoding to take a string or bytes to make hold less space on disk?

[toc] | [prev] | [next] | [standalone]


#47428

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-06-09 06:45 +0000
Message-ID<51b4249d$0$30001$c3e8da3$5496439d@news.astraweb.com>
In reply to#47423
On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:

> chr('A') would give me the mapping of this char, the number 65 while
> ord(65) would output the char 'A' likewise.

Correct. Python uses Unicode, where code-point 65 ("ordinal value 65") 
means letter "A".

There are older encodings. For example, a very old one, used on IBM 
mainframes, is EBCDIC, where ordinal value 65 means the letter "â", and 
the letter "A" has ordinal value 193.

 
> What would happen if we we try to re-encode bytes on the disk? like
> trying:
> 
> s = "νίκος"
> utf8_bytes = s.encode('utf-8')
> greek_bytes = utf_bytes.encode('iso-8869-7')
> 
> Can we re-encode twice or as many times we want and then decode back
> respectively lke?

Of course. Bytes have no memory of where they came from, or what they are 
used for. All you are doing is flipping bits on a memory chip, or on a 
hard drive. So long as *you* remember which encoding is the right one, 
there is no problem. If you forget, and start using the wrong one, you 
will get garbage characters, mojibake, or errors.

[...]
> And also is there a deiffrence between "encoding" and "compressing" ?

Of course. They are totally unrelated.

> Isnt the latter useing some form of encoding to take a string or bytes
> to make hold less space on disk?

Correct, except forget about "encoding". It's not relevant (except, 
maybe, in a mathematical sense) and will just confuse you.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#47429

Fromnagia.retsina@gmail.com
Date2013-06-09 00:00 -0700
Message-ID<19e762c7-a356-4ee1-9f50-82a128b5ac06@googlegroups.com>
In reply to#47428
Thanks Stevn, i ll read them in a bit. When i read them can you perhaps tell me whats wrong and ima still getting decode issues?

[CODE]
# =================================================================================================================
# If user downloaded a file, thank the user !!!
# =================================================================================================================
if filename:
	#update file counter if cookie does not exist
	if not nikos:
		cur.execute('''UPDATE files SET hits = hits + 1, host = %s, lastvisit = %s WHERE url = %s''', (host, lastvisit, filename) )
	
	print('''<h2><font color=blue>Το αρχείο <font color=red> %s <font color=blue>κατεβαίνει!''' % filename )
	print('''<br><img src="/data/images/thanks.gif">''')
	print('''<br><br><br><h3><font color=blue>Και τώρα Tetris μέχρι να ολοκληρωθεί :-)''' )
	print('''<br><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://fpdownload.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,0,0" width="450" height="300""><param name="menu" value="false" /><param name="movie" value="http://www.fugly.com/f/1e6d8cd7b905f4e1bf72" /><param name="quality" value="high" /><embed src="http://www.fugly.com/f/1e6d8cd7b905f4e1bf72" AllowScriptAccess="always" menu="false" quality="high" width="450" height="300" name="FuglyGame" align="middle" type="application/x-shockwave-flash" pluginspage="http://www.macromedia.com/go/getflashplayer" /></object>''')
	
	print( '''<meta http-equiv="REFRESH" content="2;/data/apps/%s">''' % filename )
	sys.exit(0)


# =================================================================================================================
# Display download button for each file and download it on click
# =================================================================================================================
print('''<body background='/data/images/star.jpg'>
		 <center><img src='/data/images/download.gif'><br><br>
		 <table border=5 cellpadding=5 bgcolor=green>
''')


#========================================================
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
	# Compute 'path/to/filename'
	filepath_bytes = path + filename
	for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
		try: 
			filepath = filepath_bytes.decode( encoding )
		except UnicodeDecodeError:
			continue
        
		# Rename to something valid in UTF-8 
		if encoding != 'utf-8': 
			os.rename( filepath_bytes, filepath.encode('utf-8') )

		assert os.path.exists( filepath )
		break 
	else: 
		# This only runs if we never reached the break
		raise ValueError( 'unable to clean filename %r' % filepath_bytes ) 


#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
	try:
		# Check the presence of a file against the database and insert if it doesn't exist
		cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
		data = cur.fetchone()
		
		if not data:
			# First time for file; primary key is automatic, hit is defaulted
			print( "iam here", filename + '\n' )
			cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
	except pymysql.ProgrammingError as e:
		print( repr(e) )


#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = set()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
	filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
	if rec not in filepaths:
		cur.execute('''DELETE FROM files WHERE url = %s''', rec )
[/CODE] 

When trying to run it is still erroting out:

[CODE]
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Original exception was:, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Traceback (most recent call last):, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 83, in <module>, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173]     assert os.path.exists( filepath ), referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173]   File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173]     os.stat(path), referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-37: ordinal not in range(128), refere
[/CODE] 

Why am i still receing unicode decore errors?
With the help of you guys we have writen a prodecure just to avoid this kind of decoding issues and rename all greek_byted_filenames to utf-8_byted.

Is it the assert that fail? Do we have some logic error someplace i dont see?

[toc] | [prev] | [next] | [standalone]


#47432

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-06-09 08:15 +0000
Message-ID<51b4398a$0$30001$c3e8da3$5496439d@news.astraweb.com>
In reply to#47429
On Sun, 09 Jun 2013 00:00:53 -0700, nagia.retsina wrote:

> path = b'/home/nikos/public_html/data/apps/'
> files = os.listdir( path )
> 
> for filename in files:
> 	# Compute 'path/to/filename'
> 	filepath_bytes = path + filename
> 	for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
> 		try:
> 			filepath = filepath_bytes.decode( encoding )
> 		except UnicodeDecodeError:
> 			continue
>         
> 		# Rename to something valid in UTF-8
> 		if encoding != 'utf-8':
> 			os.rename( filepath_bytes, 
>                                  filepath.encode('utf-8') )
> 		assert os.path.exists( filepath )
> 		break
> 	else:
> 		# This only runs if we never reached the break 
>               raise ValueError(
>                     'unable to clean filename %r' % filepath_bytes )

Editing the traceback to get rid of unnecessary noise from the logging:

Traceback (most recent call last):
  File "/home/nikos/public_html/cgi-bin/files.py", line 83, in <module>
  assert os.path.exists( filepath )
  File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists
  os.stat(path)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 
34-37: ordinal not in range(128)


> Why am i still receing unicode decore errors? With the help of you guys
> we have writen a prodecure just to avoid this kind of decoding issues
> and rename all greek_byted_filenames to utf-8_byted.

That's a very good question. It works for me when I test it, so I cannot 
explain why it fails for you.

Please try this: log into the Linux server, and then start up a Python 
interactive session by entering:

python3.3

at the $ prompt. Then, at the >>> prompt, enter these lines of code. You 
can copy and paste them:


import os, sys
print(sys.version)
s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
     '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
     '\N{GREEK SMALL LETTER EPSILON}')
print(s)
filename = '/tmp/' + s
open(filename, 'w')
os.path.exists(filename)


Copy and paste the results back here please.



> Is it the assert that fail? Do we have some logic error someplace i dont
> see?

Please read the error message. Does it say AssertionError?

If it says AssertionError, then the assert has failed. If it says 
something else, the code failed before the assert can run.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#47438

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 02:14 -0700
Message-ID<0a22570a-6bf6-4115-a7a8-a1684680702e@googlegroups.com>
In reply to#47432
Τη Κυριακή, 9 Ιουνίου 2013 11:15:07 π.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:

> Please try this: log into the Linux server, and then start up a Python 

> import os, sys 
> print(sys.version)
> s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}' 
>      '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}' 
>      '\N{GREEK SMALL LETTER EPSILON}')
> print(s)
> filename = '/tmp/' + s
> open(filename, 'w')
> os.path.exists(filename)

> Copy and paste the results back here please.

Of course: here it is:

root@nikos [/home/nikos/www/cgi-bin]# python
Python 3.3.2 (default, Jun  3 2013, 16:18:05)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> print(sys.version)
3.3.2 (default, Jun  3 2013, 16:18:05)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]
>>> s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
...      '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
...      '\N{GREEK SMALL LETTER EPSILON}')
print(s)
>>> αβγδε
>>> filename = '/tmp/' + s
>>> open(filename, 'w')
<_io.TextIOWrapper name='/tmp/αβγδε' mode='w' encoding='UTF-8'>
>>> os.path.exists(filename)
True
>>>

[toc] | [prev] | [next] | [standalone]


#47445

FromΝικόλαος Κούρας <nikos.gr33k@gmail.com>
Date2013-06-09 03:32 -0700
Message-ID<3b2647bb-4b5a-4391-9ff4-6b5e755d9770@googlegroups.com>
In reply to#47438
Τη Κυριακή, 9 Ιουνίου 2013 12:14:12 μ.μ. UTC+3, ο χρήστης Νικόλαος Κούρας έγραψε:
> Τη Κυριακή, 9 Ιουνίου 2013 11:15:07 π.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
> 
> 
> 
> > Please try this: log into the Linux server, and then start up a Python 
> 
> 
> 
> > import os, sys 
> 
> > print(sys.version)
> 
> > s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}' 
> 
> >      '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}' 
> 
> >      '\N{GREEK SMALL LETTER EPSILON}')
> 
> > print(s)
> 
> > filename = '/tmp/' + s
> 
> > open(filename, 'w')
> 
> > os.path.exists(filename)
> 
> 
> 
> > Copy and paste the results back here please.
> 
> 
> 
> Of course: here it is:
> 
> 
> 
> root@nikos [/home/nikos/www/cgi-bin]# python
> 
> Python 3.3.2 (default, Jun  3 2013, 16:18:05)
> 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux
> 
> Type "help", "copyright", "credits" or "license" for more information.
> 
> >>> import os, sys
> 
> >>> print(sys.version)
> 
> 3.3.2 (default, Jun  3 2013, 16:18:05)
> 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]
> 
> >>> s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
> 
> ...      '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
> 
> ...      '\N{GREEK SMALL LETTER EPSILON}')
> 
> print(s)
> 
> >>> αβγδε
> 
> >>> filename = '/tmp/' + s
> 
> >>> open(filename, 'w')
> 
> <_io.TextIOWrapper name='/tmp/αβγδε' mode='w' encoding='UTF-8'>
> 
> >>> os.path.exists(filename)
> 
> True
> 
> >>>

I dont much but it lloks correct to me, but then agian why this error?

[toc] | [prev] | [next] | [standalone]


#47439

FromCameron Simpson <cs@zip.com.au>
Date2013-06-09 19:16 +1000
Message-ID<mailman.2912.1370769399.3114.python-list@python.org>
In reply to#47432
On 09Jun2013 08:15, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
| On Sun, 09 Jun 2013 00:00:53 -0700, nagia.retsina wrote:
| > path = b'/home/nikos/public_html/data/apps/'
| > files = os.listdir( path )
| > 
| > for filename in files:
| > 	# Compute 'path/to/filename'
| > 	filepath_bytes = path + filename
| > 	for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
| > 		try:
| > 			filepath = filepath_bytes.decode( encoding )
| > 		except UnicodeDecodeError:
| > 			continue
| >         
| > 		# Rename to something valid in UTF-8
| > 		if encoding != 'utf-8':
| > 			os.rename( filepath_bytes, 
| >                                  filepath.encode('utf-8') )
| > 		assert os.path.exists( filepath )
| > 		break
| > 	else:
| > 		# This only runs if we never reached the break 
| >               raise ValueError(
| >                     'unable to clean filename %r' % filepath_bytes )
| 
| Editing the traceback to get rid of unnecessary noise from the logging:
| 
| Traceback (most recent call last):
|   File "/home/nikos/public_html/cgi-bin/files.py", line 83, in <module>
|   assert os.path.exists( filepath )
|   File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists
|   os.stat(path)
| UnicodeEncodeError: 'ascii' codec can't encode characters in position 
| 34-37: ordinal not in range(128)
| 
| > Why am i still receing unicode decore errors? With the help of you guys
| > we have writen a prodecure just to avoid this kind of decoding issues
| > and rename all greek_byted_filenames to utf-8_byted.
| 
| That's a very good question. It works for me when I test it, so I cannot 
| explain why it fails for you.

If he's lucky the UnicodeEncodeError occurred while trying to print
an error message, printing a greek Unicode string in the error with
ASCII as the output encoding (default when not a tty IIRC).

Cheers,
-- 
Cameron Simpson <cs@zip.com.au>

I generally avoid temptation unless I can't resist it.  - Mae West

[toc] | [prev] | [next] | [standalone]


#47458

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-06-09 12:36 +0000
Message-ID<51b476e2$0$30001$c3e8da3$5496439d@news.astraweb.com>
In reply to#47439
On Sun, 09 Jun 2013 19:16:06 +1000, Cameron Simpson wrote:


> If he's lucky the UnicodeEncodeError occurred while trying to print an
> error message, 

That's not what happens at the interactive console:

py> assert os.path.exists('Ж1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError


> printing a greek Unicode string in the error with ASCII
> as the output encoding (default when not a tty IIRC).


An interesting thought. How would we test that?



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#47476

Fromnagia.retsina@gmail.com
Date2013-06-09 10:25 -0700
Message-ID<2de6f168-5b93-4ee8-b9e3-44bd05158191@googlegroups.com>
In reply to#47458
Τη Κυριακή, 9 Ιουνίου 2013 3:36:51 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:

> > printing a greek Unicode string in the error with ASCII 
> > as the output encoding (default when not a tty IIRC).

> An interesting thought. How would we test that?

Please elaborare this for me. I ditn undertood what you are trying to say, your assumption of why still ima getting decode issues.

[toc] | [prev] | [next] | [standalone]


Page 2 of 4 — ← Prev page 1 [2] 3 4  Next page →

Back to top | Article view | comp.lang.python


csiph-web