Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #47322 > unrolled thread
| Started by | Cameron Simpson <cs@zip.com.au> |
|---|---|
| First post | 2013-06-07 18:53 +1000 |
| Last post | 2013-06-10 13:28 -0700 |
| Articles | 20 on this page of 68 — 14 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-07 18:53 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) alex23 <wuwei23@gmail.com> - 2013-06-07 02:41 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 04:53 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-07 15:29 +0100
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 11:52 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Zero Piraeus <schesis@gmail.com> - 2013-06-07 15:31 -0400
Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-07 21:45 +0100
Re: Changing filenames from Greeklish => Greek (subprocess complain) Zero Piraeus <schesis@gmail.com> - 2013-06-07 19:24 -0400
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-08 12:52 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-07 23:49 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-08 16:58 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-08 07:26 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-08 17:40 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-08 17:32 +0100
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 09:53 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 10:35 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) MRAB <python@mrabarnett.plus.com> - 2013-06-08 18:48 +0100
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-07 15:33 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-08 12:49 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 21:01 +0300
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-08 19:01 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 14:14 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 08:32 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 07:46 +0300
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 06:25 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 18:02 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:03 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-08 14:21 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-09 08:10 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 01:11 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Chris Angelico <rosuav@gmail.com> - 2013-06-09 04:47 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-08 22:09 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 06:45 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-09 00:00 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 08:15 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:14 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:32 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 19:16 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 12:36 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-09 10:25 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Lele Gaifax <lele@metapensiero.it> - 2013-06-09 10:55 +0200
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:08 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Lele Gaifax <lele@metapensiero.it> - 2013-06-09 11:20 +0200
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:38 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-09 14:24 +0200
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 13:13 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-06-09 13:05 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:42 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:37 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Larry Hudson <orgnut@yahoo.com> - 2013-06-10 00:51 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 01:11 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Larry Hudson <orgnut@yahoo.com> - 2013-06-11 00:20 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 11:50 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 05:18 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:00 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Cameron Simpson <cs@zip.com.au> - 2013-06-09 19:12 +1000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 02:20 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-06-09 13:01 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-09 12:31 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) nagia.retsina@gmail.com - 2013-06-10 00:10 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-10 10:15 +0200
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 01:54 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 02:59 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Andreas Perstinger <andipersti@gmail.com> - 2013-06-10 12:42 +0200
Re: Changing filenames from Greeklish => Greek (subprocess complain) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-10 11:59 +0000
Re: Changing filenames from Greeklish => Greek (subprocess complain) Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-10 07:27 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) jmfauth <wxjmfauth@gmail.com> - 2013-06-10 12:48 -0700
Re: Changing filenames from Greeklish => Greek (subprocess complain) Ned Batchelder <ned@nedbatchelder.com> - 2013-06-10 13:28 -0700
Page 2 of 4 — ← Prev page 1 [2] 3 4 Next page →
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-08 19:01 +0000 |
| Message-ID | <51b37fa4$0$29966$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47396 |
On Sat, 08 Jun 2013 21:01:23 +0300, Νικόλαος Κούρας wrote: > In the beginning there was ASCII with 0-127 values No, there were encoding systems that existed before ASCII, such as EBCDIC. But we can ignore those, and just start with ASCII. > and then there was > Unicode with 0-127 of ASCII's + i dont know how much many more? No, you have missed the utter chaos of dozens and dozens of Windows codepages and charsets. We still have to live with the pain of that. But now we have Unicode, with 0x10FFFF (decimal 1114111) code points. You can consider a code point to be the same as a character, at least for now. > Now ASCIII needs 1 byte to store a single character ASCII actually needs 7 bits to store a character. Since computers are optimized to work with bytes, not bits, normally ASCII characters are stored in a single byte, with one bit wasted. > while Unicode needs 2 bytes to store a character No. Since there are 0x10FFFF different Unicode "characters" (really code points, but ignore the difference) two bytes is not enough. Unicode needs 21 bits to store a character. Since that is more than 2 bytes, but less than 3, there are a few different ways for Unicode to be stored in memory, including: "Wide" Unicode uses four bytes per character. Why four instead of three? Because computers are more efficient when working with chunks of memory that is a multiple of four. "Narrow" Unicode uses two bytes per character. Since two bytes is only enough for about 65,000 characters, not 1,000,000+, the rest of the characters are stored as pairs of two-byte "surrogates". > and that is because it has > 256 characters > to store > 2^8bits ? Correct. > Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters > into the hard drive? Your computer cannot carve a tiny little "A" into the hard drive when it stores that letter in a file. It has to write some bytes. So you need to know: - what byte, or bytes, represents the letter "A"? - what byte, or bytes, represents the letter "B"? - what byte, or bytes, represents the letter "λ"? and so on. This set of rules, "byte XXXX means letter YYYY", is called an encoding. If you don't know what encoding to use, you cannot tell what the byte means. > Because in some post i have read that 'UTF-8 encoding of Unicode'. Can > you please explain to me whats the difference of ASCII-Unicode > themselves aand then of them compared to 'Charsets' . I'm still confused > about this. A charset is an ordered set of characters. For example, ASCII has 127 characters, starting with NUL: NUL ... A B C D E ... Z [ \ ] ^ ... a b c ... z ... where NULL is at position 0, 'A' is at position 65, 'B' at position 66, and so on. Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is also similar, also 256 positions, but the characters are different. And so on, with dozens of charsets. And then there is Unicode, which includes *every* character is all of those dozens of charsets. It has 1114111 positions (most are currently unfilled). An encoding is simply a program that takes a character and returns a byte, or visa versa. For instance, the ASCII encoding will take character 'A'. That is found at position 65, which is 0x41 in hexadecimal, so the ASCII encoding turns character 'A' into byte 0x41, and visa versa. > Is it like we said in C++: > ' int a', a variable with name 'a' of type integer. 'char a', a > variable with name 'a' of type char > > So taken form above example(the closest i could think of), the way i > understand them is: > > A 'string' can be of (unicode's or ascii's) type and that type needs a > way (thats a charset) to store this string into the hdd as a sequense of > bytes? Correct. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-08 14:14 -0700 |
| Message-ID | <e1cfd5ed-798d-44fa-8bf7-17f3549a288e@googlegroups.com> |
| In reply to | #47400 |
Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: > ASCII actually needs 7 bits to store a character. Since computers are > optimized to work with bytes, not bits, normally ASCII characters are > stored in a single byte, with one bit wasted. So ASCII and Unicode are 2 Encoding Systems currently in use. How should i imagine them, visualize them? Like tables 'A' = 65, 'B' = 66 and so on? But if i do then that would be the visualization of a 'charset' not of an encoding system. What the diffrence of an encoding system and of a charset? ebcdic - ascii - unicode = al of them are encoding systems greek-iso - latin-iso - utf8 - utf16 = all of them are charsets. What are these differences? i cant imagine them all, i can only imagine charsets not encodign systems. Why python interprets by default all given strings as unicode and not ascii? because the former supports many positions while ascii only 127 positions , hence can interpet only 127 different characters? > "Narrow" Unicode uses two bytes per character. Since two bytes is only > enough for about 65,000 characters, not 1,000,000+, the rest of the > characters are stored as pairs of two-byte "surrogates". surrogates literal means a replacemnt? > Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is > also similar, also 256 positions, but the characters are different. And > so on, with dozens of charsets. Latin has to display english chars(capital, small) + numbers + symbols. that would be 127 why 256? greek = all of the above plus greek chars, no? > And then there is Unicode, which includes *every* character is all of > those dozens of charsets. It has 1114111 positions (most are currently > unfilled). Shouldt the positions that Unicode has to use equal to the summary of all available characters of all the languages of the worlds plus numbers and special chars? why 1.000.000+ why the need for so many positions? Narrow Unicode format (2 byted) can cover all ofmthe worlds symbols. > An encoding is simply a program that takes a character and returns a > byte, or visa versa. For instance, the ASCII encoding will take character > 'A'. That is found at position 65, which is 0x41 in hexadecimal, so the > ASCII encoding turns character 'A' into byte 0x41, and visa versa. Why you say ASCII turn a character into HEX format and not as in binary format? Isnt the latter the way bytes are stored into hdd, like 010101111010101 etc? Are they stored as hex instead or you just said so to avoid printing 0s and 1s?
[toc] | [prev] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2013-06-09 08:32 +1000 |
| Message-ID | <mailman.2903.1370730797.3114.python-list@python.org> |
| In reply to | #47406 |
On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote: | Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: | > ASCII actually needs 7 bits to store a character. Since computers are | > optimized to work with bytes, not bits, normally ASCII characters are | > stored in a single byte, with one bit wasted. | | So ASCII and Unicode are 2 Encoding Systems currently in use. | How should i imagine them, visualize them? | Like tables 'A' = 65, 'B' = 66 and so on? Yes, that works. | But if i do then that would be the visualization of a 'charset' not of an encoding system. | What the diffrence of an encoding system and of a charset? And encoding system is the method or transcribing these values to bytes and back again. | ebcdic - ascii - unicode = al of them are encoding systems | greek-iso - latin-iso - utf8 - utf16 = all of them are charsets. No. EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets. (1:1 mappings of characters to numbers/ordinals). And encoding is a way of writing these values to bytes. Decoding reads bytes and emits character values. Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the range 0-255, they are usually transcribed (encoded) directly, one byte per ordinal. Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one value. There are several ways of transcribing Unicode. UTF-8 is a popular and usually compact form, using one byte for values below 128 and and multiple bytes for higher values. | Why python interprets by default all given strings as unicode and | not ascii? because the former supports many positions while ascii | only 127 positions , hence can interpet only 127 different characters? Yes. [...] | > Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is | > also similar, also 256 positions, but the characters are different. And | > so on, with dozens of charsets. | | Latin has to display english chars(capital, small) + numbers + symbols. that would be 127 why 256? ASCII runs up to 127. Essentially English, numerals, control codes and various symbols. The iso-8859-x sets run to 255, and the upper 128 values map to characters popular in various regions. | greek = all of the above plus greek chars, no? So iso-8859-7 included the Greek characters. | > And then there is Unicode, which includes *every* character is all of | > those dozens of charsets. It has 1114111 positions (most are currently | > unfilled). | | Shouldt the positions that Unicode has to use equal to the summary | of all available characters of all the languages of the worlds plus | numbers and special chars? why 1.000.000+ why the need for so many | positions? Narrow Unicode format (2 byted) can cover all ofmthe | worlds symbols. 2 bytes is not enough. Chinese alone has more glyphs than that. | > An encoding is simply a program that takes a character and returns a | > byte, or visa versa. For instance, the ASCII encoding will take character | > 'A'. That is found at position 65, which is 0x41 in hexadecimal, so the | > ASCII encoding turns character 'A' into byte 0x41, and visa versa. | | Why you say ASCII turn a character into HEX format and not as in binary format? Steven didn't say that. He said "position 65". People often write bytes in hex (eg 0x41) because a byte always fits in a 2-character hex (16 x 16) and because often these values have binary-based subranges, and hex makes that more obvious. For example, 'A' is 0x41. 'a' is 0x61. So you can look at the hex code and almost visually know if you're dealing with upper or lower case, etc. | Isnt the latter the way bytes are stored into hdd, like 010101111010101 etc? | Are they stored as hex instead or you just said so to avoid printing 0s and 1s? They're stored as bits at the gate level. But writing hex codes _in_ _text_ is more compact, and more readable for humans. Cheers, -- Cameron Simpson <cs@zip.com.au> A lot of people don't know the difference between a violin and a viola, so I'll tell you. A viola burns longer. - Victor Borge
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 07:46 +0300 |
| Message-ID | <mailman.2906.1370753210.3114.python-list@python.org> |
| In reply to | #47406 |
[Multipart message — attachments visible in raw view] — view raw
On 9/6/2013 1:32 πμ, Cameron Simpson wrote: > On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k@gmail.com> wrote: > | Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: > | > ASCII actually needs 7 bits to store a character. Since computers are > | > optimized to work with bytes, not bits, normally ASCII characters are > | > stored in a single byte, with one bit wasted. > | > | So ASCII and Unicode are 2 Encoding Systems currently in use. > | How should i imagine them, visualize them? > | Like tables 'A' = 65, 'B' = 66 and so on? > > Yes, that works. > > | But if i do then that would be the visualization of a 'charset' not of an encoding system. > | What the diffrence of an encoding system and of a charset? > > And encoding system is the method or transcribing these values to bytes and back again. So we have: ( 'A' mapped to the value of '65' ) => encoding process(i.e. uf-8) => bytes bytes => decoding process(i.e. utf-8) => ( '65' mapped to character 'A' ) Why does every character in a character set needs to be associated with a numeric value? I mean couldn't we just have characters sets that wouldn't have numeric associations like: 'A' => encoding process(i.e. uf-8) => bytes bytes => decoding process(i.e. utf-8) => character 'A' > > EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets. > (1:1 mappings of characters to numbers/ordinals). > > And encoding is a way of writing these values to bytes. > Decoding reads bytes and emits character values. > > Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the range 0-255, > they are usually transcribed (encoded) directly, one byte per ordinal. > > Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one value. > There are several ways of transcribing Unicode. UTF-8 is a popular and usually compact form, > using one byte for values below 128 and and multiple bytes for higher values. An ordinal = ordered numbers like 7,8,910 and so on? Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for values up to 256? UTF-8 and UTF-16 and UTF-32 I though the number beside of UTF- was to declare how many bits the character set was using to store a character into the hdd, no? "Narrow" Unicode uses two bytes per character. Since two bytes is only enough for about 65,000 characters, not 1,000,000+, the rest of the characters are stored as pairs of two-byte "surrogates". Can you please explain this line "the rest of thecharacters are stored as pairs of two-byte "surrogates"" more easily for me to understand it? I'm still having troubl understanding what a surrogate is. Again, thank you very much for explaining the encodings to me, they were giving me trouble for years in all of my scripts. And one last thing. When locale to linux system is set to utf-8 that would mean that the linux applications, should try to encode string into hdd by using system's default encoding to utf-8 nad read them back from bytes by also using utf-8. Is that correct? -- Webhost <http://superhost.gr>&& Weblog <http://psariastonafro.wordpress.com>
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-09 06:25 +0000 |
| Message-ID | <51b41fc6$0$30001$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47422 |
On Sun, 09 Jun 2013 07:46:40 +0300, Νικόλαος Κούρας wrote: > Why does every character in a character set needs to be associated with > a numeric value? Because computers are digital, not analog, and because bytes are numbers. Here are a few of the 256 possible bytes, written in binary, decimal and hexadecimal: 0b00000000 0 0x00 0b00000001 1 0x01 0b00000010 2 0x02 [...] 0b01111111 127 0x7F 0b10000000 128 0x80 [...] 0b11111110 254 0xFE 0b11111111 255 0xFF EVERYTHING in computers are numbers, because everything is stored as bytes. Text is stored as bytes. Sound files are stored as bytes. Images are stored as bytes. Programs are stored as bytes. So everything is being stored as numbers. But the *meaning* we give to those numbers depends on what we do with them, whether we treat them as characters, bitmapped images, floating point values, or something else. Once we decide we want to store the character "A" as bytes, we need to decide which number it should be. That is the job of the charset. ASCII: 65 <--> 'A' 66 <--> 'B' 67 <--> 'C' etc. > I mean couldn't we just have characters sets that wouldn't have numeric > associations like: > > 'A' => encoding process(i.e. uf-8) => bytes bytes => decoding > process(i.e. utf-8) => character 'A' No. How would you store it in a computer's memory, or on a hard drive? By carving a tiny, microscopic "A" onto the hard drive? How would you read it back? It is theoretically possible to build an analog computer, out of clockwork, or water flowing through pipes, or something, but nobody really bothers because it is much harder and not very useful. > An ordinal = ordered numbers like 7,8,910 and so on? Yes. > Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for > values up to 256? Because then how do you tell when you need one byte, and when you need two? If you read two bytes, and see 0x4C 0xFA, does that mean two characters, with ordinal values 0x4C and 0xFA, or one character with ordinal value 0x4CFA? UTF-8 solves this problem by reserving some values to mean "this byte, on its own", and others to mean "this byte, plus the next byte, together", and so forth, up to four bytes. If you look up UTF-8 on Wikipedia, you will see more about this. > UTF-8 and UTF-16 and UTF-32 > I though the number beside of UTF- was to declare how many bits the > character set was using to store a character into the hdd, no? Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit values to make a surrogate pair. UTF-8 uses 8-bit values, but sometimes it combines two, three or four of them to represent a single code-point. > > "Narrow" Unicode uses two bytes per character. Since two bytes is only > > enough for about 65,000 characters, not 1,000,000+, the rest of the > > characters are stored as pairs of two-byte "surrogates". > > Can you please explain this line "the rest of thecharacters are stored > as pairs of two-byte "surrogates"" more easily for me to understand it? > I'm still having troubl understanding what a surrogate is. Look up UTF-16 and "surrogate pair" on Wikepedia. But basically, there are 65000+ different possible 16-bit values available for UTF-16 to use. Some of those values are reserved to mean "this value is not a character, it is half of a surrogate pair". Since they are *pairs*, they must always come in twos. A surrogate pair makes up a valid character. Half of a surrogate pair, on its own, is an error. A lot of this complexity is because of historical reasons. For example, when Unicode was first invented, there was only 65 thousand characters, and a fixed 16 bits was all you needed. But it was soon learned that 65 thousand was not enough (there are more than 65,000 Asian characters alone!) and so UTF-16 developed the trick with surrogate pairs to cover the extras. [...] > When locale to linux system is set to utf-8 that would mean that the > linux applications, should try to encode string into hdd by using > system's default encoding to utf-8 nad read them back from bytes by > also using utf-8. Is that correct? Yes. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2013-06-09 18:02 +1000 |
| Message-ID | <mailman.2909.1370765374.3114.python-list@python.org> |
| In reply to | #47427 |
On 09Jun2013 06:25, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
| [... heaps of useful explaination ...]
| > When locale to linux system is set to utf-8 that would mean that the
| > linux applications, should try to encode string into hdd by using
| > system's default encoding to utf-8 nad read them back from bytes by
| > also using utf-8. Is that correct?
|
| Yes.
Although I'd point out that only application that care about text
as _text_ need to consider Unicode and the encoding. A command like
"mv" does not care. You type the command and "mv" receives byte
strings as its arguments. So it is doing straight forward "bytes"
file renames. It does not care or even know about encodings.
In this scenario, really it is the Terminal program (eg Putty) which
cares about text (what you type, and what gets displayed). It is
because of mismatches between your Terminal local settings and the
encoding that was chosen for the filenames that you get garbage
listings, one way or another.
Cheers,
--
Cameron Simpson <cs@zip.com.au>
But then, I'm only 50. Things may well get a bit much for me when I
reach the gasping heights of senile decrepitude of which old Andy
Woodward speaks with such feeling.
- Chris Malcolm, cam@uk.ac.ed.aifh, DoD #205
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 02:03 -0700 |
| Message-ID | <40931e6b-11dd-4f97-bb1f-44c9b002d98f@googlegroups.com> |
| In reply to | #47430 |
Τη Κυριακή, 9 Ιουνίου 2013 11:02:48 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: > In this scenario, really it is the Terminal program (eg Putty) which > cares about text (what you type, and what gets displayed). It is > because of mismatches between your Terminal local settings and the > encoding that was chosen for the filenames that you get garbage > listings, one way or another. Ca n you give an example please that shows a string being greek-iso encoded and then being utf8 decoded and presented back as: 1. properly 2. garbage ( means trash but dont what a garbage char is) 3. error
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-08 14:21 -0700 |
| Message-ID | <ce3bc9ae-f0eb-4dd2-ae73-75533e03921a@googlegroups.com> |
| In reply to | #47400 |
Sorry for displaying my code so many times, i know i ahve exhaust you but hti is the last thinkg i am gonna ask from you in this thread. We are very close to have this working.
#========================================================
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )
for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try:
filepath = filepath_bytes.decode( encoding )
except UnicodeDecodeError:
continue
# Rename to something valid in UTF-8
if encoding != 'utf-8':
os.rename( filepath_bytes, filepath.encode('utf-8') )
assert os.path.exists( filepath )
break
else:
# This only runs if we never reached the break
raise ValueError( 'unable to clean filename %r' % filepath_bytes )
#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
data = cur.fetchone()
if not data:
# First time for file; primary key is automatic, hit is defaulted
print( "iam here", filename + '\n' )
cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )
#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = ()
# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )
# Delete spurious
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()
# Check database's filenames against path's filenames
for rec in data:
if rec not in filepaths:
cur.execute('''DELETE FROM files WHERE url = %s''', rec )
=================================================
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] Original exception was:
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] Traceback (most recent call last):
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] File "/home/nikos/public_html/cgi-bin/files.py", line 78, in <module>
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] assert os.path.exists( filepath )
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] os.stat(path)
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-37: ordinal not in range(128)
==================
Asserts what to make sure the the path/to/file afetr the rename exists but why are we still get those unicodeencodeerrors?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-09 08:10 +1000 |
| Message-ID | <mailman.2899.1370729423.3114.python-list@python.org> |
| In reply to | #47408 |
On Sun, Jun 9, 2013 at 7:21 AM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote: > Sorry for displaying my code so many times, i know i ahve exhaust you but hti is the last thinkg i am gonna ask from you in this thread. We are very close to have this working. You need to spend more time reading and less time frantically jumping around. Go read my post on Unicode; it answers several of the questions you posted in response to Steven's. And please, don't use this list as your substitute for source control. Don't keep posting your code. Most of us are ignoring it already. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 01:11 -0700 |
| Message-ID | <f438a4fd-d6a9-4e33-8b3d-abb28b306064@googlegroups.com> |
| In reply to | #47400 |
I'm sorry posted by mistake unnessary code: here is the correct one that prodiuced the above error:
#========================================================
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )
for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try:
filepath = filepath_bytes.decode( encoding )
except UnicodeDecodeError:
continue
# Rename to something valid in UTF-8
if encoding != 'utf-8':
os.rename( filepath_bytes, filepath.encode('utf-8') )
assert os.path.exists( filepath )
break
else:
# This only runs if we never reached the break
raise ValueError( 'unable to clean filename %r' % filepath_bytes )
#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
data = cur.fetchone()
if not data:
# First time for file; primary key is automatic, hit is defaulted
print( "iam here", filename + '\n' )
cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )
#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = set()
# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )
# Delete spurious
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()
# Check database's filenames against path's filenames
for rec in data:
if rec not in filepaths:
cur.execute('''DELETE FROM files WHERE url = %s''', rec )
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-09 04:47 +1000 |
| Message-ID | <mailman.2894.1370719010.3114.python-list@python.org> |
| In reply to | #47326 |
On Sun, Jun 9, 2013 at 4:01 AM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote: > Hold on! > > In the beginning there was ASCII with 0-127 values and then there was > Unicode with 0-127 of ASCII's + i dont know how much many more? > > Now ASCIII needs 1 byte to store a single character while Unicode needs 2 > bytes to store a character and that is because it has > 256 characters to > store > 2^8bits ? > > Is this correct? No. Let me start from the beginning. Computers don't work with characters, or strings, natively. They work with numbers. To be specific, they work with bits; and it's only by convention that we can work with anything larger. For instance, there's a VERY common convention around the PC world that a set of bits can be interpreted as a signed integer; if the highest bit is set, it's negative. There are also standards for floating-point (IEEE 754), and so on. ASCII is a character set. It defines a mapping of numbers to characters - for instance, @ is 64, SOH is 1, $ is 36, etcetera, etcetera. There are 128 such mappings. Since they all fit inside a 7-bit number, there's a trivial way to represent ASCII characters in a PC's 8-bit byte: you just leave the high bit clear and use the other seven. There have been various schemes for using the eighth bit - serial ports with parity, WordStar (I think) marking the ends of words, and most notably, Extended ASCII schemes that give you another whole set of 128 characters. And that was the beginning of Code Pages, because nobody could agree on what those extra 128 should be. Norwegians used Norwegian, the Greeks were taught their Greek, Arabians created themselves an Arabian codepage with the speed of summer lightning, and Hebrews allocated from 255 down to 128, which is absolutely frightening. But I digress. There were a variety of multi-byte schemes devised at various times, but we'll ignore all of them and jump straight to Unicode. With Unicode, there's (theoretically) no need to use any other system ever again, because whatever character you want, it'll exist in Unicode. In theory, of course; there are debates over that. Now, Unicode currently has defined an "address space" of roughly 20 bits, and in a throwback to the first programming I ever did, it's a segmented system: sixteen or seventeen planes of 65,536 characters each. (Fortunately the planes are identified by low numbers, not high numbers, and there's no stupidity of overlapping planes the way the 8086 did with memory!) The highest planes are special (plane 14 has a few special-purpose characters, planes 15 and 16 are for private use), and most of the middle ones have no characters assigned to them, so for the most part, you'll see characters from the first three planes. So what do we now have? A mapping of characters to "code points", which are numbers. (I'm leaving aside the issues of combining characters and such for the moment.) But computers don't work with numbers, they work with bits. Somehow we have to store those bits in memory. There are a good few ways to do that; one is to note that every Unicode character can be represented inside 32 bits, so we can use the standard integer scheme safely. (Since they fit inside 31 bits, we don't even need to care if it's signed or unsigned.) That's called UTF-32 or UCS-4, and it's a great way to handle the full Unicode range in a manner that makes a Texan look agoraphobic. Wide builds of Python up to 3.2 did this. Or you can try to store them in 16-bit numbers, but then you have to worry about the ones that don't fit in 16 bits, because it's really hard to squeeze 20 bits of information into 16 bits of storage. UTF-16 is one way to do this; special numbers mean "grab another number". It has its issues, but is (in my opinion, unfortunately) fairly prevalent. Narrow builds of Python up to 3.2 did this. Finally, you can use a more complicated scheme that uses anywhere from 1 to 4 bytes for each character, by carefully encoding information into the top bit - if it's set, you have a multi-byte character. That's how UTF-8 works, and is probably the most prevalent disk/network encoding. All of the UTF-X systems are called "UCS Transformation Formats" (UCS meaning Universal Character Set, roughly "Unicode"). They are mappings from Unicode numbers to bytes. Between Unicode and UTF-X, you have a mapping from character to byte sequence. > Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into > the hard drive? The ISO standard 8859 specifies a number of ASCII-compatible encodings, referred to as ISO-8859-1 through ISO-8859-16. You've been working with ISO-8859-1, also called Latin-1, and ISO-8859-7, which has your Greek characters in it. These are all ways of translating characters into numbers; and since they all fit within 8 bits, they're most commonly represented on PCs with single bytes. > So taken form above example(the closest i could think of), the way i > understand them is: > > A 'string' can be of (unicode's or ascii's) type and that type needs a way > (thats a charset) to store this string into the hdd as a sequense of bytes? A Python 3 'string' is always a series of Unicode characters. How they're represented in memory doesn't matter, but as of Python 3.3 that's a fairly compact and efficient system that can omit unnecessary zero bits. To store that string on your hard disk, send it across a network, or transmit it to another process, you need to encode it as bytes, somehow. The UCS Transformation Formats are specifically designed for this, and most of the time, UTF-8 is going to be the best option. It's compact, it's well known, and usually, it'll do everything you want. The only thing it won't do is let you quickly locate the Nth character, which is why it makes a poor in-memory format. Fortunately, Python lets us hide away pretty much all those details, just as it lets us hide away the details of what makes up a list, a dictionary, or an integer. You can safely assume that the string "foo" is a string of three characters, which you can work with as characters. The chr() and ord() functions let you switch between characters and numbers, and str.encode() and bytes.decode() let you switch between characters and byte sequences. Once you get your head around the differences between those three, it all works fairly neatly. Chris Angelico
[toc] | [prev] | [next] | [standalone]
| From | nagia.retsina@gmail.com |
|---|---|
| Date | 2013-06-08 22:09 -0700 |
| Message-ID | <3fbb5d0e-51fb-4aed-b829-8388304a9885@googlegroups.com> |
| In reply to | #47403 |
Τη Σάββατο, 8 Ιουνίου 2013 9:47:53 μ.μ. UTC+3, ο χρήστης Chris Angelico έγραψε:
> Fortunately, Python lets us hide away pretty much all those details,
> just as it lets us hide away the details of what makes up a list, a
> dictionary, or an integer. You can safely assume that the string "foo"
> is a string of three characters, which you can work with as
> characters. The chr() and ord() functions let you switch between
> characters and numbers, and str.encode() and bytes.decode() let you
> switch between characters and byte sequences. Once you get your head
> around the differences between those three, it all works fairly
> neatly.
I'm trying too!
So,
chr('A') would give me the mapping of this char, the number 65 while
ord(65) would output the char 'A' likewise.
>and str.encode() and bytes.decode() let you switch between characters and byte >sequences. Once
What would happen if we we try to re-encode bytes on the disk?
like trying:
s = "νίκος"
utf8_bytes = s.encode('utf-8')
greek_bytes = utf_bytes.encode('iso-8869-7')
Can we re-encode twice or as many times we want and then decode back respectively lke?
utf8_bytes = greek_bytes.decode('iso-8859-7')
s = utf8_bytes.decoce('utf-8')
Is somethign like that totally crazy?
And also is there a deiffrence between "encoding" and "compressing" ?
Isnt the latter useing some form of encoding to take a string or bytes to make hold less space on disk?
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-09 06:45 +0000 |
| Message-ID | <51b4249d$0$30001$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47423 |
On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:
> chr('A') would give me the mapping of this char, the number 65 while
> ord(65) would output the char 'A' likewise.
Correct. Python uses Unicode, where code-point 65 ("ordinal value 65")
means letter "A".
There are older encodings. For example, a very old one, used on IBM
mainframes, is EBCDIC, where ordinal value 65 means the letter "â", and
the letter "A" has ordinal value 193.
> What would happen if we we try to re-encode bytes on the disk? like
> trying:
>
> s = "νίκος"
> utf8_bytes = s.encode('utf-8')
> greek_bytes = utf_bytes.encode('iso-8869-7')
>
> Can we re-encode twice or as many times we want and then decode back
> respectively lke?
Of course. Bytes have no memory of where they came from, or what they are
used for. All you are doing is flipping bits on a memory chip, or on a
hard drive. So long as *you* remember which encoding is the right one,
there is no problem. If you forget, and start using the wrong one, you
will get garbage characters, mojibake, or errors.
[...]
> And also is there a deiffrence between "encoding" and "compressing" ?
Of course. They are totally unrelated.
> Isnt the latter useing some form of encoding to take a string or bytes
> to make hold less space on disk?
Correct, except forget about "encoding". It's not relevant (except,
maybe, in a mathematical sense) and will just confuse you.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | nagia.retsina@gmail.com |
|---|---|
| Date | 2013-06-09 00:00 -0700 |
| Message-ID | <19e762c7-a356-4ee1-9f50-82a128b5ac06@googlegroups.com> |
| In reply to | #47428 |
Thanks Stevn, i ll read them in a bit. When i read them can you perhaps tell me whats wrong and ima still getting decode issues?
[CODE]
# =================================================================================================================
# If user downloaded a file, thank the user !!!
# =================================================================================================================
if filename:
#update file counter if cookie does not exist
if not nikos:
cur.execute('''UPDATE files SET hits = hits + 1, host = %s, lastvisit = %s WHERE url = %s''', (host, lastvisit, filename) )
print('''<h2><font color=blue>Το αρχείο <font color=red> %s <font color=blue>κατεβαίνει!''' % filename )
print('''<br><img src="/data/images/thanks.gif">''')
print('''<br><br><br><h3><font color=blue>Και τώρα Tetris μέχρι να ολοκληρωθεί :-)''' )
print('''<br><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://fpdownload.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,0,0" width="450" height="300""><param name="menu" value="false" /><param name="movie" value="http://www.fugly.com/f/1e6d8cd7b905f4e1bf72" /><param name="quality" value="high" /><embed src="http://www.fugly.com/f/1e6d8cd7b905f4e1bf72" AllowScriptAccess="always" menu="false" quality="high" width="450" height="300" name="FuglyGame" align="middle" type="application/x-shockwave-flash" pluginspage="http://www.macromedia.com/go/getflashplayer" /></object>''')
print( '''<meta http-equiv="REFRESH" content="2;/data/apps/%s">''' % filename )
sys.exit(0)
# =================================================================================================================
# Display download button for each file and download it on click
# =================================================================================================================
print('''<body background='/data/images/star.jpg'>
<center><img src='/data/images/download.gif'><br><br>
<table border=5 cellpadding=5 bgcolor=green>
''')
#========================================================
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )
for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try:
filepath = filepath_bytes.decode( encoding )
except UnicodeDecodeError:
continue
# Rename to something valid in UTF-8
if encoding != 'utf-8':
os.rename( filepath_bytes, filepath.encode('utf-8') )
assert os.path.exists( filepath )
break
else:
# This only runs if we never reached the break
raise ValueError( 'unable to clean filename %r' % filepath_bytes )
#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
data = cur.fetchone()
if not data:
# First time for file; primary key is automatic, hit is defaulted
print( "iam here", filename + '\n' )
cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )
#========================================================
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = set()
# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )
# Delete spurious
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()
# Check database's filenames against path's filenames
for rec in data:
if rec not in filepaths:
cur.execute('''DELETE FROM files WHERE url = %s''', rec )
[/CODE]
When trying to run it is still erroting out:
[CODE]
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Original exception was:, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Traceback (most recent call last):, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] File "/home/nikos/public_html/cgi-bin/files.py", line 83, in <module>, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] assert os.path.exists( filepath ), referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] os.stat(path), referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-37: ordinal not in range(128), refere
[/CODE]
Why am i still receing unicode decore errors?
With the help of you guys we have writen a prodecure just to avoid this kind of decoding issues and rename all greek_byted_filenames to utf-8_byted.
Is it the assert that fail? Do we have some logic error someplace i dont see?
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-09 08:15 +0000 |
| Message-ID | <51b4398a$0$30001$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47429 |
On Sun, 09 Jun 2013 00:00:53 -0700, nagia.retsina wrote:
> path = b'/home/nikos/public_html/data/apps/'
> files = os.listdir( path )
>
> for filename in files:
> # Compute 'path/to/filename'
> filepath_bytes = path + filename
> for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
> try:
> filepath = filepath_bytes.decode( encoding )
> except UnicodeDecodeError:
> continue
>
> # Rename to something valid in UTF-8
> if encoding != 'utf-8':
> os.rename( filepath_bytes,
> filepath.encode('utf-8') )
> assert os.path.exists( filepath )
> break
> else:
> # This only runs if we never reached the break
> raise ValueError(
> 'unable to clean filename %r' % filepath_bytes )
Editing the traceback to get rid of unnecessary noise from the logging:
Traceback (most recent call last):
File "/home/nikos/public_html/cgi-bin/files.py", line 83, in <module>
assert os.path.exists( filepath )
File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists
os.stat(path)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
34-37: ordinal not in range(128)
> Why am i still receing unicode decore errors? With the help of you guys
> we have writen a prodecure just to avoid this kind of decoding issues
> and rename all greek_byted_filenames to utf-8_byted.
That's a very good question. It works for me when I test it, so I cannot
explain why it fails for you.
Please try this: log into the Linux server, and then start up a Python
interactive session by entering:
python3.3
at the $ prompt. Then, at the >>> prompt, enter these lines of code. You
can copy and paste them:
import os, sys
print(sys.version)
s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
'\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
'\N{GREEK SMALL LETTER EPSILON}')
print(s)
filename = '/tmp/' + s
open(filename, 'w')
os.path.exists(filename)
Copy and paste the results back here please.
> Is it the assert that fail? Do we have some logic error someplace i dont
> see?
Please read the error message. Does it say AssertionError?
If it says AssertionError, then the assert has failed. If it says
something else, the code failed before the assert can run.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 02:14 -0700 |
| Message-ID | <0a22570a-6bf6-4115-a7a8-a1684680702e@googlegroups.com> |
| In reply to | #47432 |
Τη Κυριακή, 9 Ιουνίου 2013 11:15:07 π.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
> Please try this: log into the Linux server, and then start up a Python
> import os, sys
> print(sys.version)
> s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
> '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
> '\N{GREEK SMALL LETTER EPSILON}')
> print(s)
> filename = '/tmp/' + s
> open(filename, 'w')
> os.path.exists(filename)
> Copy and paste the results back here please.
Of course: here it is:
root@nikos [/home/nikos/www/cgi-bin]# python
Python 3.3.2 (default, Jun 3 2013, 16:18:05)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> print(sys.version)
3.3.2 (default, Jun 3 2013, 16:18:05)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]
>>> s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
... '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
... '\N{GREEK SMALL LETTER EPSILON}')
print(s)
>>> αβγδε
>>> filename = '/tmp/' + s
>>> open(filename, 'w')
<_io.TextIOWrapper name='/tmp/αβγδε' mode='w' encoding='UTF-8'>
>>> os.path.exists(filename)
True
>>>
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| Date | 2013-06-09 03:32 -0700 |
| Message-ID | <3b2647bb-4b5a-4391-9ff4-6b5e755d9770@googlegroups.com> |
| In reply to | #47438 |
Τη Κυριακή, 9 Ιουνίου 2013 12:14:12 μ.μ. UTC+3, ο χρήστης Νικόλαος Κούρας έγραψε:
> Τη Κυριακή, 9 Ιουνίου 2013 11:15:07 π.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
>
>
>
> > Please try this: log into the Linux server, and then start up a Python
>
>
>
> > import os, sys
>
> > print(sys.version)
>
> > s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
>
> > '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
>
> > '\N{GREEK SMALL LETTER EPSILON}')
>
> > print(s)
>
> > filename = '/tmp/' + s
>
> > open(filename, 'w')
>
> > os.path.exists(filename)
>
>
>
> > Copy and paste the results back here please.
>
>
>
> Of course: here it is:
>
>
>
> root@nikos [/home/nikos/www/cgi-bin]# python
>
> Python 3.3.2 (default, Jun 3 2013, 16:18:05)
>
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux
>
> Type "help", "copyright", "credits" or "license" for more information.
>
> >>> import os, sys
>
> >>> print(sys.version)
>
> 3.3.2 (default, Jun 3 2013, 16:18:05)
>
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]
>
> >>> s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
>
> ... '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
>
> ... '\N{GREEK SMALL LETTER EPSILON}')
>
> print(s)
>
> >>> αβγδε
>
> >>> filename = '/tmp/' + s
>
> >>> open(filename, 'w')
>
> <_io.TextIOWrapper name='/tmp/αβγδε' mode='w' encoding='UTF-8'>
>
> >>> os.path.exists(filename)
>
> True
>
> >>>
I dont much but it lloks correct to me, but then agian why this error?
[toc] | [prev] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2013-06-09 19:16 +1000 |
| Message-ID | <mailman.2912.1370769399.3114.python-list@python.org> |
| In reply to | #47432 |
On 09Jun2013 08:15, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
| On Sun, 09 Jun 2013 00:00:53 -0700, nagia.retsina wrote:
| > path = b'/home/nikos/public_html/data/apps/'
| > files = os.listdir( path )
| >
| > for filename in files:
| > # Compute 'path/to/filename'
| > filepath_bytes = path + filename
| > for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
| > try:
| > filepath = filepath_bytes.decode( encoding )
| > except UnicodeDecodeError:
| > continue
| >
| > # Rename to something valid in UTF-8
| > if encoding != 'utf-8':
| > os.rename( filepath_bytes,
| > filepath.encode('utf-8') )
| > assert os.path.exists( filepath )
| > break
| > else:
| > # This only runs if we never reached the break
| > raise ValueError(
| > 'unable to clean filename %r' % filepath_bytes )
|
| Editing the traceback to get rid of unnecessary noise from the logging:
|
| Traceback (most recent call last):
| File "/home/nikos/public_html/cgi-bin/files.py", line 83, in <module>
| assert os.path.exists( filepath )
| File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists
| os.stat(path)
| UnicodeEncodeError: 'ascii' codec can't encode characters in position
| 34-37: ordinal not in range(128)
|
| > Why am i still receing unicode decore errors? With the help of you guys
| > we have writen a prodecure just to avoid this kind of decoding issues
| > and rename all greek_byted_filenames to utf-8_byted.
|
| That's a very good question. It works for me when I test it, so I cannot
| explain why it fails for you.
If he's lucky the UnicodeEncodeError occurred while trying to print
an error message, printing a greek Unicode string in the error with
ASCII as the output encoding (default when not a tty IIRC).
Cheers,
--
Cameron Simpson <cs@zip.com.au>
I generally avoid temptation unless I can't resist it. - Mae West
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-09 12:36 +0000 |
| Message-ID | <51b476e2$0$30001$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47439 |
On Sun, 09 Jun 2013 19:16:06 +1000, Cameron Simpson wrote:
> If he's lucky the UnicodeEncodeError occurred while trying to print an
> error message,
That's not what happens at the interactive console:
py> assert os.path.exists('Ж1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError
> printing a greek Unicode string in the error with ASCII
> as the output encoding (default when not a tty IIRC).
An interesting thought. How would we test that?
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | nagia.retsina@gmail.com |
|---|---|
| Date | 2013-06-09 10:25 -0700 |
| Message-ID | <2de6f168-5b93-4ee8-b9e3-44bd05158191@googlegroups.com> |
| In reply to | #47458 |
Τη Κυριακή, 9 Ιουνίου 2013 3:36:51 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: > > printing a greek Unicode string in the error with ASCII > > as the output encoding (default when not a tty IIRC). > An interesting thought. How would we test that? Please elaborare this for me. I ditn undertood what you are trying to say, your assumption of why still ima getting decode issues.
[toc] | [prev] | [next] | [standalone]
Page 2 of 4 — ← Prev page 1 [2] 3 4 Next page →
Back to top | Article view | comp.lang.python
csiph-web