Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #72340 > unrolled thread
| Started by | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| First post | 2014-05-31 17:10 +0100 |
| Last post | 2014-06-03 14:22 -0400 |
| Articles | 20 on this page of 92 — 19 participants |
Back to article view | Back to comp.lang.python
Python 3.2 has some deadly infection Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-05-31 17:10 +0100
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-05-31 22:55 +0300
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-01 02:26 +0000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-01 12:43 +1000
Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 08:54 +1000
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-02 01:14 +0000
Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 12:23 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-01 19:46 -0700
Re: Python 3.2 has some deadly infection Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2014-06-02 07:45 +0000
Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 19:02 +1000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-02 19:14 +1000
Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-02 12:10 +0100
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-03 16:34 +0000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-04 02:43 +1000
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-02 17:34 -0400
Re: Python 3.2 has some deadly infection Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-06-03 17:16 +1200
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-03 02:21 -0400
Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-03 15:18 +0100
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-04 13:08 +0000
Re: Python 3.2 has some deadly infection Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-06-05 14:01 +1200
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 10:16 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-05 17:30 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 11:05 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-05 18:36 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 12:53 +0300
Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-05 05:43 -0700
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 14:50 -0400
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 23:21 +0300
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 18:09 -0400
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 23:13 +0000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 02:30 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 09:39 +1000
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 22:08 -0400
Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-05 20:47 -0700
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve@pearwood.info> - 2014-06-05 08:34 +0000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 12:41 +0300
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 06:37 -0700
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 17:45 +0300
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 15:33 +0000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 02:12 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 09:54 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:36 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 19:52 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:28 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 15:35 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 08:52 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 20:11 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 13:20 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 20:32 -0700
Re: Python 3.2 has some deadly infection Akira Li <4kir4.1i@gmail.com> - 2014-06-06 12:03 +0400
Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-05 16:37 +0100
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 16:16 +0000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 01:50 +1000
Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-05 17:17 +0100
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 16:32 +0000
Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-06 07:40 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:14 +1000
Re: Python 3.2 has some deadly infection Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-05 11:16 -0600
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 14:11 -0400
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 21:30 +0300
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 23:02 +0000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 02:21 +0300
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-06 12:15 +0000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 16:00 +0300
Re: Python 3.2 has some deadly infection rurpy@yahoo.com - 2014-06-07 21:34 -0700
Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-06 06:24 -0700
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 17:10 +0300
Re: Python 3.2 has some deadly infection Michael Torrie <torriem@gmail.com> - 2014-06-06 09:02 -0600
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 18:32 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 01:50 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:02 +0300
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 10:13 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:26 +1000
Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 11:03 -0700
Re: Python 3.2 has some deadly infection Denis McMahon <denismfmcmahon@gmail.com> - 2014-06-06 21:18 +0000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 08:18 +1000
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-06 15:57 +0000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 09:21 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 02:48 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 10:04 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:12 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:11 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:16 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:18 +0300
Re: Python 3.2 has some deadly infection Ned Batchelder <ned@nedbatchelder.com> - 2014-06-06 13:33 -0400
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 01:25 +1000
Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 08:44 -0700
Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 08:48 -0700
Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-06 12:56 +0100
Re: Python 3.2 has some deadly infection Akira Li <4kir4.1i@gmail.com> - 2014-06-05 06:49 +0400
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-04 00:25 +1000
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-03 14:22 -0400
Page 3 of 5 — ← Prev page 1 2 [3] 4 5 Next page →
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-06-05 09:54 -0700 |
| Message-ID | <1dc666b6-1696-4662-8832-530a2b4f66a7@googlegroups.com> |
| In reply to | #72723 |
On Thursday, June 5, 2014 9:42:28 PM UTC+5:30, Chris Angelico wrote: > On Fri, Jun 6, 2014 at 1:33 AM, Steven D'Aprano wrote: > > In the Unix world, text formats and text > > processing is much more common in user-space apps than binary processing. > > Perhaps the definitive explanation and celebration of the Unix way is > > Eric Raymond's "The Art Of Unix Programming": > > http://www.catb.org/esr/writings/taoup/html/ch05s01.html > Specifically, this from the opening paragraph: > """ > Text streams are a valuable universal format because they're easy for > human beings to read, write, and edit without specialized tools. These > formats are (or can be designed to be) transparent. > """ A fact that stops being true when you tie up text with encodings. For two reasons: 1. The function/pair encode/decode mapping between byte-string and text cannot be a bijection because the byte-string set is larger than the text set. This is the error that Armin was hit by 2. Since there is not one but a zillion encodings possible we are not talking of one (possibly universal) data structure but a zillion ones: "Text streams are a universal format" - which encoding-ed form of text??
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-06 03:36 +1000 |
| Message-ID | <mailman.10752.1401989777.18130.python-list@python.org> |
| In reply to | #72729 |
On Fri, Jun 6, 2014 at 2:54 AM, Rustom Mody <rustompmody@gmail.com> wrote: > On Thursday, June 5, 2014 9:42:28 PM UTC+5:30, Chris Angelico wrote: >> On Fri, Jun 6, 2014 at 1:33 AM, Steven D'Aprano wrote: >> > In the Unix world, text formats and text >> > processing is much more common in user-space apps than binary processing. >> > Perhaps the definitive explanation and celebration of the Unix way is >> > Eric Raymond's "The Art Of Unix Programming": >> > http://www.catb.org/esr/writings/taoup/html/ch05s01.html > >> Specifically, this from the opening paragraph: >> """ >> Text streams are a valuable universal format because they're easy for >> human beings to read, write, and edit without specialized tools. These >> formats are (or can be designed to be) transparent. >> """ > > A fact that stops being true when you tie up text with encodings. > For two reasons: > > 1. The function/pair encode/decode mapping between byte-string and text > cannot be a bijection because the byte-string set is larger than the text > set. This is the error that Armin was hit by > > 2. Since there is not one but a zillion encodings possible we are not > talking of one (possibly universal) data structure but a zillion > ones: "Text streams are a universal format" - which encoding-ed > form of text?? As soon as you store or transmit ANY form of information, you need to worry about encodings. Ever heard of this thing called "network byte order"? It's part of taming the wilds of integer encodings. The theory is that the LC environment variables will carry all that crucial out-of-band information about encodings, and while the practice isn't perfect, it does still mean that there is such a thing as a text stream. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-06-05 19:52 +0300 |
| Message-ID | <87ha3zti2h.fsf@elektro.pacujo.net> |
| In reply to | #72710 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info>: > Nevertheless, there are important abstractions that are written on top > of the bytes layer, and in the Unix and Linux world, the most > important abstraction is *text*. In the Unix world, text formats and > text processing is much more common in user-space apps than binary > processing. That linux text is not the same thing as Python's text. Conceptually, Python text is a sequence of 32-bit integers. Linux text is a sequence of 8-bit integers. It is great that lots of computer-to-computer formats are encoded in ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction layer that processes Python-esque text. Case in point: $ env | grep UTF LANG=en_US.UTF-8 $ od -c <<<"Hyvää yötä" # "Good night" in Finnish 0000000 H y v 303 244 303 244 y 303 266 t 303 244 \n 0000017 The "od" utility is asked to display its input as characters. The locale info gives a hint that all text data is in UTF-8. Yet what comes out is bytes. How about: $ wc -c <<<"Hyvää yötä" 15 $ tr 'ä' 'a' <<<"Hyvää yötä" Hyvaaaa ya�taa Grep is smarter: $ grep v...y <<<"Hyvää yötä" Hyvää yötä which is why you should always prefix "grep" with LC_ALL=C in your scripts (makes it far faster, too). Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-06 03:28 +1000 |
| Message-ID | <mailman.10751.1401989331.18130.python-list@python.org> |
| In reply to | #72728 |
On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa <marko@pacujo.net> wrote: > That linux text is not the same thing as Python's text. Conceptually, > Python text is a sequence of 32-bit integers. Linux text is a sequence > of 8-bit integers. Point of terminology: Linux is the kernel, everything you say below here is talking about particular programs. From what I understand, bash (just another Unix program) treats strings as sequences of codepoints, just as Python does; though its string manipulation is not nearly as rich as Python's, so it's harder to prove. Python is itself a Unix program, so you can do the exact same proofs and demonstrate that Linux is clearly Unicode-aware. It's not Linux you're testing. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-06-05 15:35 -0700 |
| Message-ID | <4256f797-d70e-4c0d-ba97-00cdffddc082@googlegroups.com> |
| In reply to | #72737 |
On Thursday, June 5, 2014 10:58:43 PM UTC+5:30, Chris Angelico wrote: > On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa wrote: > > That linux text is not the same thing as Python's text. Conceptually, > > Python text is a sequence of 32-bit integers. Linux text is a sequence > > of 8-bit integers. > Point of terminology: Linux is the kernel, everything you say below > here is talking about particular programs. If it helps try the following substitution: s/Linux/Pretty much all the distros that use Linux for their OS kernel/ BTW the only (other) guy I know who insistently makes that distinction is Richard Stallman. Are you an emacs user by any chance <wink>? > From what I understand, > bash (just another Unix program) treats strings as sequences of > codepoints, just as Python does; though its string manipulation is not > nearly as rich as Python's, so it's harder to prove. Python is itself > a Unix program, so you can do the exact same proofs and demonstrate > that Linux is clearly Unicode-aware. It's not Linux you're testing. In these 'other programs' is it permissible to include the kernel itself? And then ask how Linux (in your and Stallman's sense) differs from Windows in how the filesystem handles things like filenames?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-06 08:52 +1000 |
| Message-ID | <mailman.10780.1402008750.18130.python-list@python.org> |
| In reply to | #72779 |
On Fri, Jun 6, 2014 at 8:35 AM, Rustom Mody <rustompmody@gmail.com> wrote: > On Thursday, June 5, 2014 10:58:43 PM UTC+5:30, Chris Angelico wrote: >> On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa wrote: >> > That linux text is not the same thing as Python's text. Conceptually, >> > Python text is a sequence of 32-bit integers. Linux text is a sequence >> > of 8-bit integers. > >> Point of terminology: Linux is the kernel, everything you say below >> here is talking about particular programs. > > If it helps try the following substitution: > > s/Linux/Pretty much all the distros that use Linux for their OS kernel/ You could look at the Debian Project, which is a full environment with everything you're talking about. And everything you say would be equally true of Debian Linux and Debian kfreebsd. :) > BTW the only (other) guy I know who insistently makes that distinction is > Richard Stallman. > > Are you an emacs user by any chance <wink>? Nope! Just a terminology nerd. :) >> From what I understand, >> bash (just another Unix program) treats strings as sequences of >> codepoints, just as Python does; though its string manipulation is not >> nearly as rich as Python's, so it's harder to prove. Python is itself >> a Unix program, so you can do the exact same proofs and demonstrate >> that Linux is clearly Unicode-aware. It's not Linux you're testing. > > In these 'other programs' is it permissible to include the kernel > itself? > And then ask how Linux (in your and Stallman's sense) differs from > Windows in how the filesystem handles things like filenames? What are you testing of the kernel? Most of the kernel doesn't actually work with text at all - it works with integers, buffers of memory (which could be seen as streams of bytes, but might be almost anything), process tables, open file handles... but not usually text. To you, "EAGAIN" might be a bit of text, but to the Linux kernel, it's an integer (11 decimal, if I recall correctly). Is that some fancy new form of encoding? :) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-06-05 20:11 -0700 |
| Message-ID | <fba7b81b-41e8-4e6e-bac8-1613d47dac58@googlegroups.com> |
| In reply to | #72780 |
On Friday, June 6, 2014 4:22:22 AM UTC+5:30, Chris Angelico wrote: > On Fri, Jun 6, 2014 at 8:35 AM, Rustom Mody wrote: > > And then ask how Linux (in your and Stallman's sense) differs from > > Windows in how the filesystem handles things like filenames? > What are you testing of the kernel? Most of the kernel doesn't > actually work with text at all - it works with integers, buffers of > memory (which could be seen as streams of bytes, but might be almost > anything), process tables, open file handles... but not usually text. > To you, "EAGAIN" might be a bit of text, but to the Linux kernel, it's > an integer (11 decimal, if I recall correctly). Is that some fancy new > form of encoding? :) | Thanks to the properties of UTF-8 encoding, the Linux kernel, the | innermost and lowest-level part of the operating system, can | handle Unicode filenames without even having the user tell it | that UTF-8 is to be used. All character strings, including | filenames, are treated by the kernel in such a way that THEY | APPEAR TO IT ONLY AS STRINGS OF BYTES. Thus, it doesn't care and | does not need to know whether a pair of consecutive bytes should | logically be treated as two characters or a single one. The only | risk of the kernel being fooled would be, for example, for a | filename to contain a multibyte Unicode character encoded in such | a way that one of the bytes used to represent it was a slash or | some other character that has a special meaning in file | names. Fortunately, as we noted, UTF-8 never uses ASCII | characters for encoding multibyte characters, so neither the | slash nor any other special character can appear as part of one | and therefore there is no risk associated with using Unicode in | filenames. | | Filesystems found on Microsoft Windows machines (NTFS and FAT) | are different in that THEY STORE FILENAMES ON DISK IN SOME | PARTICULAR ENCODING. The kernel must translate this encoding to | the system encoding, which will be UTF-8 in our case. | | If you have Windows partitions on your system, you will have to | take care that they are mounted with correct options. For FAT and | ISO9660 (used by CD-ROMs) partitions, option utf8 makes the | system translate the filesystem's character encoding to | UTF-8. For NTFS, nls=utf8 is the recommended option (utf8 should | also work). [Emphases mine] From: http://michal.kosmulski.org/computing/articles/linux-unicode.html
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-06 13:20 +1000 |
| Message-ID | <mailman.10803.1402024865.18130.python-list@python.org> |
| In reply to | #72812 |
On Fri, Jun 6, 2014 at 1:11 PM, Rustom Mody <rustompmody@gmail.com> wrote: > All character strings, including > | filenames, are treated by the kernel in such a way that THEY > | APPEAR TO IT ONLY AS STRINGS OF BYTES. Yep, the real issue here is file systems, not the kernel. But yes, this is one of the very few places where the kernel deals with a string - and because of the hairiness of having to handle myriad file systems in a single path (imagine multiple levels of remote mounts - I've had a case where I mount via sshfs a tree that includes a Samba mount point, and you can go a lot deeper than that), the only thing it can do is pass the bytes on unchanged. Which means, in reality, the kernel doesn't actually do *anything* with the string, it just passes it right along to the file system. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-06-05 20:32 -0700 |
| Message-ID | <14ac0dcc-58da-49a5-8c93-97dbc14d7b31@googlegroups.com> |
| In reply to | #72814 |
On Friday, June 6, 2014 8:50:57 AM UTC+5:30, Chris Angelico wrote: > kernel doesn't actually do *anything* with the string, it just passes > it right along to the file system. Which is what Marko (and others like Armin) are asking of python (treated as a processing 'kernel'): "I know what I am doing with my bytes -- please channel/funnel them around as requested without being unnecessarily and unrequestedly 'intelligent'"
[toc] | [prev] | [next] | [standalone]
| From | Akira Li <4kir4.1i@gmail.com> |
|---|---|
| Date | 2014-06-06 12:03 +0400 |
| Message-ID | <mailman.10807.1402041845.18130.python-list@python.org> |
| In reply to | #72728 |
Marko Rauhamaa <marko@pacujo.net> writes: > Steven D'Aprano <steve+comp.lang.python@pearwood.info>: > >> Nevertheless, there are important abstractions that are written on top >> of the bytes layer, and in the Unix and Linux world, the most >> important abstraction is *text*. In the Unix world, text formats and >> text processing is much more common in user-space apps than binary >> processing. > > That linux text is not the same thing as Python's text. Conceptually, > Python text is a sequence of 32-bit integers. Linux text is a sequence > of 8-bit integers. _Unicode string in Python is a sequence of Unicode codepoints_. It is correct that 32-bit integer is enough to represent any Unicode codepoint: \u0000...\U0010FFFF It says *nothing* about how Unicode strings are represented *internally* in Python. It may vary from version to version, build options and even may depend on the content of a string at runtime. In the past, "narrow builds" might break the abstraction in some cases that is why Linux distributions used wide python builds. _Unicode codepoint is not a Python concept_. There is Unicode standard http://unicode.org Though intead of following the self-referential defenitions web, I find it easier to learn from examples such as http://codepoints.net/U+0041 (A) or http://codepoints.net/U+1F3A7 (🎧) _There is no such thing as 8-bit text_ http://www.joelonsoftware.com/articles/Unicode.html If you insert a space after each byte (8-bit) in the input text then you may get garbage i.e., you can't assume that a character is a byte: $ echo "Hyvää yötä" | perl -pe's/.\K/ /g' H y v a � � � � y � � t � � In general, you can't assume that a character is a Unicode codepoint: $ echo "Hyvää yötä" | perl -C -pe's/.\K/ /g' H y v a ̈ ä y ö t ä The eXtended grapheme clusters (user-perceived characters) may be useful in this case: $ echo "Hyvää yötä" | perl -C -pe's/\X\K/ /g' H y v ä ä y ö t ä \X pattern is supported by `regex` module in Python i.e., you can't even iterate over characters (as they are seen by a user) in Python using only stdlib. \w+ pattern is also broken for Unicode text http://bugs.python.org/issue1693050 (it is fixed in the `regex` module) i.e., you can't select a word in Unicode text using only stdlib. \X along is not enough in some cases e.g., "“ch” may be considered a grapheme cluster in Slovak, for processes such as collation" [1] (sorting order). `PyICU` module might be useful here. Knowing about Unicode normalization forms (NFC, NFKD, etc) http://unicode.org/reports/tr15/ Unicode text segmentation [1] and Unicode collation algorithm http://www.unicode.org/reports/tr10/ concepts is also useful; if you want to work with text. [1]: http://www.unicode.org/reports/tr29/ -- akira
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2014-06-05 16:37 +0100 |
| Message-ID | <mailman.10739.1401982699.18130.python-list@python.org> |
| In reply to | #72708 |
On 05/06/2014 15:45, Marko Rauhamaa wrote: > Rustom Mody <rustompmody@gmail.com>: > >> What Marko is saying is that by imposing the structuring of unicode on >> the outside (Unix) world of text=byte, significant power is lost. > > Mostly I'm saying Python3 will not be able to hide the fact that linux > data consists of bytes. It shouldn't even try. The linux OS outside the > Python process talks bytes, not strings. > > A different OS might have different assumptions. > > > Marko > I think I'm in the unix camp as well. I just think that an extra assumption on input output isn't always helpful. In python 3 byte strings are second class which I think is wrong; apparently pressure from influential users is pushing to make byte strings more first class which is a good thing. -- Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-06-05 16:16 +0000 |
| Message-ID | <539097e5$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #72712 |
On Thu, 05 Jun 2014 16:37:23 +0100, Robin Becker wrote: > In python 3 byte strings > are second class which I think is wrong It certainly is wrong. bytes are just as much a first-class built-in type as list, int, float, bool, set, tuple and str. There may be missing functionality (relatively easy to add new functionality), and even poor design choices (like the foolish decision to have bytes display as if they were ASCII-ish strings, a silly mistake that simply reinforces the myth that bytes and ASCII are synonymous). Python 3.4 and 3.5 are in the process of rectifying as many of these mistakes as possible, e.g. adding back % formatting. But a few mistakes in the design of bytes' API no more makes it "second-class" than the lack of dict.contains_value() method makes dict "second-class". By all means ask for better bytes functionality. But don't libel Python by pretending that bytes is anything less than one of the most important and fundamental types in the language. bytes are so important that there are TWO implementations for them, a mutable and immutable version (bytearray and bytes), while text strings only have an immutable version. -- Steven D'Aprano http://import-that.dreamwidth.org/
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-06 01:50 +1000 |
| Message-ID | <mailman.10741.1401983426.18130.python-list@python.org> |
| In reply to | #72708 |
On Fri, Jun 6, 2014 at 1:37 AM, Robin Becker <robin@reportlab.com> wrote: > I think I'm in the unix camp as well. I just think that an extra assumption > on input output isn't always helpful. In python 3 byte strings are second > class which I think is wrong; apparently pressure from influential users is > pushing to make byte strings more first class which is a good thing. I wouldn't say they're second-class; it's more that the bytes type was considered to be more like a list of ints than like a Unicode string, and now that there are a few years' worth of real-world usage information to learn from, it's known that some more string-like operations will be extremely helpful. So now they're being added, which I agree is a good thing. Whether b"a"[0] should be b'a' or ord(b'a') is another sticking point. The Py2 str does the first, the Py3 bytes does the second. That one's a bit hard to change, but what I'm not sure of is how significant this is to new-build Py3 code. Obviously it's a barrier to porting, but is it important on its own? However, that's still not really "byte strings are second class". ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2014-06-05 17:17 +0100 |
| Message-ID | <mailman.10744.1401985039.18130.python-list@python.org> |
| In reply to | #72708 |
On 05/06/2014 16:50, Chris Angelico wrote: .......... > > I wouldn't say they're second-class; it's more that the bytes type was > considered to be more like a list of ints than like a Unicode string, > and now that there are a few years' worth of real-world usage > information to learn from, it's known that some more string-like > operations will be extremely helpful. So now they're being added, > which I agree is a good thing. in python 2 str and unicode were much more comparable. On balance I think just reversing them ie str --> bytes and unicode --> str was probably the right thing to do if the default conversions had been turned off. However making bytes a crippled thing was wrong. > > Whether b"a"[0] should be b'a' or ord(b'a') is another sticking point. > The Py2 str does the first, the Py3 bytes does the second. That one's > a bit hard to change, but what I'm not sure of is how significant this > is to new-build Py3 code. Obviously it's a barrier to porting, but is > it important on its own? However, that's still not really "byte > strings are second class". ...... I dislike the current model, but that's because I had a lot of stuff to convert and probably made a bunch of blunders. The reportlab code is now a mess of hacks to keep it alive for 2.7 & >=3.3; I'm probably never going to be convinced that uncode types are good. Bytes are the underlying concept and should have remained so for simplicity's sake. -- Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-06-05 16:32 +0000 |
| Message-ID | <53909b96$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #72725 |
On Thu, 05 Jun 2014 17:17:05 +0100, Robin Becker wrote: > Bytes are the underlying > concept and should have remained so for simplicity's sake. Bytes are the underlying concept for classes too. Do you think that an opaque unstructured blob of bytes is "simpler" to use than a class? How would an unstructured blob of bytes be simpler to use than an array of multi-byte characters? Earlier: > I dislike the current model, but that's because I had a lot of stuff to > convert and probably made a bunch of blunders. The reportlab code is > now a mess of hacks to keep it alive for 2.7 & >=3.3; Although I've been critical of many of your statements, I am sympathetic to your pain. There's no doubt that that the transition from the old, broken system of bytes masquerading as text can be hard, especially to those who never quite get past the misleading and false paradigm that "bytes are ASCII". It may have been that there were better ways to have updated to 3.3; perhaps you were merely unfortunate to have updated too early, and had you waited to 3.4 or 3.5 things would have been better. I don't know. But whatever the situation, and despite our differences of opinion about Unicode, THANK YOU for having updated ReportLabs to 3.3. -- Steven D'Aprano http://import-that.dreamwidth.org/
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2014-06-06 07:40 -0700 |
| Message-ID | <mailman.10817.1402066973.18130.python-list@python.org> |
| In reply to | #72726 |
On 06/05/2014 09:32 AM, Steven D'Aprano wrote: > > But whatever the situation, and despite our differences of opinion about > Unicode, THANK YOU for having updated ReportLabs to 3.3. +1000 -- ~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-06 03:14 +1000 |
| Message-ID | <mailman.10746.1401988496.18130.python-list@python.org> |
| In reply to | #72708 |
On Fri, Jun 6, 2014 at 2:17 AM, Robin Becker <robin@reportlab.com> wrote: > in python 2 str and unicode were much more comparable. On balance I think > just reversing them ie str --> bytes and unicode --> str was probably the > right thing to do if the default conversions had been turned off. However > making bytes a crippled thing was wrong. It's easy to build up functionality after the event. Maybe reportlab will have lots of hacks to support both 2.7 and 3.3, but in a few years you'll be able to say "supports 2.7 and 3.5" and take advantage of percent formatting and whatever else is added. But this is just the way that languages develop; you use them, you find what isn't easy, and you fix it. The nature of stability is that it takes time before you can depend on freshly-written functionality (contrast the extreme instability of running the version from source control - stuff might be fixed at any time, but you have to do all the work yourself to make sure your dependencies line up), but over time, you can depend on improvements making their way out there. Can you point to specific areas in which the bytes type is "crippled"? Comparing either to the Py2 str or the Py3 str, or to anything else? The Python core devs are listening, as evidenced by PEP 461. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2014-06-05 11:16 -0600 |
| Message-ID | <mailman.10750.1401988932.18130.python-list@python.org> |
| In reply to | #72708 |
On Thu, Jun 5, 2014 at 10:17 AM, Robin Becker <robin@reportlab.com> wrote: > in python 2 str and unicode were much more comparable. On balance I think > just reversing them ie str --> bytes and unicode --> str was probably the > right thing to do if the default conversions had been turned off. However > making bytes a crippled thing was wrong. How should e.g. bytes.upper() be implemented then? The correct behavior is entirely dependent on the encoding. Python 2 just assumes ASCII, which at best will correctly upper-case some subset of the string and leave the rest unchanged, and at worst could corrupt the string entirely. There are some things that were dropped that should not have been, but my impression is that those are being worked on, for example % formatting in PEP 461.
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-06-05 14:11 -0400 |
| Message-ID | <mailman.10755.1401991925.18130.python-list@python.org> |
| In reply to | #72708 |
On 6/5/2014 10:45 AM, Marko Rauhamaa wrote: > Mostly I'm saying Python3 will not be able to hide the fact that linux > data consists of bytes. It shouldn't even try. The linux OS outside the > Python process talks bytes, not strings. A text file is a binary file wrapped with a codex to translate to and from a universal text format on input and output. Much of the time, the wrapping is a great user convenience. Since the wrapping is optional, nothing is forced or really hidden. > A different OS might have different assumptions. Different OSes *do* have different assumptions. Both MacOSX and current Windows use (UCS-2 or) UTF-16 for text. It seems that unicode strings are better than ascii+??? strings as a universal basis for OS interfacing. For Windows, at least, the interface is much improved in Python 3. I understand that some, but not all, Latin alphabet *nix programmers wish that Python 3 continued to be strongly in their favor. But they are a small minority of the world's programmers, and Python 3 is aimed at everyone on all systems. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-06-05 21:30 +0300 |
| Message-ID | <87tx7z5hvw.fsf@elektro.pacujo.net> |
| In reply to | #72742 |
Terry Reedy <tjreedy@udel.edu>: > Different OSes *do* have different assumptions. Both MacOSX and > current Windows use (UCS-2 or) UTF-16 for text. Linux can use anything for text; UTF-8 has become a de-facto standard. How text is represented is very different from whether text is a fundamental data type. A fundamental text file is such that ordinary operating system facilities can't see inside the black box (that is, they are *not* encoded as far as the applications go). I have no idea how opaque text files are in Windows or OS-X. > For Windows, at least, the interface is much improved in Python 3. Yes, I get the feeling that Python is reaching out to Windows and OS-X and trying to make linux look like them. > I understand that some, but not all, Latin alphabet *nix programmers > wish that Python 3 continued to be strongly in their favor. But they > are a small minority of the world's programmers, and Python 3 is aimed > at everyone on all systems. Python allows linux programmers to write native linux programs. Maybe it allows Windows programmers to write native Windows programs. I certainly hope so. I don't want to have to write Windows programs that kinda run on linux. Java suffers from that: no "import os" in Java. Marko
[toc] | [prev] | [next] | [standalone]
Page 3 of 5 — ← Prev page 1 2 [3] 4 5 Next page →
Back to top | Article view | comp.lang.python
csiph-web