Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #72340 > unrolled thread

Python 3.2 has some deadly infection

Started byMark Lawrence <breamoreboy@yahoo.co.uk>
First post2014-05-31 17:10 +0100
Last post2014-06-03 14:22 -0400
Articles 20 on this page of 92 — 19 participants

Back to article view | Back to comp.lang.python


Contents

  Python 3.2 has some deadly infection Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-05-31 17:10 +0100
    Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-05-31 22:55 +0300
    Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-01 02:26 +0000
      Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-01 12:43 +1000
      Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 08:54 +1000
        Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-02 01:14 +0000
          Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 12:23 +1000
            Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-01 19:46 -0700
          Re: Python 3.2 has some deadly infection Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2014-06-02 07:45 +0000
          Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 19:02 +1000
          Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-02 19:14 +1000
          Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-02 12:10 +0100
            Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-03 16:34 +0000
              Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-04 02:43 +1000
          Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-02 17:34 -0400
            Re: Python 3.2 has some deadly infection Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-06-03 17:16 +1200
              Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-03 02:21 -0400
              Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-03 15:18 +0100
                Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-04 13:08 +0000
                  Re: Python 3.2 has some deadly infection Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-06-05 14:01 +1200
                    Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 10:16 +0300
                      Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-05 17:30 +1000
                        Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 11:05 +0300
                          Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-05 18:36 +1000
                            Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 12:53 +0300
                              Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-05 05:43 -0700
                              Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 14:50 -0400
                                Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 23:21 +0300
                                  Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 18:09 -0400
                                  Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 23:13 +0000
                                    Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 02:30 +0300
                                      Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 09:39 +1000
                                      Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 22:08 -0400
                                      Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-05 20:47 -0700
                    Re: Python 3.2 has some deadly infection Steven D'Aprano <steve@pearwood.info> - 2014-06-05 08:34 +0000
                      Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 12:41 +0300
                        Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 06:37 -0700
                          Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 17:45 +0300
                            Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 15:33 +0000
                              Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 02:12 +1000
                                Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 09:54 -0700
                                  Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:36 +1000
                              Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 19:52 +0300
                                Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:28 +1000
                                  Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 15:35 -0700
                                    Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 08:52 +1000
                                      Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 20:11 -0700
                                        Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 13:20 +1000
                                          Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 20:32 -0700
                                Re: Python 3.2 has some deadly infection Akira Li <4kir4.1i@gmail.com> - 2014-06-06 12:03 +0400
                            Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-05 16:37 +0100
                              Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 16:16 +0000
                            Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 01:50 +1000
                            Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-05 17:17 +0100
                              Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 16:32 +0000
                                Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-06 07:40 -0700
                            Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:14 +1000
                            Re: Python 3.2 has some deadly infection Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-05 11:16 -0600
                            Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 14:11 -0400
                              Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 21:30 +0300
                                Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 23:02 +0000
                                  Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 02:21 +0300
                                    Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-06 12:15 +0000
                                      Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 16:00 +0300
                                  Re: Python 3.2 has some deadly infection rurpy@yahoo.com - 2014-06-07 21:34 -0700
                                Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-06 06:24 -0700
                                  Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 17:10 +0300
                                    Re: Python 3.2 has some deadly infection Michael Torrie <torriem@gmail.com> - 2014-06-06 09:02 -0600
                                      Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 18:32 +0300
                                        Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 01:50 +1000
                                          Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:02 +0300
                                            Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 10:13 -0700
                                              Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:26 +1000
                                          Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 11:03 -0700
                                          Re: Python 3.2 has some deadly infection Denis McMahon <denismfmcmahon@gmail.com> - 2014-06-06 21:18 +0000
                                            Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 08:18 +1000
                                        Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-06 15:57 +0000
                                          Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 09:21 -0700
                                            Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 02:48 +1000
                                              Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 10:04 -0700
                                                Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:12 +1000
                                          Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:11 +0300
                                            Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:16 +1000
                                            Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:18 +0300
                                            Re: Python 3.2 has some deadly infection Ned Batchelder <ned@nedbatchelder.com> - 2014-06-06 13:33 -0400
                                Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 01:25 +1000
                                  Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 08:44 -0700
                                    Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 08:48 -0700
                            Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-06 12:56 +0100
                  Re: Python 3.2 has some deadly infection Akira Li <4kir4.1i@gmail.com> - 2014-06-05 06:49 +0400
              Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-04 00:25 +1000
              Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-03 14:22 -0400

Page 3 of 5 — ← Prev page 1 2 [3] 4 5  Next page →


#72729

FromRustom Mody <rustompmody@gmail.com>
Date2014-06-05 09:54 -0700
Message-ID<1dc666b6-1696-4662-8832-530a2b4f66a7@googlegroups.com>
In reply to#72723
On Thursday, June 5, 2014 9:42:28 PM UTC+5:30, Chris Angelico wrote:
> On Fri, Jun 6, 2014 at 1:33 AM, Steven D'Aprano wrote:
> > In the Unix world, text formats and text
> > processing is much more common in user-space apps than binary processing.
> > Perhaps the definitive explanation and celebration of the Unix way is
> > Eric Raymond's "The Art Of Unix Programming":
> > http://www.catb.org/esr/writings/taoup/html/ch05s01.html

> Specifically, this from the opening paragraph:
> """
> Text streams are a valuable universal format because they're easy for
> human beings to read, write, and edit without specialized tools. These
> formats are (or can be designed to be) transparent.
> """

A fact that stops being true when you tie up text with encodings.
For two reasons:

1. The function/pair encode/decode mapping between byte-string and text 
   cannot be a bijection because the byte-string set is larger than the text
   set.  This is the error that Armin was hit by

2. Since there is not one but a zillion encodings possible we are not
   talking of one (possibly universal) data structure but a zillion
   ones: "Text streams are a universal format" - which encoding-ed
   form of text??

[toc] | [prev] | [next] | [standalone]


#72739

FromChris Angelico <rosuav@gmail.com>
Date2014-06-06 03:36 +1000
Message-ID<mailman.10752.1401989777.18130.python-list@python.org>
In reply to#72729
On Fri, Jun 6, 2014 at 2:54 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> On Thursday, June 5, 2014 9:42:28 PM UTC+5:30, Chris Angelico wrote:
>> On Fri, Jun 6, 2014 at 1:33 AM, Steven D'Aprano wrote:
>> > In the Unix world, text formats and text
>> > processing is much more common in user-space apps than binary processing.
>> > Perhaps the definitive explanation and celebration of the Unix way is
>> > Eric Raymond's "The Art Of Unix Programming":
>> > http://www.catb.org/esr/writings/taoup/html/ch05s01.html
>
>> Specifically, this from the opening paragraph:
>> """
>> Text streams are a valuable universal format because they're easy for
>> human beings to read, write, and edit without specialized tools. These
>> formats are (or can be designed to be) transparent.
>> """
>
> A fact that stops being true when you tie up text with encodings.
> For two reasons:
>
> 1. The function/pair encode/decode mapping between byte-string and text
>    cannot be a bijection because the byte-string set is larger than the text
>    set.  This is the error that Armin was hit by
>
> 2. Since there is not one but a zillion encodings possible we are not
>    talking of one (possibly universal) data structure but a zillion
>    ones: "Text streams are a universal format" - which encoding-ed
>    form of text??

As soon as you store or transmit ANY form of information, you need to
worry about encodings. Ever heard of this thing called "network byte
order"? It's part of taming the wilds of integer encodings. The theory
is that the LC environment variables will carry all that crucial
out-of-band information about encodings, and while the practice isn't
perfect, it does still mean that there is such a thing as a text
stream.

ChrisA

[toc] | [prev] | [next] | [standalone]


#72728

FromMarko Rauhamaa <marko@pacujo.net>
Date2014-06-05 19:52 +0300
Message-ID<87ha3zti2h.fsf@elektro.pacujo.net>
In reply to#72710
Steven D'Aprano <steve+comp.lang.python@pearwood.info>:

> Nevertheless, there are important abstractions that are written on top
> of the bytes layer, and in the Unix and Linux world, the most
> important abstraction is *text*. In the Unix world, text formats and
> text processing is much more common in user-space apps than binary
> processing.

That linux text is not the same thing as Python's text. Conceptually,
Python text is a sequence of 32-bit integers. Linux text is a sequence
of 8-bit integers.

It is great that lots of computer-to-computer formats are encoded in
ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction
layer that processes Python-esque text.

Case in point:

   $ env | grep UTF
   LANG=en_US.UTF-8
   $ od -c <<<"Hyvää yötä"     # "Good night" in Finnish
   0000000   H   y   v 303 244 303 244       y 303 266   t 303 244  \n
   0000017

The "od" utility is asked to display its input as characters. The locale
info gives a hint that all text data is in UTF-8. Yet what comes out is
bytes.

How about:

   $ wc -c <<<"Hyvää yötä"
   15
   $ tr 'ä' 'a' <<<"Hyvää yötä"
   Hyvaaaa ya�taa

Grep is smarter:

   $ grep v...y <<<"Hyvää yötä"
   Hyvää yötä

which is why you should always prefix "grep" with LC_ALL=C in your
scripts (makes it far faster, too).


Marko

[toc] | [prev] | [next] | [standalone]


#72737

FromChris Angelico <rosuav@gmail.com>
Date2014-06-06 03:28 +1000
Message-ID<mailman.10751.1401989331.18130.python-list@python.org>
In reply to#72728
On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> That linux text is not the same thing as Python's text. Conceptually,
> Python text is a sequence of 32-bit integers. Linux text is a sequence
> of 8-bit integers.

Point of terminology: Linux is the kernel, everything you say below
here is talking about particular programs. From what I understand,
bash (just another Unix program) treats strings as sequences of
codepoints, just as Python does; though its string manipulation is not
nearly as rich as Python's, so it's harder to prove. Python is itself
a Unix program, so you can do the exact same proofs and demonstrate
that Linux is clearly Unicode-aware. It's not Linux you're testing.

ChrisA

[toc] | [prev] | [next] | [standalone]


#72779

FromRustom Mody <rustompmody@gmail.com>
Date2014-06-05 15:35 -0700
Message-ID<4256f797-d70e-4c0d-ba97-00cdffddc082@googlegroups.com>
In reply to#72737
On Thursday, June 5, 2014 10:58:43 PM UTC+5:30, Chris Angelico wrote:
> On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa wrote:
> > That linux text is not the same thing as Python's text. Conceptually,
> > Python text is a sequence of 32-bit integers. Linux text is a sequence
> > of 8-bit integers.

> Point of terminology: Linux is the kernel, everything you say below
> here is talking about particular programs.

If it helps try the following substitution:

s/Linux/Pretty much all the distros that use Linux for their OS kernel/

BTW the only (other) guy I know who insistently makes that distinction is
Richard Stallman.

Are you an emacs user by any chance <wink>?


> From what I understand,
> bash (just another Unix program) treats strings as sequences of
> codepoints, just as Python does; though its string manipulation is not
> nearly as rich as Python's, so it's harder to prove. Python is itself
> a Unix program, so you can do the exact same proofs and demonstrate
> that Linux is clearly Unicode-aware. It's not Linux you're testing.

In these 'other programs' is it permissible to include the kernel
itself?
And then ask how Linux (in your and Stallman's sense) differs from
Windows in how the filesystem handles things like filenames?

[toc] | [prev] | [next] | [standalone]


#72780

FromChris Angelico <rosuav@gmail.com>
Date2014-06-06 08:52 +1000
Message-ID<mailman.10780.1402008750.18130.python-list@python.org>
In reply to#72779
On Fri, Jun 6, 2014 at 8:35 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> On Thursday, June 5, 2014 10:58:43 PM UTC+5:30, Chris Angelico wrote:
>> On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa wrote:
>> > That linux text is not the same thing as Python's text. Conceptually,
>> > Python text is a sequence of 32-bit integers. Linux text is a sequence
>> > of 8-bit integers.
>
>> Point of terminology: Linux is the kernel, everything you say below
>> here is talking about particular programs.
>
> If it helps try the following substitution:
>
> s/Linux/Pretty much all the distros that use Linux for their OS kernel/

You could look at the Debian Project, which is a full environment with
everything you're talking about. And everything you say would be
equally true of Debian Linux and Debian kfreebsd. :)

> BTW the only (other) guy I know who insistently makes that distinction is
> Richard Stallman.
>
> Are you an emacs user by any chance <wink>?

Nope! Just a terminology nerd. :)

>> From what I understand,
>> bash (just another Unix program) treats strings as sequences of
>> codepoints, just as Python does; though its string manipulation is not
>> nearly as rich as Python's, so it's harder to prove. Python is itself
>> a Unix program, so you can do the exact same proofs and demonstrate
>> that Linux is clearly Unicode-aware. It's not Linux you're testing.
>
> In these 'other programs' is it permissible to include the kernel
> itself?
> And then ask how Linux (in your and Stallman's sense) differs from
> Windows in how the filesystem handles things like filenames?

What are you testing of the kernel? Most of the kernel doesn't
actually work with text at all - it works with integers, buffers of
memory (which could be seen as streams of bytes, but might be almost
anything), process tables, open file handles... but not usually text.
To you, "EAGAIN" might be a bit of text, but to the Linux kernel, it's
an integer (11 decimal, if I recall correctly). Is that some fancy new
form of encoding? :)

ChrisA

[toc] | [prev] | [next] | [standalone]


#72812

FromRustom Mody <rustompmody@gmail.com>
Date2014-06-05 20:11 -0700
Message-ID<fba7b81b-41e8-4e6e-bac8-1613d47dac58@googlegroups.com>
In reply to#72780
On Friday, June 6, 2014 4:22:22 AM UTC+5:30, Chris Angelico wrote:
> On Fri, Jun 6, 2014 at 8:35 AM, Rustom Mody  wrote:
> > And then ask how Linux (in your and Stallman's sense) differs from
> > Windows in how the filesystem handles things like filenames?

> What are you testing of the kernel? Most of the kernel doesn't
> actually work with text at all - it works with integers, buffers of
> memory (which could be seen as streams of bytes, but might be almost
> anything), process tables, open file handles... but not usually text.
> To you, "EAGAIN" might be a bit of text, but to the Linux kernel, it's
> an integer (11 decimal, if I recall correctly). Is that some fancy new
> form of encoding? :)


| Thanks to the properties of UTF-8 encoding, the Linux kernel, the
| innermost and lowest-level part of the operating system, can
| handle Unicode filenames without even having the user tell it
| that UTF-8 is to be used. All character strings, including
| filenames, are treated by the kernel in such a way that THEY
| APPEAR TO IT ONLY AS STRINGS OF BYTES. Thus, it doesn't care and
| does not need to know whether a pair of consecutive bytes should
| logically be treated as two characters or a single one. The only
| risk of the kernel being fooled would be, for example, for a
| filename to contain a multibyte Unicode character encoded in such
| a way that one of the bytes used to represent it was a slash or
| some other character that has a special meaning in file
| names. Fortunately, as we noted, UTF-8 never uses ASCII
| characters for encoding multibyte characters, so neither the
| slash nor any other special character can appear as part of one
| and therefore there is no risk associated with using Unicode in
| filenames.
|  
| Filesystems found on Microsoft Windows machines (NTFS and FAT)
| are different in that THEY STORE FILENAMES ON DISK IN SOME
| PARTICULAR ENCODING. The kernel must translate this encoding to
| the system encoding, which will be UTF-8 in our case.
|  
| If you have Windows partitions on your system, you will have to
| take care that they are mounted with correct options. For FAT and
| ISO9660 (used by CD-ROMs) partitions, option utf8 makes the
| system translate the filesystem's character encoding to
| UTF-8. For NTFS, nls=utf8 is the recommended option (utf8 should
| also work).

[Emphases mine]

From: http://michal.kosmulski.org/computing/articles/linux-unicode.html

[toc] | [prev] | [next] | [standalone]


#72814

FromChris Angelico <rosuav@gmail.com>
Date2014-06-06 13:20 +1000
Message-ID<mailman.10803.1402024865.18130.python-list@python.org>
In reply to#72812
On Fri, Jun 6, 2014 at 1:11 PM, Rustom Mody <rustompmody@gmail.com> wrote:
> All character strings, including
> | filenames, are treated by the kernel in such a way that THEY
> | APPEAR TO IT ONLY AS STRINGS OF BYTES.

Yep, the real issue here is file systems, not the kernel. But yes,
this is one of the very few places where the kernel deals with a
string - and because of the hairiness of having to handle myriad file
systems in a single path (imagine multiple levels of remote mounts -
I've had a case where I mount via sshfs a tree that includes a Samba
mount point, and you can go a lot deeper than that), the only thing it
can do is pass the bytes on unchanged. Which means, in reality, the
kernel doesn't actually do *anything* with the string, it just passes
it right along to the file system.

ChrisA

[toc] | [prev] | [next] | [standalone]


#72816

FromRustom Mody <rustompmody@gmail.com>
Date2014-06-05 20:32 -0700
Message-ID<14ac0dcc-58da-49a5-8c93-97dbc14d7b31@googlegroups.com>
In reply to#72814
On Friday, June 6, 2014 8:50:57 AM UTC+5:30, Chris Angelico wrote:
> kernel doesn't actually do *anything* with the string, it just passes
> it right along to the file system.

Which is what Marko (and others like Armin) are asking of python
(treated as a processing 'kernel'):

"I know what I am doing with my bytes -- please channel/funnel them
around as requested without being unnecessarily and unrequestedly
'intelligent'"

[toc] | [prev] | [next] | [standalone]


#72820

FromAkira Li <4kir4.1i@gmail.com>
Date2014-06-06 12:03 +0400
Message-ID<mailman.10807.1402041845.18130.python-list@python.org>
In reply to#72728
Marko Rauhamaa <marko@pacujo.net> writes:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>
>> Nevertheless, there are important abstractions that are written on top
>> of the bytes layer, and in the Unix and Linux world, the most
>> important abstraction is *text*. In the Unix world, text formats and
>> text processing is much more common in user-space apps than binary
>> processing.
>
> That linux text is not the same thing as Python's text. Conceptually,
> Python text is a sequence of 32-bit integers. Linux text is a sequence
> of 8-bit integers.

_Unicode string in Python is a sequence of Unicode codepoints_. It is
correct that 32-bit integer is enough to represent any Unicode
codepoint: \u0000...\U0010FFFF 

It says *nothing* about how Unicode strings are represented
*internally* in Python. It may vary from version to version, build
options and even may depend on the content of a string at runtime.

In the past, "narrow builds" might break the abstraction in some cases
that is why Linux distributions used wide python builds.


_Unicode codepoint is  not a Python concept_. There is Unicode
standard http://unicode.org Though intead of following the
self-referential defenitions web, I find it easier to learn from
examples such as http://codepoints.net/U+0041 (A) or
http://codepoints.net/U+1F3A7 (🎧)

_There is no such thing as 8-bit text_
http://www.joelonsoftware.com/articles/Unicode.html

If you insert a space after each byte (8-bit) in the input text then you
may get garbage i.e., you can't assume that a character is a byte:

  $ echo "Hyvää yötä" | perl -pe's/.\K/ /g'
  H y v a � � � �   y � � t � �

In general, you can't assume that a character is a Unicode codepoint:

  $ echo "Hyvää yötä" | perl -C -pe's/.\K/ /g'
  H y v a ̈ ä   y ö t ä

The eXtended grapheme clusters (user-perceived characters) may be useful
in this case:

  $ echo "Hyvää yötä" | perl -C -pe's/\X\K/ /g'
  H y v ä ä   y ö t ä

\X pattern is supported by `regex` module in Python i.e., you can't even
iterate over characters (as they are seen by a user) in Python using
only stdlib. \w+ pattern is also broken for Unicode text
http://bugs.python.org/issue1693050 (it is fixed in the `regex` module)
i.e., you can't select a word in Unicode text using only stdlib.

\X along is not enough in some cases e.g., "“ch” may be considered a
grapheme cluster in Slovak, for processes such as collation" [1]
(sorting order). `PyICU` module might be useful here.

Knowing about Unicode normalization forms (NFC, NFKD, etc)
http://unicode.org/reports/tr15/ Unicode
text segmentation [1] and Unicode collation algorithm
http://www.unicode.org/reports/tr10/ concepts is also 
useful; if you want to work with text. 

[1]: http://www.unicode.org/reports/tr29/


--
akira

[toc] | [prev] | [next] | [standalone]


#72712

FromRobin Becker <robin@reportlab.com>
Date2014-06-05 16:37 +0100
Message-ID<mailman.10739.1401982699.18130.python-list@python.org>
In reply to#72708
On 05/06/2014 15:45, Marko Rauhamaa wrote:
> Rustom Mody <rustompmody@gmail.com>:
>
>> What Marko is saying is that by imposing the structuring of unicode on
>> the outside (Unix) world of text=byte, significant power is lost.
>
> Mostly I'm saying Python3 will not be able to hide the fact that linux
> data consists of bytes. It shouldn't even try. The linux OS outside the
> Python process talks bytes, not strings.
>
> A different OS might have different assumptions.
>
>
> Marko
>
I think I'm in the unix camp as well. I just think that an extra assumption on 
input output isn't always helpful. In python 3 byte strings are second class 
which I think is wrong; apparently pressure from influential users is pushing to 
make byte strings more first class which is a good thing.
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#72724

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-06-05 16:16 +0000
Message-ID<539097e5$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to#72712
On Thu, 05 Jun 2014 16:37:23 +0100, Robin Becker wrote:

> In python 3 byte strings
> are second class which I think is wrong

It certainly is wrong. bytes are just as much a first-class built-in type 
as list, int, float, bool, set, tuple and str.

There may be missing functionality (relatively easy to add new 
functionality), and even poor design choices (like the foolish decision 
to have bytes display as if they were ASCII-ish strings, a silly mistake 
that simply reinforces the myth that bytes and ASCII are synonymous). 
Python 3.4 and 3.5 are in the process of rectifying as many of these 
mistakes as possible, e.g. adding back % formatting. But a few mistakes 
in the design of bytes' API no more makes it "second-class" than the lack 
of dict.contains_value() method makes dict "second-class".

By all means ask for better bytes functionality. But don't libel Python 
by pretending that bytes is anything less than one of the most important 
and fundamental types in the language. bytes are so important that there 
are TWO implementations for them, a mutable and immutable version 
(bytearray and bytes), while text strings only have an immutable version.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]


#72717

FromChris Angelico <rosuav@gmail.com>
Date2014-06-06 01:50 +1000
Message-ID<mailman.10741.1401983426.18130.python-list@python.org>
In reply to#72708
On Fri, Jun 6, 2014 at 1:37 AM, Robin Becker <robin@reportlab.com> wrote:
> I think I'm in the unix camp as well. I just think that an extra assumption
> on input output isn't always helpful. In python 3 byte strings are second
> class which I think is wrong; apparently pressure from influential users is
> pushing to make byte strings more first class which is a good thing.

I wouldn't say they're second-class; it's more that the bytes type was
considered to be more like a list of ints than like a Unicode string,
and now that there are a few years' worth of real-world usage
information to learn from, it's known that some more string-like
operations will be extremely helpful. So now they're being added,
which I agree is a good thing.

Whether b"a"[0] should be b'a' or ord(b'a') is another sticking point.
The Py2 str does the first, the Py3 bytes does the second. That one's
a bit hard to change, but what I'm not sure of is how significant this
is to new-build Py3 code. Obviously it's a barrier to porting, but is
it important on its own? However, that's still not really "byte
strings are second class".

ChrisA

[toc] | [prev] | [next] | [standalone]


#72725

FromRobin Becker <robin@reportlab.com>
Date2014-06-05 17:17 +0100
Message-ID<mailman.10744.1401985039.18130.python-list@python.org>
In reply to#72708
On 05/06/2014 16:50, Chris Angelico wrote:
..........
>
> I wouldn't say they're second-class; it's more that the bytes type was
> considered to be more like a list of ints than like a Unicode string,
> and now that there are a few years' worth of real-world usage
> information to learn from, it's known that some more string-like
> operations will be extremely helpful. So now they're being added,
> which I agree is a good thing.

in python 2 str and unicode were much more comparable. On balance I think just 
reversing them ie str --> bytes and unicode --> str was probably the right thing 
to do if the default conversions had been turned off. However making bytes a 
crippled thing was wrong.


>
> Whether b"a"[0] should be b'a' or ord(b'a') is another sticking point.
> The Py2 str does the first, the Py3 bytes does the second. That one's
> a bit hard to change, but what I'm not sure of is how significant this
> is to new-build Py3 code. Obviously it's a barrier to porting, but is
> it important on its own? However, that's still not really "byte
> strings are second class".
......
I dislike the current model, but that's because I had a lot of stuff to convert 
and probably made a bunch of blunders. The reportlab code is now a mess of hacks 
to keep it alive for 2.7 & >=3.3; I'm probably never going to be convinced that 
uncode types are good. Bytes are the underlying concept and should have remained 
so for simplicity's sake.
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#72726

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-06-05 16:32 +0000
Message-ID<53909b96$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to#72725
On Thu, 05 Jun 2014 17:17:05 +0100, Robin Becker wrote:

> Bytes are the underlying
> concept and should have remained so for simplicity's sake.

Bytes are the underlying concept for classes too. Do you think that an 
opaque unstructured blob of bytes is "simpler" to use than a class? How 
would an unstructured blob of bytes be simpler to use than an array of 
multi-byte characters?

Earlier:

> I dislike the current model, but that's because I had a lot of stuff to
> convert and probably made a bunch of blunders. The reportlab code is
> now a mess of hacks to keep it alive for 2.7 & >=3.3;

Although I've been critical of many of your statements, I am sympathetic 
to your pain. There's no doubt that that the transition from the old, 
broken system of bytes masquerading as text can be hard, especially to 
those who never quite get past the misleading and false paradigm that 
"bytes are ASCII". It may have been that there were better ways to have 
updated to 3.3; perhaps you were merely unfortunate to have updated too 
early, and had you waited to 3.4 or 3.5 things would have been better. I 
don't know.

But whatever the situation, and despite our differences of opinion about 
Unicode, THANK YOU for having updated ReportLabs to 3.3.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]


#72849

FromEthan Furman <ethan@stoneleaf.us>
Date2014-06-06 07:40 -0700
Message-ID<mailman.10817.1402066973.18130.python-list@python.org>
In reply to#72726
On 06/05/2014 09:32 AM, Steven D'Aprano wrote:
>
> But whatever the situation, and despite our differences of opinion about
> Unicode, THANK YOU for having updated ReportLabs to 3.3.

+1000

--
~Ethan~

[toc] | [prev] | [next] | [standalone]


#72732

FromChris Angelico <rosuav@gmail.com>
Date2014-06-06 03:14 +1000
Message-ID<mailman.10746.1401988496.18130.python-list@python.org>
In reply to#72708
On Fri, Jun 6, 2014 at 2:17 AM, Robin Becker <robin@reportlab.com> wrote:
> in python 2 str and unicode were much more comparable. On balance I think
> just reversing them ie str --> bytes and unicode --> str was probably the
> right thing to do if the default conversions had been turned off. However
> making bytes a crippled thing was wrong.

It's easy to build up functionality after the event. Maybe reportlab
will have lots of hacks to support both 2.7 and 3.3, but in a few
years you'll be able to say "supports 2.7 and 3.5" and take advantage
of percent formatting and whatever else is added. But this is just the
way that languages develop; you use them, you find what isn't easy,
and you fix it. The nature of stability is that it takes time before
you can depend on freshly-written functionality (contrast the extreme
instability of running the version from source control - stuff might
be fixed at any time, but you have to do all the work yourself to make
sure your dependencies line up), but over time, you can depend on
improvements making their way out there.

Can you point to specific areas in which the bytes type is "crippled"?
Comparing either to the Py2 str or the Py3 str, or to anything else?
The Python core devs are listening, as evidenced by PEP 461.

ChrisA

[toc] | [prev] | [next] | [standalone]


#72736

FromIan Kelly <ian.g.kelly@gmail.com>
Date2014-06-05 11:16 -0600
Message-ID<mailman.10750.1401988932.18130.python-list@python.org>
In reply to#72708
On Thu, Jun 5, 2014 at 10:17 AM, Robin Becker <robin@reportlab.com> wrote:
> in python 2 str and unicode were much more comparable. On balance I think
> just reversing them ie str --> bytes and unicode --> str was probably the
> right thing to do if the default conversions had been turned off. However
> making bytes a crippled thing was wrong.

How should e.g. bytes.upper() be implemented then?  The correct
behavior is entirely dependent on the encoding.  Python 2 just assumes
ASCII, which at best will correctly upper-case some subset of the
string and leave the rest unchanged, and at worst could corrupt the
string entirely.  There are some things that were dropped that should
not have been, but my impression is that those are being worked on,
for example % formatting in PEP 461.

[toc] | [prev] | [next] | [standalone]


#72742

FromTerry Reedy <tjreedy@udel.edu>
Date2014-06-05 14:11 -0400
Message-ID<mailman.10755.1401991925.18130.python-list@python.org>
In reply to#72708
On 6/5/2014 10:45 AM, Marko Rauhamaa wrote:

> Mostly I'm saying Python3 will not be able to hide the fact that linux
> data consists of bytes. It shouldn't even try. The linux OS outside the
> Python process talks bytes, not strings.

A text file is a binary file wrapped with a codex to translate to and 
from a universal text format on input and output.  Much of the time, the 
wrapping is a great user convenience. Since the wrapping is optional, 
nothing is forced or really hidden.

> A different OS might have different assumptions.

Different OSes *do* have different assumptions. Both MacOSX and current 
Windows use (UCS-2 or) UTF-16 for text. It seems that unicode strings 
are better than ascii+??? strings as a universal basis for OS 
interfacing.  For Windows, at least, the interface is much improved in 
Python 3.

I understand that some, but not all, Latin alphabet *nix programmers 
wish that Python 3 continued to be strongly in their favor. But they are 
a small minority of the world's programmers, and Python 3 is aimed at 
everyone on all systems.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#72744

FromMarko Rauhamaa <marko@pacujo.net>
Date2014-06-05 21:30 +0300
Message-ID<87tx7z5hvw.fsf@elektro.pacujo.net>
In reply to#72742
Terry Reedy <tjreedy@udel.edu>:

> Different OSes *do* have different assumptions. Both MacOSX and
> current Windows use (UCS-2 or) UTF-16 for text.

Linux can use anything for text; UTF-8 has become a de-facto standard.

How text is represented is very different from whether text is a
fundamental data type. A fundamental text file is such that ordinary
operating system facilities can't see inside the black box (that is,
they are *not* encoded as far as the applications go).

I have no idea how opaque text files are in Windows or OS-X.

> For Windows, at least, the interface is much improved in Python 3.

Yes, I get the feeling that Python is reaching out to Windows and OS-X
and trying to make linux look like them.

> I understand that some, but not all, Latin alphabet *nix programmers
> wish that Python 3 continued to be strongly in their favor. But they
> are a small minority of the world's programmers, and Python 3 is aimed
> at everyone on all systems.

Python allows linux programmers to write native linux programs. Maybe it
allows Windows programmers to write native Windows programs. I certainly
hope so.

I don't want to have to write Windows programs that kinda run on linux.
Java suffers from that: no "import os" in Java.


Marko

[toc] | [prev] | [next] | [standalone]


Page 3 of 5 — ← Prev page 1 2 [3] 4 5  Next page →

Back to top | Article view | comp.lang.python


csiph-web