Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #72340 > unrolled thread

Python 3.2 has some deadly infection

Started byMark Lawrence <breamoreboy@yahoo.co.uk>
First post2014-05-31 17:10 +0100
Last post2014-06-03 14:22 -0400
Articles 20 on this page of 92 — 19 participants

Back to article view | Back to comp.lang.python


Contents

  Python 3.2 has some deadly infection Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-05-31 17:10 +0100
    Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-05-31 22:55 +0300
    Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-01 02:26 +0000
      Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-01 12:43 +1000
      Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 08:54 +1000
        Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-02 01:14 +0000
          Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 12:23 +1000
            Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-01 19:46 -0700
          Re: Python 3.2 has some deadly infection Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2014-06-02 07:45 +0000
          Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 19:02 +1000
          Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-02 19:14 +1000
          Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-02 12:10 +0100
            Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-03 16:34 +0000
              Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-04 02:43 +1000
          Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-02 17:34 -0400
            Re: Python 3.2 has some deadly infection Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-06-03 17:16 +1200
              Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-03 02:21 -0400
              Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-03 15:18 +0100
                Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-04 13:08 +0000
                  Re: Python 3.2 has some deadly infection Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-06-05 14:01 +1200
                    Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 10:16 +0300
                      Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-05 17:30 +1000
                        Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 11:05 +0300
                          Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-05 18:36 +1000
                            Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 12:53 +0300
                              Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-05 05:43 -0700
                              Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 14:50 -0400
                                Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 23:21 +0300
                                  Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 18:09 -0400
                                  Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 23:13 +0000
                                    Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 02:30 +0300
                                      Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 09:39 +1000
                                      Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 22:08 -0400
                                      Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-05 20:47 -0700
                    Re: Python 3.2 has some deadly infection Steven D'Aprano <steve@pearwood.info> - 2014-06-05 08:34 +0000
                      Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 12:41 +0300
                        Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 06:37 -0700
                          Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 17:45 +0300
                            Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 15:33 +0000
                              Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 02:12 +1000
                                Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 09:54 -0700
                                  Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:36 +1000
                              Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 19:52 +0300
                                Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:28 +1000
                                  Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 15:35 -0700
                                    Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 08:52 +1000
                                      Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 20:11 -0700
                                        Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 13:20 +1000
                                          Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 20:32 -0700
                                Re: Python 3.2 has some deadly infection Akira Li <4kir4.1i@gmail.com> - 2014-06-06 12:03 +0400
                            Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-05 16:37 +0100
                              Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 16:16 +0000
                            Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 01:50 +1000
                            Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-05 17:17 +0100
                              Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 16:32 +0000
                                Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-06 07:40 -0700
                            Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:14 +1000
                            Re: Python 3.2 has some deadly infection Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-05 11:16 -0600
                            Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 14:11 -0400
                              Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 21:30 +0300
                                Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 23:02 +0000
                                  Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 02:21 +0300
                                    Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-06 12:15 +0000
                                      Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 16:00 +0300
                                  Re: Python 3.2 has some deadly infection rurpy@yahoo.com - 2014-06-07 21:34 -0700
                                Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-06 06:24 -0700
                                  Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 17:10 +0300
                                    Re: Python 3.2 has some deadly infection Michael Torrie <torriem@gmail.com> - 2014-06-06 09:02 -0600
                                      Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 18:32 +0300
                                        Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 01:50 +1000
                                          Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:02 +0300
                                            Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 10:13 -0700
                                              Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:26 +1000
                                          Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 11:03 -0700
                                          Re: Python 3.2 has some deadly infection Denis McMahon <denismfmcmahon@gmail.com> - 2014-06-06 21:18 +0000
                                            Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 08:18 +1000
                                        Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-06 15:57 +0000
                                          Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 09:21 -0700
                                            Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 02:48 +1000
                                              Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 10:04 -0700
                                                Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:12 +1000
                                          Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:11 +0300
                                            Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:16 +1000
                                            Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:18 +0300
                                            Re: Python 3.2 has some deadly infection Ned Batchelder <ned@nedbatchelder.com> - 2014-06-06 13:33 -0400
                                Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 01:25 +1000
                                  Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 08:44 -0700
                                    Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 08:48 -0700
                            Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-06 12:56 +0100
                  Re: Python 3.2 has some deadly infection Akira Li <4kir4.1i@gmail.com> - 2014-06-05 06:49 +0400
              Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-04 00:25 +1000
              Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-03 14:22 -0400

Page 4 of 5 — ← Prev page 1 2 3 [4] 5  Next page →


#72782

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-06-05 23:02 +0000
Message-ID<5390f715$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to#72744
On Thu, 05 Jun 2014 21:30:11 +0300, Marko Rauhamaa wrote:

> Terry Reedy <tjreedy@udel.edu>:
> 
>> Different OSes *do* have different assumptions. Both MacOSX and current
>> Windows use (UCS-2 or) UTF-16 for text.
> 
> Linux can use anything for text; UTF-8 has become a de-facto standard.
> 
> How text is represented is very different from whether text is a
> fundamental data type. A fundamental text file is such that ordinary
> operating system facilities can't see inside the black box (that is,
> they are *not* encoded as far as the applications go).

Wait, are they black-boxes to the *operating system* or to 
*applications*? They aren't the same thing.

In any case, I reject your premise. ALL data types are constructed on top 
of bytes, and so long as you allow applications *any way* to coerce data 
types to different data types, you allow them to see "inside the black 
box". I can extract the four bytes from a C long integer, but that 
doesn't mean that C longs aren't fundamental data types in Unix/Linux.


> I have no idea how opaque text files are in Windows or OS-X.

Exactly as opaque as they are in Unix, which is to say not at all. Just 
open the file in binary mode, and voilà you see the underlying bytes.

All you're doing is pointing out that, in modern electronic computers, 
the fundamental data structure which underlies all others (the 
indivisible protons and neutrons, so to speak, only there are 256 of them 
rather than 2) is the byte. We know this, and don't dispute it.

(Like protons and neutrons, we can see inside bytes to the quark-like 
bits that make up bytes. Like quarks, bits do not exist in isolation, but 
only inside bytes.)



>> For Windows, at least, the interface is much improved in Python 3.
> 
> Yes, I get the feeling that Python is reaching out to Windows and OS-X
> and trying to make linux look like them.

Unicode support in OS-X is (I have been assured) is very good, probably 
better than Linux. Apple has very high standards when it comes to their 
apps, and provides rich Unicode-aware APIs.

But Linux Unicode support is much better than Windows. Unicode support in 
Windows is crippled by continued reliance on legacy code pages, and by 
the assumption deep inside the Windows APIs that Unicode means "16 bit 
characters". See, for example, the amount of space spent on fixing 
Windows Unicode handling here:

http://www.utf8everywhere.org/



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]


#72789

FromMarko Rauhamaa <marko@pacujo.net>
Date2014-06-06 02:21 +0300
Message-ID<87mwdr54dp.fsf@elektro.pacujo.net>
In reply to#72782
Steven D'Aprano <steve+comp.lang.python@pearwood.info>:

> In any case, I reject your premise. ALL data types are constructed on
> top of bytes,

Only in a very dull sense.

> and so long as you allow applications *any way* to coerce data types
> to different data types, you allow them to see "inside the black box".

I can't see the bytes inside Python objects, including strings, and
that's how it is supposed to be.

Similarly, I can't (easily) see how files are laid out on hard disks.
That's a true abstraction. Nothing in linux presents data, though,
except through bytes.


Marko

[toc] | [prev] | [next] | [standalone]


#72839

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-06-06 12:15 +0000
Message-ID<5391b0d1$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to#72789
On Fri, 06 Jun 2014 02:21:54 +0300, Marko Rauhamaa wrote:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> 
>> In any case, I reject your premise. ALL data types are constructed on
>> top of bytes,
> 
> Only in a very dull sense.

I agree with you that this is a very dull, unimportant sense. And I think 
it's dullness applies equally to the situation you somehow think is 
meaningfully exciting: Text is made of bytes! If you squint, you can see 
those bytes! Therefore text is not a first class data type!!!

To which my answer is, yes text is made of bytes, yes, you can expose 
those bytes, and no your conclusion doesn't follow.

 
>> and so long as you allow applications *any way* to coerce data types to
>> different data types, you allow them to see "inside the black box".
> 
> I can't see the bytes inside Python objects, including strings, and
> that's how it is supposed to be.

That's because Python the language doesn't allow you to coerce types to 
other types, except possibly through its interface to the underlying C 
implementation, ctypes. But Python allows you to write extensions in C, 
and that gives you the full power to take any data structure and turn it 
into any other data structure. Even bytes.


> Similarly, I can't (easily) see how files are laid out on hard disks.
> That's a true abstraction. Nothing in linux presents data, though,
> except through bytes.

Incorrect. Linux presents data as text all the time. Look at the prompt: 
its treated as text, not numbers. You type commands using a text 
interface. The commands are made of words like ls, dd and ps, not numbers 
like 0x6C73, 0x6464 and 0x7073. Applications like grep are based on line-
based files, and "line" is a text concept, not a byte concept.

Consider:

[steve@ando ~]$ echo -e '\x41\x42\x43'
ABC


The assumption of *text* is so strong in the echo application that by 
default you cannot enter numeric escapes at all. Without the -e switch, 
echo assumes that numeric escapes represent themselves as character 
literals:

[steve@ando ~]$ echo '\x41\x42\x43'
\x41\x42\x43



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]


#72842

FromMarko Rauhamaa <marko@pacujo.net>
Date2014-06-06 16:00 +0300
Message-ID<877g4utcpo.fsf@elektro.pacujo.net>
In reply to#72839
Steven D'Aprano <steve+comp.lang.python@pearwood.info>:

> Incorrect. Linux presents data as text all the time. Look at the prompt: 
> its treated as text, not numbers.

Of course there is a textual human interface. However, from the point of
view of virtually every OS component, it's bytes.


> Consider:
>
> [steve@ando ~]$ echo -e '\x41\x42\x43'
> ABC

"echo" doesn't know it's emitting text. It would be perfectly happy to
emit binary gibberish. The output goes to the pty which doesn't care
about the textual interpretation, either. Finally, the terminal
(emulation program) translates the incoming bytes to textual glyphs to
the best of its capabilities.

Anyway, what interests me mostly is that I routinely build programs and
systems that talk to each other over files, pipes, sockets and devices.
I really need to micromanage that data. I'm fine with encoding text if
that's the suitable interpretation. I just think Python is overreaching
by making the text interpretation the default for the standard streams
and files and guessing the correct encoding.

Note that subprocess.Popen() wisely assumes binary pipes. Unfortunately
the subprocess might be a python program that opens the standard streams
in the text mode...


Marko

[toc] | [prev] | [next] | [standalone]


#72953

Fromrurpy@yahoo.com
Date2014-06-07 21:34 -0700
Message-ID<545ec7b2-635e-462e-91a2-520de5f2f782@googlegroups.com>
In reply to#72782
On 06/05/2014 05:02 PM, Steven D'Aprano wrote:
>[...]
> But Linux Unicode support is much better than Windows. Unicode support in 
> Windows is crippled by continued reliance on legacy code pages, and by 
> the assumption deep inside the Windows APIs that Unicode means "16 bit 
> characters". See, for example, the amount of space spent on fixing 
> Windows Unicode handling here:
> 
> http://www.utf8everywhere.org/

While not disagreeing with the the general premise of that page, it 
has some problems that raise doubts in my mind about taking everything 
the author says at face value.

For example

  "Q: Why would the Asians give up on UTF-16 encoding, which saves 
      them 50% the memory per character?"
  [...] in fact UTF-8 is used just as often in those [Asian] countries. 

That is not my experience, at least for Japan.  See my comments in 
  https://mail.python.org/pipermail/python-ideas/2012-June/015429.html
where I show that utf8 files are a tiny minority of the text files 
found by Google.

He then gives a table with the size of utf8 and utf16 encoded contents
(ie stripped of html stuff) of an unnamed Japanese wikipedia page to 
show that even without a lot of (html-mandated) ascii, the space savings 
are not very much compared to the theoretical "50%" savings he stated:

  "             Dense text (Δ UTF-8)
   UTF-8   ...     222 KB (0%)
   UTF-16  ...     176 KB (−21%)"

Note that he calculates the space saving as (utf8-utf16)/utf8.
Yet by that metric the theoretical saving is *NOT* 50%, it is 33%.
For example 1000 Japanese characters will use 2000 bytes in utf16
and 3000 in utf8.

I did the same test using
  http://ja.wikipedia.org/wiki/%E7%B9%94%E7%94%B0%E4%BF%A1%E9%95%B7
I stripped html tags, javascript and redundant ascii whitespace characters
The stripped utf-8 file was 164946 bytes, the utf-16 encoded version of
same was 117756.  That gives (using the (utf8-utf16)/utf16 metric he used 
to claim 50% idealized savings) 40% which is quite a bit closer to the 
idealized 50% than his 21%.

I would have more faith in his opinions about things I don't know
about (such as unicode programming on Windows) if his other info
were more trustworthy.  IOW, just because it's on the internet doesn't 
mean it's true.

[toc] | [prev] | [next] | [standalone]


#72845

FromEthan Furman <ethan@stoneleaf.us>
Date2014-06-06 06:24 -0700
Message-ID<mailman.10815.1402062359.18130.python-list@python.org>
In reply to#72744
On 06/05/2014 11:30 AM, Marko Rauhamaa wrote:
 >
> How text is represented is very different from whether text is a
> fundamental data type. A fundamental text file is such that ordinary
> operating system facilities can't see inside the black box (that is,
> they are *not* encoded as far as the applications go).

Of course they are.  It may be an ASCII-encoding of some flavor or 
other, or something really (to me) strange -- but an encoding is most 
assuredly in affect.

ASCII is *not* the state of "this string has no encoding" -- that would 
be Unicode; a Unicode string, as a data type, has no encoding.  To 
transport it, store it, etc., it must (usually?) be encoded into 
something -- utf-8, ASCII, turkish, or whatever subset is agreed upon 
and will hopefully contain all the Unicode characters needed for the 
string to be properly represented.

The realization that ASCII was, in fact, an encoding was a big paradigm 
shift for me, but a necessary one.

--
~Ethan~

[toc] | [prev] | [next] | [standalone]


#72846

FromMarko Rauhamaa <marko@pacujo.net>
Date2014-06-06 17:10 +0300
Message-ID<87egz25dsd.fsf@elektro.pacujo.net>
In reply to#72845
Ethan Furman <ethan@stoneleaf.us>:

> On 06/05/2014 11:30 AM, Marko Rauhamaa wrote:
>> A fundamental text file is such that ordinary operating system
>> facilities can't see inside the black box (that is, they are *not*
>> encoded as far as the applications go).
>
> Of course they are.

How would you know?

> It may be an ASCII-encoding of some flavor or other, or something
> really (to me) strange -- but an encoding is most assuredly in affect.

Outside metaphysics, that statement is only meaningful if you have
access to the encoding.

> ASCII is *not* the state of "this string has no encoding" -- that
> would be Unicode; a Unicode string, as a data type, has no encoding.

Huh?


Marko

[toc] | [prev] | [next] | [standalone]


#72850

FromMichael Torrie <torriem@gmail.com>
Date2014-06-06 09:02 -0600
Message-ID<mailman.10818.1402066977.18130.python-list@python.org>
In reply to#72846
On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
> Ethan Furman <ethan@stoneleaf.us>:
>> ASCII is *not* the state of "this string has no encoding" -- that
>> would be Unicode; a Unicode string, as a data type, has no encoding.
> 
> Huh?

It's this very fact that trips of JMF in his rants about FSR.  Thank you
to Ethan for putting it so succinctly.

What part of his statement are you saying "Huh?" about?

[toc] | [prev] | [next] | [standalone]


#72852

FromMarko Rauhamaa <marko@pacujo.net>
Date2014-06-06 18:32 +0300
Message-ID<87a99q5a08.fsf@elektro.pacujo.net>
In reply to#72850
Michael Torrie <torriem@gmail.com>:

> On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
>> Ethan Furman <ethan@stoneleaf.us>:
>>> ASCII is *not* the state of "this string has no encoding" -- that
>>> would be Unicode; a Unicode string, as a data type, has no encoding.
>> 
>> Huh?
>
> [...]
>
> What part of his statement are you saying "Huh?" about?

Unicode, like ASCII, is a code. Representing text in unicode is
encoding.


Marko

[toc] | [prev] | [next] | [standalone]


#72857

FromChris Angelico <rosuav@gmail.com>
Date2014-06-07 01:50 +1000
Message-ID<mailman.10820.1402069852.18130.python-list@python.org>
In reply to#72852
On Sat, Jun 7, 2014 at 1:32 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Michael Torrie <torriem@gmail.com>:
>
>> On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
>>> Ethan Furman <ethan@stoneleaf.us>:
>>>> ASCII is *not* the state of "this string has no encoding" -- that
>>>> would be Unicode; a Unicode string, as a data type, has no encoding.
>>>
>>> Huh?
>>
>> [...]
>>
>> What part of his statement are you saying "Huh?" about?
>
> Unicode, like ASCII, is a code. Representing text in unicode is
> encoding.

Yes and no. "ASCII" means two things: Firstly, it's a mapping from the
letter A to the number 65, from the exclamation mark to 33, from the
backslash to 92, and so on. And secondly, it's an encoding of those
numbers into the lowest seven bits of a byte, with the high byte left
clear. Between those two, you get a means of representing the letter
'A' as the byte 0x41, and one of them is an encoding.

"Unicode", on the other hand, is only the first part. It maps all the
same characters to the same numbers that ASCII does, and then adds a
few more... a few followed by a few, followed by... okay, quite a lot
more. Unicode specifies that the character OK HAND SIGN, which looks
like 👌 if you have the right font, is number 1F44C in hex (128076
decimal). This is the "Universal Character Set" or UCS.

ASCII could specify a single encoding, because that encoding makes
sense for nearly all purposes. (There are times when you transmit
ASCII text and use the high bit to mean something else, like parity or
"this is the end of a word" or something, but even then, you follow
the same convention of packing a number into the low seven bits of a
byte.) Unicode can't, because there are many different pros and cons
to the different encodings, and so we have UCS Transformation Formats
like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint
to a sequence of bytes.

You can't represent text in "Unicode" in a computer. Somewhere along
the way, you have to figure out how to store those codepoints as
bytes, or something more concrete (you could, for instance, use a
Python list of Python integers; I can't say that it would be in any
way more efficient than alternatives, but it would be plausible); and
that's the encoding.

ChrisA

[toc] | [prev] | [next] | [standalone]


#72862

FromMarko Rauhamaa <marko@pacujo.net>
Date2014-06-06 20:02 +0300
Message-ID<8761ke55u0.fsf@elektro.pacujo.net>
In reply to#72857
Chris Angelico <rosuav@gmail.com>:

> "ASCII" means two things: Firstly, it's a mapping from the letter A to
> the number 65, from the exclamation mark to 33, from the backslash to
> 92, and so on. And secondly, it's an encoding of those numbers into
> the lowest seven bits of a byte, with the high byte left clear.
> Between those two, you get a means of representing the letter 'A' as
> the byte 0x41, and one of them is an encoding.

   The American Standard Code for Information Interchange [...] is a
   character-encoding scheme [...] <URL:
   http://en.wikipedia.org/wiki/ASCII>

> "Unicode", on the other hand, is only the first part. It maps all the
> same characters to the same numbers that ASCII does, and then adds a
> few more... a few followed by a few, followed by... okay, quite a lot
> more. Unicode specifies that the character OK HAND SIGN, which looks
> like 👌 if you have the right font, is number 1F44C in hex (128076
> decimal). This is the "Universal Character Set" or UCS.

   Unicode is a computing industry standard for the consistent encoding,
   representation and handling of text [...] <URL:
   http://en.wikipedia.org/wiki/Unicode>

Each standard assigns numbers to letters and other symbols. In a word,
each is a code. That's what their names say, too.


Marko

[toc] | [prev] | [next] | [standalone]


#72866

FromRustom Mody <rustompmody@gmail.com>
Date2014-06-06 10:13 -0700
Message-ID<df391e39-b6dd-46ac-b84b-01fefce9278d@googlegroups.com>
In reply to#72862
On Friday, June 6, 2014 10:32:47 PM UTC+5:30, Marko Rauhamaa wrote:
> Chris Angelico :

> > "ASCII" means two things: Firstly, it's a mapping from the letter A to
> > the number 65, from the exclamation mark to 33, from the backslash to
> > 92, and so on. And secondly, it's an encoding of those numbers into
> > the lowest seven bits of a byte, with the high byte left clear.
> > Between those two, you get a means of representing the letter 'A' as
> > the byte 0x41, and one of them is an encoding.

>    The American Standard Code for Information Interchange [...] is a
>    character-encoding scheme [...] <URL:

And a similar argument to this is seen on that page's talk page!
http://en.wikipedia.org/wiki/Talk:ASCII#Character_set_vs._Character_encoding.3F

[toc] | [prev] | [next] | [standalone]


#72870

FromChris Angelico <rosuav@gmail.com>
Date2014-06-07 03:26 +1000
Message-ID<mailman.10826.1402075604.18130.python-list@python.org>
In reply to#72866
On Sat, Jun 7, 2014 at 3:13 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> On Friday, June 6, 2014 10:32:47 PM UTC+5:30, Marko Rauhamaa wrote:
>> Chris Angelico :
>
>> > "ASCII" means two things: Firstly, it's a mapping from the letter A to
>> > the number 65, from the exclamation mark to 33, from the backslash to
>> > 92, and so on. And secondly, it's an encoding of those numbers into
>> > the lowest seven bits of a byte, with the high byte left clear.
>> > Between those two, you get a means of representing the letter 'A' as
>> > the byte 0x41, and one of them is an encoding.
>
>>    The American Standard Code for Information Interchange [...] is a
>>    character-encoding scheme [...] <URL:
>
> And a similar argument to this is seen on that page's talk page!
> http://en.wikipedia.org/wiki/Talk:ASCII#Character_set_vs._Character_encoding.3F

Which proves that Wikipedia is exactly as reliable as a mailing list.

ChrisA

[toc] | [prev] | [next] | [standalone]


#72876

Fromwxjmfauth@gmail.com
Date2014-06-06 11:03 -0700
Message-ID<9948659f-9737-4f4f-bc2f-b765c40cb17b@googlegroups.com>
In reply to#72857
Le vendredi 6 juin 2014 17:50:50 UTC+2, Chris Angelico a écrit :
> 
> byte.) Unicode can't, because there are many different pros and cons
> 
> to the different encodings, and so we have UCS Transformation Formats
> 
> like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint
> 
> to a sequence of bytes.
> 

A big NO.

jmf

[toc] | [prev] | [next] | [standalone]


#72886

FromDenis McMahon <denismfmcmahon@gmail.com>
Date2014-06-06 21:18 +0000
Message-ID<lmtb74$ofa$2@dont-email.me>
In reply to#72857
On Sat, 07 Jun 2014 01:50:50 +1000, Chris Angelico wrote:

> Yes and no. "ASCII" means two things:

ASCII means: American Standard Code for Information Interchange aka ASA 
Standard X3.4-1963

> into the lowest seven bits of a byte, with the high byte left clear.

high BIT left clear.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]


#72887

FromChris Angelico <rosuav@gmail.com>
Date2014-06-07 08:18 +1000
Message-ID<mailman.10834.1402093110.18130.python-list@python.org>
In reply to#72886
On Sat, Jun 7, 2014 at 7:18 AM, Denis McMahon <denismfmcmahon@gmail.com> wrote:
>> into the lowest seven bits of a byte, with the high byte left clear.
>
> high BIT left clear.

That thing. Unless you have bytes inside bytes (byteception?), you'll
only have room for one high bit. Some day I'll get my brain and my
fingers to agree on everything we do... but that day is not today.

ChrisA

[toc] | [prev] | [next] | [standalone]


#72858

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-06-06 15:57 +0000
Message-ID<5391e4fe$0$29988$c3e8da3$5496439d@news.astraweb.com>
In reply to#72852
On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote:

> Michael Torrie <torriem@gmail.com>:
> 
>> On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
>>> Ethan Furman <ethan@stoneleaf.us>:
>>>> ASCII is *not* the state of "this string has no encoding" -- that
>>>> would be Unicode; a Unicode string, as a data type, has no encoding.
>>> 
>>> Huh?
>>
>> [...]
>>
>> What part of his statement are you saying "Huh?" about?
> 
> Unicode, like ASCII, is a code. Representing text in unicode is
> encoding.

A Unicode string as an abstract data type has no encoding. It is a 
Platonic ideal, a pure form like the real numbers. There are no bytes, no 
bits, just code points. That is what Ethan means. A Unicode string like 
this:

s = u"NOBODY expects the Spanish Inquisition!"

should not be thought of as a bunch of bytes in some encoding, but as an 
array of code points. Eventually the abstraction will leak, all 
abstractions do, but not for a very long time.


-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]


#72859

FromRustom Mody <rustompmody@gmail.com>
Date2014-06-06 09:21 -0700
Message-ID<ca66f285-15af-4542-96df-87f9794a1cd8@googlegroups.com>
In reply to#72858
On Friday, June 6, 2014 9:27:51 PM UTC+5:30, Steven D'Aprano wrote:
> On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote:

> > Michael Torri:
> >> On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
> >>> Ethan Furman :
> >>>> ASCII is *not* the state of "this string has no encoding" -- that
> >>>> would be Unicode; a Unicode string, as a data type, has no encoding.
> >>> Huh?
> >> [...]
> >> What part of his statement are you saying "Huh?" about?
> > Unicode, like ASCII, is a code. Representing text in unicode is
> > encoding.

> A Unicode string as an abstract data type has no encoding. It is a 
> Platonic ideal, a pure form like the real numbers. There are no bytes, no 
> bits, just code points. That is what Ethan means. A Unicode string like 
> this:

> s = u"NOBODY expects the Spanish Inquisition!"

> should not be thought of as a bunch of bytes in some encoding, but as an 
> array of code points. Eventually the abstraction will leak, all 
> abstractions do, but not for a very long time.

"Should not be thought of" yes thats the Python3 world view
Not even the Python2 world view
And very far from the classic Unix world view.

As Ned Batchelder says in Unipain: http://nedbatchelder.com/text/unipain.html :
Programmers should use the 'unicode sandwich'to avoid 'unipain':

Bytes on the outside, Unicode on the inside, encode/decode at the edges.

The discussion here is precisely about these edges

Combine that with Chris':

> Yes and no. "ASCII" means two things: Firstly, it's a mapping from the
> letter A to the number 65, from the exclamation mark to 33, from the
> backslash to 92, and so on. And secondly, it's an encoding of those
> numbers into the lowest seven bits of a byte, with the high byte left
> clear. Between those two, you get a means of representing the letter
> 'A' as the byte 0x41, and one of them is an encoding.

and the situation appears quite the opposite of Ethan's description:

In the 'old world' ASCII was both mapping and encoding and so there was 
never a justification to distinguish encoding from codepoint.

It is unicode that demands these distinctions.

If we could magically go to a world where the number of bits in a byte was 32
all this headache would go away. [Actually just 21 is enough!]

[toc] | [prev] | [next] | [standalone]


#72860

FromChris Angelico <rosuav@gmail.com>
Date2014-06-07 02:48 +1000
Message-ID<mailman.10821.1402073331.18130.python-list@python.org>
In reply to#72859
On Sat, Jun 7, 2014 at 2:21 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> Combine that with Chris':
>
>> Yes and no. "ASCII" means two things: Firstly, it's a mapping from the
>> letter A to the number 65, from the exclamation mark to 33, from the
>> backslash to 92, and so on. And secondly, it's an encoding of those
>> numbers into the lowest seven bits of a byte, with the high byte left
>> clear. Between those two, you get a means of representing the letter
>> 'A' as the byte 0x41, and one of them is an encoding.
>
> and the situation appears quite the opposite of Ethan's description:
>
> In the 'old world' ASCII was both mapping and encoding and so there was
> never a justification to distinguish encoding from codepoint.
>
> It is unicode that demands these distinctions.
>
> If we could magically go to a world where the number of bits in a byte was 32
> all this headache would go away. [Actually just 21 is enough!]

An ASCII mentality lets you be sloppy. That doesn't mean the
distinction doesn't exist. When I first started programming in C, int
was *always* 16 bits long and *always* little-endian (because I used
only one compiler). I could pretend that those bits in memory actually
were that integer, that there were no other ways that integer could be
encoded. That doesn't mean that encodings weren't important. And as
soon as I started working on a 32-bit OS/2 system, and my ints became
bigger, I had to concern myself with that. Even more so when I got
into networking, and byte order became important to me. And of course,
these days I work with integers that are encoded in all sorts of
different ways (a Python integer isn't just a puddle of bytes in
memory), and I generally let someone else take care of the details,
but the encodings are still there.

ASCII was once your one companion, it was all that mattered. ASCII was
once a friendly encoding, then your world was shattered. Wishing it
were somehow here again, wishing it were somehow near... sometimes it
seemed, if you just dreamed, somehow it would be here! Wishing you
could use just bytes again, knowing that you never would... dreaming
of it won't help you to do all that you dream you could!

It's time to stop chasing the phantom and start living in the Raoul
world... err, the real world. :)

ChrisA

[toc] | [prev] | [next] | [standalone]


#72863

FromRustom Mody <rustompmody@gmail.com>
Date2014-06-06 10:04 -0700
Message-ID<57ed797e-1ed5-4c52-9fbb-b700615852d2@googlegroups.com>
In reply to#72860
On Friday, June 6, 2014 10:18:41 PM UTC+5:30, Chris Angelico wrote:
> On Sat, Jun 7, 2014 at 2:21 AM, Rustom Mody  wrote:
> > Combine that with Chris':
> >> Yes and no. "ASCII" means two things: Firstly, it's a mapping from the
> >> letter A to the number 65, from the exclamation mark to 33, from the
> >> backslash to 92, and so on. And secondly, it's an encoding of those
> >> numbers into the lowest seven bits of a byte, with the high byte left
> >> clear. Between those two, you get a means of representing the letter
> >> 'A' as the byte 0x41, and one of them is an encoding.
> > and the situation appears quite the opposite of Ethan's description:
> > In the 'old world' ASCII was both mapping and encoding and so there was
> > never a justification to distinguish encoding from codepoint.
> > It is unicode that demands these distinctions.
> > If we could magically go to a world where the number of bits in a byte was 32
> > all this headache would go away. [Actually just 21 is enough!]

> An ASCII mentality lets you be sloppy. That doesn't mean the
> distinction doesn't exist. When I first started programming in C, int
> was *always* 16 bits long and *always* little-endian (because I used
> only one compiler). I could pretend that those bits in memory actually
> were that integer, that there were no other ways that integer could be
> encoded. That doesn't mean that encodings weren't important. And as
> soon as I started working on a 32-bit OS/2 system, and my ints became
> bigger, I had to concern myself with that. Even more so when I got
> into networking, and byte order became important to me. And of course,
> these days I work with integers that are encoded in all sorts of
> different ways (a Python integer isn't just a puddle of bytes in
> memory), and I generally let someone else take care of the details,
> but the encodings are still there.

> ASCII was once your one companion, it was all that mattered. ASCII was
> once a friendly encoding, then your world was shattered. Wishing it
> were somehow here again, wishing it were somehow near... sometimes it
> seemed, if you just dreamed, somehow it would be here! Wishing you
> could use just bytes again, knowing that you never would... dreaming
> of it won't help you to do all that you dream you could!

> It's time to stop chasing the phantom and start living in the Raoul
> world... err, the real world. :)

I thought that "If only bytes were 21+ bits wide" would sound sufficiently 
nonsensical, that I did not need to explicitly qualify it as a utopian dream!

[toc] | [prev] | [next] | [standalone]


Page 4 of 5 — ← Prev page 1 2 3 [4] 5  Next page →

Back to top | Article view | comp.lang.python


csiph-web