Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #72340 > unrolled thread
| Started by | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| First post | 2014-05-31 17:10 +0100 |
| Last post | 2014-06-03 14:22 -0400 |
| Articles | 20 on this page of 92 — 19 participants |
Back to article view | Back to comp.lang.python
Python 3.2 has some deadly infection Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-05-31 17:10 +0100
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-05-31 22:55 +0300
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-01 02:26 +0000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-01 12:43 +1000
Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 08:54 +1000
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-02 01:14 +0000
Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 12:23 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-01 19:46 -0700
Re: Python 3.2 has some deadly infection Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2014-06-02 07:45 +0000
Re: Python 3.2 has some deadly infection Tim Delaney <timothy.c.delaney@gmail.com> - 2014-06-02 19:02 +1000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-02 19:14 +1000
Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-02 12:10 +0100
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-03 16:34 +0000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-04 02:43 +1000
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-02 17:34 -0400
Re: Python 3.2 has some deadly infection Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-06-03 17:16 +1200
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-03 02:21 -0400
Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-03 15:18 +0100
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-04 13:08 +0000
Re: Python 3.2 has some deadly infection Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-06-05 14:01 +1200
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 10:16 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-05 17:30 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 11:05 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-05 18:36 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 12:53 +0300
Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-05 05:43 -0700
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 14:50 -0400
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 23:21 +0300
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 18:09 -0400
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 23:13 +0000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 02:30 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 09:39 +1000
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 22:08 -0400
Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-05 20:47 -0700
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve@pearwood.info> - 2014-06-05 08:34 +0000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 12:41 +0300
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 06:37 -0700
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 17:45 +0300
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 15:33 +0000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 02:12 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 09:54 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:36 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 19:52 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:28 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 15:35 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 08:52 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 20:11 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 13:20 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-05 20:32 -0700
Re: Python 3.2 has some deadly infection Akira Li <4kir4.1i@gmail.com> - 2014-06-06 12:03 +0400
Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-05 16:37 +0100
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 16:16 +0000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 01:50 +1000
Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-05 17:17 +0100
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 16:32 +0000
Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-06 07:40 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-06 03:14 +1000
Re: Python 3.2 has some deadly infection Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-05 11:16 -0600
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-05 14:11 -0400
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-05 21:30 +0300
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-05 23:02 +0000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 02:21 +0300
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-06 12:15 +0000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 16:00 +0300
Re: Python 3.2 has some deadly infection rurpy@yahoo.com - 2014-06-07 21:34 -0700
Re: Python 3.2 has some deadly infection Ethan Furman <ethan@stoneleaf.us> - 2014-06-06 06:24 -0700
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 17:10 +0300
Re: Python 3.2 has some deadly infection Michael Torrie <torriem@gmail.com> - 2014-06-06 09:02 -0600
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 18:32 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 01:50 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:02 +0300
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 10:13 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:26 +1000
Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 11:03 -0700
Re: Python 3.2 has some deadly infection Denis McMahon <denismfmcmahon@gmail.com> - 2014-06-06 21:18 +0000
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 08:18 +1000
Re: Python 3.2 has some deadly infection Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-06 15:57 +0000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 09:21 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 02:48 +1000
Re: Python 3.2 has some deadly infection Rustom Mody <rustompmody@gmail.com> - 2014-06-06 10:04 -0700
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:12 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:11 +0300
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 03:16 +1000
Re: Python 3.2 has some deadly infection Marko Rauhamaa <marko@pacujo.net> - 2014-06-06 20:18 +0300
Re: Python 3.2 has some deadly infection Ned Batchelder <ned@nedbatchelder.com> - 2014-06-06 13:33 -0400
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-07 01:25 +1000
Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 08:44 -0700
Re: Python 3.2 has some deadly infection wxjmfauth@gmail.com - 2014-06-06 08:48 -0700
Re: Python 3.2 has some deadly infection Robin Becker <robin@reportlab.com> - 2014-06-06 12:56 +0100
Re: Python 3.2 has some deadly infection Akira Li <4kir4.1i@gmail.com> - 2014-06-05 06:49 +0400
Re: Python 3.2 has some deadly infection Chris Angelico <rosuav@gmail.com> - 2014-06-04 00:25 +1000
Re: Python 3.2 has some deadly infection Terry Reedy <tjreedy@udel.edu> - 2014-06-03 14:22 -0400
Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-06-05 23:02 +0000 |
| Message-ID | <5390f715$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #72744 |
On Thu, 05 Jun 2014 21:30:11 +0300, Marko Rauhamaa wrote: > Terry Reedy <tjreedy@udel.edu>: > >> Different OSes *do* have different assumptions. Both MacOSX and current >> Windows use (UCS-2 or) UTF-16 for text. > > Linux can use anything for text; UTF-8 has become a de-facto standard. > > How text is represented is very different from whether text is a > fundamental data type. A fundamental text file is such that ordinary > operating system facilities can't see inside the black box (that is, > they are *not* encoded as far as the applications go). Wait, are they black-boxes to the *operating system* or to *applications*? They aren't the same thing. In any case, I reject your premise. ALL data types are constructed on top of bytes, and so long as you allow applications *any way* to coerce data types to different data types, you allow them to see "inside the black box". I can extract the four bytes from a C long integer, but that doesn't mean that C longs aren't fundamental data types in Unix/Linux. > I have no idea how opaque text files are in Windows or OS-X. Exactly as opaque as they are in Unix, which is to say not at all. Just open the file in binary mode, and voilà you see the underlying bytes. All you're doing is pointing out that, in modern electronic computers, the fundamental data structure which underlies all others (the indivisible protons and neutrons, so to speak, only there are 256 of them rather than 2) is the byte. We know this, and don't dispute it. (Like protons and neutrons, we can see inside bytes to the quark-like bits that make up bytes. Like quarks, bits do not exist in isolation, but only inside bytes.) >> For Windows, at least, the interface is much improved in Python 3. > > Yes, I get the feeling that Python is reaching out to Windows and OS-X > and trying to make linux look like them. Unicode support in OS-X is (I have been assured) is very good, probably better than Linux. Apple has very high standards when it comes to their apps, and provides rich Unicode-aware APIs. But Linux Unicode support is much better than Windows. Unicode support in Windows is crippled by continued reliance on legacy code pages, and by the assumption deep inside the Windows APIs that Unicode means "16 bit characters". See, for example, the amount of space spent on fixing Windows Unicode handling here: http://www.utf8everywhere.org/ -- Steven D'Aprano http://import-that.dreamwidth.org/
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-06-06 02:21 +0300 |
| Message-ID | <87mwdr54dp.fsf@elektro.pacujo.net> |
| In reply to | #72782 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info>: > In any case, I reject your premise. ALL data types are constructed on > top of bytes, Only in a very dull sense. > and so long as you allow applications *any way* to coerce data types > to different data types, you allow them to see "inside the black box". I can't see the bytes inside Python objects, including strings, and that's how it is supposed to be. Similarly, I can't (easily) see how files are laid out on hard disks. That's a true abstraction. Nothing in linux presents data, though, except through bytes. Marko
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-06-06 12:15 +0000 |
| Message-ID | <5391b0d1$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #72789 |
On Fri, 06 Jun 2014 02:21:54 +0300, Marko Rauhamaa wrote: > Steven D'Aprano <steve+comp.lang.python@pearwood.info>: > >> In any case, I reject your premise. ALL data types are constructed on >> top of bytes, > > Only in a very dull sense. I agree with you that this is a very dull, unimportant sense. And I think it's dullness applies equally to the situation you somehow think is meaningfully exciting: Text is made of bytes! If you squint, you can see those bytes! Therefore text is not a first class data type!!! To which my answer is, yes text is made of bytes, yes, you can expose those bytes, and no your conclusion doesn't follow. >> and so long as you allow applications *any way* to coerce data types to >> different data types, you allow them to see "inside the black box". > > I can't see the bytes inside Python objects, including strings, and > that's how it is supposed to be. That's because Python the language doesn't allow you to coerce types to other types, except possibly through its interface to the underlying C implementation, ctypes. But Python allows you to write extensions in C, and that gives you the full power to take any data structure and turn it into any other data structure. Even bytes. > Similarly, I can't (easily) see how files are laid out on hard disks. > That's a true abstraction. Nothing in linux presents data, though, > except through bytes. Incorrect. Linux presents data as text all the time. Look at the prompt: its treated as text, not numbers. You type commands using a text interface. The commands are made of words like ls, dd and ps, not numbers like 0x6C73, 0x6464 and 0x7073. Applications like grep are based on line- based files, and "line" is a text concept, not a byte concept. Consider: [steve@ando ~]$ echo -e '\x41\x42\x43' ABC The assumption of *text* is so strong in the echo application that by default you cannot enter numeric escapes at all. Without the -e switch, echo assumes that numeric escapes represent themselves as character literals: [steve@ando ~]$ echo '\x41\x42\x43' \x41\x42\x43 -- Steven D'Aprano http://import-that.dreamwidth.org/
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-06-06 16:00 +0300 |
| Message-ID | <877g4utcpo.fsf@elektro.pacujo.net> |
| In reply to | #72839 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info>: > Incorrect. Linux presents data as text all the time. Look at the prompt: > its treated as text, not numbers. Of course there is a textual human interface. However, from the point of view of virtually every OS component, it's bytes. > Consider: > > [steve@ando ~]$ echo -e '\x41\x42\x43' > ABC "echo" doesn't know it's emitting text. It would be perfectly happy to emit binary gibberish. The output goes to the pty which doesn't care about the textual interpretation, either. Finally, the terminal (emulation program) translates the incoming bytes to textual glyphs to the best of its capabilities. Anyway, what interests me mostly is that I routinely build programs and systems that talk to each other over files, pipes, sockets and devices. I really need to micromanage that data. I'm fine with encoding text if that's the suitable interpretation. I just think Python is overreaching by making the text interpretation the default for the standard streams and files and guessing the correct encoding. Note that subprocess.Popen() wisely assumes binary pipes. Unfortunately the subprocess might be a python program that opens the standard streams in the text mode... Marko
[toc] | [prev] | [next] | [standalone]
| From | rurpy@yahoo.com |
|---|---|
| Date | 2014-06-07 21:34 -0700 |
| Message-ID | <545ec7b2-635e-462e-91a2-520de5f2f782@googlegroups.com> |
| In reply to | #72782 |
On 06/05/2014 05:02 PM, Steven D'Aprano wrote:
>[...]
> But Linux Unicode support is much better than Windows. Unicode support in
> Windows is crippled by continued reliance on legacy code pages, and by
> the assumption deep inside the Windows APIs that Unicode means "16 bit
> characters". See, for example, the amount of space spent on fixing
> Windows Unicode handling here:
>
> http://www.utf8everywhere.org/
While not disagreeing with the the general premise of that page, it
has some problems that raise doubts in my mind about taking everything
the author says at face value.
For example
"Q: Why would the Asians give up on UTF-16 encoding, which saves
them 50% the memory per character?"
[...] in fact UTF-8 is used just as often in those [Asian] countries.
That is not my experience, at least for Japan. See my comments in
https://mail.python.org/pipermail/python-ideas/2012-June/015429.html
where I show that utf8 files are a tiny minority of the text files
found by Google.
He then gives a table with the size of utf8 and utf16 encoded contents
(ie stripped of html stuff) of an unnamed Japanese wikipedia page to
show that even without a lot of (html-mandated) ascii, the space savings
are not very much compared to the theoretical "50%" savings he stated:
" Dense text (Δ UTF-8)
UTF-8 ... 222 KB (0%)
UTF-16 ... 176 KB (−21%)"
Note that he calculates the space saving as (utf8-utf16)/utf8.
Yet by that metric the theoretical saving is *NOT* 50%, it is 33%.
For example 1000 Japanese characters will use 2000 bytes in utf16
and 3000 in utf8.
I did the same test using
http://ja.wikipedia.org/wiki/%E7%B9%94%E7%94%B0%E4%BF%A1%E9%95%B7
I stripped html tags, javascript and redundant ascii whitespace characters
The stripped utf-8 file was 164946 bytes, the utf-16 encoded version of
same was 117756. That gives (using the (utf8-utf16)/utf16 metric he used
to claim 50% idealized savings) 40% which is quite a bit closer to the
idealized 50% than his 21%.
I would have more faith in his opinions about things I don't know
about (such as unicode programming on Windows) if his other info
were more trustworthy. IOW, just because it's on the internet doesn't
mean it's true.
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2014-06-06 06:24 -0700 |
| Message-ID | <mailman.10815.1402062359.18130.python-list@python.org> |
| In reply to | #72744 |
On 06/05/2014 11:30 AM, Marko Rauhamaa wrote: > > How text is represented is very different from whether text is a > fundamental data type. A fundamental text file is such that ordinary > operating system facilities can't see inside the black box (that is, > they are *not* encoded as far as the applications go). Of course they are. It may be an ASCII-encoding of some flavor or other, or something really (to me) strange -- but an encoding is most assuredly in affect. ASCII is *not* the state of "this string has no encoding" -- that would be Unicode; a Unicode string, as a data type, has no encoding. To transport it, store it, etc., it must (usually?) be encoded into something -- utf-8, ASCII, turkish, or whatever subset is agreed upon and will hopefully contain all the Unicode characters needed for the string to be properly represented. The realization that ASCII was, in fact, an encoding was a big paradigm shift for me, but a necessary one. -- ~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-06-06 17:10 +0300 |
| Message-ID | <87egz25dsd.fsf@elektro.pacujo.net> |
| In reply to | #72845 |
Ethan Furman <ethan@stoneleaf.us>: > On 06/05/2014 11:30 AM, Marko Rauhamaa wrote: >> A fundamental text file is such that ordinary operating system >> facilities can't see inside the black box (that is, they are *not* >> encoded as far as the applications go). > > Of course they are. How would you know? > It may be an ASCII-encoding of some flavor or other, or something > really (to me) strange -- but an encoding is most assuredly in affect. Outside metaphysics, that statement is only meaningful if you have access to the encoding. > ASCII is *not* the state of "this string has no encoding" -- that > would be Unicode; a Unicode string, as a data type, has no encoding. Huh? Marko
[toc] | [prev] | [next] | [standalone]
| From | Michael Torrie <torriem@gmail.com> |
|---|---|
| Date | 2014-06-06 09:02 -0600 |
| Message-ID | <mailman.10818.1402066977.18130.python-list@python.org> |
| In reply to | #72846 |
On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: > Ethan Furman <ethan@stoneleaf.us>: >> ASCII is *not* the state of "this string has no encoding" -- that >> would be Unicode; a Unicode string, as a data type, has no encoding. > > Huh? It's this very fact that trips of JMF in his rants about FSR. Thank you to Ethan for putting it so succinctly. What part of his statement are you saying "Huh?" about?
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-06-06 18:32 +0300 |
| Message-ID | <87a99q5a08.fsf@elektro.pacujo.net> |
| In reply to | #72850 |
Michael Torrie <torriem@gmail.com>: > On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: >> Ethan Furman <ethan@stoneleaf.us>: >>> ASCII is *not* the state of "this string has no encoding" -- that >>> would be Unicode; a Unicode string, as a data type, has no encoding. >> >> Huh? > > [...] > > What part of his statement are you saying "Huh?" about? Unicode, like ASCII, is a code. Representing text in unicode is encoding. Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-07 01:50 +1000 |
| Message-ID | <mailman.10820.1402069852.18130.python-list@python.org> |
| In reply to | #72852 |
On Sat, Jun 7, 2014 at 1:32 AM, Marko Rauhamaa <marko@pacujo.net> wrote: > Michael Torrie <torriem@gmail.com>: > >> On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: >>> Ethan Furman <ethan@stoneleaf.us>: >>>> ASCII is *not* the state of "this string has no encoding" -- that >>>> would be Unicode; a Unicode string, as a data type, has no encoding. >>> >>> Huh? >> >> [...] >> >> What part of his statement are you saying "Huh?" about? > > Unicode, like ASCII, is a code. Representing text in unicode is > encoding. Yes and no. "ASCII" means two things: Firstly, it's a mapping from the letter A to the number 65, from the exclamation mark to 33, from the backslash to 92, and so on. And secondly, it's an encoding of those numbers into the lowest seven bits of a byte, with the high byte left clear. Between those two, you get a means of representing the letter 'A' as the byte 0x41, and one of them is an encoding. "Unicode", on the other hand, is only the first part. It maps all the same characters to the same numbers that ASCII does, and then adds a few more... a few followed by a few, followed by... okay, quite a lot more. Unicode specifies that the character OK HAND SIGN, which looks like 👌 if you have the right font, is number 1F44C in hex (128076 decimal). This is the "Universal Character Set" or UCS. ASCII could specify a single encoding, because that encoding makes sense for nearly all purposes. (There are times when you transmit ASCII text and use the high bit to mean something else, like parity or "this is the end of a word" or something, but even then, you follow the same convention of packing a number into the low seven bits of a byte.) Unicode can't, because there are many different pros and cons to the different encodings, and so we have UCS Transformation Formats like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint to a sequence of bytes. You can't represent text in "Unicode" in a computer. Somewhere along the way, you have to figure out how to store those codepoints as bytes, or something more concrete (you could, for instance, use a Python list of Python integers; I can't say that it would be in any way more efficient than alternatives, but it would be plausible); and that's the encoding. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-06-06 20:02 +0300 |
| Message-ID | <8761ke55u0.fsf@elektro.pacujo.net> |
| In reply to | #72857 |
Chris Angelico <rosuav@gmail.com>: > "ASCII" means two things: Firstly, it's a mapping from the letter A to > the number 65, from the exclamation mark to 33, from the backslash to > 92, and so on. And secondly, it's an encoding of those numbers into > the lowest seven bits of a byte, with the high byte left clear. > Between those two, you get a means of representing the letter 'A' as > the byte 0x41, and one of them is an encoding. The American Standard Code for Information Interchange [...] is a character-encoding scheme [...] <URL: http://en.wikipedia.org/wiki/ASCII> > "Unicode", on the other hand, is only the first part. It maps all the > same characters to the same numbers that ASCII does, and then adds a > few more... a few followed by a few, followed by... okay, quite a lot > more. Unicode specifies that the character OK HAND SIGN, which looks > like 👌 if you have the right font, is number 1F44C in hex (128076 > decimal). This is the "Universal Character Set" or UCS. Unicode is a computing industry standard for the consistent encoding, representation and handling of text [...] <URL: http://en.wikipedia.org/wiki/Unicode> Each standard assigns numbers to letters and other symbols. In a word, each is a code. That's what their names say, too. Marko
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-06-06 10:13 -0700 |
| Message-ID | <df391e39-b6dd-46ac-b84b-01fefce9278d@googlegroups.com> |
| In reply to | #72862 |
On Friday, June 6, 2014 10:32:47 PM UTC+5:30, Marko Rauhamaa wrote: > Chris Angelico : > > "ASCII" means two things: Firstly, it's a mapping from the letter A to > > the number 65, from the exclamation mark to 33, from the backslash to > > 92, and so on. And secondly, it's an encoding of those numbers into > > the lowest seven bits of a byte, with the high byte left clear. > > Between those two, you get a means of representing the letter 'A' as > > the byte 0x41, and one of them is an encoding. > The American Standard Code for Information Interchange [...] is a > character-encoding scheme [...] <URL: And a similar argument to this is seen on that page's talk page! http://en.wikipedia.org/wiki/Talk:ASCII#Character_set_vs._Character_encoding.3F
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-07 03:26 +1000 |
| Message-ID | <mailman.10826.1402075604.18130.python-list@python.org> |
| In reply to | #72866 |
On Sat, Jun 7, 2014 at 3:13 AM, Rustom Mody <rustompmody@gmail.com> wrote: > On Friday, June 6, 2014 10:32:47 PM UTC+5:30, Marko Rauhamaa wrote: >> Chris Angelico : > >> > "ASCII" means two things: Firstly, it's a mapping from the letter A to >> > the number 65, from the exclamation mark to 33, from the backslash to >> > 92, and so on. And secondly, it's an encoding of those numbers into >> > the lowest seven bits of a byte, with the high byte left clear. >> > Between those two, you get a means of representing the letter 'A' as >> > the byte 0x41, and one of them is an encoding. > >> The American Standard Code for Information Interchange [...] is a >> character-encoding scheme [...] <URL: > > And a similar argument to this is seen on that page's talk page! > http://en.wikipedia.org/wiki/Talk:ASCII#Character_set_vs._Character_encoding.3F Which proves that Wikipedia is exactly as reliable as a mailing list. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2014-06-06 11:03 -0700 |
| Message-ID | <9948659f-9737-4f4f-bc2f-b765c40cb17b@googlegroups.com> |
| In reply to | #72857 |
Le vendredi 6 juin 2014 17:50:50 UTC+2, Chris Angelico a écrit : > > byte.) Unicode can't, because there are many different pros and cons > > to the different encodings, and so we have UCS Transformation Formats > > like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint > > to a sequence of bytes. > A big NO. jmf
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2014-06-06 21:18 +0000 |
| Message-ID | <lmtb74$ofa$2@dont-email.me> |
| In reply to | #72857 |
On Sat, 07 Jun 2014 01:50:50 +1000, Chris Angelico wrote: > Yes and no. "ASCII" means two things: ASCII means: American Standard Code for Information Interchange aka ASA Standard X3.4-1963 > into the lowest seven bits of a byte, with the high byte left clear. high BIT left clear. -- Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-07 08:18 +1000 |
| Message-ID | <mailman.10834.1402093110.18130.python-list@python.org> |
| In reply to | #72886 |
On Sat, Jun 7, 2014 at 7:18 AM, Denis McMahon <denismfmcmahon@gmail.com> wrote: >> into the lowest seven bits of a byte, with the high byte left clear. > > high BIT left clear. That thing. Unless you have bytes inside bytes (byteception?), you'll only have room for one high bit. Some day I'll get my brain and my fingers to agree on everything we do... but that day is not today. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-06-06 15:57 +0000 |
| Message-ID | <5391e4fe$0$29988$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #72852 |
On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote: > Michael Torrie <torriem@gmail.com>: > >> On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: >>> Ethan Furman <ethan@stoneleaf.us>: >>>> ASCII is *not* the state of "this string has no encoding" -- that >>>> would be Unicode; a Unicode string, as a data type, has no encoding. >>> >>> Huh? >> >> [...] >> >> What part of his statement are you saying "Huh?" about? > > Unicode, like ASCII, is a code. Representing text in unicode is > encoding. A Unicode string as an abstract data type has no encoding. It is a Platonic ideal, a pure form like the real numbers. There are no bytes, no bits, just code points. That is what Ethan means. A Unicode string like this: s = u"NOBODY expects the Spanish Inquisition!" should not be thought of as a bunch of bytes in some encoding, but as an array of code points. Eventually the abstraction will leak, all abstractions do, but not for a very long time. -- Steven D'Aprano http://import-that.dreamwidth.org/
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-06-06 09:21 -0700 |
| Message-ID | <ca66f285-15af-4542-96df-87f9794a1cd8@googlegroups.com> |
| In reply to | #72858 |
On Friday, June 6, 2014 9:27:51 PM UTC+5:30, Steven D'Aprano wrote: > On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote: > > Michael Torri: > >> On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: > >>> Ethan Furman : > >>>> ASCII is *not* the state of "this string has no encoding" -- that > >>>> would be Unicode; a Unicode string, as a data type, has no encoding. > >>> Huh? > >> [...] > >> What part of his statement are you saying "Huh?" about? > > Unicode, like ASCII, is a code. Representing text in unicode is > > encoding. > A Unicode string as an abstract data type has no encoding. It is a > Platonic ideal, a pure form like the real numbers. There are no bytes, no > bits, just code points. That is what Ethan means. A Unicode string like > this: > s = u"NOBODY expects the Spanish Inquisition!" > should not be thought of as a bunch of bytes in some encoding, but as an > array of code points. Eventually the abstraction will leak, all > abstractions do, but not for a very long time. "Should not be thought of" yes thats the Python3 world view Not even the Python2 world view And very far from the classic Unix world view. As Ned Batchelder says in Unipain: http://nedbatchelder.com/text/unipain.html : Programmers should use the 'unicode sandwich'to avoid 'unipain': Bytes on the outside, Unicode on the inside, encode/decode at the edges. The discussion here is precisely about these edges Combine that with Chris': > Yes and no. "ASCII" means two things: Firstly, it's a mapping from the > letter A to the number 65, from the exclamation mark to 33, from the > backslash to 92, and so on. And secondly, it's an encoding of those > numbers into the lowest seven bits of a byte, with the high byte left > clear. Between those two, you get a means of representing the letter > 'A' as the byte 0x41, and one of them is an encoding. and the situation appears quite the opposite of Ethan's description: In the 'old world' ASCII was both mapping and encoding and so there was never a justification to distinguish encoding from codepoint. It is unicode that demands these distinctions. If we could magically go to a world where the number of bits in a byte was 32 all this headache would go away. [Actually just 21 is enough!]
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-07 02:48 +1000 |
| Message-ID | <mailman.10821.1402073331.18130.python-list@python.org> |
| In reply to | #72859 |
On Sat, Jun 7, 2014 at 2:21 AM, Rustom Mody <rustompmody@gmail.com> wrote: > Combine that with Chris': > >> Yes and no. "ASCII" means two things: Firstly, it's a mapping from the >> letter A to the number 65, from the exclamation mark to 33, from the >> backslash to 92, and so on. And secondly, it's an encoding of those >> numbers into the lowest seven bits of a byte, with the high byte left >> clear. Between those two, you get a means of representing the letter >> 'A' as the byte 0x41, and one of them is an encoding. > > and the situation appears quite the opposite of Ethan's description: > > In the 'old world' ASCII was both mapping and encoding and so there was > never a justification to distinguish encoding from codepoint. > > It is unicode that demands these distinctions. > > If we could magically go to a world where the number of bits in a byte was 32 > all this headache would go away. [Actually just 21 is enough!] An ASCII mentality lets you be sloppy. That doesn't mean the distinction doesn't exist. When I first started programming in C, int was *always* 16 bits long and *always* little-endian (because I used only one compiler). I could pretend that those bits in memory actually were that integer, that there were no other ways that integer could be encoded. That doesn't mean that encodings weren't important. And as soon as I started working on a 32-bit OS/2 system, and my ints became bigger, I had to concern myself with that. Even more so when I got into networking, and byte order became important to me. And of course, these days I work with integers that are encoded in all sorts of different ways (a Python integer isn't just a puddle of bytes in memory), and I generally let someone else take care of the details, but the encodings are still there. ASCII was once your one companion, it was all that mattered. ASCII was once a friendly encoding, then your world was shattered. Wishing it were somehow here again, wishing it were somehow near... sometimes it seemed, if you just dreamed, somehow it would be here! Wishing you could use just bytes again, knowing that you never would... dreaming of it won't help you to do all that you dream you could! It's time to stop chasing the phantom and start living in the Raoul world... err, the real world. :) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-06-06 10:04 -0700 |
| Message-ID | <57ed797e-1ed5-4c52-9fbb-b700615852d2@googlegroups.com> |
| In reply to | #72860 |
On Friday, June 6, 2014 10:18:41 PM UTC+5:30, Chris Angelico wrote: > On Sat, Jun 7, 2014 at 2:21 AM, Rustom Mody wrote: > > Combine that with Chris': > >> Yes and no. "ASCII" means two things: Firstly, it's a mapping from the > >> letter A to the number 65, from the exclamation mark to 33, from the > >> backslash to 92, and so on. And secondly, it's an encoding of those > >> numbers into the lowest seven bits of a byte, with the high byte left > >> clear. Between those two, you get a means of representing the letter > >> 'A' as the byte 0x41, and one of them is an encoding. > > and the situation appears quite the opposite of Ethan's description: > > In the 'old world' ASCII was both mapping and encoding and so there was > > never a justification to distinguish encoding from codepoint. > > It is unicode that demands these distinctions. > > If we could magically go to a world where the number of bits in a byte was 32 > > all this headache would go away. [Actually just 21 is enough!] > An ASCII mentality lets you be sloppy. That doesn't mean the > distinction doesn't exist. When I first started programming in C, int > was *always* 16 bits long and *always* little-endian (because I used > only one compiler). I could pretend that those bits in memory actually > were that integer, that there were no other ways that integer could be > encoded. That doesn't mean that encodings weren't important. And as > soon as I started working on a 32-bit OS/2 system, and my ints became > bigger, I had to concern myself with that. Even more so when I got > into networking, and byte order became important to me. And of course, > these days I work with integers that are encoded in all sorts of > different ways (a Python integer isn't just a puddle of bytes in > memory), and I generally let someone else take care of the details, > but the encodings are still there. > ASCII was once your one companion, it was all that mattered. ASCII was > once a friendly encoding, then your world was shattered. Wishing it > were somehow here again, wishing it were somehow near... sometimes it > seemed, if you just dreamed, somehow it would be here! Wishing you > could use just bytes again, knowing that you never would... dreaming > of it won't help you to do all that you dream you could! > It's time to stop chasing the phantom and start living in the Raoul > world... err, the real world. :) I thought that "If only bytes were 21+ bits wide" would sound sufficiently nonsensical, that I did not need to explicitly qualify it as a utopian dream!
[toc] | [prev] | [next] | [standalone]
Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →
Back to top | Article view | comp.lang.python
csiph-web