Groups > comp.lang.python > #60686 > unrolled thread

Managing Google Groups headaches

Started by	rusi <rustompmody@gmail.com>
First post	2013-11-28 05:52 -0800
Last post	2013-12-04 08:31 -0800
Articles	20 on this page of 107 — 28 participants

Back to article view | Back to comp.lang.python

  Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 05:52 -0800
    Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 00:58 +1100
      Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 06:17 -0800
        Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 01:25 +1100
          Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 07:04 -0800
            Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 02:08 +1100
              Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:50 +0000
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 08:22 -0800
                  Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 16:33 +0000
              Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:49 +0000
              Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:49 +0000
              Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:50 +0000
                Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-11-28 11:43 -0500
                  Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 04:29 +1100
                  Re: Managing Google Groups headaches Neil Cerutti <neilc@norwich.edu> - 2013-12-02 13:03 +0000
                    Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-02 08:29 -0500
                      Re: Managing Google Groups headaches Neil Cerutti <neilc@norwich.edu> - 2013-12-02 14:04 +0000
                        Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-02 09:11 -0800
                          Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 17:48 +0000
                          Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-03 04:54 +1100
                          Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 18:07 +0000
                      Re: Managing Google Groups headaches Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-12-02 19:56 -0500
                  Re: Managing Google Groups headaches Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-12-02 19:54 -0500
                  Re: [OT] Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-12-02 18:17 -0700
                    Re: [OT] Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-02 20:43 -0500
                      Re: [OT] Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-02 18:27 -0800
                      Re: [OT] Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-12-02 20:09 -0700
                        Re: [OT] Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-02 19:26 -0800
                    Re: [OT] Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-12-03 04:27 +0000
                      Re: [OT] Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-03 18:01 +1100
                    Re: [OT] Managing Google Groups headaches alex23 <wuwei23@gmail.com> - 2013-12-03 16:30 +1000
                      Re: [OT] Managing Google Groups headaches Steven D'Aprano <steve@pearwood.info> - 2013-12-03 07:13 +0000
                        Re: [OT] Managing Google Groups headaches alex23 <wuwei23@gmail.com> - 2013-12-04 10:23 +1000
                          Re: [OT] Managing Google Groups headaches Neil Cerutti <neilc@norwich.edu> - 2013-12-04 14:34 +0000
                          Re: [OT] Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 15:21 +0000
                  Re: [OT] Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:09 +0000
            Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 08:40 -0700
            Re: Managing Google Groups headaches Travis Griggs <travisgriggs@gmail.com> - 2013-11-28 08:23 -0800
            Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-11-28 12:23 -0500
            Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 11:29 -0700
              Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 10:37 -0800
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 11:00 -0800
                  Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 12:55 -0700
                Re: Managing Google Groups headaches Walter Hurry <walterhurry@lavabit.com> - 2013-11-28 19:40 +0000
                Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 11:50 -0700
                  Re: Managing Google Groups headaches Arif Khokar <akhokar1234@wvu.edu> - 2013-11-28 19:46 -0500
                    Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-29 14:41 +0000
                    Re: Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-11-29 16:17 +0000
                    Re: Managing Google Groups headaches Cameron Simpson <cs@zip.com.au> - 2013-12-04 11:38 +1100
                      Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-03 17:39 -0800
                        Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-04 13:03 +1100
                        Re: Managing Google Groups headaches Cameron Simpson <cs@zip.com.au> - 2013-12-05 09:47 +1100
                          Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-05 23:42 -0800
                Re: Managing Google Groups headaches Walter Hurry <walterhurry@lavabit.com> - 2013-11-28 20:39 +0000
            Re: Managing Google Groups headaches Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-11-28 16:41 -0500
              Re: Managing Google Groups headaches pecore@pascolo.net - 2013-11-30 14:25 +0100
                Re: Managing Google Groups headaches Cameron Simpson <cs@zip.com.au> - 2013-12-04 11:40 +1100
                  Re: Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-12-04 15:50 +0000
                    Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 16:07 +0000
                    Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-12-04 11:21 -0500
                    Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 16:33 +0000
            Re: Managing Google Groups headaches Zero Piraeus <z@etiol.net> - 2013-11-28 13:29 -0300
              Re: Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-11-29 16:15 +0000
            Re: Managing Google Groups headaches Terry Reedy <tjreedy@udel.edu> - 2013-11-28 17:32 -0500
            Re: Managing Google Groups headaches Terry Reedy <tjreedy@udel.edu> - 2013-11-28 17:44 -0500
            Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-29 14:39 +0000
    Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 10:13 -0800
      Re: Managing Google Groups headaches Rich Kulawiec <rsk@gsp.org> - 2013-12-04 09:52 -0500
        Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-04 19:58 -0500
          Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-05 23:13 -0800
            Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-06 02:36 -0500
              Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 05:03 -0800
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 00:19 +1100
                  Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 05:32 -0800
                    Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 00:48 +1100
                      Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 06:11 -0800
                        Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 01:51 +1100
                ASCII and Unicode [was Re: Managing Google Groups headaches] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-06 19:00 +0000
                  Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Gene Heskett <gheskett@wdtv.com> - 2013-12-06 14:34 -0500
                  Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Roy Smith <roy@panix.com> - 2013-12-06 20:54 +0000
                  Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Chris Angelico <rosuav@gmail.com> - 2013-12-07 10:42 +1100
                  Re: ASCII and Unicode [was Re: Managing Google Groups headaches] rusi <rustompmody@gmail.com> - 2013-12-06 18:33 -0800
                    Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Chris Angelico <rosuav@gmail.com> - 2013-12-07 13:41 +1100
                      Re: ASCII and Unicode [was Re: Managing Google Groups headaches] rusi <rustompmody@gmail.com> - 2013-12-06 19:16 -0800
                        Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Chris Angelico <rosuav@gmail.com> - 2013-12-07 15:08 +1100
                    Re: ASCII and Unicode [was Re: Managing Google Groups headaches] MRAB <python@mrabarnett.plus.com> - 2013-12-07 03:19 +0000
                  Re: ASCII and Unicode giacomo boffi <pecore@pascolo.net> - 2013-12-07 17:05 +0100
                    Re: ASCII and Unicode rusi <rustompmody@gmail.com> - 2013-12-08 08:41 -0800
                    Re: ASCII and Unicode Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-08 17:22 +0000
                      Re: ASCII and Unicode rusi <rustompmody@gmail.com> - 2013-12-08 09:39 -0800
                      Re: ASCII and Unicode giacomo boffi <pecore@pascolo.net> - 2013-12-08 21:11 +0100
                        Re: ASCII and Unicode rusi <rustompmody@gmail.com> - 2013-12-08 19:02 -0800
                Re: Managing Google Groups headaches Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-07 12:27 +1300
                Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-12-06 21:24 -0500
                  Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 23:43 -0800
                    Re: Managing Google Groups headaches wxjmfauth@gmail.com - 2013-12-07 02:16 -0800
                      Re: Managing Google Groups headaches Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-07 11:25 +0000
                        Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 22:49 +1100
                      Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-07 11:08 -0500
                        Re: Managing Google Groups headaches Rotwang <sg552@hotmail.co.uk> - 2013-12-07 16:15 +0000
                        Re: Managing Google Groups headaches Tim Chase <python.list@tim.thechases.com> - 2013-12-07 10:19 -0600
                      Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-07 08:27 -0800
                        Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-12-07 12:04 -0500
            Re: Managing Google Groups headaches Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-07 03:07 +0000
              Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-06 22:40 -0500
      Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-05 02:46 +1100
      Re: Managing Google Groups headaches Travis Griggs <travisgriggs@gmail.com> - 2013-12-04 08:31 -0800

Page 4 of 6 — ← Prev page 1 2 3 [4] 5 6 Next page →

#61040

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-12-04 16:33 +0000
Message-ID	<mailman.3579.1386174840.18130.python-list@python.org>
In reply to	#61028

On 04/12/2013 16:21, Ned Batchelder wrote:
> On 12/4/13 11:07 AM, Mark Lawrence wrote:
>> On 04/12/2013 15:50, Grant Edwards wrote:
>>> On 2013-12-04, Cameron Simpson <cs@zip.com.au> wrote:
>>>> On 30Nov2013 14:25, pecore@pascolo.net <pecore@pascolo.net> wrote:
>>>>> Dennis Lee Bieber <wlfraed@ix.netcom.com> writes:
>>>>>> [NNTP] clients provide full-fledged editors
>>>>>     and conversely full-fledged editors provide
>>>>>     NNTP clients
>>>>
>>>>    GNU Emacs is a LISP operating system disguised as a word processor.
>>>>          - Doug Mohney, in comp.arch
>>>
>>> Unix: A set of device drivers used to support the the Emacs operating
>>>        system.
>>>
>>>   - Don't remember who, where, or when
>>>
>>
>> It's a funny thing the computing world, with some people deriving
>> operating systems from raincoats, and others editing code with a
>> domestic household cleaner, what next, I ask myself?
>>
>
> Computing with vacuum cleaners is on the decline at least:
> http://www.vax.co.uk/vacuum-cleaners
>
> --Ned.
>

Well it shouldn't be.  It's a well known fact that VMS stands for Very 
Much Safer.  I'd compare it to inferior products, but not even the 
threat of The Comfy Chair will make me type the names.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#60737

From	Zero Piraeus <z@etiol.net>
Date	2013-11-28 13:29 -0300
Message-ID	<mailman.3382.1385676148.18130.python-list@python.org>
In reply to	#60692

:

On Thu, Nov 28, 2013 at 08:40:47AM -0700, Michael Torrie wrote:
> My opinion is that the Python list should dump the Usenet tie-in and
> just go straight e-mail.

+1 Hell yes.

-- 
Zero Piraeus: coram publico
http://etiol.net/pubkey.asc

[toc] | [prev] | [next] | [standalone]

#60768

From	Grant Edwards <invalid@invalid.invalid>
Date	2013-11-29 16:15 +0000
Message-ID	<l7aeic$hp4$1@reader1.panix.com>
In reply to	#60737

On 2013-11-28, Zero Piraeus <z@etiol.net> wrote:
>:
>
> On Thu, Nov 28, 2013 at 08:40:47AM -0700, Michael Torrie wrote:
>> My opinion is that the Python list should dump the Usenet tie-in and
>> just go straight e-mail.
>
> +1 Hell yes.

I'd have to reluctantly agree.  I've been using Usenet for 25 years,
and I still read this as comp.lang.python, but this is practically the
only Usenet group left that I follow.  There are a number of mailing
lists I follow via gmane's NNTP server, and I can certainly do the
same for this one.

I've been filtering out all postings from GG for years, so it doesn't
really matter to me, but apparently there are a lot of people with
defective mail/news clients for whom that's apparently not possible?
[Otherwise, I don't understand what all the complaining is about.]

-- 
Grant

[toc] | [prev] | [next] | [standalone]

#60739

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-11-28 17:32 -0500
Message-ID	<mailman.3383.1385677947.18130.python-list@python.org>
In reply to	#60692

On 11/28/2013 10:40 AM, Michael Torrie wrote:
> On 11/28/2013 08:08 AM, Chris Angelico wrote:
>> Which is easier, fiddling around with your setup so you can post
>> reasonably on Google Groups, or just getting a better client? With
>> your setup, you have to drop out to another editor and press F9 for it
>> to work. With pretty much any other newsreader on the planet, this
>> works straight off, no setup necessary.
>>
>> I'm still going to advise people to stop using buggy rubbish.
>
> My opinion is that the Python list should dump the Usenet tie-in

I am beginning to think this also.

> and just go straight e-mail.

email + gmane newsgroup mirror

 > Python is the only list I'm on that has a usenet gateway.

1000 of techical mlists have a gmane mirror. There are over 200 just for 
Python.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#60740

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-11-28 17:44 -0500
Message-ID	<mailman.3384.1385678678.18130.python-list@python.org>
In reply to	#60692

On 11/28/2013 1:29 PM, Michael Torrie wrote:

> Seems like 90% of the problems on this list come from the unchecked
> usenet side of things.  Such as trolls or spam.
...
> Despite many calls to banish [such] ...
> with usenet it's just not possible.

The usenet gateway has been changed recently to no longer pass 
everything to python-list (and on to gmane) without question. If you 
want the benefit of such moderation as there is, use either of those two.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#60765

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-11-29 14:39 +0000
Message-ID	<mailman.3398.1385735965.18130.python-list@python.org>
In reply to	#60692

On 28/11/2013 16:29, Zero Piraeus wrote:
> :
>
> On Thu, Nov 28, 2013 at 08:40:47AM -0700, Michael Torrie wrote:
>> My opinion is that the Python list should dump the Usenet tie-in and
>> just go straight e-mail.
>
> +1 Hell yes.
>

I'd happily use semaphore but given time you're bound to find someone 
who could screw that up.  So I'll stick with Thunderbird and gmane, 
reading some 40-ish Python lists and blogs.  Well, I think they're blogs :)

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#60720

From	rusi <rustompmody@gmail.com>
Date	2013-11-28 10:13 -0800
Message-ID	<132658ff-d06a-4136-ade6-353189da5769@googlegroups.com>
In reply to	#60686

Here's a 1-click pure python solution.

As I said I dont know how to manage errors!

1. Put it in a file say cleangg.py and make it executable
2. Install it as the 'editor' for the "Its all text" firefox addon
3. Click the edit and you should get a cleaned out post

------------------------------
#!/usr/bin/env python3

from sys import argv
import re
from re import sub

def clean(s):
    s1 = sub("^> *\n> *$", "¶",   s,  flags=re.M)
    s2 = sub("^> *\n",     "",    s1, flags=re.M)
    s3 = sub("¶\n",        ">\n", s2, flags=re.M)
    return s3

def main():
    print ("argv[1] %s" % argv[1])
    with open(argv[1]) as f:
        s = f.read()
    with open(argv[1], "w") as f:
        f.write(clean(s))

main()

[toc] | [prev] | [next] | [standalone]

#61024

From	Rich Kulawiec <rsk@gsp.org>
Date	2013-12-04 09:52 -0500
Message-ID	<mailman.3565.1386170444.18130.python-list@python.org>
In reply to	#60720

(comments from a lurker on python-list)

- Google "groups" is a disaster.  It's extremely poorly-run, and is in
fact a disservice to Usenet -- which is alive and well, tyvm, and still used
by many of the most senior and experienced people on the Internet.  (While
some newsgroups are languishing and some have almost no traffic, others
are thriving.  As it should be.)  I could catalog the litany of egregious
mistakes that Google has made, but what's the point?  They're clearly
uninterested in fixing them.  Their only interest is in slapping the
"Google" label on Usenet -- which is far more important in the evolution
of the Internet than Google will ever be -- so that they can use it
as a marketing vehicle.  Worse, Google has completely failed to control
outbound abuse from Google groups, which is why many consider it a
best practice to simply drop all Usenet traffic originating there.

- That said, there is value in bidirectionally gatewaying mailing lists
with corresponding Usenet newsgroups.  Usenet's propagation properties often
make it the medium of choice for many people, particularly those in areas
with slow, expensive, erratic, etc. connectivity.  Conversely, delivery
of Usenet traffic via email is a better solution for others.  Software
like Mailman facilitates this fairly well, even given the impedance
mismatch between SMTP and NNTP.

- Mailing lists/Usenet newsgroups remain, as they've been for a very
long time, the solutions of choice for online discussions.  Yes, I'm
aware of web forums: I've used hundreds of them.  They suck.  They ALL
suck, they just all suck differently.  I could spend the next several
thousand lines explaining why, but instead I'll just abbreviate: they
don't handle threading, they don't let me use my editor of choice,
they don't let me build my own archive that I can search MY way including
when I'm offline, they are brittle and highly vulnerable to abuse
and security breaches, they encourage worst practices in writing
style (including top-posting and full-quoting), they translate poorly
to other formats, they are difficult to archive, they're even more
difficult to migrate (whereas Unix mbox format files from 30 years ago
are still perfectly usable today), they aren't standardized, they
aren't easily scalable, they're overly complex, they don't support
proper quoting, they don't support proper attribution, they can't
be easily forwarded, they...oh, it just goes on.   My point being that
there's a reason that the IETF and the W3C and NANOG and lots of other
groups that could use anything they want use mailing lists: they work.

- That said, they work *if configured properly*, which unfortunately
these days includes a hefty dose of anti-abuse controls.  This list
(for the most part) isn't particularly targeted, but it is occasionally
and in the spirit of trying to help out, I can assist with that. (I think
it's fair to say I have a little bit of email expertise.)  If any of
the list's owners are reading this and want help, please let me know.

- They also work well *if used properly*, which means that participants
should use proper email/news etiquette: line wrap, sane quoting style,
reasonable editing of followups, preservation of threads, all that stuff.
The more people do more of that, the smoother things work.  On the other
hand, if nobody does that, the result is impaired communication and
quite often, a chorus of "mailing lists suck" even though the problem
is not the mailing lists: it's the bad habits of the users on them.
(And of course changing mediums won't fix that.)

- To bring this back around to one of the starting points for this
discussion: I think the current setup is functioning well, even given
the sporadic stresses placed on it.  I think it would be best to invest
effort in maintaining/improving it as it stands (which is why I volunteered
to do so, see above) rather than migrating to something else.

---rsk

[toc] | [prev] | [next] | [standalone]

#61065

From	Roy Smith <roy@panix.com>
Date	2013-12-04 19:58 -0500
Message-ID	<roy-9C2ADB.19585404122013@news.panix.com>
In reply to	#61024

In article <mailman.3565.1386170444.18130.python-list@python.org>,
 Rich Kulawiec <rsk@gsp.org> wrote:

> Yes, I'm
> aware of web forums: I've used hundreds of them.  They suck.  They ALL
> suck, they just all suck differently.  I could spend the next several
> thousand lines explaining why, but instead I'll just abbreviate: they
> don't handle threading, they don't let me use my editor of choice,
> they don't let me build my own archive that I can search MY way including
> when I'm offline, they are brittle and highly vulnerable to abuse
> and security breaches, they encourage worst practices in writing
> style (including top-posting and full-quoting), they translate poorly
> to other formats, they are difficult to archive, they're even more
> difficult to migrate (whereas Unix mbox format files from 30 years ago
> are still perfectly usable today), they aren't standardized, they
> aren't easily scalable, they're overly complex, they don't support
> proper quoting, they don't support proper attribution, they can't
> be easily forwarded, they...oh, it just goes on.  

The real problem with web forums is they conflate transport and 
presentation into a single opaque blob, and are pretty much universally 
designed to be a closed system.  Mail and usenet were both engineered to 
make a sharp division between transport and presentation, which meant it 
was possible to evolve each at their own pace.

Mostly that meant people could go off and develop new client 
applications which interoperated with the existing system.  But, it also 
meant that transport layers could be switched out (as when NNTP 
gradually, but inexorably, replaced UUCP as the primary usenet transport 
layer).

[toc] | [prev] | [next] | [standalone]

#61117

From	rusi <rustompmody@gmail.com>
Date	2013-12-05 23:13 -0800
Message-ID	<51007240-6bc9-4f0b-9937-4883bcc0ceb6@googlegroups.com>
In reply to	#61065

On Thursday, December 5, 2013 6:28:54 AM UTC+5:30, Roy Smith wrote:
>  Rich Kulawiec wrote:

> > Yes, I'm
> > aware of web forums: I've used hundreds of them.  They suck.  They ALL
> > suck, they just all suck differently.  I could spend the next several
> > thousand lines explaining why, but instead I'll just abbreviate: they
> > don't handle threading, they don't let me use my editor of choice,
> > they don't let me build my own archive that I can search MY way including
> > when I'm offline, they are brittle and highly vulnerable to abuse
> > and security breaches, they encourage worst practices in writing
> > style (including top-posting and full-quoting), they translate poorly
> > to other formats, they are difficult to archive, they're even more
> > difficult to migrate (whereas Unix mbox format files from 30 years ago
> > are still perfectly usable today), they aren't standardized, they
> > aren't easily scalable, they're overly complex, they don't support
> > proper quoting, they don't support proper attribution, they can't
> > be easily forwarded, they...oh, it just goes on.  

> The real problem with web forums is they conflate transport and 
> presentation into a single opaque blob, and are pretty much universally 
> designed to be a closed system.  Mail and usenet were both engineered to 
> make a sharp division between transport and presentation, which meant it 
> was possible to evolve each at their own pace.

> Mostly that meant people could go off and develop new client 
> applications which interoperated with the existing system.  But, it also 
> meant that transport layers could be switched out (as when NNTP 
> gradually, but inexorably, replaced UUCP as the primary usenet transport 
> layer).

There is a deep assumption hovering round-about the above -- what I
will call the 'Unix assumption(s)'.  But before that, just a check on
terminology. By 'presentation' you mean what people normally call
'mail-clients': thunderbird, mutt etc. And by 'transport' you mean
sendmail, exim, qmail etc etc -- what normally are called
'mail-servers.'  Right??

Assuming this is the intended meaning of the terminology (yeah its
clearer terminology than the usual and yeah Im also a 'Unix-guy'),
here's the 'Unix-assumption':

  - human communication…
(is not very different from)
  - machine communication…
(can be done by)
  - text…
(for which)
  - ASCII is fine…
(which is just)
  - bytes…
(inside/between byte-memory-organized)
  - von Neumann computers

To the extent that these assumptions are invalid, the 'opaque-blob'
may well be preferable.

[toc] | [prev] | [next] | [standalone]

#61118

From	Roy Smith <roy@panix.com>
Date	2013-12-06 02:36 -0500
Message-ID	<roy-1384C7.02363006122013@news.panix.com>
In reply to	#61117

In article <51007240-6bc9-4f0b-9937-4883bcc0ceb6@googlegroups.com>,
 rusi <rustompmody@gmail.com> wrote:

> On Thursday, December 5, 2013 6:28:54 AM UTC+5:30, Roy Smith wrote:

> > The real problem with web forums is they conflate transport and 
> > presentation into a single opaque blob, and are pretty much universally 
> > designed to be a closed system.  Mail and usenet were both engineered to 
> > make a sharp division between transport and presentation, which meant it 
> > was possible to evolve each at their own pace.
> 
> > Mostly that meant people could go off and develop new client 
> > applications which interoperated with the existing system.  But, it also 
> > meant that transport layers could be switched out (as when NNTP 
> > gradually, but inexorably, replaced UUCP as the primary usenet transport 
> > layer).
> 
> There is a deep assumption hovering round-about the above -- what I
> will call the 'Unix assumption(s)'.

It has nothing to do with Unix.  The separation of transport from 
presentation is just as valid on Windows, Mac, etc.

> But before that, just a check on
> terminology. By 'presentation' you mean what people normally call
> 'mail-clients': thunderbird, mutt etc. And by 'transport' you mean
> sendmail, exim, qmail etc etc -- what normally are called
> 'mail-servers.'  Right??

Yes.

> Assuming this is the intended meaning of the terminology (yeah its
> clearer terminology than the usual and yeah Im also a 'Unix-guy'),
> here's the 'Unix-assumption':
> 
>   - human communicationŠ
> (is not very different from)
>   - machine communicationŠ
> (can be done by)
>   - textŠ
> (for which)
>   - ASCII is fineŠ
> (which is just)
>   - bytesŠ
> (inside/between byte-memory-organized)
>   - von Neumann computers
> 
> To the extent that these assumptions are invalid, the 'opaque-blob'
> may well be preferable.

I think you're off on the wrong track here.  This has nothing to do with 
plain text (ascii or otherwise).  It has to do with divorcing how you 
store and transport messages (be they plain text, HTML, or whatever) 
from how a user interacts with them.

Take something like Wikipedia (by which, I really mean, MediaWiki, which 
is the underlying software package).  Most people think of Wikipedia as 
a web site.  But, there's another layer below that which lets you get 
access to the contents of articles, navigate all the rich connections 
like category trees, and all sorts of metadata like edit histories.  
Which means, if I wanted to (and many examples of this exist), I can 
write my own client which presents the same information in different 
ways.

[toc] | [prev] | [next] | [standalone]

#61141

From	rusi <rustompmody@gmail.com>
Date	2013-12-06 05:03 -0800
Message-ID	<ae4b6a4d-fbd4-4d10-a860-9589e6045d16@googlegroups.com>
In reply to	#61118

On Friday, December 6, 2013 1:06:30 PM UTC+5:30, Roy Smith wrote:
>  Rusi  wrote:

> > On Thursday, December 5, 2013 6:28:54 AM UTC+5:30, Roy Smith wrote:

> > > The real problem with web forums is they conflate transport and 
> > > presentation into a single opaque blob, and are pretty much universally 
> > > designed to be a closed system.  Mail and usenet were both engineered to 
> > > make a sharp division between transport and presentation, which meant it 
> > > was possible to evolve each at their own pace.
> > > Mostly that meant people could go off and develop new client 
> > > applications which interoperated with the existing system.  But, it also 
> > > meant that transport layers could be switched out (as when NNTP 
> > > gradually, but inexorably, replaced UUCP as the primary usenet transport 
> > > layer).
> > There is a deep assumption hovering round-about the above -- what I
> > will call the 'Unix assumption(s)'.

> It has nothing to do with Unix.  The separation of transport from 
> presentation is just as valid on Windows, Mac, etc.

> > But before that, just a check on
> > terminology. By 'presentation' you mean what people normally call
> > 'mail-clients': thunderbird, mutt etc. And by 'transport' you mean
> > sendmail, exim, qmail etc etc -- what normally are called
> > 'mail-servers.'  Right??

> Yes.

> > Assuming this is the intended meaning of the terminology (yeah its
> > clearer terminology than the usual and yeah Im also a 'Unix-guy'),
> > here's the 'Unix-assumption':
> >   - human communication�
> > (is not very different from)
> >   - machine communication�
> > (can be done by)
> >   - text�
> > (for which)
> >   - ASCII is fine�
> > (which is just)
> >   - bytes�
> > (inside/between byte-memory-organized)
> >   - von Neumann computers
> > To the extent that these assumptions are invalid, the 'opaque-blob'
> > may well be preferable.

> I think you're off on the wrong track here.  This has nothing to do with 
> plain text (ascii or otherwise).  It has to do with divorcing how you 
> store and transport messages (be they plain text, HTML, or whatever) 
> from how a user interacts with them.

Evidently (and completely inadvertently) this exchange has just
illustrated one of the inadmissable assumptions:

"unicode as a medium is universal in the same way that ASCII used to be"

I wrote a number of ellipsis characters ie codepoint 2026 as in:

  - human communication…
(is not very different from)
  - machine communication… 

Somewhere between my sending and your quoting those ellipses became
the replacement character FFFD

> >   - human communication�
> > (is not very different from)
> >   - machine communication�

Leaving aside whose fault this is (very likely buggy google groups),
this mojibaking cannot happen if the assumption "All text is ASCII"
were to uniformly hold.

Of course with unicode also this can be made to not happen, but that
is fragile and error-prone.  And that is because ASCII (not extended)
is ONE thing in a way that unicode is hopelessly a motley inconsistent
variety.

With unicode there are in-memory formats, transportation formats eg
UTF-8, strange beasties like FSR (which then hopelessly and
inveterately tickle our resident trolls!) multi-layer encodings (in
html), BOMS and unnecessary/inconsistent BOMS (in microsoft-notepad).
With ASCII, ASCII is ASCII; ie "ABC" is 65,66,67 whether its in-core,
in-file, in-pipe or whatever.  Ok there are a few wrinkles to this
eg. the null-terminator in C-strings. I think this is the exception to
the rule that in classic Unix, ASCII is completely inter-operable and
therefore a universal data-structure for inter-process or inter-machine
communication.

It is this universal data structure that makes classic unix pipes and
filters possible and easy (of which your separation of presentation
and transportation is just one case).

Give it up and the composability goes with it.

Go up from the ASCII -> Unicode level to the plain-text -> hypertext
(aka html) level and these composability problems hit with redoubled
force.

> Take something like Wikipedia (by which, I really mean, MediaWiki, which 
> is the underlying software package).  Most people think of Wikipedia as 
> a web site.  But, there's another layer below that which lets you get 
> access to the contents of articles, navigate all the rich connections 
> like category trees, and all sorts of metadata like edit histories.  
> Which means, if I wanted to (and many examples of this exist), I can 
> write my own client which presents the same information in different 
> ways.

Not sure whats your point.
Html is a universal data-structuring format -- ok for presentation, bad for
data-structuring
SQL databases (assuming thats the mediawiki backend) is another -- ok for 
data-structuring bad for presentation.

Mediawiki mediates between the two formats.

Beyond that I lost you... what are you trying to say??

[toc] | [prev] | [next] | [standalone]

#61143

From	Chris Angelico <rosuav@gmail.com>
Date	2013-12-07 00:19 +1100
Message-ID	<mailman.3645.1386335953.18130.python-list@python.org>
In reply to	#61141

On Sat, Dec 7, 2013 at 12:03 AM, rusi <rustompmody@gmail.com> wrote:
> SQL databases (assuming thats the mediawiki backend) is another -- ok for
> data-structuring bad for presentation.

No, SQL databases don't store structured text. MediaWiki just stores a
single blob (not in the database sense of that word) of text.

ChrisA

[toc] | [prev] | [next] | [standalone]

#61144

From	rusi <rustompmody@gmail.com>
Date	2013-12-06 05:32 -0800
Message-ID	<10982d24-d47c-4a99-a93a-360fbe6b52ed@googlegroups.com>
In reply to	#61143

On Friday, December 6, 2013 6:49:04 PM UTC+5:30, Chris Angelico wrote:
> On Sat, Dec 7, 2013 at 12:03 AM, rusi wrote:
> > SQL databases (assuming thats the mediawiki backend) is another -- ok for
> > data-structuring bad for presentation.

> No, SQL databases don't store structured text. MediaWiki just stores a
> single blob (not in the database sense of that word) of text.

I guess we are using 'structured' in different ways.  All I am saying
is that mediawiki which seems to present as html, actually stores its
stuff as SQL -- nothing more or less structured than the schemas here:
http://www.mediawiki.org/wiki/Manual:MediaWiki_architecture#Database_and_text_storage

[toc] | [prev] | [next] | [standalone]

#61145

From	Chris Angelico <rosuav@gmail.com>
Date	2013-12-07 00:48 +1100
Message-ID	<mailman.3646.1386337709.18130.python-list@python.org>
In reply to	#61144

On Sat, Dec 7, 2013 at 12:32 AM, rusi <rustompmody@gmail.com> wrote:
> I guess we are using 'structured' in different ways.  All I am saying
> is that mediawiki which seems to present as html, actually stores its
> stuff as SQL -- nothing more or less structured than the schemas here:
> http://www.mediawiki.org/wiki/Manual:MediaWiki_architecture#Database_and_text_storage

Yeah, but the structure is all about the metadata. Ultimately, there's
one single text field containing the entire content as you would see
it in the page editor: wiki markup in straight text. MediaWiki uses an
SQL database to store that lump of text, but ultimately the
relationship is between wikitext and HTML, no SQL involvement.

Wiki markup is reasonable for text structuring. (Not for generic data
structuring, but it's decent for text.) Same with reStructuredText,
used for PEPs. An SQL database is a good way to store mappings of
"this key, this tuple of data" and retrieve them conveniently,
including (and this is the bit that's more complicated in a straight
Python dictionary) using any value out of the tuple as the key, and
(and this is where a dict *really* can't hack it) storing/retrieving
more data than fits in memory. The two are orthogonal. Your point is
better supported by wikitext than by SQL, here, except that there
aren't fifty other systems that parse and display wikitext. In fact,
what you're suggesting is a good argument for deprecating HTML email
in favour of RST email, and using docutils to render the result either
as HTML (for webmail users) or as some other format. And I wouldn't be
against that :) But good luck convincing the world that Microsoft
Outlook is doing the wrong thing.

ChrisA

[toc] | [prev] | [next] | [standalone]

#61151

From	rusi <rustompmody@gmail.com>
Date	2013-12-06 06:11 -0800
Message-ID	<fd3cf10d-f0a1-4693-b534-6d65a262a9ec@googlegroups.com>
In reply to	#61145

On Friday, December 6, 2013 7:18:19 PM UTC+5:30, Chris Angelico wrote:
> On Sat, Dec 7, 2013 at 12:32 AM, rusi  wrote:
> > I guess we are using 'structured' in different ways.  All I am saying
> > is that mediawiki which seems to present as html, actually stores its
> > stuff as SQL -- nothing more or less structured than the schemas here:
> > http://www.mediawiki.org/wiki/Manual:MediaWiki_architecture#Database_and_text_storage

> Yeah, but the structure is all about the metadata.

Ok (I'd drop the 'all')

> Ultimately, there's one single text field containing the entire content

Right

> as you would see it in the page editor: wiki markup in straight text.

Aha! There you are! Its 'page editor' here and not the html which
'display source' (control-u) which a browser would show. And wikimedia
is the software that mediates.

The usual direction (seen by users of wikipedia) is that wikimedia
takes this text, along with the other unrelated (metadata?) seen
around -- sidebar, tabs etc, css settings and munges it all into html

The other direction (seen by editors of wikipedia) is that you edit a
page and that page and history etc will show the changes,
reflecting the fact that the SQL content has changed.

> MediaWiki uses an SQL database to store that lump of text, but
> ultimately the relationship is between wikitext and HTML, no SQL
> involvement.

Dunno what you mean. Every time someone browses wikipedia, things are
getting pulled out of the SQL and munged into the html (s)he sees.

[toc] | [prev] | [next] | [standalone]

#61152

From	Chris Angelico <rosuav@gmail.com>
Date	2013-12-07 01:51 +1100
Message-ID	<mailman.3651.1386341477.18130.python-list@python.org>
In reply to	#61151

On Sat, Dec 7, 2013 at 1:11 AM, rusi <rustompmody@gmail.com> wrote:
> Aha! There you are! Its 'page editor' here and not the html which
> 'display source' (control-u) which a browser would show. And wikimedia
> is the software that mediates.
>
> The usual direction (seen by users of wikipedia) is that wikimedia
> takes this text, along with the other unrelated (metadata?) seen
> around -- sidebar, tabs etc, css settings and munges it all into html
>
> The other direction (seen by editors of wikipedia) is that you edit a
> page and that page and history etc will show the changes,
> reflecting the fact that the SQL content has changed.

MediaWiki is fundamentally very similar to a structure that I'm trying
to deploy for a community web site that I host, approximately thus:

* A git repository stores a bunch of RST files
* A script auto-generates index files based on the presence of certain
file names, and renders via rst2html
* The HTML pages are served as static content

MediaWiki is like this:

* Each page has a history, represented by a series of state snapshots
of wikitext
* On display, the wikitext is converted to HTML and served.

The main difference is that MediaWiki is optimized for rapid and
constant editing, where what I'm pushing for is optimized for less
common edits that might span multiple files. (MW has no facility for
atomically changing multiple pages, and atomically reverting those
changes, and so on. Each page stands alone.) They're still broadly
doing the same thing: storing marked-up text and rendering HTML. The
fact that one uses an SQL database and the other uses a git repository
is actually quite insignificant - it's as significant as the choice of
whether to store your data on a hard disk or an SSD. The system is no
different.

>> MediaWiki uses an SQL database to store that lump of text, but
>> ultimately the relationship is between wikitext and HTML, no SQL
>> involvement.
>
> Dunno what you mean. Every time someone browses wikipedia, things are
> getting pulled out of the SQL and munged into the html (s)he sees.

Yes, but that's just mechanics. The fact that the PHP scripts to
operate Wikipedia are being pulled off a file system doesn't mean that
MediaWiki is an ext3-to-HTML renderer. It's a wikitext-to-HTML
renderer.

Anyway. As I said, your point is still mostly there, as long as you
use wikitext rather than SQL.

ChrisA

[toc] | [prev] | [next] | [standalone]

#61176 — ASCII and Unicode [was Re: Managing Google Groups headaches]

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-12-06 19:00 +0000
Subject	ASCII and Unicode [was Re: Managing Google Groups headaches]
Message-ID	<52a21ec1$0$30003$c3e8da3$5496439d@news.astraweb.com>
In reply to	#61141

On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:

> Evidently (and completely inadvertently) this exchange has just
> illustrated one of the inadmissable assumptions:
> 
> "unicode as a medium is universal in the same way that ASCII used to be"

Ironically, your post was not Unicode.

Seriously. I am 100% serious.

Your post was sent using a legacy encoding, Windows-1252, also known as 
CP-1252, which is most certainly *not* Unicode. Whatever software you 
used to send the message correctly flagged it with a charset header:

Content-Type: text/plain; charset=windows-1252

Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle 
encodings correctly (or at all!), it screws up the encoding then sends a 
reply with no charset line at all. This is one bug that cannot be blamed 
on Google Groups -- or on Unicode.

> I wrote a number of ellipsis characters ie codepoint 2026 as in:

Actually you didn't. You wrote a number of ellipsis characters, hex byte 
\x85 (decimal 133), in the CP1252 charset. That happens to be mapped to 
code point U+2026 in Unicode, but the two are as distinct as ASCII and 
EBCDIC.

> Somewhere between my sending and your quoting those ellipses became the
> replacement character FFFD

Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about 
encodings and character sets. It doesn't just assume things are ASCII, 
but makes a half-hearted attempt to be charset-aware, but badly. I can 
only imagine that it was written back in the Dark Ages where there were a 
lot of different charsets in use but no conventions for specifying which 
charset was in use. Or perhaps the author was smoking crack while coding.

> Leaving aside whose fault this is (very likely buggy google groups),
> this mojibaking cannot happen if the assumption "All text is ASCII" were
> to uniformly hold.

This is incorrect. People forget that ASCII has evolved since the first 
version of the standard in 1963. There have actually been five versions 
of the ASCII standard, plus one unpublished version. (And that's not 
including the things which are frequently called ASCII but aren't.)

ASCII-1963 didn't even include lowercase letters. It is also missing some 
graphic characters like braces, and included at least two characters no 
longer used, the up-arrow and left-arrow. The control characters were 
also significantly different from today.

ASCII-1965 was unpublished and unused. I don't know the details of what 
it changed.

ASCII-1967 is a lot closer to the ASCII in use today. It made 
considerable changes to the control characters, moving, adding, removing, 
or renaming at least half a dozen control characters. It officially added 
lowercase letters, braces, and some others. It replaced the up-arrow 
character with the caret and the left-arrow with the underscore. It was 
ambiguous, allowing variations and substitutions, e.g.:

    - character 33 was permitted to be either the exclamation 
      mark ! or the logical OR symbol |

    - consequently character 124 (vertical bar) was always 
      displayed as a broken bar ¦, which explains why even today
      many keyboards show it that way

    - character 35 was permitted to be either the number sign # or 
      the pound sign £

    - character 94 could be either a caret ^ or a logical NOT ¬

Even the humble comma could be pressed into service as a cedilla.

ASCII-1968 didn't change any characters, but allowed the use of LF on its 
own. Previously, you had to use either LF/CR or CR/LF as newline.

ASCII-1977 removed the ambiguities from the 1967 standard.

The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). 
Unfortunately I haven't been able to find out what changes were made -- I 
presume they were minor, and didn't affect the character set.

So as you can see, even with actual ASCII, you can have mojibake. It's 
just not normally called that. But if you are given an arbitrary ASCII 
file of unknown age, containing code 94, how can you be sure it was 
intended as a caret rather than a logical NOT symbol? You can't.

Then there are at least 30 official variations of ASCII, strictly 
speaking part of ISO-646. These 7-bit codes were commonly called "ASCII" 
by their users, despite the differences, e.g. replacing the dollar sign $ 
with the international currency sign ¤, or replacing the left brace 
{ with the letter s with caron š.

One consequence of this is that the MIME type for ASCII text is called 
"US ASCII", despite the redundancy, because many people expect "ASCII" 
alone to mean whatever national variation they are used to.

But it gets worse: there are proprietary variations on ASCII which are 
commonly called "ASCII" but aren't, including dozens of 8-bit so-called 
"extended ASCII" character sets, which is where the problems *really* 
pile up. Invariably back in the 1980s and early 1990s people used to call 
these "ASCII" no matter that they used 8-bits and contained anything up 
to 256 characters.

Just because somebody calls something "ASCII", doesn't make it so; even 
if it is ASCII, doesn't mean you know which version of ASCII; even if you 
know which version, doesn't mean you know how to interpret certain codes. 
It simply is *wrong* to think that "good ol' plain ASCII text" is 
unambiguous and devoid of problems.

> With unicode there are in-memory formats, transportation formats eg
> UTF-8, 

And the same applies to ASCII. 

ASCII is a *seven-bit code*. It will work fine on computers where the 
word-size is seven bits. If the word-size is eight bits, or more, you 
have to pad the ASCII code. How do you do that? Pad the most-significant 
end or the least significant end? That's a choice there. How do you pad 
it, with a zero or a one? That's another choice. If your word-size is 
more than eight bits, you might even pad *both* ends.

In C, a char is defined as the smallest addressable unit of the machine 
that can contain basic character set, not necessarily eight bits. 
Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits 
as a "byte" and/or char. Your in-memory representation of ASCII "a" could 
easily end up as bits 001100001 or 0000000001100001.

And then there is the question of whether ASCII characters should be Big 
Endian or Little Endian. I'm referring here to bit endianness, rather 
than bytes: should character 'a' be represented as bits 1100001 (most 
significant bit to the left) or 1000011 (least significant bit to the 
left)? This may be relevant with certain networking protocols. Not all 
networking protocols are big-endian, nor are all processors. The Ada 
programming language even supports both bit orders.

When transmitting ASCII characters, the networking protocol could include 
various start and stop bits and parity codes. A single 7-bit ASCII 
character might be anything up to 12 bits in length on the wire. It is 
simply naive to imagine that the transmission of ASCII codes is the same 
as the in-memory or on-disk storage of ASCII.

You're lucky to be active in a time when most common processors have 
standardized on a single bit-order, and when most (but not all) network 
protocols have done the same. But that doesn't mean that these issues 
don't exist for ASCII. If you get a message that purports to be ASCII 
text but looks like this:

"\tS\x1b\x1b{\x01u{'\x1b\x13!"

you should suspect strongly that it is "Hello World!" which has been 
accidentally bit-reversed by some rogue piece of hardware.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#61180 — Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

From	Gene Heskett <gheskett@wdtv.com>
Date	2013-12-06 14:34 -0500
Subject	Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
Message-ID	<mailman.3663.1386358504.18130.python-list@python.org>
In reply to	#61176

On Friday 06 December 2013 14:30:06 Steven D'Aprano did opine:

> On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:
> > Evidently (and completely inadvertently) this exchange has just
> > illustrated one of the inadmissable assumptions:
> > 
> > "unicode as a medium is universal in the same way that ASCII used to
> > be"
> 
> Ironically, your post was not Unicode.
> 
> Seriously. I am 100% serious.
> 
> Your post was sent using a legacy encoding, Windows-1252, also known as
> CP-1252, which is most certainly *not* Unicode. Whatever software you
> used to send the message correctly flagged it with a charset header:
> 
> Content-Type: text/plain; charset=windows-1252
> 
> Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle
> encodings correctly (or at all!), it screws up the encoding then sends a
> reply with no charset line at all. This is one bug that cannot be blamed
> on Google Groups -- or on Unicode.
> 
> > I wrote a number of ellipsis characters ie codepoint 2026 as in:
> Actually you didn't. You wrote a number of ellipsis characters, hex byte
> \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to
> code point U+2026 in Unicode, but the two are as distinct as ASCII and
> EBCDIC.
> 
> > Somewhere between my sending and your quoting those ellipses became
> > the replacement character FFFD
> 
> Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about
> encodings and character sets. It doesn't just assume things are ASCII,
> but makes a half-hearted attempt to be charset-aware, but badly. I can
> only imagine that it was written back in the Dark Ages where there were
> a lot of different charsets in use but no conventions for specifying
> which charset was in use. Or perhaps the author was smoking crack while
> coding.
> 
> > Leaving aside whose fault this is (very likely buggy google groups),
> > this mojibaking cannot happen if the assumption "All text is ASCII"
> > were to uniformly hold.
> 
> This is incorrect. People forget that ASCII has evolved since the first
> version of the standard in 1963. There have actually been five versions
> of the ASCII standard, plus one unpublished version. (And that's not
> including the things which are frequently called ASCII but aren't.)
> 
> ASCII-1963 didn't even include lowercase letters. It is also missing
> some graphic characters like braces, and included at least two
> characters no longer used, the up-arrow and left-arrow. The control
> characters were also significantly different from today.
> 
> ASCII-1965 was unpublished and unused. I don't know the details of what
> it changed.
> 
> ASCII-1967 is a lot closer to the ASCII in use today. It made
> considerable changes to the control characters, moving, adding,
> removing, or renaming at least half a dozen control characters. It
> officially added lowercase letters, braces, and some others. It
> replaced the up-arrow character with the caret and the left-arrow with
> the underscore. It was ambiguous, allowing variations and
> substitutions, e.g.:
> 
>     - character 33 was permitted to be either the exclamation
>       mark ! or the logical OR symbol |
> 
>     - consequently character 124 (vertical bar) was always
>       displayed as a broken bar آ¦, which explains why even today
>       many keyboards show it that way
> 
>     - character 35 was permitted to be either the number sign # or
>       the pound sign آ£
> 
>     - character 94 could be either a caret ^ or a logical NOT آ¬
> 
> Even the humble comma could be pressed into service as a cedilla.
> 
> ASCII-1968 didn't change any characters, but allowed the use of LF on
> its own. Previously, you had to use either LF/CR or CR/LF as newline.
> 
> ASCII-1977 removed the ambiguities from the 1967 standard.
> 
> The most recent version is ASCII-1986 (also known as ANSI X3.4-1986).
> Unfortunately I haven't been able to find out what changes were made --
> I presume they were minor, and didn't affect the character set.
> 
> So as you can see, even with actual ASCII, you can have mojibake. It's
> just not normally called that. But if you are given an arbitrary ASCII
> file of unknown age, containing code 94, how can you be sure it was
> intended as a caret rather than a logical NOT symbol? You can't.
> 
> Then there are at least 30 official variations of ASCII, strictly
> speaking part of ISO-646. These 7-bit codes were commonly called "ASCII"
> by their users, despite the differences, e.g. replacing the dollar sign
> $ with the international currency sign آ¤, or replacing the left brace
> { with the letter s with caron إ،.
> 
> One consequence of this is that the MIME type for ASCII text is called
> "US ASCII", despite the redundancy, because many people expect "ASCII"
> alone to mean whatever national variation they are used to.
> 
> But it gets worse: there are proprietary variations on ASCII which are
> commonly called "ASCII" but aren't, including dozens of 8-bit so-called
> "extended ASCII" character sets, which is where the problems *really*
> pile up. Invariably back in the 1980s and early 1990s people used to
> call these "ASCII" no matter that they used 8-bits and contained
> anything up to 256 characters.
> 
> Just because somebody calls something "ASCII", doesn't make it so; even
> if it is ASCII, doesn't mean you know which version of ASCII; even if
> you know which version, doesn't mean you know how to interpret certain
> codes. It simply is *wrong* to think that "good ol' plain ASCII text"
> is unambiguous and devoid of problems.
> 
> > With unicode there are in-memory formats, transportation formats eg
> > UTF-8,
> 
> And the same applies to ASCII.
> 
> ASCII is a *seven-bit code*. It will work fine on computers where the
> word-size is seven bits. If the word-size is eight bits, or more, you
> have to pad the ASCII code. How do you do that? Pad the most-significant
> end or the least significant end? That's a choice there. How do you pad
> it, with a zero or a one? That's another choice. If your word-size is
> more than eight bits, you might even pad *both* ends.
> 
> In C, a char is defined as the smallest addressable unit of the machine
> that can contain basic character set, not necessarily eight bits.
> Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits
> as a "byte" and/or char. Your in-memory representation of ASCII "a"
> could easily end up as bits 001100001 or 0000000001100001.
> 
> And then there is the question of whether ASCII characters should be Big
> Endian or Little Endian. I'm referring here to bit endianness, rather
> than bytes: should character 'a' be represented as bits 1100001 (most
> significant bit to the left) or 1000011 (least significant bit to the
> left)? This may be relevant with certain networking protocols. Not all
> networking protocols are big-endian, nor are all processors. The Ada
> programming language even supports both bit orders.
> 
> When transmitting ASCII characters, the networking protocol could
> include various start and stop bits and parity codes. A single 7-bit
> ASCII character might be anything up to 12 bits in length on the wire.
> It is simply naive to imagine that the transmission of ASCII codes is
> the same as the in-memory or on-disk storage of ASCII.
> 
> You're lucky to be active in a time when most common processors have
> standardized on a single bit-order, and when most (but not all) network
> protocols have done the same. But that doesn't mean that these issues
> don't exist for ASCII. If you get a message that purports to be ASCII
> text but looks like this:
> 
> "\tS\x1b\x1b{\x01u{'\x1b\x13!"
> 
> you should suspect strongly that it is "Hello World!" which has been
> accidentally bit-reversed by some rogue piece of hardware.

You can lay a lot of the ASCII ambiguity on D.E.C. and their vt series 
terminals, anything newer than a vt100 made liberal use of the msbit in a 
character.  Having written an emulator for the vt-220, I can testify that 
really getting it right, was a right pain in the ass.  And then I added 
zmodem triggers and detections.

Cheers, Gene
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

Mother Earth is not flat!
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
         law-abiding citizens.

[toc] | [prev] | [next] | [standalone]

#61183 — Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

From	Roy Smith <roy@panix.com>
Date	2013-12-06 20:54 +0000
Subject	Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
Message-ID	<mailman.3665.1386363607.18130.python-list@python.org>
In reply to	#61176

Steven D'Aprano <steve+comp.lang.python <at> pearwood.info> writes:

> Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about 
> encodings and character sets. It doesn't just assume things are ASCII, 
> but makes a half-hearted attempt to be charset-aware, but badly. I can 
> only imagine that it was written back in the Dark Ages

Indeed.  The basic codebase probably goes back 20 years.  I'm posting this
from gmane, just so people don't think I'm a total luddite.

> When transmitting ASCII characters, the networking protocol could include 
> various start and stop bits and parity codes. A single 7-bit ASCII 
> character might be anything up to 12 bits in length on the wire.

Not to mention that some really old hardware used 1.5 stop bits!

[toc] | [prev] | [next] | [standalone]

Page 4 of 6 — ← Prev page 1 2 3 [4] 5 6 Next page →

csiph-web

Managing Google Groups headaches

Contents

#61040

#60737

#60768

#60739

#60740

#60765

#60720

#61024

#61065

#61117

#61118

#61141

#61143

#61144

#61145

#61151

#61152

#61176 — ASCII and Unicode [was Re: Managing Google Groups headaches]

#61180 — Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

#61183 — Re: ASCII and Unicode [was Re: Managing Google Groups headaches]