Groups > comp.lang.python > #106266 > unrolled thread

[beginner] What's wrong?

Started by	Michael Okuntsov <okuntsov.mikhail@yandex.ru>
First post	2016-04-02 03:48 +0600
Last post	2016-04-04 17:19 -0600
Articles	20 on this page of 110 — 29 participants

Back to article view | Back to comp.lang.python

  [beginner] What's wrong? Michael Okuntsov <okuntsov.mikhail@yandex.ru> - 2016-04-02 03:48 +0600
    Re: [beginner] What's wrong? Michael Okuntsov <okuntsov.mikhail@yandex.ru> - 2016-04-02 04:10 +0600
      Re: [beginner] What's wrong? sohcahtoa82@gmail.com - 2016-04-01 15:44 -0700
        Re: [beginner] What's wrong? Random832 <random832@fastmail.com> - 2016-04-02 00:27 -0400
        Re: [beginner] What's wrong? Michael Selik <michael.selik@gmail.com> - 2016-04-02 05:36 +0000
        Re: [beginner] What's wrong? William Ray Wing <wrw@mac.com> - 2016-04-02 00:54 -0400
        Re: [beginner] What's wrong? Chris Angelico <rosuav@gmail.com> - 2016-04-02 19:15 +1100
        Re: [beginner] What's wrong? Michael Selik <michael.selik@gmail.com> - 2016-04-02 14:48 +0000
        Re: [beginner] What's wrong? Chris Angelico <rosuav@gmail.com> - 2016-04-03 01:55 +1100
          Re: [beginner] What's wrong? Marko Rauhamaa <marko@pacujo.net> - 2016-04-02 18:07 +0300
            Re: [beginner] What's wrong? Chris Angelico <rosuav@gmail.com> - 2016-04-03 02:36 +1100
            Re: [beginner] What's wrong? Steven D'Aprano <steve@pearwood.info> - 2016-04-03 02:06 +1000
              Re: [beginner] What's wrong? Marko Rauhamaa <marko@pacujo.net> - 2016-04-02 19:44 +0300
                Re: [beginner] What's wrong? Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2016-04-02 19:12 +0200
                  Re: [beginner] What's wrong? Rustom Mody <rustompmody@gmail.com> - 2016-04-02 10:28 -0700
                    Re: [beginner] What's wrong? Marko Rauhamaa <marko@pacujo.net> - 2016-04-02 21:43 +0300
                    Re: [beginner] What's wrong? Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2016-04-03 13:47 +0200
                      Re: [beginner] What's wrong? Rustom Mody <rustompmody@gmail.com> - 2016-04-03 07:30 -0700
                        Re: [beginner] What's wrong? Dan Sommers <dan@tombstonezero.net> - 2016-04-03 15:25 +0000
                          Re: [beginner] What's wrong? Rustom Mody <rustompmody@gmail.com> - 2016-04-03 08:39 -0700
                            Re: [beginner] What's wrong? Dan Sommers <dan@tombstonezero.net> - 2016-04-03 16:22 +0000
                              Re: [beginner] What's wrong? Chris Angelico <rosuav@gmail.com> - 2016-04-04 02:44 +1000
                              Re: [beginner] What's wrong? Rustom Mody <rustompmody@gmail.com> - 2016-04-03 10:18 -0700
                                Re: [beginner] What's wrong? Chris Angelico <rosuav@gmail.com> - 2016-04-04 03:35 +1000
                                Re: [beginner] What's wrong? Dan Sommers <dan@tombstonezero.net> - 2016-04-03 18:26 +0000
                          Re: [beginner] What's wrong? Rustom Mody <rustompmody@gmail.com> - 2016-04-03 08:46 -0700
                            Re: [beginner] What's wrong? Larry Martell <larry.martell@gmail.com> - 2016-04-03 11:55 -0400
                            Re: [beginner] What's wrong? Chris Angelico <rosuav@gmail.com> - 2016-04-04 01:53 +1000
                              Re: [beginner] What's wrong? Rustom Mody <rustompmody@gmail.com> - 2016-04-03 09:49 -0700
                                Re: [beginner] What's wrong? Dan Sommers <dan@tombstonezero.net> - 2016-04-03 18:32 +0000
                            Re: [beginner] What's wrong? Dan Sommers <dan@tombstonezero.net> - 2016-04-03 16:07 +0000
                        Re: [beginner] What's wrong? Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2016-04-06 21:56 +0200
                          Unicode normalisation [was Re: [beginner] What's wrong?] Steven D'Aprano <steve@pearwood.info> - 2016-04-07 11:37 +1000
                            Re: Unicode normalisation [was Re: [beginner] What's wrong?] Marko Rauhamaa <marko@pacujo.net> - 2016-04-07 09:36 +0300
                            Re: Unicode normalisation [was Re: [beginner] What's wrong?] Peter Pearson <pkpearson@nowhere.invalid> - 2016-04-07 16:51 +0000
                              Re: Unicode normalisation [was Re: [beginner] What's wrong?] Rustom Mody <rustompmody@gmail.com> - 2016-04-07 21:43 -0700
                                Re: Unicode normalisation [was Re: [beginner] What's wrong?] Rustom Mody <rustompmody@gmail.com> - 2016-04-07 21:47 -0700
                                Re: Unicode normalisation [was Re: [beginner] What's wrong?] Chris Angelico <rosuav@gmail.com> - 2016-04-08 14:54 +1000
                                  Re: Unicode normalisation [was Re: [beginner] What's wrong?] Rustom Mody <rustompmody@gmail.com> - 2016-04-08 10:51 -0700
                              Re: Unicode normalisation [was Re: [beginner] What's wrong?] Steven D'Aprano <steve@pearwood.info> - 2016-04-08 16:00 +1000
                                Re: Unicode normalisation [was Re: [beginner] What's wrong?] Chris Angelico <rosuav@gmail.com> - 2016-04-08 16:13 +1000
                                Re: Unicode normalisation [was Re: [beginner] What's wrong?] Peter Pearson <pkpearson@nowhere.invalid> - 2016-04-08 17:21 +0000
                                  Re: Unicode normalisation [was Re: [beginner] What's wrong?] Marko Rauhamaa <marko@pacujo.net> - 2016-04-08 20:44 +0300
                                    Re: Unicode normalisation [was Re: [beginner] What's wrong?] Chris Angelico <rosuav@gmail.com> - 2016-04-09 03:50 +1000
                                      Re: Unicode normalisation [was Re: [beginner] What's wrong?] Peter Pearson <pkpearson@nowhere.invalid> - 2016-04-08 18:03 +0000
                                        Re: Unicode normalisation [was Re: [beginner] What's wrong?] Rustom Mody <rustompmody@gmail.com> - 2016-04-08 11:17 -0700
                                          Re: Unicode normalisation [was Re: [beginner] What's wrong?] Rustom Mody <rustompmody@gmail.com> - 2016-04-08 11:20 -0700
                                    Re: Unicode normalisation [was Re: [beginner] What's wrong?] Rustom Mody <rustompmody@gmail.com> - 2016-04-08 11:04 -0700
                                      Re: Unicode normalisation [was Re: [beginner] What's wrong?] Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2016-04-08 20:20 -0400
                                        Re: Unicode normalisation [was Re: [beginner] What's wrong?] alister <alister.ware@ntlworld.com> - 2016-04-09 08:30 +0000
                                          Re: Unicode normalisation [was Re: [beginner] What's wrong?] Ben Bacarisse <ben.usenet@bsb.me.uk> - 2016-04-09 14:43 +0100
                                            Re: Unicode normalisation [was Re: [beginner] What's wrong?] Ben Bacarisse <ben.usenet@bsb.me.uk> - 2016-04-09 15:34 +0100
                                              Re: Unicode normalisation [was Re: [beginner] What's wrong?] Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2016-04-09 14:30 -0400
                                            Re: Unicode normalisation [was Re: [beginner] What's wrong?] Rustom Mody <rustompmody@gmail.com> - 2016-04-09 09:08 -0700
                                              Re: Unicode normalisation [was Re: [beginner] What's wrong?] Ben Bacarisse <ben.usenet@bsb.me.uk> - 2016-04-09 19:27 +0100
                                              Re: Unicode normalisation [was Re: [beginner] What's wrong?] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-04-09 20:25 +0100
                                              Re: Unicode normalisation [was Re: [beginner] What's wrong?] Stephen Hansen <me@ixokai.io> - 2016-04-09 12:45 -0700
                                            Re: Unicode normalisation [was Re: [beginner] What's wrong?] Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-04-10 20:35 +1200
                                      QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) Ben Finney <ben+python@benfinney.id.au> - 2016-04-09 10:43 +1000
                                        Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) Steven D'Aprano <steve@pearwood.info> - 2016-04-09 13:28 +1000
                                          Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) Random832 <random832@fastmail.com> - 2016-04-09 11:44 -0400
                                          Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) Random832 <random832@fastmail.com> - 2016-04-09 11:53 -0400
                                            Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) Steven D'Aprano <steve@pearwood.info> - 2016-04-18 11:39 +1000
                                              Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) Random832 <random832@fastmail.com> - 2016-04-17 22:01 -0400
                                                Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-04-18 17:21 +1000
                                                  Re: QWERTY was not designed to intentionally slow typists down Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-04-18 21:17 +1200
                                              Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) Chris Angelico <rosuav@gmail.com> - 2016-04-18 12:09 +1000
                                              Re: QWERTY was not designed to intentionally slow typists down Michael Torrie <torriem@gmail.com> - 2016-04-17 21:50 -0600
                                              Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2016-04-18 00:06 -0400
                                          Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2016-04-09 14:52 -0400
                                            Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) pyotr filipivich <phamp@mindspring.com> - 2016-04-09 20:09 -0700
                                              Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) Ian Kelly <ian.g.kelly@gmail.com> - 2016-04-10 07:43 -0600
                                                Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) pyotr filipivich <phamp@mindspring.com> - 2016-04-10 19:14 -0700
                                      Re: QWERTY was not designed to intentionally slow typists down Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-04-09 20:13 +0100
                                        Re: QWERTY was not designed to intentionally slow typists down alister <alister.ware@ntlworld.com> - 2016-04-09 20:22 +0000
                                          Re: QWERTY was not designed to intentionally slow typists down Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-04-09 22:23 +0100
                                          Re: QWERTY was not designed to intentionally slow typists down Tim Golden <mail@timgolden.me.uk> - 2016-04-09 22:51 +0100
                                      Re: QWERTY was not designed to intentionally slow typists down Tim Golden <mail@timgolden.me.uk> - 2016-04-09 20:25 +0100
                                      Re: QWERTY was not designed to intentionally slow typists down Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-04-09 20:36 +0100
                                      Re: QWERTY was not designed to intentionally slow typists down Ethan Furman <ethan@stoneleaf.us> - 2016-04-09 14:33 -0700
                                      RE: [E] QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?]) "Coll-Barth, Michael" <Michael.Coll-Barth@VerizonWireless.com> - 2016-04-09 13:31 -0400
                                  Re: Unicode normalisation [was Re: [beginner] What's wrong?] Steven D'Aprano <steve@pearwood.info> - 2016-04-09 04:44 +1000
                                    Re: Unicode normalisation [was Re: [beginner] What's wrong?] Marko Rauhamaa <marko@pacujo.net> - 2016-04-08 21:55 +0300
                                      Re: Unicode normalisation [was Re: [beginner] What's wrong?] Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-04-10 21:25 +1200
                  Re: [beginner] What's wrong? Steven D'Aprano <steve@pearwood.info> - 2016-04-03 09:49 +1000
                    Re: [beginner] What's wrong? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-04-03 01:26 +0100
                    Re: [beginner] What's wrong? Rustom Mody <rustompmody@gmail.com> - 2016-04-03 07:52 -0700
                      Re: [beginner] What's wrong? Michael Okuntsov <okuntsov.mikhail@yandex.ru> - 2016-04-03 22:24 +0600
                        Re: [beginner] What's wrong? Chris Angelico <rosuav@gmail.com> - 2016-04-04 02:28 +1000
                  Re: [beginner] What's wrong? Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-04-03 16:57 +1200
                    Re: [beginner] What's wrong? Steven D'Aprano <steve@pearwood.info> - 2016-04-03 15:34 +1000
                Re: [beginner] What's wrong? Terry Reedy <tjreedy@udel.edu> - 2016-04-02 15:07 -0400
                  Re: [beginner] What's wrong? Marko Rauhamaa <marko@pacujo.net> - 2016-04-02 22:36 +0300
                    Re: [beginner] What's wrong? Michael Selik <michael.selik@gmail.com> - 2016-04-02 21:42 +0000
                      Re: [beginner] What's wrong? Steven D'Aprano <steve@pearwood.info> - 2016-04-03 10:48 +1000
                        Re: [beginner] What's wrong? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-04-03 02:04 +0100
                          Re: [beginner] What's wrong? alister <alister.ware@ntlworld.com> - 2016-04-03 12:37 +0000
            Re: [beginner] What's wrong? Terry Reedy <tjreedy@udel.edu> - 2016-04-02 14:59 -0400
          Re: [beginner] What's wrong? Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-04-03 16:43 +1200
        Re: [beginner] What's wrong? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2016-04-02 12:31 -0400
        Re: [beginner] What's wrong? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-04-03 00:58 +0100
        Re: [beginner] What's wrong? sohcahtoa82@gmail.com - 2016-04-08 15:59 -0700
          Re: [beginner] What's wrong? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-04-09 00:07 +0100
      Re: [beginner] What's wrong? Michael Torrie <torriem@gmail.com> - 2016-04-02 16:49 -0600
        Re: [beginner] What's wrong? Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2016-04-03 10:12 +0200
      Re: [beginner] What's wrong? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-04-04 15:04 +0100
        Re: [beginner] What's wrong? BartC <bc@freeuk.com> - 2016-04-04 15:51 +0100
      From email addresses sometimes strange on this list - was Re: [beginner] What's wrong? Michael Torrie <torriem@gmail.com> - 2016-04-04 16:55 -0600
      Re: From email addresses sometimes strange on this list - was Re: [beginner] What's wrong? Chris Angelico <rosuav@gmail.com> - 2016-04-05 08:58 +1000
      Re: From email addresses sometimes strange on this list - was Re: [beginner] What's wrong? Michael Torrie <torriem@gmail.com> - 2016-04-04 17:19 -0600

Page 2 of 6 — ← Prev page 1 [2] 3 4 5 6 Next page →

#106375

From	Dan Sommers <dan@tombstonezero.net>
Date	2016-04-03 16:22 +0000
Message-ID	<ndrg0v$9va$3@dont-email.me>
In reply to	#106367

On Sun, 03 Apr 2016 08:39:02 -0700, Rustom Mody wrote:

> On Sunday, April 3, 2016 at 8:58:59 PM UTC+5:30, Dan Sommers wrote:
>> On Sun, 03 Apr 2016 07:30:47 -0700, Rustom Mody wrote:
>> 
>> > So here are some examples to illustrate what I am saying:
>> 
>> [A vs a, A vs A, ﬂag vs flag, etc.]
> <snip>
>> I understand that in some use cases, ﬂag and flag represent the same
>> English word, but please don't extend that to identifiers in my
>> software.

> I wonder once again if you are getting my point opposite to the one I
> am making.  With ASCII there were problems like O vs 0 -- niggling but
> small.
> 
> With Unicode its a gigantic pandora box.  Python by allowing unicode
> identifiers without restraint has made grief for unsuspecting
> programmers.

What about the A vs a case, which comes up even with ASCII-only
characters?  If those are the same, then I, as a reader of Python code,
have to understand all the rules about ß (which I think have changed
over time), and potentially þ and others.

> That is why my original suggestion that there should have been alongside this
> 'brave new world', a pragma wherein a programmer can EXPLICITLY declare
> #language Greek
> Then he is knowingly opting into possible clashes between A and Α
> But not between A and А.

If I declared #language Greek, then I'd expect an identifier like A to
be rejected by the compiler.  That said, I don't know if that sort of
distinction is as clear cut in every language supported by Unicode.

And just to cause trouble (because that's the way I feel today), can I
declare

#γλώσσα Ελληνική

;-)

> [And if you think the above is a philosophical disquisition on
> Aristotle's law of identity: "A is A" you just proved my point that
> unconstrained Unicode identifiers is a mess]

Can we take a "we're all adults here" approach?  For the same reason
that adults don't use identifiers like xl0, x10, xlO, and xl0 anywhere
near each other, shouldn't we also not use A and A anywhere near each
other?  I certainly don't want the language itself to [try to] reject
x10 and xIO because they look too much alike in many fonts.

[toc] | [prev] | [next] | [standalone]

#106378

From	Chris Angelico <rosuav@gmail.com>
Date	2016-04-04 02:44 +1000
Message-ID	<mailman.400.1459701865.28225.python-list@python.org>
In reply to	#106375

On Mon, Apr 4, 2016 at 2:22 AM, Dan Sommers <dan@tombstonezero.net> wrote:
> What about the A vs a case, which comes up even with ASCII-only
> characters?  If those are the same, then I, as a reader of Python code,
> have to understand all the rules about ß (which I think have changed
> over time), and potentially þ and others.

And Iİıi, and Σσς, and (if you want completeness) ſ too. And various
other case conversion rules. It's not possible to case-fold perfectly
without knowing what language something is.

This, coupled with the extremely useful case distinction between
"Classes" and "instances", means I'm very much glad Python is case
sensitive. "base = Base()" is perfectly legal and meaningful, no
matter what language you translate those words into (well, as long as
it's bicameral - otherwise you need to adorn one of them somehow, but
you'd have to anyway).

ChrisA

[toc] | [prev] | [next] | [standalone]

#106380

From	Rustom Mody <rustompmody@gmail.com>
Date	2016-04-03 10:18 -0700
Message-ID	<257887ea-df88-4229-b045-57d50b7e60b1@googlegroups.com>
In reply to	#106375

On Sunday, April 3, 2016 at 9:56:24 PM UTC+5:30, Dan Sommers wrote:
> On Sun, 03 Apr 2016 08:39:02 -0700, Rustom Mody wrote:
> 
> > On Sunday, April 3, 2016 at 8:58:59 PM UTC+5:30, Dan Sommers wrote:
> >> On Sun, 03 Apr 2016 07:30:47 -0700, Rustom Mody wrote:
> >> 
> >> > So here are some examples to illustrate what I am saying:
> >> 
> >> [A vs a, A vs A, ﬂag vs flag, etc.]
> > <snip>
> >> I understand that in some use cases, ﬂag and flag represent the same
> >> English word, but please don't extend that to identifiers in my
> >> software.
> 
> > I wonder once again if you are getting my point opposite to the one I
> > am making.  With ASCII there were problems like O vs 0 -- niggling but
> > small.
> > 
> > With Unicode its a gigantic pandora box.  Python by allowing unicode
> > identifiers without restraint has made grief for unsuspecting
> > programmers.
> 
> What about the A vs a case, which comes up even with ASCII-only
> characters?  If those are the same, then I, as a reader of Python code,
> have to understand all the rules about ß (which I think have changed
> over time), and potentially þ and others.

Dont get your point.
If you know German then these rules should be clear enough to you
If not youve probably got bigger problems reading that code anyway

As illustration, here is Marko's code few posts back:

for oppilas in luokka:
        if oppilas.hylätty():
            oppilas.ilmoita(oppilas.koetulokset) 

Does it make sense to you?

> 
> > That is why my original suggestion that there should have been alongside this
> > 'brave new world', a pragma wherein a programmer can EXPLICITLY declare
> > #language Greek
> > Then he is knowingly opting into possible clashes between A and Α
> > But not between A and А.
> 
> If I declared #language Greek, then I'd expect an identifier like A to
> be rejected by the compiler.  That said, I don't know if that sort of
> distinction is as clear cut in every language supported by Unicode.
> 
> And just to cause trouble (because that's the way I feel today), can I
> declare
> 
> #γλώσσα Ελληνική
> 
> ;-)
> 
> > [And if you think the above is a philosophical disquisition on
> > Aristotle's law of identity: "A is A" you just proved my point that
> > unconstrained Unicode identifiers is a mess]
> 
> Can we take a "we're all adults here" approach?

Who's the 'we' we are talking about?

> For the same reason
> that adults don't use identifiers like xl0, x10, xlO, and xl0 anywhere
> near each other, shouldn't we also not use A and A anywhere near each
> other?  I certainly don't want the language itself to [try to] reject
> x10 and xIO because they look too much alike in many fonts.

When Kernighan and Ritchie wrote C there was no problem with gets.
Then suddenly, decades later the problem exploded.

What happened?

Here's an analysis:
Security means two almost completely unrelated concepts
- protection against shooting oneself in the foot (remember the 'protected' 
   keyword of C++ ?)
- protection against intelligent, capable, motivated criminals
Lets call them security-s (against stupidity) and security-c (against criminals)

Security-c didnt figure because computers were anyway physically secured and 
there was no much internet to speak of.
gets was provided exactly on your principle of 'consenting-adults' -- if you
use it you know what you are using.

Then suddenly computers became net-facing and their servers could be written by
'consenting' (to whom?) adults using gets.

Voila -- Security has just become a lucrative profession!

I believe python's situation of laissez-faire unicode is similarly trouble-inviting.

While I personally dont know enough about security to be able to demonstrate a
full sequence of events, here's a little fun I had with Chris:

https://mail.python.org/pipermail/python-list/2014-May/672413.html

Do you not think this could be tailored into something more sinister and
dangerous?

[toc] | [prev] | [next] | [standalone]

#106382

From	Chris Angelico <rosuav@gmail.com>
Date	2016-04-04 03:35 +1000
Message-ID	<mailman.403.1459704960.28225.python-list@python.org>
In reply to	#106380

On Mon, Apr 4, 2016 at 3:18 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> While I personally dont know enough about security to be able to demonstrate a
> full sequence of events, here's a little fun I had with Chris:
>
> https://mail.python.org/pipermail/python-list/2014-May/672413.html
>
> Do you not think this could be tailored into something more sinister and
> dangerous?

I honestly don't know what you're proving there. You didn't import a
file called "1.py"; you just created a file with a non-ASCII name and
used a non-ASCII identifier to import it. In other words, you did
exactly what Unicode should allow: names in any language.

ChrisA

[toc] | [prev] | [next] | [standalone]

#106388

From	Dan Sommers <dan@tombstonezero.net>
Date	2016-04-03 18:26 +0000
Message-ID	<ndrn97$9va$4@dont-email.me>
In reply to	#106380

On Sun, 03 Apr 2016 10:18:45 -0700, Rustom Mody wrote:

> On Sunday, April 3, 2016 at 9:56:24 PM UTC+5:30, Dan Sommers wrote:
>> On Sun, 03 Apr 2016 08:39:02 -0700, Rustom Mody wrote:
>> 
>> > On Sunday, April 3, 2016 at 8:58:59 PM UTC+5:30, Dan Sommers wrote:
>> >> On Sun, 03 Apr 2016 07:30:47 -0700, Rustom Mody wrote:
>> >> 
>> >> > So here are some examples to illustrate what I am saying:
>> >> 
>> >> [A vs a, A vs A, ﬂag vs flag, etc.]
>> > <snip>
>> >> I understand that in some use cases, ﬂag and flag represent the same
>> >> English word, but please don't extend that to identifiers in my
>> >> software.
>> 
>> > I wonder once again if you are getting my point opposite to the one I
>> > am making.  With ASCII there were problems like O vs 0 -- niggling but
>> > small.
>> > 
>> > With Unicode its a gigantic pandora box.  Python by allowing unicode
>> > identifiers without restraint has made grief for unsuspecting
>> > programmers.
>> 
>> What about the A vs a case, which comes up even with ASCII-only
>> characters?  If those are the same, then I, as a reader of Python code,
>> have to understand all the rules about ß (which I think have changed
>> over time), and potentially þ and others.
> 
> Dont get your point.
> If you know German then these rules should be clear enough to you
> If not youve probably got bigger problems reading that code anyway

My point is that case sensitivity is good.  I was disagreeing with your
point about scheme getting A vs a "right" and Python and C and Unix
getting it "wrong."

My larger point, and my experience, is that case sensitivity is easier
for to handle than case insensitivity.  Most of the time, the same
letter's capital and small renditions look different from each other (A
vs a, Q vs q, and even Þ and þ is no worse than O and o), and there are
no context sensitive conversion rules to worry about.

> As illustration, here is Marko's code few posts back:
> 
> for oppilas in luokka:
>         if oppilas.hylätty():
>             oppilas.ilmoita(oppilas.koetulokset) 
> 
> Does it make sense to you?

It makes enough sense to recognize the idiom:  for each item in a
collection that satisfies a predicate, call a method on the item.

My point here is that while the identifiers themselves can be enormously
helpful to someone seeing a block of code for the first time or
maintaining it five years later, it's just as important to recognize
quickly that one identifier is not the same as another one, or that a
particular identifier only appears once or only in certain syntactical
constructs.

If the above code were written a little differently, we'd be having a
completely different discussion:

    for list in object:
        if list.clear():
            list.pop(list.append)

>> Can we take a "we're all adults here" approach?
> 
> Who's the 'we' we are talking about?

The community, who has accepted Python as a case-sensitive language and
knows better than to use identifiers that look too much alike or are
otherwise deliberatly mis-leading.

>> For the same reason
>> that adults don't use identifiers like xl0, x10, xlO, and xl0 anywhere
>> near each other, shouldn't we also not use A and A anywhere near each
>> other?  I certainly don't want the language itself to [try to] reject
>> x10 and xIO because they look too much alike in many fonts.
> 
> When Kernighan and Ritchie wrote C there was no problem with gets.
> Then suddenly, decades later the problem exploded.

When Kernighan and Ritchie wrote C there *was* a problem with gets.

> What happened?

The problem was no longer isolated to taking down one Unix process or a
single machine, or discovering passwords on that one machine.

> Here's an analysis:
> Security means two almost completely unrelated concepts
> - protection against shooting oneself in the foot (remember the 'protected' 
>    keyword of C++ ?)
> - protection against intelligent, capable, motivated criminals
> Lets call them security-s (against stupidity) and security-c (against criminals)
> 
> Security-c didnt figure because computers were anyway physically secured and 
> there was no much internet to speak of.
> gets was provided exactly on your principle of 'consenting-adults' -- if you
> use it you know what you are using.
> 
> Then suddenly computers became net-facing and their servers could be
> written by 'consenting' (to whom?) adults using gets.
> 
> Voila -- Security has just become a lucrative profession!

I can't prevent insecure web servers, or unknowing users.  Allowing or
disallowing A and A and А to coexist in the source code doesn't matter.

> I believe python's situation of laissez-faire unicode is similarly
> trouble-inviting.

I'm not sure I agree, but I didn't timing attacks on cryptographic
algorithms or devices reading passwords from air-gapped computers
coming, either.

I do know that complexity is also a source of bugs and security risks.
Allowing or disallowing certain unicode code points in identifiers, and
declaring that identifiers consisting of the same sequence of code
points are the same, is way less complex than getting something else
(even something as "simple" as case-insensitivity) right for all cases.

> While I personally dont know enough about security to be able to demonstrate a
> full sequence of events, here's a little fun I had with Chris:
> 
> https://mail.python.org/pipermail/python-list/2014-May/672413.html
> 
> Do you not think this could be tailored into something more sinister and
> dangerous?

[toc] | [prev] | [next] | [standalone]

#106368

From	Rustom Mody <rustompmody@gmail.com>
Date	2016-04-03 08:46 -0700
Message-ID	<a3abadcc-a9f9-49ea-b395-15e1c32c9618@googlegroups.com>
In reply to	#106366

On Sunday, April 3, 2016 at 8:58:59 PM UTC+5:30, Dan Sommers wrote:
> Yes, it's marginally annoying, and a security hole waiting to happen,
> that A and A often look very much alike.

"A security hole waiting to happen" = "Marginally annoying"

Frankly I find this juxtaposition alarming

Personal note: I once was idiot enough to have root with password root123
and transferring some files to a friend ... over ssh...
Lost my entire installation in a matter of minutes

[toc] | [prev] | [next] | [standalone]

#106369

From	Larry Martell <larry.martell@gmail.com>
Date	2016-04-03 11:55 -0400
Message-ID	<mailman.395.1459698984.28225.python-list@python.org>
In reply to	#106368

On Sun, Apr 3, 2016 at 11:46 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> Personal note: I once was idiot enough to have root with password root123

I changed my password to "incorrect," so whenever I forget it the
computer will say, "Your password is incorrect."

[toc] | [prev] | [next] | [standalone]

#106370

From	Chris Angelico <rosuav@gmail.com>
Date	2016-04-04 01:53 +1000
Message-ID	<mailman.396.1459699225.28225.python-list@python.org>
In reply to	#106368

On Mon, Apr 4, 2016 at 1:46 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> On Sunday, April 3, 2016 at 8:58:59 PM UTC+5:30, Dan Sommers wrote:
>> Yes, it's marginally annoying, and a security hole waiting to happen,
>> that A and A often look very much alike.
>
>
> "A security hole waiting to happen" = "Marginally annoying"
>
> Frankly I find this juxtaposition alarming
>
> Personal note: I once was idiot enough to have root with password root123
> and transferring some files to a friend ... over ssh...
> Lost my entire installation in a matter of minutes

Exactly why did you have root ssh access with a password?

ChrisA

[toc] | [prev] | [next] | [standalone]

#106379

From	Rustom Mody <rustompmody@gmail.com>
Date	2016-04-03 09:49 -0700
Message-ID	<ba4e0e23-c12a-4d51-8a86-17b3b7e19eef@googlegroups.com>
In reply to	#106370

On Sunday, April 3, 2016 at 9:30:40 PM UTC+5:30, Chris Angelico wrote:
> Exactly why did you have root ssh access with a password?

Umm... Dont exactly remember.
Probably it was not strictly necessary.
Combination of carelessness, stupidity, hurry....

Brings me to...

On Sunday, April 3, 2016 at 9:41:11 PM UTC+5:30, Dan Sommers wrote:
> On Sun, 03 Apr 2016 08:46:59 -0700, Rustom Mody wrote:
> 
> > On Sunday, April 3, 2016 at 8:58:59 PM UTC+5:30, Dan Sommers wrote:
> >> Yes, it's marginally annoying, and a security hole waiting to happen,
> >> that A and A often look very much alike.
> > 
> > "A security hole waiting to happen" = "Marginally annoying"
> > 
> > Frankly I find this juxtaposition alarming
> 
> Sorry about that.
> 
> I didn't mean to equate the two.  I meant to point out that the fact
> that A and A look alike can be one, or both, of those things.  Perhaps I
> should have used "or" instead of "and."

Chill! No offence.

Just that when you have the above ingredients (carelessness, stupidity, hurry....) multiplied by a GHz clock, it makes for spicy security
incidents(!).

I just meant to say that "Just a lil security incident" is not a helpful attitude to foster

[toc] | [prev] | [next] | [standalone]

#106390

From	Dan Sommers <dan@tombstonezero.net>
Date	2016-04-03 18:32 +0000
Message-ID	<ndrnjp$9va$5@dont-email.me>
In reply to	#106379

On Sun, 03 Apr 2016 09:49:03 -0700, Rustom Mody wrote:

> On Sunday, April 3, 2016 at 9:41:11 PM UTC+5:30, Dan Sommers wrote:
>> On Sun, 03 Apr 2016 08:46:59 -0700, Rustom Mody wrote:
>> 
>> > On Sunday, April 3, 2016 at 8:58:59 PM UTC+5:30, Dan Sommers wrote:
>> >> Yes, it's marginally annoying, and a security hole waiting to happen,
>> >> that A and A often look very much alike.
>> > 
>> > "A security hole waiting to happen" = "Marginally annoying"
>> > 
>> > Frankly I find this juxtaposition alarming
>> 
>> Sorry about that.
>> 
>> I didn't mean to equate the two.  I meant to point out that the fact
>> that A and A look alike can be one, or both, of those things.  Perhaps I
>> should have used "or" instead of "and."
> 
> Chill! No offence.

I'm chilled.  :-)

No offense taken.  I am arguably overly sensitive to putting forth an
argument that isn't clear and concise (because I've also been known to
derail the proceedings until I can get my head around someone else's
argument).

> Just that when you have the above ingredients (carelessness,
> stupidity, hurry....) multiplied by a GHz clock, it makes for spicy
> security incidents(!).  I just meant to say that "Just a lil security
> incident" is not a helpful attitude to foster

On this we agree.  :-)

[toc] | [prev] | [next] | [standalone]

#106372

From	Dan Sommers <dan@tombstonezero.net>
Date	2016-04-03 16:07 +0000
Message-ID	<ndrf4e$9va$2@dont-email.me>
In reply to	#106368

On Sun, 03 Apr 2016 08:46:59 -0700, Rustom Mody wrote:

> On Sunday, April 3, 2016 at 8:58:59 PM UTC+5:30, Dan Sommers wrote:
>> Yes, it's marginally annoying, and a security hole waiting to happen,
>> that A and A often look very much alike.
> 
> "A security hole waiting to happen" = "Marginally annoying"
> 
> Frankly I find this juxtaposition alarming

Sorry about that.

I didn't mean to equate the two.  I meant to point out that the fact
that A and A look alike can be one, or both, of those things.  Perhaps I
should have used "or" instead of "and."

[toc] | [prev] | [next] | [standalone]

#106599

From	Thomas 'PointedEars' Lahn <PointedEars@web.de>
Date	2016-04-06 21:56 +0200
Message-ID	<1584744.4h7ToaqLat@PointedEars.de>
In reply to	#106360

Rustom Mody wrote:

> On Sunday, April 3, 2016 at 5:17:36 PM UTC+5:30, Thomas 'PointedEars' Lahn
> wrote:
>> Rustom Mody wrote:
>> > When python went to full unicode identifers it should have also added
>> > pragmas for which blocks the programmer intended to use -- something
>> > like a charset declaration of html.
>> > 
>> > This way if the programmer says "I want latin and greek"
>> > and then A and Α get mixed up well he asked for it.
>> > If he didn't ask then springing it on him seems unnecessary and
>> > uncalled for
>> 
>> Nonsense.
> 
> Some misunderstanding of what I said it looks
> [Guessing also from Marko's "...silly..."]

First of all, while bandwidth might not be precious anymore to some, free 
time still is.  So please trim your quotations to the relevant minimum, to 
the parts you are actually referring to, and summarize properly if 
necessary.  For if you continue this mindbogglingly stupid full-quoting, 
this is going to be my last reply to you for a long time.  You have been 
warned.

<https://www.netmeister.org/news/learn2quote.html>

> So here are some examples to illustrate what I am saying:
> 
> Example 1 -- Ligatures:
> 
> Python3 gets it right
>>>> ﬂag = 1
>>>> flag
> 1

Fascinating; confirmed with

| $ python3 
| Python 3.4.4 (default, Jan  5 2016, 15:35:18) 
| [GCC 5.3.1 20160101] on linux
| […]

I do not think this is correct, though.  Different Unicode code sequences, 
after normalization, should result in different symbols.

> Whereas haskell gets it wrong:
> Prelude> let ﬂag = 1
> Prelude> flag
> 
> <interactive>:3:1: Not in scope: ‘flag’
> Prelude> ﬂag
> 1
> Prelude>

I think Haskell gets it right here, while Py3k does not.  The “ﬂ” is not to 
be decomposed to “fl”.

> Example 2 Case Sensitivity
> Scheme¹ gets it right
> 
>> (define a 1)
>> A
> 1
>> a
> 1

So Scheme is case-insensitive there.  So is (Visual) Basic.  That does not 
make it (any) better.

> Python gets it wrong
>>>> a=1
>>>> A
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> NameError: name 'A' is not defined

This is not wrong; it is just different.  And given that identifiers 
starting with uppercase ought to be class names in Python (and other OOPLs 
that are case-sensitive there), and that a class name serves in constructor 
calls (in Python, instantiating a class is otherwise indistinguishable from 
a function call), it makes sense that the (maybe local) variable “a” should 
be different from the (probably global) class “A”.

> [Likewise filenames windows gets right; Unix wrong]

Utter nonsense.  Apparently you are blissfully unaware of how much grief it 
has caused WinDOS lusers and users alike over the years that Micro$~1 
decided in their infinite wisdom that letter case was not important.

Example: By contrast to previous versions, FAT32 supports long filenames 
(VFAT).  Go try changing a long filename from uppercase (“Really Long 
Filename.txt”) to partial lowercase (“Really long filename.txt”).  It does 
not work, you get an error, because the underlying “short filename” is the 
same as it is has to be case-insensitive for backwards compatibility 
(“REALLY~1.TXT”)  First you have to rename the file so that its name results 
in a different “short filename” (“REALLY~2.TXT”).  Then you have to rename 
it again to get the proper letter case (by which the “short filename” might 
either become “REALLY~1.TXT” again or “REALLY~3.TXT”).

> Unicode Identifiers in the spirit of IDN homograph attack.
> Every language that 'supports' unicode gets it wrong

NAK, see above.

> Python3
>>>> A=1
>>>> Α
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> NameError: name 'Α' is not defined
>>>> A
> 1
> 
> Can you make out why A both is and is not defined?

Fallacy.  “A” is _not_ both defined and not defined.  There is only one “A”.

However, given the proper font, I might see at a glance what is wrong there.  
In fact, in my Konsole[tm] where the default font is “Courier 10 Pitch” I 
clearly see what is wrong there.  “A” (U+0041 LATIN CAPITAL LETTER A) is 
displayed using that serif font where the letter has a serif to the left at 
cap height and serifs left and right on the baseline, while “Α” (U+0391 
GREEK CAPITAL LETTER ALPHA) is displayed using a sans-serif font, where also 
the cap height is considerably higher.

> When the language does not support it eg python2 the behavior is better

NAK.  Being able to use Unicode strings verbatim in a program without having 
to declare them is infinitely useful.  Unicode identifiers appear to be 
merely a (happy?) side effect of that.

> The notion of 'variable' in programming language is inherently based on
> that of 'identifier'.

ACK.

> With ASCII the problems are minor: Case-distinct identifiers are distinct
> -- they dont IDENTIFY.

I do not think this is a problem.

> This contradicts standard English usage and practice 

No, it does not.  English distinguishes between proper *nouns* and proper 
*names* (the latter can be the former).  For example, “Wednesday”, 
regardless where it occurs in a sentence, is an English word, a proper 
*name*; by contrast, “wednesday” is not only neither a proper noun nor a 
proper name; it is not a proper English *word* in the first place.  “i” 
might be the imaginary unit or a marketing abbreviation for “internet” [1]; 
“I” is (AFAIK) *only* the English pronoun for referring to oneself.

[1] <https://en.wikipedia.org/wiki/IMac#History>

-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

[toc] | [prev] | [next] | [standalone]

#106608 — Unicode normalisation [was Re: [beginner] What's wrong?]

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-04-07 11:37 +1000
Subject	Unicode normalisation [was Re: [beginner] What's wrong?]
Message-ID	<5705b9ef$0$1611$c3e8da3$5496439d@news.astraweb.com>
In reply to	#106599

On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote:

> Rustom Mody wrote:

>> So here are some examples to illustrate what I am saying:
>> 
>> Example 1 -- Ligatures:
>> 
>> Python3 gets it right
>>>>> ﬂag = 1
>>>>> flag
>> 1

Python identifiers are intentionally normalised to reduce security issues,
or at least confusion and annoyance, due to visually-identical identifiers
being treated as different.

Unicode has technical standards dealing with identifiers:

http://www.unicode.org/reports/tr31/

and visual spoofing and confusables:

http://www.unicode.org/reports/tr39/

I don't believe that CPython goes to the full extreme of checking for mixed
script confusables, but it does partially mitigate the problem by
normalising identifiers.

Unfortunately PEP 3131 leaves a number of questions open. Presumably they
were answered in the implementation, but they aren't documented in the PEP.

https://www.python.org/dev/peps/pep-3131/

> Fascinating; confirmed with
> 
> | $ python3
> | Python 3.4.4 (default, Jan  5 2016, 15:35:18)
> | [GCC 5.3.1 20160101] on linux
> | […]
> 
> I do not think this is correct, though.  Different Unicode code sequences,
> after normalization, should result in different symbols.

I think you are confused about normalisation. By definition, normalising
different Unicode code sequences may result in the same symbols, since that
is what normalisation means.

Consider two distinct strings which nevertheless look identical:

py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}"
py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"
py> a == b
False
py> print(a, b)
ü ü

The purpose of normalisation is to turn one into the other:

py> unicodedata.normalize('NFKC', a) == b  # compose 2 code points --> 1
True
py> unicodedata.normalize('NFKD', b) == a  # decompose 1 code point --> 2
True

In the case of the fl ligature, normalisation splits the ligature into
individual 'f' and 'l' code points regardless of whether you compose or
decompose:

py> unicodedata.normalize('NFKC', "ﬂag") == "flag"
True
py> unicodedata.normalize('NFKD', "ﬂag") == "flag"
True

That's using the combatability composition form. Using the default
composition form leaves the ligature unchanged.

Note that UTS #39 (security mechanisms) suggests that identifiers should be
normalised using NFKC.

[...]
> I think Haskell gets it right here, while Py3k does not.  The “ﬂ” is not
> to be decomposed to “fl”.

The Unicode consortium seems to disagree with you. Table 1 of UTS #39 (see
link above) includes "Characters that cannot occur in strings normalized to
NFKC" in the Restricted category, that is, characters which should not be
used in identifiers. ﬂ cannot occur in such normalised strings, and so it
is classified as Restricted and should not be used in identifiers.

I'm not entirely sure just how closely Python's identifiers follow the
standard, but I think that the intention is to follow something close to
"UAX31-R4. Equivalent Normalized Identifiers":

http://www.unicode.org/reports/tr31/#R4

[Rustom] 
>> Python gets it wrong
>>>>> a=1
>>>>> A
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> NameError: name 'A' is not defined
> 
> This is not wrong; it is just different.

I agree with Thomas here. Case-insensitivity is a choice, and I don't think
it is a good choice for programming identifiers. Being able to make case
distinctions between (let's say):

SPAM  # a constant, or at least constant-by-convention
Spam  # a class or type
spam  # an instance

is useful.

[Rustom]
>> With ASCII the problems are minor: Case-distinct identifiers are distinct
>> -- they dont IDENTIFY.
> 
> I do not think this is a problem.
> 
>> This contradicts standard English usage and practice
> 
> No, it does not.

I agree with Thomas here too. Although it is rare for case to make a
distinction in English, it does happen. As the old joke goes:

Capitalisation is the difference between helping my Uncle Jack off a horse,
and helping my uncle jack off a horse.

So even in English, capitalisation can make a semantic difference.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#106614 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

From	Marko Rauhamaa <marko@pacujo.net>
Date	2016-04-07 09:36 +0300
Subject	Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Message-ID	<87mvp6hqnd.fsf@elektro.pacujo.net>
In reply to	#106608

Steven D'Aprano <steve@pearwood.info>:

> So even in English, capitalisation can make a semantic difference.

It can even make a pronunciation difference: polish vs Polish.


Marko

[toc] | [prev] | [next] | [standalone]

#106632 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

From	Peter Pearson <pkpearson@nowhere.invalid>
Date	2016-04-07 16:51 +0000
Subject	Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Message-ID	<dmnhhaFq3t3U1@mid.individual.net>
In reply to	#106608

On Thu, 07 Apr 2016 11:37:50 +1000, Steven D'Aprano wrote:
> On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote:
>> Rustom Mody wrote:
>
>>> So here are some examples to illustrate what I am saying:
>>> 
>>> Example 1 -- Ligatures:
>>> 
>>> Python3 gets it right
>>>>>> ﬂag = 1
>>>>>> flag
>>> 1
[snip]
>> 
>> I do not think this is correct, though.  Different Unicode code sequences,
>> after normalization, should result in different symbols.
>
> I think you are confused about normalisation. By definition, normalising
> different Unicode code sequences may result in the same symbols, since that
> is what normalisation means.
>
> Consider two distinct strings which nevertheless look identical:
>
> py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}"
> py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"
> py> a == b
> False
> py> print(a, b)
> ü ü
>
>
> The purpose of normalisation is to turn one into the other:
>
> py> unicodedata.normalize('NFKC', a) == b  # compose 2 code points --> 1
> True
> py> unicodedata.normalize('NFKD', b) == a  # decompose 1 code point --> 2
> True

It's all great fun until someone loses an eye.

Seriously, it's cute how neatly normalisation works when you're
watching closely and using it in the circumstances for which it was
intended, but that hardly proves that these practices won't cause much
trouble when they're used more casually and nobody's watching closely.
Considering how much energy good software engineers spend eschewing
unnecessary complexity, do we really want to embrace the prospect of
having different things look identical?  (A relevant reference point:
mixtures of spaces and tabs in Python indentation.)

[snip]
> The Unicode consortium seems to disagree with you.

<cranky_geezer_font>

The Unicode consortium was certifiably insane when it went into the
typesetting business.  The pile-of-poo character was just frosting on
the cake.

</cranky_geezer_font>

(Sorry to leave you with that image.)

-- 
To email me, substitute nowhere->runbox, invalid->com.

[toc] | [prev] | [next] | [standalone]

#106641 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

From	Rustom Mody <rustompmody@gmail.com>
Date	2016-04-07 21:43 -0700
Subject	Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Message-ID	<e990973b-8777-4441-9401-b1b162b000fc@googlegroups.com>
In reply to	#106632

On Thursday, April 7, 2016 at 10:22:18 PM UTC+5:30, Peter Pearson wrote:
> On Thu, 07 Apr 2016 11:37:50 +1000, Steven D'Aprano wrote:
> > On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote:
> >> Rustom Mody wrote:
> >
> >>> So here are some examples to illustrate what I am saying:
> >>> 
> >>> Example 1 -- Ligatures:
> >>> 
> >>> Python3 gets it right
> >>>>>> ﬂag = 1
> >>>>>> flag
> >>> 1
> [snip]
> >> 
> >> I do not think this is correct, though.  Different Unicode code sequences,
> >> after normalization, should result in different symbols.
> >
> > I think you are confused about normalisation. By definition, normalising
> > different Unicode code sequences may result in the same symbols, since that
> > is what normalisation means.
> >
> > Consider two distinct strings which nevertheless look identical:
> >
> > py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}"
> > py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"
> > py> a == b
> > False
> > py> print(a, b)
> > ü ü
> >
> >
> > The purpose of normalisation is to turn one into the other:
> >
> > py> unicodedata.normalize('NFKC', a) == b  # compose 2 code points --> 1
> > True
> > py> unicodedata.normalize('NFKD', b) == a  # decompose 1 code point --> 2
> > True
> 
> It's all great fun until someone loses an eye.
> 
> Seriously, it's cute how neatly normalisation works when you're
> watching closely and using it in the circumstances for which it was
> intended, but that hardly proves that these practices won't cause much
> trouble when they're used more casually and nobody's watching closely.
> Considering how much energy good software engineers spend eschewing
> unnecessary complexity, do we really want to embrace the prospect of
> having different things look identical?  (A relevant reference point:
> mixtures of spaces and tabs in Python indentation.)

That kind of sums up my position.
To be a casual user of unicode is one thing
To support it is another -- unicode strings in python3 -- ok so far
To mix up these two is a third without enough thought or consideration --
unicode identifiers is likely a security hole waiting to happen...

No I am not clever/criminal enough to know how to write a text that is visually
close to 
print "Hello World"
but is internally closer to
rm -rf /

For me this:
 >>> Α = 1
>>> A = 2
>>> Α + 1 == A 
True
>>> 


is cure enough that I am not amused

[The only reason I brought up case distinction is that this is in the same 
direction and way worse than that]

If python had been more serious about embracing the brave new world of
unicode it should have looked in this direction:
http://blog.languager.org/2014/04/unicoded-python.html

Also here I suggest a classification of unicode, that, while not
official or even formalizable is (I believe) helpful
http://blog.languager.org/2015/03/whimsical-unicode.html

Specifically as far as I am concerned if python were to throw back say
a ligature in an identifier as a syntax error -- exactly what python2 does --
I think it would be perfectly fine and a more sane choice

[toc] | [prev] | [next] | [standalone]

#106642 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

From	Rustom Mody <rustompmody@gmail.com>
Date	2016-04-07 21:47 -0700
Subject	Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Message-ID	<7e55a6df-d272-4217-9c45-1f9dea9b7afd@googlegroups.com>
In reply to	#106641

On Friday, April 8, 2016 at 10:13:16 AM UTC+5:30, Rustom Mody wrote:
> No I am not clever/criminal enough to know how to write a text that is visually
> close to 
> print "Hello World"
> but is internally closer to
> rm -rf /
> 
> For me this:
>  >>> Α = 1
> >>> A = 2
> >>> Α + 1 == A 
> True
> >>> 
> 
> 
> is cure enough that I am not amused

Um... "cute" was the intention
[Or is it cuʇe ?]

[toc] | [prev] | [next] | [standalone]

#106643 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

From	Chris Angelico <rosuav@gmail.com>
Date	2016-04-08 14:54 +1000
Subject	Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Message-ID	<mailman.63.1460091243.2253.python-list@python.org>
In reply to	#106641

On Fri, Apr 8, 2016 at 2:43 PM, Rustom Mody <rustompmody@gmail.com> wrote:
> No I am not clever/criminal enough to know how to write a text that is visually
> close to
> print "Hello World"
> but is internally closer to
> rm -rf /
>
> For me this:
>  >>> Α = 1
>>>> A = 2
>>>> Α + 1 == A
> True
>>>>
>
>
> is cure enough that I am not amused

To me, the above is a contrived example. And you can contrive examples
that are just as confusing while still being ASCII-only, like
swimmer/swirnmer in many fonts, or I and l, or any number of other
visually-confusing glyphs. I propose that we ban the letters 'r' and
'l' from identifiers, to ensure that people can't mess with
themselves.

> Specifically as far as I am concerned if python were to throw back say
> a ligature in an identifier as a syntax error -- exactly what python2 does --
> I think it would be perfectly fine and a more sane choice

The ligature is handled straight-forwardly: it gets decomposed into
its component letters. I'm not seeing a problem here.

ChrisA

[toc] | [prev] | [next] | [standalone]

#106699 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

From	Rustom Mody <rustompmody@gmail.com>
Date	2016-04-08 10:51 -0700
Subject	Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Message-ID	<df998f95-929f-4d7b-9eed-cde6bde040fa@googlegroups.com>
In reply to	#106643

On Friday, April 8, 2016 at 10:24:17 AM UTC+5:30, Chris Angelico wrote:
> On Fri, Apr 8, 2016 at 2:43 PM, Rustom Mody  wrote:
> > No I am not clever/criminal enough to know how to write a text that is visually
> > close to
> > print "Hello World"
> > but is internally closer to
> > rm -rf /
> >
> > For me this:
> >  >>> Α = 1
> >>>> A = 2
> >>>> Α + 1 == A
> > True
> >>>>
> >
> >
> > is cure enough that I am not amused
> 
> To me, the above is a contrived example. And you can contrive examples
> that are just as confusing while still being ASCII-only, like
> swimmer/swirnmer in many fonts, or I and l, or any number of other
> visually-confusing glyphs. I propose that we ban the letters 'r' and
> 'l' from identifiers, to ensure that people can't mess with
> themselves.

swirnmer and swimmer are distinguished by squiting a bit
А and A only by digging down into the hex.
If you categorize them as similar/same... well I am not arguing...
will come to you when I am short of straw...


> 
> > Specifically as far as I am concerned if python were to throw back say
> > a ligature in an identifier as a syntax error -- exactly what python2 does --
> > I think it would be perfectly fine and a more sane choice
> 
> The ligature is handled straight-forwardly: it gets decomposed into
> its component letters. I'm not seeing a problem here.

Yes... there is no problem... HERE [I did say python gets this right that
haskell for example gets wrong]
Whats wrong is the whole approach of swallowing gobs of characters that
need not be legal at all and then getting indigestion:

Note the "non-normative" in
https://docs.python.org/3/reference/lexical_analysis.html#identifiers

If a language reference is not normative what is?

[toc] | [prev] | [next] | [standalone]

#106648 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-04-08 16:00 +1000
Subject	Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Message-ID	<570748ec$0$1620$c3e8da3$5496439d@news.astraweb.com>
In reply to	#106632

On Fri, 8 Apr 2016 02:51 am, Peter Pearson wrote:

> Seriously, it's cute how neatly normalisation works when you're
> watching closely and using it in the circumstances for which it was
> intended, but that hardly proves that these practices won't cause much
> trouble when they're used more casually and nobody's watching closely.
> Considering how much energy good software engineers spend eschewing
> unnecessary complexity, 

Maybe so, but it's not good software engineers we have to worry about, but
the other 99.9% :-)

> do we really want to embrace the prospect of 
> having different things look identical?

You mean like ASCII identifiers? I'm afraid it's about fifty years too late
to ban identifiers using O and 0, or l, I and 1, or rn and m.

Or for that matter:

a = akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqwe9fhlcjbqvcbhsiauy37wkg() + 100
b = 100 + akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqew9fhlcjbqvcbhsiauy37wkg()

How easily can you tell them apart at a glance?

The reality is that we trust our coders not to deliberately mess us about.
As the Obfuscated C and the Underhanded C contest prove, you don't need
Unicode to hide hostile code. In fact, the use of Unicode confusables in an
otherwise all-ASCII file is a dead giveaway that something fishy is going
on.

I think that, beyond normalisation, the compiler need not be too concerned
by confusables. I wouldn't *object* to the compiler raising a warning if it
detected confusable identifiers, or mixed script identifiers, but I think
that's more the job for a linter or human code review.

> (A relevant reference point: 
> mixtures of spaces and tabs in Python indentation.)

Most editors have an option to display whitespace, and tabs and spaces look
different. Typically the tab is shown with an arrow, and the space by a
dot. If people *still* confuse them, the issue is easily managed by a
combination of "well don't do that" and TabError.

> [snip]
>> The Unicode consortium seems to disagree with you.
> 
> <cranky_geezer_font>
> 
> The Unicode consortium was certifiably insane when it went into the
> typesetting business.

They are not, and never have been, in the typesetting business. Perhaps
characters are not the only things easily confused *wink*

(Although some members of the consortium may be. But the consortium itself
isn't.)

> The pile-of-poo character was just frosting on 
> the cake.

Blame the Japanese mobile phone companies for that. When you pay your
membership fee, you get to object to the addition of characters too.
(Anyone, I think, can propose a new character, but only members get to
choose which proposals are accepted.)

But really, why should we object? Is "pile-of-poo" any more silly than any
of the other dingbats, graphics characters, and other non-alphabetical
characters? Unicode is not just for "letters of the alphabet".

-- 
Steven

[toc] | [prev] | [next] | [standalone]

Page 2 of 6 — ← Prev page 1 [2] 3 4 5 6 Next page →

csiph-web

[beginner] What's wrong?

Contents

#106375

#106378

#106380

#106382

#106388

#106368

#106369

#106370

#106379

#106390

#106372

#106599

#106608 — Unicode normalisation [was Re: [beginner] What's wrong?]

#106614 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

#106632 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

#106641 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

#106642 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

#106643 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

#106699 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]

#106648 — Re: Unicode normalisation [was Re: [beginner] What's wrong?]