Groups > comp.lang.python > #91044 > unrolled thread

Ah Python, you have spoiled me for all other languages

Started by	Steven D'Aprano <steve@pearwood.info>
First post	2015-05-23 00:58 +1000
Last post	2015-05-22 21:33 -0600
Articles	20 on this page of 77 — 24 participants

Back to article view | Back to comp.lang.python

  Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-23 00:58 +1000
    Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 01:29 +1000
      Re: Ah Python, you have spoiled me for all other languages wxjmfauth@gmail.com - 2015-05-22 10:57 -0700
      Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-22 16:40 -0500
      Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-22 16:40 -0500
        Re: Ah Python, you have spoiled me for all other languages Terry Reedy <tjreedy@udel.edu> - 2015-05-22 21:54 -0400
          Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:12 -0500
          Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:12 -0500
            Re: Ah Python, you have spoiled me for all other languages Terry Reedy <tjreedy@udel.edu> - 2015-05-23 13:26 -0400
        Re: Ah Python, you have spoiled me for all other languages Michael Torrie <torriem@gmail.com> - 2015-05-22 21:31 -0600
          Re: Ah Python, you have spoiled me for all other languages Johannes Bauer <dfnsonfsduifb@gmx.de> - 2015-05-23 08:55 +0200
            Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:21 -0500
              Re: Ah Python, you have spoiled me for all other languages Johannes Bauer <dfnsonfsduifb@gmx.de> - 2015-05-23 15:24 +0200
                Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-23 20:05 +0300
                  Re: Ah Python, you have spoiled me for all other languages Johannes Bauer <dfnsonfsduifb@gmx.de> - 2015-05-24 20:29 +0200
            Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-23 15:44 +0300
              Re: Ah Python, you have spoiled me for all other languages Johannes Bauer <dfnsonfsduifb@gmx.de> - 2015-05-23 15:17 +0200
              Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-24 00:00 +1000
                Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-23 19:53 +0300
                  Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-24 03:41 +1000
                    Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-23 22:02 +0300
                  Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-24 20:26 +1000
                    Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-24 18:26 +0300
                      Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-25 01:35 +1000
                        Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-25 09:57 +0300
                          Re: Ah Python, you have spoiled me for all other languages Laura Creighton <lac@openend.se> - 2015-05-25 11:39 +0200
                          Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-25 21:09 +1000
              Re: Ah Python, you have spoiled me for all other languages Michael Torrie <torriem@gmail.com> - 2015-05-23 21:00 -0600
                Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-24 11:23 +0300
        Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-22 22:10 -0600
        Re: Ah Python, you have spoiled me for all other languages amber <amber.of.luxor@gmail.com> - 2015-05-23 04:11 +0000
          Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:11 -0500
          Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:11 -0500
        Re: Ah Python, you have spoiled me for all other languages Ben Finney <ben+python@benfinney.id.au> - 2015-05-23 14:20 +1000
        Re: Ah Python, you have spoiled me for all other languages Michael Torrie <torriem@gmail.com> - 2015-05-22 22:30 -0600
          Re: Ah Python, you have spoiled me for all other languages Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-23 11:10 +0000
            Re: Ah Python, you have spoiled me for all other languages Tim Chase <python.list@tim.thechases.com> - 2015-05-23 06:34 -0500
            Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 21:40 +1000
            Re: Ah Python, you have spoiled me for all other languages Michael Torrie <torriem@gmail.com> - 2015-05-23 20:57 -0600
            Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-24 01:22 -0600
        Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-22 22:29 -0600
        Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-22 22:49 -0600
        Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 14:49 +1000
          Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:29 -0500
        Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 14:55 +1000
        Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 14:28 +1000
        Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 14:21 +1000
      Re: Ah Python, you have spoiled me for all other languages Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-05-23 14:33 +0200
        Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-23 23:01 +1000
          Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 23:12 +1000
            Re: Ah Python, you have spoiled me for all other languages wxjmfauth@gmail.com - 2015-05-23 23:37 -0700
          Re: Ah Python, you have spoiled me for all other languages Ned Batchelder <ned@nedbatchelder.com> - 2015-05-23 06:35 -0700
            Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-24 00:09 +1000
            Re: Ah Python, you have spoiled me for all other languages Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 10:21 +0200
              Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-06-07 21:42 +1000
                Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-06-07 22:08 +1000
                  Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-06-07 23:24 +1000
                    Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-06-08 00:47 +1000
                Re: Ah Python, you have spoiled me for all other languages random832@fastmail.us - 2015-06-07 10:58 -0400
                  Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-06-08 02:28 +1000
    Re: Ah Python, you have spoiled me for all other languages Tony the Tiger <tony@tiger.invalid> - 2015-05-22 16:31 +0000
      Re: Ah Python, you have spoiled me for all other languages Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-05-22 17:57 +0100
      Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-22 16:41 -0500
        Re: Ah Python, you have spoiled me for all other languages Tony the Tiger <tony@tiger.invalid> - 2015-05-23 20:25 +0000
    Re: Ah Python, you have spoiled me for all other languages Grant Edwards <invalid@invalid.invalid> - 2015-05-22 17:47 +0000
      Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 04:11 +1000
      Re: Ah Python, you have spoiled me for all other languages mm0fmf <none@mailinator.com> - 2015-05-22 19:19 +0100
      Re: Ah Python, you have spoiled me for all other languages Laura Creighton <lac@openend.se> - 2015-05-22 21:14 +0200
        Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-23 11:36 +1000
      Re: Ah Python, you have spoiled me for all other languages MRAB <python@mrabarnett.plus.com> - 2015-05-22 20:34 +0100
      Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-22 13:56 -0600
        Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-22 23:34 +0300
          Re: Ah Python, you have spoiled me for all other languages Tim Chase <python.list@tim.thechases.com> - 2015-05-22 15:55 -0500
          Re: Ah Python, you have spoiled me for all other languages Ethan Furman <ethan@stoneleaf.us> - 2015-05-22 14:15 -0700
          Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-22 15:20 -0600
    Re: Ah Python, you have spoiled me for all other languages Paul Rubin <no.email@nospam.invalid> - 2015-05-22 16:00 -0700
      Re: Ah Python, you have spoiled me for all other languages Michael Torrie <torriem@gmail.com> - 2015-05-22 21:33 -0600

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

#91097

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2015-05-22 22:29 -0600
Message-ID	<mailman.258.1432355434.17265.python-list@python.org>
In reply to	#91078

On Fri, May 22, 2015 at 10:20 PM, Ben Finney <ben+python@benfinney.id.au> wrote:
> Ian Kelly <ian.g.kelly@gmail.com> writes:
>
>> On Fri, May 22, 2015 at 9:31 PM, Michael Torrie <torriem@gmail.com> wrote:
>> > On 05/22/2015 07:54 PM, Terry Reedy wrote:
>> >> On 5/22/2015 5:40 PM, Tim Daneliuk wrote:
>> >>
>> >>> Lo these many years ago, I argued that Python is a whole lot more than
>> >>> a programming language:
>> >>>
>> >>>     https://www.tundraware.com/TechnicalNotes/Python-Is-Middleware/
>> >>
>> >> Perhaps something at tundraware needs updating.
>> >> '''
>> >> This Connection is Untrusted
>> >>
>> >> You have asked Firefox to connect securely to www.tundraware.com, but we
>> >> can't confirm that your connection is secure.
>> >> […]
>
>> Without some prior reason to trust the certificate, the certificate is
>> meaningless. How is the browser to distinguish between a legitimate
>> self-signed cert and a self-signed cert presented by an attacker
>> conducting a man-in-the-middle attack?
>
> Any unencrypted HTTP (“http://…”) connection has the same problem. Yet
> the same browsers don't present a big scary warning for those?
>
> The flaw in the browser is that it doesn't complain when an unencrypted
> HTTP connection is established, but only complains when an *encrypted*
> connection is made to a site with a self-signed certificate.
>
>> There is still some value in TLS with a self-signed certificate in
>> that at least the connection is encrypted and can't be eavesdropped by
>> an attacker who can only read the channel, but there is no assurance
>> that the party you're communicating with actually owns the public key
>> that you've been presented.
>
> Right. By that logic, let's advocate for browsers to present a big
> intrusive warning for every HTTP connection that has no SSL layer or
> certificate.
>
> I will agree that a self-signed certificate presents the problem of how
> to verify the certificate automatically.
>
> Where I disagree is that this is somehow less secure than a completely
> *unencrypted* HTTP connection. No, the opposite is true.

I don't disagree with you. There *should* be scary warnings for plain
HTTP connections (although there is a counter-argument that many sites
don't need any encryption and HTTPS would just be wasteful in those
cases). The fact that browsers don't yet provide those warnings
doesn't change anything that I wrote above.

[toc] | [prev] | [next] | [standalone]

#91098

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2015-05-22 22:49 -0600
Message-ID	<mailman.259.1432356589.17265.python-list@python.org>
In reply to	#91078

On Fri, May 22, 2015 at 10:30 PM, Michael Torrie <torriem@gmail.com> wrote:
> On 05/22/2015 10:10 PM, Ian Kelly wrote:
>> Sure it is. Without some prior reason to trust the certificate, the
>> certificate is meaningless. How is the browser to distinguish between
>> a legitimate self-signed cert and a self-signed cert presented by an
>> attacker conducting a man-in-the-middle attack?
>
> How does a CA actually help this problem?  It just puts trust in some
> third party. But as we know, CA authorities are not all trustworthy and
> they certainly don't guarantee that the site is what it says it is.

Nobody is forcing you to trust them. Go ahead and remove the CA
certificates that you consider untrustworthy if you want. Remove all
of them if you like, although good luck with verifying all those site
certificates yourself.

The CA helps because some assurance is better than none.

>> There is still some value in TLS with a self-signed certificate in
>> that at least the connection is encrypted and can't be eavesdropped
>> by an attacker who can only read the channel, but there is no
>> assurance that the party you're communicating with actually owns the
>> public key that you've been presented.
>
> The same can be said of CA-signed certificates.  The only way to know if
> the site is who they say they are is to know what the cert's fingerprint
> ought to be and see if it still is. I used to use a firefox plugin for
> this purpose, but certs for some major sites like even www.google.com
> change with such frequency that the utility of the plugin went away.

So instead of trusting a CA, you have to trust the maintainers of the
plugin. How is that any different?

[toc] | [prev] | [next] | [standalone]

#91099

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-23 14:49 +1000
Message-ID	<mailman.260.1432356599.17265.python-list@python.org>
In reply to	#91078

On Sat, May 23, 2015 at 2:29 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> There *should* be scary warnings for plain
> HTTP connections (although there is a counter-argument that many sites
> don't need any encryption and HTTPS would just be wasteful in those
> cases).

I don't think there should be "scary warnings", for precisely this
reason. When the information you're sharing is completely public,
there's no point taking the overhead of encryption. So there should be
two normal and acceptable ways to access data: either unencrypted, or
encrypted with a verified certificate. Oh look, that's what we have.
There is an assumption that your system certificate store is
trustworthy, but for the typical user, it's probably better than
they'll get any other way, and for an atypical user, it can be pruned
easily.

But I think we're just a smidge off-topic here.

ChrisA

[toc] | [prev] | [next] | [standalone]

#91117

From	Tim Daneliuk <tundra@tundraware.com>
Date	2015-05-23 06:29 -0500
Message-ID	<556064A5.5030502@tundraware.com>
In reply to	#91099

On 05/22/2015 11:49 PM, Chris Angelico wrote:
> When the information you're sharing is completely public,
> there's no point taking the overhead of encryption.

I disagree.  With two different ways to access data, the metadata about
when you do- and do not use an encrypted channel can be useful to
a snoopy third party.  For example, repressive governments might use
the fact of your connecting via https as a prima facie evidence you're
doing something subversive.  The argument for https everywhere is that this
sort of distinction becomes impossible to make and one less piece of metadata
is left around to misuse.

-- 
----------------------------------------------------------------------------
Tim Daneliuk     tundra@tundraware.com
PGP Key:         http://www.tundraware.com/PGP/

[toc] | [prev] | [next] | [standalone]

#91100

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-23 14:55 +1000
Message-ID	<mailman.261.1432356947.17265.python-list@python.org>
In reply to	#91078

On Sat, May 23, 2015 at 2:49 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
>> The same can be said of CA-signed certificates.  The only way to know if
>> the site is who they say they are is to know what the cert's fingerprint
>> ought to be and see if it still is. I used to use a firefox plugin for
>> this purpose, but certs for some major sites like even www.google.com
>> change with such frequency that the utility of the plugin went away.
>
> So instead of trusting a CA, you have to trust the maintainers of the
> plugin. How is that any different?

It brings it local. If you're able to see the source code for the
plugin, you could check exactly how it does its verification (and by
the sound of it, it'd be pretty simple: just look up the cert, see if
it's different, if so, big noisy warning). Or, of course, you could do
the check yourself: click on the padlock, look at fingerprint, compare
against previously-noted fingerprint. That'd at least prove that your
plugin is checking properly.

But it still doesn't solve the fundamental problem of knowing when you
have the right site to start with.

ChrisA

[toc] | [prev] | [next] | [standalone]

#91101

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-23 14:28 +1000
Message-ID	<mailman.262.1432357208.17265.python-list@python.org>
In reply to	#91078

On Sat, May 23, 2015 at 2:20 PM, Ben Finney <ben+python@benfinney.id.au> wrote:
> Where I disagree is that this is somehow less secure than a completely
> *unencrypted* HTTP connection. No, the opposite is true.

No, it isn't less secure. However, people have been trained for years
to look for the padlock (including looking for padlocks before
entering credit card numbers or passwords, despite the fact that HTTPS
on the form isn't actually what's significant), and that's the key
here. Web browsers are intended for *humans* to use. You want a truly
secure connection between your Python client script and your Python
server? Sure, self-signed cert is great. You want something that an
average Joe can understand? Do what 99% of the world does, and get a
CSA-signed cert. Unencrypted is normal, encrypted is normal, and the
only thing that's being flagged is "hey, this *looks* secured, but it
might not be the right server". It's still encrypted, but the
unverified origin is a potential problem.

ChrisA

[toc] | [prev] | [next] | [standalone]

#91102

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-23 14:21 +1000
Message-ID	<mailman.263.1432360834.17265.python-list@python.org>
In reply to	#91078

On Sat, May 23, 2015 at 2:10 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
>> Sigh. I blame this as much on the browser.  There's no inherent reason
>> why a connection to a site secured with a self-signed certificate is
>> insecure.  In fact it's definitely not.
>
> Sure it is. Without some prior reason to trust the certificate, the
> certificate is meaningless. How is the browser to distinguish between
> a legitimate self-signed cert and a self-signed cert presented by an
> attacker conducting a man-in-the-middle attack?
>
> There is still some value in TLS with a self-signed certificate in
> that at least the connection is encrypted and can't be eavesdropped by
> an attacker who can only read the channel, but there is no assurance
> that the party you're communicating with actually owns the public key
> that you've been presented.

To be fair, certificates never actually tell you that the owner is
legitimate - all they do is move the problem. Self-signed certs move
the problem to "how do you get a guaranteed copy of this exact
server's certificate", which makes it an out-of-band issue (if you
meet someone you know in person and get a copy of the cert on a USB
stick, then manually install it, you can be sure it's safe);
externally-signed certs move the problem to the certificate chain and
its reliability (how well do the CSAs check ownership prior to issuing
a certificate?). Both are still problematic, just in different ways.

Self-signed certs are ideal if you're packaging your own client - you
could keep the IP address and certificate in the same VCS repository.
Anyone who can change the cert can also change the IP address, so you
lose no security there. But they're way WAY more hassle for https on
the public internet.

ChrisA

[toc] | [prev] | [next] | [standalone]

#91120

From	Thomas 'PointedEars' Lahn <PointedEars@web.de>
Date	2015-05-23 14:33 +0200
Message-ID	<2212595.DFZ6OqehRn@PointedEars.de>
In reply to	#91045

Chris Angelico wrote:

> […] My hobby-horse, Unicode, is a notable flaw in many languages - if you 
> ask the user for information (in the most obvious way for whatever 
> environment you're in, be that via a web browser request, or a GUI widget, 
> or text entered at the console), can it cope equally with all the world's 
> languages? What if you want to manipulate that text - is it represented as 
> a sequence of codepoints (Python 3), UTF-16 code units (JavaScript),

If only characters were represented as sequences UTF-16 code units in 
ECMAScript implementations like JavaScript, there would not be a problem 
beyond the BMP; see <http://PointedEars.de/wsvn/JSX/trunk/string/unicode.js> 
and others for details.

-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

[toc] | [prev] | [next] | [standalone]

#91122

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-05-23 23:01 +1000
Message-ID	<55607a1b$0$13011$c3e8da3$5496439d@news.astraweb.com>
In reply to	#91120

On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote:

> If only characters were represented as sequences UTF-16 code units in
> ECMAScript implementations like JavaScript, there would not be a problem
> beyond the BMP;

Are you being sarcastic?

This is Rhino:

js> var c = String.fromCharCode(65535); // in the BMP
js> print(c.charCodeAt(0));
65535

So far so good.

js> var c = String.fromCharCode(65536); // astral character
js> print(c.charCodeAt(0));
0

Can you name any ECMAScript implementation which correctly handles code
points in the supplementary multilingual planes?

By the way, for many years Python implemented Unicode as UTF-16 code units,
the so-called "narrow build":

py> c = unichr(65536)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Let's try again:

py> c = u'\U00010000'  # a single code point
py> len(c)
2

I'm not saying that it is impossible to have a correct Unicode implemention
using UTF-16, but I've never seen one.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#91123

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-23 23:12 +1000
Message-ID	<mailman.274.1432386750.17265.python-list@python.org>
In reply to	#91122

On Sat, May 23, 2015 at 11:01 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> I'm not saying that it is impossible to have a correct Unicode implemention
> using UTF-16, but I've never seen one.

I suspect this is partly because, if you're aiming for correct Unicode
semantics, UTF-8 offers everything that UTF-16 does and more. The only
reason to use UTF-16 is so you can pretend that UCS-2 is good enough.

ChrisA

[toc] | [prev] | [next] | [standalone]

#91160

From	wxjmfauth@gmail.com
Date	2015-05-23 23:37 -0700
Message-ID	<6cbe6ba2-5ab5-4c7e-8d7b-59ab8066488c@googlegroups.com>
In reply to	#91123

Le samedi 23 mai 2015 15:12:42 UTC+2, Chris Angelico a écrit :
> On Sat, May 23, 2015 at 11:01 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> > I'm not saying that it is impossible to have a correct Unicode implemention
> > using UTF-16, but I've never seen one.
> 
> I suspect this is partly because, if you're aiming for correct Unicode
> semantics, UTF-8 offers everything that UTF-16 does and more. The only
> reason to use UTF-16 is so you can pretend that UCS-2 is good enough.
> 
> ChrisA

Like you "colleague", it's also time to learn
Unicode.

jmf

[toc] | [prev] | [next] | [standalone]

#91127

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2015-05-23 06:35 -0700
Message-ID	<2c4d029c-8ea5-465b-8adc-6c35185bd150@googlegroups.com>
In reply to	#91122

On Saturday, May 23, 2015 at 9:01:29 AM UTC-4, Steven D'Aprano wrote:
> On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote:
> 
> > If only characters were represented as sequences UTF-16 code units in
> > ECMAScript implementations like JavaScript, there would not be a problem
> > beyond the BMP;
> 
> Are you being sarcastic?

IIUC, Thomas' point is that *characters* should be sequences of codepoints,
not that *strings* should be.

--Ned.

[toc] | [prev] | [next] | [standalone]

#91129

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-05-24 00:09 +1000
Message-ID	<55608a27$0$13013$c3e8da3$5496439d@news.astraweb.com>
In reply to	#91127

On Sat, 23 May 2015 11:35 pm, Ned Batchelder wrote:

> On Saturday, May 23, 2015 at 9:01:29 AM UTC-4, Steven D'Aprano wrote:
>> On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote:
>> 
>> > If only characters were represented as sequences UTF-16 code units in
>> > ECMAScript implementations like JavaScript, there would not be a
>> > problem beyond the BMP;
>> 
>> Are you being sarcastic?
> 
> IIUC, Thomas' point is that *characters* should be sequences of
> codepoints, not that *strings* should be.

Like Python, Javascript/ECMAScript doesn't have a distinct character type,
it has strings which happen to be of length one. So I'm not sure I
understand the point you are trying to make.

There's also a bit of a problem in deciding what counts as a character. Is
IJ a single character, or two? The answer depends on whether you are Dutch
or not. Unicode punts on that decision, and leaves it up to the
application.

Unicode only concerns itself with code points, which are complex enough, and
generally programming languages follow Unicode (usually imperfectly). Each
code point (a.k.a. "character" if we're being sloppy) requires either one
or two 16-bit code units in UTF-16. I'm not sure that "1 or 2" counts as a
sequence.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92211

From	Thomas 'PointedEars' Lahn <PointedEars@web.de>
Date	2015-06-07 10:21 +0200
Message-ID	<2483375.eHyISxeWLQ@PointedEars.de>
In reply to	#91127

Ned Batchelder wrote:

> On Saturday, May 23, 2015 at 9:01:29 AM UTC-4, Steven D'Aprano wrote:
>> On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote:
>> > If only characters were represented as sequences UTF-16 code units in
>> > ECMAScript implementations like JavaScript, there would not be a
>> > problem beyond the BMP;
>> 
>> Are you being sarcastic?
> 
> IIUC, Thomas' point is that *characters* should be sequences of
> codepoints, not that *strings* should be.

No, my point is that one character should be a sequence of code _units_ (for 
a code point value).  But in ECMAScript implementations (so far), a *code 
point value* equals a character, and that is a problem in ECMAScript because 
there the value range is limited to what can be encoded in 16 bit.  The 
problem starts beyond the BMP where 16 bit are no longer sufficient for a 
code sequence and code point value, and code sequence and code point value 
are no longer equal.

-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

[toc] | [prev] | [next] | [standalone]

#92233

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-07 21:42 +1000
Message-ID	<55742e0e$0$12980$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92211

On Sun, 7 Jun 2015 06:21 pm, Thomas 'PointedEars' Lahn wrote:

> Ned Batchelder wrote:
> 
>> On Saturday, May 23, 2015 at 9:01:29 AM UTC-4, Steven D'Aprano wrote:
>>> On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote:
>>> > If only characters were represented as sequences UTF-16 code units in
>>> > ECMAScript implementations like JavaScript, there would not be a
>>> > problem beyond the BMP;
>>> 
>>> Are you being sarcastic?
>> 
>> IIUC, Thomas' point is that *characters* should be sequences of
>> codepoints, not that *strings* should be.
> 
> No, my point is that one character should be a sequence of code _units_
> (for a code point value).  

I don't understand this sentence. "Code point value" doesn't appear to be
meaningful. "Code point" is a value in the Unicode codespace, informally "a
character" (but see below); code points can take on values in the range 0
to 1114111, usually written in hex as U+0000 to U+10FFFF.

"Code value" is an obsolete term for code unit, that is, the smallest chunk
of memory used to represent a code point. For example, UTF-8 uses 8-bit
code units, UTF-32 uses 32 bit code units.

But "code point value", I'm not sure what you mean by that. Consequently I
have no idea what you think a character should be. Is "Hello World" a
character? How about "Æ" or "û"?

The term "character" is problematic, because what counts as a character
depends on where you are and how the string is normalised. For example:

"ij" could be two characters, the letters i followed by j, or one, the 25th
letter of the Dutch language [and not even the Dutch agree on this];
conversely, "ĳ" could be a single character, or a ligature of two
characters.

"Ḗ" (U+1E16 LATIN CAPITAL LETTER E WITH MACRON AND ACUTE) could be
considered one character, or three 'E\u0304\u0301', depending on whether it
is normalised or not.

So I'm afraid I do not understand your sentence.

Code point: http://www.unicode.org/glossary/#code_point

Code unit: http://www.unicode.org/glossary/#code_unit

Code value: http://www.unicode.org/glossary/#code_value

See also http://unicode.org/faq/char_combmark.html

> But in ECMAScript implementations (so far), a *code 
> point value* equals a character, and that is a problem in ECMAScript
> because
> there the value range is limited to what can be encoded in 16 bit.  The
> problem starts beyond the BMP where 16 bit are no longer sufficient for a
> code sequence and code point value, and code sequence and code point value
> are no longer equal.

This is no clearer.

I *think* what you are trying to say is that ECMAScript assumes that one
code point is always represented by a single code unit. So a sequence of
code points ABCD will be correctly interpreted as four "characters" so long
as each of those code points are in the BMP (i.e. between U+0000 and U+FFFF
inclusive), but *not* if they are from one of the supplementary planes.

This is the same problem that older Python "narrow builds" suffered from.
The solutions in Python was to use a wide-build (each code point is
represented by a single UTF-32 code unit, that is, four bytes) or to
upgrade to Python 3.3, which uses a compressed coding scheme where strings
are represented by either 1-byte per code point, 2-bytes per code point, or
4-bytes per code point, whichever is the minimum needed for that particular
string.

My opinion is that a programming language like Python or ECMAScript should
operate on *code points*. If we want to call them "characters" informally,
that should be allowed, but whenever there is ambiguity we should remember
we're dealing with code points. The implementation shouldn't matter:
compliant Python interpreters might choose to use UTF-8 internally, or
UTF-16, or UTF-32, or something else, and still agree on how many
characters a string contains. Normalisation is still an issue, of course,
but any decent Unicode implementation will include a way to normalise or
denormalise strings.

The question of graphemes (what "ordinary people" consider letters and
characters, e.g. "ch" is two letters to an English speaker but one letter
to a Czech speaker) should be left to libraries. It's a much harder problem
to solve in the full general case, requires localisation, and is overkill
for many string-processing tasks.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92237

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-07 22:08 +1000
Message-ID	<mailman.243.1433678894.13271.python-list@python.org>
In reply to	#92233

On Sun, Jun 7, 2015 at 9:42 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> My opinion is that a programming language like Python or ECMAScript should
> operate on *code points*. If we want to call them "characters" informally,
> that should be allowed, but whenever there is ambiguity we should remember
> we're dealing with code points. The implementation shouldn't matter:
> compliant Python interpreters might choose to use UTF-8 internally, or
> UTF-16, or UTF-32, or something else, and still agree on how many
> characters a string contains. Normalisation is still an issue, of course,
> but any decent Unicode implementation will include a way to normalise or
> denormalise strings.

If by "normalise" you mean the NF[K]C/NF[K]D composition and
decomposition, then yes, any decent Unicode library will provide that.
I'm not sure it's critical to string handling itself, though; and
Python defers the operation to the unicodedata module:

>>> s1 = "\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}"
>>> s2 = "\N{LATIN SMALL LETTER A WITH ACUTE}"
>>> s1 == s2
False
>>> unicodedata.normalize("NFC", s1) == s2
True

It's a useful operation to be able to do, but I would never expect
that *string comparison* or other operations should automatically
normalize. (Unless you want to say that all strings are guaranteed to
be NFC/NFD normalized, such that s1 and s2 would actually be
identical, which I suppose is plausible. I'm not sure what the
advantage would be, though. And certainly you wouldn't want to
K-normalize strings automatically.)

> The question of graphemes (what "ordinary people" consider letters and
> characters, e.g. "ch" is two letters to an English speaker but one letter
> to a Czech speaker) should be left to libraries. It's a much harder problem
> to solve in the full general case, requires localisation, and is overkill
> for many string-processing tasks.

Yeah. The basic challenge to a beginning programmer, "reverse this
string", becomes rather tricky in the presence of natural language.

>>> s1 += "e"
>>> s1
'áe'
>>> s1[::-1]
'éa'

Oops.

But hey. It's easier to understand what went wrong here than, say, if
you reverse the bytes in a UTF-8 stream. Or the code units in a UTF-16
stream. If you're lucky, those would give you instant errors... if
you're not, well, who knows.

ChrisA

[toc] | [prev] | [next] | [standalone]

#92250

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-07 23:24 +1000
Message-ID	<557445f6$0$12997$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92237

On Sun, 7 Jun 2015 10:08 pm, Chris Angelico wrote:

> On Sun, Jun 7, 2015 at 9:42 PM, Steven D'Aprano <steve@pearwood.info>
> wrote:
>> My opinion is that a programming language like Python or ECMAScript
>> should operate on *code points*. If we want to call them "characters"
>> informally, that should be allowed, but whenever there is ambiguity we
>> should remember we're dealing with code points. The implementation
>> shouldn't matter: compliant Python interpreters might choose to use UTF-8
>> internally, or UTF-16, or UTF-32, or something else, and still agree on
>> how many characters a string contains. Normalisation is still an issue,
>> of course, but any decent Unicode implementation will include a way to
>> normalise or denormalise strings.
> 
> If by "normalise" you mean the NF[K]C/NF[K]D composition and
> decomposition, then yes, any decent Unicode library will provide that.

Dat's der bunny!

> I'm not sure it's critical to string handling itself, though; and
> Python defers the operation to the unicodedata module:
> 
>>>> s1 = "\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}"
>>>> s2 = "\N{LATIN SMALL LETTER A WITH ACUTE}"
>>>> s1 == s2
> False
>>>> unicodedata.normalize("NFC", s1) == s2
> True
> 
> It's a useful operation to be able to do, but I would never expect
> that *string comparison* or other operations should automatically
> normalize.

I completely agree.

It might be convenient to have a string equality method that did
normalisation, but for most cases it would be unnecessary and slow. I think
that's the sort of thing which should be left to a subclass of str, and it
should normalise on construction.


> (Unless you want to say that all strings are guaranteed to 
> be NFC/NFD normalized, such that s1 and s2 would actually be
> identical, which I suppose is plausible. I'm not sure what the
> advantage would be, though. And certainly you wouldn't want to
> K-normalize strings automatically.)

I believe that filenames on Apple file systems (HFS+ if I remember
correctly) are guaranteed to be both normalised and correctly encoded as
UTF-8. If you could live in a purely Apple world, you'd have far fewer
filename hassles.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92257

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-08 00:47 +1000
Message-ID	<mailman.250.1433688442.13271.python-list@python.org>
In reply to	#92250

On Sun, Jun 7, 2015 at 11:24 PM, Steven D'Aprano <steve@pearwood.info> wrote:
>> (Unless you want to say that all strings are guaranteed to
>> be NFC/NFD normalized, such that s1 and s2 would actually be
>> identical, which I suppose is plausible. I'm not sure what the
>> advantage would be, though. And certainly you wouldn't want to
>> K-normalize strings automatically.)
>
> I believe that filenames on Apple file systems (HFS+ if I remember
> correctly) are guaranteed to be both normalised and correctly encoded as
> UTF-8. If you could live in a purely Apple world, you'd have far fewer
> filename hassles.

Yep. Actually, there should be nothing stopping the next Linux file
system ("ext5" or whatever) from enforcing the same thing;
byte-oriented filename APIs would still work just fine, but you could
have some confidence that at least local file systems will normally be
decodable as UTF-8. Then the only time you'd have to worry about
encoding problems would be network or removable file systems - no
worrying about "what's the FS encoding", because it'll just be UTF-8.
(Hmm. Point of interest: What happens on a Mac if you network-mount
something that isn't Unicode? If the enforcement of UTF-8 and
normalization is done at the file system level, it's no different from
the current Linux situation, where basically anything goes.)

But that's file names, not strings in a program. I'm not sure that
mandating that strings be normalized is particularly useful, but on
the flip side, I'm not sure of any situation where it'd be majorly
problematic either. There are ambiguities in some encodings, and as
soon as you decode from them and re-encode, you've effectively folded
those ambiguities to some canonical form; if your language
automatically normalized strings, you'd just have the same effect of
folding. And then you could have encode methods that stipulate the
other form of normalization - say you NFD everything internally, you
could then have a method "a\u0301".encode("utf-8", combine=True) which
NFC normalizes prior to encoding (and would thus be C3 A1 instead of
61 CC 81). Are there any languages out there that work this way?

ChrisA

[toc] | [prev] | [next] | [standalone]

#92258

From	random832@fastmail.us
Date	2015-06-07 10:58 -0400
Message-ID	<mailman.251.1433689134.13271.python-list@python.org>
In reply to	#92233

On Sun, Jun 7, 2015, at 07:42, Steven D'Aprano wrote:
> The question of graphemes (what "ordinary people" consider letters and
> characters, e.g. "ch" is two letters to an English speaker but one letter
> to a Czech speaker) should be left to libraries.

Do Czech speakers expect to be able to select and delete it as a single
unit and never have the cursor in the middle of it? If not, then this is
not really fundamentally the same thing as what we have with combining
characters or certain sequences of Indic letters.

Also, "should be left to libraries" isn't really a coherent statement
when we are talking about the design of the standard *library*.

[toc] | [prev] | [next] | [standalone]

#92268

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-08 02:28 +1000
Message-ID	<55747134$0$13005$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92258

On Mon, 8 Jun 2015 12:58 am, random832@fastmail.us wrote:

> On Sun, Jun 7, 2015, at 07:42, Steven D'Aprano wrote:
>> The question of graphemes (what "ordinary people" consider letters and
>> characters, e.g. "ch" is two letters to an English speaker but one letter
>> to a Czech speaker) should be left to libraries.
> 
> Do Czech speakers expect to be able to select and delete it as a single
> unit and never have the cursor in the middle of it?

You'd have to ask one. I expect the answer is No, because they're used to
using software written by English speakers who think that "ch" is two
letters.

Whether they would *like* to stick the cursor between the c and the h is a
different question to whether they would *expect* it.

There may even be words where "ch" counts as two letters, where the "c" is
at the end of one syllable but the "h" is the beginning of the next.
(That's certainly the case for Dutch "ij".) Natural language is *hard*.

But generally speaking, I expect that when Czech speakers are playing (say)
Scrabble, they would want to have a tile called "CH" which they can play as
a single letter.

> If not, then this is 
> not really fundamentally the same thing as what we have with combining
> characters or certain sequences of Indic letters.

I'll have to take your word for that.

> 
> Also, "should be left to libraries" isn't really a coherent statement
> when we are talking about the design of the standard *library*.

The language offers a certain view of strings, which is reflected in the
methods that strings have, and built-in functions that operate on strings.
Should len('ch') return 1 or 2? If you think that the language should treat
strings as sequences of graphemes, then you will answer "sometimes 1".
Maybe there is a global setting to set the locale

LANG = 'Cz'
len('ch')
=> returns 1

or an optional parameter that you can pass to len:

len('ch', lang='Cz')
=> returns 1

len('ch', lang='En')
=> returns 2

But if you think that the language should treat strings as sequences of code
points, as I do, then there's only one reasonable thing for len('ch') to
return, and that is 2. But *some library* (as opposed to the built-in str
type) can offer a grapheme view of strings:

from language_tools import Graphemes
g = Graphemes.fromstr('ch', lang='Cz', exceptions=['xchx', 'ychy'])
len(g)
=> returns 1

Do you still think this is incoherent?

-- 
Steven

[toc] | [prev] | [next] | [standalone]

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

csiph-web

Ah Python, you have spoiled me for all other languages

Contents

#91097

#91098

#91099

#91117

#91100

#91101

#91102

#91120

#91122

#91123

#91160

#91127

#91129

#92211

#92233

#92237

#92250

#92257

#92258

#92268