Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #91044 > unrolled thread
| Started by | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| First post | 2015-05-23 00:58 +1000 |
| Last post | 2015-05-22 21:33 -0600 |
| Articles | 20 on this page of 77 — 24 participants |
Back to article view | Back to comp.lang.python
Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-23 00:58 +1000
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 01:29 +1000
Re: Ah Python, you have spoiled me for all other languages wxjmfauth@gmail.com - 2015-05-22 10:57 -0700
Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-22 16:40 -0500
Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-22 16:40 -0500
Re: Ah Python, you have spoiled me for all other languages Terry Reedy <tjreedy@udel.edu> - 2015-05-22 21:54 -0400
Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:12 -0500
Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:12 -0500
Re: Ah Python, you have spoiled me for all other languages Terry Reedy <tjreedy@udel.edu> - 2015-05-23 13:26 -0400
Re: Ah Python, you have spoiled me for all other languages Michael Torrie <torriem@gmail.com> - 2015-05-22 21:31 -0600
Re: Ah Python, you have spoiled me for all other languages Johannes Bauer <dfnsonfsduifb@gmx.de> - 2015-05-23 08:55 +0200
Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:21 -0500
Re: Ah Python, you have spoiled me for all other languages Johannes Bauer <dfnsonfsduifb@gmx.de> - 2015-05-23 15:24 +0200
Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-23 20:05 +0300
Re: Ah Python, you have spoiled me for all other languages Johannes Bauer <dfnsonfsduifb@gmx.de> - 2015-05-24 20:29 +0200
Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-23 15:44 +0300
Re: Ah Python, you have spoiled me for all other languages Johannes Bauer <dfnsonfsduifb@gmx.de> - 2015-05-23 15:17 +0200
Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-24 00:00 +1000
Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-23 19:53 +0300
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-24 03:41 +1000
Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-23 22:02 +0300
Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-24 20:26 +1000
Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-24 18:26 +0300
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-25 01:35 +1000
Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-25 09:57 +0300
Re: Ah Python, you have spoiled me for all other languages Laura Creighton <lac@openend.se> - 2015-05-25 11:39 +0200
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-25 21:09 +1000
Re: Ah Python, you have spoiled me for all other languages Michael Torrie <torriem@gmail.com> - 2015-05-23 21:00 -0600
Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-24 11:23 +0300
Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-22 22:10 -0600
Re: Ah Python, you have spoiled me for all other languages amber <amber.of.luxor@gmail.com> - 2015-05-23 04:11 +0000
Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:11 -0500
Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:11 -0500
Re: Ah Python, you have spoiled me for all other languages Ben Finney <ben+python@benfinney.id.au> - 2015-05-23 14:20 +1000
Re: Ah Python, you have spoiled me for all other languages Michael Torrie <torriem@gmail.com> - 2015-05-22 22:30 -0600
Re: Ah Python, you have spoiled me for all other languages Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-23 11:10 +0000
Re: Ah Python, you have spoiled me for all other languages Tim Chase <python.list@tim.thechases.com> - 2015-05-23 06:34 -0500
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 21:40 +1000
Re: Ah Python, you have spoiled me for all other languages Michael Torrie <torriem@gmail.com> - 2015-05-23 20:57 -0600
Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-24 01:22 -0600
Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-22 22:29 -0600
Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-22 22:49 -0600
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 14:49 +1000
Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-23 06:29 -0500
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 14:55 +1000
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 14:28 +1000
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 14:21 +1000
Re: Ah Python, you have spoiled me for all other languages Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-05-23 14:33 +0200
Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-23 23:01 +1000
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 23:12 +1000
Re: Ah Python, you have spoiled me for all other languages wxjmfauth@gmail.com - 2015-05-23 23:37 -0700
Re: Ah Python, you have spoiled me for all other languages Ned Batchelder <ned@nedbatchelder.com> - 2015-05-23 06:35 -0700
Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-24 00:09 +1000
Re: Ah Python, you have spoiled me for all other languages Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 10:21 +0200
Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-06-07 21:42 +1000
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-06-07 22:08 +1000
Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-06-07 23:24 +1000
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-06-08 00:47 +1000
Re: Ah Python, you have spoiled me for all other languages random832@fastmail.us - 2015-06-07 10:58 -0400
Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-06-08 02:28 +1000
Re: Ah Python, you have spoiled me for all other languages Tony the Tiger <tony@tiger.invalid> - 2015-05-22 16:31 +0000
Re: Ah Python, you have spoiled me for all other languages Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-05-22 17:57 +0100
Re: Ah Python, you have spoiled me for all other languages Tim Daneliuk <tundra@tundraware.com> - 2015-05-22 16:41 -0500
Re: Ah Python, you have spoiled me for all other languages Tony the Tiger <tony@tiger.invalid> - 2015-05-23 20:25 +0000
Re: Ah Python, you have spoiled me for all other languages Grant Edwards <invalid@invalid.invalid> - 2015-05-22 17:47 +0000
Re: Ah Python, you have spoiled me for all other languages Chris Angelico <rosuav@gmail.com> - 2015-05-23 04:11 +1000
Re: Ah Python, you have spoiled me for all other languages mm0fmf <none@mailinator.com> - 2015-05-22 19:19 +0100
Re: Ah Python, you have spoiled me for all other languages Laura Creighton <lac@openend.se> - 2015-05-22 21:14 +0200
Re: Ah Python, you have spoiled me for all other languages Steven D'Aprano <steve@pearwood.info> - 2015-05-23 11:36 +1000
Re: Ah Python, you have spoiled me for all other languages MRAB <python@mrabarnett.plus.com> - 2015-05-22 20:34 +0100
Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-22 13:56 -0600
Re: Ah Python, you have spoiled me for all other languages Marko Rauhamaa <marko@pacujo.net> - 2015-05-22 23:34 +0300
Re: Ah Python, you have spoiled me for all other languages Tim Chase <python.list@tim.thechases.com> - 2015-05-22 15:55 -0500
Re: Ah Python, you have spoiled me for all other languages Ethan Furman <ethan@stoneleaf.us> - 2015-05-22 14:15 -0700
Re: Ah Python, you have spoiled me for all other languages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-22 15:20 -0600
Re: Ah Python, you have spoiled me for all other languages Paul Rubin <no.email@nospam.invalid> - 2015-05-22 16:00 -0700
Re: Ah Python, you have spoiled me for all other languages Michael Torrie <torriem@gmail.com> - 2015-05-22 21:33 -0600
Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2015-05-22 22:29 -0600 |
| Message-ID | <mailman.258.1432355434.17265.python-list@python.org> |
| In reply to | #91078 |
On Fri, May 22, 2015 at 10:20 PM, Ben Finney <ben+python@benfinney.id.au> wrote: > Ian Kelly <ian.g.kelly@gmail.com> writes: > >> On Fri, May 22, 2015 at 9:31 PM, Michael Torrie <torriem@gmail.com> wrote: >> > On 05/22/2015 07:54 PM, Terry Reedy wrote: >> >> On 5/22/2015 5:40 PM, Tim Daneliuk wrote: >> >> >> >>> Lo these many years ago, I argued that Python is a whole lot more than >> >>> a programming language: >> >>> >> >>> https://www.tundraware.com/TechnicalNotes/Python-Is-Middleware/ >> >> >> >> Perhaps something at tundraware needs updating. >> >> ''' >> >> This Connection is Untrusted >> >> >> >> You have asked Firefox to connect securely to www.tundraware.com, but we >> >> can't confirm that your connection is secure. >> >> […] > >> Without some prior reason to trust the certificate, the certificate is >> meaningless. How is the browser to distinguish between a legitimate >> self-signed cert and a self-signed cert presented by an attacker >> conducting a man-in-the-middle attack? > > Any unencrypted HTTP (“http://…”) connection has the same problem. Yet > the same browsers don't present a big scary warning for those? > > The flaw in the browser is that it doesn't complain when an unencrypted > HTTP connection is established, but only complains when an *encrypted* > connection is made to a site with a self-signed certificate. > >> There is still some value in TLS with a self-signed certificate in >> that at least the connection is encrypted and can't be eavesdropped by >> an attacker who can only read the channel, but there is no assurance >> that the party you're communicating with actually owns the public key >> that you've been presented. > > Right. By that logic, let's advocate for browsers to present a big > intrusive warning for every HTTP connection that has no SSL layer or > certificate. > > I will agree that a self-signed certificate presents the problem of how > to verify the certificate automatically. > > Where I disagree is that this is somehow less secure than a completely > *unencrypted* HTTP connection. No, the opposite is true. I don't disagree with you. There *should* be scary warnings for plain HTTP connections (although there is a counter-argument that many sites don't need any encryption and HTTPS would just be wasteful in those cases). The fact that browsers don't yet provide those warnings doesn't change anything that I wrote above.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2015-05-22 22:49 -0600 |
| Message-ID | <mailman.259.1432356589.17265.python-list@python.org> |
| In reply to | #91078 |
On Fri, May 22, 2015 at 10:30 PM, Michael Torrie <torriem@gmail.com> wrote: > On 05/22/2015 10:10 PM, Ian Kelly wrote: >> Sure it is. Without some prior reason to trust the certificate, the >> certificate is meaningless. How is the browser to distinguish between >> a legitimate self-signed cert and a self-signed cert presented by an >> attacker conducting a man-in-the-middle attack? > > How does a CA actually help this problem? It just puts trust in some > third party. But as we know, CA authorities are not all trustworthy and > they certainly don't guarantee that the site is what it says it is. Nobody is forcing you to trust them. Go ahead and remove the CA certificates that you consider untrustworthy if you want. Remove all of them if you like, although good luck with verifying all those site certificates yourself. The CA helps because some assurance is better than none. >> There is still some value in TLS with a self-signed certificate in >> that at least the connection is encrypted and can't be eavesdropped >> by an attacker who can only read the channel, but there is no >> assurance that the party you're communicating with actually owns the >> public key that you've been presented. > > The same can be said of CA-signed certificates. The only way to know if > the site is who they say they are is to know what the cert's fingerprint > ought to be and see if it still is. I used to use a firefox plugin for > this purpose, but certs for some major sites like even www.google.com > change with such frequency that the utility of the plugin went away. So instead of trusting a CA, you have to trust the maintainers of the plugin. How is that any different?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-05-23 14:49 +1000 |
| Message-ID | <mailman.260.1432356599.17265.python-list@python.org> |
| In reply to | #91078 |
On Sat, May 23, 2015 at 2:29 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote: > There *should* be scary warnings for plain > HTTP connections (although there is a counter-argument that many sites > don't need any encryption and HTTPS would just be wasteful in those > cases). I don't think there should be "scary warnings", for precisely this reason. When the information you're sharing is completely public, there's no point taking the overhead of encryption. So there should be two normal and acceptable ways to access data: either unencrypted, or encrypted with a verified certificate. Oh look, that's what we have. There is an assumption that your system certificate store is trustworthy, but for the typical user, it's probably better than they'll get any other way, and for an atypical user, it can be pruned easily. But I think we're just a smidge off-topic here. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Tim Daneliuk <tundra@tundraware.com> |
|---|---|
| Date | 2015-05-23 06:29 -0500 |
| Message-ID | <556064A5.5030502@tundraware.com> |
| In reply to | #91099 |
On 05/22/2015 11:49 PM, Chris Angelico wrote: > When the information you're sharing is completely public, > there's no point taking the overhead of encryption. I disagree. With two different ways to access data, the metadata about when you do- and do not use an encrypted channel can be useful to a snoopy third party. For example, repressive governments might use the fact of your connecting via https as a prima facie evidence you're doing something subversive. The argument for https everywhere is that this sort of distinction becomes impossible to make and one less piece of metadata is left around to misuse. -- ---------------------------------------------------------------------------- Tim Daneliuk tundra@tundraware.com PGP Key: http://www.tundraware.com/PGP/
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-05-23 14:55 +1000 |
| Message-ID | <mailman.261.1432356947.17265.python-list@python.org> |
| In reply to | #91078 |
On Sat, May 23, 2015 at 2:49 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote: >> The same can be said of CA-signed certificates. The only way to know if >> the site is who they say they are is to know what the cert's fingerprint >> ought to be and see if it still is. I used to use a firefox plugin for >> this purpose, but certs for some major sites like even www.google.com >> change with such frequency that the utility of the plugin went away. > > So instead of trusting a CA, you have to trust the maintainers of the > plugin. How is that any different? It brings it local. If you're able to see the source code for the plugin, you could check exactly how it does its verification (and by the sound of it, it'd be pretty simple: just look up the cert, see if it's different, if so, big noisy warning). Or, of course, you could do the check yourself: click on the padlock, look at fingerprint, compare against previously-noted fingerprint. That'd at least prove that your plugin is checking properly. But it still doesn't solve the fundamental problem of knowing when you have the right site to start with. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-05-23 14:28 +1000 |
| Message-ID | <mailman.262.1432357208.17265.python-list@python.org> |
| In reply to | #91078 |
On Sat, May 23, 2015 at 2:20 PM, Ben Finney <ben+python@benfinney.id.au> wrote: > Where I disagree is that this is somehow less secure than a completely > *unencrypted* HTTP connection. No, the opposite is true. No, it isn't less secure. However, people have been trained for years to look for the padlock (including looking for padlocks before entering credit card numbers or passwords, despite the fact that HTTPS on the form isn't actually what's significant), and that's the key here. Web browsers are intended for *humans* to use. You want a truly secure connection between your Python client script and your Python server? Sure, self-signed cert is great. You want something that an average Joe can understand? Do what 99% of the world does, and get a CSA-signed cert. Unencrypted is normal, encrypted is normal, and the only thing that's being flagged is "hey, this *looks* secured, but it might not be the right server". It's still encrypted, but the unverified origin is a potential problem. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-05-23 14:21 +1000 |
| Message-ID | <mailman.263.1432360834.17265.python-list@python.org> |
| In reply to | #91078 |
On Sat, May 23, 2015 at 2:10 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote: >> Sigh. I blame this as much on the browser. There's no inherent reason >> why a connection to a site secured with a self-signed certificate is >> insecure. In fact it's definitely not. > > Sure it is. Without some prior reason to trust the certificate, the > certificate is meaningless. How is the browser to distinguish between > a legitimate self-signed cert and a self-signed cert presented by an > attacker conducting a man-in-the-middle attack? > > There is still some value in TLS with a self-signed certificate in > that at least the connection is encrypted and can't be eavesdropped by > an attacker who can only read the channel, but there is no assurance > that the party you're communicating with actually owns the public key > that you've been presented. To be fair, certificates never actually tell you that the owner is legitimate - all they do is move the problem. Self-signed certs move the problem to "how do you get a guaranteed copy of this exact server's certificate", which makes it an out-of-band issue (if you meet someone you know in person and get a copy of the cert on a USB stick, then manually install it, you can be sure it's safe); externally-signed certs move the problem to the certificate chain and its reliability (how well do the CSAs check ownership prior to issuing a certificate?). Both are still problematic, just in different ways. Self-signed certs are ideal if you're packaging your own client - you could keep the IP address and certificate in the same VCS repository. Anyone who can change the cert can also change the IP address, so you lose no security there. But they're way WAY more hassle for https on the public internet. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Thomas 'PointedEars' Lahn <PointedEars@web.de> |
|---|---|
| Date | 2015-05-23 14:33 +0200 |
| Message-ID | <2212595.DFZ6OqehRn@PointedEars.de> |
| In reply to | #91045 |
Chris Angelico wrote: > […] My hobby-horse, Unicode, is a notable flaw in many languages - if you > ask the user for information (in the most obvious way for whatever > environment you're in, be that via a web browser request, or a GUI widget, > or text entered at the console), can it cope equally with all the world's > languages? What if you want to manipulate that text - is it represented as > a sequence of codepoints (Python 3), UTF-16 code units (JavaScript), If only characters were represented as sequences UTF-16 code units in ECMAScript implementations like JavaScript, there would not be a problem beyond the BMP; see <http://PointedEars.de/wsvn/JSX/trunk/string/unicode.js> and others for details. -- PointedEars Twitter: @PointedEars2 Please do not cc me. / Bitte keine Kopien per E-Mail.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-05-23 23:01 +1000 |
| Message-ID | <55607a1b$0$13011$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #91120 |
On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote: > If only characters were represented as sequences UTF-16 code units in > ECMAScript implementations like JavaScript, there would not be a problem > beyond the BMP; Are you being sarcastic? This is Rhino: js> var c = String.fromCharCode(65535); // in the BMP js> print(c.charCodeAt(0)); 65535 So far so good. js> var c = String.fromCharCode(65536); // astral character js> print(c.charCodeAt(0)); 0 Can you name any ECMAScript implementation which correctly handles code points in the supplementary multilingual planes? By the way, for many years Python implemented Unicode as UTF-16 code units, the so-called "narrow build": py> c = unichr(65536) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: unichr() arg not in range(0x10000) (narrow Python build) Let's try again: py> c = u'\U00010000' # a single code point py> len(c) 2 I'm not saying that it is impossible to have a correct Unicode implemention using UTF-16, but I've never seen one. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-05-23 23:12 +1000 |
| Message-ID | <mailman.274.1432386750.17265.python-list@python.org> |
| In reply to | #91122 |
On Sat, May 23, 2015 at 11:01 PM, Steven D'Aprano <steve@pearwood.info> wrote: > I'm not saying that it is impossible to have a correct Unicode implemention > using UTF-16, but I've never seen one. I suspect this is partly because, if you're aiming for correct Unicode semantics, UTF-8 offers everything that UTF-16 does and more. The only reason to use UTF-16 is so you can pretend that UCS-2 is good enough. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-05-23 23:37 -0700 |
| Message-ID | <6cbe6ba2-5ab5-4c7e-8d7b-59ab8066488c@googlegroups.com> |
| In reply to | #91123 |
Le samedi 23 mai 2015 15:12:42 UTC+2, Chris Angelico a écrit : > On Sat, May 23, 2015 at 11:01 PM, Steven D'Aprano <steve@pearwood.info> wrote: > > I'm not saying that it is impossible to have a correct Unicode implemention > > using UTF-16, but I've never seen one. > > I suspect this is partly because, if you're aiming for correct Unicode > semantics, UTF-8 offers everything that UTF-16 does and more. The only > reason to use UTF-16 is so you can pretend that UCS-2 is good enough. > > ChrisA Like you "colleague", it's also time to learn Unicode. jmf
[toc] | [prev] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2015-05-23 06:35 -0700 |
| Message-ID | <2c4d029c-8ea5-465b-8adc-6c35185bd150@googlegroups.com> |
| In reply to | #91122 |
On Saturday, May 23, 2015 at 9:01:29 AM UTC-4, Steven D'Aprano wrote: > On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote: > > > If only characters were represented as sequences UTF-16 code units in > > ECMAScript implementations like JavaScript, there would not be a problem > > beyond the BMP; > > Are you being sarcastic? IIUC, Thomas' point is that *characters* should be sequences of codepoints, not that *strings* should be. --Ned.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-05-24 00:09 +1000 |
| Message-ID | <55608a27$0$13013$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #91127 |
On Sat, 23 May 2015 11:35 pm, Ned Batchelder wrote: > On Saturday, May 23, 2015 at 9:01:29 AM UTC-4, Steven D'Aprano wrote: >> On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote: >> >> > If only characters were represented as sequences UTF-16 code units in >> > ECMAScript implementations like JavaScript, there would not be a >> > problem beyond the BMP; >> >> Are you being sarcastic? > > IIUC, Thomas' point is that *characters* should be sequences of > codepoints, not that *strings* should be. Like Python, Javascript/ECMAScript doesn't have a distinct character type, it has strings which happen to be of length one. So I'm not sure I understand the point you are trying to make. There's also a bit of a problem in deciding what counts as a character. Is IJ a single character, or two? The answer depends on whether you are Dutch or not. Unicode punts on that decision, and leaves it up to the application. Unicode only concerns itself with code points, which are complex enough, and generally programming languages follow Unicode (usually imperfectly). Each code point (a.k.a. "character" if we're being sloppy) requires either one or two 16-bit code units in UTF-16. I'm not sure that "1 or 2" counts as a sequence. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Thomas 'PointedEars' Lahn <PointedEars@web.de> |
|---|---|
| Date | 2015-06-07 10:21 +0200 |
| Message-ID | <2483375.eHyISxeWLQ@PointedEars.de> |
| In reply to | #91127 |
Ned Batchelder wrote: > On Saturday, May 23, 2015 at 9:01:29 AM UTC-4, Steven D'Aprano wrote: >> On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote: >> > If only characters were represented as sequences UTF-16 code units in >> > ECMAScript implementations like JavaScript, there would not be a >> > problem beyond the BMP; >> >> Are you being sarcastic? > > IIUC, Thomas' point is that *characters* should be sequences of > codepoints, not that *strings* should be. No, my point is that one character should be a sequence of code _units_ (for a code point value). But in ECMAScript implementations (so far), a *code point value* equals a character, and that is a problem in ECMAScript because there the value range is limited to what can be encoded in 16 bit. The problem starts beyond the BMP where 16 bit are no longer sufficient for a code sequence and code point value, and code sequence and code point value are no longer equal. -- PointedEars Twitter: @PointedEars2 Please do not cc me. / Bitte keine Kopien per E-Mail.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-06-07 21:42 +1000 |
| Message-ID | <55742e0e$0$12980$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #92211 |
On Sun, 7 Jun 2015 06:21 pm, Thomas 'PointedEars' Lahn wrote: > Ned Batchelder wrote: > >> On Saturday, May 23, 2015 at 9:01:29 AM UTC-4, Steven D'Aprano wrote: >>> On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote: >>> > If only characters were represented as sequences UTF-16 code units in >>> > ECMAScript implementations like JavaScript, there would not be a >>> > problem beyond the BMP; >>> >>> Are you being sarcastic? >> >> IIUC, Thomas' point is that *characters* should be sequences of >> codepoints, not that *strings* should be. > > No, my point is that one character should be a sequence of code _units_ > (for a code point value). I don't understand this sentence. "Code point value" doesn't appear to be meaningful. "Code point" is a value in the Unicode codespace, informally "a character" (but see below); code points can take on values in the range 0 to 1114111, usually written in hex as U+0000 to U+10FFFF. "Code value" is an obsolete term for code unit, that is, the smallest chunk of memory used to represent a code point. For example, UTF-8 uses 8-bit code units, UTF-32 uses 32 bit code units. But "code point value", I'm not sure what you mean by that. Consequently I have no idea what you think a character should be. Is "Hello World" a character? How about "Æ" or "û"? The term "character" is problematic, because what counts as a character depends on where you are and how the string is normalised. For example: "ij" could be two characters, the letters i followed by j, or one, the 25th letter of the Dutch language [and not even the Dutch agree on this]; conversely, "ij" could be a single character, or a ligature of two characters. "Ḗ" (U+1E16 LATIN CAPITAL LETTER E WITH MACRON AND ACUTE) could be considered one character, or three 'E\u0304\u0301', depending on whether it is normalised or not. So I'm afraid I do not understand your sentence. Code point: http://www.unicode.org/glossary/#code_point Code unit: http://www.unicode.org/glossary/#code_unit Code value: http://www.unicode.org/glossary/#code_value See also http://unicode.org/faq/char_combmark.html > But in ECMAScript implementations (so far), a *code > point value* equals a character, and that is a problem in ECMAScript > because > there the value range is limited to what can be encoded in 16 bit. The > problem starts beyond the BMP where 16 bit are no longer sufficient for a > code sequence and code point value, and code sequence and code point value > are no longer equal. This is no clearer. I *think* what you are trying to say is that ECMAScript assumes that one code point is always represented by a single code unit. So a sequence of code points ABCD will be correctly interpreted as four "characters" so long as each of those code points are in the BMP (i.e. between U+0000 and U+FFFF inclusive), but *not* if they are from one of the supplementary planes. This is the same problem that older Python "narrow builds" suffered from. The solutions in Python was to use a wide-build (each code point is represented by a single UTF-32 code unit, that is, four bytes) or to upgrade to Python 3.3, which uses a compressed coding scheme where strings are represented by either 1-byte per code point, 2-bytes per code point, or 4-bytes per code point, whichever is the minimum needed for that particular string. My opinion is that a programming language like Python or ECMAScript should operate on *code points*. If we want to call them "characters" informally, that should be allowed, but whenever there is ambiguity we should remember we're dealing with code points. The implementation shouldn't matter: compliant Python interpreters might choose to use UTF-8 internally, or UTF-16, or UTF-32, or something else, and still agree on how many characters a string contains. Normalisation is still an issue, of course, but any decent Unicode implementation will include a way to normalise or denormalise strings. The question of graphemes (what "ordinary people" consider letters and characters, e.g. "ch" is two letters to an English speaker but one letter to a Czech speaker) should be left to libraries. It's a much harder problem to solve in the full general case, requires localisation, and is overkill for many string-processing tasks. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-06-07 22:08 +1000 |
| Message-ID | <mailman.243.1433678894.13271.python-list@python.org> |
| In reply to | #92233 |
On Sun, Jun 7, 2015 at 9:42 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> My opinion is that a programming language like Python or ECMAScript should
> operate on *code points*. If we want to call them "characters" informally,
> that should be allowed, but whenever there is ambiguity we should remember
> we're dealing with code points. The implementation shouldn't matter:
> compliant Python interpreters might choose to use UTF-8 internally, or
> UTF-16, or UTF-32, or something else, and still agree on how many
> characters a string contains. Normalisation is still an issue, of course,
> but any decent Unicode implementation will include a way to normalise or
> denormalise strings.
If by "normalise" you mean the NF[K]C/NF[K]D composition and
decomposition, then yes, any decent Unicode library will provide that.
I'm not sure it's critical to string handling itself, though; and
Python defers the operation to the unicodedata module:
>>> s1 = "\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}"
>>> s2 = "\N{LATIN SMALL LETTER A WITH ACUTE}"
>>> s1 == s2
False
>>> unicodedata.normalize("NFC", s1) == s2
True
It's a useful operation to be able to do, but I would never expect
that *string comparison* or other operations should automatically
normalize. (Unless you want to say that all strings are guaranteed to
be NFC/NFD normalized, such that s1 and s2 would actually be
identical, which I suppose is plausible. I'm not sure what the
advantage would be, though. And certainly you wouldn't want to
K-normalize strings automatically.)
> The question of graphemes (what "ordinary people" consider letters and
> characters, e.g. "ch" is two letters to an English speaker but one letter
> to a Czech speaker) should be left to libraries. It's a much harder problem
> to solve in the full general case, requires localisation, and is overkill
> for many string-processing tasks.
Yeah. The basic challenge to a beginning programmer, "reverse this
string", becomes rather tricky in the presence of natural language.
>>> s1 += "e"
>>> s1
'áe'
>>> s1[::-1]
'éa'
Oops.
But hey. It's easier to understand what went wrong here than, say, if
you reverse the bytes in a UTF-8 stream. Or the code units in a UTF-16
stream. If you're lucky, those would give you instant errors... if
you're not, well, who knows.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-06-07 23:24 +1000 |
| Message-ID | <557445f6$0$12997$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #92237 |
On Sun, 7 Jun 2015 10:08 pm, Chris Angelico wrote:
> On Sun, Jun 7, 2015 at 9:42 PM, Steven D'Aprano <steve@pearwood.info>
> wrote:
>> My opinion is that a programming language like Python or ECMAScript
>> should operate on *code points*. If we want to call them "characters"
>> informally, that should be allowed, but whenever there is ambiguity we
>> should remember we're dealing with code points. The implementation
>> shouldn't matter: compliant Python interpreters might choose to use UTF-8
>> internally, or UTF-16, or UTF-32, or something else, and still agree on
>> how many characters a string contains. Normalisation is still an issue,
>> of course, but any decent Unicode implementation will include a way to
>> normalise or denormalise strings.
>
> If by "normalise" you mean the NF[K]C/NF[K]D composition and
> decomposition, then yes, any decent Unicode library will provide that.
Dat's der bunny!
> I'm not sure it's critical to string handling itself, though; and
> Python defers the operation to the unicodedata module:
>
>>>> s1 = "\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}"
>>>> s2 = "\N{LATIN SMALL LETTER A WITH ACUTE}"
>>>> s1 == s2
> False
>>>> unicodedata.normalize("NFC", s1) == s2
> True
>
> It's a useful operation to be able to do, but I would never expect
> that *string comparison* or other operations should automatically
> normalize.
I completely agree.
It might be convenient to have a string equality method that did
normalisation, but for most cases it would be unnecessary and slow. I think
that's the sort of thing which should be left to a subclass of str, and it
should normalise on construction.
> (Unless you want to say that all strings are guaranteed to
> be NFC/NFD normalized, such that s1 and s2 would actually be
> identical, which I suppose is plausible. I'm not sure what the
> advantage would be, though. And certainly you wouldn't want to
> K-normalize strings automatically.)
I believe that filenames on Apple file systems (HFS+ if I remember
correctly) are guaranteed to be both normalised and correctly encoded as
UTF-8. If you could live in a purely Apple world, you'd have far fewer
filename hassles.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-06-08 00:47 +1000 |
| Message-ID | <mailman.250.1433688442.13271.python-list@python.org> |
| In reply to | #92250 |
On Sun, Jun 7, 2015 at 11:24 PM, Steven D'Aprano <steve@pearwood.info> wrote:
>> (Unless you want to say that all strings are guaranteed to
>> be NFC/NFD normalized, such that s1 and s2 would actually be
>> identical, which I suppose is plausible. I'm not sure what the
>> advantage would be, though. And certainly you wouldn't want to
>> K-normalize strings automatically.)
>
> I believe that filenames on Apple file systems (HFS+ if I remember
> correctly) are guaranteed to be both normalised and correctly encoded as
> UTF-8. If you could live in a purely Apple world, you'd have far fewer
> filename hassles.
Yep. Actually, there should be nothing stopping the next Linux file
system ("ext5" or whatever) from enforcing the same thing;
byte-oriented filename APIs would still work just fine, but you could
have some confidence that at least local file systems will normally be
decodable as UTF-8. Then the only time you'd have to worry about
encoding problems would be network or removable file systems - no
worrying about "what's the FS encoding", because it'll just be UTF-8.
(Hmm. Point of interest: What happens on a Mac if you network-mount
something that isn't Unicode? If the enforcement of UTF-8 and
normalization is done at the file system level, it's no different from
the current Linux situation, where basically anything goes.)
But that's file names, not strings in a program. I'm not sure that
mandating that strings be normalized is particularly useful, but on
the flip side, I'm not sure of any situation where it'd be majorly
problematic either. There are ambiguities in some encodings, and as
soon as you decode from them and re-encode, you've effectively folded
those ambiguities to some canonical form; if your language
automatically normalized strings, you'd just have the same effect of
folding. And then you could have encode methods that stipulate the
other form of normalization - say you NFD everything internally, you
could then have a method "a\u0301".encode("utf-8", combine=True) which
NFC normalizes prior to encoding (and would thus be C3 A1 instead of
61 CC 81). Are there any languages out there that work this way?
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | random832@fastmail.us |
|---|---|
| Date | 2015-06-07 10:58 -0400 |
| Message-ID | <mailman.251.1433689134.13271.python-list@python.org> |
| In reply to | #92233 |
On Sun, Jun 7, 2015, at 07:42, Steven D'Aprano wrote: > The question of graphemes (what "ordinary people" consider letters and > characters, e.g. "ch" is two letters to an English speaker but one letter > to a Czech speaker) should be left to libraries. Do Czech speakers expect to be able to select and delete it as a single unit and never have the cursor in the middle of it? If not, then this is not really fundamentally the same thing as what we have with combining characters or certain sequences of Indic letters. Also, "should be left to libraries" isn't really a coherent statement when we are talking about the design of the standard *library*.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-06-08 02:28 +1000 |
| Message-ID | <55747134$0$13005$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #92258 |
On Mon, 8 Jun 2015 12:58 am, random832@fastmail.us wrote:
> On Sun, Jun 7, 2015, at 07:42, Steven D'Aprano wrote:
>> The question of graphemes (what "ordinary people" consider letters and
>> characters, e.g. "ch" is two letters to an English speaker but one letter
>> to a Czech speaker) should be left to libraries.
>
> Do Czech speakers expect to be able to select and delete it as a single
> unit and never have the cursor in the middle of it?
You'd have to ask one. I expect the answer is No, because they're used to
using software written by English speakers who think that "ch" is two
letters.
Whether they would *like* to stick the cursor between the c and the h is a
different question to whether they would *expect* it.
There may even be words where "ch" counts as two letters, where the "c" is
at the end of one syllable but the "h" is the beginning of the next.
(That's certainly the case for Dutch "ij".) Natural language is *hard*.
But generally speaking, I expect that when Czech speakers are playing (say)
Scrabble, they would want to have a tile called "CH" which they can play as
a single letter.
> If not, then this is
> not really fundamentally the same thing as what we have with combining
> characters or certain sequences of Indic letters.
I'll have to take your word for that.
>
> Also, "should be left to libraries" isn't really a coherent statement
> when we are talking about the design of the standard *library*.
The language offers a certain view of strings, which is reflected in the
methods that strings have, and built-in functions that operate on strings.
Should len('ch') return 1 or 2? If you think that the language should treat
strings as sequences of graphemes, then you will answer "sometimes 1".
Maybe there is a global setting to set the locale
LANG = 'Cz'
len('ch')
=> returns 1
or an optional parameter that you can pass to len:
len('ch', lang='Cz')
=> returns 1
len('ch', lang='En')
=> returns 2
But if you think that the language should treat strings as sequences of code
points, as I do, then there's only one reasonable thing for len('ch') to
return, and that is 2. But *some library* (as opposed to the built-in str
type) can offer a grapheme view of strings:
from language_tools import Graphemes
g = Graphemes.fromstr('ch', lang='Cz', exceptions=['xchx', 'ychy'])
len(g)
=> returns 1
Do you still think this is incoherent?
--
Steven
[toc] | [prev] | [next] | [standalone]
Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →
Back to top | Article view | comp.lang.python
csiph-web