Groups > comp.lang.python > #71389 > unrolled thread

Everything you did not want to know about Unicode in Python 3

Started by	Mark Lawrence <breamoreboy@yahoo.co.uk>
First post	2014-05-12 16:19 +0100
Last post	2014-05-14 09:56 -0600
Articles	20 on this page of 72 — 25 participants

Back to article view | Back to comp.lang.python

  Everything you did not want to know about Unicode in Python 3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-05-12 16:19 +0100
    Re: Everything you did not want to know about Unicode in Python 3 alister <alister.nospam.ware@ntlworld.com> - 2014-05-12 17:47 +0000
      Re: Everything you did not want to know about Unicode in Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-05-12 12:31 -0600
      Re: Everything you did not want to know about Unicode in Python 3 MRAB <python@mrabarnett.plus.com> - 2014-05-12 20:42 +0100
      Re: Everything you did not want to know about Unicode in Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-05-12 16:16 -0600
      Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-13 09:42 +1000
      Re: Everything you did not want to know about Unicode in Python 3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-13 01:18 +0000
        Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-13 11:39 +1000
          Re: Everything you did not want to know about Unicode in Python 3 alex23 <wuwei23@gmail.com> - 2014-05-13 16:25 +1000
            Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-13 16:32 +1000
        Re: Everything you did not want to know about Unicode in Python 3 Mark H Harris <harrismh777@gmail.com> - 2014-05-12 20:58 -0500
        Re: Everything you did not want to know about Unicode in Python 3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-05-13 03:33 +0100
        Re: Everything you did not want to know about Unicode in Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-05-12 22:10 -0700
          Re: Everything you did not want to know about Unicode in Python 3 Mark H Harris <harrismh777@gmail.com> - 2014-05-13 00:39 -0500
            Re: Everything you did not want to know about Unicode in Python 3 Gene Heskett <gheskett@wdtv.com> - 2014-05-13 01:45 -0400
            Re: Everything you did not want to know about Unicode in Python 3 Ben Finney <ben@benfinney.id.au> - 2014-05-13 16:03 +1000
            Re: Everything you did not want to know about Unicode in Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-05-12 23:09 -0700
            Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-13 16:18 +1000
              Re: Everything you did not want to know about Unicode in Python 3 Mark H Harris <harrismh777@gmail.com> - 2014-05-13 01:32 -0500
              Re: Everything you did not want to know about Unicode in Python 3 Mark H Harris <harrismh777@gmail.com> - 2014-05-13 01:32 -0500
              Re: Everything you did not want to know about Unicode in Python 3 Roy Smith <roy@panix.com> - 2014-05-13 07:20 -0400
                Re: Everything you did not want to know about Unicode in Python 3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-13 13:39 +0000
                  Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-13 23:43 +1000
                    Re: Everything you did not want to know about Unicode in Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-05-13 07:30 -0700
                      Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-14 00:36 +1000
                  Re: Everything you did not want to know about Unicode in Python 3 Grant Edwards <invalid@invalid.invalid> - 2014-05-13 13:51 +0000
                    Re: Everything you did not want to know about Unicode in Python 3 alister <alister.nospam.ware@ntlworld.com> - 2014-05-13 14:42 +0000
                      Re: Everything you did not want to know about Unicode in Python 3 Grant Edwards <invalid@invalid.invalid> - 2014-05-13 15:21 +0000
                      Re: Everything you did not want to know about Unicode in Python 3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-13 23:53 +0000
                        Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-14 10:08 +1000
                          Re: Everything you did not want to know about Unicode in Python 3 alister <alister.nospam.ware@ntlworld.com> - 2014-05-14 12:42 +0000
                            Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-14 22:52 +1000
                            Re: Everything you did not want to know about Unicode in Python 3 Grant Edwards <invalid@invalid.invalid> - 2014-05-16 14:46 +0000
                              Re: Everything you did not want to know about Unicode in Python 3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-17 01:07 +0000
                                Re: Everything you did not want to know about Unicode in Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-05-17 07:19 +0300
                                  Re: Everything you did not want to know about Unicode in Python 3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-05-17 09:35 +0100
                                  Re: Everything you did not want to know about Unicode in Python 3 Robert Kern <robert.kern@gmail.com> - 2014-05-17 10:29 +0100
                                    Re: Everything you did not want to know about Unicode in Python 3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-17 14:15 +0000
                                      Re: Everything you did not want to know about Unicode in Python 3 Robert Kern <robert.kern@gmail.com> - 2014-05-17 22:01 +0100
                                Re: Everything you did not want to know about Unicode in Python 3 Robert Kern <robert.kern@gmail.com> - 2014-05-17 09:57 +0100
                                  Re: Everything you did not want to know about Unicode in Python 3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-17 12:07 +0000
                                    Re: Everything you did not want to know about Unicode in Python 3 Robert Kern <robert.kern@gmail.com> - 2014-05-17 22:07 +0100
                                Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-17 19:18 +1000
                                Re: Everything you did not want to know about Unicode in Python 3 Ben Finney <ben@benfinney.id.au> - 2014-05-17 21:05 +1000
                        [OT] Copyright statements and why they can be useful (was: Everything you did not want to know about Unicode in Python 3) Ben Finney <ben@benfinney.id.au> - 2014-05-14 11:01 +1000
                        Re: Everything you did not want to know about Unicode in Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-05-14 09:07 -0600
                  Re: Everything you did not want to know about Unicode in Python 3 Dave Angel <davea@davea.name> - 2014-05-13 21:56 -0400
              Re: Everything you did not want to know about Unicode in Python 3 Grant Edwards <invalid@invalid.invalid> - 2014-05-13 13:49 +0000
        Re: Everything you did not want to know about Unicode in Python 3 gregor <gregor@ediwo.com> - 2014-05-13 09:27 +0200
        Re: Everything you did not want to know about Unicode in Python 3 Johannes Bauer <dfnsonfsduifb@gmx.de> - 2014-05-13 10:08 +0200
          Re: Everything you did not want to know about Unicode in Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-05-13 11:25 +0300
            Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-13 18:38 +1000
              Re: Everything you did not want to know about Unicode in Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-05-13 12:06 +0300
                Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-13 19:29 +1000
                Re: Everything you did not want to know about Unicode in Python 3 Steven D'Aprano <steve@pearwood.info> - 2014-05-13 09:44 +0000
              Re: Everything you did not want to know about Unicode in Python 3 Johannes Bauer <dfnsonfsduifb@gmx.de> - 2014-05-13 11:38 +0200
            Re: Everything you did not want to know about Unicode in Python 3 Johannes Bauer <dfnsonfsduifb@gmx.de> - 2014-05-13 11:46 +0200
              Re: Everything you did not want to know about Unicode in Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-05-13 12:59 +0300
            Re: Everything you did not want to know about Unicode in Python 3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-05-13 14:30 +0100
            Re: Everything you did not want to know about Unicode in Python 3 Chris Angelico <rosuav@gmail.com> - 2014-05-13 23:37 +1000
            Re: Everything you did not want to know about Unicode in Python 3 Skip Montanaro <skip@pobox.com> - 2014-05-13 09:02 -0500
          Re: Everything you did not want to know about Unicode in Python 3 wxjmfauth@gmail.com - 2014-05-14 00:00 -0700
        Re: Everything you did not want to know about Unicode in Python 3 alister <alister.nospam.ware@ntlworld.com> - 2014-05-13 11:19 +0000
          Re: Everything you did not want to know about Unicode in Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-05-13 10:08 -0600
            Re: Everything you did not want to know about Unicode in Python 3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-14 00:10 +0000
              Re: Everything you did not want to know about Unicode in Python 3 Ethan Furman <ethan@stoneleaf.us> - 2014-05-13 17:53 -0700
              Re: Everything you did not want to know about Unicode in Python 3 Terry Reedy <tjreedy@udel.edu> - 2014-05-14 17:47 -0400
              Re: Everything you did not want to know about Unicode in Python 3 Antoine Pitrou <antoine@python.org> - 2014-05-16 11:50 +0000
                Re: Everything you did not want to know about Unicode in Python 3 wxjmfauth@gmail.com - 2014-05-16 06:20 -0700
            Re: Everything you did not want to know about Unicode in Python 3 alister <alister.nospam.ware@ntlworld.com> - 2014-05-14 12:38 +0000
          Re: Everything you did not want to know about Unicode in Python 3 Robin Becker <robin@reportlab.com> - 2014-05-14 16:30 +0100
          Re: Everything you did not want to know about Unicode in Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-05-14 09:56 -0600

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

#71679

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-05-17 12:07 +0000
Message-ID	<537750fc$0$29977$c3e8da3$5496439d@news.astraweb.com>
In reply to	#71673

On Sat, 17 May 2014 09:57:06 +0100, Robert Kern wrote:

> On 2014-05-17 02:07, Steven D'Aprano wrote:
>> On Fri, 16 May 2014 14:46:23 +0000, Grant Edwards wrote:
>>
>>> At least in the US, there doesn't seem to be such a thing as "placing
>>> a work into the public domain".  The copyright holder can transfer
>>> ownershipt to soembody else, but there is no "public domain" to which
>>> ownership can be trasferred.
>>
>> That's factually incorrect. In the US, sufficiently old works, or works
>> of a certain age that were not explicitly registered for copyright, are
>> in the public domain. Under a wide range of circumstances, works
>> created by the federal government go immediately into the public
>> domain.
> 
> There is such a thing as the public domain in the US, and there are
> works in it, but there isn't really such a thing as "placing a work"
> there voluntarily, as Grant says. A work either is or isn't in the
> public domain. The author has no choice in the matter.

That's incorrect.

http://cr.yp.to/publicdomain.html

Here's the money quote, from the 9th Circuit Court:

    It is well settled that rights gained under the Copyright Act 
    may be abandoned. But abandonment of a right must be manifested
    by some overt act indicating an intention to abandon that right.


There's also this:

http://creativecommons.org/publicdomain/zero/1.0/

which counts as an overt act.


By the way, there's more info on US copyright terms here:

http://copyright.cornell.edu/resources/publicdomain.cfm

although it doesn't specifically mention voluntarily abandonment of 
copyright.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]

#71707

From	Robert Kern <robert.kern@gmail.com>
Date	2014-05-17 22:07 +0100
Message-ID	<mailman.10100.1400360883.18130.python-list@python.org>
In reply to	#71679

On 2014-05-17 13:07, Steven D'Aprano wrote:
> On Sat, 17 May 2014 09:57:06 +0100, Robert Kern wrote:
>
>> On 2014-05-17 02:07, Steven D'Aprano wrote:
>>> On Fri, 16 May 2014 14:46:23 +0000, Grant Edwards wrote:
>>>
>>>> At least in the US, there doesn't seem to be such a thing as "placing
>>>> a work into the public domain".  The copyright holder can transfer
>>>> ownershipt to soembody else, but there is no "public domain" to which
>>>> ownership can be trasferred.
>>>
>>> That's factually incorrect. In the US, sufficiently old works, or works
>>> of a certain age that were not explicitly registered for copyright, are
>>> in the public domain. Under a wide range of circumstances, works
>>> created by the federal government go immediately into the public
>>> domain.
>>
>> There is such a thing as the public domain in the US, and there are
>> works in it, but there isn't really such a thing as "placing a work"
>> there voluntarily, as Grant says. A work either is or isn't in the
>> public domain. The author has no choice in the matter.
>
> That's incorrect.
>
> http://cr.yp.to/publicdomain.html

Thanks for the link. While it has not really changed my opinion (as discussed at 
length in my other reply), I did not know that the 9th Circuit had formalized 
the "overt act" test in their civil procedure rules, so there is at least one 
jurisdiction in the US that does currently work like this. None of the others 
do, to my knowledge, and this is the product of judicial common law, not 
statutory law, so it's still pretty shaky.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

[toc] | [prev] | [next] | [standalone]

#71674

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-17 19:18 +1000
Message-ID	<mailman.10077.1400318344.18130.python-list@python.org>
In reply to	#71669

On Sat, May 17, 2014 at 6:57 PM, Robert Kern <robert.kern@gmail.com> wrote:
> There is such a thing as the public domain in the US, and there are works in
> it, but there isn't really such a thing as "placing a work" there
> voluntarily, as Grant says. A work either is or isn't in the public domain.
> The author has no choice in the matter.

Then what's copyright status on PEPs?

The nearest thing to "assigning to public domain" that works across
legislatures is probably CC0:

http://creativecommons.org/about/cc0

ChrisA

[toc] | [prev] | [next] | [standalone]

#71677

From	Ben Finney <ben@benfinney.id.au>
Date	2014-05-17 21:05 +1000
Message-ID	<mailman.10080.1400324748.18130.python-list@python.org>
In reply to	#71669

Chris Angelico <rosuav@gmail.com> writes:

> On Sat, May 17, 2014 at 6:57 PM, Robert Kern <robert.kern@gmail.com> wrote:
> > There is such a thing as the public domain in the US, and there are works in
> > it, but there isn't really such a thing as "placing a work" there
> > voluntarily, as Grant says. A work either is or isn't in the public domain.
> > The author has no choice in the matter.
>
> Then what's copyright status on PEPs?

My guess: They are in the default copyright status, with all rights
reserved (i.e. everything that copyright law restricts, is forbidden to
the recipient).

But, if any of those copyright holders were ever to assert their
copyright had been infringed by some recipient, the “this work is in the
public domain” or equivalent would be taken as a clear indication of the
*intent* of the copyright holder.

Ultimately, what matters is the determination of whatever judge you find
yourself facing. To that end, clarifying in the copyright statement and
license terms exactly what is permitted can be immensely helpful in
foreshortening and, ideally, avoiding a future copyright suit.

Copyright is a ridiculous burden on everyone — to the extent that even
those copyright holders who don't *want* those rights which the law
reserves to the copyright holder, and want to divest themselves of the
role of copyright holder, find it frustratingly difficult to do so
effectively across jurisdictions.

-- 
 \          “Computer perspective on Moore's Law: Human effort becomes |
  `\           twice as expensive roughly every two years.” —anonymous |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#71523 — [OT] Copyright statements and why they can be useful (was: Everything you did not want to know about Unicode in Python 3)

From	Ben Finney <ben@benfinney.id.au>
Date	2014-05-14 11:01 +1000
Subject	[OT] Copyright statements and why they can be useful (was: Everything you did not want to know about Unicode in Python 3)
Message-ID	<mailman.9985.1400029305.18130.python-list@python.org>
In reply to	#71515

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

> On Tue, 13 May 2014 14:42:51 +0000, alister wrote:
>
> > You do not need any statements at all, copyright is automaticly
> > assigned to anything you create (at least that is the case in UK
> > Law) although proving the creation date my be difficult.
>
> (1) In my lifetime, that wasn't always the case. Up until the 1970s or 
> thereabouts, you had to explicitly register anything you wanted 
> copyrighted […]

> (2) You don't have to just prove copyright. You also have to *identify* 
> who the work is copyrighted by, and it needs to be an identifiable legal 
> person (actual person or corporation), not necessarily the author. […]

(3) In all jurisdictions where copyright exists, the copyright holder
nominally has monopoly on the work for only a fixed term, starting from
the date of publication. To know when the copyright will expire, it's
essential to know the date from which copyright starts; this is best
done explicitly in the copyright statement.

I say “nominally”, because another alarming and unilateral trend is to
dramatically extend the nominally fixed term, and to strong-arm national
governments with terade deals to maximise the copyright term around the
world.

The effect, as Lawrence Lessig points out:

    The meaning of this pattern is absolutely clear to those who pay to
    produce it. The meaning is: No one can do to the Disney Corporation
    what Walt Disney did to the Brothers Grimm. That though we had a
    culture where people could take and build upon what went before,
    that's over. There is no such thing as the public domain in the
    minds of those who have produced these 11 extensions these last 40
    years because now culture is owned.

    <URL:http://www.oreillynet.com/pub/a/policy/2002/08/15/lessig.html>

Or, less poetically, since the term of copyright is only nominally
fixed, and in practice just keeps getting extended by newly-lobbied
legislation every twenty years or so, the copyright maximalists have
de facto instituted “perpetual copyright on the installment plan”
<URL:https://en.wikipedia.org/wiki/Perpetual_copyright>.

Nevertheless, copyright on works created this century will in principle
expire at some date in the future; and to know when that date will be,
we need to know when the copyright began. Hence the need for explicit
copyright statements saying the date of publication.

<URL:http://questioncopyright.org/>

-- 
 \            “[T]he great menace to progress is not ignorance but the |
  `\           illusion of knowledge.” —Daniel J. Boorstin, historian, |
_o__)                                                        1914–2004 |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#71564

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2014-05-14 09:07 -0600
Message-ID	<mailman.10009.1400080047.18130.python-list@python.org>
In reply to	#71515

[Multipart message — attachments visible in raw view] — view raw

On May 13, 2014 6:10 PM, "Chris Angelico" <rosuav@gmail.com> wrote:
>
> On Wed, May 14, 2014 at 9:53 AM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
> > With the current system, all of us here are technically violating
> > copyright every time we reply to an email and quote more than a small
> > percentage of it.
>
> Oh wow... so when someone quotes heaps of text without trimming, and
> adding blank lines, we can complain that it's a copyright violation -
> reproducing our work with unauthorized modifications and without
> permission...
>
> I never thought of it like that.

I'd be surprised if this doesn't fall under fair use.

[toc] | [prev] | [next] | [standalone]

#71553

From	Dave Angel <davea@davea.name>
Date	2014-05-13 21:56 -0400
Message-ID	<mailman.10003.1400072207.18130.python-list@python.org>
In reply to	#71485

On 05/13/2014 09:39 AM, Steven D'Aprano wrote:
> On Tue, 13 May 2014 07:20:34 -0400, Roy Smith wrote:
>
>> ASCII *is* all I need.
>
> You've never needed to copyright something? Copyright © Roy Smith 2014...
> I know some people use (c) instead, but that actually has no legal
> standing. (Not that any reasonable judge would invalidate a copyright
> based on a technicality like that, not these days.)


(c) has no standing whatsoever, as it's properly spelled (copr)


-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#71490

From	Grant Edwards <invalid@invalid.invalid>
Date	2014-05-13 13:49 +0000
Message-ID	<lkt7tn$ncv$1@reader1.panix.com>
In reply to	#71437

On 2014-05-13, Chris Angelico <rosuav@gmail.com> wrote:
> On Tue, May 13, 2014 at 4:03 PM, Ben Finney <ben@benfinney.id.au> wrote:
>> (It's always a good day to remind people that the rest of the world
>> exists.)
>
> Ironic that this should come up in a discussion on Unicode, given that
> Unicode's fundamental purpose is to welcome that whole rest of the
> world instead of yelling "LALALALALA America is everything" and
> pretending that ASCII, or Latin-1, or something, is all you need.

Well, strictly speaking, it ASCII or Latin-1 _is_ all I need.

I will however admit to the existence of other people who might need
something else...

-- 
Grant Edwards               grant.b.edwards        Yow! How many retured
                                  at               bricklayers from FLORIDA
                              gmail.com            are out purchasing PENCIL
                                                   SHARPENERS right NOW??

[toc] | [prev] | [next] | [standalone]

#71447

From	gregor <gregor@ediwo.com>
Date	2014-05-13 09:27 +0200
Message-ID	<20140513092722.444c5a77@florenz>
In reply to	#71416

Am 13 May 2014 01:18:35 GMT
schrieb Steven D'Aprano <steve+comp.lang.python@pearwood.info>:

> 
> - have a simple way to write bytes to stdout and stderr.

there is the underlying binary buffer:

https://docs.python.org/3/library/sys.html#sys.stdin

greg

[toc] | [prev] | [next] | [standalone]

#71449

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2014-05-13 10:08 +0200
Message-ID	<lksju6$amr$1@news.albasani.net>
In reply to	#71416

On 13.05.2014 03:18, Steven D'Aprano wrote:

> Armin Ronacher is an extremely experienced and knowledgeable Python 
> developer, and a Python core developer. He might be wrong, but he's not 
> *obviously* wrong.

He's correct about file name encodings. Which can be fixed really easily
wihtout messing everything up (sys.argv binary variant, open accepting
binary filenames). But that he suggests that Go would be superior:

> Which uses an even simpler model than Python 2: everything is a byte string. The assumed encoding is UTF-8. End of the story.

Is just a horrible idea. An obviously horrible idea, too.

Having dealt with the UTF-8 problems on Python2 I can safely say that I
never, never ever want to go back to that freaky hell. If I deal with
strings, I want to be able to sanely manipulate them and I want to be
sure that after manipulation they're still valid strings. Manipulating
the bytes representation of unicode data just doesn't work.

And I'm very very glad that some people felt the same way and
implemented a sane, consistent way of dealing with Unicode in Python3.
It's one of the reasons why I switched to Py3 very early and I love it.

Cheers,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#71450

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-05-13 11:25 +0300
Message-ID	<87tx8uccgd.fsf@elektro.pacujo.net>
In reply to	#71449

Johannes Bauer <dfnsonfsduifb@gmx.de>:

> Having dealt with the UTF-8 problems on Python2 I can safely say that
> I never, never ever want to go back to that freaky hell. If I deal
> with strings, I want to be able to sanely manipulate them and I want
> to be sure that after manipulation they're still valid strings.
> Manipulating the bytes representation of unicode data just doesn't
> work.

Based on my background (network and system programming), I'm a bit
suspicious of strings, that is, text. For example, is the stuff that
goes to syslog bytes or text? Does an XML file contain bytes or
(encoded) text? The answers are not obvious to me. Modern computing is
full of ASCII-esque binary communication standards and formats.

Python 2's ambiguity allows me not to answer the tough philosophical
questions. I'm not saying it's necessarily a good thing, but it has its
benefits.


Marko

[toc] | [prev] | [next] | [standalone]

#71451

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-13 18:38 +1000
Message-ID	<mailman.9951.1399970292.18130.python-list@python.org>
In reply to	#71450

On Tue, May 13, 2014 at 6:25 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Johannes Bauer <dfnsonfsduifb@gmx.de>:
>
>> Having dealt with the UTF-8 problems on Python2 I can safely say that
>> I never, never ever want to go back to that freaky hell. If I deal
>> with strings, I want to be able to sanely manipulate them and I want
>> to be sure that after manipulation they're still valid strings.
>> Manipulating the bytes representation of unicode data just doesn't
>> work.
>
> Based on my background (network and system programming), I'm a bit
> suspicious of strings, that is, text. For example, is the stuff that
> goes to syslog bytes or text? Does an XML file contain bytes or
> (encoded) text? The answers are not obvious to me. Modern computing is
> full of ASCII-esque binary communication standards and formats.

These are problems that Unicode can't solve. In theory, XML should
contain text in a known encoding (defaulting to UTF-8). With syslog,
it's problematic - I don't remember what it's meant to be, but I know
there are issues. Same with other log files.

> Python 2's ambiguity allows me not to answer the tough philosophical
> questions. I'm not saying it's necessarily a good thing, but it has its
> benefits.

It's not a good thing. It means that you have the convenience of
pretending there's no problem, which means you don't notice trouble
until something happens... and then, in all probability, your app is
in production and you have no idea why stuff went wrong.

ChrisA

[toc] | [prev] | [next] | [standalone]

#71453

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-05-13 12:06 +0300
Message-ID	<87ppjicaj9.fsf@elektro.pacujo.net>
In reply to	#71451

Chris Angelico <rosuav@gmail.com>:

> These are problems that Unicode can't solve.

I actually think the problem has little to do with Unicode. Text is an
abstract data type just like any class. If I have an object (say, a
subprocess or a dictionary) in memory, I don't expect the object to have
any existence independently of the Python virtual machine. I have the
same feeling about Py3 strings: they only exist inside the Python
virtual machine.

An abstract object like a subprocess or dictionary justifies its
existence through its behaviour (its quacking). Now, do strings quack or
are they silent? I guess if you are writing a word processor they might
quack to you. Otherwise, they are just an esoteric storage format.

What I'm saying is that strings definitely have an important application
in the human interface. However, I feel strings might be overused in the
Py3 API. Case in point: are pathnames bytes objects or strings? The
linux position is that they are bytes objects. Py3 supports both
interpretations seemingly throughout:

   open(b"/bin/ls")    vs    open("/bin/ls")
   os.path.join(b"a", b"b")    vs    os.path.join("a", "b")


Marko

[toc] | [prev] | [next] | [standalone]

#71455

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-13 19:29 +1000
Message-ID	<mailman.9954.1399973382.18130.python-list@python.org>
In reply to	#71453

On Tue, May 13, 2014 at 7:06 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> These are problems that Unicode can't solve.
>
> I actually think the problem has little to do with Unicode. Text is an
> abstract data type just like any class. If I have an object (say, a
> subprocess or a dictionary) in memory, I don't expect the object to have
> any existence independently of the Python virtual machine. I have the
> same feeling about Py3 strings: they only exist inside the Python
> virtual machine.

That's true; the only difference is that text is extremely prevalent.
You can share a dict with another program, or store it in a file, or
whatever, simply by agreeing on an encoding - for instance, JSON. As
long as you and the other program know that this file is JSON encoded,
you can write it and he can read it, and you'll get the right data at
the far end. It's no different; there are encodings that are easy to
handle and have limitations, and there are encodings that are
elaborate and have lots of features (XML comes to mind, although
technically you can't encode a dict in XML).

> Case in point: are pathnames bytes objects or strings? The
> linux position is that they are bytes objects. Py3 supports both
> interpretations seemingly throughout:
>
>    open(b"/bin/ls")    vs    open("/bin/ls")
>    os.path.join(b"a", b"b")    vs    os.path.join("a", "b")

That's a problem that comes from the underlying file systems. If every
FS in the world worked with Unicode file names, it would be easy.
(Most would encode them onto the platters in UTF-8 or maybe UTF-16;
some might choose to use a PEP 393 or Pike string structure, with the
size_shift being a file mode just like the 'directory' bit; others
might use a limited encoding for legacy reasons, storing uppercased
CP437 on the disk, and returning an error if the desired name didn't
fit.) But since they don't, we have to cope with that. What happens if
you're running on Linux, and you have a mounted drive from an OS/2
share, and inside that, you access an aliased drive that represents a
Windows share, on which you've mounted a remote-backup share? A single
path name could have components parsed by each of those systems, so
what's its encoding? How do you handle that? There's no solution.
(Well, okay. There is a solution: don't do something so stupidly
convoluted. But there's no law against cackling admins making circular
mounts. In fact, I just mounted my own home directory as a
subdirectory under my home directory, via sshfs. I can now encrypt my
own file reads and writes exactly as many times as I choose to. I also
cackled.)

ChrisA

[toc] | [prev] | [next] | [standalone]

#71460

From	Steven D'Aprano <steve@pearwood.info>
Date	2014-05-13 09:44 +0000
Message-ID	<5371e97b$0$11109$c3e8da3@news.astraweb.com>
In reply to	#71453

On Tue, 13 May 2014 12:06:50 +0300, Marko Rauhamaa wrote:

> Chris Angelico <rosuav@gmail.com>:
> 
>> These are problems that Unicode can't solve.
> 
> I actually think the problem has little to do with Unicode. Text is an
> abstract data type just like any class. If I have an object (say, a
> subprocess or a dictionary) in memory, I don't expect the object to have
> any existence independently of the Python virtual machine. I have the
> same feeling about Py3 strings: they only exist inside the Python
> virtual machine.

And you would be correct. When you write them to a device (say, push them 
over a network, or write them to a file) they need to be serialized. If 
you're lucky, you have an API that takes a string and serializes it for 
you, and then all you have to deal with is:

- am I happy with the default encoding?

- if not, what encoding do I want?

Otherwise you ought to have an API that requires bytes, not strings, and 
you have to perform your own serialization by encoding it.

But abstractions leak, and this abstraction leaks because *right now* 
there isn't a single serialization for text strings. There are HUNDREDS, 
and sometimes you don't know which one is being used.

[...]
> What I'm saying is that strings definitely have an important application
> in the human interface. However, I feel strings might be overused in the
> Py3 API. Case in point: are pathnames bytes objects or strings?

Yes. On POSIX systems, file names are sequences of bytes, with a very few 
restrictions. On recent Windows file systems (NTFS I believe?), file 
names are Unicode strings encoded to UTF-16, but with a whole lot of 
other restrictions imposed by the OS.

> The
> linux position is that they are bytes objects. Py3 supports both
> interpretations seemingly throughout:
> 
>    open(b"/bin/ls")    vs    open("/bin/ls") os.path.join(b"a", b"b")   
>    vs    os.path.join("a", "b")

Because it has to, otherwise there will be files that are unreachable on 
one platform or another.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#71458

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2014-05-13 11:38 +0200
Message-ID	<lksp6o$lmf$1@news.albasani.net>
In reply to	#71451

On 13.05.2014 10:38, Chris Angelico wrote:

>> Python 2's ambiguity allows me not to answer the tough philosophical
>> questions. I'm not saying it's necessarily a good thing, but it has its
>> benefits.
> 
> It's not a good thing. It means that you have the convenience of
> pretending there's no problem, which means you don't notice trouble
> until something happens... and then, in all probability, your app is
> in production and you have no idea why stuff went wrong.

Exactly. With Py2 "strings" you never know what encoding they are, if
they already have been converted or something like that. And it's very
well possible to mix already converted strings with other, not yet
encoded strings. What a mess!

All these issues are avoided by Py3. There is a very clear distinction
between strings and string representation (data bytes), which is
beautiful. Accidental mixing is not possible. And you have some thing
*guaranteed* for the string type which aren't guaranteed for the bytes
type (for example when doing string manipulation).

Regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#71461

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2014-05-13 11:46 +0200
Message-ID	<lkspl2$mlc$1@news.albasani.net>
In reply to	#71450

On 13.05.2014 10:25, Marko Rauhamaa wrote:

> Based on my background (network and system programming), I'm a bit
> suspicious of strings, that is, text. For example, is the stuff that
> goes to syslog bytes or text? Does an XML file contain bytes or
> (encoded) text? The answers are not obvious to me. Modern computing is
> full of ASCII-esque binary communication standards and formats.

Traditional Unix programs (syslog for example) are notorious for being
clear, ambiguous and/or ignorant of character encodings altogether. And
this works, unfortunately, for the most time because many encodings
share a common subset. If they wouldn't, the problems would be VERY
apparent and people would be forced to handle the issues not so sloppily.

Which is the route that Py3 chose. Don't be sloppy, make a great
distinction between "text" (which handles naturally as strings) and its
respective encoding.

The only people who are angered by this now is people who always treated
encodings sloppily and it "just worked". Well, there's a good chance it
has worked by pure chance so far. It's a good thing that Python does
this now more strictly as it gives developers *guarantees* about what
they can and cannot do with text datatypes without having to deal with
encoding issues in many places. Just one place: The interface where text
is read or written, just as it should be.

Regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#71463

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-05-13 12:59 +0300
Message-ID	<87iopac83j.fsf@elektro.pacujo.net>
In reply to	#71461

Johannes Bauer <dfnsonfsduifb@gmx.de>:

> The only people who are angered by this now is people who always
> treated encodings sloppily and it "just worked". Well, there's a good
> chance it has worked by pure chance so far. It's a good thing that
> Python does this now more strictly as it gives developers *guarantees*
> about what they can and cannot do with text datatypes without having
> to deal with encoding issues in many places. Just one place: The
> interface where text is read or written, just as it should be.

I'm not angered by text. I'm just wondering if it has any practical use
that is not misuse...

For example, Py3 should not make any pretense that there is a "default"
encoding for strings. Locale's are an abhorrent invention from the early
8-bit days. IOW, you should never input or output text without explicit
serialization.

I get the feeling that Py3 would like to present a world where strings
are first-class I/O objects that can exist in files, in filenames,
inside pipes. You say, "text is read or written." I'm saying text is
never read or written. It only exists as an abstraction (not even
unicode) inside the virtual machine.


Marko

[toc] | [prev] | [next] | [standalone]

#71483

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-05-13 14:30 +0100
Message-ID	<mailman.9964.1399987825.18130.python-list@python.org>
In reply to	#71450

On 13/05/2014 09:38, Chris Angelico wrote:
>
> It's not a good thing. It means that you have the convenience of
> pretending there's no problem, which means you don't notice trouble
> until something happens... and then, in all probability, your app is
> in production and you have no idea why stuff went wrong.
>

Unless you're (un)lucky enough to be working on IIRC the 1/3 of major IT 
projects that deliver nothing :)

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com

[toc] | [prev] | [next] | [standalone]

#71484

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-13 23:37 +1000
Message-ID	<mailman.9965.1399988276.18130.python-list@python.org>
In reply to	#71450

On Tue, May 13, 2014 at 11:30 PM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
> On 13/05/2014 09:38, Chris Angelico wrote:
>>
>>
>> It's not a good thing. It means that you have the convenience of
>> pretending there's no problem, which means you don't notice trouble
>> until something happens... and then, in all probability, your app is
>> in production and you have no idea why stuff went wrong.
>>
>
> Unless you're (un)lucky enough to be working on IIRC the 1/3 of major IT
> projects that deliver nothing :)

Been there, done that. At least, most likely so... there is a chance,
albeit slim, that the boss/owner will either discover someone who'll
finish the project for him, or find the time to finish it himself. I
gather he's looking at ripping all my code out and replacing it with
PHP of his own design, which should be fun. On the plus side, that
does mean he can get any idiot straight out of a uni course to do the
work; much easier than finding someone who knows Python, Pike, bash,
and C++. The White King told Alice that cynicism is a disease that can
be cured... but it can also be inflicted, and a promising-looking
N-year project that collapses because the boss starts getting stupid
with code formatting rules and then ends up firing his last remaining
competent employee is a pretty effective means of instilling cynicism.

ChrisA

[toc] | [prev] | [next] | [standalone]

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

csiph-web

Everything you did not want to know about Unicode in Python 3

Contents

#71679

#71707

#71674

#71677

#71523 — [OT] Copyright statements and why they can be useful (was: Everything you did not want to know about Unicode in Python 3)

#71564

#71553

#71490

#71447

#71449

#71450

#71451

#71453

#71455

#71460

#71458

#71461

#71463

#71483

#71484