Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #105093 > unrolled thread

How to waste computer memory?

Started bywxjmfauth@gmail.com
First post2016-03-17 07:34 -0700
Last post2016-03-18 11:18 -0700
Articles 20 on this page of 72 — 18 participants

Back to article view | Back to comp.lang.python


Contents

  How to waste computer memory? wxjmfauth@gmail.com - 2016-03-17 07:34 -0700
    Re: How to waste computer memory? Rick Johnson <rantingrickjohnson@gmail.com> - 2016-03-17 12:21 -0700
      Re: How to waste computer memory? cl@isbd.net - 2016-03-17 20:31 +0000
        Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 07:42 +1100
          Re: How to waste computer memory? Grant Edwards <invalid@invalid.invalid> - 2016-03-17 21:08 +0000
            Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 08:13 +1100
              Re: How to waste computer memory? Paul Rubin <no.email@nospam.invalid> - 2016-03-17 14:30 -0700
            Re: How to waste computer memory? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-17 22:32 +0000
            Re: How to waste computer memory? cl@isbd.net - 2016-03-17 22:42 +0000
          Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-17 23:11 +0200
            Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 08:17 +1100
            Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-17 21:26 +0000
              Re: How to waste computer memory? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-17 22:38 +0000
              Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 10:02 +1100
          Re: How to waste computer memory? alister <alister.ware@ntlworld.com> - 2016-03-17 21:37 +0000
            Re: How to waste computer memory? alister <alister.ware@ntlworld.com> - 2016-03-17 21:43 +0000
            Re: How to waste computer memory? Gene Heskett <gheskett@wdtv.com> - 2016-03-17 20:51 -0400
              Re: How to waste computer memory? Rick Johnson <rantingrickjohnson@gmail.com> - 2016-03-17 18:47 -0700
              Re: How to waste computer memory? cl@isbd.net - 2016-03-18 10:44 +0000
                Re: How to waste computer memory? Gene Heskett <gheskett@wdtv.com> - 2016-03-18 10:11 -0400
                Re: How to waste computer memory? Grant Edwards <invalid@invalid.invalid> - 2016-03-19 13:50 +0000
      Re: How to waste computer memory? Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-18 01:00 -0600
        Re: How to waste computer memory? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-03-18 10:26 +0200
          Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 17:26 +0200
            Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 03:58 +1100
            Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 23:02 +0200
              Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 23:28 +0200
                Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 00:03 +0200
                  Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 09:49 +0200
                    Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 10:22 +0200
                      Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 11:40 +0200
                  Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 19:38 +1100
              Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 00:14 -0700
                Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 02:17 -0700
              Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 19:14 +1100
                Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 11:31 +0200
                  Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 03:40 -0700
                  Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 13:07 +0200
                    Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 12:24 +0000
                      Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 14:43 +0200
                      Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:18 +1100
                        Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 15:14 +0000
                          Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 15:20 +0000
                  Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 22:32 +1100
                    Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 14:42 +0200
                      Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:39 +1100
                        Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 16:56 +0200
                    Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 07:01 -0700
                  Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:56 +1100
                    Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 17:02 +0200
                      Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 02:47 +1100
                        Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 18:12 +0200
                          Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 16:01 +1100
                            Re: How to waste computer memory? Rustom Mody <rustompmody@gmail.com> - 2016-03-19 23:20 -0700
                              Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 22:06 +1100
                                Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-20 22:22 +1100
                                  Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 23:14 +1100
                                    Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-20 23:27 +1100
                              Re: How to waste computer memory? Ben Bacarisse <ben.usenet@bsb.me.uk> - 2016-03-20 14:55 +0000
                                Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-20 17:36 +0200
                                Re: How to waste computer memory? Random832 <random832@fastmail.com> - 2016-03-20 14:17 -0400
                            Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-20 09:30 +0200
        Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 03:50 -0700
        Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-18 22:46 +1100
          Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-18 22:58 +1100
            Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 12:53 -0700
          Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 23:37 +1100
          Re: How to waste computer memory? Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-18 07:57 -0600
      Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 03:44 +1100
        Re: How to waste computer memory? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-03-18 20:22 +0200
          Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 13:03 -0700
    Re: How to waste computer memory? sohcahtoa82@gmail.com - 2016-03-18 11:18 -0700

Page 2 of 4 — ← Prev page 1 [2] 3 4  Next page →


#105269

FromGrant Edwards <invalid@invalid.invalid>
Date2016-03-19 13:50 +0000
Message-ID<ncjlf0$j9t$1@reader1.panix.com>
In reply to#105196
On 2016-03-18, cl@isbd.net <cl@isbd.net> wrote:

> However I doubt it's still being used, a year or two after I wrote it
> we migrated to a Tektronix development system that ran Unix (wow!).

The PDP-11 one that ran TNIX (a thinly disguised port of v7)? Back in
the early 80's we used a copule of those doing microprocessor
development for cellular phones and cellular base station radios.
IIRC, they were 8560s with the 8540 in-circuit-emulator boxes attached
to a couple of the high speed serial ports.  Compared to the groups
that were using Intel MDS development machines running ISIS, we
thought we were pretty cool.  We even had a 300 baud modem on one port
so we could dial-in from home!

-- 
Grant






[toc] | [prev] | [next] | [standalone]


#105182

FromIan Kelly <ian.g.kelly@gmail.com>
Date2016-03-18 01:00 -0600
Message-ID<mailman.302.1458284448.12893.python-list@python.org>
In reply to#105142
On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson
<rantingrickjohnson@gmail.com> wrote:
> In the event that i change my mind about Unicode, and/or for
> the sake of others, who may want to know, please provide a
> list of languages that *YOU* think handle Unicode better than
> Python, starting with the best first. Thanks.

jmf has been asked this before, and as I recall he seems to feel that
UTF-8 should be used for all purposes, ignoring the limitations of
that encoding such as that indexing becomes a O(n) operation. He has
pointed at Go as an example of a language wherein Unicode "just
works", although I think that others do not necessarily agree [1].

He also seems to have a strange notion of the meaning of the word
"buggy". He frequently uses that word to describe the Python 3.3
Unicode implementation, although he can't seem to demonstrate any
actual bugs. Instead, he points at cherry-picked micro-benchmarks that
show Python's old "narrow" Unicode implementation (which did not
properly support SMP characters, unlike the "wide" implementation
which was a much greater memory hog than the version he's now
complaining about) outperforming the PEP-393 implementation while
completely ignoring any real-world benchmarks.

[1] https://coderwall.com/p/k7zvyg/dealing-with-unicode-in-go

[toc] | [prev] | [next] | [standalone]


#105189

FromJussi Piitulainen <jussi.piitulainen@helsinki.fi>
Date2016-03-18 10:26 +0200
Message-ID<lf5y49gw5s9.fsf@ling.helsinki.fi>
In reply to#105182
Ian Kelly writes:

> On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson
> <rantingrickjohnson@gmail.com> wrote:
>> In the event that i change my mind about Unicode, and/or for
>> the sake of others, who may want to know, please provide a
>> list of languages that *YOU* think handle Unicode better than
>> Python, starting with the best first. Thanks.
>
> jmf has been asked this before, and as I recall he seems to feel that
> UTF-8 should be used for all purposes, ignoring the limitations of
> that encoding such as that indexing becomes a O(n) operation. He has
> pointed at Go as an example of a language wherein Unicode "just
> works", although I think that others do not necessarily agree [1].

...

> [1] https://coderwall.com/p/k7zvyg/dealing-with-unicode-in-go

I think Julia's way of dealing with its strings-as-UTF-8 [2] is more
promising. Indexing is by bytes (1-based in Julia) but the value at a
valid index is the whole UTF-8 character at that point, and an invalid
index raises an exception.

The letters "ö" and "ä" are two bytes each in UTF-8.

julia> s = "myöhä"
"myöhä"

julia> s[3]
'ö'

julia> s[4]
ERROR: UnicodeError: invalid character index
 in next at ./unicode/utf8.jl:65
 in getindex at strings/basic.jl:37

julia> s[5]
'h'

Julia provides access to the next character at an index and the valid
index after that:

julia> next(s, 3)
('ö',5)

The last valid index:

julia> endof(s)
6

Special syntax to index at the end of a string:

julia> s[end - 1:end]
"hä"

That's not quite right. The penultimate character happened to be one
byte, so it worked. At least incorrect indexing results in an exception
rather than an incorrect value. There is a proper method to get a
previous valid index - I should have used that.

Also, the length of a string is the number of characters rather than
bytes, decoupled from the indexing.

julia> length("myöhä")
5

I work with text all the time, but I don't think I ever _need_ arbitrary
access to an nth character. What I require is access to the start and
end of a string, searching, and splitting. These all seem compatible
with using UTF-8 representations. Same with iterating over the string
(forward or backward).

Just in case: I've been quite happy with Unicode in Python 3. It's just
interesting to see a different way that also seems to work.

[2] http://docs.julialang.org/en/release-0.4/manual/strings/

[toc] | [prev] | [next] | [standalone]


#105224

FromMarko Rauhamaa <marko@pacujo.net>
Date2016-03-18 17:26 +0200
Message-ID<87twk3oli0.fsf@elektro.pacujo.net>
In reply to#105189
Michael Torrie <torriem@gmail.com>:

> On 03/18/2016 02:26 AM, Jussi Piitulainen wrote:
>> I think Julia's way of dealing with its strings-as-UTF-8 [2] is more
>> promising. Indexing is by bytes (1-based in Julia) but the value at a
>> valid index is the whole UTF-8 character at that point, and an
>> invalid index raises an exception.
>
> This seems to me to be a leaky abstraction.

It may be that Python's Unicode abstraction is an untenable illusion
because the underlying reality is 8-bit and there's no way to hide it
completely.

There's no problem providing pure Unicode strings. Things get iffy when
Python's OS abstraction pretends sys.stdin is text or filenames are
strings.

> Julia's approach is interesting, but it strikes me as somewhat broken
> as it pretends to do O(1) indexing, but in reality it's still O(n)

If the underlying encoding is 8-bit, converting it to an O(1) structure
would still be O(n).


Marko

[toc] | [prev] | [next] | [standalone]


#105227

FromSteven D'Aprano <steve@pearwood.info>
Date2016-03-19 03:58 +1100
Message-ID<56ec33a2$0$1593$c3e8da3$5496439d@news.astraweb.com>
In reply to#105224
On Sat, 19 Mar 2016 02:26 am, Marko Rauhamaa wrote:

> Michael Torrie <torriem@gmail.com>:
> 
>> On 03/18/2016 02:26 AM, Jussi Piitulainen wrote:
>>> I think Julia's way of dealing with its strings-as-UTF-8 [2] is more
>>> promising. Indexing is by bytes (1-based in Julia) but the value at a
>>> valid index is the whole UTF-8 character at that point, and an
>>> invalid index raises an exception.
>>
>> This seems to me to be a leaky abstraction.
> 
> It may be that Python's Unicode abstraction is an untenable illusion
> because the underlying reality is 8-bit and there's no way to hide it
> completely.
>
> There's no problem providing pure Unicode strings. Things get iffy when
> Python's OS abstraction pretends sys.stdin is text or filenames are
> strings.

The abstraction only breaks because of historical reasons.

In Linux and Unix systems, the underlying file system actually allows any
arbitrary byte strings (with a small number of restrictions, such as
disallowing ASCII NUL and / (slash) bytes. But modern applications try to
pretend that the file system is actually UTF-8. That would work fine if
people *only* accessed the file system with such tools that used UTF-8. But
they don't.

On Windows, the file system is either UTF-16 or UCS-2, I'm not sure which.
But the NTFS file system itself enforces that all file names are valid
UTF-16 (or the other one). Since all valid UTF-16 strings are valid Unicode
(by definition), there's no problem there.

However, the problem on Windows is not the underlying file system, but the
Explorer interface. It still uses old legacy encodings, and localises them
in different countries, so it is invariable that people end up with
mojibake file names.


>> Julia's approach is interesting, but it strikes me as somewhat broken
>> as it pretends to do O(1) indexing, but in reality it's still O(n)
> 
> If the underlying encoding is 8-bit, converting it to an O(1) structure
> would still be O(n).

Yes, but you only need to do that once, on input, or at most twice, on input
and output, not on every operation.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#105240

FromMarko Rauhamaa <marko@pacujo.net>
Date2016-03-18 23:02 +0200
Message-ID<87k2kzo5y5.fsf@elektro.pacujo.net>
In reply to#105224
Chris Angelico <rosuav@gmail.com>:
> On Sat, Mar 19, 2016 at 2:26 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> It may be that Python's Unicode abstraction is an untenable illusion
>> because the underlying reality is 8-bit and there's no way to hide it
>> completely.
>
> The underlying reality is 1-bit. Or maybe the underlying reality is
> actually electrical signals that don't even have a clear definition of
> "bits" and bounce between two states for a few fractions of a second
> before settling. And maybe someone's implementing Python on the George
> Banks Kite CPU, which consists of two cents' worth of paper and
> string, on which text is actually represented by glyph. They're all
> equally valid notions of "underlying reality".
>
> Text is an abstract concept, just as numbers are.

The question is how tenable the illusion is. If the OS gave the
appropriate guarantees (say, all pathnames are encoded Unicode strings),
the abstraction could be maintained. Unfortunately, the legacy shines
through making you wonder if Python has overreached prematurely with its
Unicode HAL.


Marko

[toc] | [prev] | [next] | [standalone]


#105241

FromMarko Rauhamaa <marko@pacujo.net>
Date2016-03-18 23:28 +0200
Message-ID<87bn6bo4q7.fsf@elektro.pacujo.net>
In reply to#105240
Chris Angelico <rosuav@gmail.com>:

> The problem is not Python's Unicode strings, then. The problem is the
> notion that path names are text. If they're text, they should be
> exclusively text (although, for low-level efficiency, they're more
> likely to be defined as "valid UTF-8 sequences" rather than "sequences
> of Unicode codepoints"); since they're not, they are fundamentally
> bytes. But that's not a problem with Python - it's a problem with the
> file system.

The file system does not have a problem. Python has a problem because it
tries to present pathnames as Unicode strings, which isn't always
possible.

The standard input and output are even more problematic because Python
very strongly "wishes" them to be Unicode streams.

If I were to start a new OS today, I would very much like to placate the
likes of Python. Unfortunately, the sins of our forefathers cannot be
wished away.

Anyway, Python is careful not to paint itself in a corner. It gives you
everything you need to break the abstraction and go low-level. It even
offers regular expressions and ASCII syntax for bytes objects! For
example, Guile 2.x is trying to emulate Python's progressive approach,
but doesn't offer such amenities. Thus, Python's

   b'hi!'

is

   #vu8(104 105 33)

or

   (use-modules (rnrs bytevectors))
   (string->utf8 "hi!")

in Guile.


Marko

[toc] | [prev] | [next] | [standalone]


#105243

FromMarko Rauhamaa <marko@pacujo.net>
Date2016-03-19 00:03 +0200
Message-ID<877fgzo34t.fsf@elektro.pacujo.net>
In reply to#105241
Chris Angelico <rosuav@gmail.com>:

> On Sat, Mar 19, 2016 at 8:28 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> The file system does not have a problem. Python has a problem because it
>> tries to present pathnames as Unicode strings, which isn't always
>> possible.
>
> But what does a file name *mean*?

A Linux/UNIX file name is an extended ASCII string, where the
interpretation of bytes in the range 128..255 are left ambiguous. That's
the legacy of the early 1980's. At that time 8-bit bytes were standard,
and the parity nonsense was virtually gone.

C, Emacs and the OS supported those bytes without a problem but treated
them as some sort of control characters (they were represented with the
octal \nnn notation).

Some systems used the upper byte range for block graphics (CP/M).

Some systems used the upper byte range to represent Hebrew letters
(Atari).

Then came ISO-8859-x and the locales (yuck!). Sun scrambled to make
SunOS "8-bit clean". ISO-8859-1 was widely taken as the default for the
Civilized World. Pathnames reflected that colonialist mindset.

ISO-8859-1 was the state of the art around 1995 (HTML). UCS-2 was the
avant-garde adopted by Windows and Java. UTF-8 came later, and Linux
luckily avoided the UCS-2 mess.

All that "extended ASCII" legacy is still the reaily on Linux and won't
go away in the foreseeable future. I suppose OSX is the only mainstream
operating system that had the full benefit of hindsight. And even they
messed it up with case-insensitive pathnames.

> If I were building an entire OS ecosystem from scratch today, I'd
> probably do a lot of things with a hybrid system of documented meaning
> atop implementation-detail APIs. In this particular case, I would
> define the API in terms of byte sequences, but clearly documenting
> that these byte sequences are to be understood to mean text strings,
> and thus must be valid UTF-8.

UTF-8 shouldn't have anything to do with the abstract pathnames (which
should be normalized Unicode). Also, special-casing '\0' and '/' is
lame. Why can't I have "Results 1/2016" as a filename?


Marko

[toc] | [prev] | [next] | [standalone]


#105252

FromMarko Rauhamaa <marko@pacujo.net>
Date2016-03-19 09:49 +0200
Message-ID<871t76oqje.fsf@elektro.pacujo.net>
In reply to#105243
Random832 <random832@fastmail.com>:

> On Fri, Mar 18, 2016, at 20:55, Chris Angelico wrote:
>> On Sat, Mar 19, 2016 at 9:03 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> > Also, special-casing '\0' and '/' is
>> > lame. Why can't I have "Results 1/2016" as a filename?
>> 
>> Would you be allowed to have a directory named "Results 1" as well?
>
> If I were designing a new operating from scratch and didn't have to be
> compatible with anything, I would probably have pathnames be tuples of
> strings (maybe represented at the low level with percent-escaping),
> rather than having a directory separator.

Speaking of the low level, the classic UNIX file system doesn't make use
of pathnames. Rather, the files are nameless. They are identified by the
device (= file system) number plus the inode number.

Some files are directories, dir objects if you will, that map filenames
to inode numbers. The file system enforces the limitation that the
filenames (directory keys) cannot contain '\0' or '/' ASCII characters.
The entries of the directory are called (hard) links.

A pathname is a clumsy proxy for a file because the file system may be
modified between references through renames or deletions. What you'd
want is reference object that maintains a reference count on the inode.
Or course, you could create a hard link on the fly, but the operating
system doesn't clear the link automatically when the client process goes
away nor does it prevent other processes from tampering with the link.

You could open the file and use the file descriptor as such a reference
object. However, the process may not have access rights to open the
file. UNIX forces you to open a file with O_RDONLY, O_WRONLY or O_RDWR;
you'd need an O_REF option that doesn't allow you any I/O access to the
file but allows you to refer to a file.


Marko

[toc] | [prev] | [next] | [standalone]


#105254

FromMarko Rauhamaa <marko@pacujo.net>
Date2016-03-19 10:22 +0200
Message-ID<87pouqnahf.fsf@elektro.pacujo.net>
In reply to#105252
Chris Angelico <rosuav@gmail.com>:

> On Sat, Mar 19, 2016 at 6:49 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> Speaking of the low level, the classic UNIX file system doesn't make
>> use of pathnames. Rather, the files are nameless. They are identified
>> by the device (= file system) number plus the inode number.
>
> Not entirely fair. A file system has directories in it, which have
> names in them referencing other inodes. So while you can get to the
> contents of the file given only its inode, but the path names are very
> much a part of the file system too.

Not all files have pathnames. Those that do have numerous pathnames. You
can't tell by looking at a file what pathnames, if any, it might have.
You need an exhaustive, recursive search of the file system for the
reverse mapping.

If you execute the commands:

   echo hello >hello
   rm hello

You don't know for sure if the file you removed was the file you created
on the previous line.


Marko

[toc] | [prev] | [next] | [standalone]


#105260

FromMarko Rauhamaa <marko@pacujo.net>
Date2016-03-19 11:40 +0200
Message-ID<87h9g2n6um.fsf@elektro.pacujo.net>
In reply to#105254
Chris Angelico <rosuav@gmail.com>:

> On Sat, Mar 19, 2016 at 7:22 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> Not all files have pathnames. Those that do have numerous pathnames. You
>> can't tell by looking at a file what pathnames, if any, it might have.
>> You need an exhaustive, recursive search of the file system for the
>> reverse mapping.
>>
>> If you execute the commands:
>>
>>    echo hello >hello
>>    rm hello
>>
>> You don't know for sure if the file you removed was the file you created
>> on the previous line.
>
> Not all objects in Python have names bound to them. Those that do may
> have multiple. You can't tell, by looking at an object, what pathnames
> it has. You need an exhaustive, recursive search of all namespaces for
> the reverse mapping.
>
> If you execute the commands:
>
> hello = re.compile("[Hh][Ee][Ll][Ll][Oo]")
> hello.match(msg)
>
> you don't know for sure if the object you called a method on was the
> one you created on the previous line.
>
> So are object names not part of Python?

What are you talking about, Chris?

Point is, UNIX file systems don't provide for inode handles. You have
pathnames and file descriptors, but you'd need something between the
two. The omission plagues UNIX application programmers and is the cause
of numerous bugs -- even when the programmers are not aware of the
danger.

Objects in Python have trivial references: just assign the object to a
variable.


Marko

[toc] | [prev] | [next] | [standalone]


#105256

FromSteven D'Aprano <steve@pearwood.info>
Date2016-03-19 19:38 +1100
Message-ID<56ed0fef$0$1596$c3e8da3$5496439d@news.astraweb.com>
In reply to#105243
On Sat, 19 Mar 2016 01:30 pm, Random832 wrote:

> On Fri, Mar 18, 2016, at 20:55, Chris Angelico wrote:
>> On Sat, Mar 19, 2016 at 9:03 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> > Also, special-casing '\0' and '/' is
>> > lame. Why can't I have "Results 1/2016" as a filename?
>> 
>> Would you be allowed to have a directory named "Results 1" as well?
> 
> If I were designing a new operating from scratch and didn't have to be
> compatible with anything, I would probably have pathnames be tuples of
> strings (maybe represented at the low level with percent-escaping),
> rather than having a directory separator.


ls -l /home/user/documents/stuff/foo


ls -l "home","user","documents","stuff","foo"


I think users of command line tools and shells will hate you.





-- 
Steven

[toc] | [prev] | [next] | [standalone]


#105251

Fromwxjmfauth@gmail.com
Date2016-03-19 00:14 -0700
Message-ID<434ba3c9-5391-46bc-86cf-cab5a77e24bd@googlegroups.com>
In reply to#105240
---------

"Demonstrate", illustrate:
The shorter the strings are, the more memory is wasted.

(Interestingly, this does correspond to the "real world".)

[toc] | [prev] | [next] | [standalone]


#105257

Fromwxjmfauth@gmail.com
Date2016-03-19 02:17 -0700
Message-ID<ef627c9e-328f-4be6-8cef-6384bb5620df@googlegroups.com>
In reply to#105251
Le samedi 19 mars 2016 08:14:41 UTC+1, wxjm...@gmail.com a écrit :
> ---------
> 
> "Demonstrate", illustrate:
> The shorter the strings are, the more memory is wasted.
> 
> (Interestingly, this does correspond to the "real world".)

Addendum
I forgot something.
In an attempt to mathematically modelize the memory
consumption as a function of the length of the strings
*and* the number of strings.
Do not forget, to take into account the very short
strings (the limit) containing a single "char".
There is some kind of discontinuty, due to the
fact that in the *string type* everything is flexible.

>>> sys.getsizeof('')
25
>>> sys.getsizeof('a')
26

---
A purist will always initialize a string with
an ascii character.

[toc] | [prev] | [next] | [standalone]


#105253

FromSteven D'Aprano <steve@pearwood.info>
Date2016-03-19 19:14 +1100
Message-ID<56ed0a71$0$1607$c3e8da3$5496439d@news.astraweb.com>
In reply to#105240
On Sat, 19 Mar 2016 08:08 am, Chris Angelico wrote:

> On Sat, Mar 19, 2016 at 8:02 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> Chris Angelico <rosuav@gmail.com>:
>>> On Sat, Mar 19, 2016 at 2:26 AM, Marko Rauhamaa <marko@pacujo.net>
>>> wrote:
>>>> It may be that Python's Unicode abstraction is an untenable illusion
>>>> because the underlying reality is 8-bit and there's no way to hide it
>>>> completely.
>>>
>>> The underlying reality is 1-bit. Or maybe the underlying reality is
>>> actually electrical signals that don't even have a clear definition of
>>> "bits" and bounce between two states for a few fractions of a second
>>> before settling. And maybe someone's implementing Python on the George
>>> Banks Kite CPU, which consists of two cents' worth of paper and
>>> string, on which text is actually represented by glyph. They're all
>>> equally valid notions of "underlying reality".
>>>
>>> Text is an abstract concept, just as numbers are.
>>
>> The question is how tenable the illusion is. If the OS gave the
>> appropriate guarantees (say, all pathnames are encoded Unicode strings),
>> the abstraction could be maintained. Unfortunately, the legacy shines
>> through making you wonder if Python has overreached prematurely with its
>> Unicode HAL.
> 
> The problem is not Python's Unicode strings, then. The problem is the
> notion that path names are text. If they're text, they should be
> exclusively text (although, for low-level efficiency, they're more
> likely to be defined as "valid UTF-8 sequences" rather than "sequences
> of Unicode codepoints"); since they're not, they are fundamentally
> bytes. But that's not a problem with Python - it's a problem with the
> file system.


One thing that NTFS gets right is that all path names are guaranteed to be
well-formed, valid Unicode. I believe that they are stored in UTF-16, and
unlike the ext file systems used on Linux, they are not arbitrary bytes.

I believe that HFS+ on Apple Macs goes one step further and guarantees that
paths are always fully normalised, so that it's impossible to have (e.g.)
two files ã (U+00E3 LATIN SMALL LETTER A WITH TILDE) and ã (U+0061 LATIN
SMALL LETTER A + U+0303 COMBINING TILDE) in the same directory.

Unfortunately, backwards compatibility is holding Linux file systems back...



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#105259

FromMarko Rauhamaa <marko@pacujo.net>
Date2016-03-19 11:31 +0200
Message-ID<87lh5en79a.fsf@elektro.pacujo.net>
In reply to#105253
Steven D'Aprano <steve@pearwood.info>:

> One thing that NTFS gets right is that all path names are guaranteed
> to be well-formed, valid Unicode. I believe that they are stored in
> UTF-16, and unlike the ext file systems used on Linux, they are not
> arbitrary bytes.

<URL: https://msdn.microsoft.com/en-us/library/windows/desktop/dd31774
8%28v=vs.85%29.aspx> states that NTFS filenames disallow '\', '/', '.',
'?', '*' as well as '¥'. Apparently the ban on the yen symbol isn't
enforced by the FS.

I haven't found a direct statement whether NTFS internally enforces the
soundness of UTF-16 or if it is simply doing UCS-2.

<URL: https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069
%28v=vs.85%29.aspx>:

   Using the surrogate mechanism, UTF-16 can support all 1,114,112
   potential Unicode characters.

But Unicode doesn't contain 1,114,112 characters—the surrogates are
excluded from Unicode, and definitely cannot be encoded using
UTF-anything.

Furthermore, the page notes:

   Note Windows 2000 introduces support for basic input, output, and
   simple sorting of supplementary characters. However, not all system
   components are compatible with supplementary characters.

(Somewhat related, Python doesn't enforce the soundness of Unicode
because Python allows surrogate code points in strings.)

> I believe that HFS+ on Apple Macs goes one step further and guarantees
> that paths are always fully normalised, so that it's impossible to
> have (e.g.) two files ã (U+00E3 LATIN SMALL LETTER A WITH TILDE) and ã
> (U+0061 LATIN SMALL LETTER A + U+0303 COMBINING TILDE) in the same
> directory.
>
> Unfortunately, backwards compatibility is holding Linux file systems
> back...

Linux got lucky by not jumping the gun. We are still waiting for the
dust to settle.

Unicode made several (understandable but grave) mistakes along the way:

   * UCS-2

   * supplementary code points

   * BOM

   * endianness

   * normalization

We still don't know if the final result will be UCS-4 everywhere (with
all 2**32 code points allowed?!) or UTF-8 everywhere.


Marko

[toc] | [prev] | [next] | [standalone]


#105261

Fromwxjmfauth@gmail.com
Date2016-03-19 03:40 -0700
Message-ID<72bdbf36-7b0c-4979-8261-8928dd1d5715@googlegroups.com>
In reply to#105259
Le samedi 19 mars 2016 10:31:56 UTC+1, Marko Rauhamaa a écrit :
> We still don't know if the final result will be UCS-4 everywhere (with
> all 2**32 code points allowed?!) or UTF-8 everywhere.
> 

A partial answer.
You are most probably using utf-32 everyday whithout
knowing it. Simply, because you are using fonts.

jmf

[toc] | [prev] | [next] | [standalone]


#105262

FromMarko Rauhamaa <marko@pacujo.net>
Date2016-03-19 13:07 +0200
Message-ID<87vb4ilo8g.fsf@elektro.pacujo.net>
In reply to#105259
Chris Angelico <rosuav@gmail.com>:

> On Sat, Mar 19, 2016 at 8:31 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> Unicode made several (understandable but grave) mistakes along the way:
>>
>>    * normalization
>
> Elaborate please? What's such a big mistake here?

Unicode shouldn't have allowed multiple equivalent variants for a
string.

Now Python falls victim to:

   >>> '\u006e\u0303' == '\u00f1'
   False

<URL: https://en.wikipedia.org/wiki/Unicode_equivalence>:

   For example, the code point U+006E (the Latin lowercase "n") followed
   by U+0303 (the combining tilde "◌̃") is defined by Unicode to be
   canonically equivalent to the single code point U+00F1 (the lowercase
   letter "ñ" of the Spanish alphabet). Therefore, those sequences
   should be displayed in the same manner, should be treated in the same
   way by applications such as alphabetizing names or searching, and may
   be substituted for each other.


Marko

[toc] | [prev] | [next] | [standalone]


#105264

FromBartC <bc@freeuk.com>
Date2016-03-19 12:24 +0000
Message-ID<ncjg8e$7lj$1@dont-email.me>
In reply to#105262
On 19/03/2016 11:07, Marko Rauhamaa wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> On Sat, Mar 19, 2016 at 8:31 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
>>> Unicode made several (understandable but grave) mistakes along the way:
>>>
>>>     * normalization
>>
>> Elaborate please? What's such a big mistake here?
>
> Unicode shouldn't have allowed multiple equivalent variants for a
> string.
>
> Now Python falls victim to:
>
>     >>> '\u006e\u0303' == '\u00f1'
>     False
>
> <URL: https://en.wikipedia.org/wiki/Unicode_equivalence>:
>
>     For example, the code point U+006E (the Latin lowercase "n") followed
>     by U+0303 (the combining tilde "◌̃") is defined by Unicode to be
>     canonically equivalent to the single code point U+00F1 (the lowercase
>     letter "ñ" of the Spanish alphabet). Therefore, those sequences
>     should be displayed in the same manner, should be treated in the same
>     way by applications such as alphabetizing names or searching, and may
>     be substituted for each other.
>


So a string that looks like:

"ññññññññññññññññññññññññññññññññññññññññññññññññññ"

can have 2**50 different representations? And occupy somewhere between 
50 and 200 bytes? Or is that 400?

OK...

-- 
Bartc

[toc] | [prev] | [next] | [standalone]


#105266

FromMarko Rauhamaa <marko@pacujo.net>
Date2016-03-19 14:43 +0200
Message-ID<87k2kyljte.fsf@elektro.pacujo.net>
In reply to#105264
BartC <bc@freeuk.com>:

> So a string that looks like:
>
> "ññññññññññññññññññññññññññññññññññññññññññññññññññ"
>
> can have 2**50 different representations? And occupy somewhere between
> 50 and 200 bytes? Or is that 400?
>
> OK...

You are on the right track!


Marko

[toc] | [prev] | [next] | [standalone]


Page 2 of 4 — ← Prev page 1 [2] 3 4  Next page →

Back to top | Article view | comp.lang.python


csiph-web