Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #105093 > unrolled thread
| Started by | wxjmfauth@gmail.com |
|---|---|
| First post | 2016-03-17 07:34 -0700 |
| Last post | 2016-03-18 11:18 -0700 |
| Articles | 20 on this page of 72 — 18 participants |
Back to article view | Back to comp.lang.python
How to waste computer memory? wxjmfauth@gmail.com - 2016-03-17 07:34 -0700
Re: How to waste computer memory? Rick Johnson <rantingrickjohnson@gmail.com> - 2016-03-17 12:21 -0700
Re: How to waste computer memory? cl@isbd.net - 2016-03-17 20:31 +0000
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 07:42 +1100
Re: How to waste computer memory? Grant Edwards <invalid@invalid.invalid> - 2016-03-17 21:08 +0000
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 08:13 +1100
Re: How to waste computer memory? Paul Rubin <no.email@nospam.invalid> - 2016-03-17 14:30 -0700
Re: How to waste computer memory? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-17 22:32 +0000
Re: How to waste computer memory? cl@isbd.net - 2016-03-17 22:42 +0000
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-17 23:11 +0200
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 08:17 +1100
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-17 21:26 +0000
Re: How to waste computer memory? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-17 22:38 +0000
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 10:02 +1100
Re: How to waste computer memory? alister <alister.ware@ntlworld.com> - 2016-03-17 21:37 +0000
Re: How to waste computer memory? alister <alister.ware@ntlworld.com> - 2016-03-17 21:43 +0000
Re: How to waste computer memory? Gene Heskett <gheskett@wdtv.com> - 2016-03-17 20:51 -0400
Re: How to waste computer memory? Rick Johnson <rantingrickjohnson@gmail.com> - 2016-03-17 18:47 -0700
Re: How to waste computer memory? cl@isbd.net - 2016-03-18 10:44 +0000
Re: How to waste computer memory? Gene Heskett <gheskett@wdtv.com> - 2016-03-18 10:11 -0400
Re: How to waste computer memory? Grant Edwards <invalid@invalid.invalid> - 2016-03-19 13:50 +0000
Re: How to waste computer memory? Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-18 01:00 -0600
Re: How to waste computer memory? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-03-18 10:26 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 17:26 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 03:58 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 23:02 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 23:28 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 00:03 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 09:49 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 10:22 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 11:40 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 19:38 +1100
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 00:14 -0700
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 02:17 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 19:14 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 11:31 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 03:40 -0700
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 13:07 +0200
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 12:24 +0000
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 14:43 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:18 +1100
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 15:14 +0000
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 15:20 +0000
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 22:32 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 14:42 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:39 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 16:56 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 07:01 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:56 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 17:02 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 02:47 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 18:12 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 16:01 +1100
Re: How to waste computer memory? Rustom Mody <rustompmody@gmail.com> - 2016-03-19 23:20 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 22:06 +1100
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-20 22:22 +1100
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 23:14 +1100
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-20 23:27 +1100
Re: How to waste computer memory? Ben Bacarisse <ben.usenet@bsb.me.uk> - 2016-03-20 14:55 +0000
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-20 17:36 +0200
Re: How to waste computer memory? Random832 <random832@fastmail.com> - 2016-03-20 14:17 -0400
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-20 09:30 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 03:50 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-18 22:46 +1100
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-18 22:58 +1100
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 12:53 -0700
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 23:37 +1100
Re: How to waste computer memory? Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-18 07:57 -0600
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 03:44 +1100
Re: How to waste computer memory? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-03-18 20:22 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 13:03 -0700
Re: How to waste computer memory? sohcahtoa82@gmail.com - 2016-03-18 11:18 -0700
Page 2 of 4 — ← Prev page 1 [2] 3 4 Next page →
| From | Grant Edwards <invalid@invalid.invalid> |
|---|---|
| Date | 2016-03-19 13:50 +0000 |
| Message-ID | <ncjlf0$j9t$1@reader1.panix.com> |
| In reply to | #105196 |
On 2016-03-18, cl@isbd.net <cl@isbd.net> wrote: > However I doubt it's still being used, a year or two after I wrote it > we migrated to a Tektronix development system that ran Unix (wow!). The PDP-11 one that ran TNIX (a thinly disguised port of v7)? Back in the early 80's we used a copule of those doing microprocessor development for cellular phones and cellular base station radios. IIRC, they were 8560s with the 8540 in-circuit-emulator boxes attached to a couple of the high speed serial ports. Compared to the groups that were using Intel MDS development machines running ISIS, we thought we were pretty cool. We even had a 300 baud modem on one port so we could dial-in from home! -- Grant
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2016-03-18 01:00 -0600 |
| Message-ID | <mailman.302.1458284448.12893.python-list@python.org> |
| In reply to | #105142 |
On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson <rantingrickjohnson@gmail.com> wrote: > In the event that i change my mind about Unicode, and/or for > the sake of others, who may want to know, please provide a > list of languages that *YOU* think handle Unicode better than > Python, starting with the best first. Thanks. jmf has been asked this before, and as I recall he seems to feel that UTF-8 should be used for all purposes, ignoring the limitations of that encoding such as that indexing becomes a O(n) operation. He has pointed at Go as an example of a language wherein Unicode "just works", although I think that others do not necessarily agree [1]. He also seems to have a strange notion of the meaning of the word "buggy". He frequently uses that word to describe the Python 3.3 Unicode implementation, although he can't seem to demonstrate any actual bugs. Instead, he points at cherry-picked micro-benchmarks that show Python's old "narrow" Unicode implementation (which did not properly support SMP characters, unlike the "wide" implementation which was a much greater memory hog than the version he's now complaining about) outperforming the PEP-393 implementation while completely ignoring any real-world benchmarks. [1] https://coderwall.com/p/k7zvyg/dealing-with-unicode-in-go
[toc] | [prev] | [next] | [standalone]
| From | Jussi Piitulainen <jussi.piitulainen@helsinki.fi> |
|---|---|
| Date | 2016-03-18 10:26 +0200 |
| Message-ID | <lf5y49gw5s9.fsf@ling.helsinki.fi> |
| In reply to | #105182 |
Ian Kelly writes:
> On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson
> <rantingrickjohnson@gmail.com> wrote:
>> In the event that i change my mind about Unicode, and/or for
>> the sake of others, who may want to know, please provide a
>> list of languages that *YOU* think handle Unicode better than
>> Python, starting with the best first. Thanks.
>
> jmf has been asked this before, and as I recall he seems to feel that
> UTF-8 should be used for all purposes, ignoring the limitations of
> that encoding such as that indexing becomes a O(n) operation. He has
> pointed at Go as an example of a language wherein Unicode "just
> works", although I think that others do not necessarily agree [1].
...
> [1] https://coderwall.com/p/k7zvyg/dealing-with-unicode-in-go
I think Julia's way of dealing with its strings-as-UTF-8 [2] is more
promising. Indexing is by bytes (1-based in Julia) but the value at a
valid index is the whole UTF-8 character at that point, and an invalid
index raises an exception.
The letters "ö" and "ä" are two bytes each in UTF-8.
julia> s = "myöhä"
"myöhä"
julia> s[3]
'ö'
julia> s[4]
ERROR: UnicodeError: invalid character index
in next at ./unicode/utf8.jl:65
in getindex at strings/basic.jl:37
julia> s[5]
'h'
Julia provides access to the next character at an index and the valid
index after that:
julia> next(s, 3)
('ö',5)
The last valid index:
julia> endof(s)
6
Special syntax to index at the end of a string:
julia> s[end - 1:end]
"hä"
That's not quite right. The penultimate character happened to be one
byte, so it worked. At least incorrect indexing results in an exception
rather than an incorrect value. There is a proper method to get a
previous valid index - I should have used that.
Also, the length of a string is the number of characters rather than
bytes, decoupled from the indexing.
julia> length("myöhä")
5
I work with text all the time, but I don't think I ever _need_ arbitrary
access to an nth character. What I require is access to the start and
end of a string, searching, and splitting. These all seem compatible
with using UTF-8 representations. Same with iterating over the string
(forward or backward).
Just in case: I've been quite happy with Unicode in Python 3. It's just
interesting to see a different way that also seems to work.
[2] http://docs.julialang.org/en/release-0.4/manual/strings/
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-18 17:26 +0200 |
| Message-ID | <87twk3oli0.fsf@elektro.pacujo.net> |
| In reply to | #105189 |
Michael Torrie <torriem@gmail.com>: > On 03/18/2016 02:26 AM, Jussi Piitulainen wrote: >> I think Julia's way of dealing with its strings-as-UTF-8 [2] is more >> promising. Indexing is by bytes (1-based in Julia) but the value at a >> valid index is the whole UTF-8 character at that point, and an >> invalid index raises an exception. > > This seems to me to be a leaky abstraction. It may be that Python's Unicode abstraction is an untenable illusion because the underlying reality is 8-bit and there's no way to hide it completely. There's no problem providing pure Unicode strings. Things get iffy when Python's OS abstraction pretends sys.stdin is text or filenames are strings. > Julia's approach is interesting, but it strikes me as somewhat broken > as it pretends to do O(1) indexing, but in reality it's still O(n) If the underlying encoding is 8-bit, converting it to an O(1) structure would still be O(n). Marko
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-19 03:58 +1100 |
| Message-ID | <56ec33a2$0$1593$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105224 |
On Sat, 19 Mar 2016 02:26 am, Marko Rauhamaa wrote: > Michael Torrie <torriem@gmail.com>: > >> On 03/18/2016 02:26 AM, Jussi Piitulainen wrote: >>> I think Julia's way of dealing with its strings-as-UTF-8 [2] is more >>> promising. Indexing is by bytes (1-based in Julia) but the value at a >>> valid index is the whole UTF-8 character at that point, and an >>> invalid index raises an exception. >> >> This seems to me to be a leaky abstraction. > > It may be that Python's Unicode abstraction is an untenable illusion > because the underlying reality is 8-bit and there's no way to hide it > completely. > > There's no problem providing pure Unicode strings. Things get iffy when > Python's OS abstraction pretends sys.stdin is text or filenames are > strings. The abstraction only breaks because of historical reasons. In Linux and Unix systems, the underlying file system actually allows any arbitrary byte strings (with a small number of restrictions, such as disallowing ASCII NUL and / (slash) bytes. But modern applications try to pretend that the file system is actually UTF-8. That would work fine if people *only* accessed the file system with such tools that used UTF-8. But they don't. On Windows, the file system is either UTF-16 or UCS-2, I'm not sure which. But the NTFS file system itself enforces that all file names are valid UTF-16 (or the other one). Since all valid UTF-16 strings are valid Unicode (by definition), there's no problem there. However, the problem on Windows is not the underlying file system, but the Explorer interface. It still uses old legacy encodings, and localises them in different countries, so it is invariable that people end up with mojibake file names. >> Julia's approach is interesting, but it strikes me as somewhat broken >> as it pretends to do O(1) indexing, but in reality it's still O(n) > > If the underlying encoding is 8-bit, converting it to an O(1) structure > would still be O(n). Yes, but you only need to do that once, on input, or at most twice, on input and output, not on every operation. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-18 23:02 +0200 |
| Message-ID | <87k2kzo5y5.fsf@elektro.pacujo.net> |
| In reply to | #105224 |
Chris Angelico <rosuav@gmail.com>: > On Sat, Mar 19, 2016 at 2:26 AM, Marko Rauhamaa <marko@pacujo.net> wrote: >> It may be that Python's Unicode abstraction is an untenable illusion >> because the underlying reality is 8-bit and there's no way to hide it >> completely. > > The underlying reality is 1-bit. Or maybe the underlying reality is > actually electrical signals that don't even have a clear definition of > "bits" and bounce between two states for a few fractions of a second > before settling. And maybe someone's implementing Python on the George > Banks Kite CPU, which consists of two cents' worth of paper and > string, on which text is actually represented by glyph. They're all > equally valid notions of "underlying reality". > > Text is an abstract concept, just as numbers are. The question is how tenable the illusion is. If the OS gave the appropriate guarantees (say, all pathnames are encoded Unicode strings), the abstraction could be maintained. Unfortunately, the legacy shines through making you wonder if Python has overreached prematurely with its Unicode HAL. Marko
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-18 23:28 +0200 |
| Message-ID | <87bn6bo4q7.fsf@elektro.pacujo.net> |
| In reply to | #105240 |
Chris Angelico <rosuav@gmail.com>: > The problem is not Python's Unicode strings, then. The problem is the > notion that path names are text. If they're text, they should be > exclusively text (although, for low-level efficiency, they're more > likely to be defined as "valid UTF-8 sequences" rather than "sequences > of Unicode codepoints"); since they're not, they are fundamentally > bytes. But that's not a problem with Python - it's a problem with the > file system. The file system does not have a problem. Python has a problem because it tries to present pathnames as Unicode strings, which isn't always possible. The standard input and output are even more problematic because Python very strongly "wishes" them to be Unicode streams. If I were to start a new OS today, I would very much like to placate the likes of Python. Unfortunately, the sins of our forefathers cannot be wished away. Anyway, Python is careful not to paint itself in a corner. It gives you everything you need to break the abstraction and go low-level. It even offers regular expressions and ASCII syntax for bytes objects! For example, Guile 2.x is trying to emulate Python's progressive approach, but doesn't offer such amenities. Thus, Python's b'hi!' is #vu8(104 105 33) or (use-modules (rnrs bytevectors)) (string->utf8 "hi!") in Guile. Marko
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-19 00:03 +0200 |
| Message-ID | <877fgzo34t.fsf@elektro.pacujo.net> |
| In reply to | #105241 |
Chris Angelico <rosuav@gmail.com>: > On Sat, Mar 19, 2016 at 8:28 AM, Marko Rauhamaa <marko@pacujo.net> wrote: >> The file system does not have a problem. Python has a problem because it >> tries to present pathnames as Unicode strings, which isn't always >> possible. > > But what does a file name *mean*? A Linux/UNIX file name is an extended ASCII string, where the interpretation of bytes in the range 128..255 are left ambiguous. That's the legacy of the early 1980's. At that time 8-bit bytes were standard, and the parity nonsense was virtually gone. C, Emacs and the OS supported those bytes without a problem but treated them as some sort of control characters (they were represented with the octal \nnn notation). Some systems used the upper byte range for block graphics (CP/M). Some systems used the upper byte range to represent Hebrew letters (Atari). Then came ISO-8859-x and the locales (yuck!). Sun scrambled to make SunOS "8-bit clean". ISO-8859-1 was widely taken as the default for the Civilized World. Pathnames reflected that colonialist mindset. ISO-8859-1 was the state of the art around 1995 (HTML). UCS-2 was the avant-garde adopted by Windows and Java. UTF-8 came later, and Linux luckily avoided the UCS-2 mess. All that "extended ASCII" legacy is still the reaily on Linux and won't go away in the foreseeable future. I suppose OSX is the only mainstream operating system that had the full benefit of hindsight. And even they messed it up with case-insensitive pathnames. > If I were building an entire OS ecosystem from scratch today, I'd > probably do a lot of things with a hybrid system of documented meaning > atop implementation-detail APIs. In this particular case, I would > define the API in terms of byte sequences, but clearly documenting > that these byte sequences are to be understood to mean text strings, > and thus must be valid UTF-8. UTF-8 shouldn't have anything to do with the abstract pathnames (which should be normalized Unicode). Also, special-casing '\0' and '/' is lame. Why can't I have "Results 1/2016" as a filename? Marko
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-19 09:49 +0200 |
| Message-ID | <871t76oqje.fsf@elektro.pacujo.net> |
| In reply to | #105243 |
Random832 <random832@fastmail.com>: > On Fri, Mar 18, 2016, at 20:55, Chris Angelico wrote: >> On Sat, Mar 19, 2016 at 9:03 AM, Marko Rauhamaa <marko@pacujo.net> wrote: >> > Also, special-casing '\0' and '/' is >> > lame. Why can't I have "Results 1/2016" as a filename? >> >> Would you be allowed to have a directory named "Results 1" as well? > > If I were designing a new operating from scratch and didn't have to be > compatible with anything, I would probably have pathnames be tuples of > strings (maybe represented at the low level with percent-escaping), > rather than having a directory separator. Speaking of the low level, the classic UNIX file system doesn't make use of pathnames. Rather, the files are nameless. They are identified by the device (= file system) number plus the inode number. Some files are directories, dir objects if you will, that map filenames to inode numbers. The file system enforces the limitation that the filenames (directory keys) cannot contain '\0' or '/' ASCII characters. The entries of the directory are called (hard) links. A pathname is a clumsy proxy for a file because the file system may be modified between references through renames or deletions. What you'd want is reference object that maintains a reference count on the inode. Or course, you could create a hard link on the fly, but the operating system doesn't clear the link automatically when the client process goes away nor does it prevent other processes from tampering with the link. You could open the file and use the file descriptor as such a reference object. However, the process may not have access rights to open the file. UNIX forces you to open a file with O_RDONLY, O_WRONLY or O_RDWR; you'd need an O_REF option that doesn't allow you any I/O access to the file but allows you to refer to a file. Marko
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-19 10:22 +0200 |
| Message-ID | <87pouqnahf.fsf@elektro.pacujo.net> |
| In reply to | #105252 |
Chris Angelico <rosuav@gmail.com>: > On Sat, Mar 19, 2016 at 6:49 PM, Marko Rauhamaa <marko@pacujo.net> wrote: >> Speaking of the low level, the classic UNIX file system doesn't make >> use of pathnames. Rather, the files are nameless. They are identified >> by the device (= file system) number plus the inode number. > > Not entirely fair. A file system has directories in it, which have > names in them referencing other inodes. So while you can get to the > contents of the file given only its inode, but the path names are very > much a part of the file system too. Not all files have pathnames. Those that do have numerous pathnames. You can't tell by looking at a file what pathnames, if any, it might have. You need an exhaustive, recursive search of the file system for the reverse mapping. If you execute the commands: echo hello >hello rm hello You don't know for sure if the file you removed was the file you created on the previous line. Marko
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-19 11:40 +0200 |
| Message-ID | <87h9g2n6um.fsf@elektro.pacujo.net> |
| In reply to | #105254 |
Chris Angelico <rosuav@gmail.com>:
> On Sat, Mar 19, 2016 at 7:22 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> Not all files have pathnames. Those that do have numerous pathnames. You
>> can't tell by looking at a file what pathnames, if any, it might have.
>> You need an exhaustive, recursive search of the file system for the
>> reverse mapping.
>>
>> If you execute the commands:
>>
>> echo hello >hello
>> rm hello
>>
>> You don't know for sure if the file you removed was the file you created
>> on the previous line.
>
> Not all objects in Python have names bound to them. Those that do may
> have multiple. You can't tell, by looking at an object, what pathnames
> it has. You need an exhaustive, recursive search of all namespaces for
> the reverse mapping.
>
> If you execute the commands:
>
> hello = re.compile("[Hh][Ee][Ll][Ll][Oo]")
> hello.match(msg)
>
> you don't know for sure if the object you called a method on was the
> one you created on the previous line.
>
> So are object names not part of Python?
What are you talking about, Chris?
Point is, UNIX file systems don't provide for inode handles. You have
pathnames and file descriptors, but you'd need something between the
two. The omission plagues UNIX application programmers and is the cause
of numerous bugs -- even when the programmers are not aware of the
danger.
Objects in Python have trivial references: just assign the object to a
variable.
Marko
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-19 19:38 +1100 |
| Message-ID | <56ed0fef$0$1596$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105243 |
On Sat, 19 Mar 2016 01:30 pm, Random832 wrote: > On Fri, Mar 18, 2016, at 20:55, Chris Angelico wrote: >> On Sat, Mar 19, 2016 at 9:03 AM, Marko Rauhamaa <marko@pacujo.net> wrote: >> > Also, special-casing '\0' and '/' is >> > lame. Why can't I have "Results 1/2016" as a filename? >> >> Would you be allowed to have a directory named "Results 1" as well? > > If I were designing a new operating from scratch and didn't have to be > compatible with anything, I would probably have pathnames be tuples of > strings (maybe represented at the low level with percent-escaping), > rather than having a directory separator. ls -l /home/user/documents/stuff/foo ls -l "home","user","documents","stuff","foo" I think users of command line tools and shells will hate you. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2016-03-19 00:14 -0700 |
| Message-ID | <434ba3c9-5391-46bc-86cf-cab5a77e24bd@googlegroups.com> |
| In reply to | #105240 |
--------- "Demonstrate", illustrate: The shorter the strings are, the more memory is wasted. (Interestingly, this does correspond to the "real world".)
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2016-03-19 02:17 -0700 |
| Message-ID | <ef627c9e-328f-4be6-8cef-6384bb5620df@googlegroups.com> |
| In reply to | #105251 |
Le samedi 19 mars 2016 08:14:41 UTC+1, wxjm...@gmail.com a écrit :
> ---------
>
> "Demonstrate", illustrate:
> The shorter the strings are, the more memory is wasted.
>
> (Interestingly, this does correspond to the "real world".)
Addendum
I forgot something.
In an attempt to mathematically modelize the memory
consumption as a function of the length of the strings
*and* the number of strings.
Do not forget, to take into account the very short
strings (the limit) containing a single "char".
There is some kind of discontinuty, due to the
fact that in the *string type* everything is flexible.
>>> sys.getsizeof('')
25
>>> sys.getsizeof('a')
26
---
A purist will always initialize a string with
an ascii character.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-19 19:14 +1100 |
| Message-ID | <56ed0a71$0$1607$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105240 |
On Sat, 19 Mar 2016 08:08 am, Chris Angelico wrote: > On Sat, Mar 19, 2016 at 8:02 AM, Marko Rauhamaa <marko@pacujo.net> wrote: >> Chris Angelico <rosuav@gmail.com>: >>> On Sat, Mar 19, 2016 at 2:26 AM, Marko Rauhamaa <marko@pacujo.net> >>> wrote: >>>> It may be that Python's Unicode abstraction is an untenable illusion >>>> because the underlying reality is 8-bit and there's no way to hide it >>>> completely. >>> >>> The underlying reality is 1-bit. Or maybe the underlying reality is >>> actually electrical signals that don't even have a clear definition of >>> "bits" and bounce between two states for a few fractions of a second >>> before settling. And maybe someone's implementing Python on the George >>> Banks Kite CPU, which consists of two cents' worth of paper and >>> string, on which text is actually represented by glyph. They're all >>> equally valid notions of "underlying reality". >>> >>> Text is an abstract concept, just as numbers are. >> >> The question is how tenable the illusion is. If the OS gave the >> appropriate guarantees (say, all pathnames are encoded Unicode strings), >> the abstraction could be maintained. Unfortunately, the legacy shines >> through making you wonder if Python has overreached prematurely with its >> Unicode HAL. > > The problem is not Python's Unicode strings, then. The problem is the > notion that path names are text. If they're text, they should be > exclusively text (although, for low-level efficiency, they're more > likely to be defined as "valid UTF-8 sequences" rather than "sequences > of Unicode codepoints"); since they're not, they are fundamentally > bytes. But that's not a problem with Python - it's a problem with the > file system. One thing that NTFS gets right is that all path names are guaranteed to be well-formed, valid Unicode. I believe that they are stored in UTF-16, and unlike the ext file systems used on Linux, they are not arbitrary bytes. I believe that HFS+ on Apple Macs goes one step further and guarantees that paths are always fully normalised, so that it's impossible to have (e.g.) two files ã (U+00E3 LATIN SMALL LETTER A WITH TILDE) and ã (U+0061 LATIN SMALL LETTER A + U+0303 COMBINING TILDE) in the same directory. Unfortunately, backwards compatibility is holding Linux file systems back... -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-19 11:31 +0200 |
| Message-ID | <87lh5en79a.fsf@elektro.pacujo.net> |
| In reply to | #105253 |
Steven D'Aprano <steve@pearwood.info>: > One thing that NTFS gets right is that all path names are guaranteed > to be well-formed, valid Unicode. I believe that they are stored in > UTF-16, and unlike the ext file systems used on Linux, they are not > arbitrary bytes. <URL: https://msdn.microsoft.com/en-us/library/windows/desktop/dd31774 8%28v=vs.85%29.aspx> states that NTFS filenames disallow '\', '/', '.', '?', '*' as well as '¥'. Apparently the ban on the yen symbol isn't enforced by the FS. I haven't found a direct statement whether NTFS internally enforces the soundness of UTF-16 or if it is simply doing UCS-2. <URL: https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069 %28v=vs.85%29.aspx>: Using the surrogate mechanism, UTF-16 can support all 1,114,112 potential Unicode characters. But Unicode doesn't contain 1,114,112 characters—the surrogates are excluded from Unicode, and definitely cannot be encoded using UTF-anything. Furthermore, the page notes: Note Windows 2000 introduces support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters. (Somewhat related, Python doesn't enforce the soundness of Unicode because Python allows surrogate code points in strings.) > I believe that HFS+ on Apple Macs goes one step further and guarantees > that paths are always fully normalised, so that it's impossible to > have (e.g.) two files ã (U+00E3 LATIN SMALL LETTER A WITH TILDE) and ã > (U+0061 LATIN SMALL LETTER A + U+0303 COMBINING TILDE) in the same > directory. > > Unfortunately, backwards compatibility is holding Linux file systems > back... Linux got lucky by not jumping the gun. We are still waiting for the dust to settle. Unicode made several (understandable but grave) mistakes along the way: * UCS-2 * supplementary code points * BOM * endianness * normalization We still don't know if the final result will be UCS-4 everywhere (with all 2**32 code points allowed?!) or UTF-8 everywhere. Marko
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2016-03-19 03:40 -0700 |
| Message-ID | <72bdbf36-7b0c-4979-8261-8928dd1d5715@googlegroups.com> |
| In reply to | #105259 |
Le samedi 19 mars 2016 10:31:56 UTC+1, Marko Rauhamaa a écrit : > We still don't know if the final result will be UCS-4 everywhere (with > all 2**32 code points allowed?!) or UTF-8 everywhere. > A partial answer. You are most probably using utf-32 everyday whithout knowing it. Simply, because you are using fonts. jmf
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-19 13:07 +0200 |
| Message-ID | <87vb4ilo8g.fsf@elektro.pacujo.net> |
| In reply to | #105259 |
Chris Angelico <rosuav@gmail.com>: > On Sat, Mar 19, 2016 at 8:31 PM, Marko Rauhamaa <marko@pacujo.net> wrote: >> Unicode made several (understandable but grave) mistakes along the way: >> >> * normalization > > Elaborate please? What's such a big mistake here? Unicode shouldn't have allowed multiple equivalent variants for a string. Now Python falls victim to: >>> '\u006e\u0303' == '\u00f1' False <URL: https://en.wikipedia.org/wiki/Unicode_equivalence>: For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other. Marko
[toc] | [prev] | [next] | [standalone]
| From | BartC <bc@freeuk.com> |
|---|---|
| Date | 2016-03-19 12:24 +0000 |
| Message-ID | <ncjg8e$7lj$1@dont-email.me> |
| In reply to | #105262 |
On 19/03/2016 11:07, Marko Rauhamaa wrote: > Chris Angelico <rosuav@gmail.com>: > >> On Sat, Mar 19, 2016 at 8:31 PM, Marko Rauhamaa <marko@pacujo.net> wrote: >>> Unicode made several (understandable but grave) mistakes along the way: >>> >>> * normalization >> >> Elaborate please? What's such a big mistake here? > > Unicode shouldn't have allowed multiple equivalent variants for a > string. > > Now Python falls victim to: > > >>> '\u006e\u0303' == '\u00f1' > False > > <URL: https://en.wikipedia.org/wiki/Unicode_equivalence>: > > For example, the code point U+006E (the Latin lowercase "n") followed > by U+0303 (the combining tilde "◌̃") is defined by Unicode to be > canonically equivalent to the single code point U+00F1 (the lowercase > letter "ñ" of the Spanish alphabet). Therefore, those sequences > should be displayed in the same manner, should be treated in the same > way by applications such as alphabetizing names or searching, and may > be substituted for each other. > So a string that looks like: "ññññññññññññññññññññññññññññññññññññññññññññññññññ" can have 2**50 different representations? And occupy somewhere between 50 and 200 bytes? Or is that 400? OK... -- Bartc
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-19 14:43 +0200 |
| Message-ID | <87k2kyljte.fsf@elektro.pacujo.net> |
| In reply to | #105264 |
BartC <bc@freeuk.com>: > So a string that looks like: > > "ññññññññññññññññññññññññññññññññññññññññññññññññññ" > > can have 2**50 different representations? And occupy somewhere between > 50 and 200 bytes? Or is that 400? > > OK... You are on the right track! Marko
[toc] | [prev] | [next] | [standalone]
Page 2 of 4 — ← Prev page 1 [2] 3 4 Next page →
Back to top | Article view | comp.lang.python
csiph-web