Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!border3.nntp.dca.giganews.com!Xl.tags.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local2.nntp.dca.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Tue, 28 Feb 2012 21:16:18 -0600
Message-ID: <4F4D987B.5060007@SPAM.comp-arch.net>
Date: Tue, 28 Feb 2012 19:16:11 -0800
From: "Andy (Super) Glew" <andy@SPAM.comp-arch.net>
Reply-To: andy@SPAM.comp-arch.net
Organization: comp-arch.net
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: M68k add to memory is not a mistake any more
References: <ggtgp-8D1AEA.03180231012012@netnews.mchsi.com> <4F49DE8B.9060104@SPAM.comp-arch.net> <jid0co$fg6$1@gosset.csi.cam.ac.uk> <4F4BBB9B.7050300@SPAM.comp-arch.net> <jigmkj$k9o$1@gosset.csi.cam.ac.uk>
In-Reply-To: <jigmkj$k9o$1@gosset.csi.cam.ac.uk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 175
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-wT64Kb0vpOexk3ZS+Cnl7Ivf0qZCnMH/T3ZgQUpRiWM4K03xirhFw/NyT8dIHMWdCg8sc73mzQqf9cK!D5EM3wfkC6hR2UY1rN+9aAYWUZN/JPnQrlvHa7n8WE7Yt7eJUSXOBTeIyPrHXqQ=
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 9845
Xref: x330-a1.tempe.blueboxinc.net comp.arch:6158

On 2/27/2012 11:47 AM, nmm1@cam.ac.uk wrote:
> In article<4F4BBB9B.7050300@SPAM.comp-arch.net>,
> Andy (Super) Glew<andy@SPAM.comp-arch.net>  wrote:

> The question of whether such a thread is holding a lock is solved better
> under this scheme, because at least you can recover the lock when the
> process is killed.  Systems that have had uncooperative lock recovery
> have never been a great success.

What sort of lock recovery does UNIX / Linux have, generically?

However, folks like Oracle have stuck lock recovery on top.

>
> The point about lots of small, simple cores (including one per relevant
> device) is that one doesn't even need the hardware cooperative
> multitasking, but that's more of a simplification and optimisation than
> critical.  However, it IS where I came in.

But, you do need some way to regain control of cores that software has 
run out of control on.

That's either an interrupt, or a way for code running on some other core 
to force the runaway into some known state - possibly killing whatever 
software is running on it.

In 
http://semipublic.comp-arch.net/wiki/Can_asynchronous_interrupts_be_completely_eliminated%3F
I call that "remote control".

There may be some simplifications by only allowing destructive preemption.

On the other hand, when something has runaway and has to be killed, I 
like to be able to see what it was doing.  I.e. I like to be able to 
save its state.  And if you have a way of restoring the saved state...

>
>> Also, you keep saying that asynchronous interrupts have proven to be a
>> bad idea.  But I do not recall ever seeing a real example of such proof.
>>
>> Anyway: I would love to see a credible example of why you think
>> interrupts have such major RAS problems.
>>
>> BTW, I agree that many *implentations* of interrupts have RAS problems.
>>   Especially those that try to optimize state saving.  But as far as I
>> can tell the basic idea of interrupts - act AS IF you have stopped at a
>> single place in the code, with precise state - is workable.  Of course,
>> that requires the notion of a precise state.
>
> Right.  Let's go there.
>
> The first theoretical problem is that they introduce arbitrary aliasing
> at EVERY location in the interruptible process, between the process and
> the interrupt handler plus anything it calls (including another process
> in the same parallel application).  Like parallelism?  Unfortunately,
> no, because there is no way of providing a proper, two-sided
> synchronisation mechanism, because the interrupted code can't execute
> until resumed.  One can't even use the simple 'set an in-use bit; update
> object; unset in-use bit' method, because that can lead to deadlock, for
> the same reason.  Well, actually, one can, but one has to assume that
> the kernel scheduler will then interrupt the second process to allow the
> first to run.  And don't even think of relying on time-aware critical
> sections and spin-loops!

Ah, now we are getting to the nub.

I agree: I have often found that the best thing to do in an interrupt 
handler is initiate a thread and return - and allow the thread to 
compete using "normal" thread scheduling.

As we have discussed elsewhere, I am not averse to thinking about having 
interrupts (really, I/O events) automatically spawn threads or make 
threads runnable. Or queue up messages for some thread.  And if the 
thread happens to have a core to itself, fine.

Doing so might remove the possibility of the FLIH being messed up.

But, again - this is not a logically complete solution, if there is any 
possibility that such a thread or core can run out of control.


>
> So one has to absolutely forbid shareable updating (including on files),
> write some FIENDISHLY evil interrupt-safe synchronisation, pray that the
> scheduler does exactly what you want, or introduce an interrupt-masking
> facility - but a mistake with that means that you have the worst of both
> worlds.  If I recall, MVS added such a facility, had trouble, added an
> override, had trouble, added an override to the override, ....  That's
> why we have to use 'kill -9' so much, and often even 'sudo kill -9' :-(

Again, I agree: I have seen several block bits for supposedly 
non-blockable interrupts.

Any architect who says that "this facility will never be interrupted" is 
deluding himself for the short term.

Virtual machines may need to be able to interrupt most OS level 
"interrupt blocked" regions.  Which is okay ... unless there was 
hardware that assumed that could never happen.

> But it also impacts RAS because the semantic analysis ('program
> proving') of unrestrained interrupts is virtually impossible, which also
> means that their semantics are virtually impossible to specify in
> languages and interfaces.  So they are left undefined, so no application
> that relies on them can be portable or reliable :-(  I know how it can
> be done, and Ada MIGHT do it, but nothing else that I know of does more
> than that.

This addresses the question somebody else - Stephen Fuld - asked about 
what he acronymized as SSI/HCM (Semi-Synchronous Interrupts/HW 
Cooperative Multitasking).  Why is it easier to be interruptible onbly 
at certain places, not everywhere?  Part of the answer is ease of analysis.

Much of this boils down to atomicity.  Very few machines have had ways 
of making user software define complex atomic regions. Atomicity is 
always relative, so making atomic relative to interrupts would involve 
interrupt blocking, usually an OS level facility.  Atomicity relative to 
other threads and processors is only ever cooperative, using locks.

The great hope of mechanisms such as Transactional Memory, whether HW or 
SW, is that they might give us a way of creating atomicity without 
having the cooperation of all parties involved.



>
>> By the way, I did NOT know that there were places that were unsafe, even
>> with a null signal handler.  What are these places?
>
> Any system where even SIGIGN is handled by calling the application back
> and having it ignore it can trigger that.  The first problem is that the
> run-time system must get into an environment that it can at least
> execute code and access its global data, and most operating systems are
> NOT cooperative!  The second is that it must complete and return without
> having any of the handler's state propagated back to the interrupted
> process - and the same remark applies.  And the third is that, in POSIX
> unlike in MVS, system calls are usually terminated by any received
> signal, and must be restarted - and many are not idempotent!

Fair enough.  Many, many, programs do not handle error returns from 
system calls properly, let alone EINTR.

IMHO most programs would be better off dying when such errors are 
thrown. (Not quite Freudian slip there.)


> But it gets worse.  Because the hardware typically only does a small
> part of the job, the FLIH has to complete it and (as you say is very
> often optimised to not save and restore everything).  Now what gets
> optimised out?  Typically the fancier registers and state not used by
> the kernel, such as floating-point - the saving of that is left until
> there is a full-blown context switch.
>
> So that will fail only if (a) the interrupted program has important data
> in the unsaved state and (b) the interrupt handler calls a kernel thread
> that changes that state (improperly).  We are now talking about a VERY
> low-frequency, non-repeatable failure that occurs in an arbitrary user
> program.  I am one of 2-3 people I know of who has ever tracked such a
> bug down without using privilege, and my success rate is VERY low.  In
> most cases, even I didn't try.
>
> Even if someone does manage to track such a bug down, and print
> registers in two consecutive statements, how does he report it?  He
> first has to get the bug acknowledged - and remember that it's not
> repeatable - and then the hardware, operating system and language
> run-time system people will all blame each other.  And, in the absence
> of either program proving or a practical way of bug reporting, such bugs
> are unavoidable and almost immortal.

Fair enough.

Although this is consistent with my observation that premature 
optimization is the cause of many, if not most, such problems.