Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gah4 <gah4@u.washington.edu>
Newsgroups: comp.compilers
Subject: Re: OoO, VLIW, Are there different programming languages that are compiled to the same intermediate language?
Date: Fri, 3 Feb 2023 19:13:17 -0800 (PST)
Organization: Compilers Central
Sender: johnl@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <23-02-015@comp.compilers>
References: <Adkz+TvWa4zLl8W9Qd6ovtClKZpZrA==> <23-01-078@comp.compilers> <23-02-001@comp.compilers> <23-02-007@comp.compilers> <23-02-011@comp.compilers>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="86800"; mail-complaints-to="abuse@iecc.com"
Keywords: architecture
Posted-Date: 03 Feb 2023 22:51:37 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <23-02-011@comp.compilers>
Xref: csiph.com comp.compilers:3363

On Friday, February 3, 2023 at 10:17:06 AM UTC-8, Anton Ertl wrote:

(snip,  I wrote)

> >This would have been especially useful for Itanium, which
> >(mostly) failed due to problems with code generation.

> I dispute the latter claim. My take is that IA-64 failed because the
> original assumption that in-order performance would exceed OoO
> performance was wrong. OoO processors surpassed in-order CPUs; they
> managed to get higher clock rates (my guess is that this is due to
> them having smaller feedback loops) and they benefit from better
> branch prediction, which extends to 512-instruction reorder buffers on
> recent Intel CPUs, far beyond what compilers can achieve on IA-64.
> The death knell for IA-64 competetiveness was the introduction of SIMD
> instruction set extensions which made OoO CPUs surpass IA-64 even in
> those vectorizable codes where IA-64 had been competetive.

I got interested in OoO in the days of the IBM 360/91, which is early
in the line of OoO processors.  Among others, the 91 has imprecise
interrupts, where the stored address is not the instruction after the
interrupt cause.

But okay, the biggest failure of Itanium is that it was two years or
so behind schedule when it came out.  And partly, as well as I
remember, is the need to implement x86 instructions, too.

> >Since the whole idea is that the processor depends on the
> >code generator doing things in the right order. That is, out
> >of order execution, but determined at compile time. Failure
> >to do that meant failure for the whole idea.

> But essentially all sold IA-64 CPUs were variations of the McKinley
> microarchitecture as far as performance characteristics were
> concerned, especially during the time when IA-64 was still perceived
> as relevant. The next microarchitecture Poulson was only released in
> 2012 when IA-64 had already lost.

But is it the whole idea of compile-time instruction scheduling the
cause of the failure, or just the way they did it?

The 360/91 had some interesting problems.  One is that it had 16 way
interleaved memory with a cycle time of 13 processor cycles, and the
goal of one instruction per clock cycle.  That means it is dependent on
memory address ordering, which is hard to know at compile time.

The slightly later 360/85, without the fancy OoO processor, but with
cache memory (the first machine with cache!) was about as fast
on real problems.

Otherwise, the 91 has a limited number of reservation stations,
limiting how for OoO it can go.  All OoO processors have a limit
to how far they can go. But the compiler does not have that limit.

Now, since transistors are cheap now, and one can throw a large
number into reorder buffers and such, one can build really deep
pipelines.

But the reason for bringing this up, is that if Intel had a defined
intermediate code, and supplied the back end that used it,
and even more, could update that back end later, that would have
been very convenient for compiler writers.

Even more, design for it could have been done in parallel with the
processor, making both work well together.

Reminds me of the 8087 virtual register stack.  The 8087 has
eight registers, but they were supposed to be a virtual stack.
They would be copied to/from memory on stack overflow
or underflow.  But no-one tried to write the interrupt routine
until after the hardware was made, and it turns out not to
be possible.  I never knew why that wasn't fixed for the 287
or 387, though.
[Multiflow found that VLIW compile-time instruction scheduling was
swell for code with predictable memory access patterns, much less so
for code with data-dependent access patterns. I doubt that has
changed.  And if the memory access is that predictable, you can
likely use SIMD instructions instead. -John]