Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: gah4 Newsgroups: comp.compilers Subject: Re: OoO, VLIW, Are there different programming languages that are compiled to the same intermediate language? Date: Fri, 3 Feb 2023 19:13:17 -0800 (PST) Organization: Compilers Central Sender: johnl@iecc.com Approved: comp.compilers@iecc.com Message-ID: <23-02-015@comp.compilers> References: <23-01-078@comp.compilers> <23-02-001@comp.compilers> <23-02-007@comp.compilers> <23-02-011@comp.compilers> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="86800"; mail-complaints-to="abuse@iecc.com" Keywords: architecture Posted-Date: 03 Feb 2023 22:51:37 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com In-Reply-To: <23-02-011@comp.compilers> Xref: csiph.com comp.compilers:3363 On Friday, February 3, 2023 at 10:17:06 AM UTC-8, Anton Ertl wrote: (snip, I wrote) > >This would have been especially useful for Itanium, which > >(mostly) failed due to problems with code generation. > I dispute the latter claim. My take is that IA-64 failed because the > original assumption that in-order performance would exceed OoO > performance was wrong. OoO processors surpassed in-order CPUs; they > managed to get higher clock rates (my guess is that this is due to > them having smaller feedback loops) and they benefit from better > branch prediction, which extends to 512-instruction reorder buffers on > recent Intel CPUs, far beyond what compilers can achieve on IA-64. > The death knell for IA-64 competetiveness was the introduction of SIMD > instruction set extensions which made OoO CPUs surpass IA-64 even in > those vectorizable codes where IA-64 had been competetive. I got interested in OoO in the days of the IBM 360/91, which is early in the line of OoO processors. Among others, the 91 has imprecise interrupts, where the stored address is not the instruction after the interrupt cause. But okay, the biggest failure of Itanium is that it was two years or so behind schedule when it came out. And partly, as well as I remember, is the need to implement x86 instructions, too. > >Since the whole idea is that the processor depends on the > >code generator doing things in the right order. That is, out > >of order execution, but determined at compile time. Failure > >to do that meant failure for the whole idea. > But essentially all sold IA-64 CPUs were variations of the McKinley > microarchitecture as far as performance characteristics were > concerned, especially during the time when IA-64 was still perceived > as relevant. The next microarchitecture Poulson was only released in > 2012 when IA-64 had already lost. But is it the whole idea of compile-time instruction scheduling the cause of the failure, or just the way they did it? The 360/91 had some interesting problems. One is that it had 16 way interleaved memory with a cycle time of 13 processor cycles, and the goal of one instruction per clock cycle. That means it is dependent on memory address ordering, which is hard to know at compile time. The slightly later 360/85, without the fancy OoO processor, but with cache memory (the first machine with cache!) was about as fast on real problems. Otherwise, the 91 has a limited number of reservation stations, limiting how for OoO it can go. All OoO processors have a limit to how far they can go. But the compiler does not have that limit. Now, since transistors are cheap now, and one can throw a large number into reorder buffers and such, one can build really deep pipelines. But the reason for bringing this up, is that if Intel had a defined intermediate code, and supplied the back end that used it, and even more, could update that back end later, that would have been very convenient for compiler writers. Even more, design for it could have been done in parallel with the processor, making both work well together. Reminds me of the 8087 virtual register stack. The 8087 has eight registers, but they were supposed to be a virtual stack. They would be copied to/from memory on stack overflow or underflow. But no-one tried to write the interrupt routine until after the hardware was made, and it turns out not to be possible. I never knew why that wasn't fixed for the 287 or 387, though. [Multiflow found that VLIW compile-time instruction scheduling was swell for code with predictable memory access patterns, much less so for code with data-dependent access patterns. I doubt that has changed. And if the memory access is that predictable, you can likely use SIMD instructions instead. -John]