Groups | Search | Server Info | Login | Register


Groups > comp.compilers > #3732

Paper: LLM Translation of Compiler Intermediate Representation

Path csiph.com!weretis.net!feeder9.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From John R Levine <johnl@taugh.com>
Newsgroups comp.compilers
Subject Paper: LLM Translation of Compiler Intermediate Representation
Date Tue, 12 May 2026 11:34:57 -0400
Organization Compilers Central
Sender johnl%iecc.com
Approved comp.compilers@iecc.com
Message-ID <26-05-002@comp.compilers> (permalink)
MIME-Version 1.0
Content-Type text/plain; charset="UTF-8"
Injection-Info gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="41016"; mail-complaints-to="abuse@iecc.com"
Keywords GCC, LLVM
Posted-Date 12 May 2026 11:35:54 EDT
X-submission-address compilers@iecc.com
X-moderator-address compilers-request@iecc.com
X-FAQ-and-archives http://compilers.iecc.com
Xref csiph.com comp.compilers:3732

Show key headers only | View raw


They use an LLM to translate between GCC and LLVM intermediate representation, a
famously hard task, and claim success even though one table says it's at best
84% correct.

Abstract

GCC and LLVM underpin much of modern software infrastructure, relying on
distinct Intermediate Representations (IRs) to drive optimizations and code
generation. However, the semantic and structural differences between these IRs
create significant barriers for cross-toolchain interaction, limiting the reuse
of compiler frontends, backends, and optimization pipelines across programming
languages and compilation ecosystems. Traditional rule-based translators have
attempted to bridge this gap, but their complexity and maintenance cost have
hindered practical adoption. In this context, Large Language Models (LLMs)
appear to be an emerging technology that offers a data-driven alternative,
capable of learning complex mappings between heterogeneous compiler IRs directly
from sufficiently representative examples. To explore this approach, this paper
presents IRIS-14B, a 14-billion-parameter transformer model fine-tuned to
translate GIMPLE (as emitted by GCC) to LLVM IR (as emitted by LLVM). The model
is trained on paired IRs extracted from C sources and evaluated on the
GIMPLE-to-LLVM IR transformation applied to IRs derived from real-world C code
and competitive programming problems. To the best of our knowledge, IRIS-14B is
the first model trained explicitly for IR-to-IR translation. It outperforms the
accuracy of widely used models, including the largest state-of-the-art open
models available today, ranging from 13 to 1,000 billion parameters, by up to 44
percentage points. The proposed transformation supports the integration of LLMs
as complementary components within hybrid neuro-symbolic compiler architectures,
where models such as IRIS-14B act as interoperability layers enabling
cross-toolchain workflows without modifying existing compiler passes, while
traditional compiler infrastructure continues to perform deterministic
compilation and optimization.

https://arxiv.org/abs/2605.08247

Regards,
John Levine, johnl@taugh.com, Taughannock Networks, Trumansburg NY
Please consider the environment before reading this e-mail. https://jl.ly

Back to comp.compilers | Previous | Find similar


Thread

Paper: LLM Translation of Compiler Intermediate Representation John R Levine <johnl@taugh.com> - 2026-05-12 11:34 -0400

csiph-web