Groups | Search | Server Info | Login | Register

Paper: LLM Translation of Compiler Intermediate Representation

Path	csiph.com!weretis.net!feeder9.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From	John R Levine <johnl@taugh.com>
Newsgroups	comp.compilers
Subject	Paper: LLM Translation of Compiler Intermediate Representation
Date	Tue, 12 May 2026 11:34:57 -0400
Organization	Compilers Central
Sender	johnl%iecc.com
Approved	comp.compilers@iecc.com
Message-ID	<26-05-002@comp.compilers> (permalink)
MIME-Version	1.0
Content-Type	text/plain; charset="UTF-8"
Injection-Info	gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="41016"; mail-complaints-to="abuse@iecc.com"
Keywords	GCC, LLVM
Posted-Date	12 May 2026 11:35:54 EDT
X-submission-address	compilers@iecc.com
X-moderator-address	compilers-request@iecc.com
X-FAQ-and-archives	http://compilers.iecc.com
Xref	csiph.com comp.compilers:3732

Show key headers only | View raw

They use an LLM to translate between GCC and LLVM intermediate representation, a
famously hard task, and claim success even though one table says it's at best
84% correct.

Abstract

GCC and LLVM underpin much of modern software infrastructure, relying on
distinct Intermediate Representations (IRs) to drive optimizations and code
generation. However, the semantic and structural differences between these IRs
create significant barriers for cross-toolchain interaction, limiting the reuse
of compiler frontends, backends, and optimization pipelines across programming
languages and compilation ecosystems. Traditional rule-based translators have
attempted to bridge this gap, but their complexity and maintenance cost have
hindered practical adoption. In this context, Large Language Models (LLMs)
appear to be an emerging technology that offers a data-driven alternative,
capable of learning complex mappings between heterogeneous compiler IRs directly
from sufficiently representative examples. To explore this approach, this paper
presents IRIS-14B, a 14-billion-parameter transformer model fine-tuned to
translate GIMPLE (as emitted by GCC) to LLVM IR (as emitted by LLVM). The model
is trained on paired IRs extracted from C sources and evaluated on the
GIMPLE-to-LLVM IR transformation applied to IRs derived from real-world C code
and competitive programming problems. To the best of our knowledge, IRIS-14B is
the first model trained explicitly for IR-to-IR translation. It outperforms the
accuracy of widely used models, including the largest state-of-the-art open
models available today, ranging from 13 to 1,000 billion parameters, by up to 44
percentage points. The proposed transformation supports the integration of LLMs
as complementary components within hybrid neuro-symbolic compiler architectures,
where models such as IRIS-14B act as interoperability layers enabling
cross-toolchain workflows without modifying existing compiler passes, while
traditional compiler infrastructure continues to perform deterministic
compilation and optimization.

https://arxiv.org/abs/2605.08247

Regards,
John Levine, johnl@taugh.com, Taughannock Networks, Trumansburg NY
Please consider the environment before reading this e-mail. https://jl.ly

Back to comp.compilers | Previous | Find similar

Thread

Paper: LLM Translation of Compiler Intermediate Representation John R Levine <johnl@taugh.com> - 2026-05-12 11:34 -0400

csiph-web