CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming

Abstract

Automatic translation of programming languages has garnered renewed interest, driven by recent advancements in large language models (LLMs). Encoder-decoder transformer models, in particular, have shown promise in translating between different programming languages. However, translating between a language and its high-performance computing (HPC) extension remains underexplored due to inherent challenges like complex parallel semantics understanding. In this paper, we introduce CodeRosetta, an encoder-decoder transformer model explicitly designed for translating between programming languages and also their HPC extensions. CodeRosetta is evaluated on C++ ↔ CUDA and C++ ↔ Fortran translation. It employs a customized learning-based framework with tailored pretraining and training objectives to effectively capture code semantics and parallel structural nuances, allowing for bidirectional code translation. Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to CUDA translation by 2.9 BLEU and 1.72 CodeBLEU points while improving compilation accuracy by 6.05%. Compared to general closed-source LLMs, our proposed bidirectional learning-based method improves C++ to CUDA translation by 22.08 BLEU and 14.39 CodeBLEU with 2.75% higher compilation accuracy. Finally, CodeRosetta exhibits proficiency in Fortran to parallel C++ translation, marking it, to our knowledge, as the first encoder-decoder model for such a complex translation task, improving CodeBLEU at least by 4.63 points compared to closed-source LLMs and Open Code LLMs.

C++ Code:

            
void pow_cpu(int N, float ALPHA, float* X, int INCX, float* Y, int INCY) {
  int i;
  for (i = 0; i < N; ++i) {
    Y[i * INCY] = pow(X[i * INCX], ALPHA);
  }
}

CUDA Code Generated by CodeRosetta:

Fortran Code:

              
  subroutine add
  use sp_data
  implicit none
  integer :: i, j, k, m
  if (timeron) call timer_start(t_add)
  do k = 1, nz2
    do j = 1, ny2
      do i = 1, nx2
        do m = 1, 5
          u(m, i, j, k) = u(m, i, j, k) + rhs(m, i, j, k)
        end do
      end do
    end do
  if (timeron) call timer_stop(t_add)
  return
  end subroutine add

C++ Code Generated by CodeRosetta:

Pretraining and Training Objectives

Mask Language Modeling

Pre-training is essential for transformer models to understand programming languages, and Masked Language Modeling (MLM) is used to mask entire words in code, helping the model learn both syntactic and semantic patterns. This method, which masks full words like "int" in "int index," helps the model predict the missing tokens by understanding the context. Additionally, training the model on a combined dataset of C++ and the target language (CUDA or Fortran) enables cross-lingual learning, allowing the model to generalize and transfer knowledge across programming languages.

Abstract Syntaxt Tree Entity Recognition

Following cross-lingual MLM pre-training, we introduce Abstract Syntax Tree (AST) Entity Recognition (AER) to improve CodeRosetta's understanding of code structure. AER allows the model to recognize and categorize syntactic elements in code, similar to how Named Entity Recognition works in natural language processing. By leveraging ASTs generated from source code, AER pre-training enables CodeRosetta to predict syntactic roles, improving its adaptability across different programming languages and paradigms, such as CUDA.

Denoising Auto Encoding

The decoder in CodeRosetta is untrained after pre-training, so Denoising Auto-Encoding (DAE) is used to train it for code translation. DAE involves corrupting input code with various noise types (e.g., token masking, shuffling) and training the model to reconstruct the original, enabling the decoder to learn target language syntax. Additionally, techniques like weighted token dropping (removing language-specific keywords) and language-specific token insertion (inserting tokens from other languages) help the model distinguish between programming languages, with adaptive noise ratios gradually increasing complexity during training.

Back Translation

To enhance CodeRosetta's translation quality and grasp of complex code semantics, back translation is employed. This process involves translating code from a source to a target language (e.g., C++ to CUDA) and then performing reverse translation (CUDA to C++) to reconstruct the original source code. The model refines its accuracy by comparing the reconstructed code with the original, iteratively improving its understanding of language differences and code structures, while alternating between language pairs to prevent bias and ensure balanced learning.

BibTeX

@inproceedings{
      tehranijamsaz2024coderosetta,
      title={CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming},
      author={TehraniJamsaz, Ali and Bhattacharjee, Arijit and Chen, Le and Ahmed, Nesreen K and Yazdanbakhsh, Amir and Jannesari, Ali},
      booktitle={Proceedings of the 38th International Conference on Neural Information Processing Systems},
      year={2024},
      url={https://openreview.net/forum?id=V6hrg4O9gg}
      }