master-thesis/thesis/chapters/implementation.tex

\chapter{Implementation}
\label{cha:implementation}

This chapter focuses on the implementation phase of the project, building upon the concepts and designs previously discussed. It begins with an overview of the technologies employed for both the CPU and GPU parts of the application. This is followed by a description of the pre-processing or frontend phase. The chapter concludes with a detailed overview of the core components, the interpreter and the transpiler.

% Go into the details why this implementation is tuned towards performance and should be the optimum at that

\section{Technologies}
This section describes the technologies used for both the CPU side of the prototypes and the GPU side. The rationale behind these choices, including consideration of their performance implications, is presented. In addition, the hardware limitations imposed by the choice of GPU technology are outlined.

\subsection{CPU side}
Both prototypes were implemented using the Julia programming language. It was chosen mainly, because the current symbolic regression algorithm is also implemented in Julia. Being a high-level programming language, with modern features such as a garbage collector, support for meta-programming and dynamic typing, it also offers great convenience to the developer.

More interestingly however, is the high performance that can be achieved with this language. It is possible to achieve high performance despite the supported modern features, which are often deemed to be harmful to performance. \textcite{bezanson_julia_2017} have shown how Julia can provide C-like performance while supporting the developer with modern quality of life features. The ability of Julia to be used in high performance computing scenarios and to be competitive with C has been demonstrated by \textcite{lin_comparing_2021}. This shows how Julia is a good and valid choice for scenarios where developer comfort and C-like performance are needed.

\subsection{GPU side}
In addition to a programming language for the CPU, a method for programming the GPU is also required. For this purpose, the CUDA API was chosen. While CUDA offers robust capabilities, it is important to note that it is exclusively compatible with Nvidia GPUs. An alternative would have been OpenCL, which provides broader compatibility by supporting GPUs from Nvidia, AMD and Intel. However, considering Nvidia's significant market share and the widespread adoption of CUDA in the industry, the decision was made to use CUDA.

A typical CUDA program is primarily written C++ and Nvidia also provides their CUDA compiler nvcc\footnote{\url{https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/}} for C and C++ and their official CUDA programming guide \parencite{nvidia_cuda_2025} also uses C++ for code examples. It is also possible to call C++ code from within Julia. This would allow for writing the kernel and interacting with the GPU in C++, leveraging the knowledge built up over several years.

\subsubsection{CUDA and Julia}
Instead of writing the kernel in C++ and calling it from Julia, a much simpler and effective alternative can be used. The Julia package CUDA.jl\footnote{\url{https://cuda.juliagpu.org/}} enables a developer to write a kernel in Julia similar to how a kernel is written in C++ with CUDA. One drawback of using CUDA.jl however, is the fact that it is much newer compared to CUDA and therefore does not have years of testing and bug fixing in its history, which might be a concern for some applications. Apart from writing kernels with CUDA.jl, it also offers a method for interacting with the driver, to compile PTX code into machine code. This is a must-have feature as otherwise, it wouldn't have been possible to fully develop the transpiler in Julia.

Additionally, the JuliaGPU initiative\footnote{\url{https://juliagpu.org/}} offers a collection of additional packages to enable GPU development for AMD, Intel and Apple and not just for Nvidia. However, CUDA.jl is also the most mature of the available implementations, which is also a reason why CUDA has been chosen instead of for example OpenCL.

Again, the question arises if the performance of CUDA.jl is sufficient to be used as an alternative to C++ and CUDA. Performance studies by \textcite{besard_rapid_2019}, \textcite{lin_comparing_2021} and \textcite{faingnaert_flexible_2022} have demonstrated, that CUDA.jl provides sufficient performance. They found that in some cases CUDA.jl was able to perform better than the same algorithm implemented in C and C++. This provides the confidence, that Julia alongside CUDA.jl is a good choice for leveraging the performance of GPUs to speed-up expression evaluation.

\section{Pre-Processing}
Talk about why this needs to be done and how it is done (the why is basically: simplifies evaluation/transpilation process; the how is in ExpressionProcessing.jl (the why is probably not needed because it is explained in concept and design))

\section{Interpreter}
Talk about how the interpreter has been developed.

UML-Ablaufdiagram

main loop; kernel transpiled by CUDA.jl into PTX and then executed

Memory access (currently global memory only)
no dynamic memory allocation like on CPU (stack needs to have fixed size)

\section{Transpiler}
Talk about how the transpiler has been developed (probably largest section, because it just has more interesting parts)

UML-Ablaufdiagram

Front-End and Back-End
Caching of back-end results

PTX code generated and compiled using CUDA.jl (so basically the driver) and then executed

Memory access (global memory and register management especially register management)