master-thesis/thesis/chapters/implementation.tex

\chapter{Implementation}
\label{cha:implementation}

This chapter focuses on the implementation phase of the project, building upon the concepts and designs previously discussed. It begins with an overview of the technologies employed for both the CPU and GPU parts of the application. This is followed by a description of the pre-processing or frontend phase. The chapter concludes with a detailed overview of the core components, the interpreter and the transpiler.

% Go into the details why this implementation is tuned towards performance and should be the optimum at that

\section{Technologies}
This section describes the technologies used for both the CPU side of the prototypes and the GPU side. The rationale behind these choices, including consideration of their performance implications, is presented. In addition, the hardware limitations imposed by the choice of GPU technology are outlined.

\subsection{CPU side}
Both prototypes were implemented using the Julia programming language. It was chosen mainly, because the current symbolic regression algorithm is also implemented in Julia. Being a high-level programming language, with modern features such as a garbage collector, support for meta-programming and dynamic typing, it also offers great convenience to the developer.

More interestingly however, is the high performance that can be achieved with this language. It is possible to achieve high performance despite the supported modern features, which are often deemed to be harmful to performance. \textcite{bezanson_julia_2017} have shown how Julia can provide C-like performance while supporting the developer with modern quality of life features. The ability of Julia to be used in high performance computing scenarios and to be competitive with C has been demonstrated by \textcite{lin_comparing_2021}. This shows how Julia is a good and valid choice for scenarios where developer comfort and C-like performance are needed.

\subsection{GPU side}
In addition to a programming language for the CPU, a method for programming the GPU is also required. For this purpose, the CUDA API was chosen. While CUDA offers robust capabilities, it is important to note that it is exclusively compatible with Nvidia GPUs. An alternative would have been OpenCL, which provides broader compatibility by supporting GPUs from Nvidia, AMD and Intel. However, considering Nvidia's significant market share and the widespread adoption of CUDA in the industry, the decision was made to use CUDA.

A typical CUDA program is primarily written C++ and Nvidia also provides their CUDA compiler nvcc\footnote{\url{https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/}} for C and C++ and their official CUDA programming guide \parencite{nvidia_cuda_2025} also uses C++ for code examples. It is also possible to call C++ code from within Julia. This would allow for writing the kernel and interacting with the GPU in C++, leveraging the knowledge built up over several years.

\subsubsection{CUDA and Julia}
Instead of writing the kernel in C++ and calling it from Julia, a much simpler and effective alternative can be used. The Julia package CUDA.jl\footnote{\url{https://cuda.juliagpu.org/}} enables a developer to write a kernel in Julia similar to how a kernel is written in C++ with CUDA. One drawback of using CUDA.jl however, is the fact that it is much newer compared to CUDA and therefore does not have years of testing and bug fixing in its history, which might be a concern for some applications. Apart from writing kernels with CUDA.jl, it also offers a method for interacting with the driver, to compile PTX code into machine code. This is a must-have feature as otherwise, it wouldn't have been possible to fully develop the transpiler in Julia.

Additionally, the JuliaGPU initiative\footnote{\url{https://juliagpu.org/}} offers a collection of additional packages to enable GPU development for AMD, Intel and Apple and not just for Nvidia. However, CUDA.jl is also the most mature of the available implementations, which is also a reason why CUDA has been chosen instead of for example OpenCL.

Again, the question arises if the performance of CUDA.jl is sufficient to be used as an alternative to C++ and CUDA. Performance studies by \textcite{besard_rapid_2019}, \textcite{lin_comparing_2021} and \textcite{faingnaert_flexible_2022} have demonstrated, that CUDA.jl provides sufficient performance. They found that in some cases CUDA.jl was able to perform better than the same algorithm implemented in C and C++. This provides the confidence, that Julia alongside CUDA.jl is a good choice for leveraging the performance of GPUs to speed-up expression evaluation.

\section{Pre-Processing}
% Talk about why this needs to be done and how it is done (the why is basically: simplifies evaluation/transpilation process; the how is in ExpressionProcessing.jl (the why is probably not needed because it is explained in concept and design))
The pre-processing or frontend step is very important. As already explained in Chapter \ref{cha:conceptdesign} it is responsible for ensuring the given expressions are valid and that they are transformed into an intermediate representation. This section aims at explaining how the intermediate representation is implemented, as well as how it is generated from a mathematical expression.

\subsection{Intermediate Representation}
\label{sec:ir}
% Talk about how it looks and why it was chosen to look like this
The intermediate representation is mainly designed to be lightweight and easily transferrable to the GPU. Since the interpreter is running on the GPU this was a very important consideration. Since the transpilation process is performed on the CPU and is therefore very flexible in terms of the intermediate representation, the focus lied mostly on being efficient for the interpreter.

The intermediate representation can not take on any form. While it has already been defined that expressions are converted to postfix notation, there are different ways of storing the data. The first logical decision is to create an array where each entry represents a token. On the CPU it would be possible to define each entry to be a pointer to the token object. Each of these objects could be of a different type, for example an object holding a constant value while another object holds an operator. Additionally, each of these objects could include its own logic on what to do when it is encountered during the evaluation process. However, on the GPU, this is not possible, as an array entry must hold a value and not a pointer to another memory location. Furthermore, if this would be possible, it would be a bad idea. As explained in Section \ref{sec:memory_model}, when loading data from memory, larger chunks are retrieved at once. If the data is scattered around the GPUs memory, a lot of unwanted data is transferred. This can be seen in figure \ref{fig:excessive-memory-transfer}, where if the data is stored consecutive, much fewer data operations and much less data in general needs to be transferred.

\begin{figure}
	\centering
	\includegraphics[width=.9\textwidth]{excessive_memory_transfer.png}
	\caption{Loading data from global memory on the GPU always loads 32, 64 or 128 bytes (see Section \ref{sec:memory_model}). If pointers were supported and data would be scattered around global memory, many more data load operations would be required. Additionally, much more unwanted data would be loaded.}
	\label{fig:excessive-memory-transfer}
\end{figure}

Because of this and because the GPU does not allow pointers, another solution is required. Instead of storing pointers to objects of different types in an array, it is possible to store one object with meta information. The object therefore contains the type of the value stored, and the value itself as described in \ref{sec:pre-processing}. The four types that need to be stored in this object, differ significantly in the value they represent.

Variables and parameters are very simple to store. Because they represent indices to the variable matrix or the parameter vector, this (integer) index can be stored as is in the value property of the object.

Constants are also very simple, as they represent a single 32-bit floating point value. However, because of the variables and parameters, the value property is already defined as an integer and not as a floating point number. Unlike languages like Python, where every number is a floating point number, in Julia they are different and can therefore not be stored in the same property. Creating a second property only for constants is not feasible, as this would introduce 4 bytes per object that need to be sent to the GPU which most of the time does not contain a value. To avoid sending unnecessary bytes, a mechanism provided by Julia called reinterpret can be used. This allows the bits of a variable of one type, to be treated as the bits of another type. The bits used to represent a floating point value are then interpreted as an integer and can be stored in the same property. On the GPU, the same concept can be applied to interpret the integer value as a floating point value again for further computations. This is also the reason why the original type of the value needs to be stored alongside the value, to correctly interpret the stored bits and in turn correctly evaluate the expressions.

Operators are very different from variables, parameters and constants. Because they represent an operation, rather than a value, another option is needed to store them. An operator can be mapped to a number, to identify the operation. For example if the addition operator is mapped to the integer number one, if during evaluation, the evaluator comes across an object of type operator and a value of one, it knows which operation it needs to perform. This can be done for all operators which means it is possible to store them in the same object with the same property and only the type must be specified. The mapping of an operator to a value is often called an operation code or opcode, and each operator is represented as one opcode.

With this, the intermediate representation is defined. Figure \ref{fig:pre-processing-result-impl} shows how a simple expression would look after the pre-processing step.
\begin{figure}
	\centering
	\includegraphics[width=.9\textwidth]{pre-processing_result_impl.png}
	\caption{The expression $x_1 + 2.5$ after it has been converted to the intermediate representation. Note that the constant value $2.5$ stores a seemingly random value due to it being reinterpreted as an integer.}
	\label{fig:pre-processing-result-impl}
\end{figure}


\subsection{Processing}
Now that the intermediate representation has been defined, the processing step can be implemented. This section describes the structure of the expressions and how they are processed. Furthermore, the process of parsing the expressions to ensure their validity and the conversion into the intermediate representation is explained.

\subsubsection{Expressions}
With the pre-processing step the first modern feature of Julia has been used. As already mentioned, Julia offers extensive support for meta-programming which is important for this step. Julia represents its own code as a data structure, which allows a developer to manipulate the code at runtime. The code is stored in the so-called Expr object as an abstract syntax tree (AST) which is the most minimal tree representation of a given expression. As a result, mathematical expressions can also be represented as such an Expr object instead of a simple string. Which is a major benefit, as these expression can then easily be manipulated by the symbolic regression algorithm. This is the main reason why the pre-processing step requires the expressions to be provided as an Expr object instead of a string.

Another major benefit of the expressions being stored in the Expr object and therefore as an AST, is the included operator precedence. Because it is a tree where the leaves are the constants, variables or parameters (also called terminal symbols) and the nodes are the operators, the correct result will be calculated when evaluating the tree from bottom to top. As seen in Figure \ref{fig:expr-ast} the expression $1 + x_1 \, \log(p_1)$ when parsed as an AST contains the correct operator precedence. First the bottom most subtree $\log(p_1)$ must be evaluated before the multiplication and after that the addition can be evaluated.

It needs to be mentioned however, that Julia stores the tree as a list of arrays to allow one node to have as many children as needed. For example the expression $1+2+\dots+n$ only contains additions which is a commutative operation, meaning the order of operations is irrelevant. The AST for this expression would contain the operator at the first position in the array and the values at the following positions. This ensures that the AST is as minimal as possible.

\begin{figure}
	\centering
	\includegraphics[width=.45\textwidth]{expr_ast.png}
	\caption{The AST for the expression $1 + x_1 \, \log(p_1)$ as generated by Julia. Some additional details Julia includes in its AST have been omitted as they are not relevant.}
	\label{fig:expr-ast}
\end{figure}

\subsubsection{Parsing}
To convert the AST of an expression into the intermediate representation, a top-down traversal of the tree is required. The steps for this are as follows:

\begin{enumerate}
	\item Extract the operator for later use.
	\item Convert all constants, variables and parameters to the object (expression element) described in Section \ref{sec:ir}.
	\item Append the expression elements to the postfix expression.
	\item If the operator is a binary operator and there are more than two expression elements, append the operator after the first two elements and then after each element.
	\item If a subtree exists, apply all previous steps and append it to the existing postfix expression.
	\item Append the operator
	\item Return the generated postfix expression/intermediate representation.
\end{enumerate}

As explained above, a node of a binary operator can have $n$ children. In these cases, additional handling is required to ensure correct conversion. This handling is condensed in step 4 of the list above. Essentially, after the first two elements, the operator must be added and for every following element, the operator must be added as well. The expression $1+2+3+4$ will be converted to the AST $+\,1\,2\,3\,4$ and without step 4, the expression would be $1\,2\,3\,4\,+$. If the operator is added after the first two elements and then after each element, the correct expression $1\,2\,+\,3\,+\,4\,+$ will be generated.

% talk about the process of parsing.
% Include code fragments
% probably point out how meta-programming is used (more detailed than above)
% talk about how invalid expressions are handled
% talk about generation of intermediate representation
% especially talk about cache


\section{Interpreter}
Talk about how the interpreter has been developed.

UML-Ablaufdiagram

main loop; kernel transpiled by CUDA.jl into PTX and then executed

Memory access (currently global memory only)
no dynamic memory allocation like on CPU (stack needs to have fixed size)

\section{Transpiler}
Talk about how the transpiler has been developed (probably largest section, because it just has more interesting parts)

UML-Ablaufdiagram

Front-End and Back-End
Caching of back-end results

PTX code generated and compiled using CUDA.jl (so basically the driver) and then executed

Memory access (global memory and register management especially register management)