master-thesis/thesis/chapters/implementation.tex

\chapter{Implementation}
\label{cha:implementation}

This chapter focuses on the implementation phase of the project, building upon the concepts and designs previously discussed. It begins with an overview of the technologies employed for both the CPU and GPU parts of the application. This is followed by a description of the pre-processing or frontend phase. The chapter concludes with a detailed overview of the core components, the interpreter and the transpiler.

% Go into the details why this implementation is tuned towards performance and should be the optimum at that

\section{Technologies}
This section describes the technologies used for both the CPU side of the prototypes and the GPU side. The rationale behind these choices, including consideration of their performance implications, is presented. In addition, the hardware limitations imposed by the choice of GPU technology are outlined.

\subsection{CPU side}
Both prototypes were implemented using the Julia programming language. It was chosen mainly, because the current symbolic regression algorithm is also implemented in Julia. Being a high-level programming language, with modern features such as a garbage collector, support for meta-programming and dynamic typing, it also offers great convenience to the developer.

More interestingly however, is the high performance that can be achieved with this language. It is possible to achieve high performance despite the supported modern features, which are often deemed to be harmful to performance. \textcite{bezanson_julia_2017} have shown how Julia can provide C-like performance while supporting the developer with modern quality of life features. The ability of Julia to be used in high performance computing scenarios and to be competitive with C has been demonstrated by \textcite{lin_comparing_2021}. This shows how Julia is a good and valid choice for scenarios where developer comfort and C-like performance are needed.

\subsection{GPU side}
In addition to a programming language for the CPU, a method for programming the GPU is also required. For this purpose, the CUDA API was chosen. While CUDA offers robust capabilities, it is important to note that it is exclusively compatible with Nvidia GPUs. An alternative would have been OpenCL, which provides broader compatibility by supporting GPUs from Nvidia, AMD and Intel. However, considering Nvidia's significant market share and the widespread adoption of CUDA in the industry, the decision was made to use CUDA.

A typical CUDA program is primarily written C++ and Nvidia also provides their CUDA compiler nvcc\footnote{\url{https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/}} for C and C++ and their official CUDA programming guide \parencite{nvidia_cuda_2025} also uses C++ for code examples. It is also possible to call C++ code from within Julia. This would allow for writing the kernel and interacting with the GPU in C++, leveraging the knowledge built up over several years.

\subsubsection{CUDA and Julia}
Instead of writing the kernel in C++ and calling it from Julia, a much simpler and effective alternative can be used. The Julia package CUDA.jl\footnote{\url{https://cuda.juliagpu.org/}} enables a developer to write a kernel in Julia similar to how a kernel is written in C++ with CUDA. One drawback of using CUDA.jl however, is the fact that it is much newer compared to CUDA and therefore does not have years of testing and bug fixing in its history, which might be a concern for some applications. Apart from writing kernels with CUDA.jl, it also offers a method for interacting with the driver, to compile PTX code into machine code. This is a must-have feature as otherwise, it wouldn't have been possible to fully develop the transpiler in Julia.

Additionally, the JuliaGPU initiative\footnote{\url{https://juliagpu.org/}} offers a collection of additional packages to enable GPU development for AMD, Intel and Apple and not just for Nvidia. However, CUDA.jl is also the most mature of the available implementations, which is also a reason why CUDA has been chosen instead of for example OpenCL.

Again, the question arises if the performance of CUDA.jl is sufficient to be used as an alternative to C++ and CUDA. Performance studies by \textcite{besard_rapid_2019}, \textcite{lin_comparing_2021} and \textcite{faingnaert_flexible_2022} have demonstrated, that CUDA.jl provides sufficient performance. They found that in some cases CUDA.jl was able to perform better than the same algorithm implemented in C and C++. This provides the confidence, that Julia alongside CUDA.jl is a good choice for leveraging the performance of GPUs to speed-up expression evaluation.

\section{Pre-Processing}
% Talk about why this needs to be done and how it is done (the why is basically: simplifies evaluation/transpilation process; the how is in ExpressionProcessing.jl (the why is probably not needed because it is explained in concept and design))
The pre-processing or frontend step is very important. As already explained in Chapter \ref{cha:conceptdesign}, it is responsible for ensuring that the given expressions are valid and that they are transformed into an intermediate representation. This section aims to explain how the intermediate representation is implemented, as well as how it is generated from a mathematical expression.

\subsection{Intermediate Representation}
\label{sec:ir}
% Talk about how it looks and why it was chosen to look like this
The intermediate representation is mainly designed to be lightweight and easily transferrable to the GPU. Since the interpreter runs on the GPU, this was a very important consideration. Because the transpilation process is done on the CPU, and is therefore very flexible in terms of the intermediate representation, the focus was mainly on being efficient for the interpreter.

The intermediate representation cannot take any form. While it has already been defined that expressions are converted to postfix notation, there are several ways to store the data. The first logical choice is to create an array where each entry represents a token. On the CPU it would be possible to define each entry as a pointer to the token object. Each of these objects could be of a different type, for example one object that holds a constant value while another object holds an operator. In addition, each of these objects could contain its own logic about what to do when it is encountered during the evaluation process. However, on the GPU, this is not possible, as an array entry must hold a value and not a pointer to another memory location. Furthermore, even if it were possible, it would be a bad idea. As explained in Section \ref{sec:memory_model}, when loading data from global memory, larger chunks are retrieved at once. If the data is scattered across the GPU's global memory, a lot of unwanted data will be transferred. This can be seen in Figure \ref{fig:excessive-memory-transfer}, where if the data is stored sequentially, far fewer data operations and far less data in general needs to be transferred.

\begin{figure}
	\centering
	\includegraphics[width=.9\textwidth]{excessive_memory_transfer.png}
	\caption{Loading data from global memory on the GPU always loads 32, 64 or 128 bytes (see Section \ref{sec:memory_model}). If pointers were supported and data would be scattered around global memory, many more data load operations would be required. Additionally, much more unwanted data would be loaded.}
	\label{fig:excessive-memory-transfer}
\end{figure}

Because of this and because the GPU does not allow pointers, another solution is required. Instead of storing pointers to objects of different types in an array, it is possible to store one object with meta information. The object thus contains the type of the stored value, and the value itself, as described in Section \ref{sec:pre-processing}. The four types that need to be stored in this object, differ significantly in the value they represent.

Variables and parameters are very simple to store. Because they represent indices to the variable matrix or the parameter vector, this (integer) index can be stored as is in the value property of the object. The type can then be used to determine whether it is an index to a variable or a parameter access.

Constants are also very simple, as they represent a single 32-bit floating point value. However, because of the variables and parameters, the value property is already defined as an integer and not as a floating point number. Unlike languages like Python, where every number is a floating point number, in Julia they are different and therefore cannot be stored in the same property. Creating a second property for constants only is not feasible, as this would introduce 4 bytes per object that need to be sent to the GPU which most of the time does not contain a defined value.

To avoid sending unnecessary bytes, a mechanism provided by Julia called reinterpret can be used. This allows the bits of a variable of one type, to be treated as the bits of another type. The bits used to represent a floating point number are then interpreted as an integer and can be stored in the same property. On the GPU, the same concept can be applied to reinterpret the integer value as a floating point value for further calculations. This is also the reason why the original type of the value needs to be stored alongside the value in order for the stored to be interpreted correctly and the expressions to be evaluated correctly.

Operators are very different from variables, parameters and constants. Because they represent an operation rather than a value, a different way of storing them is required. An operator can be mapped to a number to identify the operation. For example, if the addition operator is mapped to the integer $1$, then when the evaluator encounters an object of type operator and a value of $1$, it will know which operation to perform. This can be done for all operators which means it is possible to store them in the same object with the same property. and only the type needs to be specified. The mapping of an operator to a value is often called an operation code, or opcode, and each operator is represented as one opcode.

With this, the intermediate representation is defined. Figure \ref{fig:pre-processing-result-impl} shows how a simple expression would look after the pre-processing step. Note that the vluae $2.5$ has been reinterpreted as an integer, resulting in the seemingly random value.
\begin{figure}
	\centering
	\includegraphics[width=.9\textwidth]{pre-processing_result_impl.png}
	\caption{The expression $x_1 + 2.5$ after it has been converted to the intermediate representation. Note that the constant value $2.5$ stores a seemingly random value due to it being reinterpreted as an integer.}
	\label{fig:pre-processing-result-impl}
\end{figure}


\subsection{Processing}
Now that the intermediate representation has been defined, the processing step can be implemented. This section describes the structure of the expressions and how they are processed. It also explains the process of parsing the expressions to ensure their validity and converting them into the intermediate representation.

\subsubsection{Expressions}
With the pre-processing step, the first modern feature of Julia has been used. As already mentioned, Julia provides extensive support for meta-programming, which is important for this step. Julia represents its own code as a data structure, which allows a developer to manipulate the code at runtime. The code is stored in the so-called Expr object as an Abstract Syntax Tree (AST), which is the most minimal tree representation of a given expression. As a result, mathematical expressions can also be represented as such an Expr object instead of a simple string. Which is a major benefit, because these expressions can then be easily manipulated by the symbolic regression algorithm. This is the main reason why the pre-processing step requires the expressions to be provided as an Expr object instead of a string.

Another major benefit of the expressions being stored in the Expr object and therefore as an AST, is the included operator precedence. Because it is a tree where the leaves are the constants, variables or parameters (also called terminal symbols) and the nodes are the operators, the correct result will be calculated when evaluating the tree from bottom to top. As can be seen in Figure \ref{fig:expr-ast}, the expression $1 + x_1 \, \log(p_1)$, when parsed as an AST, contains the correct operator precedence. First the bottom most subtree $\log(p_1)$ must be evaluated before the multiplication, and after that, the addition can be evaluated.

It should be noted however, that Julia stores the tree as a list of arrays to allow a node to have as many children as necessary. For example the expression $1+2+\dots+n$ contains only additions, which is a commutative operation, meaning that the order of operations is irrelevant. The AST for this expression would contain the operator at the first position in the array and the values at the following positions. This ensures that the AST is as minimal as possible.

\begin{figure}
	\centering
	\includegraphics[width=.45\textwidth]{expr_ast.png}
	\caption{The AST for the expression $1 + x_1 \, \log(p_1)$ as generated by Julia. Some additional details Julia includes in its AST have been omitted as they are not relevant.}
	\label{fig:expr-ast}
\end{figure}

\subsubsection{Parsing}
To convert the AST of an expression into the intermediate representation, a top-down traversal of the tree is required. The steps for this are as follows:

\begin{enumerate}
	\item Extract the operator and convert it to its opcode for later use.
	\item Convert all constants, variables and parameters and operators to the object (expression element) described in Section \ref{sec:ir}.
	\item Append the expression elements to the postfix expression.
	\item If the operator is a binary operator and there are more than two expression elements, append the operator after the first two elements and then after each element.
	\item If a subtree exists, apply all previous steps and append it to the existing postfix expression.
	\item Append the operator
	\item Return the generated postfix expression/intermediate representation.
\end{enumerate}

The validation of the expression is performed throughout the parsing process. Validating that only correct operators are used is performed in step 1. To be able to convert the operator to its corresponding opcode, it must be validated that an opcode exists for it, and therefore whether it is valid or not. Similarly, converting the tokens into an expression element object ensures that only valid variables and parameters are present in the expression. This is handled in step 2.

As explained above, a node of a binary operator can have $n$ children. In these cases, additional handling is required to ensure correct conversion. This handling is summarised in step 4. Essentially, the operator must be added after the first two elements, and for each subsequent element, the operator must also be added. The expression $1+2+3+4$ is converted to the AST $+\,1\,2\,3\,4$ and without step 4 the postfix expression would be $1\,2\,3\,4\,+$. If the operator is added after the first two elements and then after each subsequent element, the correct postfix expression $1\,2\,+\,3\,+\,4\,+$ will be generated.

Each subtree of the AST is its own separate AST, which can be converted to postfix notation in the same way the whole AST can be converted. This means that the algorithm only needs to be able to handle leave nodes, and when it encounters a subtree, it recursively calls itself to parse the remaining AST. Step 5 indicates this recursive behaviour.

While the same expression usually occurs only once, sub-expressions can occur multiple times. In the example in Figure \ref{fig:expr-ast}, the whole expression $1 + x_1 \, \log(p_1)$ is unlikely to be generated more than once by the symbolic regression algorithm. However, the sub-expression $\log(p_1)$ is much more likely to be generated multiple times. This means that the generation of the intermediate representation for this subtree only needs to be done once and can be reused later. Therefore, a cache can be used to store the intermediate representation for this sub-expression and access it again later to eliminate the parsing overhead.

Caching can be applied to both individual sub-expressions as well as the entire expression. While it is unlikely for the whole expression to recur frequently, either as a whole or as part of a larger expression, implementing a cache will not degrade performance and will, in fact, enhance it if repetitions do occur. In the context of parameter optimisation, where the evaluators are employed, expressions will recur, making full-expression caching advantageous. The primary drawback of caching is the increased use of RAM. However, given that RAM is plentiful in modern systems, this should not pose a significant issue.

\section{Interpreter}
The implementation is divided into two main components, the CPU-based control logic and the GPU-based interpreter as outlined in the Concept and Design chapter. This section aims to describe the technical details of these components. First the CPU-based control logic will be discussed. This component handles the communication with the GPU and is the entry point which is called by the symbolic regression algorithm. Following this, the GPU-based interpreter will be explored, highlighting the specifics of developing an interpreter on the GPU.

An overview of how these components interact with each other is outlined in Figure \ref{fig:interpreter-sequence}. The parts of this figure are explained in detail in the following sections.

\begin{figure}
	\centering
	\includegraphics[width=.95\textwidth]{interpreter_sequence_diagram.png}
	\caption{The sequence diagram of the interpreter.}
	\label{fig:interpreter-sequence}
\end{figure}

\subsection{CPU Side}
The interpreter is given all the expressions it needs to interpret as an input. Additionally, it needs the variable matrix as well as the parameters for each expression. All expressions are passed to the interpreter as an array of Expr objects, as they are needed for the pre-processing step or the frontend. The first loop as shown in Figure \ref{fig:interpreter-sequence}, is responsible for sending the expressions to the frontend to be converted into the intermediate representation. After this step, the expressions are in the correct format to be sent to the GPU and the interpretation process can continue.

\subsubsection{Data Transfer}
Before the GPU can start with the interpretation, the data needs to be sent to the GPU. Because the variables are already in matrix form, transferring the data is fairly straightforward. Memory must be allocated in the global memory of the GPU and then be copied from RAM into the allocated memory. Allocating memory and transferring the data to the GPU is handled implicitly by the CuArray type provided by CUDA.jl.

To optimise the interpreter for parameter optimisation workloads, this step is actually performed before the interpreter is called. Although, the diagram includes this transmission for completeness, it is important to note that the variables never change, as they represent the observed inputs of the system that being modelled by the symbolic regression algorithm. Therefore, re-transmitting the variables for each step of the parameter optimisation process would be inefficient. By transmitting the variables once and reusing them throughout the parameter optimisation, significant time can be saved.

Furthermore, transferring the data to the GPU before the symbolic regression algorithm begins, could save even more time. However, this approach would require modification to the symbolic regression algorithm. Therefore, the decision has been made to neglect this optimisation. Nonetheless, it is still possible to modify the implementation at a later stage with minimal effort, if needed.

Once the variables are transmitted, the parameters also must be transferred to the GPU. Unlike the variables, the parameters are stored as a vector of vectors. In order to transmit the parameters efficiently, they also need to be put in a matrix form. The matrix needs to be of the form $k \times N$, where $k$ is equal to the length of the longest inner vector and $N$ is equal to the length of the outer vector. This ensures that all values can be stored in the matrix. It also means that if the inner vectors are of different lengths, some extra unnecessary values will be transmitted, but the overall benefit of treating them as a matrix outweighs this drawback. The Program \ref{code:julia_vec-to-mat} shows how this conversion can be implemented. Note that it is required to provide an invalid element. This ensures defined behaviour and helps with finding errors in the code. After the parameters have been brought into matrix form, they can be transferred to the GPU the same way the variables are transferred.

\begin{program}
	\begin{GenericCode}
function convert_to_matrix(vecs::Vector{Vector{T}}, invalidElement::T)::Matrix{T} where T
	maxLength = get_max_inner_length(vecs)

	# Pad the shorter vectors with the invalidElement to make all equal length
	paddedVecs = [vcat(vec, fill(invalidElement, maxLength - length(vec))) for vec in vecs]
	vecMat = hcat(paddedVecs...) # transform vector of vectors into column-major matrix

	return vecMat
end

function get_max_inner_length(vecs::Vector{Vector{T}})::Int where T
	return maximum(length.(vecs))
end
	\end{GenericCode}
	\caption{A Julia program fragment depicting the conversion from a vector of vectors into a matrix of the form $k \times N$. }
	\label{code:julia_vec-to-mat}
\end{program}

Similar to the parameters, the expressions are also stored as a vector of vectors. The outer vector contains each expression, while the inner vectors hold the expressions in their intermediate representation. Therefore, this vector of vectors also needs to be brought into matrix form the same way the parameters are brought into matrix form. To simplify development, the special opcode \textit{stop} has been introduced, which is used for the invalidElement in Program \ref{code:julia_vec-to-mat}. As seen in Section \ref{sec:interpreter-gpu-side}, this element is used to determine if the end of an expression has been reached during the interpretation process. This removes the need for additional data to be sent which stores the length of each expression to determine if the entire expression has been interpreted or not. Therefore, a lot of overhead can be reduced.

Once the conversion into matrix form has been performed, the expressions are transferred to the GPU. Just like with the variables, the expressions remain the same over the course of the parameter optimisation part. Therefore, they are transferred to the GPU before the interpreter is called, to reduce the amount of unnecessary data transfer.

In addition to the already described data that needs to be sent, two more steps are required that have not been included in the Sequence Diagram \ref{fig:interpreter-sequence}. The first one is the allocation of global memory for the result matrix. Without this, the kernel would not know where to store the interpretation results. Therefore, enough global memory needs to be allocated so that the results can be stored and retrieved after all kernel executions have finished.

\begin{figure}
	\centering
	\includegraphics[width=.9\textwidth]{memory_layout_data.png}
	\caption{The expressions, variables and parameters as they are stored in the GPUs global memory. Note that while on the CPU they are stored as matrices, on the GPU, they are only three arrays of data. The thick lines represent, where a new column and therefore a new set of data begins.}
	\label{fig:memory-layout-data}
\end{figure}

Only raw data can be sent to the GPU, which means that information about the data is missing. The matrices are represented as flat arrays, which means they have lost their column and row information. This information must be sent separately to let the kernel know the dimensions of the expressions, variables and parameters. Otherwise, the kernel does not know at which memory location the second variable set is stored, as it does not know how large a single set is for example. Figure \ref{fig:memory-layout-data} shows how the data is stored without any information about the rows or columns of the matrices. The thick lines help to identify where a new column, and therefore a new set of data begins. However, the GPU has no knowledge of this and therefore the additional information must be transferred to ensure that the data is accessed correctly.

\subsubsection{Kernel Dispatch}
Once all the data is present on the GPU, the CPU can dispatch the kernel for each expression. This dispatch requires parameters that specify the number of threads and their organisation into thread blocks. In total, one thread is required for each variable set and therefore the grouping into thread blocks is the primary variable. Taking into account the constraints explained in Section \ref{sec:occupancy}, this grouping needs to be tuned for optimal performance. The specific values alongside the methodology for determining these values will be explained in Chapter \ref{cha:evaluation}.

In addition, the dispatch parameters also include the pointers to the location of the data allocated and transferred above, as well as the index of the expression to be interpreted. Since all expressions and parameters are sent to the GPU at once, this index ensures that the kernel knows where in memory to find the expression it needs to interpret and which parameter set it needs to use. After the kernel has finished, the result matrix needs to be read from the GPU and passed back to the symbolic regression algorithm.

Crucially, dispatching a kernel is an asynchronous operation, which means that the CPU does not wait for the kernel to finish before continuing. This allows the CPU to dispatch all kernels at once, rather than one at a time. As explained in Section \ref{sec:architecture}, a GPU can have multiple resident grids, meaning that the dispatched kernels can run concurrently, drastically reducing evaluation times. Only once the result matrix is read from the GPU does the CPU have to wait for all kernels to finish execution.

\subsection{GPU Side}
\label{sec:interpreter-gpu-side}
% Memory access (currently global memory only)
% no dynamic memory allocation like on CPU (stack needs to have fixed size; also stack is stored in local memory)
With the GPU's global memory now containing all the necessary data and the kernel being dispatched, the interpretation process can begin. Before interpreting an expression, the global thread ID must be calculated. This step is crucial because each variable set is assigned to a unique thread. Therefore, the global thread ID determines which variable set should be used for the current interpretation instance.

Moreover, the global thread ID ensures that excess threads do not perform any work. As otherwise these threads would try to access a variable set that does not exist and therefore would lead to an illegal memory access. This is necessary because the number of required threads often does not align perfectly with the number of threads per block multiplied by the number of blocks. If for example $1031$ threads are required, then at least two thread blocks are needed, as one thread block can hold at most $1024$ threads. Because $1031$ is a prime number, it can not be divided by any practical number of thread blocks. If two thread blocks are allocated, each holding $1024$ threads, a total of $2048$ threads is started. Therefore, the excess $2048 - 1031 = 1017$ threads must be prevented from executing. By using the global thread ID and the number of available variable sets, these excess threads can be easily identified and terminated early in the kernel execution.

Afterwards the stack for the interpretation can be created. It is possible to dynamically allocate memory on the GPU, which enables a similar programming model as on the CPU. \textcite{winter_are_2021} have even compared many dynamic memory managers and found, that the performance impact of them is rather small. However, if it is easily possible to use static allocations, it still offers better performance. In the case of this thesis, it is easily possible which is the reason why the stack has been chosen to have a static size. Because it is known that expressions do not exceed 50 tokens, including the operators, the stack size has been set to 25, which should be more than enough to hold the values and partial results, even in the worst case.

\subsubsection{Main Loop} % MAYBE
Once everything is initialised, the main interpreter loop starts interpreting the expression. Because of the intermediate representation, the loop simply iterates through the expression from left to right. On each iteration the type of the current token is checked, to decide which operation to perform.

If the current token type matches the \textit{stop} opcode, the interpreter knows that it is finished. This simplicity is the reason why this opcode was introduced, as explained above.

More interestingly is the case, where the current token corresponds to an index to either the variable matrix, or the parameter matrix. In this case, the token's value is important. To access one of these matrices, the correct starting index of the set must first be calculated. As previously explained, information about the dimensions of the data is lost during transfer. At this stage, the kernel only knows the index of the first element of either matrix, which set to use for this evaluation, and the index of the value within the current set. However, the boundaries of these sets are unknown. Therefore, the additionally transferred data about the dimensions is used in this step to calculate the index of the first element in each set. With this calculated index and the index stored in the token, the correct value can be loaded. After the value has been loaded, it is pushed to the top of the stack for later use.

% MAYBE:
% Algorithm that shows how this calculation works

Constants work very similarly in that the token value is read and added to the top of the stack. However, the constants have been reinterpreted from floating-point values to integers for easy transfer to the GPU. This operation must be reversed before adding the value to the stack as otherwise the wrong values would be used for evaluation.

Evaluating the expression is happening if the current token is an operator. The token's value, which serves as the opcode, determines the operation that needs to be performed. If the opcode represents a unary operator, only the top value of the stack needs to be popped for the operation. The operation is then executed on this value and the result is pushed back to the stack. On the other hand, if the opcode represents a binary operator, the top two values of the stack are popped. These are then used for the operation, and the result is subsequently pushed back onto the stack.

Support for ternary operators could also be easily added. An example of a ternary operator that would help improve performance would be the GPU supported Fused Multiply-Add (FMA) operator. While this operator does not exist in Julia, the frontend can generate it when it encounters a sub-expression of the form $x * y + z$. Since this expression performs the multiplication and addition in a single clock cycle instead of two, it would be a feasible optimisation. However, detecting such sub-expressions is more complicated, which why it is not supported in the current implementation.

Once the interpreter loop has finished, the result of the evaluation must be stored in the result matrix. By using the index of the current expression, as well as the index of the current variable set (the global thread ID) it is possible to calculate the index where the result must be stored. The last value on the stack is the result, which is stored in the result matrix at the calculated location.

\section{Transpiler}
% Talk about how the transpiler has been developed (probably largest section, because it just has more interesting parts); CPU-side part will be much larger than GPU side
Unlike the interpreter, the transpiler primarily operates on the CPU, with only a minor GPU-based component. This is because the transpiler must generate entire PTX kernels from Julia expressions, rather than simply executing a pre-written kernel like the interpreter. Similar to the interpreter, the CPU side of the transpiler manages communication with both the GPU and the symbolic regression algorithm. This section provides a detailed overview of the transpiler's functionality.

An overview of how the transpiler interacts with the frontend and GPU is outlined in Figure \ref{fig:transpiler-sequence}. The parts of this figure are explained in detail in the following sections.

\begin{figure}
	\centering
	\includegraphics[width=.95\textwidth]{transpiler_sequence_diagram.png}
	\caption{The sequence diagram of the transpiler.}
	\label{fig:transpiler-sequence}
\end{figure}

\subsection{CPU Side}
% TODO: Finish on Saturday
After the transpiler has received the expressions to be transpiled, first they are sent to the frontend for processing. Once they have been processed, the expressions are sent to the transpiler backend which is further explained in Section \ref{sec:transpiler-backend}. The backend is responsible for generating the kernels. The output of the backend are the kernels written as PTX code for all expressions.

\subsubsection{Data Transfer}
% smaller section as it basically states that it works the same as interpreter
% mention that now expressions are not transmitted, as the kernel "is" the expression

\subsubsection{Kernel Dispatch}
% similar to interpreter dispatch with tuning etc.
% mention that CUDA.jl is used to instruct the driver to compile the kernel for the specific hardware

\subsubsection{Transpiler Backend}
\label{sec:transpiler-backend}
% TODO: Start on Saturday and finish on Sunday (prefferably finish on Saturday)
% describe the tanspilation process

\subsection{GPU Side}
% TODO: Finish on Sunday


% Front-End and Back-End
% Caching of back-end results

% PTX code generated and compiled using CUDA.jl (so basically using the driver) and then executed

% Memory access (global memory and register management especially register management)