thesis: finished re-read and made some improvements

2025-06-10 11:34:58 +02:00
parent 3efd8a6c26
commit f25919dc06
3 changed files with 51 additions and 65 deletions
--- a/thesis/chapters/conceptdesign.tex
+++ b/thesis/chapters/conceptdesign.tex
@ -1,9 +1,7 @@
 \chapter{Concept and Design}
 \label{cha:conceptdesign}
 % introduction to what needs to be done. also clarify terms "Host" and "Device" here
-To be able to determine whether evaluating mathematical expressions on the GPU is better suited than on the CPU, a prototype needs to be implemented. More specifically, a prototype for interpreting these expressions on the GPU, as well as a prototype that transpiles expressions into code that can be executed by the GPU. The goal of this chapter, is to describe how these two prototypes can be implemented conceptually. First the requirements for the prototypes as well as the data they operate on are explained. This is followed by the design of the interpreter and the transpiler. The CPU interpreter will not be described, as it already exists.
-
-% TODO: maybe describe CPU interpreter too? We will see
+To be able to determine whether evaluating mathematical expressions on the GPU is better suited than on the CPU, two prototypes need to be implemented. More specifically, a prototype for interpreting these expressions on the GPU, as well as a prototype that transpiles expressions into PTX code that can be executed by the GPU. The goal of this chapter, is to describe how these two prototypes can be implemented conceptually. First the requirements for the prototypes as well as the data they operate on are explained. This is followed by the design of the interpreter and the transpiler. The CPU interpreter will not be described, as it already exists.

 \section[Requirements]{Requirements and Data}
 The main goal of both prototypes or evaluators is to provide a speed-up compared to the CPU interpreter already in use. However, it is also important to determine which evaluator provides the most speed-up. This also means that if one of the evaluators is faster, it is intended to replace the CPU interpreter. Therefore, they must have similar capabilities, and therefore meet the following requirements:
@ -13,7 +11,7 @@ The main goal of both prototypes or evaluators is to provide a speed-up compared
 	\item All input expressions have the same number of variables ($x_n$), but can have a different number of parameters ($p_n$).
 	\item The variables are parametrised using a matrix of the form $k \times N$, where $k$ is the number of variables in the expressions and $N$ is the number of different parametrisations for the variables. This matrix is the same for all expressions. 
 	\item The parameters are parametrised using a vector of vectors. Each vector $v_i$ corresponds to an expression $e_i$.
-	\item The following operations must be supported: $x + y$, $x - y$, $x * y$, $x / y$, $x ^ y$, $|x|$, $\log(x)$, $e^x$ and $\sqrt{x}$. Note that $x$ and $y$ can either stand for a value, a variable, or another operation.
+	\item The following operations must be supported: $x + y$, $x - y$, $x * y$, $x / y$, $x ^ y$, $|x|$, $\log(x)$, $e^x$, $1 / x$ and $\sqrt{x}$. Note that $x$ and $y$ can either stand for a constant, a variable, a parameter, or another operation.
 	\item The results of the evaluations are returned in a matrix of the form $k \times N$. In this case, $k$ is equal to the $N$ of the variable matrix and $N$ is equal to the number of input expressions.
 \end{itemize}

@ -25,19 +23,19 @@ The main goal of both prototypes or evaluators is to provide a speed-up compared
 \end{figure}


-With this, the required capabilities are outlined. However, for a better understanding, the input and output data need to be explained further. The first input contains the expressions that need to be evaluated. These can be of any length and can contain constant values, variables and parameters, all of which are linked together with the supported operations. In the simplified example shown in Figure \ref{fig:input_output_explanation}, there are six expressions $e_1$ to $e_6$. 
+With this, the required capabilities are outlined. However, for a better understanding, the input and output data need to be explained further. The first input contains the expressions that need to be evaluated. These can be of any length and can contain constant values, variables and parameters, all of which are linked together with the supported operators. In the simplified example shown in Figure \ref{fig:input_output_explanation}, there are six expressions $e_1$ to $e_6$. 

-Next is the variable matrix. An entry in this matrix corresponds to one variable in every expression. The row indicates which variable it holds the value for. For example the values in row three, are used to parameterise the variable $x_3$. Each column holds a different set of variables. Each expression must be evaluated using each set of variables. In the provided example, there are three variable sets, each containing the values for four variables $x_1$ to $x_4$. 
+Next is the variable matrix. An entry in this matrix corresponds to one variable in every expression. The row indicates which variable it holds the value for. For example the values in row three are used to parameterise the variable $x_3$. Each column holds a different set of variables. Each expression must be evaluated using each set of variables. In the provided example, there are three variable sets, each containing the values for four variables $x_1$ to $x_4$. 

 After all expressions have been evaluated using all variable sets, the results of these evaluations must be stored in the result matrix. Each entry in this matrix holds the result of the evaluation of one expression parameterised with one variable set. The row indicates the variable set and the column indicates the expression.

-The prototypes developed in this thesis, are part of a GP algorithm for symbolic regression. This means that the expressions that are evaluated, represent parts of the search space of all expressions being made up of any combination of allowed operators, the set of input variables, a set of parameters and constants. This means that the size of the search space grows exponentially. Exploring this search space by simply generating expressions, evaluating them once and then generating the next set of expressions leaves much of the search space unexplored. To combat this, parameters are introduced. These allow the algorithm to perform some kind of local search. To enable this, the prototypes must support not only variables, but also parameters.
+The prototypes developed in this thesis, are part of a GP algorithm for symbolic regression. This means that the expressions that are evaluated, represent parts of the search space of all expressions being made up of any combination of allowed operators, the set of input variables, a set of parameters and constants. This means that the size of the search space grows exponentially. Exploring this search space by simply generating expressions, evaluating them once and then generating the next set of expressions leaves much of the search space unexplored. To combat this, parameters are introduced. These allow the algorithm to perform some kind of intensification. To enable this, the prototypes must support not only variables, but also parameters.

 The parameters themselves are unique to each expression, meaning they have a one-to-one mapping to an expression. Furthermore, as can be seen in Figure \ref{fig:input_output_explanation}, each expression can have a different number of parameters, or even no parameters at all. However, with no parameters, it wouldn't be possible to perform parameter optimisation. This is in contrast to variables, where each expression must have the same number of variables. Because parameters are unique to each expression and can vary in size, they are not structured as a matrix, but as a vector of vectors.

 An important thing to consider, is the volume and volatility of the data itself. The example shown in Figure \ref{fig:input_output_explanation} has been drastically simplified. It is expected, that there are hundreds of expressions evaluate per GP generation. Each of these expressions may contain between ten and 50 tokens. A token is equivalent to either a variable, a parameter, a constant value or an operator.

-It can be assumed that typically the number of variables per expression is around ten. However, the number of variable sets can increase drastically. It can be considered, that $1\,000$ variable sets is the lower limit. On the other hand, $100\,000$ can be considered as the upper limit. Considering that one variable takes up 4 bytes of memory and 10 variables are needed per expression, at least $4 * 10 * 1\,000 = 40\,000$ bytes and at most $4 * 10 * 100\,000 = 400\,000$ bytes need to be transferred to the GPU for the variables.
+It can be assumed that typically the number of variables per expression is around ten. However, the number of variable sets can increase drastically. It can be considered that $1\,000$ variable sets is the lower limit. On the other hand, $100\,000$ can be considered as the upper limit. Considering that one variable takes up 4 bytes of memory and 10 variables are needed per expression, at least $4 * 10 * 1\,000 = 40\,000$ bytes and at most $4 * 10 * 100\,000 = 400\,000$ bytes need to be transferred to the GPU for the variables.

 These variables do not change during the runtime of the symbolic regression algorithm. As a result the data only needs to be sent to the GPU once. This means that the impact of this data transfer is minimal. On the other hand, the data for the parameters is much more volatile. As explained above, they are used for parameter optimisation and therefore vary from evaluation to evaluation and need to be sent to the GPU very frequently. The amount of data that needs to be sent depends on the number of expressions as well as on the number of parameters per expression. Considering $10\,000$ expressions that need to be evaluated and an average of two parameters per expression each requiring 4 bytes of memory, a total of $10\,000 * 2 * 4 = 80\,000$ bytes need to be transferred to the GPU on each parameter optimisation step.

@ -52,7 +50,7 @@ Based on the requirements and data structure above, the architecture of both pro
 	\label{fig:kernel_architecture}
 \end{figure}

-A design decision that has been made for both prototypes is to split the evaluation of each expression into a separate kernel or kernel dispatch as seen in Figure \ref{fig:kernel_architecture}. As explained in Section \ref{sec:thread_hierarchy}, it is desirable to reduce the occurrence of thread divergence as much as possible. Although the SIMT programming model tries to mitigate the negative effects of thread divergence, it is still a good idea to avoid it when possible. For this use-case, thread divergence can easily be avoided by not evaluating all expressions in a single kernel or kernel dispatch. GPUs are able to have multiple resident grids, with modern GPUs being able to accommodate 128 grids concurrently \parencite{nvidia_cuda_2025}. One grid corresponds to one kernel dispatch, and therefore allows up-to 128 kernels to be run concurrently. Therefore, dispatching a kernel for each expression, further increases GPU utilisation. In the case of the interpreter, having only one kernel that can be dispatched for each expression, also simplifies the kernel itself. This is because the kernel can focus on evaluating one expression and does not require additional code to handle multiple expressions at once. Similarly, the transpiler can also be simplified, as it can generate many smaller kernels rather than one big kernel. Additionally, the smaller kernels do not need any branching, because the generated code only needs to perform the operations as they occur in the expression itself. This also reduces the overhead on the GPU. One drawback of generating a kernel for each expression, is the generation itself. Especially for smaller variable sets, it is possible, that the time it takes to transpile an expression is greater than the time it takes to evaluate it. However, for larger variable sets this should not be a concern.
+A design decision that has been made for both prototypes is to split the evaluation of each expression into a separate kernel or kernel dispatch as seen in Figure \ref{fig:kernel_architecture}. As explained in Section \ref{sec:thread_hierarchy}, it is desirable to reduce the occurrence of thread divergence as much as possible. Although the SIMT programming model tries to mitigate the negative effects of thread divergence, it is still advisable to avoid it when possible. For this use-case, thread divergence can easily be avoided by not evaluating all expressions in a single kernel or kernel dispatch. GPUs are able to have multiple resident grids, with modern GPUs being able to accommodate 128 grids concurrently \parencite{nvidia_cuda_2025}. One grid corresponds to one kernel dispatch, and therefore allows up-to 128 kernels to be run concurrently. Therefore, dispatching a kernel for each expression, further increases GPU utilisation. In the case of the interpreter, having only one kernel that can be dispatched for each expression, also simplifies the kernel itself. This is because the kernel can focus on evaluating one expression and does not require additional code to handle multiple expressions at once. Similarly, the transpiler can also be simplified, as it can generate many smaller kernels rather than one big kernel. Additionally, the smaller kernels do not need any branching, because the generated code only needs to perform the operations as they occur in the expressions themselves. This also reduces the overhead on the GPU. One drawback of generating a kernel for each expression, is the generation itself. Especially for smaller variable sets, it is possible, that the time it takes to transpile an expression and compile the kernel into machine code is greater than the time it takes to evaluate it. However, for larger variable sets this should not be a concern.

 %
 % TODO: Probably include a diagram that shows how the evaluators are integrated in the symbolic regression algorithm (assuming its a GP variant), to show the bigger picture
@ -62,7 +60,7 @@ A design decision that has been made for both prototypes is to split the evaluat
 \label{sec:pre-processing}
 The first step in both prototypes is the pre-processing step. It is needed, as it simplifies working with the expressions in the later steps. One of the responsibilities of the pre-processor is to verify that only allowed operators and symbols are present in the given expressions. This is comparable to the work a scanner like Flex\footnote{\url{https://github.com/westes/flex}} performs. Secondly, this step also converts the expression into an intermediate representation. In essence, the pre-processing step can be compared to the frontend of a compiler as described in Section \ref{sec:compilers}. If new operators are required, the pre-processor must be extended as well. Otherwise, expressions containing these operators would be treated as invalid and never reach the evaluator.

-The conversion into the intermediate representation transforms the expressions from infix-notation into postfix notation. This further allows the later parts to more easily evaluate the expressions. One of the major benefits of this notation is the implicit operator precedence. It allows the evaluators to evaluate the expressions token by token from left to right, without needing to worry about the correct order of operations. One token represents either an operator, a constant value, a variable or a parameter. Apart from the intermediate representation containing the expression in postfix notation, it also contains information about the types of the tokens themselves. This is all that is needed for the interpretation and transpilation steps. A simple expression like $x + 2$ would look like depicted in figure \ref{fig:pre-processing_results} after the pre-processing step.
+The conversion into the intermediate representation transforms the expressions from infix notation into postfix notation. This further allows the later parts to more easily evaluate the expressions. One of the major benefits of this notation is the implicit operator precedence. It allows the evaluators to evaluate the expressions token by token from left to right, without needing to worry about the correct order of operations. One token represents either an operator, a constant value, a variable or a parameter. Apart from the intermediate representation containing the expression in postfix notation, it also contains information about the types of the tokens themselves. This is all that is needed for the interpretation and transpilation steps. A simple expression like $x + 2$ would look like depicted in Figure \ref{fig:pre-processing_results} after the pre-processing step.

 \begin{figure}
 	\centering
@ -86,7 +84,7 @@ The already mentioned concept of processing one expression per thread can also b

 The interpreter consists of two parts. The CPU side is the part of the program, that interacts with both the GPU and the caller. An overview of the components and the workflow of the interpreter is shown in Figure \ref{fig:component_diagram_interpreter}. Once the interpreter has received the expressions, they are pre-processed. This ensures that the expressions are valid, and that they are transformed into the intermediate representation needed to evaluate them. The result of this pre-processing step is then sent to the GPU, which performs the actual interpretation of the expressions. In addition to the expressions, the data for the variables and parameters must also be sent to the GPU. 

-Once all the data is present on the GPU, the interpreter kernel can be dispatched. As already described, the kernel will be dispatched for each expression to reduce thread divergence. In fact, dispatching the same kernel multiple times with different expressions, means, there will not occur any thread divergence which will be explained later. 
+Once all the necessary data is present on the GPU, the interpreter kernel can be dispatched. As previously mentioned, the kernel is dispatched for each expression to minimise thread divergence. In fact, dispatching the same kernel multiple times for each expression ensures that there will not occur any thread divergence, as will be explained later. 

 After the GPU has finished evaluating all expressions with all variable sets, the result is stored in a matrix on the GPU. The CPU then retrieves the results and returns them to the caller in the format specified by the requirements.

@ -122,13 +120,13 @@ Evaluating the expressions is relatively straight forward. Because the expressio
 	\end{algorithmic}
 \end{algorithm}

-The handling of constants and variables is very simple. Constants only need to be stored on the stack for later use. Variables also only need to be stored on the stack. However, their value must first be loaded from the variable matrix according to the token value of the variable. Since the entire variable matrix is sent to the GPU, the index of the variable set is also needed to load the variable value. However, for the sake of simplicity, it has been omitted from the algorithm.
+Handling constants, variables and parameters is very simple. Constants simply need to be stored on the stack for later use. Variables and parameters also only need to be stored on the stack. However, their value must first be loaded from the variable or parameter matrix according to the token value. Since the entire matrices are sent to the GPU, the index of the variable or parameter set is also needed to load the correct value. However, for simplicity, this has been omitted from the algorithm.

-When an operator token is encountered, the handling becomes more complex. The value of the token indicates the type of operation to be applied. For binary operators, the top two values on the stack need to be used as input to the operator. For unary operators, only the top value of the stack needs to be used as an input. Once the result has been computed, it must be stored at the top of the stack to be used as an input for the next operation. 
+When an operator token is encountered, the handling becomes more complex. The value of the token indicates the type of operation to be applied. For binary operators, the top two values on the stack need to be used as input to the operator. For unary operators, only the top value of the stack needs to be used as an input. Once the result has been computed, it must be stored at the top of the stack to be used as an input for the next operation or the result for this expression. 

-At the end of the algorithm, the stack contains one last entry. This entry is the value computed by the expression with the designated variable set and parameters. In order to send this value back to the CPU, it must be stored in the result matrix. The last statement performs this action. It again has been simplified to omit the index of the expression and variable set needed to store the result at the correct location.
+At the end of the algorithm, the stack contains one last entry. This entry is the value computed by the expression with the designated variable set and parameters. In order to send this value back to the CPU, it must be stored in the result matrix. The last statement performs this action. It again has been simplified to omit the index calculation of the expression and variable set needed to store the result at the correct location.

-The Algorithm \ref{alg:eval_interpreter} in this case resembles the kernel. This kernel will be dispatched for each expression that needs to be evaluated, to prevent thread divergence. Thread divergence can only occur on data-dependent branches. In this case, the while loop and every if and else-if statement contains a data-dependent branch. Depending on the expression passed to the kernel, the while loop may run longer than for another expression. Similarly, not all expressions have the same constants, operators or variables in the same order, and would therefore cause each thread to take a different path. However, one expression always has the same constants, operators and variables in the same locations, meaning that all threads will take the same path. This also means that although the interpreter contains many data-dependent branches, these branches only depend on the expression itself. Because of this, all threads will follow the same path and will therefore never diverge from one another as long as they are executing the same expression.
+The Algorithm \ref{alg:eval_interpreter} in this case resembles the kernel. This kernel will be dispatched for each expression that needs to be evaluated, to prevent thread divergence. Thread divergence can only occur on data-dependent branches. In this case, the while loop and every if and else-if statement contains a data-dependent branch. Depending on the expression passed to the kernel, the while loop may run longer than for another expression. Similarly, not all expressions have the same constants, operators, variables or parameters in the same order, and would therefore cause each thread to take a different path. However, one expression always has the same constants, operators, variables and parameter in the same locations, meaning that all threads will take the same path. This also means that although the interpreter contains many data-dependent branches, these branches only depend on the expression itself. Because of this, all threads will follow the same path and will therefore never diverge from one another.

 \subsection{Transpiler}

@ -139,13 +137,13 @@ The Algorithm \ref{alg:eval_interpreter} in this case resembles the kernel. This
 	\label{fig:component_diagram_transpiler}
 \end{figure}

-Similar to the interpreter, the transpiler also consists of a part that runs on the CPU and a part that runs on the GPU. Looking at the components and workflow of the transpiler, as shown in Figure \ref{fig:component_diagram_transpiler}, it is almost identical to the interpreter. However, the key difference between the two, is the additional code generation, or transpilation step. Apart from that, the transpiler also needs the same pre-processing step and also the GPU to evaluate the expressions. However, the GPU evaluator generated by the transpiler works very differently to the GPU evaluator for the interpreter. The difference between these evaluators will be explained later.
+Similar to the interpreter, the transpiler also consists of a part that runs on the CPU and a part that runs on the GPU. Looking at the components and workflow of the transpiler, as shown in Figure \ref{fig:component_diagram_transpiler}, it is almost identical to the interpreter. However, the key difference between the two, is the additional code generation, or transpilation step. Apart from that, the transpiler also needs the same pre-processing step and also the GPU to evaluate the expressions. However, the kernels generated by the transpiler work very differently to the kernel for the interpreter. The difference between these evaluators will be explained later.

 Before the expressions can be transpiled into PTX code, they have to be pre-processed. As already described, this step ensures the validity of the expressions and transforms them into the intermediate representation described above. As with the interpreter, this also simplifies the code generation step. By transforming the expressions into postfix notation, the code generation follows a similar pattern to the interpretation already described. 

-Algorithm \ref{alg:transpile} shows how the transpiler takes an expression, transpiles it and then returns the finished code. It can be seen that the while loop is largely the same as the while loop of the interpreter. The main difference is in the operator branches, because now code needs to be generated instead of the result of computing the expression. Therefore, the branches themselves call their designated code generation function, such as $\textit{GetAddition}$. This function returns the PTX code responsible for the addition. However, this function must return more than just the code that performs the addition. When executed, this addition also returns a value which will be needed as an input by other operators. Therefore, not only the code fragment must be returned, but also the reference to the result.
+Algorithm \ref{alg:transpile} shows how the transpiler takes an expression, transpiles it and then returns the finished code. It can be seen that the while loop is largely the same as the while loop of the interpreter. The main difference is in the operator branches, because now code needs to be generated instead of computing the result of the expression. Therefore, the branches themselves call their designated code generation function, such as $\textit{GetAddition}$. This function returns the PTX code responsible for the addition. However, this function must return more than just the code that performs the addition. When executed, this addition also returns a value which will be needed as an input by other operators. Therefore, not only the code fragment must be returned, but also the reference to the result.

-This reference can then be put on the stack for later use, just as the interpreter stores the value for later use. The code fragment must also be added to the already generated code so that it can be returned to the caller. As with the interpreter, there is a final value on the stack when the loop has finished. Once the code has been executed, this value is the reference to the result of the expression. This value then needs to be stored in the result matrix, so that it can be retrieved by the CPU after all expressions have been executed on the GPU. Therefore, a final code fragment must be generated to handle the storage of this value in the result matrix. This fragment must then be added to the code already generated, and the transpilation process is complete.
+This reference can then be put on the stack for later use, the same way the interpreter stores the value for later use. The code fragment must also be added to the already generated code so that it can be returned to the caller. As with the interpreter, there is a final value on the stack when the loop has finished. Once the code has been executed, this value is the reference to the result of the expression. This value then needs to be stored in the result matrix, so that it can be retrieved by the CPU after all expressions have been executed. Therefore, a final code fragment must be generated to handle the storage of this value in the result matrix. This fragment must then be added to the code already generated, and the transpilation process is complete.

 \begin{algorithm}
 	\caption{Transpiling an equation in postfix notation}\label{alg:transpile}
@ -187,10 +185,8 @@ This reference can then be put on the stack for later use, just as the interpret
 	\end{algorithmic}
 \end{algorithm}

-The code generated by the transpiler is the kernel for the transpiled expressions. This means that a new kernel must be generated for each expression that needs to be evaluated. This is in contrast to the interpreter, which has one kernel and dispatches it once for each expression. However, generating one kernel per expression results in a much simpler kernel. This allows the kernel to focus on evaluating the postfix expression from left to right. There is no overhead work such as branching or managing a stack. However, this overhead is now shifted to the transpilation step on the CPU which can be seen in Algorithm \ref{alg:transpile}. There is also a noticeable overhead in that a kernel has to be generated for each expression. In cases like parameter optimisation, many of the expressions will be transpiled multiple times as the transpiler is called multiple times with the same expressions.
+The code generated by the transpiler is the kernel for the transpiled expressions. This means that a new kernel must be generated for each expression that needs to be evaluated. This is in contrast to the interpreter, which has one kernel and dispatches it once for each expression. However, generating one kernel per expression results in a much simpler kernel. This allows the kernel to focus on evaluating the postfix expression from left to right. There is no overhead work such as branching or managing a stack. However, this overhead is now shifted to the transpilation step on the CPU which can be seen in Algorithm \ref{alg:transpile}. There is also a noticeable overhead in that a kernel has to be generated for each expression. In cases like parameter optimisation, many of the expressions would be transpiled multiple times as the transpiler is called multiple times with the same expressions.

 Both the transpiler and the interpreter have their respective advantages and disadvantages. While the interpreter puts less load on the CPU, the GPU has to perform more work. Much of this work involves branching or managing a stack, and therefore involves many instructions that are not used to evaluate the expression itself. However, this overhead can be mitigated by the fact, that all this work is performed in parallel rather than sequentially.

-On the other hand, the transpiler performs more work on the CPU. The kernels are much simpler, and most of the instructions are used to evaluate the expressions themselves. Furthermore, as explained in Section \ref{sec:ptx}, any program running on the GPU, must be transpiled into PTX code before the driver can compile it into machine code. Therefore, the kernel written for the interpreter, must also be transpiled into PTX. This overhead is in addition to the branch instruction overhead. The self-written transpiler removes this intermediate step by transpiling directly into PTX. In addition, the generated code is tailored to evaluate expressions and does not need to generate generic PTX code, which can reduce transpilation time. 
-
-Unlike the GPU, the CPU can manage state across multiple kernel dispatches. Concepts such as caches can be employed by the transpiler to reduce the overhead on the CPU. In cases such as parameter optimisation, where expressions remain the same across multiple calls, the resulting PTX code can be cached. As a result, the same expression doesn't need to be transpiled multiple times which drastically reducing the transpilation time. This is an important optimisation as this can improve the overall performance of the transpiler. 
+On the other hand, the transpiler performs more work on the CPU. The kernels are much simpler, and most of the instructions are used to evaluate the expressions themselves. Furthermore, as explained in Section \ref{sec:ptx}, any program running on the GPU, must be transpiled into PTX code before the driver can compile it into machine code. Therefore, the kernel written for the interpreter, must also be transpiled into PTX and then be compiled. However, this needs to be done only once, while for the transpiler this needs to be done for each expression. Since the generated code is tailored to evaluate expressions and not to generate generic code, this means the kernels are simpler and can be transpiled and compiled faster. The overhead of transpiling and compiling the expressions is further mitigated by re-using the compiled kernels during the parameter optimisation step.
--- a/thesis/chapters/implementation.tex
+++ b/thesis/chapters/implementation.tex
@ -1,40 +1,36 @@
 \chapter{Implementation}
 \label{cha:implementation}

-This chapter focuses on the implementation phase of the project, building upon the concepts and designs previously discussed. It begins with an overview of the technologies employed for both the CPU and GPU parts of the application. This is followed by a description of the pre-processing or frontend phase. The chapter concludes with a detailed overview of the core components, the interpreter and the transpiler.
-
-% Go into the details why this implementation is tuned towards performance and should be the optimum at that
+This chapter focuses on the implementation phase of the thesis, building upon the concepts and designs previously discussed. It begins with an overview of the technologies employed for both the CPU and GPU parts of the prototypes. This is followed by a description of the pre-processing or frontend phase. The chapter concludes with a detailed overview of the core components, the interpreter and the transpiler.

 \section{Technologies}
 This section describes the technologies used for both the CPU side of the prototypes and the GPU side. The rationale behind these choices, including consideration of their performance implications, is presented. In addition, the hardware limitations imposed by the choice of GPU technology are outlined.

 \subsection{CPU side}
-Both prototypes were implemented using the Julia programming language. It was chosen mainly, because the current symbolic regression algorithm is also implemented in Julia. Being a high-level programming language, with modern features such as a garbage collector, support for meta-programming and dynamic typing, it also offers great convenience to the developer. 
+Both prototypes were implemented using the Julia programming language. It was chosen mainly, because the current symbolic regression algorithm is also implemented in Julia. Being a high-level programming language, with modern features such as a garbage-collector (GC), support for meta-programming and dynamic typing, it also offers great convenience to the developer. 

 More interestingly however, is the high performance that can be achieved with this language. It is possible to achieve high performance despite the supported modern features, which are often deemed to be harmful to performance. \textcite{bezanson_julia_2017} have shown how Julia can provide C-like performance while supporting the developer with modern quality of life features. The ability of Julia to be used in high performance computing scenarios and to be competitive with C has been demonstrated by \textcite{lin_comparing_2021}. This shows how Julia is a good and valid choice for scenarios where developer comfort and C-like performance are needed.

 \subsection{GPU side}
 In addition to a programming language for the CPU, a method for programming the GPU is also required. For this purpose, the CUDA API was chosen. While CUDA offers robust capabilities, it is important to note that it is exclusively compatible with Nvidia GPUs. An alternative would have been OpenCL, which provides broader compatibility by supporting GPUs from Nvidia, AMD and Intel. However, considering Nvidia's significant market share and the widespread adoption of CUDA in the industry, the decision was made to use CUDA.

-A typical CUDA program is primarily written C++ and Nvidia also provides their CUDA compiler nvcc\footnote{\url{https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/}} for C and C++ and their official CUDA programming guide \parencite{nvidia_cuda_2025} also uses C++ for code examples. It is also possible to call C++ code from within Julia. This would allow for writing the kernel and interacting with the GPU in C++, leveraging the knowledge built up over several years.
+A typical CUDA program is primarily written C++ and Nvidia also provides their CUDA compiler nvcc\footnote{\url{https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/}} for C and C++ and their official CUDA programming guide \parencite{nvidia_cuda_2025} also uses C++ for code examples. It is also possible to call C++ code from within Julia. This would allow for writing the kernel and interaction with the GPU in C++, leveraging the knowledge built up in the industry over several years.

 \subsubsection{CUDA and Julia}
-Instead of writing the kernel in C++ and calling it from Julia, a much simpler and effective alternative can be used. The Julia package CUDA.jl\footnote{\url{https://cuda.juliagpu.org/}} enables a developer to write a kernel in Julia similar to how a kernel is written in C++ with CUDA. One drawback of using CUDA.jl however, is the fact that it is much newer compared to CUDA and therefore does not have years of testing and bug fixing in its history, which might be a concern for some applications. Apart from writing kernels with CUDA.jl, it also offers a method for interacting with the driver, to compile PTX code into machine code. This is a must-have feature as otherwise, it wouldn't have been possible to fully develop the transpiler in Julia.
+Instead of writing the kernel in C++ and calling it from Julia, a much simpler and effective alternative is available. The Julia package CUDA.jl\footnote{\url{https://cuda.juliagpu.org/}} enables a developer to write a kernel in Julia similar to how a kernel is written in C++ with CUDA. One drawback of using CUDA.jl however, is the fact that it is much newer compared to CUDA and therefore does not have years of testing and bug fixing in its history, which might be a concern for some applications. Apart from writing kernels with CUDA.jl, it also offers a method for interacting with the driver to compile PTX code into machine code. This is a must-have feature as otherwise, it wouldn't have been possible to fully develop the transpiler in Julia.

-Additionally, the JuliaGPU initiative\footnote{\url{https://juliagpu.org/}} offers a collection of additional packages to enable GPU development for AMD, Intel and Apple and not just for Nvidia. However, CUDA.jl is also the most mature of the available implementations, which is also a reason why CUDA has been chosen instead of for example OpenCL. 
+Additionally, the JuliaGPU initiative\footnote{\url{https://juliagpu.org/}} offers a collection of additional packages to enable GPU development for AMD, Intel and Apple and not just for Nvidia. However, CUDA.jl is also the most mature of the available implementations, which is another reason why CUDA has been chosen instead of for example OpenCL. 

-Again, the question arises if the performance of CUDA.jl is sufficient to be used as an alternative to C++ and CUDA. Performance studies by \textcite{besard_rapid_2019}, \textcite{lin_comparing_2021} and \textcite{faingnaert_flexible_2022} have demonstrated, that CUDA.jl provides sufficient performance. They found that in some cases CUDA.jl was able to perform better than the same algorithm implemented in C and C++. This provides the confidence, that Julia alongside CUDA.jl is a good choice for leveraging the performance of GPUs to speed-up expression evaluation.
+Again, the question arises as to whether the performance of CUDA.jl is sufficient for it to be used as an alternative to C++ and CUDA. Studies by \textcite{besard_rapid_2019, lin_comparing_2021, faingnaert_flexible_2022} have demonstrated, that CUDA.jl provides sufficient performance. They found that, in some cases, CUDA.jl performed better than the same algorithm implemented in C and C++, and that it is on par otherwise. These results provide the confidence, that Julia alongside CUDA.jl is a good choice for leveraging the performance of GPUs to speed up expression evaluation.

 \section{Pre-Processing}
-% Talk about why this needs to be done and how it is done (the why is basically: simplifies evaluation/transpilation process; the how is in ExpressionProcessing.jl (the why is probably not needed because it is explained in concept and design))
 The pre-processing or frontend step is very important. As already explained in Chapter \ref{cha:conceptdesign}, it is responsible for ensuring that the given expressions are valid and that they are transformed into an intermediate representation. This section aims to explain how the intermediate representation is implemented, as well as how it is generated from a mathematical expression.

 \subsection{Intermediate Representation}
 \label{sec:ir}
-% Talk about how it looks and why it was chosen to look like this
-The intermediate representation is mainly designed to be lightweight and easily transferrable to the GPU. Since the interpreter runs on the GPU, this was a very important consideration. Because the transpilation process is done on the CPU, and is therefore very flexible in terms of the intermediate representation, the focus was mainly on being efficient for the interpreter.
+The intermediate representation is mainly designed to be lightweight and easily transferrable to the GPU. Since the interpreter runs on the GPU, this was a very important consideration. Because the transpilation process is done on the CPU, and is therefore very flexible in terms of the intermediate representation, the focus lied mainly on being efficient for the interpreter.

-The intermediate representation cannot take any form. While it has already been defined that expressions are converted to postfix notation, there are several ways to store the data. The first logical choice is to create an array where each entry represents a token. On the CPU it would be possible to define each entry as a pointer to the token object. Each of these objects could be of a different type, for example one object that holds a constant value while another object holds an operator. In addition, each of these objects could contain its own logic about what to do when it is encountered during the evaluation process. However, on the GPU, this is not possible, as an array entry must hold a value and not a pointer to another memory location. Furthermore, even if it were possible, it would be a bad idea. As explained in Section \ref{sec:memory_model}, when loading data from global memory, larger chunks are retrieved at once. If the data is scattered across the GPU's global memory, a lot of unwanted data will be transferred. This can be seen in Figure \ref{fig:excessive-memory-transfer}, where if the data is stored sequentially, far fewer data operations and far less data in general needs to be transferred.
+The intermediate representation cannot take any form. While it has already been defined that expressions are converted to postfix notation, there are several ways to store the data. The first logical choice is to create an array where each entry represents a token. On the CPU it would be possible to define each entry as a pointer to the token object. Each of these objects could be of a different type, for example one object that holds a constant value while another object holds an operator. In addition, each of these objects could contain its own logic about what to do when it is encountered during the evaluation process. However, on the GPU, this is not possible, as an array entry must hold a value and not a pointer to another memory location. Furthermore, even if it were possible, it would not be a feasible solution. As explained in Section \ref{sec:memory_model}, when loading data from global memory, larger chunks are retrieved at once. If the data is scattered across the GPU's global memory, a lot of unwanted data will be transferred. This can be seen in Figure \ref{fig:excessive-memory-transfer}, where if the data is stored sequentially, far fewer data operations and far less data in general needs to be transferred.

 \begin{figure}
 	\centering
@ -43,21 +39,21 @@ The intermediate representation cannot take any form. While it has already been
 	\label{fig:excessive-memory-transfer}
 \end{figure}

-Because of this and because the GPU does not allow pointers, another solution is required. Instead of storing pointers to objects of different types in an array, it is possible to store one object with meta information. The object thus contains the type of the stored value, and the value itself, as described in Section \ref{sec:pre-processing}. The four types that need to be stored in this object, differ significantly in the value they represent. 
+Due to this, and the fact that the GPU does not allow pointers, an alternative approach is required. Rather than storing pointers to objects of different types in an array, it is possible to store objects of a single type. As described in Section \ref{sec:pre-processing}, the objects thus contain the type of the stored value and the value itself. The four types of values that need to be stored in this object differ significantly in terms of the value they represent. The following paragraphs explain how these values can be stored in objects of a single type.

 Variables and parameters are very simple to store. Because they represent indices to the variable matrix or the parameter vector, this (integer) index can be stored as is in the value property of the object. The type can then be used to determine whether it is an index to a variable or a parameter access.

-Constants are also very simple, as they represent a single 32-bit floating point value. However, because of the variables and parameters, the value property is already defined as an integer and not as a floating point number. Unlike languages like Python, where every number is a floating point number, in Julia they are different and therefore cannot be stored in the same property. Creating a second property for constants only is not feasible, as this would introduce 4 bytes per object that need to be sent to the GPU which most of the time does not contain a defined value. 
+Constants are also very simple, as they represent a single 32-bit floating point value. However, due to the variables and parameters, the value property is already defined as an integer and not as a floating point number. Unlike in dynamically typed languages such as Python, where every number is a floating point number, in Julia they these have different types and therefore cannot be stored in the same property. Creating a second property for constants only is not feasible, as this would introduce four bytes per object that need to be sent to the GPU, which most of the time does not contain a defined value. 

-To avoid sending unnecessary bytes, a mechanism provided by Julia called reinterpret can be used. This allows the bits of a variable of one type, to be treated as the bits of another type. The bits used to represent a floating point number are then interpreted as an integer and can be stored in the same property. On the GPU, the same concept can be applied to reinterpret the integer value as a floating point value for further calculations. This is also the reason why the original type of the value needs to be stored alongside the value in order for the stored to be interpreted correctly and the expressions to be evaluated correctly.
+To avoid sending unnecessary bytes, Julia provides a mechanism called \verb|reinterpret| that can be used. This allows the bits of a variable of one type, to be treated as the bits of a different type. For example, the bits used to represent a floating point number are then interpreted as an integer and can be stored in the same property. On the GPU, the same concept can be applied to reinterpret the integer value as a floating point value for further calculations. This is also the reason why the original type of the value needs to be stored alongside the value in order for the stored value to be interpreted and the expressions to be evaluated correctly.

-Operators are very different from variables, parameters and constants. Because they represent an operation rather than a value, a different way of storing them is required. An operator can be mapped to a number to identify the operation. For example, if the addition operator is mapped to the integer $1$, then when the evaluator encounters an object of type operator and a value of $1$, it will know which operation to perform. This can be done for all operators which means it is possible to store them in the same object with the same property. and only the type needs to be specified. The mapping of an operator to a value is often called an operation code, or opcode, and each operator is represented as one opcode.
+Operators are very different from variables, parameters and constants. Because they represent an operation rather than a value, a different way of storing them is required. An operator can be uniquely mapped to a number to identify the operation. For example, if the addition operator is mapped to the integer $1$. Consequently, when the evaluator encounters an object of type operator and a value of $1$, it can determine the corresponding operation to perform. This can be done for all operators which means it is possible to store them in the same object structure. The type must be specified to be an operator and the value can be stored without needing to reinterpret it. The mapping of an operator to a value is commonly referred to as an operation code, or opcode, ensuring that each operator is uniquely identifiable.

-With this, the intermediate representation is defined. Figure \ref{fig:pre-processing-result-impl} shows how a simple expression would look after the pre-processing step. Note that the vluae $2.5$ has been reinterpreted as an integer, resulting in the seemingly random value. 
+With this, the intermediate representation is defined. Figure \ref{fig:pre-processing-result-impl} shows how a simple expression would look after the pre-processing step. Note that the bit representation of the value $2.5$ has been reinterpreted as an integer, resulting in the seemingly random value. 
 \begin{figure}
 	\centering
 	\includegraphics[width=.9\textwidth]{pre-processing_result_impl.png}
-	\caption{The expression $x_1 + 2.5$ after it has been converted to the intermediate representation. Note that the constant value $2.5$ stores a seemingly random value due to it being reinterpreted as an integer.}
+	\caption{The expression $x_1 + 2.5$ after it has been converted to the intermediate representation. Note that the constant value $2.5$ stores a seemingly random value due to the bits being reinterpreted as an integer.}
 	\label{fig:pre-processing-result-impl}
 \end{figure}

@ -66,9 +62,9 @@ With this, the intermediate representation is defined. Figure \ref{fig:pre-proce
 Now that the intermediate representation has been defined, the processing step can be implemented. This section describes the structure of the expressions and how they are processed. It also explains the process of parsing the expressions to ensure their validity and converting them into the intermediate representation.

 \subsubsection{Expressions} 
-With the pre-processing step, the first modern feature of Julia has been used. As already mentioned, Julia provides extensive support for meta-programming, which is important for this step. Julia represents its own code as a data structure, which allows a developer to manipulate the code at runtime. The code is stored in the so-called Expr object as an Abstract Syntax Tree (AST), which is the most minimal tree representation of a given expression. As a result, mathematical expressions can also be represented as such an Expr object instead of a simple string. Which is a major benefit, because these expressions can then be easily manipulated by the symbolic regression algorithm. This is the main reason why the pre-processing step requires the expressions to be provided as an Expr object instead of a string.
+With the pre-processing step, the first modern feature of Julia has been used. As already mentioned, Julia provides extensive support for meta-programming, which is important for this step. Julia represents its own code as a data structure, which allows a developer to manipulate the code at runtime. The code is stored in the so-called \verb|Expr| object as an Abstract Syntax Tree (AST), which is the most minimal tree representation of a given expression. As a result, mathematical expressions can also be represented as such an \verb|Expr| object instead of a simple string. This is a major benefit, because these expressions can then be easily manipulated by the symbolic regression algorithm. Because of this, the pre-processing step requires the expressions to be provided as an \verb|Expr| object instead of a string.

-Another major benefit of the expressions being stored in the Expr object and therefore as an AST, is the included operator precedence. Because it is a tree where the leaves are the constants, variables or parameters (also called terminal symbols) and the nodes are the operators, the correct result will be calculated when evaluating the tree from bottom to top. As can be seen in Figure \ref{fig:expr-ast}, the expression $1 + x_1 \, \log(p_1)$, when parsed as an AST, contains the correct operator precedence. First the bottom most subtree $\log(p_1)$ must be evaluated before the multiplication, and after that, the addition can be evaluated.
+Another major benefit of the expressions being stored in the \verb|Expr| object and therefore as an AST, is the included operator precedence. Because it is a tree where the leaves are the constants, variables or parameters (also called terminal symbols) and the nodes are the operators, the correct result will be calculated when evaluating the tree from bottom to top. As can be seen in Figure \ref{fig:expr-ast}, the expression $1 + x_1 \, \log(p_1)$, when parsed as an AST, contains the correct operator precedence. First the bottom most subtree $\log(p_1)$ must be evaluated before the multiplication, and after that, the addition can be evaluated.

 It should be noted however, that Julia stores the tree as a list of arrays to allow a node to have as many children as necessary. For example the expression $1+2+\dots+n$ contains only additions, which is a commutative operation, meaning that the order of operations is irrelevant. The AST for this expression would contain the operator at the first position in the array and the values at the following positions. This ensures that the AST is as minimal as possible.

@ -92,18 +88,16 @@ To convert the AST of an expression into the intermediate representation, a top-
 	\item Return the generated postfix expression/intermediate representation.
 \end{enumerate}

-The validation of the expression is performed throughout the parsing process. Validating that only correct operators are used is performed in step 1. To be able to convert the operator to its corresponding opcode, it must be validated that an opcode exists for it, and therefore whether it is valid or not. Similarly, converting the tokens into an expression element object ensures that only valid variables and parameters are present in the expression. This is handled in step 2.
+The validation of the expression is performed throughout the parsing process. Validating that only correct operators are used is performed in step 1. To be able to convert the operator to its corresponding opcode, it must be validated that an opcode exists for it, and therefore whether it is valid or not. Similarly, converting the tokens into an expression element object ensures that only variables and parameters in the correct format are present in the expression. This is handled in step 2.

-As explained above, a node of a binary operator can have $n$ children. In these cases, additional handling is required to ensure correct conversion. This handling is summarised in step 4. Essentially, the operator must be added after the first two elements, and for each subsequent element, the operator must also be added. The expression $1+2+3+4$ is converted to the AST $+\,1\,2\,3\,4$ and without step 4 the postfix expression would be $1\,2\,3\,4\,+$. If the operator is added after the first two elements and then after each subsequent element, the correct postfix expression $1\,2\,+\,3\,+\,4\,+$ will be generated.
+As explained above, a node of a binary operator can have $n$ children. In these cases, additional handling is required to ensure correct conversion. This handling is summarised in step 4. Essentially, the operator must be added after the first two elements, for each subsequent element, the operator must also be added. The expression $1+2+3+4$ is converted to the AST $+\,1\,2\,3\,4$ and without step 4 the postfix expression would be $1\,2\,3\,4\,+$. If the operator is added after the first two elements and then after each subsequent element, the correct postfix expression $1\,2\,+\,3\,+\,4\,+$ will be generated.

 Each subtree of the AST is its own separate AST, which can be converted to postfix notation in the same way the whole AST can be converted. This means that the algorithm only needs to be able to handle leave nodes, and when it encounters a subtree, it recursively calls itself to parse the remaining AST. Step 5 indicates this recursive behaviour. 

 While the same expression usually occurs only once, sub-expressions can occur multiple times. In the example in Figure \ref{fig:expr-ast}, the whole expression $1 + x_1 \, \log(p_1)$ is unlikely to be generated more than once by the symbolic regression algorithm. However, the sub-expression $\log(p_1)$ is much more likely to be generated multiple times. This means that the generation of the intermediate representation for this subtree only needs to be done once and can be reused later. Therefore, a cache can be used to store the intermediate representation for this sub-expression and access it again later to eliminate the parsing overhead.

-Caching can be applied to both individual sub-expressions as well as the entire expression. While it is unlikely for the whole expression to recur frequently, either as a whole or as part of a larger expression, implementing a cache will not degrade performance and will, in fact, enhance it if repetitions do occur. In the context of parameter optimisation, where the evaluators are employed, expressions will recur, making full-expression caching advantageous. The primary drawback of caching is the increased use of RAM. However, given that RAM is plentiful in modern systems, this should not pose a significant issue.
-
 \section{Interpreter}
-The implementation is divided into two main components, the CPU-based control logic and the GPU-based interpreter as outlined in the Concept and Design chapter. This section aims to describe the technical details of these components. First the CPU-based control logic will be discussed. This component handles the communication with the GPU and is the entry point which is called by the symbolic regression algorithm. Following this, the GPU-based interpreter will be explored, highlighting the specifics of developing an interpreter on the GPU.
+The implementation of the interpreter is divided into two main components, the CPU-based control logic and the GPU-based interpreter as outlined in the Concept and Design chapter. This section aims to describe the technical details of these components. First the CPU-based control logic will be discussed. This component handles the communication with the GPU and is the entry point which is called by the symbolic regression algorithm. Following this, the GPU-based interpreter will be explored, highlighting the specifics of developing an interpreter on the GPU.

 An overview of how these components interact with each other is outlined in Figure \ref{fig:interpreter-sequence}. The parts of this figure are explained in detail in the following sections.

@ -115,14 +109,14 @@ An overview of how these components interact with each other is outlined in Figu
 \end{figure}

 \subsection{CPU Side}
-The interpreter is given all the expressions it needs to interpret as an input. Additionally, it needs the variable matrix as well as the parameters for each expression. All expressions are passed to the interpreter as an array of Expr objects, as they are needed for the pre-processing step or the frontend. The first loop as shown in Figure \ref{fig:interpreter-sequence}, is responsible for sending the expressions to the frontend to be converted into the intermediate representation. After this step, the expressions are in the correct format to be sent to the GPU and the interpretation process can continue.
+The interpreter is given all the expressions it needs to interpret as an input. Additionally, it needs the variable matrix as well as the parameters for each expression. All expressions are passed to the interpreter as an array of \verb|Expr| objects, as they are needed for the pre-processing step or the frontend. The first loop as shown in Figure \ref{fig:interpreter-sequence}, is responsible for sending the expressions to the frontend to be converted into the intermediate representation. After this step, the expressions are in the correct format to be sent to the GPU and the interpretation process can continue.

 \subsubsection{Data Transfer}
-Before the GPU can start with the interpretation, the data needs to be present on it. Because the variables are already in matrix form, transferring the data is fairly straightforward. Memory must be allocated in the global memory of the GPU and then be copied from RAM into the allocated memory. Allocating memory and transferring the data to the GPU is handled implicitly by the CuArray type provided by CUDA.jl.
+Before the GPU can start with the interpretation, the data needs to be present on it. Because the variables are already in matrix form, transferring the data is fairly straightforward. Memory must be allocated in the global memory of the GPU and then be copied from RAM into the allocated memory. Allocating memory and transferring the data to the GPU is handled implicitly by the \verb|CuArray| type provided by CUDA.jl.

 To optimise the interpreter for parameter optimisation workloads, this step is performed before it is called. Although, the diagram includes this transmission for completeness, it is important to note that the variables never change, as they represent the observed inputs of the system that is being modelled by the symbolic regression algorithm. As a symbolic regression algorithm is usually implemented with GP, there are many generations that need to be evaluated. Therefore, re-transmitting the variables for each generation is inefficient. By transmitting the variables once before the symbolic regression algorithm begins, additional performance gains are very likely. However, this approach would require modifying the symbolic regression algorithm, which is the reason this optimisation has not been applied. Nonetheless, if needed it is still possible to modify the implementation at a later stage with minimal effort.

-Once the variables are transmitted, the parameters also must be transferred to the GPU. Unlike the variables, the parameters are stored as a vector of vectors. In order to transmit the parameters efficiently, they also need to be put in a matrix form. The matrix needs to be of the form $k \times N$, where $k$ is equal to the length of the longest inner vector and $N$ is equal to the length of the outer vector. This ensures that all values can be stored in the matrix. It also means that if the inner vectors are of different lengths, some extra unnecessary values will be transmitted, but the overall benefit of treating them as a matrix outweighs this drawback. The Program \ref{code:julia_vec-to-mat} shows how this conversion can be implemented. Note that it is required to provide an invalid element. This ensures defined behaviour and helps with finding errors in the code. After the parameters have been brought into matrix form, they can be transferred to the GPU the same way the variables are transferred.
+Once the variables are transmitted, the parameters must also be transferred to the GPU. Unlike the variables, the parameters are stored as a vector of vectors. In order to transmit the parameters efficiently, they also need to be put in a matrix form. The matrix needs to be of the form $k \times N$, where $k$ is equal to the length of the longest inner vector and $N$ is equal to the length of the outer vector. This ensures that all values can be stored in the matrix. It also means that if the inner vectors are of different lengths, some extra unnecessary values will be transmitted, but the overall benefit of treating them as a matrix outweighs this drawback. The Program \ref{code:julia_vec-to-mat} shows how this conversion can be implemented. Note that it is required to provide an invalid element. This ensures defined behaviour and helps with finding errors in the code. After the parameters have been brought into matrix form, they can be transferred to the GPU the same way the variables are transferred.

 \begin{program}
 	\begin{GenericCode}
@ -144,11 +138,11 @@ end
 	\label{code:julia_vec-to-mat}
 \end{program}

-Similar to the parameters, the expressions are also stored as a vector of vectors. The outer vector contains each expression, while the inner vectors hold the expressions in their intermediate representation. Therefore, this vector of vectors also needs to be brought into matrix form following the same concept as the parameters. To simplify development, the special opcode \textit{stop} has been introduced, which is used for the invalidElement in Program \ref{code:julia_vec-to-mat}. As seen in Section \ref{sec:interpreter-gpu-side}, this element is used to determine if the end of an expression has been reached during the interpretation process. This removes the need for additional data to be sent which stores the length of each expression to determine if the entire expression has been interpreted or not. Therefore, a lot of overhead can be reduced.
+Similar to the parameters, the expressions are also stored as a vector of vectors. The outer vector contains each expression, while the inner vectors hold the expressions in their intermediate representation. Therefore, this vector of vectors also needs to be brought into matrix form following the same concept as the parameters. To simplify development, the special opcode \textit{stop} has been introduced, which is used for the \verb|invalidElement| in Program \ref{code:julia_vec-to-mat}. As seen in Section \ref{sec:interpreter-gpu-side}, this element is used to determine if the end of an expression has been reached during the interpretation process. This removes the need for additional data to be sent which stores the length of each expression to determine if the entire expression has been interpreted or not. Therefore, a lot of overhead can be reduced.

 Once the conversion into matrix form has been performed, the expressions are transferred to the GPU. Just like with the variables, the expressions remain the same over the course of the parameter optimisation part. Which is the reason they are transferred to the GPU before the interpreter is called, reducing the number of unnecessary data transfers.

-Only raw data can be sent to the GPU, which means that meta information about the data layout is missing. The matrices are represented as flat arrays, which means they have lost their column and row information. This information must be sent separately to let the kernel know the dimensions of the expressions, variables and parameters. Otherwise, the kernel does not know at which memory location the second variable set is stored, as it does not know how large a single set is for example. Figure \ref{fig:memory-layout-data} shows how the data is stored without any information about the rows or columns of the matrices. The thick lines help to identify where a new column, and therefore a new set of data begins. However, the GPU has no knowledge of this and therefore the meta information must be transferred separately to ensure that the data is accessed correctly.
+Only raw data can be sent to the GPU, which means that meta information about the data layout is missing. The matrices are represented as flat arrays, which means they have lost their column and row information. This information must be sent separately to inform the kernel about the dimensions of the expressions, variables and parameters. Otherwise, the kernel does not know at which memory location the second variable set is stored for example, as it does not know how large a single set is. Figure \ref{fig:memory-layout-data} shows how the data is stored without any information about the rows or columns of the matrices. The thick lines help to identify where a new column, and therefore a new set of data begins. However, the GPU has no knowledge of this and therefore the meta information must be transferred separately to ensure that the data is accessed correctly.

 \begin{figure}
 	\centering
@ -165,24 +159,22 @@ Once all the data is present on the GPU, the CPU can dispatch the kernel for eac

 In addition, the dispatch parameters also include the pointers to the location of the data allocated and transferred above, as well as the index of the expression to be interpreted. Since all expressions and parameters are sent to the GPU at once, this index ensures that the kernel knows where in memory to find the expression it needs to interpret and which parameter set it needs to use. After the kernel has finished, the result matrix needs to be read from the GPU and passed back to the symbolic regression algorithm.

-Crucially, dispatching a kernel is an asynchronous operation, which means that the CPU does not wait for the kernel to finish before continuing. This allows the CPU to dispatch all kernels at once, rather than one at a time. As explained in Section \ref{sec:architecture}, a GPU can have multiple resident grids, meaning that the dispatched kernels can run concurrently, drastically reducing evaluation time. Only once the result matrix is read from the GPU does the CPU have to wait for all kernels to finish execution.
+Crucially, dispatching a kernel is an asynchronous operation, which means that the CPU does not wait for the kernel to finish before continuing. This allows the CPU to dispatch all kernels at once, rather than one at a time. As explained in Section \ref{sec:architecture}, a GPU can have multiple resident grids, meaning that the dispatched kernels can run concurrently, reducing evaluation time. Only once the result matrix is read from the GPU does the CPU have to wait for all kernels to finish execution.

 \subsection{GPU Side}
 \label{sec:interpreter-gpu-side}
-% Memory access (currently global memory only)
-% no dynamic memory allocation like on CPU (stack needs to have fixed size; also stack is stored in local memory)
-With the GPU's global memory now containing all the necessary data and the kernel being dispatched, the interpretation process can begin. Before interpreting an expression, the global thread ID must be calculated. This step is crucial because each variable set is assigned to a unique thread. Therefore, the global thread ID determines which variable set should be used for the current interpretation instance.
+With the GPU's global memory containing all the necessary data and the kernel being dispatched, the interpretation process can begin. Before interpreting an expression, the global thread ID must be calculated. This step is crucial because each variable set is assigned to a unique thread. Therefore, the global thread ID determines which variable set should be used for the current interpretation instance.

 Moreover, the global thread ID ensures that excess threads do not perform any work. As otherwise these threads would try to access a variable set that does not exist and therefore would lead to an illegal memory access. This is necessary because the number of required threads often does not align perfectly with the number of threads per block multiplied by the number of blocks. If for example $1031$ threads are required, then at least two thread blocks are needed, as one thread block can hold at most $1024$ threads. Because $1031$ is a prime number, it can not be divided by any practical number of thread blocks. If two thread blocks are allocated, each holding $1024$ threads, a total of $2048$ threads is started. Therefore, the excess $2048 - 1031 = 1017$ threads must be prevented from executing. By using the global thread ID and the number of available variable sets, these excess threads can be easily identified and terminated early in the kernel execution.

-Afterwards the stack for the interpretation can be created. It is possible to dynamically allocate memory on the GPU, which enables a similar programming model as on the CPU. \textcite{winter_are_2021} have even compared many dynamic memory managers and found, that the performance impact of them is rather small. However, if it is easily possible to use static allocations, it still offers better performance. In the case of this thesis, it is easily possible which is the reason why the stack has been chosen to have a static size. Because it is known that expressions do not exceed 50 tokens, including the operators, the stack size has been set to 25, which should be more than enough to hold the values and partial results, even in the worst case.
+Afterwards the stack for the interpretation can be created. It is possible to dynamically allocate memory on the GPU, which enables a similar programming model as on the CPU. \textcite{winter_are_2021} have compared many dynamic memory managers and found, that the performance impact of them is rather small. However, if it is easily possible to use static allocations, it still offers better performance. In the case of this thesis, it is easily possible which is the reason why the stack has been chosen to have a static size. Because it is known that expressions do not exceed 50 tokens, including the operators, the stack size has been set to ten, which should be more than enough to hold the values and partial results, even in the worst case. It is very unlikely that ten values must be stored before a binary operator is encountered that reduces the number of values on the stack. Therefore, a stack size of ten should be sufficient, however it is possible to increase the stack size if needed.

 \subsubsection{Main Loop}
 Once everything is initialised, the main interpreter loop starts interpreting the expression. Because of the intermediate representation, the loop simply iterates through the expression from left to right. On each iteration the type of the current token is checked, to decide which operation to perform. 

-If the current token type matches the \textit{stop} opcode, the interpreter knows that it is finished. This simplicity is the reason why this opcode was introduced, as explained above.
+If the current token type matches the \textit{stop} opcode, the interpreter knows that it is finished. This simplicity is the reason why this opcode was introduced, as mentioned above.

-More interestingly is the case, where the current token corresponds to an index to either the variable matrix, or the parameter matrix. In this case, the token's value is important. To access one of these matrices, the correct starting index of the set must first be calculated. As previously explained, information about the dimensions of the data is lost during transfer. At this stage, the kernel only knows the index of the first element of either matrix, which set to use for this evaluation, and the index of the value within the current set. However, the boundaries of these sets are unknown. Therefore, the additionally transferred data about the dimensions is used in this step to calculate the index of the first element in each set. With this calculated index and the index stored in the token, the correct value can be loaded. After the value has been loaded, it is pushed to the top of the stack for later use.
+More interestingly is the case, where the current token corresponds to an index to either the variable matrix, or the parameter matrix. In this case, the token's value is important. To access one of these matrices, the correct starting index of the set must first be calculated. As previously explained, information about the dimensions of the data is lost during transfer. At this stage, the kernel only knows the index of the first element of either matrix, which set to use for this evaluation, and the index of the value within the current set. However, the boundaries of these sets are unknown. Therefore, the additionally transferred data about the dimensions is used in this step to calculate the index of the first element in each set. With this calculated index and the index stored in the token, the correct value can be loaded by adding the token value to the index of the first element of the set. After the value has been loaded, it is pushed to the top of the stack for later use.

 % MAYBE:
 % Algorithm that shows how this calculation works
@ -208,12 +200,12 @@ An overview of how the transpiler interacts with the frontend and GPU is outline
 \end{figure}

 \subsection{CPU Side}
-After the transpiler has received the expressions to be transpiled, it first sends them to the frontend for processing. Once an expression has been processed, it is sent to the transpiler backend which is explained in more detail Section \ref{sec:transpiler-backend}. The backend is responsible for generating the kernels. When finished, each expression is transpiled into its own kernels written in PTX code. 
+After the transpiler has received the expressions to be transpiled, it first sends them to the frontend for processing. Once an expression has been processed, it is sent to the transpiler backend which is explained in more detail Section \ref{sec:transpiler-backend}. The backend is responsible for generating the kernels. When finished, each expression is transpiled into its own kernel written in PTX code. 

 \subsubsection{Data Transfer}
 Data is sent to the GPU in the same way it is sent in the interpreter. The variables are sent as they are, while the parameters are again brought into matrix form. Memory must also be allocated for the result matrix. Unlike the interpreter however, only the variables and parameters need to be sent to the GPU. The variables are again sent before the parameter optimisation step to reduce the number of data transfers.

-Because each expression has its own kernel, there is no need to transfer the expressions themselves. Moreover, there is also no need to send information about the layout of the variables and parameters to the GPU. The reason for this is explained in the transpiler backend section below.
+Because each expression is represented by its own kernel, there is no need to transfer the expressions themselves. Moreover, there is also no need to send information about the layout of the variables and parameters to the GPU. The reason for this is explained in the transpiler backend section below.

 \subsubsection{Kernel Dispatch}
 Once all the data is present on the GPU, the transpiled kernels can be dispatched. Dispatching the transpiled kernels is more involved than dispatching the interpreter kernel. Program \ref{code:julia_dispatch-comparison} shows the difference between dispatching the interpreter kernel and the transpiled kernels. An important note, is that the transpiled kernels must be manually compiled into machine code. To achieve this, CUDA.jl provides functionality to instruct the driver to compile the PTX code. The same process of creating PTX code and compiling it must also be done for the interpreter kernel, however, this is done by CUDA.jl automatically when calling the @cuda macro in line 6.
@ -249,7 +241,7 @@ end	\end{JuliaCode}
 	\label{code:julia_dispatch-comparison}
 \end{program}

-Similar to the interpreter, the frontend and backend are executed before the parameter optimisation step to improve the runtime. Each kernel is compiled into machine code after it has been generated to ensure, as little work as possible needs to be done during the parameter optimisation loop. However, as will be explained in Chapter \ref{cha:evaluation}, storing the compiled kernels is very memory intensive. This means that if many expressions need to be evaluated at once, a lot of memory will be required.
+Similar to the interpreter, the frontend and backend are executed before the parameter optimisation step to improve the runtime. Each kernel is compiled into machine code after it has been generated to ensure, as little work as possible needs to be done during the parameter optimisation loop. However, as will be explained in Chapter \ref{cha:evaluation}, storing the compiled kernels is very memory intensive. This means that if many expressions need to be evaluated at once, a lot of memory is required.

 After all kernels have been dispatched, the CPU waits for the kernels to complete their execution. Once the kernels have finished, the result matrix is read from global memory into system memory. The results can then be returned to the symbolic regression algorithm. 

@ -269,7 +261,7 @@ PTX assumes a register machine, which means that a developer has to work with a
 \subsubsection{Register Management}
 Register management is a crucial part of the transpiler as it is important to balance register usage with occupancy and performance. \textcite{aho_compilers_2006, cooper_engineering_2022} describe techniques for efficient register management, especially for machines with few registers and register usage by convention on the CPU. On the GPU however, there are many more registers available, all of which can be used as needed without restrictions.

-To allow for maximum occupancy and avoid spilling registers into local memory, the transpiler tries to reuse as many registers as possible. Furthermore, allocating and using a register in PTX is very similar to using variables in code, as they represent virtual registers. Therefore, much of the complexity of managing registers is handled by the PTX compiler of the driver. 
+To allow for maximum occupancy and avoid spilling registers into local memory, the transpiler tries to reuse as many registers as possible. Furthermore, allocating and using a register in PTX is very similar to using variables in high level code, as they represent virtual registers. Therefore, much of the complexity of managing registers is handled by the PTX compiler of the driver. 

 Because much of the complexity of managing registers is hidden by the compiler, or does not apply in this scenario, it is implemented very simple. If a register is needed at any point in the transpilation process, it can be requested by the register manager. A register must be given a name and the manager uses this name to determine the type of this register. For example, if the name of the register is \verb|f|, it is assumed to be an FP32 register. Several naming conventions exist to ensure that the register is of the correct data type. The manager then returns the identifying name of the register, which is used to access it. The identifying name, is the name given as an input and a zero-based number that is incremented by one for each successive call.

@ -316,8 +308,6 @@ It needs to be noted, that the register \verb|%r2| is not needed. Since the tran
 \subsubsection{Main Loop}
 The main loop of the transpiler, which generates the kernel for evaluating a single expression, is analogous to the interpreter's main loop. Since the transpiler uses the same intermediate representation as the interpreter, both loops behave similarly. The transpiler loop also uses a stack to store the values and intermediate results. However, the transpiler does not require the special opcode \textit{stop} which was necessary in the interpreter to handle expressions padded to fit into a matrix. The transpiler only needs to process a single expression, which is stored in an unpadded vector of known length. This means that all tokens within the vector are valid and therefore do not require this opcode.

-% MAYBE : activity diagram for this loop (also add to interpreter main loop section (would maybe fit better in concept and design so basically move the algorithms of C&D here and add activity diagram to C&D ))
-
 When the loop encounters a token that represents an index to either the variable or the parameter matrix, the transpiler needs to generate code to load these values. In the general case, this works in exactly the same way as the interpreter, calculating the index and accessing the matrices at that location. 

 However, the first time a variable or parameter is accessed, it must be loaded from global memory. Although registers already exist that hold a pointer to the address of the matrices in global memory, the data is still not accessible. To make it accessible, the index to the value must first be calculated in the same way as it is calculated in the interpreter. Afterwards the value must be loaded into a register with the instruction \verb|ld.global.f32  %reg1, [%reg2]|. Using the first register of the instruction, the data can be accessed. For example, if the variable $x_1$ is accessed several times, all subsequent calls only need to reference this register and do not need to load the data from global memory again.
--- a/thesis/main.pdf
+++ b/thesis/main.pdf