concept and design: minor improvements

2025-04-10 11:17:50 +02:00
parent c68e0d04a0
commit 5a9760d221
2 changed files with 9 additions and 4 deletions
--- a/thesis/chapters/conceptdesign.tex
+++ b/thesis/chapters/conceptdesign.tex
@ -65,10 +65,9 @@ It would have also been possible to perform the pre-processing step on the GPU.
 	\label{fig:component_diagram_interpreter}
 \end{figure}

-The interpreter consists of two parts. The CPU side is the part of the program, that interacts with both the GPU and the caller. An overview on the components and the workflow of the interpreter can be seen in Figure \ref{fig:component_diagram_interpreter}. Once the interpreter receives the expressions, they are pre-processed. This ensures the expressions are valid, and that they are transformed into the intermediate representation needed for evaluating them. The results of this pre-processing are then sent to the GPU, which performs the actual interpretation of the expressions. Alongside the expressions, the data for the variables and parameters also needs to be sent to the GPU. Once all the data resides on the GPU, the interpreter kernel can be dispatched. It needs to be noted, that for each of the expressions, a separate kernel will be dispatched. As already described, this decision has been made, to ensure, reduce thread divergence and therefore increase performance. In fact, dispatching the same kernel multiple times with different expressions, means, there will not occur any thread divergence as explained later. Once the GPU has finished evaluating all expressions with all variable sets, the result will be stored in a matrix on the GPU. The CPU then retrieves the results and returns them to the caller in the format specified by the requirements.
+The interpreter consists of two parts. The CPU side is the part of the program, that interacts with both the GPU and the caller. An overview on the components and the workflow of the interpreter can be seen in Figure \ref{fig:component_diagram_interpreter}. Once the interpreter receives the expressions, they are pre-processed. This ensures the expressions are valid, and that they are transformed into the intermediate representation needed for evaluating them. The results of this pre-processing are then sent to the GPU, which performs the actual interpretation of the expressions. Alongside the expressions, the data for the variables and parameters also needs to be sent to the GPU. Once all the data resides on the GPU, the interpreter kernel can be dispatched. It needs to be noted, that for each of the expressions, a separate kernel will be dispatched. As already described, this decision has been made to reduce thread divergence and therefore increase performance. In fact, dispatching the same kernel multiple times with different expressions, means, there will not occur any thread divergence as explained later. Once the GPU has finished evaluating all expressions with all variable sets, the result will be stored in a matrix on the GPU. The CPU then retrieves the results and returns them to the caller in the format specified by the requirements.

-% somewhere here explain why thread divergence doesn't occur
-Evaluating the expressions is relatively straight forward. Due to the expressions being in post-fix notation, the actual interpreter must only iterate over all tokens once and perform the appropriate tasks. If the interpreter encounters a binary operator, it must simply read the previous two values and perform the operation specified by the operator. For unary operators, only the previous value must be read. As already mentioned, expressions in postfix-notation implicitly contain the operator precedence, therefore no look-ahead or other strategies need to be used to ensure correct evaluation. The Algorithm \ref{alg:eval_interpreter} shows how the interpreter works. Note that this is a simplified version, that only works with additions, multiplications and constant values.
+Evaluating the expressions is relatively straight forward. Due to the expressions being in postfix-notation, the actual interpreter must only iterate over all tokens once and perform the appropriate tasks. If the interpreter encounters a binary operator, it must simply read the previous two values and perform the operation specified by the operator. For unary operators, only the previous value must be read. As already mentioned, expressions in postfix-notation implicitly contain the operator precedence, therefore no look-ahead or other strategies need to be used to ensure correct evaluation. The Algorithm \ref{alg:eval_interpreter} shows how the interpreter works. Note that this is a simplified version, that only works with additions, multiplications, constant values and variables.

 \begin{algorithm}
 	\caption{Interpreting an equation in postfix-notation}\label{alg:eval_interpreter}
@ -80,6 +79,8 @@ Evaluating the expressions is relatively straight forward. Due to the expression
 				\State $\textit{token} \gets \text{GetNextToken}(\textit{expr})$ 
 				\If{$\textit{token.Type} = \text{Constant}$} 
 					\State Push($\textit{stack}$, $\textit{token.Value}$)
+				\ElsIf{$\textit{token.Type} = \text{Variable}$}
+					\State Push($\textit{stack}$, GetVariable($\textit{token.Value}$))
 				\ElsIf{$\textit{token.Type} = \text{Operator}$} 
 					\If{$\textit{token.Value} = \text{Addition}$}
 						\State $\textit{right} \gets \text{Pop}(\textit{stack})$
@ -126,6 +127,10 @@ Before the expressions can be transpiled into PTX code, they need to be pre-proc
 				\State $\textit{token} \gets \text{GetNextToken}(\textit{expr})$ 
 				\If{$\textit{token.Type} = \text{Constant}$} 
 					\State Push($\textit{stack}$, $\textit{token.Value}$)
+				\ElsIf{$\textit{token.Type} = \text{Variable}$} 
+					\State ($\textit{codeFragment}, \textit{referenceToValue}$) $\gets$ GetVariable($\textit{token.Value}$)
+					\State Push($\textit{stack}$, $\textit{referenceToValue}$)
+					\State Append($\textit{code}$, $\textit{codeFragment}$)
 				\ElsIf{$\textit{token.Type} = \text{Operator}$} 
 					\If{$\textit{token.Value} = \text{Addition}$}
 						\State $\textit{right} \gets \text{Pop}(\textit{stack})$
@ -151,7 +156,7 @@ Before the expressions can be transpiled into PTX code, they need to be pre-proc
 	\end{algorithmic}
 \end{algorithm}

-The code generated by the transpiler is the kernel for the transpiled expressions. This means that a new kernel must be generated for each expression that needs to be evaluated. This is in contrast to the interpreter, which has one kernel and dispatches it once for each expression. However, generating one kernel per expression results in a much simpler kernel. This allows the kernel to focus on evaluating the postfix expression from left to right. No overhead work, like branching or managing a stack is needed. However, this overhead is now offloaded to the transpilation step on the CPU. There is also a noticeable overhead in that a kernel has to be generated for each expression. In cases like parameter optimisation, many of the expressions will be transpiled multiple times as the transpiler is called multiple times with the same expressions.
+The code generated by the transpiler is the kernel for the transpiled expressions. This means that a new kernel must be generated for each expression that needs to be evaluated. This is in contrast to the interpreter, which has one kernel and dispatches it once for each expression. However, generating one kernel per expression results in a much simpler kernel. This allows the kernel to focus on evaluating the postfix expression from left to right. No overhead work, like branching or managing a stack is needed. However, this overhead is now offloaded to the transpilation step on the CPU as can be seen in Algorithm \ref{alg:transpile}. There is also a noticeable overhead in that a kernel has to be generated for each expression. In cases like parameter optimisation, many of the expressions will be transpiled multiple times as the transpiler is called multiple times with the same expressions.

 Both the transpiler and the interpreter have their respective advantages and disadvantages. While the interpreter puts less load on the CPU, the GPU has to perform more work. Much of this work is branching or managing a stack and therefore involves many instructions that are not used to evaluate the expression itself. However, this overhead can be mitigated by the fact, that all of this overhead is performed in parallel and not sequentially. 

--- a/thesis/main.pdf
+++ b/thesis/main.pdf