concept and design: added transpiler section

2025-04-06 13:59:14 +02:00
parent 9e1094ac43
commit 20fcbab4ca
7 changed files with 72 additions and 18 deletions
--- a/package/src/ExpressionProcessing.jl
+++ b/package/src/ExpressionProcessing.jl
@ -22,9 +22,9 @@ NOTE: All 64-Bit values will be converted to 32-Bit. Be aware of the lost precis
 "
 function expr_to_postfix(expr::Expr)::PostfixType
 	postfix = PostfixType()
-	operator = get_operator(expr.args[1])
+	@inbounds operator = get_operator(expr.args[1])

-	for j in 2:length(expr.args)
+	@inbounds for j in 2:length(expr.args)
 		arg = expr.args[j]

 		if typeof(arg) === Expr
--- a/package/src/Transpiler.jl
+++ b/package/src/Transpiler.jl
@ -198,8 +198,7 @@ function generate_calculation_code(expression::ExpressionProcessing.PostfixType,
 	exprId64Reg = Utils.get_next_free_register(regManager, "rd")
 	println(codeBuffer, "mov.u64 $exprId64Reg, $expressionIndex;")

-	for i in eachindex(expression)
-		token = expression[i]
+	for token in expression

 		if token.Type == FLOAT32
 			push!(operands, reinterpret(Float32, token.Value))
--- a/package/test/PerformanceTests.jl
+++ b/package/test/PerformanceTests.jl
@ -179,4 +179,8 @@ end



-REDO @inbounds performance tests because I added more @inbounds and removed not needed code from interpreter
+REDO @inbounds performance tests because I added more @inbounds and removed not needed code from interpreter
+Also updated Expression processing and transpiler
+
+After these tests have been redone, use Nsight Compute/Systems as described here: 
+#https://cuda.juliagpu.org/stable/development/profiling/#NVIDIA-Nsight-Systems
--- a/thesis/chapters/conceptdesign.tex
+++ b/thesis/chapters/conceptdesign.tex
@ -74,7 +74,7 @@ Evaluating the expressions is relatively straight forward. Due to the expression
 	\caption{Interpreting an equation in postfix-notation}\label{alg:eval_interpreter}
 	\begin{algorithmic}[1]
 		\Procedure{Evaluate}{\textit{expr}: PostfixExpression}
-			\State $\textit{stack} \gets ()$
+			\State $\textit{stack} \gets []$

 			\While{HasTokenLeft(\textit{expr})}
 				\State $\textit{token} \gets \text{GetNextToken}(\textit{expr})$ 
@ -84,11 +84,11 @@ Evaluating the expressions is relatively straight forward. Due to the expression
 					\If{$\textit{token.Value} = \text{Addition}$}
 						\State $\textit{right} \gets \text{Pop}(\textit{stack})$
 						\State $\textit{left} \gets \text{Pop}(\textit{stack})$
-						\State Push(stack, $\textit{left} + \textit{right}$)
+						\State Push($\textit{stack}$, $\textit{left} + \textit{right}$)
 					\ElsIf{$\textit{token.Value} = \text{Multiplication}$}
 						\State $\textit{right} \gets \text{Pop}(\textit{stack})$
 						\State $\textit{left} \gets \text{Pop}(\textit{stack})$
-						\State Push(stack, $\textit{left} * \textit{right}$)
+						\State Push($\textit{stack}$, $\textit{left} * \textit{right}$)
 					\EndIf
 				\EndIf
 			\EndWhile
@ -100,9 +100,7 @@ Evaluating the expressions is relatively straight forward. Due to the expression

 If a new operator is needed, it must simply be added as another else-if block inside the operator branch. New token types like variables or parameters, can also be added by adding a new outer else-if block that checks for these token types. However, the pre-processing step also needs to be extended with these new operators and token types. Otherwise, the expression will never reach the evaluation step as they would be seen as invalid. It is also possible to add unary operators like $\log()$. In this case only one value would be read from the stack, the operation would be applied, and the result would be written back to the stack.

-The Algorithm \ref{alg:eval_interpreter} in this case resembles the kernel. This kernel will be dispatched for every expression that needs to be evaluated, to eliminate thread divergence. Thread divergence can only happen on data dependent branches. In this case, the while loop and every if and else-if statement contains a data dependent branch. Depending on the expression passed to the kernel, the while loop may run longer than for another expression. Similarly, not all expressions have the same constants, operators and variables in the same order and would therefore lead to each thread, taking different paths. However, one expression, always has the same constants, operators and variables in the same locations, meaning all threads will take the same paths. This also means that despite the interpreter containing many data dependent branches, these branches only depend on the expression itself. Because of this, all threads will take the same paths and therefore will never diverge from one another.
-
-% explain why thread convergence does not happen here
+The Algorithm \ref{alg:eval_interpreter} in this case resembles the kernel. This kernel will be dispatched for every expression that needs to be evaluated, to eliminate thread divergence. Thread divergence can only happen on data dependent branches. In this case, the while loop and every if and else-if statement contains a data dependent branch. Depending on the expression passed to the kernel, the while loop may run longer than for another expression. Similarly, not all expressions have the same constants, operators and variables in the same order and would therefore lead to each thread, taking different paths. However, one expression, always has the same constants, operators and variables in the same locations, meaning all threads will take the same paths. This also means that despite the interpreter containing many data dependent branches, these branches only depend on the expression itself. Because of this, all threads will take the same paths and therefore will never diverge from one another if they execute the same expression.

 \subsection{Transpiler}

@ -113,15 +111,51 @@ The Algorithm \ref{alg:eval_interpreter} in this case resembles the kernel. This
 	\label{fig:component_diagram_transpiler}
 \end{figure}

-Similar to the interpreter, the transpiler also consists of a part that is running on the CPU side and one that is running on the GPU side. When looking at the component and workflow of the transpiler as seen in Figure \ref{fig:component_diagram_transpiler}, it is almost identical to the interpreter. However, the key difference between these two, is the additional code generation, or transpilation step. 
+Similar to the interpreter, the transpiler also consists of a part that runs on the CPU and a part that runs on the GPU. When looking at the component and workflow of the transpiler, as shown in Figure \ref{fig:component_diagram_transpiler}, it is almost identical to the interpreter. However, the key difference between the two, is the additional code generation, or transpilation step. Apart from that, the transpiler also needs the same pre-processing step and also the GPU to evaluate the expressions. However, the GPU evaluator generated by the transpiler works differently to the GPU evaluator for the interpreter. The difference between these evaluators will be explained later.

-% explain the differences of interpreter and transpiler
-% explain how the transpilation process works
-% also add algorithm to further show the process
+Before the expressions can be transpiled into PTX code, they need to be pre-processed. As already described, this step ensures the validity of the expressions and transforms them into the intermediate representation described above. As with the interpreter, this also simplifies the code generation step at the cost of some performance because the intermediate representation has to be generated. However, in this case the benefit of having a simple code generation step was more important than performance. By transforming the expressions into postfix-notation, the code generation follows a similar pattern to the interpretation described above. Algorithm \ref{alg:transpile} shows how the transpiler takes an expression, transpiles it and then returns the finished code. It can be seen that the while loop is the same as the while loop of the interpreter. The main difference is in the operator branches. Because now code needs to be generated, the branches themselves call their designated code generation function, such as $\textit{GenerateAddition}$. However, this function can not only return the code that performs the addition for example. This addition, when executed, also returns a value which will be needed as an input by other operators. Therefore, not only the code fragment must be returned, but also the variable in which the result is stored. This variable can then be put on the stack for later use, and the code fragment must be added to the code already generated so that it can be returned to the caller. As with the interpreter, there is a final value on the stack when the loop has finished. Once the code is executed, this value is the variable containing the result of the expression. This value then needs to be stored in the results matrix, so that it can be retrieved by the CPU after all expressions have been executed on the GPU. Therefore, one last code fragment must be generated to handle the storage of this value in the results matrix. This fragment must then be added to the code already generated, and the transpilation process is complete.

-% at the end probably, talk why each expression has its own kernel -> because GPU now only evaluates expression and no branches are needed -> results in less overhead instructions (no branch instructions) -> however, this overhead is now on CPU. This is the reason both need to be explored to find out if there are performance differences between them
+\begin{algorithm}
+	\caption{Transpiling an equation in postfix-notation}\label{alg:transpile}
+	\begin{algorithmic}[1]
+		\Procedure{Transpile}{\textit{expr}: PostfixExpression}: String
+			\State $\textit{stack} \gets []$
+			\State $\textit{code} \gets$ ""

+			\While{HasTokenLeft(\textit{expr})}
+				\State $\textit{token} \gets \text{GetNextToken}(\textit{expr})$ 
+				\If{$\textit{token.Type} = \text{Constant}$} 
+					\State Push($\textit{stack}$, $\textit{token.Value}$)
+				\ElsIf{$\textit{token.Type} = \text{Operator}$} 
+					\If{$\textit{token.Value} = \text{Addition}$}
+						\State $\textit{right} \gets \text{Pop}(\textit{stack})$
+						\State $\textit{left} \gets \text{Pop}(\textit{stack})$
+						\State $(\textit{valueLocation}, \textit{codeFragment}) \gets \text{GenerateAddition}(\textit{left}, \textit{right})$
+						\State Push($\textit{stack}$, $\textit{valueLocation}$)
+						\State Append($\textit{code}$, $\textit{codeFragment}$)
+					\ElsIf{$\textit{token.Value} = \text{Multiplication}$}
+						\State $\textit{right} \gets \text{Pop}(\textit{stack})$
+						\State $\textit{left} \gets \text{Pop}(\textit{stack})$
+						\State $(\textit{valueLocation}, \textit{codeFragment}) \gets \text{GenerateMultiplication}(\textit{left}, \textit{right})$
+						\State Push($\textit{stack}$, $\textit{valueLocation}$)
+						\State Append($\textit{code}$, $\textit{codeFragment}$)
+					\EndIf
+				\EndIf
+			\EndWhile
+						
+			\State $\textit{codeFragment} \gets$ GenerateResultStoring($\text{Pop}(\textit{stack})$)
+			\State Append($\textit{code}$, $\textit{codeFragment}$)

+			\Return $\textit{code}$
+		\EndProcedure
+	\end{algorithmic}
+\end{algorithm}
+
+The code generated by the transpiler is the kernel for the transpiled expressions. This means that a new kernel must be generated for each expression that needs to be evaluated. This is in contrast to the interpreter, which has one kernel and dispatches it once for each expression. However, generating one kernel per expression results in a much simpler kernel. This allows the kernel to evaluate the postfix expression from left to right without having to perform any branching. However, while the kernel no longer has to perform branching, it is still needed and therefore only offloaded to the CPU. There is also a noticeable overhead in that a kernel has to be generated for each expression. In cases like parameter optimisation, many of the expressions will be transpiled multiple times, because the transpiler is called multiple times with the same expressions. Strategies such as caching can be used to improve the performance in such cases.
+
+Both the transpiler and the interpreter have their respective advantages and disadvantages. While the interpreter puts less load on the CPU, the GPU has to perform more work. Much of this work is branching and therefore involves many instructions that are not used to evaluate the expression itself. However, all of this work is done in parallel rather than sequentially on the CPU. 
+
+On the other hand, the transpiler performs more work on the CPU. The kernels are much simpler, and most of the instructions are used to evaluate the expressions themselves. Because unlike the GPU the CPU can manage state, concepts such as caches can be employed to reduce the overhead on the CPU. This means, that unnecessary work can be avoided in certain scenarios such as parameter optimisation. 


 % \section{Interpreter}
--- a/thesis/chapters/conclusion.tex
+++ b/thesis/chapters/conclusion.tex
@ -4,4 +4,6 @@
 Summarise the results

 \section{Future Work}
-talk about what can be improved
+talk about what can be improved
+
+Transpiler: transpile expression directly from Julia AST -> would save time because no intermediate representation needs to be created (looses step and gains performance, but also makes transpiler itself more complex)
--- a/thesis/chapters/implementation.tex
+++ b/thesis/chapters/implementation.tex
@ -14,6 +14,21 @@ Talk about why this needs to be done and how it is done (the why is basically: s
 \section{Interpreter}
 Talk about how the interpreter has been developed.

+UML-Ablaufdiagram
+
+main loop; kernel transpiled by CUDA.jl into PTX and then executed
+
+Memory access (currently global memory only)
+no dynamic memory allocation like on CPU (stack needs to have fixed size)

 \section{Transpiler}
-Talk about how the transpiler has been developed
+Talk about how the transpiler has been developed (probably largest section, because it just has more interesting parts)
+
+UML-Ablaufdiagram
+
+Front-End and Back-End
+Caching of back-end results
+
+PTX code generated and compiled using CUDA.jl (so basically the driver) and then executed
+
+Memory access (global memory and register management especially register management)
--- a/thesis/main.pdf
+++ b/thesis/main.pdf