implementation: continued writing transpiler section

2025-05-04 13:54:05 +02:00
parent 18d89e27ca
commit b69a3efe96
5 changed files with 134 additions and 15 deletions
--- a/thesis/chapters/implementation.tex
+++ b/thesis/chapters/implementation.tex
@ -256,16 +256,111 @@ end	\end{JuliaCode}

 After all kernels have been dispatched, the CPU waits for the kernels to complete their execution. When the kernels have finished, the result matrix is read from global memory into system memory. The results can then be returned to the symbolic regression algorithm. 

-\subsection{Transpiler Backend}
+%
+% TODO: DeepL pass
+%
+\subsection{Transpiler Backend} 
 \label{sec:transpiler-backend}
-% TODO: Start on Saturday and finish on Sunday (prefferably finish on Saturday)
-% describe the tanspilation process
+The transpiler backend is responsible for creating a kernel from an expression in its intermediate representation. Transpiling an expression is split into several parts, these parts are as follows:
+
+\begin{itemize}
+	\item Register management
+	\item Generating the header and kernel entry point
+	\item Ensuring that only the requested amount of threads is performing work
+	\item Generating the Code for evaluating the expression
+\end{itemize}
+
+PTX assumes a register machine, which means that a developer has to work with a limited number of registers. This also means that the transpiler must define a strategy for managing these registers. The second and third parts are rather simple and can be considered overhead. Finally, the last part is the main part of the generated kernel. It contains the code to load variables and parameters, evaluate the expression and store the result in the result matrix. All parts will be explained in the following sections.
+
+\subsubsection{Register Management}
+Register management is a crucial part of the transpiler as it is important to balance register usage with occupancy and performance. \textcite{aho_compilers_2006, cooper_engineering_2022} describe techniques for efficient register management, especially for machines with few registers and register usage by convention on the CPU. However, on the GPU, there are much more registers available, which can all be used as needed.
+
+To allow for maximum occupancy and avoid spilling registers into local memory, the transpiler tries to re-use as many registers as possible. Furthermore, in PTX, allocating and using a register is very similar to using variables in code, as they do not represent physical registers. Therefore, much of the complexity of managing registers is managed by the PTX compiler. 
+
+Because much of the complexity of managing registers is hidden by the compiler, or does not apply in this scenario, it is implemented very simple. If at any point in the transpilation process, a register is required, it can be requested by the register manager. A register must be given a name and the manager uses this name to determine the type of this register. For example if the name of the register is \verb|f|, it will be assumed that it is a FP32 register. Several naming conventions exist to ensure that the register will have the correct datatype. The manager then returns the identifying name of the register, which is used for accessing it. The identifying name, is the name given as an input and a zero-based number that is incremented by one for each consecutive call.
+
+PTX requires the registers to be defined before they are used. Therefore, after the transpiler has finished generating the code, the registers must be defined at the top of the kernel. As the manager has kept track of the registers used, it can generate the code to allocate and define the registers. If the kernel only used five FP32 registers, the manager would generate the code \verb|.reg .f32  %f<5>;|. This allocates and defines the registers \verb|%f0| through \verb|%f4|.
+
+\subsubsection{Header and Entry Point}
+Each PTX program must begin with certain instructions to correctly compile and use this program. The first directive must be the \verb|.version| directive. It indicates for which PTX version the code has been written, to ensure that it is compiled with the correct tools in the correct version. Following the \verb|.version| directive, is the \verb|.target| directive, which specifies the target hardware architecture.
+
+Once these directives are added to the generated code, the entry point to the kernel can be generated. It contains the name of the kernel, as well as all parameters that are passed to it, like the pointer to the variable, parameter and result matrix. The kernel name is important, as it is required by the CPU to dispatch it.
+
+While generating the entry point, the PTX code for loading the parameters into the kernel is also generated. This removes the need to iterate over the kernel parameters a second time. Loading the parameters into the kernel is required, as it is not possible to address these values directly. \textcite{nvidia_parallel_2025} states that addresses in the parameter state space, can only be accessed via the \verb|ld.param| instruction. Furthermore, since all three matrices are stored in global memory, the parameter address needs to be converted from parameter state space to global state space using the \verb|cvta.to.global.datatype| instruction.
+
+\subsubsection{Guard Clause}
+As previously explained in Section \ref{sec:interpreter-gpu-side}, the guard clause ensures that all excessive threads do not take part in the evaluation. The following code shows how this guard clause looks if the kernel is written with Julia and CUDA.jl:
+\begin{JuliaCode}
+function my_kernel(nrOfVarSets::Int32)
+    threadId = (blockIdx().x - 1) * blockDim().x + threadIdx().x
+    if threadId > nrOfVarSets
+        return
+    end
+    # remaining kernel
+end
+\end{JuliaCode}
+
+This can be translated into the following PTX code fragment:
+
+\begin{PTXCode}
+	mov.u32         %r3, %ntid.x;   // r3 = blockIdx().x - 1
+	mov.u32         %r4, %ctaid.x;  // r4 = blockDim().x
+	mov.u32         %r5, %tid.x;    // r5 = threadIdx().x
+
+	mad.lo.s32      %r1, %r3, %r4, %r5; //r1 = r3 * r4 + r5
+	setp.ge.s32    %p1, %r1, %r2;  // p1 = r1 >= r2 (r2 = nrOfVarSets)
+	@%p1 bra        End;
+
+	// remaining Kernel
+
+End:
+	ret;
+\end{PTXCode}
+
+It needs to be noted, that the register \verb|r2| is not needed. Because the transpiler already knows the number of variable sets, it would be wasteful to transmit it to the kernel. Therefore, instead the transpiler inserts the number directly to save resources.
+
+\subsubsection{Evaluation}
+The equivalent of the interpreter's main loop, is the loop for generating the code for evaluating the expression. Because the transpiler uses the same intermediate representation as the interpreter, the loop also works very similar. It also uses a stack for storing the values and the intermediate results. Unlike the interpreter, the special opcode \textit{stop} is not needed any more. As only a single expression needs to be transpiled, it is stored in an unpadded vector and therefore all tokens inside the vector are valid. This opcode has only been introduced for the interpreter, as the expressions might get padded to fully fit in the matrix.
+
+% MAYBE : activity diagram for this loop (also add to interpreter main loop section)
+
+When the loop encounters a token that represents an index to either the variable or the parameter matrix, the transpiler needs to generate code for loading these values. In the general case, this works exactly the same as with the interpreter. 
+
+However, on the first time a variable or parameter is accessed, it needs to be loaded from global memory. While a register already exists that holds a pointer to the address of the matrices in global memory, the data is still not accessible. To make it accessible, first the index to the value must be calculated the same way it is calculated in the interpreter. Afterwards the value must be loaded into a register with the instruction \verb|ld.global.f32  %reg1, %reg2|. Using the first register of the instruction, the data can be accessed. If for example the variable $x_1$ is accessed multiple times, all subsequent calls only need to reference this register and do not need to load the data from global memory again.
+
+% token -> operator -> same as interpreter but instead of result, add register that holds result to stack
+
+% after loop storing result is also interesting

 \subsection{GPU Side}
 % I am not really happy with this. The length of the paragraph is fine, but the content not so much
 % Maybe show a kernel for the expression "x1+p1" or so to show the complexity or something?
 On the GPU, the transpiled kernels are simply executed. Because the kernels themselves are very simple, as they contain almost no branching and other overhead work, the GPU does not need to perform a lot of operations. As can be seen in Program TODO, the kernel for the expression $x_1 + p_1$ is very straightforward, with only two load operations, the addition and then the storing of the result in the result matrix. In fact, the kernel is a one-to-one mapping of the expression, with the overhead of ensuring only the one thread is executing and loading the variable and parameter.

+\begin{program}
+	\begin{GenericCode}
+		TODO
+		TODO
+		TODO
+		TODO
+	.func loop(.param .u32 N)
+	{
+		.reg .u32 %n;
+		.reg .pred %p;
+	
+		ld.param.u32 %n, [N];
+	Loop: 
+		setp.eq.u32 %p, %n, 0;
+	@%p bra 		Done;
+		sub.u32		%n, %n, 1;
+		bra			Loop;
+	Done:
+	}
+	\end{GenericCode}
+	\caption{The PTX kernel for the expression $x_1 + p_1$}
+	\label{code:ptx_loop}
+\end{program}
+% TODO: Probably explain the code a bit too?

 % Front-End and Back-End
 % Caching of back-end results