implementation: continued writing transpiler section
Some checks failed
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Has been cancelled

This commit is contained in:
Daniel 2025-05-04 13:54:05 +02:00
parent 18d89e27ca
commit b69a3efe96
5 changed files with 134 additions and 15 deletions

View File

@ -266,6 +266,7 @@ function load_into_register(register::String, loadLocation::String, valueIndex::
mad.lo.u64 $tempReg, $setIndexReg64, $(setSize*BYTES), $((valueIndex - 1) * BYTES);
add.u64 $tempReg, $loadLocation, $tempReg;
ld.global.f32 $register, [$tempReg];"
#TODO: This is not the most efficient way. The index of the set should be calculated only once if possible and not like here multiple times
end
function type_to_ptx_type(type::DataType)::String

View File

@ -256,16 +256,111 @@ end \end{JuliaCode}
After all kernels have been dispatched, the CPU waits for the kernels to complete their execution. When the kernels have finished, the result matrix is read from global memory into system memory. The results can then be returned to the symbolic regression algorithm.
\subsection{Transpiler Backend}
%
% TODO: DeepL pass
%
\subsection{Transpiler Backend}
\label{sec:transpiler-backend}
% TODO: Start on Saturday and finish on Sunday (prefferably finish on Saturday)
% describe the tanspilation process
The transpiler backend is responsible for creating a kernel from an expression in its intermediate representation. Transpiling an expression is split into several parts, these parts are as follows:
\begin{itemize}
\item Register management
\item Generating the header and kernel entry point
\item Ensuring that only the requested amount of threads is performing work
\item Generating the Code for evaluating the expression
\end{itemize}
PTX assumes a register machine, which means that a developer has to work with a limited number of registers. This also means that the transpiler must define a strategy for managing these registers. The second and third parts are rather simple and can be considered overhead. Finally, the last part is the main part of the generated kernel. It contains the code to load variables and parameters, evaluate the expression and store the result in the result matrix. All parts will be explained in the following sections.
\subsubsection{Register Management}
Register management is a crucial part of the transpiler as it is important to balance register usage with occupancy and performance. \textcite{aho_compilers_2006, cooper_engineering_2022} describe techniques for efficient register management, especially for machines with few registers and register usage by convention on the CPU. However, on the GPU, there are much more registers available, which can all be used as needed.
To allow for maximum occupancy and avoid spilling registers into local memory, the transpiler tries to re-use as many registers as possible. Furthermore, in PTX, allocating and using a register is very similar to using variables in code, as they do not represent physical registers. Therefore, much of the complexity of managing registers is managed by the PTX compiler.
Because much of the complexity of managing registers is hidden by the compiler, or does not apply in this scenario, it is implemented very simple. If at any point in the transpilation process, a register is required, it can be requested by the register manager. A register must be given a name and the manager uses this name to determine the type of this register. For example if the name of the register is \verb|f|, it will be assumed that it is a FP32 register. Several naming conventions exist to ensure that the register will have the correct datatype. The manager then returns the identifying name of the register, which is used for accessing it. The identifying name, is the name given as an input and a zero-based number that is incremented by one for each consecutive call.
PTX requires the registers to be defined before they are used. Therefore, after the transpiler has finished generating the code, the registers must be defined at the top of the kernel. As the manager has kept track of the registers used, it can generate the code to allocate and define the registers. If the kernel only used five FP32 registers, the manager would generate the code \verb|.reg .f32 %f<5>;|. This allocates and defines the registers \verb|%f0| through \verb|%f4|.
\subsubsection{Header and Entry Point}
Each PTX program must begin with certain instructions to correctly compile and use this program. The first directive must be the \verb|.version| directive. It indicates for which PTX version the code has been written, to ensure that it is compiled with the correct tools in the correct version. Following the \verb|.version| directive, is the \verb|.target| directive, which specifies the target hardware architecture.
Once these directives are added to the generated code, the entry point to the kernel can be generated. It contains the name of the kernel, as well as all parameters that are passed to it, like the pointer to the variable, parameter and result matrix. The kernel name is important, as it is required by the CPU to dispatch it.
While generating the entry point, the PTX code for loading the parameters into the kernel is also generated. This removes the need to iterate over the kernel parameters a second time. Loading the parameters into the kernel is required, as it is not possible to address these values directly. \textcite{nvidia_parallel_2025} states that addresses in the parameter state space, can only be accessed via the \verb|ld.param| instruction. Furthermore, since all three matrices are stored in global memory, the parameter address needs to be converted from parameter state space to global state space using the \verb|cvta.to.global.datatype| instruction.
\subsubsection{Guard Clause}
As previously explained in Section \ref{sec:interpreter-gpu-side}, the guard clause ensures that all excessive threads do not take part in the evaluation. The following code shows how this guard clause looks if the kernel is written with Julia and CUDA.jl:
\begin{JuliaCode}
function my_kernel(nrOfVarSets::Int32)
threadId = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if threadId > nrOfVarSets
return
end
# remaining kernel
end
\end{JuliaCode}
This can be translated into the following PTX code fragment:
\begin{PTXCode}
mov.u32 %r3, %ntid.x; // r3 = blockIdx().x - 1
mov.u32 %r4, %ctaid.x; // r4 = blockDim().x
mov.u32 %r5, %tid.x; // r5 = threadIdx().x
mad.lo.s32 %r1, %r3, %r4, %r5; //r1 = r3 * r4 + r5
setp.ge.s32 %p1, %r1, %r2; // p1 = r1 >= r2 (r2 = nrOfVarSets)
@%p1 bra End;
// remaining Kernel
End:
ret;
\end{PTXCode}
It needs to be noted, that the register \verb|r2| is not needed. Because the transpiler already knows the number of variable sets, it would be wasteful to transmit it to the kernel. Therefore, instead the transpiler inserts the number directly to save resources.
\subsubsection{Evaluation}
The equivalent of the interpreter's main loop, is the loop for generating the code for evaluating the expression. Because the transpiler uses the same intermediate representation as the interpreter, the loop also works very similar. It also uses a stack for storing the values and the intermediate results. Unlike the interpreter, the special opcode \textit{stop} is not needed any more. As only a single expression needs to be transpiled, it is stored in an unpadded vector and therefore all tokens inside the vector are valid. This opcode has only been introduced for the interpreter, as the expressions might get padded to fully fit in the matrix.
% MAYBE : activity diagram for this loop (also add to interpreter main loop section)
When the loop encounters a token that represents an index to either the variable or the parameter matrix, the transpiler needs to generate code for loading these values. In the general case, this works exactly the same as with the interpreter.
However, on the first time a variable or parameter is accessed, it needs to be loaded from global memory. While a register already exists that holds a pointer to the address of the matrices in global memory, the data is still not accessible. To make it accessible, first the index to the value must be calculated the same way it is calculated in the interpreter. Afterwards the value must be loaded into a register with the instruction \verb|ld.global.f32 %reg1, %reg2|. Using the first register of the instruction, the data can be accessed. If for example the variable $x_1$ is accessed multiple times, all subsequent calls only need to reference this register and do not need to load the data from global memory again.
% token -> operator -> same as interpreter but instead of result, add register that holds result to stack
% after loop storing result is also interesting
\subsection{GPU Side}
% I am not really happy with this. The length of the paragraph is fine, but the content not so much
% Maybe show a kernel for the expression "x1+p1" or so to show the complexity or something?
On the GPU, the transpiled kernels are simply executed. Because the kernels themselves are very simple, as they contain almost no branching and other overhead work, the GPU does not need to perform a lot of operations. As can be seen in Program TODO, the kernel for the expression $x_1 + p_1$ is very straightforward, with only two load operations, the addition and then the storing of the result in the result matrix. In fact, the kernel is a one-to-one mapping of the expression, with the overhead of ensuring only the one thread is executing and loading the variable and parameter.
\begin{program}
\begin{GenericCode}
TODO
TODO
TODO
TODO
.func loop(.param .u32 N)
{
.reg .u32 %n;
.reg .pred %p;
ld.param.u32 %n, [N];
Loop:
setp.eq.u32 %p, %n, 0;
@%p bra Done;
sub.u32 %n, %n, 1;
bra Loop;
Done:
}
\end{GenericCode}
\caption{The PTX kernel for the expression $x_1 + p_1$}
\label{code:ptx_loop}
\end{program}
% TODO: Probably explain the code a bit too?
% Front-End and Back-End
% Caching of back-end results

View File

@ -160,28 +160,30 @@ While in most cases a GPU can be programmed in a higher level language like C++
PTX defines a virtual machine with an own instruction set architecture (ISA) and is designed for data-parallel processing on a GPU. It is an abstraction of the underlying hardware instruction set, allowing PTX code to be portable across Nvidia GPUs. In order for PTX code to be usable for the GPU, the driver is responsible for compiling the code to the hardware instruction set of the GPU it is run on. A developer typically writes a kernel in CUDA using C++, for example, and the Nvidia compiler generates the PTX code for that kernel. This PTX code is then compiled by the driver once it is executed. The concepts for programming the GPU with PTX and CUDA are the same, apart from the terminology which is slightly different. For consistency, the CUDA terminology will continue to be used.
Syntactically, PTX is similar to assembler style code. Every PTX code must have a \verb|.version| directive which indicates the PTX version and an optional \verb|.target| directive which indicates the compute capability. If the program works in 64 bit addresses, the optional \verb|.address_size| directive can be used to indicate that, which simplifies the code for such applications. After these directives, the actual code is written. As each PTX code needs an entry point (the kernel) the \verb|.entry| directive indicates the name of the kernel and the parameters needed. It is also possible to write helper functions with the \verb|.func| directive. Inside the kernel or a helper function, normal PTX code can be written. Because PTX is very low level, it assumes an underlying register machine, therefore a developer needs to think about register management. This includes loading data from global or shared memory into registers if needed. Code for manipulating data like addition and subtraction generally follow the structure \verb|operation.datatype| followed by up to four parameters for that operation. For adding two FP32 values together and storing them in the register \%n, the code looks like the following:
Syntactically, PTX is similar to assembler style code. Every PTX code must have a \verb|.version| directive which indicates the PTX version and is immediately followed by the \verb|.target| directive which indicates the compute capability. If the program needs 64-bit addresses instead of the default 32-bit addresses, the optional \verb|.address_size| directive can be used to indicate this. Using 64-bit addresses enables the developer to access more than 4 GB of memory but also increases register usage, as a 64-bit address must be stored in two registers.
After these directives, the actual code is written. As each PTX code needs an entry point (the kernel) the \verb|.entry| directive indicates the name of the kernel and the parameters needed. It is also possible to write helper functions with the \verb|.func| directive. Inside the kernel or a helper function, normal PTX code can be written. Because PTX is very low level, it assumes an underlying register machine, therefore a developer needs to think about register management. This includes loading data from global or shared memory into registers if needed. Code for manipulating data like addition and subtraction generally follow the structure \verb|operation.datatype| followed by up to four parameters for that operation. For adding two FP32 values together and storing them in the register \%n, the code looks like the following:
\begin{GenericCode}[numbers=none]
add.f32 \%n, 0.1, 0.2;
add.f32 %n, 0.1, 0.2;
\end{GenericCode}
Loops in the classical sense do not exist in PTX. Instead, a developer needs to define jump targets for the beginning and end of the loop. The Program in \ref{code:ptx_loop} shows how a function with simple loop can be implemented. The loop counts down to zero from the passed parameter $N$ which is loaded into the register \%n in line 6. If the value in the register \%n reached zero the loop branches at line 9 to the jump target at line 12 and the loop has finished. All other used directives and further information on writing PTX code can be taken from the PTX documentation \parencite{nvidia_parallel_2025}.
\begin{program}
\begin{GenericCode}
\begin{PTXCode}
.func loop(.param .u32 N)
{
.reg .u32 \%n;
.reg .pred \%p;
.reg .u32 %n;
.reg .pred %p;
ld.param.u32 \%n, [N];
ld.param.u32 %n, [N];
Loop:
setp.eq.u32 \%p, \%n, 0;
@\%p bra Done;
sub.u32 \%n, \%n, 1;
setp.eq.u32 %p, %n, 0;
@%p bra Done;
sub.u32 %n, %n, 1;
bra Loop;
Done:
}
\end{GenericCode}
\end{PTXCode}
\caption{A PTX program fragment depicting how loops can be implemented.}
\label{code:ptx_loop}
\end{program}

View File

@ -137,8 +137,7 @@ keepspaces=true,%
% Language Definition and Code Environment for Julia
\lstdefinelanguage{Julia}{
alsoletter={.},
keywords={if, for, continue, break, end, else, true, false, @cuda},
keywords={if, for, continue, break, end, else, true, false, @cuda, return, function},
keywordstyle=\color{blue},
sensitive=true,
morestring=[b]",
@ -156,6 +155,28 @@ keepspaces=true,%
#1}}%
{}
% Language Definition and Code Environment for Julia
\lstdefinelanguage{PTX}{
alsoletter={.},
morekeywords={mov.u32, mad.lo.s32, setp.ge.s32, bra,
ret, .func, .entry},
keywordstyle=\color{blue},
sensitive=true,
morestring=[b]",
morestring=[d]',
morecomment=[l]{//},
commentstyle=\color{gray},
stringstyle=\color{brown}
}
\lstnewenvironment{PTXCode}[1][]
{\lstset{%
language=PTX,
escapeinside={/+}{+/}, % makes "/+" and "+/" available for Latex escapes (labels etc.)
#1}}%
{}
% Code Enivornmente for Generic Code
\lstnewenvironment{GenericCode}[1][]

Binary file not shown.