implementation: continued writing transpiler section
Some checks failed
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Has been cancelled

This commit is contained in:
2025-05-04 13:54:05 +02:00
parent 18d89e27ca
commit b69a3efe96
5 changed files with 134 additions and 15 deletions

View File

@ -160,28 +160,30 @@ While in most cases a GPU can be programmed in a higher level language like C++
PTX defines a virtual machine with an own instruction set architecture (ISA) and is designed for data-parallel processing on a GPU. It is an abstraction of the underlying hardware instruction set, allowing PTX code to be portable across Nvidia GPUs. In order for PTX code to be usable for the GPU, the driver is responsible for compiling the code to the hardware instruction set of the GPU it is run on. A developer typically writes a kernel in CUDA using C++, for example, and the Nvidia compiler generates the PTX code for that kernel. This PTX code is then compiled by the driver once it is executed. The concepts for programming the GPU with PTX and CUDA are the same, apart from the terminology which is slightly different. For consistency, the CUDA terminology will continue to be used.
Syntactically, PTX is similar to assembler style code. Every PTX code must have a \verb|.version| directive which indicates the PTX version and an optional \verb|.target| directive which indicates the compute capability. If the program works in 64 bit addresses, the optional \verb|.address_size| directive can be used to indicate that, which simplifies the code for such applications. After these directives, the actual code is written. As each PTX code needs an entry point (the kernel) the \verb|.entry| directive indicates the name of the kernel and the parameters needed. It is also possible to write helper functions with the \verb|.func| directive. Inside the kernel or a helper function, normal PTX code can be written. Because PTX is very low level, it assumes an underlying register machine, therefore a developer needs to think about register management. This includes loading data from global or shared memory into registers if needed. Code for manipulating data like addition and subtraction generally follow the structure \verb|operation.datatype| followed by up to four parameters for that operation. For adding two FP32 values together and storing them in the register \%n, the code looks like the following:
Syntactically, PTX is similar to assembler style code. Every PTX code must have a \verb|.version| directive which indicates the PTX version and is immediately followed by the \verb|.target| directive which indicates the compute capability. If the program needs 64-bit addresses instead of the default 32-bit addresses, the optional \verb|.address_size| directive can be used to indicate this. Using 64-bit addresses enables the developer to access more than 4 GB of memory but also increases register usage, as a 64-bit address must be stored in two registers.
After these directives, the actual code is written. As each PTX code needs an entry point (the kernel) the \verb|.entry| directive indicates the name of the kernel and the parameters needed. It is also possible to write helper functions with the \verb|.func| directive. Inside the kernel or a helper function, normal PTX code can be written. Because PTX is very low level, it assumes an underlying register machine, therefore a developer needs to think about register management. This includes loading data from global or shared memory into registers if needed. Code for manipulating data like addition and subtraction generally follow the structure \verb|operation.datatype| followed by up to four parameters for that operation. For adding two FP32 values together and storing them in the register \%n, the code looks like the following:
\begin{GenericCode}[numbers=none]
add.f32 \%n, 0.1, 0.2;
add.f32 %n, 0.1, 0.2;
\end{GenericCode}
Loops in the classical sense do not exist in PTX. Instead, a developer needs to define jump targets for the beginning and end of the loop. The Program in \ref{code:ptx_loop} shows how a function with simple loop can be implemented. The loop counts down to zero from the passed parameter $N$ which is loaded into the register \%n in line 6. If the value in the register \%n reached zero the loop branches at line 9 to the jump target at line 12 and the loop has finished. All other used directives and further information on writing PTX code can be taken from the PTX documentation \parencite{nvidia_parallel_2025}.
\begin{program}
\begin{GenericCode}
\begin{PTXCode}
.func loop(.param .u32 N)
{
.reg .u32 \%n;
.reg .pred \%p;
.reg .u32 %n;
.reg .pred %p;
ld.param.u32 \%n, [N];
ld.param.u32 %n, [N];
Loop:
setp.eq.u32 \%p, \%n, 0;
@\%p bra Done;
sub.u32 \%n, \%n, 1;
setp.eq.u32 %p, %n, 0;
@%p bra Done;
sub.u32 %n, %n, 1;
bra Loop;
Done:
}
\end{GenericCode}
\end{PTXCode}
\caption{A PTX program fragment depicting how loops can be implemented.}
\label{code:ptx_loop}
\end{program}