implementation: finished re-reading chapter and fixed suboptimal wording
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run

This commit is contained in:
Daniel 2025-05-08 16:32:05 +02:00
parent 6b7205e026
commit ae03ebdf1d
2 changed files with 12 additions and 17 deletions

View File

@ -321,27 +321,22 @@ When the loop encounters a token that represents an index to either the variable
However, the first time a variable or parameter is accessed, it must be loaded from global memory. Although registers already exist that hold a pointer to the address of the matrices in global memory, the data is still not accessible. To make it accessible, the index to the value must first be calculated in the same way as it is calculated in the interpreter. Afterwards the value must be loaded into a register with the instruction \verb|ld.global.f32 %reg1, [%reg2]|. Using the first register of the instruction, the data can be accessed. For example, if the variable $x_1$ is accessed several times, all subsequent calls only need to reference this register and do not need to load the data from global memory again.
%
% TODO: Deepl pass
%
In the case where the current token represents an operation, the code for this operation needs to be generated. Many operators have an equivalent on the GPU. For example addition has the \verb|add.f32 %reg1, %reg2, %reg3;| instruction. The instructions for division and square root have an equivalent instruction but only support approximate calculation. While the approximation can be controlled with different options, the fastest option \verb|.approx| has been chosen. Although a slightly slower but more accurate option \verb|.full| exists, it is not fully IEEE 754 compliant and has therefore not been used.
In the case where the current token represents an operation, the code for this operation needs to be generated. Many operators have direct equivalents on the GPU. For example addition has the \verb|add.f32 %reg1, %reg2, %reg3;| instruction. The instructions for division and square root operations have equivalent instruction, but these only support approximate calculations. Although the accuracy can be controlled with different options, the fastest option \verb|.approx| has been selected. While a slightly slower but more accurate option \verb|.full| exists, it is not fully IEEE 754 compliant and has therefore not been used.
However, not all supported operators have a single instruction GPU equivalent. For example the operation $x^y$ has no equivalent and therefore needs to be generated differently. When compiling a kernel with this operation using the Nvidia compiler and the compiler flag \textit{-\,-use\_fast\_math} the following code will be generated:
However, not all supported operators have a single instruction GPU equivalent. For example, the $x^y$ operation does not have an equivalent and must be generated differently. Compiling a kernel containing this operation using the Nvidia compiler and the \textit{-\,-use\_fast\_math} compiler flag will generate the following code:
\begin{PTXCode}[numbers=none]
lg2.approx.f32 %reg1, %reg2;
mul.f32 %reg4, %reg3, %reg1;
ex2.approx.f32 %reg5, %reg4;
\end{PTXCode}
While this compiler flag trades accuracy for performance, the more accurate version of this operation contains about 100 instructions instead of the above three. Because of this, the more performant version has been chosen to be generated by the transpiler. Similarly, the operations $\log(x)$ and $e^x$ also have no equivalent instruction and are therefore generated following the same principle.
While this compiler flag trades accuracy for performance, the more accurate version of this operation contains about 100 instructions instead of the three above. Therefore, the more performant version was chosen to be generated by the transpiler. Similarly, the operations $\log(x)$ and $e^x$ have no equivalent instruction and are therefore generated using the same principle.
The final register of the generated code, stores the result of the operation once executed. As with the interpreter, this result is either the final value, or an input to another operation. Therefore, this register must be stored in the stack for later use.
The final register of the generated code stores the result of the operation once it has been executed. As with the interpreter, this result is either the final value or an input to another operation. Therefore, this register must be stored on the stack for later use.
Once the main loop has finished, the last element on the stack, holds the register with the result of the evaluation. The value of this register must be stored in the result matrix. As the result matrix is stored in global memory, the code for storing the data is similar to the code responsible for loading the data from global memory. First the location where to store the result must be calculated. Storing the result at this location, is performed with the instruction \verb|st.global.f32 [%reg1], %reg2;|.
Once the main loop has finished, the last element on the stack holds the register with the result of the evaluation. The value of this register must be stored in the result matrix. As the result matrix is stored in global memory, the code for storing the data is similar to the code responsible for loading the data from global memory. First, the location where the result is to be stored must be calculated. Storing the result at this location is performed with the instruction \verb|st.global.f32 [%reg1], %reg2;|.
\subsection{GPU Side}
% I am not really happy with this. The length of the paragraph is fine, but the content not so much
% Maybe show a kernel for the expression "x1+p1" or so to show the complexity or something?
On the GPU, the transpiled kernels are simply executed. Because the kernels themselves are very simple, as they contain almost no branching and other overhead work, the GPU does not need to perform a lot of operations. As can be seen in Program \ref{code:ptx_kernel}, the kernel for the expression $x_1 + p_1$ is very straightforward, with only two load operations, the addition and then the storing of the result in the result matrix. In fact, the kernel is a one-to-one mapping of the expression, with the overhead of ensuring only the one thread is executing and loading the variable and parameter.
On the GPU, the transpiled kernels are executed. Given that these kernels are relatively simple, containing minimal branching and overhead, the GPU does not need to perform a lot of operations. As illustrated in Program \ref{code:ptx_kernel}, the kernel for the expression $x_1 + p_1$ is quite straightforward. It involves only two load operations, the addition and the storing of the result in the result matrix. Essentially, the kernel mirrors the expression directly, with the already explained added overhead.
\begin{program}
\begin{PTXCode}
@ -387,14 +382,14 @@ On the GPU, the transpiled kernels are simply executed. Because the kernels them
%\verb|.version| and \verb|.target|
Note that Program \ref{code:ptx_kernel} has been slightly simplified to not include the allocation of the registers. From line five to line ten, the addresses stored in the parameters, are converted from parameter state space into global state space, so that they reference the correct portion of the GPUs memory. It needs to be noted, that this kernel uses 64-bit addresses, which is the reason why throughout the kernel, there are some 64-bit instructions used. Evaluating the expression itself is however entirely performed using the faster 32-bit instructions.
Note that Program \ref{code:ptx_kernel} has been slightly simplified to omit the mandatory directives and the register allocation. From line five to line ten, the addresses stored in the parameters are converted from parameter state space into global state space so that they reference the correct portion of the GPU's memory. It needs to be noted, that this kernel uses 64-bit addresses, which is the reason why some 64-bit instructions are used throughout the kernel. However, the evaluation of the expression itself is performed entirely using the faster 32-bit instructions.
Lines 12 through 17 are responsible to calculate the global thread-ID and ensure that excessive threads are terminated early. Note that in line 16, if the global thread-ID stored in register \verb|%r3| is greater than one, it must terminate early. This is because in this example, only one variable set needs to be evaluated.
Lines 12 through 17 are responsible for calculating the global thread ID and ensuring that excessive threads are terminated early. Note that in line 16, if the global thread ID stored in register \verb|%r3| is greater than one, it must terminate early. This is because only one variable set needs to be evaluated in this example.
The PTX code from line 22 to line 28 is the actual evaluation of the expression, with line 28 performing the calculation $x_1 + p_1$. All other lines, are responsible for loading the values from global memory. The instructions at the lines 22, 23 and 25, 26 are responsible for calculating the offset in bytes to the memory location where the value is stored with respect to the location of the first element.
The PTX code from line 22 to line 28 is the actual evaluation of the expression, with line 28 performing the calculation $x_1 + p_1$. All other lines are responsible for loading the values from global memory. The instructions in lines 22, 23, 25 and 26 are responsible for calculating the offset in bytes to the memory location where the value is stored with respect to the location of the first element.
The constants $4$ and $0$ are introduced due to performance reasons. The number $4$ is the size in bytes of one variable set. As one variable set only stores a single FP32 value in this case, each variable set has a size of four bytes. Similarly, the number $0$ represents the index of the value inside the variable set. More precise, this is the offset in bytes from index to the variable set, which is zero for the first element and four for the second etc. These two constants are calculated during the transpilation process, to minimise the amount of data that needs to be transferred to the GPU.
The constants $4$ and $0$ are introduced for performance reasons. The number $4$ is the size of a variable set in bytes. Since one variable set in this case stores only a single FP32 value, each variable set has a size of four bytes. Similarly, the number $0$ represents the index of the value within the variable set. More precisely, this is the offset in bytes from the index to the variable set, which is zero for the first element, four for the second, and so on. These two constants are calculated during the transpilation process to minimise the amount of data to be transferred to the GPU.
Storing the result in the result matrix is performed from line 31 to 33. The location at which to store the value is calculated in the lines 31 and 32. Line 31 calculates the index inside the result matrix, according to the current variable set stored in register \verb|%rd3|. The constant $0$ is the product of the index of the expression that is evaluated and the number of variable sets and represents the column of the result matrix. Converting this index into bytes and adding it as the offset to the first element of the result matrix, results in the correct memory location to store the result at.
Storing the result in the result matrix is performed from line 31 to 33. The location where the value is to be stored is calculated in lines 31 and 32. Line 31 calculates the index inside the result matrix according to the current variable set stored in register \verb|%rd3|. The constant $0$ is the product of the index of the expression being evaluated and the number of variable sets, and represents the column of the result matrix. Converting this index into bytes and adding it as an offset to the first element of the result matrix gives the correct memory location to store the result at.
This kernel consists mostly of overhead code, as only lines 22 through 33 contribute to calculating the result of the expression with the designated variable and parameter set. However, for larger expressions, the percentage of overhead code drastically shrinks.
This kernel consists mostly of overhead code, as only lines 22 through 33 contribute to calculating the result of the expression with the designated variable and parameter set. However, for larger expressions, the percentage of overhead code shrinks drastically.

Binary file not shown.