concept and design: improved wording and added overview diagram of kernel usage
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
This commit is contained in:
parent
258d33c338
commit
c68e0d04a0
112
other/kernel_architecture.drawio
Normal file
112
other/kernel_architecture.drawio
Normal file
|
@ -0,0 +1,112 @@
|
|||
<mxfile host="app.diagrams.net" agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:137.0) Gecko/20100101 Firefox/137.0" version="26.2.6">
|
||||
<diagram name="Page-1" id="ZW0hAwE0V4rwrlzxzp_e">
|
||||
<mxGraphModel dx="1426" dy="791" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="1169" pageHeight="827" math="0" shadow="0">
|
||||
<root>
|
||||
<mxCell id="0" />
|
||||
<mxCell id="1" parent="0" />
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-3" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-1" target="EzEPb8_loPXt5I1V_28y-2">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-1" value="Interpreter" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="250" y="120" width="120" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-7" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-2" target="EzEPb8_loPXt5I1V_28y-4">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-8" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-2" target="EzEPb8_loPXt5I1V_28y-5">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-9" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-2" target="EzEPb8_loPXt5I1V_28y-6">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-2" value="Kernel" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="250" y="200" width="120" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-14" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-4" target="EzEPb8_loPXt5I1V_28y-11">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-4" value="Dispatch" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="180" y="280" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-15" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-5" target="EzEPb8_loPXt5I1V_28y-12">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-5" value="Dispatch" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="270" y="280" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-16" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-6" target="EzEPb8_loPXt5I1V_28y-13">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-6" value="Dispatch" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="360" y="280" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-11" value="Evaluate" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="180" y="360" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-12" value="Evaluate" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="270" y="360" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-13" value="<div>Evaluate</div>" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="360" y="360" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-36" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-18" target="EzEPb8_loPXt5I1V_28y-32">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-37" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-18" target="EzEPb8_loPXt5I1V_28y-33">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-38" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-18" target="EzEPb8_loPXt5I1V_28y-34">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-18" value="Transpiler" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="550" y="120" width="120" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-23" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" source="EzEPb8_loPXt5I1V_28y-24" target="EzEPb8_loPXt5I1V_28y-29" parent="1">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-24" value="Dispatch" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="480" y="280" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-25" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" source="EzEPb8_loPXt5I1V_28y-26" target="EzEPb8_loPXt5I1V_28y-30" parent="1">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-26" value="Dispatch" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="570" y="280" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-27" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" source="EzEPb8_loPXt5I1V_28y-28" target="EzEPb8_loPXt5I1V_28y-31" parent="1">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-28" value="Dispatch" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="660" y="280" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-29" value="Evaluate" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="480" y="360" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-30" value="Evaluate" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="570" y="360" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-31" value="<div>Evaluate</div>" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="660" y="360" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-39" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-32" target="EzEPb8_loPXt5I1V_28y-24">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-32" value="Kernel" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="480" y="200" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-40" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-33" target="EzEPb8_loPXt5I1V_28y-26">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-33" value="Kernel" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="570" y="200" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-41" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0.5;exitY=1;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="EzEPb8_loPXt5I1V_28y-34" target="EzEPb8_loPXt5I1V_28y-28">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="EzEPb8_loPXt5I1V_28y-34" value="Kernel" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
|
||||
<mxGeometry x="660" y="200" width="80" height="40" as="geometry" />
|
||||
</mxCell>
|
||||
</root>
|
||||
</mxGraphModel>
|
||||
</diagram>
|
||||
</mxfile>
|
|
@ -6,7 +6,6 @@ To be able to determine whether evaluating mathematical expressions on the GPU i
|
|||
% TODO: maybe describe CPU interpreter too? We will see
|
||||
|
||||
\section[Requirements]{Requirements and Data}
|
||||
% short section.
|
||||
The main goal of both prototypes or evaluators is to provide a speed-up compared to the CPU interpreter already in use. However, it is also important to determine which evaluator provides the most speed-up. This also means that if one of the evaluators is faster, it is intended to replace the CPU interpreter. Therefore, they must have similar capabilities, and therefore meet the following requirements:
|
||||
|
||||
\begin{itemize}
|
||||
|
@ -18,8 +17,6 @@ The main goal of both prototypes or evaluators is to provide a speed-up compared
|
|||
\item The results of the evaluations are returned in a matrix of the form $k \times N$. In this case, $k$ is equal to the $N$ of the variable matrix and $N$ is equal to the number of input expressions.
|
||||
\end{itemize}
|
||||
|
||||
These requirements mean, that one possible expression that must be able to be evaluated is the following: $\log(e^{p_1}) - |x_1| * \sqrt{x_2} / 10 + 2^{x_3}$
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=.9\textwidth]{input_output_explanation.png}
|
||||
|
@ -28,9 +25,9 @@ These requirements mean, that one possible expression that must be able to be ev
|
|||
\end{figure}
|
||||
|
||||
|
||||
With this, the capabilities are outlined, however, the input and output data need to further be explained for a better understanding. The first input are the expressions that need to be evaluated. These can have any length and can contain constant values, variables and parameters and all of these are linked together with the supported operations. In the example shown in Figure \ref{fig:input_output_explanation}, there are six expressions $e_1$ through $e_6$. Next is the variable matrix. One entry in this matrix, corresponds to one variable in every expression, with the row indicating which variable it holds the value for. For example the values in row three, are used to parametrise the variable $x_3$. Each column holds a different set of variables. In the provided example, there are three variable sets, each holding the values for four variables $x_1$ through $x_4$. All expressions are evaluated using all variable sets and the results of these evaluations are stored in the results matrix. Each entry in this matrix holds the resulting value of the evaluation of one expression with one variable set. The row indicates the variable set while the column indicates the expression.
|
||||
With this, the required capabilities are outlined. However, the input and output data need to further be explained for a better understanding. The first input contains the expressions that need to be evaluated. These can have any length and can contain constant values, variables and parameters and all of these are linked together with the supported operations. In the example shown in Figure \ref{fig:input_output_explanation}, there are six expressions $e_1$ through $e_6$. Next is the variable matrix. One entry in this matrix, corresponds to one variable in every expression. The row indicates which variable it holds the value for. For example the values in row three, are used to parametrise the variable $x_3$. Each column holds a different set of variables. Each expression must be evaluated using every variable set. In the provided example, there are three variable sets, each holding the values for four variables $x_1$ through $x_4$. After all expressions are evaluated using all variable sets the results of these evaluations must be stored in the results matrix. Each entry in this matrix holds the resulting value of the evaluation of one expression parametrised with one variable set. The row indicates the variable set while the column indicates the expression.
|
||||
|
||||
This is the minimal functionality needed to evaluate expressions with variables generated by a symbolic regression algorithm. In the case of parameter optimisation, it is useful to have a different type of variable, called parameter. For parameter optimisation it is important that for the given variable sets, the best fitting parameters need to be found. To achieve this, the evaluator is called multiple times with different parameters, but the same variables. The results are then evaluated for their fitness by the caller. In this case, the parameters do not change within one call. Parameters could therefore be treated as constant values of the expressions, as the caller and no separate input for them would be needed. However, providing the possibility to have the parameters as an input, makes the process of parameter optimisation easier. This is the reason the prototype evaluators need to support parameters as inputs. Unlike variables, not all expressions need to have the same number of parameters. Therefore, they are structured as a vector of vectors and not a matrix. The example in Figure \ref{fig:input_output_explanation} shows how the parameters are structured. For example one expression has zero parameters, while another has six parameters $p_1$ through $p_6$. It needs to be mentioned that just like the number of variables, the number of parameters per expression is not limited. It is also possible to completely omit the parameters if they are not needed.
|
||||
This is the minimal functionality needed to evaluate expressions with variables generated by a symbolic regression algorithm. In the case of parameter optimisation, it is useful to have a different type of variable, called parameter. For parameter optimisation it is important that for the given variable sets, the best fitting parameters need to be found. To achieve this, the evaluator is called multiple times with different parameters, but the same variables. The results are then evaluated for their fitness by the caller. In this case, the parameters do not change within one call. Parameters could therefore be treated as constant values of the expressions, and no separate input for them would be needed. However, providing the possibility to have the parameters as an input, makes the process of parameter optimisation easier. Unlike variables, not all expressions need to have the same number of parameters. Therefore, they are structured as a vector of vectors and not a matrix. The example in Figure \ref{fig:input_output_explanation} shows how the parameters are structured. For example one expression has zero parameters, while another has six parameters $p_1$ through $p_6$. It needs to be mentioned that just like the number of variables, the number of parameters per expression is not limited. It is also possible to completely omit the parameters if they are not needed. Because these evaluators will primarily be used in parameter optimisation use-cases, allowing parameters as an input is required.
|
||||
|
||||
% \subsection{Non-Goals}
|
||||
% Probably a good idea. Probably move this to "introduction"
|
||||
|
@ -38,11 +35,14 @@ This is the minimal functionality needed to evaluate expressions with variables
|
|||
|
||||
Based on the requirements above, the architecture of both prototypes can be designed. While the requirements only specify the input and output, the components and workflow also need to be specified. This section aims at giving an architectural overview of both prototypes, alongside their design decisions.
|
||||
|
||||
A design decision that has been made for both prototypes is to split the evaluation of each expression into a separate kernel or kernel dispatch. As explained in Section \ref{sec:thread_hierarchy}, it is desirable to reduce the occurrence of thread divergence as much as possible. Although the SIMT programming model tries to mitigate the negative effects of thread divergence, it is still a good idea to avoid it when possible. For this use-case, thread divergence can easily be avoided by not evaluating all expressions in a single kernel or kernel dispatch. GPUs are able to have multiple resident grids, with modern GPUs being able to accommodate 128 grids concurrently \parencite{nvidia_cuda_2025}. One grid corresponds to one kernel dispatch, and therefore allows up-to 128 kernels to be run concurrently. Therefore, dispatching a kernel for each expression, has the possibility to improve the performance. In the case of the interpreter, having only one kernel that can be dispatched for each expression, also simplifies the kernel itself. This is because the kernel can focus on evaluating one expression and does not require additional code to handle multiple expressions at once. Similarly, the transpiler can also be simplified, as it can generate many smaller kernels than one big kernel. Additionally, the smaller kernels do not need any branching, because the generated code only needs to perform the operations as they occur in the expression itself.
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=.9\textwidth]{kernel_architecture.png}
|
||||
\caption{The interpreter has one kernel that is dispatched multiple times, while the transpiler, has multiple kernels that are dispatched once. This helps to eliminate thread divergence.}
|
||||
\label{fig:kernel_architecture}
|
||||
\end{figure}
|
||||
|
||||
%% Maybe add overview Diagram
|
||||
%% Shows -> Caller calls Evaluator -> Evaluator dispatches kernel -> kernel evaluates -> Evaluator returns evaluation result
|
||||
%% Probably the same as the interpreter and transpiler diagram. If so, dont add it
|
||||
A design decision that has been made for both prototypes is to split the evaluation of each expression into a separate kernel or kernel dispatch as seen in Figure \ref{fig:kernel_architecture}. As explained in Section \ref{sec:thread_hierarchy}, it is desirable to reduce the occurrence of thread divergence as much as possible. Although the SIMT programming model tries to mitigate the negative effects of thread divergence, it is still a good idea to avoid it when possible. For this use-case, thread divergence can easily be avoided by not evaluating all expressions in a single kernel or kernel dispatch. GPUs are able to have multiple resident grids, with modern GPUs being able to accommodate 128 grids concurrently \parencite{nvidia_cuda_2025}. One grid corresponds to one kernel dispatch, and therefore allows up-to 128 kernels to be run concurrently. Therefore, dispatching a kernel for each expression, has the possibility to improve the performance. In the case of the interpreter, having only one kernel that can be dispatched for each expression, also simplifies the kernel itself. This is because the kernel can focus on evaluating one expression and does not require additional code to handle multiple expressions at once. Similarly, the transpiler can also be simplified, as it can generate many smaller kernels than one big kernel. Additionally, the smaller kernels do not need any branching, because the generated code only needs to perform the operations as they occur in the expression itself.
|
||||
|
||||
\subsection{Pre-Processing}
|
||||
The first step in both prototypes is the pre-processing step. It is needed, as it simplifies working with the expressions in the later steps. One of the responsibilities of the pre-processor is to verify that only allowed operators and symbols are present in the given expressions. This is comparable to the work a scanner like Flex\footnote{\url{https://github.com/westes/flex}} performs. Additionally, this step also converts the expression into an intermediate representation. In essence, the pre-processing step can be compared to the front-end of a compiler as described in Section \ref{sec:compilers}. The conversion into the intermediate representation transforms the expressions from infix-notation into postfix-notation. This further allows the later parts to more easily evaluate the expressions. One of the major benefits of this notation is the implicit operator precedence. It allows the evaluators to evaluate the expressions token by token from left to right, without needing to worry about the correct order of operations. One token represents either an operator, a constant value, a variable or a parameter. Apart from the intermediate representation containing the expression in postfix-notation, it also contains the information about the types of the tokens themselves. This is all that is needed for the interpretation and transpilation steps. A simple expression like $x + 2$ would look like depicted in figure \ref{fig:pre-processing_results} after the pre-processing step.
|
||||
|
@ -111,9 +111,9 @@ The Algorithm \ref{alg:eval_interpreter} in this case resembles the kernel. This
|
|||
\label{fig:component_diagram_transpiler}
|
||||
\end{figure}
|
||||
|
||||
Similar to the interpreter, the transpiler also consists of a part that runs on the CPU and a part that runs on the GPU. When looking at the component and workflow of the transpiler, as shown in Figure \ref{fig:component_diagram_transpiler}, it is almost identical to the interpreter. However, the key difference between the two, is the additional code generation, or transpilation step. Apart from that, the transpiler also needs the same pre-processing step and also the GPU to evaluate the expressions. However, the GPU evaluator generated by the transpiler works differently to the GPU evaluator for the interpreter. The difference between these evaluators will be explained later.
|
||||
Similar to the interpreter, the transpiler also consists of a part that runs on the CPU and a part that runs on the GPU. When looking at the components and workflow of the transpiler, as shown in Figure \ref{fig:component_diagram_transpiler}, it is almost identical to the interpreter. However, the key difference between the two, is the additional code generation, or transpilation step. Apart from that, the transpiler also needs the same pre-processing step and also the GPU to evaluate the expressions. However, the GPU evaluator generated by the transpiler works differently to the GPU evaluator for the interpreter. The difference between these evaluators will be explained later.
|
||||
|
||||
Before the expressions can be transpiled into PTX code, they need to be pre-processed. As already described, this step ensures the validity of the expressions and transforms them into the intermediate representation described above. As with the interpreter, this also simplifies the code generation step at the cost of some performance because the intermediate representation has to be generated. However, in this case the benefit of having a simple code generation step was more important than performance. By transforming the expressions into postfix-notation, the code generation follows a similar pattern to the interpretation described above. Algorithm \ref{alg:transpile} shows how the transpiler takes an expression, transpiles it and then returns the finished code. It can be seen that the while loop is the same as the while loop of the interpreter. The main difference is in the operator branches. Because now code needs to be generated, the branches themselves call their designated code generation function, such as $\textit{GenerateAddition}$. However, this function can not only return the code that performs the addition for example. This addition, when executed, also returns a value which will be needed as an input by other operators. Therefore, not only the code fragment must be returned, but also the variable in which the result is stored. This variable can then be put on the stack for later use, and the code fragment must be added to the code already generated so that it can be returned to the caller. As with the interpreter, there is a final value on the stack when the loop has finished. Once the code is executed, this value is the variable containing the result of the expression. This value then needs to be stored in the results matrix, so that it can be retrieved by the CPU after all expressions have been executed on the GPU. Therefore, one last code fragment must be generated to handle the storage of this value in the results matrix. This fragment must then be added to the code already generated, and the transpilation process is complete.
|
||||
Before the expressions can be transpiled into PTX code, they need to be pre-processed. As already described, this step ensures the validity of the expressions and transforms them into the intermediate representation described above. As with the interpreter, this also simplifies the code generation step at the cost of some performance because the validity has to be ensured, and the intermediate representation needs to be generated. However, in this case the benefit of having a simple code generation step was more important than performance. By transforming the expressions into postfix-notation, the code generation follows a similar pattern to the interpretation already described. Algorithm \ref{alg:transpile} shows how the transpiler takes an expression, transpiles it and then returns the finished code. It can be seen that the while loop is the same as the while loop of the interpreter. The main difference is in the operator branches. Because now code needs to be generated, the branches themselves call their designated code generation function, such as $\textit{GetAddition}$. However, this function can not only return the code that performs the addition for example. When executed, this addition also returns a value which will be needed as an input by other operators. Therefore, not only the code fragment must be returned, but also the reference to the result. This reference can then be put on the stack for later use the same as the interpreter stores the value for later use. The code fragment must also be added to the already generated code so that it can be returned to the caller. As with the interpreter, there is a final value on the stack when the loop has finished. Once the code is executed, this value is the reference to the result of the expression. This value then needs to be stored in the results matrix, so that it can be retrieved by the CPU after all expressions have been executed on the GPU. Therefore, one last code fragment must be generated to handle the storage of this value in the results matrix. This fragment must then be added to the code already generated, and the transpilation process is completed.
|
||||
|
||||
\begin{algorithm}
|
||||
\caption{Transpiling an equation in postfix-notation}\label{alg:transpile}
|
||||
|
@ -130,14 +130,14 @@ Before the expressions can be transpiled into PTX code, they need to be pre-proc
|
|||
\If{$\textit{token.Value} = \text{Addition}$}
|
||||
\State $\textit{right} \gets \text{Pop}(\textit{stack})$
|
||||
\State $\textit{left} \gets \text{Pop}(\textit{stack})$
|
||||
\State $(\textit{valueLocation}, \textit{codeFragment}) \gets \text{GenerateAddition}(\textit{left}, \textit{right})$
|
||||
\State Push($\textit{stack}$, $\textit{valueLocation}$)
|
||||
\State $(\textit{referenceToValue}, \textit{codeFragment}) \gets \text{GetAddition}(\textit{left}, \textit{right})$
|
||||
\State Push($\textit{stack}$, $\textit{referenceToValue}$)
|
||||
\State Append($\textit{code}$, $\textit{codeFragment}$)
|
||||
\ElsIf{$\textit{token.Value} = \text{Multiplication}$}
|
||||
\State $\textit{right} \gets \text{Pop}(\textit{stack})$
|
||||
\State $\textit{left} \gets \text{Pop}(\textit{stack})$
|
||||
\State $(\textit{valueLocation}, \textit{codeFragment}) \gets \text{GenerateMultiplication}(\textit{left}, \textit{right})$
|
||||
\State Push($\textit{stack}$, $\textit{valueLocation}$)
|
||||
\State $(\textit{referenceToValue}, \textit{codeFragment}) \gets \text{GetMultiplication}(\textit{left}, \textit{right})$
|
||||
\State Push($\textit{stack}$, $\textit{referenceToValue}$)
|
||||
\State Append($\textit{code}$, $\textit{codeFragment}$)
|
||||
\EndIf
|
||||
\EndIf
|
||||
|
@ -151,38 +151,10 @@ Before the expressions can be transpiled into PTX code, they need to be pre-proc
|
|||
\end{algorithmic}
|
||||
\end{algorithm}
|
||||
|
||||
The code generated by the transpiler is the kernel for the transpiled expressions. This means that a new kernel must be generated for each expression that needs to be evaluated. This is in contrast to the interpreter, which has one kernel and dispatches it once for each expression. However, generating one kernel per expression results in a much simpler kernel. This allows the kernel to evaluate the postfix expression from left to right without having to perform any branching. However, while the kernel no longer has to perform branching, it is still needed and therefore only offloaded to the CPU. There is also a noticeable overhead in that a kernel has to be generated for each expression. In cases like parameter optimisation, many of the expressions will be transpiled multiple times, because the transpiler is called multiple times with the same expressions. Strategies such as caching can be used to improve the performance in such cases.
|
||||
The code generated by the transpiler is the kernel for the transpiled expressions. This means that a new kernel must be generated for each expression that needs to be evaluated. This is in contrast to the interpreter, which has one kernel and dispatches it once for each expression. However, generating one kernel per expression results in a much simpler kernel. This allows the kernel to focus on evaluating the postfix expression from left to right. No overhead work, like branching or managing a stack is needed. However, this overhead is now offloaded to the transpilation step on the CPU. There is also a noticeable overhead in that a kernel has to be generated for each expression. In cases like parameter optimisation, many of the expressions will be transpiled multiple times as the transpiler is called multiple times with the same expressions.
|
||||
|
||||
Both the transpiler and the interpreter have their respective advantages and disadvantages. While the interpreter puts less load on the CPU, the GPU has to perform more work. Much of this work is branching and therefore involves many instructions that are not used to evaluate the expression itself. However, all of this work is done in parallel rather than sequentially on the CPU.
|
||||
Both the transpiler and the interpreter have their respective advantages and disadvantages. While the interpreter puts less load on the CPU, the GPU has to perform more work. Much of this work is branching or managing a stack and therefore involves many instructions that are not used to evaluate the expression itself. However, this overhead can be mitigated by the fact, that all of this overhead is performed in parallel and not sequentially.
|
||||
|
||||
On the other hand, the transpiler performs more work on the CPU. The kernels are much simpler, and most of the instructions are used to evaluate the expressions themselves. Because unlike the GPU the CPU can manage state, concepts such as caches can be employed to reduce the overhead on the CPU. This means, that unnecessary work can be avoided in certain scenarios such as parameter optimisation.
|
||||
On the other hand, the transpiler performs more work on the CPU. The kernels are much simpler, and most of the instructions are used to evaluate the expressions themselves. Furthermore, as explained in Section \ref{sec:ptx}, any program running on the GPU, must be transpiled into PTX code before the driver can compile it into machine code. Therefore, the kernel written for the interpreter, must also be transpiled into PTX. This overhead is in addition to the branch instruction overhead. The self-written transpiler removes this intermediate step by transpiling directly to PTX. In addition, the generated code is tailored to evaluate expressions and does not need to generate generic PTX code, which can reduce transpilation time.
|
||||
|
||||
|
||||
% \section{Interpreter}
|
||||
% % as introduction to this section talk about what "interpreter" means in this context. so "gpu parses expr and calculates"
|
||||
% In this section, the GPU-based expression interpreter is described. It includes the design decisions made for the architecture of the interpreter. It also describes what is done on the CPU side or host side and what is performed on the GPU side or device side.
|
||||
|
||||
% \subsection{Architecture}
|
||||
% talk about the coarse grained architecture on how the interpreter will work. (.5 to 1 page probably)
|
||||
% Include decicions made like "one kernel per expression"
|
||||
|
||||
% \subsection{Host}
|
||||
% talk about the steps taken to prepare for GPU interpretation
|
||||
|
||||
% \subsection{Device}
|
||||
% talk about how the actual interpreter will be implemented
|
||||
|
||||
|
||||
% \section{Transpiler}
|
||||
% as introduction to this section talk about what "transpiler" means in this context. so "cpu takes expressions and generates ptx for gpu execution"
|
||||
|
||||
% Transpiler used, to reduce overhead of the generic PTX transpiler of CUDA, as we can build a more specialised transpiler and hopefully generate faster code that way. (this sentence was written as a reminder and not to be used as is)
|
||||
|
||||
% \subsection{Architecture}
|
||||
% talk about the coarse grained architecture on how the transpiler will work. (.5 to 1 page probably)
|
||||
|
||||
% \subsection{Host}
|
||||
% talk about how the transpiler is implemented
|
||||
|
||||
% \subsection{Device}
|
||||
% talk about what the GPU does. short section since the gpu does not do much
|
||||
Unlike the GPU, the CPU can manage state across multiple calls. Concepts such as caches can be employed by the transpiler to reduce the overhead on the CPU. In cases such as parameter optimisation, where expressions remain the same over multiple calls, the resulting PTX code can be cached. As a result the same expression doesn't need to be transpiled multiple times, drastically reducing the transpilation time and therefore improving the overall performance of the transpiler.
|
||||
|
|
|
@ -8,24 +8,24 @@ Explain the hardware used, as well as the actual data (how many expressions, var
|
|||
talk about what we will see now (results only for interpreter, then transpiler and then compared with each other and a CPU interpreter)
|
||||
|
||||
\subsection{Interpreter}
|
||||
Results only for Interpreter
|
||||
Results only for Interpreter (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section)
|
||||
\subsection{Performance tuning}
|
||||
Document the process of performance tuning
|
||||
|
||||
Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking enabled (especially in kernel)
|
||||
|
||||
Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
|
||||
Using @inbounds -> noticeable improvement in 2 out of 3
|
||||
1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
|
||||
2.) Using @inbounds -> noticeable improvement in 2 out of 3
|
||||
|
||||
\subsection{Transpiler}
|
||||
Results only for Transpiler
|
||||
Results only for Transpiler (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section
|
||||
\subsection{Performance tuning}
|
||||
Document the process of performance tuning
|
||||
|
||||
Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking enabled
|
||||
|
||||
Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
|
||||
Using @inbounds -> small improvement only on CPU side code
|
||||
1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
|
||||
2.) Using @inbounds -> small improvement only on CPU side code
|
||||
|
||||
\subsection{Comparison}
|
||||
Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter
|
|
@ -126,6 +126,7 @@ Additionally, shared memory consumption can also impact the occupancy. If for ex
|
|||
Balancing these limitations and therefore the occupancy and performance often requires a lot of trial and error with help of the aforementioned tools. In cases where occupancy is already high and the amount of warps ready for execution is also high, other areas for performance improvements need to be explored. Algorithmic optimisation is always a good idea. Some performance improvements can be achieved by altering the computations to use different parts of the GPU. One of such optimisations is using FP32 operations wherever possible. Another well suited optimisation is to rewrite the algorithm to use as many Fused Multiply-Add (FMA) instructions. FMA is a special floating point instruction, that multiplies two values and adds a third, all in a single clock cycle \parencite{nvidia_cuda_2025-1}. However, the result might slightly deviate compared to performing these two operations separately, which means in accuracy sensitive scenarios, this instruction should be avoided. If the compiler detects a floating point operation with the FMA structure, it will automatically be compiled to an FMA instruction. To prevent this, in C++ the developer can call the functions \_\_fadd\_ and \_\_fmul\_ for addition and multiplication respectively.
|
||||
|
||||
\subsection[PTX]{Parallel Thread Execution}
|
||||
\label{sec:ptx}
|
||||
% https://docs.nvidia.com/cuda/parallel-thread-execution/
|
||||
While in most cases a GPU can be programmed in a higher level language like C++ or even Julia\footnote{\url{https://juliagpu.org/}}, it is also possible to program GPUs with the low level language Parallel Thread Execution (PTX) developed by Nvidia. A brief overview of what PTX is and how it can be used to program GPUs is given in this section. Information in this section is taken from the PTX documentation \parencite{nvidia_parallel_2025} if not stated otherwise.
|
||||
|
||||
|
|
BIN
thesis/images/kernel_architecture.png
Normal file
BIN
thesis/images/kernel_architecture.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 54 KiB |
BIN
thesis/main.pdf
BIN
thesis/main.pdf
Binary file not shown.
Loading…
Reference in New Issue
Block a user