implementation: started transpiler section
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run

This commit is contained in:
Daniel 2025-05-03 12:25:34 +02:00
parent e8e457eae9
commit 18d89e27ca
4 changed files with 68 additions and 13 deletions

View File

@ -34,12 +34,10 @@ function interpret(expressions::Vector{Expr}, variables::Matrix{Float32}, parame
@inbounds for i in eachindex(exprs)
# TODO: Currently only the first expression gets evaluated. Either use a view on "cudaExprs" to determine the correct expression or extend cudaStepsize to include this information (this information was removed in a previous commit)
# If a "view" is used, then the ExpressionProcessing must be updated to always include the stop opcode at the end
kernel = @cuda launch=false fastmath=true interpret_expression(cudaExprs, cudaVars, cudaParams, cudaResults, cudaStepsize, i)
# config = launch_configuration(kernel.fun)
threads = min(variableCols, 128)
blocks = cld(variableCols, threads)
numThreads = min(variableCols, 128)
numBlocks = cld(variableCols, numThreads)
kernel(cudaExprs, cudaVars, cudaParams, cudaResults, cudaStepsize, i; threads, blocks)
@cuda threads=numThreads blocks=numBlocks fastmath=true interpret_expression(cudaExprs, cudaVars, cudaParams, cudaResults, cudaStepsize, i)
end
return cudaResults

View File

@ -150,7 +150,7 @@ Similar to the parameters, the expressions are also stored as a vector of vector
Once the conversion into matrix form has been performed, the expressions are transferred to the GPU. Just like with the variables, the expressions remain the same over the course of the parameter optimisation part. Therefore, they are transferred to the GPU before the interpreter is called, to reduce the amount of unnecessary data transfer.
In addition to the already described data that needs to be sent, two more steps are required that have not been included in the Sequence Diagram \ref{fig:interpreter-sequence}. The first one is the allocation of global memory for the result matrix. Without this, the kernel would not know where to store the interpretation results. Therefore, enough global memory needs to be allocated so that the results can be stored and retrieved after all kernel executions have finished.
In addition to the already described data that needs to be sent, two more steps are required that have not been included in the Sequence Diagram \ref{fig:interpreter-sequence}. The first one is the allocation of global memory for the result matrix. Without this, the kernel would not know where to store the interpretation results and the CPU would not know from which memory location to read the results from. Therefore, enough global memory needs to be allocated beforehand so that the results can be stored and retrieved after all kernel executions have finished.
\begin{figure}
\centering
@ -192,7 +192,7 @@ Constants work very similarly in that the token value is read and added to the t
Evaluating the expression is happening if the current token is an operator. The token's value, which serves as the opcode, determines the operation that needs to be performed. If the opcode represents a unary operator, only the top value of the stack needs to be popped for the operation. The operation is then executed on this value and the result is pushed back to the stack. On the other hand, if the opcode represents a binary operator, the top two values of the stack are popped. These are then used for the operation, and the result is subsequently pushed back onto the stack.
Support for ternary operators could also be easily added. An example of a ternary operator that would help improve performance would be the GPU supported Fused Multiply-Add (FMA) operator. While this operator does not exist in Julia, the frontend can generate it when it encounters a sub-expression of the form $x * y + z$. Since this expression performs the multiplication and addition in a single clock cycle instead of two, it would be a feasible optimisation. However, detecting such sub-expressions is more complicated, which why it is not supported in the current implementation.
Support for ternary operators could also be easily added. An example of a ternary operator that would help improve performance would be the GPU supported Fused Multiply-Add (FMA) operator. While this operator does not exist in Julia, the frontend can generate it when it encounters a sub-expression of the form $x * y + z$. Since this expression performs the multiplication and addition in a single clock cycle instead of two, it would be a feasible optimisation. However, detecting such sub-expressions is complicated, which why it is not supported in the current implementation.
Once the interpreter loop has finished, the result of the evaluation must be stored in the result matrix. By using the index of the current expression, as well as the index of the current variable set (the global thread ID) it is possible to calculate the index where the result must be stored. The last value on the stack is the result, which is stored in the result matrix at the calculated location.
@ -211,24 +211,60 @@ An overview of how the transpiler interacts with the frontend and GPU is outline
\subsection{CPU Side}
% TODO: Finish on Saturday
After the transpiler has received the expressions to be transpiled, first they are sent to the frontend for processing. Once they have been processed, the expressions are sent to the transpiler backend which is further explained in Section \ref{sec:transpiler-backend}. The backend is responsible for generating the kernels. The output of the backend are the kernels written as PTX code for all expressions.
After the transpiler has received the expressions to be transpiled, it first sends them to the frontend for processing. Once they have been processed, the expressions are sent to the transpiler backend which is explained in more detail Section \ref{sec:transpiler-backend}. The backend is responsible for generating the kernels. The output of the backend are the kernels written as PTX code for all expressions.
\subsubsection{Data Transfer}
% smaller section as it basically states that it works the same as interpreter
% mention that now expressions are not transmitted, as the kernel "is" the expression
Data is sent to the GPU in the same way as it is sent by the interpreter. The variables are sent as they are, while the parameters are again brought into matrix form. Memory must also be allocated for the result matrix. Unlike the interpreter however, this is the only data that needs to be sent to the GPU for the transpiler.
Because each expression has its own kernel, there is no need to transfer the expressions themselves. Moreover, there is also no need to send information about the layout of the variables and parameters to the GPU. The reason for this is explained in the transpiler backend section below.
\subsubsection{Kernel Dispatch}
% similar to interpreter dispatch with tuning etc.
% mention that CUDA.jl is used to instruct the driver to compile the kernel for the specific hardware
Once all the data is present on the GPU, the transpiled kernels can be dispatched. Dispatching the transpiled kernels is more involved than dispatching the interpreter kernel. Program \ref{code:julia_dispatch-comparison} shows the difference between dispatching the interpreter kernel and the transpiled kernels. An important note, is that the transpiled kernels must be manually compiled into machine code. To achieve this, CUDA.jl provides functionality to instruct the drivers to compile the PTX code. The same process of creating PTX code and compiling it must also be done for the interpreter kernel, however, this is done by CUDA.jl automatically when calling the @cuda macro in line 6.
\subsubsection{Transpiler Backend}
\begin{program}
\begin{JuliaCode}
# Dispatching the interpreter kernel
for i in eachindex(exprs)
numThreads = ...
numBlocks = ...
@cuda threads=numThreads blocks=numBlocks fastmath=true interpret(cudaExprs, cudaVars, cudaParams, cudaResults, cudaAdditional)
end
# Dispatching the transpiled kernels
for kernelPTX in kernelsPTX
# Create linker object, add the code and compile it
linker = CuLink()
add_data!(linker, "KernelName", kernelPTX)
image = complete(linker)
# Get callable function from compiled result
mod = CuModule(image)
kernel = CuFunction(mod, "KernelName")
numThreads = ...
numBlocks = ...
# Dispatching the kernel
cudacall(kernel, (CuPtr{Float32},CuPtr{Float32},CuPtr{Float32}), cudaVars, cudaParams, cudaResults; threads=numThreads, blocks=numBlocks)
end \end{JuliaCode}
\caption{A Julia program fragment showing how the transpiled kernels need to be dispatched as compared to the interpreter kernel}
\label{code:julia_dispatch-comparison}
\end{program}
After all kernels have been dispatched, the CPU waits for the kernels to complete their execution. When the kernels have finished, the result matrix is read from global memory into system memory. The results can then be returned to the symbolic regression algorithm.
\subsection{Transpiler Backend}
\label{sec:transpiler-backend}
% TODO: Start on Saturday and finish on Sunday (prefferably finish on Saturday)
% describe the tanspilation process
\subsection{GPU Side}
% TODO: Finish on Sunday
% I am not really happy with this. The length of the paragraph is fine, but the content not so much
% Maybe show a kernel for the expression "x1+p1" or so to show the complexity or something?
On the GPU, the transpiled kernels are simply executed. Because the kernels themselves are very simple, as they contain almost no branching and other overhead work, the GPU does not need to perform a lot of operations. As can be seen in Program TODO, the kernel for the expression $x_1 + p_1$ is very straightforward, with only two load operations, the addition and then the storing of the result in the result matrix. In fact, the kernel is a one-to-one mapping of the expression, with the overhead of ensuring only the one thread is executing and loading the variable and parameter.
% Front-End and Back-End

View File

@ -135,6 +135,27 @@ keepspaces=true,%
#1}}%
{}
% Language Definition and Code Environment for Julia
\lstdefinelanguage{Julia}{
alsoletter={.},
keywords={if, for, continue, break, end, else, true, false, @cuda},
keywordstyle=\color{blue},
sensitive=true,
morestring=[b]",
morestring=[d]',
morecomment=[l]{\#},
commentstyle=\color{gray},
stringstyle=\color{brown}
}
\lstnewenvironment{JuliaCode}[1][]
{\lstset{%
language=Julia,
escapeinside={/+}{+/}, % makes "/+" and "+/" available for Latex escapes (labels etc.)
#1}}%
{}
% Code Enivornmente for Generic Code
\lstnewenvironment{GenericCode}[1][]

Binary file not shown.