implementation: finished interpreter section

2025-05-01 14:36:14 +02:00
parent 101b13e7e7
commit c4187a131e
5 changed files with 79 additions and 26 deletions
--- a/package/src/Interpreter.jl
+++ b/package/src/Interpreter.jl
@ -33,6 +33,7 @@ function interpret(expressions::Vector{Expr}, variables::Matrix{Float32}, parame
 	# Start kernel for each expression to ensure that no warp is working on different expressions
 	@inbounds for i in eachindex(exprs)
 		# TODO: Currently only the first expression gets evaluated. Either use a view on "cudaExprs" to determine the correct expression or extend cudaStepsize to include this information (this information was removed in a previous commit)
+		# If a "view" is used, then the ExpressionProcessing must be updated to always include the stop opcode at the end
 		kernel = @cuda launch=false fastmath=true interpret_expression(cudaExprs, cudaVars, cudaParams, cudaResults, cudaStepsize, i)
 		# config = launch_configuration(kernel.fun)
 		threads = min(variableCols, 128)
@ -48,7 +49,7 @@ end
 const MAX_STACK_SIZE = 25 # The depth of the stack to store the values and intermediate results
 function interpret_expression(expressions::CuDeviceArray{ExpressionElement}, variables::CuDeviceArray{Float32}, parameters::CuDeviceArray{Float32}, results::CuDeviceArray{Float32}, stepsize::CuDeviceArray{Int}, exprIndex::Int)
 	varSetIndex = (blockIdx().x - 1) * blockDim().x + threadIdx().x # ctaid.x * ntid.x + tid.x (1-based)
-	@inbounds variableCols = length(variables) / stepsize[2]
+	@inbounds variableCols = length(variables) / stepsize[2] # number of variable sets

 	if varSetIndex > variableCols
 		return
@ -79,29 +80,29 @@ function interpret_expression(expressions::CuDeviceArray{ExpressionElement}, var
 			operationStackTop += 1
 			operationStack[operationStackTop] = reinterpret(Float32, expr.Value)
 		elseif expr.Type == OPERATOR
-			type = reinterpret(Operator, expr.Value)
-			if type == ADD
+			opcode = reinterpret(Operator, expr.Value)
+			if opcode == ADD
 				operationStackTop -= 1
 				operationStack[operationStackTop] = operationStack[operationStackTop] + operationStack[operationStackTop + 1]
-			elseif type == SUBTRACT
+			elseif opcode == SUBTRACT
 				operationStackTop -= 1
 				operationStack[operationStackTop] = operationStack[operationStackTop] - operationStack[operationStackTop + 1]
-			elseif type == MULTIPLY
+			elseif opcode == MULTIPLY
 				operationStackTop -= 1
 				operationStack[operationStackTop] = operationStack[operationStackTop] * operationStack[operationStackTop + 1]
-			elseif type == DIVIDE
+			elseif opcode == DIVIDE
 				operationStackTop -= 1
 				operationStack[operationStackTop] = operationStack[operationStackTop] / operationStack[operationStackTop + 1]
-			elseif type == POWER
+			elseif opcode == POWER
 				operationStackTop -= 1
 				operationStack[operationStackTop] = operationStack[operationStackTop] ^ operationStack[operationStackTop + 1]
-			elseif type == ABS
+			elseif opcode == ABS
 				operationStack[operationStackTop] = abs(operationStack[operationStackTop])
-			elseif type == LOG
+			elseif opcode == LOG
 				operationStack[operationStackTop] = log(operationStack[operationStackTop])
-			elseif type == EXP
+			elseif opcode == EXP
 				operationStack[operationStackTop] = exp(operationStack[operationStackTop])
-			elseif type == SQRT
+			elseif opcode == SQRT
 				operationStack[operationStackTop] = sqrt(operationStack[operationStackTop])
 			end
 		else
--- a/thesis/chapters/conceptdesign.tex
+++ b/thesis/chapters/conceptdesign.tex
@ -42,7 +42,7 @@ Usually, the number of variables per expression is around ten. However, the numb
 These variables do not change during the runtime of the symbolic regression algorithm. As a result the data only needs to be sent to the GPU once. This means that the impact of this data transfer is minimal. On the other hand, the data for the parameters is much more volatile. As explained above, they are used for parameter optimisation and therefore vary from evaluation to evaluation and need to be sent to the GPU very frequently. However, the amount of data that needs to be sent is also much smaller. TODO: ONCE I GET THE DATA SEE HOW MANY BYTES PARAMETERS TAKE ON AVERAGE

 \section{Architecture}
-
+\label{sec:architecture}
 Based on the requirements and data structure above, the architecture of both prototypes can be designed. While the requirements only specify the input and output, the components and workflow also need to be specified. This section aims at giving an architectural overview of both prototypes, alongside their design decisions. 

 \begin{figure}
--- a/thesis/chapters/conclusion.tex
+++ b/thesis/chapters/conclusion.tex
@ -7,6 +7,9 @@ talk again how a typical input is often not complex enough (basically repeat tha
 \section{Future Work}
 talk about what can be improved

+Frontend:
+1.) extend frontend to support ternary operators (basically if the frontend sees a multiplication and an addition it should collapse them to an FMA instruction)
+
 Transpiler: 
 1.) transpile expression directly from Julia AST -> would save time because no intermediate representation needs to be created (looses step and gains performance, but also makes transpiler itself more complex)
 2.) Better register management strategy might be helpful -> look into register pressure etc.
--- a/thesis/chapters/implementation.tex
+++ b/thesis/chapters/implementation.tex
@ -103,9 +103,9 @@ While the same expression usually occurs only once, sub-expressions can occur mu
 Caching can be applied to both individual sub-expressions as well as the entire expression. While it is unlikely for the whole expression to recur frequently, either as a whole or as part of a larger expression, implementing a cache will not degrade performance and will, in fact, enhance it if repetitions do occur. In the context of parameter optimisation, where the evaluators are employed, expressions will recur, making full-expression caching advantageous. The primary drawback of caching is the increased use of RAM. However, given that RAM is plentiful in modern systems, this should not pose a significant issue.

 \section{Interpreter}
-The implementation is divided into two main components, the CPU-based control logic and the GPU-based interpreter as outlined in the concept and design chapter. This section aims at describing the technical details of these components. First the CPU-based control logic will be discussed. This component handles the communication with the GPU and is the entry point which is called by the symbolic regression algorithm. Following this, the GPU-based interpreter will be explored, highlighting the specifics of developing an interpreter on the GPU.
+The implementation is divided into two main components, the CPU-based control logic and the GPU-based interpreter as outlined in the Concept and Design chapter. This section aims to describe the technical details of these components. First the CPU-based control logic will be discussed. This component handles the communication with the GPU and is the entry point which is called by the symbolic regression algorithm. Following this, the GPU-based interpreter will be explored, highlighting the specifics of developing an interpreter on the GPU.

-An overview of how these components interact with each other is outlined in Figure \ref{fig:interpreter-sequence}. The parts of this figure will be explained in detail in the following sections.
+An overview of how these components interact with each other is outlined in Figure \ref{fig:interpreter-sequence}. The parts of this figure are explained in detail in the following sections.

 \begin{figure}
 	\centering
@ -116,17 +116,39 @@ An overview of how these components interact with each other is outlined in Figu

 \subsection{CPU Side}
 % main loop; kernel transpiled by CUDA.jl into PTX and then executed
-The interpreter gets all expressions that need to be interpreted as an input. Additionally, it needs the variable matrix as well as the parameters for each expression. All expressions are passed to the interpreter as an array of Expr objects, as they are needed for the pre-processing step or frontend. The first loop as seen in Figure \ref{fig:interpreter-sequence} is responsible for sending the expressions to the frontend to be converted into the intermediate representation. After this step, the expressions are in the correct format to be sent to the GPU.
+The interpreter is given all the expressions it needs to interpret as an input. Additionally, it needs the variable matrix as well as the parameters for each expression. All expressions are passed to the interpreter as an array of Expr objects, as they are needed for the pre-processing step or the frontend. The first loop as shown in Figure \ref{fig:interpreter-sequence}, is responsible for sending the expressions to the frontend to be converted into the intermediate representation. After this step, the expressions are in the correct format to be sent to the GPU and the interpretation process can continue.

-Before the GPU can start with the interpretation, the data needs to be sent to the GPU. Because the variables are already in matrix form, transferring the data is rather simple. Memory needs to be allocated on the global memory of the GPU and then be copied from RAM into the allocated memory. Allocating memory and transferring the data to the GPU is handled in the background by the CuArray type provided by CUDA.jl.
+Before the GPU can start with the interpretation, the data needs to be sent to the GPU. Because the variables are already in matrix form, transferring the data is fairly straightforward. Memory must be allocated in the global memory of the GPU and then be copied from RAM into the allocated memory. Allocating memory and transferring the data to the GPU is handled implicitly by the CuArray type provided by CUDA.jl.

-Because the interpreter must be optimised for parameter optimisation workloads, this step is actually performed before the interpreter is called. The variables never change, as they represent the observed inputs of a system that needs to be modelled by the symbolic regression algorithm. Therefore, it would be wasteful to retransmit the variables for every step of the parameter optimisation part. If they are transmitted once and then reused over the duration of the parameter optimisation part, a lot of time can be saved.
+As the interpreter needs to be optimised for parameter optimisation workloads, this step is actually performed before the interpreter is called. The variables never change, as they represent the observed inputs of a system that needs to be modelled by the symbolic regression algorithm. Therefore, it would be wasteful to re-transmit the variables for each step of the parameter optimisation part. If they are transmitted once and then reused throughout the duration of the parameter optimisation part, a lot of time can be saved. It would even be possible to transfer the data to the GPU before the symbolic regression algorithm starts, saving even more time. However, as this would require a change to the symbolic regression algorithm, the decision has been made to neglect this optimisation. It is still possible to modify the implementation at a later stage with minimal effort, if required.

-After the variables are transmitted, the parameters also need to be transmitted to the GPU. Unlike the variables, the parameters are stored as a vector of vectors. In order to efficiently transmit the parameters, they also need to be brought in a matrix form. The matrix needs to be of the form $k \times N$ where $k$ is equal to the length of the longest inner vector and $N$ is equal to the length of the outer vector. This ensures that all values can be stored in the matrix. After the parameters have been brought into matrix form, they can be transferred to the GPU the same way the variables are transferred.
+Once the variables are transmitted, the parameters also must be transferred to the GPU. Unlike the variables, the parameters are stored as a vector of vectors. In order to transmit the parameters efficiently, they also need to be put in a matrix form. The matrix needs to be of the form $k \times N$, where $k$ is equal to the length of the longest inner vector and $N$ is equal to the length of the outer vector. This ensures that all values can be stored in the matrix. It also means that if the inner vectors are of different lengths, some extra unnecessary values will be transmitted, but the overall benefit of treating them as a matrix outweighs this drawback. The Program \ref{code:julia_vec-to-mat} shows how this conversion can be implemented. Note that it is required to provide an invalid element. This ensures defined behaviour and helps with finding errors in the code. After the parameters have been brought into matrix form, they can be transferred to the GPU the same way the variables are transferred.

-Similar to the parameters, the expressions are also stored as a vector of vectors. The outer vector holds each expression while inner vectors hold the expressions in their intermediate representation. Therefore, this vector of vectors also needs to be brought into matrix form the same way the parameters are brought into matrix form. Once this has been done, they are transferred to the GPU. Just like with the variables, the expressions stay the same over the course of the parameter optimisation part. Therefore, they are transferred to the GPU before the interpreter is called, to reduce the amount of unnecessary data transfer.
+\begin{program}
+	\begin{GenericCode}
+function convert_to_matrix(vecs::Vector{Vector{T}}, invalidElement::T)::Matrix{T} where T
+	maxLength = get_max_inner_length(vecs)

-In addition to the already described data that needs to be sent, two more steps are required that have not been included in the sequence diagram \ref{fig:interpreter-sequence}. The first one is the allocation of global memory for the result matrix. Without this, the kernel would not know where to store the interpretation results. Therefore, enough global memory needs to be allocated to allow storing the results to be retrieved after all kernel executions have finished. 
+	# Pad the shorter vectors with the invalidElement to make all equal length
+	paddedVecs = [vcat(vec, fill(invalidElement, maxLength - length(vec))) for vec in vecs]
+	vecMat = hcat(paddedVecs...) # transform vector of vectors into column-major matrix
+
+	return vecMat
+end
+
+function get_max_inner_length(vecs::Vector{Vector{T}})::Int where T
+	return maximum(length.(vecs))
+end
+	\end{GenericCode}
+	\caption{A Julia program fragment depicting the conversion from a vector of vectors into a matrix of the form $k \times N$. }
+	\label{code:julia_vec-to-mat}
+\end{program}
+
+Similar to the parameters, the expressions are also stored as a vector of vectors. The outer vector contains each expression, while the inner vectors hold the expressions in their intermediate representation. Therefore, this vector of vectors also needs to be brought into matrix form the same way the parameters are brought into matrix form. To simplify development, the special opcode \textit{stop} has been introduced, which is used for the invalidElement in Program \ref{code:julia_vec-to-mat}. As seen in Section \ref{sec:interpreter-gpu-side}, this element is used to determine if the end of an expression has been reached during the interpretation process. This removes the need for additional data to be sent which stores the length of each expression to determine if the entire expression has been interpreted or not. Therefore, a lot of overhead can be reduced.
+
+Once the conversion into matrix form has been performed, the expressions are transferred to the GPU. Just like with the variables, the expressions remain the same over the course of the parameter optimisation part. Therefore, they are transferred to the GPU before the interpreter is called, to reduce the amount of unnecessary data transfer.
+
+In addition to the already described data that needs to be sent, two more steps are required that have not been included in the Sequence Diagram \ref{fig:interpreter-sequence}. The first one is the allocation of global memory for the result matrix. Without this, the kernel would not know where to store the interpretation results. Therefore, enough global memory needs to be allocated so that the results can be stored and retrieved after all kernel executions have finished. 

 \begin{figure}
 	\centering
@ -135,24 +157,51 @@ In addition to the already described data that needs to be sent, two more steps
 	\label{fig:memory-layout-data}
 \end{figure}

-Only raw data can be sent to the GPU, which means information about that data is missing. The matrices are represented as flat arrays, which means they have lost their column and row information. This information needs to be sent separately to let the kernel know the dimensions of the expressions, variables and parameters. Otherwise, the kernel does not know at which memory location the second variable set is stored for example. Figure \ref{fig:memory-layout-data} shows how the data is stored without any information about rows or columns of the matrices. The thick lines help to identify where a new column and therefore a new set of data begins. The GPU however has no knowledge of this and therefore the additional information needs to be transferred to ensure that the data in accessed correctly.
+Only raw data can be sent to the GPU, which means that information about the data is missing. The matrices are represented as flat arrays, which means they have lost their column and row information. This information must be sent separately to let the kernel know the dimensions of the expressions, variables and parameters. Otherwise, the kernel does not know at which memory location the second variable set is stored for example. Figure \ref{fig:memory-layout-data} shows how the data is stored without any information about the rows or columns of the matrices. The thick lines help to identify where a new column, and therefore a new set of data begins. However, the GPU has no knowledge of this and therefore the additional information must be transferred to ensure that the data is accessed correctly.

-Once all the data is present on the GPU, the CPU can dispatch the kernel for each expression. The dispatch includes the pointers to the location of the data allocated above, as well as the index of the expression to interpret. Because all expressions and parameters are sent to the GPU at once, this index ensures that the kernel knows at which memory location the expression is stored that it needs to interpret and which parameter set it needs to use. After the kernel has finished, the result matrix needs to be read from the GPU and returned to the symbolic regression algorithm.
+Once all the data is present on the GPU, the CPU can dispatch the kernel for each expression. This dispatch requires parameters that specify the number of threads and their organisation into thread blocks. In total, one thread is required for each variable set and therefore the grouping into thread blocks is the primary variable. Taking into account the constraints explained in Section \ref{sec:occupancy}, this grouping needs to be tuned for optimal performance. The specific values alongside the methodology for determining these values will be explained in Chapter \ref{cha:evaluation}.
+
+In addition, the dispatch parameters also include the pointers to the location of the data allocated and transferred above, as well as the index of the expression to be interpreted. Since all expressions and parameters are sent to the GPU at once, this index ensures that the kernel knows where in memory to find the expression it needs to interpret and which parameter set it needs to use. After the kernel has finished, the result matrix needs to be read from the GPU and passed back to the symbolic regression algorithm.
+
+Crucially, dispatching a kernel is an asynchronous operation, which means that the CPU does not wait for the kernel to finish before continuing. This allows the CPU to dispatch all kernels at once, rather than one at a time. As explained in Section \ref{sec:architecture}, a GPU can have multiple resident grids, meaning that the dispatched kernels can run concurrently, drastically reducing evaluation times. Only once the result matrix is read from the GPU does the CPU have to wait for all kernels to finish execution.

 \subsection{GPU Side}
+\label{sec:interpreter-gpu-side}
 % Memory access (currently global memory only)
-% no dynamic memory allocation like on CPU (stack needs to have fixed size)
-Now that the GPU has all the required data in its global ... (something along those lines)
+% no dynamic memory allocation like on CPU (stack needs to have fixed size; also stack is stored in local memory)
+With the GPU's global memory now containing all the necessary data and the kernel being dispatched, the interpretation process can begin. Before interpreting an expression, the global thread ID must be calculated. This step is crucial because each variable set is assigned to a unique thread. Therefore, the global thread ID determines which variable set should be used for the current interpretation instance.

+Moreover, the global thread ID ensures that excess threads do not perform any work. As otherwise these threads would try to access a variable set that does not exist and therefore would lead to an illegal memory access. This is necessary because the number of required threads often does not align perfectly with the number of threads per block multiplied by the number of blocks. If for example $1031$ threads are required, then at least two thread blocks are needed, as one thread block can hold at most $1024$ threads. Because $1031$ is a prime number, it can not be divided by any practical number of thread blocks. If two thread blocks are allocated, each holding $1024$ threads, a total of $2048$ threads is started. Therefore, the excess $2048 - 1031 = 1017$ threads must be prevented from executing. By using the global thread ID and the number of available variable sets, these excess threads can be easily identified and terminated early in the kernel execution.
+
+Afterwards the stack for the interpretation can be created. It is possible to dynamically allocate memory on the GPU, which enables a similar programming model as on the CPU. \textcite{winter_are_2021} have even compared many dynamic memory managers and found, that the performance impact of them is rather small. However, if it is easily possible to use static allocations, it still offers better performance. In the case of this thesis, it is easily possible which is the reason why the stack has been chosen to have a static size. Because it is known that expressions do not exceed 50 tokens, including the operators, the stack size has been set to 25, which should be more than enough to hold the values and partial results, even in the worst case.
+
+\subsubsection{Main Loop} % MAYBE
+After everything has been initialised, the main interpreter loop starts interpreting the expression. Because of the intermediate representation, the loop simply traverses the expression from left to right. On each iteration the type of the current token is checked, to decide which operation to perform. 
+
+If the current token type corresponds to the \textit{stop} opcode, the interpreter knows that it has finished. This simplicity is the reason why as explained above, this opcode has been introduced.
+
+% make clearer
+More interestingly is the case, where the current token corresponds to an index to either the variable matrix, or the parameter matrix. If that is the case, then the value of the token is important. To access one of these matrices first, the correct starting index needs to be calculated. As already explained, all information about the layout of the data is lost during transfer. At this stage, the kernel only knows the index of the first element of either of the matrices, the index of the variable set and parameter set, as well as the index of the value inside the current variable set or parameter set. However, it is not known where the boundaries of the sets are and therefore the additionally transferred data about the layout is used in this step to calculate the index of the first element per set. With this calculated index and the index stored in the token, the correct value can be loaded. After the value has been loaded, it is stored at the top of the stack for later use.
+
+% MAYBE:
+% Algorithm that shows how this calculation works
+
+Constants work very similar, as the token value is read and added to the top of the stack. However, since constants have been reinterpreted as integers for easy transfer to the GPU, this must be reversed before adding the value to the stack. 
+
+Evaluating the expression is happening if the current token is an operator. The value of the token is the opcode, which determines the operation that needs to be performed. If the opcode corresponds to a unary operator, only the top value of the stack needs to be popped for the operation. The operation is then performed on this value and the result is added back to the stack. On the other hand, if the operator corresponds to a binary operator, the top two values need to be popped. These are then used for the operation and the result is again added back to the stack.
+
+With this it would also be possible to add support for ternary operators. An example would be a Fused Multiply-Add (FMA) operator. While this operator does not exist in Julia, the frontend can generate it, if it encounters a sub-expression that follows the form $x * y + z$. As this expression performs the multiplication and addition in a single clock cycle instead of two, it would be a feasible optimisation. However, detecting such sub-expressions is more complicated, which is the reason it is not supported in the current implementation.
+
+Once the interpreter loop is finished, the result of the evaluation needs to be stored in the result matrix. By using the index of the current expression, as well as the index of the current variable set, the index where the result must be stored can be calculated. The last value on the stack is the result, which will be stored in the result matrix at the calculated location.

 \section{Transpiler}
-Talk about how the transpiler has been developed (probably largest section, because it just has more interesting parts)
+Talk about how the transpiler has been developed (probably largest section, because it just has more interesting parts); CPU-side part will be much larger than GPU side

 UML-Ablaufdiagram

 Front-End and Back-End
 Caching of back-end results

-PTX code generated and compiled using CUDA.jl (so basically the driver) and then executed
+PTX code generated and compiled using CUDA.jl (so basically using the driver) and then executed

 Memory access (global memory and register management especially register management)
--- a/thesis/main.pdf
+++ b/thesis/main.pdf