thesis: aded abstract and kurzfassung; re-read conclusion and evaluation to iron out mistakes etc.
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run

This commit is contained in:
2025-06-09 14:11:58 +02:00
parent b494803611
commit 3efd8a6c26
10 changed files with 255417 additions and 78 deletions

View File

@ -35,11 +35,11 @@ The prototypes developed in this thesis, are part of a GP algorithm for symbolic
The parameters themselves are unique to each expression, meaning they have a one-to-one mapping to an expression. Furthermore, as can be seen in Figure \ref{fig:input_output_explanation}, each expression can have a different number of parameters, or even no parameters at all. However, with no parameters, it wouldn't be possible to perform parameter optimisation. This is in contrast to variables, where each expression must have the same number of variables. Because parameters are unique to each expression and can vary in size, they are not structured as a matrix, but as a vector of vectors.
An important thing to consider, is the volume and volatility of the data itself. The example used above has been drastically simplified. It is expected, that there are hundreds of expressions evaluate per GP generation. Each of these expressions may contain between ten and 50 tokens. A token is equivalent to either a variable, a parameter, a constant value or an operator.
An important thing to consider, is the volume and volatility of the data itself. The example shown in Figure \ref{fig:input_output_explanation} has been drastically simplified. It is expected, that there are hundreds of expressions evaluate per GP generation. Each of these expressions may contain between ten and 50 tokens. A token is equivalent to either a variable, a parameter, a constant value or an operator.
Usually, the number of variables per expression is around ten. However, the number of variable sets can increase drastically. It can be considered, that $1\,000$ variable sets is the lower limit. On the other hand, $100\,000$ can be considered as the upper limit. Considering that one variable takes up 4 bytes of space and 10 variables are needed per expression, at least $4 * 10 * 1\,000 = 40\,000$ bytes and at most $4 * 10 * 100\,000 = 400\,000$ bytes need to be transferred to the GPU for the variables.
It can be assumed that typically the number of variables per expression is around ten. However, the number of variable sets can increase drastically. It can be considered, that $1\,000$ variable sets is the lower limit. On the other hand, $100\,000$ can be considered as the upper limit. Considering that one variable takes up 4 bytes of memory and 10 variables are needed per expression, at least $4 * 10 * 1\,000 = 40\,000$ bytes and at most $4 * 10 * 100\,000 = 400\,000$ bytes need to be transferred to the GPU for the variables.
These variables do not change during the runtime of the symbolic regression algorithm. As a result the data only needs to be sent to the GPU once. This means that the impact of this data transfer is minimal. On the other hand, the data for the parameters is much more volatile. As explained above, they are used for parameter optimisation and therefore vary from evaluation to evaluation and need to be sent to the GPU very frequently. However, the amount of data that needs to be sent is also much smaller. TODO: ONCE I GET THE DATA SEE HOW MANY BYTES PARAMETERS TAKE ON AVERAGE
These variables do not change during the runtime of the symbolic regression algorithm. As a result the data only needs to be sent to the GPU once. This means that the impact of this data transfer is minimal. On the other hand, the data for the parameters is much more volatile. As explained above, they are used for parameter optimisation and therefore vary from evaluation to evaluation and need to be sent to the GPU very frequently. The amount of data that needs to be sent depends on the number of expressions as well as on the number of parameters per expression. Considering $10\,000$ expressions that need to be evaluated and an average of two parameters per expression each requiring 4 bytes of memory, a total of $10\,000 * 2 * 4 = 80\,000$ bytes need to be transferred to the GPU on each parameter optimisation step.
\section{Architecture}
\label{sec:architecture}

View File

@ -1,18 +1,35 @@
\chapter[Conclusion]{Conclusion and Future Work}
\label{cha:conclusion}
Summarise the results
talk again how a typical input is often not complex enough (basically repeat that statement from comparison section in evaluation)
% When trying to model a system consisting of some inputs with an observed output, a computer can be used
A typical system consists of a set of inputs with an observed output. For example when trying to model the flow in rough pipes as done by \textcite{nikuradse_laws_1950} where the length, the diameter and the roughness of the pipes are the input. In this scenario the flow through the pipe is the output and a mathematical model is needed to describe the correlation between the inputs and outputs. Finding such a model or formula can be done by utilising a computer and symbolic regression. Symbolic regression typically is implemented using genetic programming. During the runtime thousands or even hundreds of thousands of formulas or expressions are generated which need to be evaluated to determine if they describe the observed system with sufficient accuracy. This process can take several hours to days to find a suitable formula on a single machine utilising the CPU only. Therefore, this thesis deals with the question of how the evaluation of the expressions generated at runtime can be sped up to minimise execution times.
Research has been conducted on how to best approach this problem statement. The GPU has been chosen to improve the performance as a cheap and powerful tool especially compared to compute clusters. Numerous instances exist were utilising the GPU lead to drastic performance improvements in many fields of research.
Two GPU evaluators were implemented which should determine if the GPU is more suitable for evaluating expressions generated at runtime as compared to the CPU. The two implementations are as follows:
\begin{description}
\item[GPU Interpreter] \mbox{} \\
A stack based interpreter that evaluates the expressions. The frontend converts these expressions into postfix notation to ensure the implementation can be as simple as possible. It consists of one kernel that is used to evaluate all expressions separately.
\item[GPU Transpiler] \mbox{} \\
A transpiler that takes the expressions and transpiles them into PTX code. Each expression is represented in its own unique kernel. The kernels are simpler than the one GPU interpreter kernel, but more effort is needed to generate them.
\end{description}
In total three benchmarks were conducted to determine if and under which circumstances the GPU is a more suitable choice for evaluating the expressions. The current CPU implementation is the baseline against which the GPU evaluators are evaluated. To answer the research questions the benchmarks are structured as follows:
\begin{enumerate}
\item Roughly $250\,000$ expressions with $362$ variable sets have been evaluated. The goal of this benchmark was determining how the evaluators can handle large volumes of expressions.
\item Roughly $10\,000$ expressions with $362$ variable sets have been evaluated. This benchmark should demonstrate how a change in the number of expressions impacts the performance, especially compared with each other.
\item Roughly $10\,000$ expressions and roughly $10\,000$ variable sets have been evaluated. By increasing the number of variable sets a more realistic use-case is modelled with this benchmark. Additionally, by using more variable sets the strengths of the GPU should get more exploited.
\end{enumerate}
After conducting the first and second benchmarks it was clear, that the CPU is the better choice in these scenarios. The first benchmark in particular demonstrated how the high RAM usage of the GPU transpiler lead to it not finishing this benchmark. Reducing the number of expressions demonstrated that the GPU transpiler can perform better than the GPU interpreter, however, in relation to the CPU implementation, no real change was observed between the first and second benchmark. However, in the third benchmark, both GPU evaluators managed to outperform the CPU, with the GPU transpiler performing the best.
To address the research questions, this thesis demonstrates that evaluating expressions generated at runtime can be more efficient on the GPU under specific conditions. Utilizing the GPU becomes feasible when dealing with a high number of variable sets, typically in the thousands and above. For scenarios with fewer variable sets, the CPU remains the better choice. Additionally, in scenarios where RAM is abundant, the GPU transpiler is the optimal choice. If too little RAM is available and the number of variable sets is sufficiently large, the GPU interpreter should be chosen, as it outperforms both the GPU transpiler and the CPU in such cases.
\section{Future Work}
talk about what can be improved
This thesis demonstrated how the GPU can be used to accelerate the evaluation of expressions and therefore the symbolic regression algorithm as a whole. However, the boundaries at which it is more feasible to utilise the GPU are very coarse-grained. Therefore, conducting more research into how the number of expressions and variable sets impact performance is needed. Furthermore, only one dataset with only two variables per variable set was used. Varying the number of variables per set and their impact on performance could also be interesting. The impact of the parameters was omitted from this thesis entirely. Further research on how the number of parameters impact the performance is of interest. Since parameters need to be transferred to the GPU frequently, having too many parameters could impact the GPU more negatively than the CPU.
Frontend:
1.) extend frontend to support ternary operators (basically if the frontend sees a multiplication and an addition it should collapse them to an FMA instruction)
The current implementation also has flaws that can be improved in future work. Currently, no shared memory is utilised, meaning the threads need to always retrieve the data from global memory. This is a slow operation and efficiently utilising shared memory should further improve the performance of both GPU evaluators.
Transpiler:
1.) transpile expression directly from Julia AST -> would save time because no intermediate representation needs to be created (looses step and gains performance, but also makes transpiler itself more complex; since expressions do not need to be sent to the GPU, the IR theoretically isn't needed)
2.) Better register management strategy might be helpful -> look into register pressure etc.
Additionally, neither of the implementations supports special GPU instructions. Especially the Fused Multiply-Add (FMA) instruction is of interest. Given that multiplying two values and adding a third is a common operation, this special instruction allows these operations to be performed in a single clock cycle. The frontend can be extended to detect and convert sub-expressions of this form into a special ternary opcode, enabling the backend to generate more efficient code. If the effort of detecting these sub-expressions is outweighed by the performance improvement needs to be determined in a future work.
CPU Interpreter: Probably more worth to dive into parallelising cpu interpreter itself (not really future work, as you wouldn't write a paper about that)

View File

@ -1,21 +1,21 @@
\chapter{Evaluation}
\label{cha:evaluation}
This thesis aims to determine whether one of the two GPU evaluators is faster than the current CPU evaluator. This chapter describes the performance evaluation process. First, the environment in which the performance benchmarks are conducted is explained. Next the individual results for the GPU interpreter and transpiler are presented individually. This section also includes the performance tuning steps taken to achieve these results. Finally, the results of the GPU evaluators are compared to those of the CPU evaluator to answer the research questions of this thesis.
This thesis aims to determine whether one of the two GPU evaluators is faster than the current CPU evaluator. This chapter describes the performance evaluation process. First, the environment in which the performance benchmarks are conducted is explained. Next the individual results for the GPU interpreter and transpiler are presented individually alongside the performance tuning process to achieve these results. Finally, the results of the GPU evaluators are compared to those of the CPU evaluator to answer the research questions of this thesis.
\section{Benchmark Environment}
In this section, the benchmark environment used to evaluate the performance is outlined. To ensure the validity and reliability of the results, it is necessary to specify the details of the environment. This includes a description of the hardware and software configuration as well as the performance evaluation process. With this, the variance between the results is minimised, which allows for better reproducibility and comparability between the implementations.
\subsection{Hardware Configuration}
The hardware configuration is the most important aspect of the benchmark environment. The capabilities of both the CPU and GPU can have a significant impact on the resulting performance. The following sections outline the importance of the individual components as well as the actual hardware used for the benchmarks.
The hardware configuration is the most important aspect of the benchmark environment. The capabilities of both the CPU and GPU can have a significant impact on the resulting performance. The following sections outline the importance of the individual components as well as the hardware used for the benchmarks and the performance tuning.
\subsubsection{GPU}
The GPU plays a crucial role, as different microarchitectures typically require different optimisations. Although the evaluators can generally operate on any Nvidia GPU with a compute capability of at least 6.1, they are tuned for the Ampere microarchitecture which has a compute capability of 8.6. Despite the evaluators being tuned for this microarchitecture, more recent ones can be used as well. However, additional tuning is required to ensure that the evaluators can utilise the hardware to its fullest potential.
The GPU plays a crucial role, as different microarchitectures typically operate differently and therefore require different performance tuning. Although the evaluators can generally operate on any Nvidia GPU with a compute capability of at least 6.1, they are tuned for the Ampere microarchitecture which has a compute capability of 8.6. Despite the evaluators being tuned for this microarchitecture, more recent microarchitectures can be used as well. However, additional tuning is required to ensure that the evaluators can utilise the hardware to its fullest potential.
Tuning must also be done on a per-problem basis. In particular, the number of variable sets can impact how well the hardware is utilised. Therefore, it is crucial to determine which configuration yields the best performance. Section \ref{sec:results} outlines a strategy for tuning the configuration to a new problem.
Tuning must also be done on a per-problem basis. In particular, the number of variable sets impact how well the hardware is utilised. Therefore, it is crucial to determine which configuration yields the best performance. Section \ref{sec:results} outlines steps to tune the configuration for a specific problem.
\subsubsection{CPU}
Although the GPU plays a crucial role, work is also carried out on the CPU. The interpreter primarily utilises the CPU for data transfer and the pre-processing step, making it more GPU-bound as most of the work is performed on the GPU. However, the transpiler additionally relies on the CPU to perform the transpilation step. This step involves generating a kernel for each expression and sending these kernels to the driver for compilation, a process also handled by the CPU. By contrast, the interpreter only required one kernel to be converted into PTX and compiled by the driver once. Consequently, the transpiler is significantly more CPU-bound and variations in the CPU used have a much greater impact. Therefore, using a more powerful CPU benefits the transpiler more than the interpreter.
Although the GPU plays a crucial role, work is also carried out on the CPU. The interpreter primarily utilises the CPU for the frontend and data transfer, making it more GPU-bound as most of the work is performed on the GPU. However, the transpiler additionally relies on the CPU to perform the transpilation step. This step involves generating a kernel for each expression and sending these kernels to the driver for compilation, a process also handled by the CPU. By contrast, the interpreter only required one kernel which needs to be converted into PTX and compiled by the driver only once. Consequently, the transpiler is significantly more CPU-bound and variations in the CPU used have a much greater impact. Therefore, using a more powerful CPU benefits the transpiler more than the interpreter.
\subsubsection{System Memory}
In addition to the hardware configuration of the GPU and CPU, system memory (RAM) also plays a crucial role. Although RAM does not directly contribute to the overall performance, it can have a noticeable indirect impact due to its role in caching and general data storage. Insufficient RAM forces the operating system to use the page file, which is stored on a considerably slower SSD. This leads to slower data access, thereby reducing the overall performance of the application.
@ -23,7 +23,7 @@ In addition to the hardware configuration of the GPU and CPU, system memory (RAM
As seen in the list below, only 16 GB of RAM were available during the benchmarking process. This amount is insufficient to utilise caching to the extent outlined in Chapter \ref{cha:implementation}. Additional RAM was not available, meaning caching had to be disabled for all benchmarks as further explained in Section \ref{sec:results}.
\subsubsection{Hardware}
With the requirements explained above in mind, the following hardware is used to perform the benchmarks for the CPU-based evaluator, which was used as the baseline, as well as for the GPU-based evaluators:
With the requirements explained above in mind, the following hardware is used to perform the benchmarks for the CPU-based evaluator, as well as for the GPU-based evaluators:
\begin{itemize}
\item Intel i5 12500
\item Nvidia RTX 3060 Ti
@ -43,17 +43,17 @@ Typically, newer versions of these components include, among other things, perfo
\subsection{Performance Evaluation Process}
With the hardware and software configuration established, the process of benchmarking the implementations can be described. This process is designed to simulate the load and scenario in which these evaluators will be used. The Nikuradse dataset \parencite{nikuradse_laws_1950} has been chosen as the data source. The dataset models the laws of flow in rough pipes and provides $362$ variable sets, each set containing two variables. This dataset has first been used by \textcite{guimera_bayesian_2020} to benchmark a symbolic regression algorithm.
Now that the hardware and software configurations have been established, the benchmarking process can be defined. This process is designed to simulate the load and scenario in which these evaluators will be used. The Nikuradse dataset \parencite{nikuradse_laws_1950} has been chosen as the data source. The dataset models the laws of flow in rough pipes and provides $362$ variable sets, each set containing two variables. This dataset has first been used by \textcite{guimera_bayesian_2020} to benchmark a symbolic regression algorithm.
Since only the evaluators are benchmarked, the expressions to be evaluated must already exist. These expressions are generated for the Nikuradse dataset using the exhaustive symbolic regression algorithm proposed by \textcite{bartlett_exhaustive_2024}. This ensures that the expressions are representative of what needs to be evaluated in a real-world application. In total, three benchmarks will be conducted, each having a different goal, which will be further explained in the following paragraphs.
The first benchmark involves a very large set of roughly $250\,000$ expressions with $362$ variable sets. This means that when using GP all $250\,000$ expressions would be evaluated in a single generation. In a typical generation, significantly fewer expressions would be evaluated. However, this benchmark is designed to show how the evaluators can handle very large volumes of data. Because of memory constraints, it was not possible to conduct an additional benchmark with a higher number of variable sets.
Both the second and third benchmarks are conducted to demonstrate how the evaluators will perform in more realistic scenarios. For the second benchmark the number of expressions has been reduced to roughly $10\,000$, and the number of variable sets is again $362$. The number of expressions is much more representative to a typical scenario, while the number of variable sets is very low. To determine if the GPU evaluators are also a feasible alternative, this benchmark is conducted nonetheless.
Both the second and third benchmarks are conducted to demonstrate how the evaluators will perform in more realistic scenarios. For the second benchmark the number of expressions has been reduced to roughly $10\,000$, and the number of variable sets is again $362$. The number of expressions is much more representative to a typical scenario, while the number of variable sets is still low. To determine if the GPU evaluators are a feasible alternative in scenarios with a realistic number of expressions but comparably few variable sets, this benchmark is conducted nonetheless.
Finally, the third benchmark will be conducted. Similar to the second benchmark, this benchmark evaluates the same $10\,000$ expressions but now with 30 times more variable sets, which equates to roughly $10\,000$. This benchmark mimics the scenario where the evaluators will most likely be used. While the others simulate different conditions to determine if and where the GPU evaluators can be used efficiently, this benchmark is more focused on determining if the GPU evaluators are suitable for the specific scenario they would be used in.
Finally, a third benchmark will be conducted. Similar to the second benchmark, this benchmark evaluates the same roughly $10\,000$ expressions but now with $30$ times more variable sets, which equates to roughly $10\,000$. This benchmark mimics the scenario where the evaluators will most likely be used. While the others simulate different conditions to determine if and where the GPU evaluators can be used efficiently, this benchmark is more focused on determining if the GPU evaluators are suitable for the specific scenario they are likely going to be used in.
All three benchmarks also simulate a parameter optimisation step, as this is the scenario in which these evaluators will be used in. For parameter optimisation, $100$ steps are used, meaning that all expressions will be evaluated $100$ times. During the benchmark, this process is simulated by re-transmitting the parameters instead of generating new ones. Generating new parameters is not part of the evaluators and is therefore not implemented. However, because the parameters are re-transmitted every time, the overhead of sending the data is taken into account. This overhead is part of the evaluators and is an additional burden that the CPU implementation does not have, making important to be measured.
All three benchmarks also simulate a parameter optimisation step, as this is the intended use-case for these evaluators. For parameter optimisation, $100$ steps are used, meaning that all expressions are evaluated $100$ times. During the benchmark, this process is simulated by re-transmitting the parameters instead of generating new ones. Generating new parameters is not part of the evaluators and is therefore not implemented. However, because the parameters are re-transmitted each time, the overhead of sending the data is taken into account. This overhead is part of the evaluators and represents an additional burden that the CPU implementation does not have, making it important to be measured.
\subsubsection{Measuring Performance}
The performance measurements are taken, using the BenchmarkTools.jl\footnote{\url{https://juliaci.github.io/BenchmarkTools.jl/stable/}} package. It is the standard for benchmarking applications in Julia, which makes it an obvious choice for measuring the performance of the evaluators.
@ -64,14 +64,11 @@ It offers extensive support for measuring and comparing results of different imp
\label{sec:results}
This section presents the results of the benchmarks described above. First the results for the GPU-based interpreter will be presented alongside the performance tuning process. This is followed by the results of the transpiler as well as the performance tuning process. Finally, both GPU-based evaluators will be compared with each other to determine which of them performs the best. Additionally, these evaluators will be compared against the CPU-based interpreter to answer the research questions of this thesis.
%BECAUSE OF RAM CONSTRAINTS, CACHING IS NOT USED TO THE FULL EXTEND AS IN CONTRAST TO HOW IT IS EXPLAINED IN THE IMPLEMENTATION CHAPTER. I hope I can cache the frontend. If only the finished kernels can not be cached, move this explanation to the transpiler section below and update the reference in subsubsection "System Memory"
\subsection{Interpreter}
In this section, the results for the GPU-based interpreter are presented in detail. Following the benchmark results, the process of tuning the interpreter is described as well as how to adapt the tuning for the different benchmarks. This part not only contains the tuning of the GPU, but also performance improvements done on the CPU side.
\subsubsection{Benchmark 1}
The first benchmark consists of $250\,000$ expressions and $362$ variable sets with $100$ parameter optimisation steps. Because each expression needs to be evaluated with each variable set for each parameter optimisation step, a total of $9.05\,\textit{billion}$ evaluations have been performed per sample. In Figure \ref{fig:gpu_i_benchmark_1} the result over all $50$ samples is presented. The median value across all samples is $466.3$ seconds with a standard deviation of $14.2$ seconds.
The first benchmark consists of $250\,000$ expressions and $362$ variable sets with $100$ parameter optimisation steps. Because each expression needs to be evaluated with each variable set for each parameter optimisation step, a total of $250\,000 * 362 * 100 \approx 9.05\,\textit{billion}$ evaluations have been performed per sample. In Figure \ref{fig:gpu_i_benchmark_1} the result over all $50$ samples is presented. The median value across all samples is $466.3$ seconds with a standard deviation of $14.2$ seconds.
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/gpu-interpreter-final-performance-benchmark1.png}
@ -82,7 +79,7 @@ The first benchmark consists of $250\,000$ expressions and $362$ variable sets w
For the kernel configuration, a block size of $128$ threads has been used. As will be explained below, this has been found to be the configuration that results in the most performance. During the benchmark, the utilisation of both the CPU and GPU was roughly $100\%$.
\subsubsection{Benchmark 2}
With $10\,000$ expressions, $362$ variable sets and $100$ parameter optimisation steps, the total number of evaluations per sample was $362\,\textit{million}$. The median across all samples is $21.3$ seconds with a standard deviation of $0.75$ seconds. Compared to benchmark 1, there were $25$ times fewer evaluations which also resulted in a reduction of the median and standard deviation of roughly $25$ times. This indicates a roughly linear correlation between the number of expressions and the runtime. Since the number of variable sets did not change, the block size for this benchmark remained at $128$ threads. Again the utilisation of the CPU and GPU during the benchmark was roughly $100\%$.
With $10\,000$ expressions, $362$ variable sets and $100$ parameter optimisation steps, the total number of evaluations per sample was $362\,\textit{million}$. The median across all samples is $21.3$ seconds with a standard deviation of $0.75$ seconds. Compared to the first benchmark, there were $25$ times fewer evaluations which also resulted in a reduction of the median and standard deviation of roughly $25$ times. This indicates a roughly linear correlation between the number of expressions and the runtime. Since the number of variable sets did not change, the block size for this benchmark remained at $128$ threads. Again the utilisation of the CPU and GPU during the benchmark was roughly $100\%$.
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/gpu-interpreter-final-performance-benchmark2.png}
@ -91,7 +88,7 @@ With $10\,000$ expressions, $362$ variable sets and $100$ parameter optimisation
\end{figure}
\subsubsection{Benchmark 3}
The third benchmark used the same $10\,000$ expressions and $100$ parameter optimisation steps. However, now there are 30 times more variable sets that need to be used for evaluation. This means, that the total number of evaluations per sample is now $10.86\,\textit{billion}$. This means, compared to benchmark 1, an additional $1.8\,\textit{billion}$ evaluations were performed. However, as seen in Figure \ref{fig:gpu_i_benchmark_3}, the execution time was significantly faster. With a median of $30.3$ seconds and a standard deviation of $0.45$ seconds, this benchmark was only marginally slower than benchmark 2. This also indicates, that the GPU evaluators are much more suited for scenarios, where there is a high number of variable sets.
The third benchmark used the same $10\,000$ expressions and $100$ parameter optimisation steps. However, now there are $30$ times more variable sets that need to be used for evaluation. This means, that the total number of evaluations per sample is now $10.86\,\textit{billion}$. Compared to the first benchmark, an additional $1.8\,\textit{billion}$ evaluations were performed. However, as seen in Figure \ref{fig:gpu_i_benchmark_3}, the execution time was significantly faster. With a median of $30.3$ seconds and a standard deviation of $0.45$ seconds, this benchmark was only marginally slower than the second benchmark. This also indicates, that the GPU evaluators are much more suited for scenarios, where there is a high number of variable sets.
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/gpu-interpreter-final-performance-benchmark3.png}
@ -99,35 +96,35 @@ The third benchmark used the same $10\,000$ expressions and $100$ parameter opti
\label{fig:gpu_i_benchmark_3}
\end{figure}
Although the number of variable sets has been increased by $30$ times, the block size remained at $128$ threads. Unlike the previous benchmarks, the hardware utilisation was different. Now only the GPU was utilised to 100\% while the CPU utilisation started at 100\% and slowly dropped to 80\%. The GPU needs to perform $30$ times more evaluations, meaning it takes longer for one kernel dispatch to be finished. At the same time, the CPU tries to dispatch the kernel at the same rate as before. Because only a certain number of kernels can be dispatched at once, the CPU needs to wait for the GPU to finish a kernel before another one can be dispatched again. Therefore, in this scenario, the evaluator runs into a GPU-bottleneck and using a more performant GPU would consequently improve the runtime in this scenario. In the benchmarks before, both the CPU and GPU would need to be upgraded, to achieve better performance.
Although the number of variable sets has been increased by $30$ times, the block size remained at $128$ threads. Unlike the previous benchmarks, the hardware utilisation was different. Now only the GPU was utilised to 100\% while the CPU utilisation started at 100\% and slowly dropped to 80\%. The GPU needs to perform $30$ times more evaluations per expression, meaning it takes longer for one kernel dispatch to be finished. At the same time, the CPU tries to dispatch the kernel at the same rate as before. Because only a certain number of kernels can be dispatched at once, the CPU needs to wait for the GPU to finish a kernel before another one can be dispatched. Therefore, in this scenario, the evaluator runs into a GPU-bottleneck and using a more performant GPU would consequently improve the runtime. In the previous benchmarks, both the CPU and GPU would need to be upgraded, to achieve better performance.
\subsection{Performance Tuning Interpreter}
\label{sec:tuning_interpreter}
Optimising and tuning the interpreter is crucial to achieve good performance. Especially tuning the kernel, as a wrongly configured kernel can drastically degrade performance. Before any performance tuning and optimisation has been performed, the kernel was configured with a block size of $256$ threads as it is a good initial configuration as recommended by \textcite{nvidia_cuda_2025-1}. Additionally, on the CPU, the frontend was executed for each expression before every kernel dispatch, even in parameter optimisation scenarios, where the expressions did not change from one dispatch to the other. Moreover, the variables have also been transmitted to the GPU before ever dispatch. However, executing the frontend, as well as dispatching the kernel was multithreaded, utilising all 12 threads of the CPU and a cache for the frontend has been used.
Optimising and tuning the interpreter is crucial to achieve good performance. Especially tuning the kernel, as a wrongly configured kernel can drastically degrade performance. Before any performance tuning and optimisation has been performed, the kernel was configured with a block size of $256$ threads since it is a good initial configuration as recommended by \textcite{nvidia_cuda_2025-1}. Additionally, on the CPU, the frontend was executed for each expression before every kernel dispatch, even in parameter optimisation scenarios, where the expressions did not change from one dispatch to the other. Moreover, the variables have also been transmitted to the GPU before ever dispatch. However, executing the frontend, as well as dispatching the kernel was multithreaded, utilising all 12 threads of the CPU and a cache for the frontend was utilised.
With this implementation, the initial performance measurements have been conducted for benchmark 1 which served as the baseline for further performance optimisations. However, as already mentioned, during this benchmark, memory limitations where encountered, as too much RAM was being used. Therefore, the caching had to be disabled. Because the evaluator is multithreaded, this change resulted in significantly better performance. As the cache introduced critical sections where race conditions could occur, locking mechanisms needed to be used. While locking ensures that no race conditions occur, it also means that parts of an otherwise entirely parallel implementation are now serialised, reducing the effect of parallelisation.
With this implementation, the initial performance measurements have been conducted for the first benchmark which served as the baseline for further performance optimisations. However, as already mentioned, during this benchmark, memory limitations where encountered, as too much RAM was being used. Therefore, the caching had to be disabled. Because the evaluator is multithreaded, this change resulted in significantly better performance. As the cache introduced critical sections where race conditions could occur, locking mechanisms were required. While locking ensures that no race conditions occur, it also means that parts of an otherwise entirely parallel implementation are now serialised, reducing the effect of parallelisation.
Without a cache and utilising all 12 threads, the frontend achieved very good performance. Processing $250\,000$ expressions takes roughly $88.5$ milliseconds. On the other hand, using a cache, resulted in the frontend running for $6.9$ \textit{seconds}. This equates to a speed-up of roughly 78 times when using no cache. Additionally, when looking at the benchmark results above, the time it takes to execute the frontend is negligible, meaning further optimising the frontend would not significantly improve the overall runtime.
During the tuning process $362$ variable sets have been used, which is the number of variable sets used by benchmark one and two. Before conduction benchmark three, additional performance tuning has been performed to ensure that this benchmark also utilises the hardware as much as possible.
During the tuning process $362$ variable sets have been used, which is the number of variable sets used by benchmark one and two. Before conducting benchmark three, additional performance tuning has been performed to ensure that this benchmark also utilises the hardware as much as possible.
\subsubsection{Optimisation 1}
After caching has been disabled, the first performance improvement was to drastically reduce the number of calls to the frontend and the number of data transfers to the GPU. Because the expressions and variables never change during the parameter optimisation process, processing the expression and transmitting the data to the GPU on each step are wasted resources. Therefore, the expressions are sent to the frontend once before the parameter optimisation process. Afterwards, the processed expressions as well as the variables are transferred to the GPU exactly once for this execution of the interpreter.
After caching has been disabled, the first performance improvement was to drastically reduce the number of calls to the frontend and the number of data transfers to the GPU. Because the expressions and variables never change during the parameter optimisation process, processing the expression and transmitting the data to the GPU on each step wastes resources. Therefore, the expressions are sent to the frontend once before the parameter optimisation process. Afterwards, the processed expressions as well as the variables are transferred to the GPU exactly once for this execution of the interpreter.
Figure \ref{fig:gpu_i_optimisation_1} shows how this optimisation improved the overall performance as demonstrated with benchmark one. However, it can also be seen that the range the individual samples fall within is much greater now. While in all cases, this optimisation improved the performance, in some cases the difference between the initial and the optimised version is very low with roughly a two-second improvement.
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/interpreter-comparison-initial-optim1.png}
\caption{Comparison of the initial implementation with the first optimisation applied on benchmark one. Note that while the results of the optimisation have a much wider range, all samples performed better than the initial implementation.}
\caption{Comparison of the initial implementation with the first optimisation applied on benchmark one. Note that while the results of the optimisation fall within a much wider range, all samples performed better than the initial implementation.}
\label{fig:gpu_i_optimisation_1}
\end{figure}
\subsubsection{Optimisation 2}
The second optimisation was concerned with tuning the kernel configuration. Using NSight Compute\footnote{\url{https://developer.nvidia.com/nsight-compute}} it was possible to profile the kernel with different configurations. During the profiling a lot of metrics have been gathered that allowed to deeply analyse the kernel executions, with the application recommending different aspects that had a lot of potential for performance improvements.
The second optimisation was concerned with tuning the kernel configuration. Using NSight Compute\footnote{\url{https://developer.nvidia.com/nsight-compute}} it was possible to profile the kernel with different configurations. During the profiling a lot of metrics have been gathered that allowed to deeply analyse the kernel executions, with the application recommending different aspects that had potential for performance improvements.
Since the evaluator is designed to execute many kernel dispatches in parallel, it was important to reduce the kernel runtime. Reducing the runtime per kernel has a knock-on effect, as the following kernel dispatches can begin execution sooner reducing the overall runtime.
@ -142,11 +139,11 @@ After the evaluator tuning has been concluded, it was found that a block size of
The found block size of $128$ might seem strange. However, it makes sense, as in total at least $362$ threads need to be started to evaluate one expression. If one block contains $128$ threads a total of $362 / 128 \approx 3$ blocks need to be started, totalling $384$ threads. As a result, only $384 - 362 = 22$ threads are excess threads. When choosing a block size of $121$ three blocks could be started, totalling one excess thread. However, there is no performance difference between a block size of $121$ and $128$. Since all threads are executed inside a warp, which consists of exactly $32$ threads, a block size that is not divisible by $32$ has no benefit and only hides the true amount of excess threads started.
Benchmark three had a total of $10\,860$ variable sets, meaning at least this number of threads must be started. To ensure optimal hardware utilisation, the evaluator had to undergo another tuning process. As seen above, it is beneficial to start as little excess threads as possible. By utilising NSight Compute, a performance measurement with a block size of $128$ was used as the initial configuration. This already performed well as again very little excess threads are started. In total $10\,860 / 128 \approx 84.84$ blocks are needed which must be rounded up to $85$ blocks with the last block being filled by roughly $84\%$ which equates to $20$ excess threads being started.
Benchmark three had a total of $10\,860$ variable sets, meaning at least this number of threads must be started. To ensure optimal hardware utilisation, the evaluator had to undergo another tuning process. As seen above, it is beneficial to start as little excess threads as possible. By utilising NSight Compute, a performance measurement with a block size of $128$ was used as the initial configuration. This already performed well as again very little excess threads are started. In total $10\,860 / 128 \approx 84.84$ blocks are needed, which must be round up to $85$ blocks with the last block being filled by roughly $84\%$ which equates to $20$ excess threads being started.
This has been repeated for two more configurations. Once for a block size of $160$ and once for $192$. With a block size of $160$ the total number of blocks was reduced to $68$ which again resulted in $20$ excess threads being started. With the hypothesis being, that using fewer blocks will result in better utilisation and therefore better performance. The same idea was also behind choosing the block size $192$. While this only requires $57$ blocks, the number of excess threads increased to $84$.
This was repeated for two more configurations. Once for a block size of $160$ and once for $192$. With a block size of $160$, the total number of blocks was reduced to $68$, which again resulted in $20$ excess threads being started. With the hypothesis behind increasing the block size was that using fewer blocks would result in better utilisation and therefore better performance. The same idea was also behind choosing a block size $192$. However, While this only required $57$ blocks, the number of excess threads increased to $84$.
Using NSight Compute it was found, that a block size of $160$ was the best performing followed by the block size of $192$ and the worst performing configuration was with a block size of $128$. However, this is not representative of how these configurations perform during the benchmarks. As seen in Figure \ref{fig:gpu_i_128-160-192} using a block size of $128$ lead to significantly better performance than the other configurations. While a block size of $160$ lead to worse results, it needs to be noted that it also improved the standard deviation by 25\% when compared to the results with a block size of $128$. These results also demonstrate that it is important to not only use NSight Compute but also conduct performance tests with real data to ensure the best possible configuration is chosen.
Using NSight Compute it was found, that a block size of $160$ was the best performing followed by the block size of $192$ and the worst performing configuration was with a block size of $128$. However, this is not representative of how these configurations performed during the benchmarks. As seen in Figure \ref{fig:gpu_i_128-160-192} using a block size of $128$ lead to significantly better performance than the other configurations. While a block size of $160$ lead to worse results, it needs to be noted that it also improved the standard deviation by 25\% when compared to the results with a block size of $128$. These results also demonstrate that it is important to not only use NSight Compute but also conduct performance tests with real data to ensure the best possible configuration is chosen.
\begin{figure}
\centering
@ -158,11 +155,11 @@ Using NSight Compute it was found, that a block size of $160$ was the best perfo
\subsubsection{Optimisation 3}
As seen in Figure \ref{fig:gpu_i_optimisation_2}, while the performance overall improved, the standard deviation also significantly increased. With the third optimisation the goal was to reduce the standard deviation. In order to achieve this, some minor optimisations where applied.
The first optimisation was to reduce the stack size of the interpreter from 25 to 10. As the stack is stored in local memory, it is beneficial to minimise the data transfer. This change, however, means that the stack might not be sufficient for larger expressions. Because there was no problem found with using a stack size of 10 during testing, it was assumed to be sufficient for most cases and in the other cases, the stack size can simply be increased.
The first optimisation was to reduce the stack size of the interpreter from 25 to 10. As the stack is stored in local memory, it is beneficial to minimise the data transfer and allocation of memory. This change, however, means that the stack might not be sufficient for larger expressions. Because with a stack size of 10 no problems were found during testing, it was assumed to be sufficient. In cases where this isn't sufficient, the stack size can be increased.
During the parameter optimisation step a lot of memory operations where performed. These are required as for each step new memory on the GPU must be allocated for both the parameters and the meta information. The documentation of CUDA.jl\footnote{\url{https://cuda.juliagpu.org/stable/usage/memory/\#Avoiding-GC-pressure}} mentioned that this can lead to higher garbage-collector (GC) pressure, increasing the time spent garbage-collecting. To reduce this, CUDA.jl provides the \verb|CUDA.unsafe_free!(::CuArray)| function. This frees the memory on the GPU without requiring to run the Julia GC and therefore spending less resources on garbage-collecting and more on evaluating the expressions.
With these two changes the overall runtime has been improved as can be seen in Figure \ref{fig:gpu_i_optimisation_3}. Moreover, the standard deviation was also reduced which was the goal of this optimisation.
With these two changes the overall runtime has been improved as can be seen in Figure \ref{fig:gpu_i_optimisation_3}. Moreover, the standard deviation was also reduced which was the main goal of this optimisation.
\begin{figure}
\centering
@ -173,16 +170,15 @@ With these two changes the overall runtime has been improved as can be seen in F
\subsection{Transpiler}
% Results only for Transpiler (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section
In this section the results for the transpiler are presented in detail. First the results for all three benchmarks are shown. The benchmarks are the same as already explained in the previous sections. After the results, an overview of the steps taken to optimise the transpiler execution times is given.
\subsubsection{Benchmark 1}
\label{sec:gput_bench1}
This benchmark lead to very poor results for the transpiler. While the best performing kernel configuration of $128$ threads per block was used, the above-mentioned RAM constraints meant that this benchmark performed poorly. After roughly $20$ hours of execution only two samples have been taken at which point it was decided to not finish this benchmark.
This benchmark lead to very poor results for the transpiler. While the best performing kernel configuration of $128$ threads per block was used, the above-mentioned RAM constraints meant that this benchmark performed poorly. After roughly $20$ hours of execution only two samples have been taken at which point it was decided to not finish this benchmark and treat it as failed.
The reason for this benchmark to perform poorly was because of too little RAM being available. As described in Chapter \ref{cha:implementation} the expressions are transpiled into PTX code and then immediately compiled into machine code by the GPU driver before the compiled kernels are sent to the parameter optimisation step. This order of operations makes sense as the expressions remain the same during this process and otherwise would result in performing a lot of unnecessary transpilations and compilations.
As described in Chapter \ref{cha:implementation} the expressions are transpiled into PTX code and then immediately compiled into machine code by the GPU driver before the compiled kernels are sent to the parameter optimisation step. This order of operations makes sense as the expressions remain the same during this process and otherwise would result in performing a lot of unnecessary transpilations and compilations.
However, only 16 GB of RAM where available with about half of that being used by the operating system. This meant that about eight GB of RAM where available to store $250\,000$ compiled kernels next to other required data for example the variable matrix. As a result, this was not enough memory and the benchmark was unable to finish. To combat this the step of compiling the kernels was moved into the parameter optimisation process, as this would free the memory taken up by the compiled kernel after it has been executed. As seen above consequently the performance was hurt dramatically and has shown that for these scenarios much more memory is required for the transpiler.
However, only 16 GB of RAM where available with about half of that being used by the operating system. This meant that about eight GB of RAM where available to store $250\,000$ compiled kernels next to other required data for example the variable matrix. As a result, this was not enough memory and the benchmark failed. To combat this the step of compiling the kernels was moved into the parameter optimisation process, as this would free the memory taken up by the compiled kernel after it has been executed. As seen above consequently the performance was hurt dramatically and has shown that for these scenarios much more memory is required for the transpiler to work properly.
\subsubsection{Benchmark 2}
@ -194,13 +190,11 @@ By reducing the number of expressions from $250\,000$ to roughly $10\,000$ the R
\label{fig:gpu_t_benchmark_2}
\end{figure}
During the benchmark it was observed that the CPU maintained a utilisation of 100\%. However crucially the GPU rapidly oscillated between 0\% and 100\% utilisation. This pattern suggests that while the kernels can fully utilise the GPU, they complete the evaluations almost immediately. Consequently, although the evaluation is performed very quickly, the time spent evaluating is smaller than the time spent preparing the expressions for evaluation. To better leverage the GPU more evaluations should be performed. This would increase the GPU's share of total execution time and therefore increase the efficiency and performance drastically.
During the benchmark it was observed that the CPU maintained a utilisation of 100\%. However crucially the GPU rapidly oscillated between 0\% and 100\% utilisation. This pattern suggests that while the kernels can fully utilise the GPU, they complete the evaluations almost immediately. Consequently, although the evaluation is performed very quickly, the time spent evaluating is smaller than the time spent preparing the expressions for evaluation. To better leverage the GPU, more evaluations should be performed. This would increase the GPU's share of total execution time and therefore increase the efficiency drastically.
\subsubsection{Benchmark 3}
% Even larger var sets would be perfect. 10k is rather small and the GPU still has barely any work to do
% std: 648.8 ms
This benchmark increased the amount of variable sets by $30$ times and therefore also increases the total number of evaluations by $30$ times. As already seen in the second benchmark, the GPU was under utilised and had therefore more resources for evaluating the expressions. As can be seen in Figure \ref{fig:gpu_t_benchmark_3} the available resources where better utilised. Although the number of evaluations increased by $30$ times, the median execution time only increased by roughly six seconds or $1.3$ times from $19.6$ to $25.4$. The standard deviation also decreased from $1.16$ seconds to $0.65$ seconds.
This benchmark increased the amount of variable sets by $30$ times and therefore also increases the total number of evaluations by $30$ times. As observed in the second benchmark, the GPU was underutilised and thus had more resources available for evaluating the expressions. As shown in Figure \ref{fig:gpu_t_benchmark_3} the available resources were better utilised. Although the number of evaluations increased by a factor of $30$, the median execution time only increased by approximately six seconds, or $1.3$ times, from $19.6$ to $25.4$. The standard deviation also decreased from $1.16$ seconds to $0.65$ seconds.
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/gpu-transpiler-final-performance-benchmark3.png}
@ -208,33 +202,33 @@ This benchmark increased the amount of variable sets by $30$ times and therefore
\label{fig:gpu_t_benchmark_3}
\end{figure}
Since the number of variable sets has changed, some performance tests with different block sizes needed to be performed. During this process it was found, that changing the block size from $128$ to $160$ threads resulted in the best performance. This is in contrast to the GPU interpreter where changing the block size to $160$ resulted in degraded performance.
Given the change in the number of variable sets, additional performance tests with different block sizes were conducted. During this process it was found, that changing the block size from $128$ to $160$ threads resulted in the best performance. This is in contrast to the GPU interpreter where changing the block size to $160$ resulted in degraded performance.
While conducting this benchmark the CPU utilisation started with 100\% during the frontend step as well as the transpilation and compilation process. However, similar to the third benchmark of the GPU interpreter, the CPU utilisation dropped to 80\% during the evaluation. It is very likely also the same reason that the kernels are dispatched to quickly in succession, filling up the number of allowed resident grids on the GPU.
While conducting this benchmark, the CPU utilisation began at 100\% during the frontend step as well as the transpilation and compilation steps. However, similar to the third benchmark of the GPU interpreter, the CPU utilisation dropped to 80\% during the evaluation phase. This is very likely due to the same reason that the kernels are dispatched too quickly in succession, filling up the number of allowed resident grids on the GPU.
However, the GPU utilisation also drastically increased. During the second benchmark, rapid oscillation was observed. With this benchmark the utilisation remained much more stable with the utilisation hovering at around 60\% to 70\% utilisation most of the time. It also needs to be noted however, that there also where frequent spikes to 100\% and slightly less frequent drops to 20\% utilisation. Overall the GPU utilisation was much higher compared to the second benchmark which explains why the execution time only increased slightly while the number of evaluations increased drastically.
However, GPU utilisation also increased drastically. During the second benchmark, rapid oscillation was observed. With this benchmark the utilisation remained much more stable with the utilisation hovering around 60\% to 70\% most of the time. It should also be noted that there appeared frequent spikes to 100\% and slightly less frequent drops to 20\% utilisation. Overall the GPU utilisation was much higher compared to the second benchmark, which explains why the execution time only increased slightly despite the drastic increase in the number of evaluations.
\subsection{Performance Tuning Transpiler}
% Initial: no cache; 256 blocksize; exprs pre-processed and transpiled on every call; vars sent on every call; frontend + transpilation + dispatch are multithreaded
This section describes how the transpiler has been tuned to achieve good performance. Steps taken to improve the performance of the CPU-side of the transpiler are presented. Additionally, steps taken to improve the performance of the kernels are also shown.
Before any optimisations were applied the block size was set to $256$ threads. The frontend as well as the transpilation and compilation was performed during the parameter optimisation step before the expression needed to be evaluated. Additionally, the variables have also been sent to the GPU on every parameter optimisation step. Multithreading has been used for the frontend, transpilation, compilation and kernel dispatch. Caching has also been used for the frontend and for the transpilation process in an effort to reduce the runtime.
Before any optimisations were applied, the block size was set to $256$ threads. The frontend as well as the transpilation and compilation were performed during the parameter optimisation step before the expression needed to be evaluated. Additionally, the variables have also been sent to the GPU on every parameter optimisation step. Multithreading has been used for the frontend, transpilation, compilation and kernel dispatch. Caching has also been used for the frontend and for the transpilation process in an effort to reduce the runtime.
As already mentioned in Section \ref{sec:tuning_interpreter}, using a cache in combination with multithreading for the frontend drastically slowed down the execution, which is the reason it has been disabled before conducting any benchmarks.
Caching has also been used for the transpilation step. The reason for this was to reduce the runtime during the parameter optimisation step. While this reduced the overhead of transpilation, the overhead of searching the cache if the expression has already been transpiled still existed. Because of the already mentioned RAM constraints this cache has been disabled and a better solution has been implemented in the first and second optimisation steps.
Most data of the tuning process has been gathered with the number of expressions and variable sets of the first benchmark, as this was the worst performing scenario. Therefore, it would show best where potential for performance improvements was. Before any optimisations were applied a single sample of the first benchmark took roughly 15 hours. However, it needs to be noted that the sample size is due to the duration of one sample very low.
Most data of the tuning process has been gathered with the number of expressions and variable sets of the first benchmark, as this was the worst performing scenario. Therefore, it would show best where potential for performance improvements was. Before any optimisations were applied a single sample of the first benchmark took roughly 15 hours. However, it needs to be noted that only two samples were taken due to the duration of one sample.
\subsubsection{Optimisation 1}
% 1.) Done before parameter optimisation loop: Frontend, transmitting Variables (improved runtime)
Since all caching has been disabled, a better solution for reducing the number of calls to the frontend was needed. For this, the calls to the frontend were moved outside the parameter optimisation step and storing the result for later use. Furthermore, transmitting the variables to the GPU has also been performed before the parameter optimisation is started, further reducing the number and volume of data transfer to the GPU. These two optimisations were able to reduce the runtime of one sample to roughly 14 hours
Since all caching has been disabled, a better solution for reducing the number of calls to the frontend was needed. For this, the calls to the frontend were moved outside the parameter optimisation step and storing the result for later use. Furthermore, transmitting the variables to the GPU has also been performed before the parameter optimisation is started, further reducing the number and volume of data transfer to the GPU. These two optimisations were able to reduce the runtime of one sample to roughly 14 hours and are equivalent to the first optimisation step of the GPU interpreter.
\subsubsection{Optimisation 2}
% 2.) All expressions to execute are transpiled first (before they were transpiled for every execution, even in parameter optimisation scenarios). Compilation is done every time in benchmark 1, because too little RAM was available (compilation takes the most time).
With this optimisation the number of calls to the transpiler and compiler have been drastically reduced. Both steps have been performed at the same time the frontend is called. The compiled kernels are then stored and only need to be executed during the parameter optimisation step. This meant that a cache was not needed any more. Because each time a new set of expressions needs to be evaluated, it is very unlikely that the same expression needs to be evaluated more than once. Consequently, the benefit of reducing the RAM consumption far outweighs the potential time savings of using a cache. Moreover, removing the cache also reduced the overhead of accessing it on every parameter optimisation step, further improving performance.
With this optimisation step the number of calls to the transpiler and compiler have been drastically reduced. Both steps are now performed at the same time the frontend is called. The compiled kernels are then stored and only need to be executed during the parameter optimisation step. This meant that a cache was not needed any more. Because each time a new set of expressions needs to be evaluated, it is extremely unlikely that the same expression needs to be evaluated more than once. Consequently, the benefit of reducing the RAM consumption far outweighs the potential time savings of using a cache. Moreover, removing the cache also reduced the overhead of accessing it on every parameter optimisation step, further improving performance.
It also must be noted, that compiling the PTX kernels and storing the result before the parameter optimisation step lead to an out of memory error for the first benchmark. In order to get any results, this step had to be reverted for this benchmark. If much more RAM were available, the runtime would have also significantly improved.
It also must be noted, that compiling the PTX kernels and storing the result before the parameter optimisation step lead to an out of memory error for the first benchmark. In order to get any results, this step had to be reverted for this benchmark. If much more RAM were available, the runtime would have been significantly better.
These optimisations lead to a runtime of one sample of roughly ten hours for the first benchmark. Therefore, a substantial improvement of roughly four hours per sample was achieved. When $10\,000$ expressions are transpiled it takes on average $0.05$ seconds over ten samples. Comparing this to the time spent compiling the resulting $10\,000$ kernels it takes on average $3.2$ seconds over ten samples. This suggests that performing the compilation before the parameter optimisation step would yield drastically better results in the first benchmark.
@ -244,9 +238,9 @@ These optimisations lead to a runtime of one sample of roughly ten hours for the
% unsafe_free in benchmark one reduced std. but could also be run to run variance. at least no negative effects
The third optimisation step was more focused on improving the performance for the third benchmark as it has a higher number of variable sets than the first and second one. However, as with the interpreter, the function \verb|CUDA.unsafe_free!(::CuArray)| has been used to reduce the standard deviation for all benchmarks.
Since the number of variable sets has changed in the third benchmark, it is important re-do the performance tuning. This was done by measuring the kernel performance using NSight Compute. As with the interpreter, block sizes of 128 and 160 threads have been compared with each other. A block size of 192 threads has been omitted here since the number of excess threads is very high. In the case of the interpreter the performance of this configuration was the worst out of the three configurations, and it was assumed it will be the same here.
Since the number of variable sets has changed in the third benchmark, it is important to re-do the performance tuning. This was done by measuring the kernel performance using NSight Compute. As with the interpreter, block sizes of $128$ and $160$ threads have been compared with each other. A block size of $192$ threads has been omitted here since the number of excess threads is very high. In the case of the interpreter the performance of this configuration was the worst out of the three configurations, and it was assumed it will be similar in this scenario.
However, since the number of excess threads for 128 and 160 threads per block is the same but the latter using fewer blocks might behave differently in the case of the transpiler. As seen in Figure \ref{fig:gpu_t_128_160} this assumption was true and using a block size of 160 threads resulted in better performance for the third benchmark. This is in contrast to the interpreter, where this configuration performed much more poorly.
However, since the number of excess threads for $128$ and $160$ threads per block is the same, the latter using fewer blocks might lead to performance improvements in the case of the transpiler. As seen in Figure \ref{fig:gpu_t_128_160} this assumption was true and using a block size of $160$ threads resulted in better performance for the third benchmark. This is in contrast to the interpreter, where this configuration performed much more poorly.
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/transpiler-comparison-128-160.png}
@ -257,10 +251,10 @@ However, since the number of excess threads for 128 and 160 threads per block is
\subsection{Comparison}
% Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter
% more var sets == better performance for GPU; more expressions == more performance for CPU evaluator
With the individual results of the GPU interpreter and transpiler presented, it is possible to compare them with the existing CPU interpreter. This section aims at outlining and comparing the performance of all three implementations across all three benchmark to understand their strengths and weaknesses. Through this analysis the scenarios will be identified where it is best to leverage the GPU but also when using the CPU interpreter is the better choice ultimately answering the research questions of this thesis.
With the individual results of the GPU interpreter and transpiler presented, it is possible to compare them with the existing CPU interpreter. This section aims at outlining and comparing the performance of all three implementations across all three benchmarks to understand their strengths and weaknesses. Through this analysis the scenarios will be identified where it is best to leverage the GPU but also when using the CPU interpreter is the better choice, ultimately answering the research questions of this thesis.
\subsubsection{Benchmark 1}
The goal of the first benchmark was to determine how the evaluators are able to handle large amounts of expressions. While this benchmark is not representative of a typical scenario, it allows for demonstrating the impact the number of expressions has on the execution time. As already explained in Section \ref{sec:gput_bench1} the transpiler was not able to finish this benchmark due to RAM limitations. This required a slightly modified implementation was needed to obtain results for at least two samples, each taking roughly ten hours to complete. Therefore, it has been omitted in this comparison.
The goal of the first benchmark was to determine how the evaluators are able to handle large amounts of expressions. While this benchmark is not representative of a typical scenario, it allows for demonstrating the impact the number of expressions has on the execution time. As already explained in Section \ref{sec:gput_bench1} the transpiler failed to finish this benchmark due to RAM limitations. This required a slightly modified implementation to obtain results for at least two samples, each taking roughly ten hours to complete, which is the reason it has been omitted from this comparison.
\begin{figure}
\centering
@ -269,10 +263,10 @@ The goal of the first benchmark was to determine how the evaluators are able to
\label{fig:cpu_gpui_gput_benchmark_1}
\end{figure}
Figure \ref{fig:cpu_gpui_gput_benchmark_1} shows the results of the first benchmark for the CPU and GPU interpreter. It can be seen that the GPU interpreter takes roughly four times as long on median than the CPU interpreter. Additionally, the standard deviation is much larger on the GPU interpreter. This shows that the CPU heavily benefits from scenarios where a lot of expressions need to be evaluated with very few variable sets. Therefore, it is not advisable to use the GPU to increase the performance in these scenarios.
Figure \ref{fig:cpu_gpui_gput_benchmark_1} shows the results of the first benchmark for the CPU and GPU interpreter. It can be seen that the GPU interpreter takes roughly four times as long on median than the CPU interpreter. Additionally, the standard deviation is much larger on the GPU interpreter. This shows that the CPU heavily benefits from scenarios where a lot of expressions need to be evaluated with very few variable sets. Therefore, it is not advisable to use the GPU to increase the performance in such scenarios.
\subsubsection{Benchmark 2}
Since the first benchmark has shown that with a large number of expressions the GPU is not a suitable alternative to the CPU. To further proof this statement a second benchmark with much fewer expressions was conducted. Now instead of $250\,000$ expressions, only $10\,000$ are evaluated. This reduction also meant that the transpiler can now be included in the comparison as it does not face the RAM limitations any more.
Since the first benchmark has shown that with a large number of expressions the GPU is not a suitable alternative to the CPU. To further proof this statement a second benchmark with much fewer expressions was conducted. Now instead of $250\,000$ expressions, only $10\,000$ are evaluated. This reduction also meant that the transpiler can now be included in the comparison as it does not face any RAM limitations any more.
\begin{figure}
\centering
@ -281,12 +275,12 @@ Since the first benchmark has shown that with a large number of expressions the
\label{fig:cpu_gpui_gput_benchmark_2}
\end{figure}
Reducing the number of expressions did not benefit the GPU evaluators at all compared to the CPU interpreter. This can be seen in Figure \ref{fig:cpu_gpui_gput_benchmark_2}. Furthermore, now the GPU evaluators are both roughly five times slower than the CPU interpreter instead of the previous reduction of roughly four times. Again the standard deviation is also much higher on both GPU evaluators when compared to the CPU interpreter. This means that a lower number of expressions does not necessarily mean that the GPU can outperform the CPU and therefore disproves the above statement.
Reducing the number of expressions did not benefit the GPU evaluators at all in relation to the CPU interpreter. This can be seen in Figure \ref{fig:cpu_gpui_gput_benchmark_2}. Furthermore, now the GPU evaluators are both roughly five times slower than the CPU interpreter instead of the previous performance reduction of roughly four times. Again the standard deviation is also much higher on both GPU evaluators when compared to the CPU interpreter. This means that a lower number of expressions does not necessarily mean that the GPU can outperform the CPU. Thus disproving the above statement that only a large number of expressions results in the GPU performing poorly.
On the other side, it can also be seen that the GPU transpiler tends to perform better than the GPU interpreter. While in the worst case both implementations are roughly equal, the GPU transpiler on median performs better. Additionally, the GPU transpiler can also outperform the GPU interpreter in the best case.
\subsubsection{Benchmark 3}
As found by the previous two benchmarks, varying the number of expressions only has a slight impact on the performance of the GPU in relation to the performance of the CPU. However, instead of varying the number of expressions, the number of variable sets can also be changed. For this benchmark, instead of $362$ variable sets, a total of $10\,860$ variable sets were used, which translates to an increase by $30$ times. It needs to be noted, that it was only possible to evaluate the performance with roughly $10\,000$ expressions. When using the same roughly $250\,000$ expressions of the first benchmark, none of the implementations managed to complete the benchmark, as there was too little RAM available.
As found by the previous two benchmarks, varying the number of expressions only has a slight impact on the performance of the GPU in relation to the performance of the CPU. However, instead of varying the number of expressions, the number of variable sets can also be changed. For this benchmark, instead of $362$ variable sets, a total of $10\,860$ variable sets were used, which translates to an increase by $30$ times. It needs to be noted, that it was only possible to evaluate the performance with roughly $10\,000$ expressions with this number of variable sets. When using the same roughly $250\,000$ expressions of the first benchmark and the increase number of variable sets, none of the implementations managed to complete the benchmark, as there was too little RAM available.
\begin{figure}
\centering
@ -295,9 +289,9 @@ As found by the previous two benchmarks, varying the number of expressions only
\label{fig:cpu_gpui_gput_benchmark_3}
\end{figure}
Increasing the number of variable sets greatly benefited both GPU evaluators as seen in Figure \ref{fig:cpu_gpui_gput_benchmark_3}. With this change, the CPU interpreter noticeable fell behind the GPU evaluators. Compared to the GPU transpiler, the CPU interpreter took roughly twice as long on median. The GPU transpiler continued its trend of performing better than the GPU interpreter. Furthermore, the standard deviation of all three evaluators is also very similar.
Increasing the number of variable sets greatly benefited both GPU evaluators as seen in Figure \ref{fig:cpu_gpui_gput_benchmark_3}. With this change, the CPU interpreter noticeably fell behind the GPU evaluators. Compared to the GPU transpiler, the CPU interpreter took roughly twice as long on median. The GPU transpiler continued its trend of performing better than the GPU interpreter. Furthermore, the standard deviation of all three evaluators is also very similar.
From this benchmark it can be concluded that the GPU heavily benefits from a larger number of variable sets. If the number of variable sets is increased even further, the difference between in performance between the GPU and CPU should be even more pronounced.
From this benchmark it can be concluded that the GPU heavily benefits from a larger number of variable sets. If the number of variable sets is increased even further, the difference in performance between the GPU and CPU should be even more pronounced.
Since the GPU is very limited in terms of concurrent kernel dispatches that can be evaluated, the number of threads and blocks can virtually be infinitely large. This means that a higher degree of parallelism is achievable with a higher number of variable sets. Increasing the number of expressions on the other hand does not influence the degree of parallelism to this extent. This is the reason no performance benefit was found by only decreasing the number of expressions with the same number of variable sets.
While the GPU is very limited in terms of concurrent kernel dispatches that can be evaluated, the number of threads and blocks can virtually be infinitely large. This means that a higher degree of parallelism is achievable with a higher number of variable sets. Increasing the number of expressions on the other hand does not influence the degree of parallelism to this extent. This is the reason no performance benefit was found by only decreasing the number of expressions with the same number of variable sets.

View File

@ -1,21 +1,21 @@
\chapter{Introduction}
\label{cha:Introduction}
This chapter provides an entry point for this thesis. First the motivation of exploring this topic is presented. In addition, the research questions of this thesis are outlined. Lastly the methodology on how to answer these questions will be explained. This master thesis is associated with the FFG COMET project ProMetHeus (\#904919). The developed software is used and further developed for modelling in the ProMetHeus project.
This chapter provides an entry point for this thesis. First, the motivation of exploring this topic is presented. In addition, the research questions of this thesis are outlined. Finally, the structure of this thesis is described, explaining how each part contributes to answering the research questions.
\section{Background and Motivation}
%
% Not totally happy with this yet
%
Optimisation and acceleration of program code is a crucial part in many fields. For example video games need optimisation to lower the minimum hardware requirements which allows more people to run the game, increasing sales. Another example where optimisation is important are computer simulations. For those, optimisation is even more crucial, as this allows the scientists to run more detailed simulations or get the simulation results faster. Equation learning or symbolic regression is another field that can heavily benefit from optimisation. One part of equation learning, is to evaluate the expressions generated by a search algorithm which can make up a significant portion of the runtime. This thesis is concerned with optimising the evaluation part to increase the overall performance of equation learning algorithms.
Optimisation and acceleration of program code is a crucial part in many fields. For example video games need optimisation to lower the minimum hardware requirements which allows more people to run the game, increasing sales. Another example where optimisation is important are computer simulations. For those, optimisation is even more crucial, as this allows the scientists to run more detailed simulations or get the simulation results faster. Equation learning or symbolic regression is another field that can heavily benefit from optimisation. One part of equation learning, is to evaluate the expressions generated by a search algorithm, which can make up a significant portion of the runtime. This thesis is concerned with optimising the evaluation part to increase the overall performance of equation learning algorithms.
The following expression $5 - \text{abs}(x_1) \, \sqrt{p_1} / 10 + 2^{x_2}$ which contains simple mathematical operations as well as variables $x_n$ and parameters $p_n$ is one example that can be generated by the equation learning algorithm, Usually an equation learning algorithm generates multiple of such expressions per iteration. Out of these expressions all possibly relevant ones have to be evaluated. Additionally, multiple different values need to be inserted for all variables and parameters, drastically increasing the amount of evaluations that need to be performed.
The following expression $5 - \text{abs}(x_1) \, \sqrt{p_1} / 10 + 2^{x_2}$, which contains simple mathematical operations as well as variables $x_n$ and parameters $p_n$, is one example that can be generated by the equation learning algorithm, Usually an equation learning algorithm generates hundreds or even thousands of such expressions per iteration, all of which have to be evaluated. Additionally, multiple different values must be entered for all variables and parameters, drastically increasing the amount of evaluations that need to be performed.
In his blog, \textcite{sutter_free_2004} described how the free lunch is over in terms of the ever-increasing performance of hardware like the CPU. He states that to gain additional performance, developers need to start developing software for multiple cores and not just hope that on the next generation of CPUs the program magically runs faster. While this approach means more development overhead, a much greater speed-up can be achieved. However, in some cases the speed-up achieved by this is still not large enough and another approach is needed. One of these approaches is the utilisation of Graphics Processing Units (GPUs) as an easy and affordable option as compared to compute clusters. Especially when talking about performance per dollar, GPUs are very inexpensive as found by \textcite{brodtkorb_graphics_2013}. \textcite{michalakes_gpu_2008} have shown a noticeable speed-up when using GPUs for weather simulation. In addition to computer simulations, GPU acceleration also can be found in other places such as networking \parencite{han_packetshader_2010} or structural analysis of buildings \parencite{georgescu_gpu_2013}. These solutions were all developed using CUDA\footnote{\url{https://developer.nvidia.com/cuda-toolkit}}. However, it is also possible to develop assembly like code for GPUs using Parallel Thread Execution (PTX)\footnote{\url{https://docs.nvidia.com/cuda/parallel-thread-execution/}} to gain more control.
In his blog, \textcite{sutter_free_2004} described how the free lunch is over in terms of the ever-increasing performance of hardware like the CPU. He states that to gain additional performance, developers need to start developing software for multiple cores and not just hope that on the next generation of CPUs the program magically runs faster. While this approach means more development overhead, a much greater speed-up can be achieved. However, in some cases the speed-up achieved by this is still not large enough, and another approach is needed. One of these approaches is the utilisation of Graphics Processing Units (GPUs) as an easy and affordable option as compared to compute clusters. Especially when talking about performance per dollar, GPUs are very inexpensive as found by \textcite{brodtkorb_graphics_2013}. \textcite{michalakes_gpu_2008} have shown a noticeable speed-up when using GPUs for weather simulation. In addition to computer simulations, GPU acceleration also can be found in other places such as networking \parencite{han_packetshader_2010} or structural analysis of buildings \parencite{georgescu_gpu_2013}. These solutions were all developed using CUDA\footnote{\url{https://developer.nvidia.com/cuda-toolkit}}. However, it is also possible to develop assembly like code for GPUs using Parallel Thread Execution (PTX)\footnote{\url{https://docs.nvidia.com/cuda/parallel-thread-execution/}} to gain more control.
\section{Research Question}
With these successful implementations of GPU acceleration, this thesis also attempts to improve the performance of evaluating mathematical equations, generated at runtime for symbolic regression using GPUs. Therefore, the following research questions are formulated:
Given the successful implementation of GPU acceleration, the aim of this thesis is to improve the performance of evaluating mathematical equations, generated at runtime for symbolic regression using GPUs. Therefore, the following research questions are formulated:
\begin{itemize}
\item How can simple arithmetic expressions that are generated at runtime be efficiently evaluated on GPUs?
@ -23,7 +23,7 @@ With these successful implementations of GPU acceleration, this thesis also atte
\item Under which circumstances is the interpretation of the expressions on the GPU or the translation to the intermediate language Parallel Thread Execution (PTX) more efficient?
\end{itemize}
Answering the first question is necessary to ensure the approach of this thesis is actually feasible. If it is feasible, it is important to evaluate if evaluating the expressions on the GPU actually improves the performance over a parallelised CPU evaluator. To answer if the GPU evaluator is faster than the CPU evaluator, the last research question is important. As there are two major ways of implementing an evaluator on the GPU, they need to be implemented and evaluated to finally state if evaluating expressions on the GPU is faster and if so, which type of implementation results in the best performance.
Answering the first question is necessary to ensure the approach of this thesis is feasible. If it is feasible, it is important to determine if evaluating the expressions on the GPU improves the performance over a parallelised CPU evaluator. To answer if the GPU evaluator is faster than the CPU evaluator, the last research question is important. As there are two major ways of implementing an evaluator on the GPU, both need to be implemented and evaluated to finally state if evaluating expressions on the GPU is faster and if so, which type of implementation results in the best performance under which circumstances.
\section{Thesis Structure}

View File

@ -1,5 +1,11 @@
\chapter{Abstract}
The objective of symbolic regression is to identify an expression that accurately models a system based on a set of inputs. For instance, one might determine the flow through pipes using inputs such as roughness, diameter, and length by conducting experiments with varying input configurations and observing the resulting flow and derive an expression from the experiments. This methodology, exemplified by \textcite{nikuradse_laws_1950}, can be applied to any system through symbolic regression. To find the best-fitting expression, millions of candidate expressions are generated, each requiring evaluation against every input configuration to assess how well they fit to the system. Consequently, millions of evaluations must be performed, a process that is computationally intensive and time-consuming. Thus, optimizing the evaluation phase of symbolic regression is crucial for discovering expressions that describe large and complex systems within a feasible timeframe.
This should be a 1-page (maximum) summary of your work in English.
% Applications such as weather simulation \parencite{michalakes_gpu_2008}, simulation of static and rotating black holes \parencite{hissbach_overview_2022, verbraeck_interactive_2021}, and structural analysis \parencite{georgescu_gpu_2013} significantly benefit from optimized algorithms that leverage the graphics processing unit (GPU).
This thesis presents the design and implementation of two evaluators that utilize the GPU to evaluate expressions generated at runtime by the symbolic regression algorithm. Performance benchmarks are conducted to compare the efficiency of the GPU evaluators against the current CPU evaluator.
The benchmark results indicate that the GPU can serve as a viable alternative to the CPU in certain scenarios. The determining factor for choosing between GPU and CPU evaluation is the number of input configurations. In a scenario with $10\,000$ expressions and $10\,000$ input configurations, the GPU outperformed the CPU by a significant margin.
This master thesis is associated with the FFG COMET project ProMetHeus (\#904919). The developed software is used and further developed for modelling in the ProMetHeus project.

View File

@ -1,7 +1,12 @@
\chapter{Kurzfassung}
\begin{german}
An dieser Stelle steht eine Zusammenfassung der Arbeit, Umfang
max.\ 1 Seite.
...
Das Ziel der symbolischen Regression ist es, einen Ausdruck zu finden, der ein System basierend auf einer Reihe von Variablen modelliert. Beispielsweise kann man den Durchfluss durch Rohre unter Verwendung von Variablen wie Rauheit, Durchmesser und Länge bestimmen, indem Experimente mit verschiedenen Werten für die Variablen durchgeführt werden. Für jedes Experiment wird der Durchfluss gemessen, wodurch man eine allgemeine Formel ableiten kann, welche die Beziehung der Variablen mit dem Durchfluss beschreibt. Diese Methodik, veranschaulicht durch die Arbeit von \textcite{nikuradse_laws_1950}, kann auf unterschiedliche Systeme mithilfe von symbolischer Regression angewendet werden. Um einen Ausdruck zu finden, welcher das System am besten beschreibt, werden Millionen von Kandidatenausdrücken generiert. Diese müssen, unter Verwendung der Variablenkonfiguration aller Experimente ausgewertet werden, um ihre Passgenauigkeit zum System zu beurteilen. Folglich müssen Millionen von Auswertungen durchgeführt werden, ein Prozess, der rechenintensiv und zeitaufwendig ist. Daher ist die Optimierung der Auswertungsphase der symbolischen Regression entscheidend. So wird es ermöglicht Ausdrücke in einem angemessenen Zeitrahmen zu finden, welche große und komplexe Systeme beschreiben.
Diese Arbeit präsentiert das Design und die Implementierung von zwei Evaluatoren, die die Grafikkarte (GPU) nutzen, um Ausdrücke zu bewerten, die zur Laufzeit der symbolischen Regression generiert werden. Leistungsbenchmarks werden durchgeführt, um die Performanz der GPU-Evaluatoren mit dem aktuellen CPU-Evaluator zu vergleichen.
Die Benchmark-Ergebnisse zeigen, dass die GPU in bestimmten Szenarien als eine tragfähige Alternative zur CPU dienen kann. Der entscheidende Faktor für die Wahl zwischen GPU- und CPU-Auswertung ist die Anzahl der Experimente und folglich die Menge an Variablenkonfigurationen. In einer Konfiguration mit $10\,000$ Ausdrücken und $10\,000$ Variablenkonfigurationen übertraf die GPU die CPU um ein bedeutendes Maß.
Diese Masterarbeit steht im Zusammenhang mit dem FFG COMET Projekt ProMetHeus (\#904919). Die entwickelte Software wird für die Modellierung im ProMetHeus Projekt verwendet und weiterentwickelt.
\end{german}

Binary file not shown.