evaluation: started documenting results of evaluations
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run

This commit is contained in:
2025-05-22 13:02:11 +02:00
parent 5bada5ffcb
commit 2cab6e0698
4 changed files with 33 additions and 20 deletions

View File

@ -7,20 +7,20 @@ This thesis aims to determine whether one of the two GPU evaluators is faster th
In this section, the benchmark environment used to evaluate the performance is outlined. To ensure the validity and reliability of the results, it is necessary to specify the details of the environment. This includes a description of the hardware and software configuration as well as the performance evaluation process. With this, the variance between the results is minimised, which allows for better reproducibility and comparability between the implementations.
\subsection{Hardware Configuration}
The hardware configuration is the most important aspect of the benchmark environment. The capabilities of both the CPU and GPU can have a significant impact on the resulting performance. The following sections outline the importance of the individual components as well as the hardware used for the benchmarks.
The hardware configuration is the most important aspect of the benchmark environment. The capabilities of both the CPU and GPU can have a significant impact on the resulting performance. The following sections outline the importance of the individual components as well as the actual hardware used for the benchmarks.
\subsubsection{GPU}
Especially the GPU is important, as different microarchitectures typically require different optimisations. While the evaluators can generally run on any Nvidia GPU with a compute capability of at least 6.1, they are tuned for the Ampere microarchitecture with a compute capability of 8.6. Despite the evaluators being tuned for this microarchitecture, more modern ones can be used as well. However, additional tuning is required to ensure the evaluators can utilise the hardware to its fullest potential.
The GPU plays a crucial role, as different microarchitectures typically require different optimisations. Although the evaluators can generally operate on any Nvidia GPU with a compute capability of at least 6.1, they are tuned for the Ampere microarchitecture which has a compute capability of 8.6. Despite the evaluators being tuned for this microarchitecture, more recent ones can be used as well. However, additional tuning is required to ensure that the evaluators can utilise the hardware to its fullest potential.
Tuning must also be done on a per-problem basis. Especially the number of variable sets can impact how well the hardware is utilised. Therefore, it is important to see which configuration performs the best. In Section \ref{sec:results} a strategy for tuning the configuration to a new problem is described.
Tuning must also be done on a per-problem basis. In particular, the number of variable sets can impact how well the hardware is utilised. Therefore, it is crucial to determine which configuration yields the best performance. Section \ref{sec:results} outlines a strategy for tuning the configuration to a new problem.
\subsubsection{CPU}
Although the GPU plays a crucial role, work is also carried out on the CPU. The interpreter mainly uses the CPU for data transfer and the pre-processing step and is therefore more GPU-bound. However, the transpiler additionally needs the CPU to perform the transpilation step. This step produces a kernel for each expression and also involves sending these kernels to the driver for compilation, a process which is also performed by the CPU. By contrast, the interpreter only has one kernel that needs to be converted into PTX and compiled by the driver only once. Consequently, the transpiler is much more CPU-bound and variations in the used CPU have a much greater impact. Therefore, using a more powerful CPU benefits the transpiler more than the interpreter.
Although the GPU plays a crucial role, work is also carried out on the CPU. The interpreter primarily utilises the CPU for data transfer and the pre-processing step, making it more GPU-bound as most of the work is performed on the GPU. However, the transpiler additionally relies on the CPU to perform the transpilation step. This step involves generating a kernel for each expression and sending these kernels to the driver for compilation, a process also handled by the CPU. By contrast, the interpreter only required one kernel to be converted into PTX and compiled by the driver once. Consequently, the transpiler is significantly more CPU-bound and variations in the CPU used have a much greater impact. Therefore, using a more powerful CPU benefits the transpiler more than the interpreter.
\subsubsection{System Memory}
In addition to the hardware configuration of the GPU and CPU, system memory (RAM) also plays a crucial role. While RAM does not directly contribute to the overall performance, it can have a noticeable indirect impact due to its role in caching. Insufficient RAM forces the operating system to use the page file, which is stored on a much slower SSD. This results in slower cache access, thereby reducing the overall performance of the application.
In addition to the hardware configuration of the GPU and CPU, system memory (RAM) also plays a crucial role. Although RAM does not directly contribute to the overall performance, it can have a noticeable indirect impact due to its role in caching and general data storage. Insufficient RAM forces the operating system to use the page file, which is stored on a considerably slower SSD. This leads to slower data access, thereby reducing the overall performance of the application.
As seen in the list below, only 16 GB of RAM were available during the benchmarking process. This amount is insufficient to utilise caching to the extent outlined in Chapter \ref{cha:implementation}. More RAM was not available, which means some caching had to be disabled, which will be further explained in Section \ref{sec:results}.
As seen in the list below, only 16 GB of RAM were available during the benchmarking process. This amount is insufficient to utilise caching to the extent outlined in Chapter \ref{cha:implementation}. Additional RAM was not available, meaning caching had to be disabled, which will be further explained in Section \ref{sec:results}.
\subsubsection{Hardware}
With the requirements explained above in mind, the following hardware is used to perform the benchmarks for the CPU-based evaluator, which was used as the baseline, as well as for the GPU-based evaluators:
@ -32,28 +32,28 @@ With the requirements explained above in mind, the following hardware is used to
\subsection{Software Configuration}
Apart from the hardware, the performance of the evaluators can also be significantly affected by the software. Primarily these three software components or libraries are involved in the performance:
Apart from the hardware, the performance of the evaluators can also be significantly affected by the software. Primarily these three software components or libraries are involved in the performance of the evaluators:
\begin{itemize}
\item GPU Driver
\item Julia
\item CUDA.jl
\end{itemize}
Typically, newer versions of these components include performance improvements, among other things. This is why it is important to specify the version which is used for benchmarking. The GPU driver has version \emph{561.17}, Julia has version \emph{1.11.5}, and CUDA.jl has version \emph{5.8.1}. As with the hardware configuration, this ensures that the results are reproducible and comparable to each other.
Typically, newer versions of these components include, among other things, performance improvements. This is why it is important to specify the version which is used for benchmarking. The GPU driver has version \emph{561.17}, Julia has version \emph{1.11.5}, and CUDA.jl has version \emph{5.8.1}. As with the hardware configuration, this ensures that the results are reproducible and comparable to each other.
\subsection{Performance Evaluation Process}
With the hardware and software configuration being set, the process of benchmarking the implementations can be described. The process is designed to simulate the load and scenario these evaluators will be used in. The Nikuradse dataset \parencite{nikuradse_laws_1950} has been chosen as the source of the data. The dataset itself models the laws of flow in rough pipes and provides $362$ variable sets, with each set containing two variables. This dataset has first been used by \textcite{guimera_bayesian_2020} to benchmark a symbolic regression algorithm.
With the hardware and software configuration established, the process of benchmarking the implementations can be described. This process is designed to simulate the load and scenario in which these evaluators will be used. The Nikuradse dataset \parencite{nikuradse_laws_1950} has been chosen as the data source. The dataset models the laws of flow in rough pipes and provides $362$ variable sets, each set containing two variables. This dataset has first been used by \textcite{guimera_bayesian_2020} to benchmark a symbolic regression algorithm.
Because only the evaluators are benchmarked, the expressions to be evaluated, need to already exist. Generating the expressions is done, using the exhaustive symbolic regression algorithm proposed by \textcite{bartlett_exhaustive_2024} and the Nikuradse dataset. This ensures that the expressions are exemplary of what needs to be evaluated in a real use-case.
Since only the evaluators are benchmarked, the expressions to be evaluated must already exist. These expressions are generated for the Nikuradse dataset using the exhaustive symbolic regression algorithm proposed by \textcite{bartlett_exhaustive_2024}. This ensures that the expressions are representative of what needs to be evaluated in a real-world application. In total, three benchmarks will be conducted, each having a different goal, which will be further explained in the following paragraphs.
With roughly $250\,000$ expressions, the second-largest set has been used as the first benchmark. This means that all $250\,000$ expressions are evaluated in a single run, which is much more than what would be evaluated in a typical run. This benchmark is designed to show how the evaluators can handle large amounts of data. However, evaluating such high amount of expressions also has some drawbacks as will be explained in Section \ref{sec:results}.
The first benchmark involves a very large set of roughly $250\,000$ expressions. This means that all $250\,000$ expressions are evaluated in a single generation when using GP. In a typical generation, significantly fewer expressions would be evaluated. However, this benchmark is designed to show how the evaluators can handle large volumes of data. Evaluating such a high number of expressions also has some drawbacks, as will be explained in Section \ref{sec:results}.
A second benchmark with slight adaptations to the first one is also performed. Because GPUs are very good at executing work in parallel, the number of variable sets is increased in this benchmark. Therefore, the second benchmark consists of the same $250\,000$ expressions, but the number of variable sets has been increased by a factor of four to a total of $1\,4448$.
A second benchmark, with slight modifications to the first, is also conducted. Given that GPUs are very good at executing work in parallel, the number of variable sets is increased in this benchmark. Therefore, the second benchmark consists of the same $250\,000$ expressions, but the number of variable sets has been increased by a factor of ten to a total of $3\,620$. This benchmark aims to demonstrate how the GPU is best used for large datasets, which is also more representative of the scenarios where the evaluators will be employed.
Lastly a third benchmark will be performed. This benchmark should mimic a realistic load. Therefore, the number of expressions has been reduced to roughly $10\,000$ and the number of variable sets is again $362$. The reason for this benchmark is to demonstrate how the evaluators will most likely perform in a typical run.
Finally, a third benchmark will be performed to mimic a realistic load. For this benchmark the number of expressions has been reduced to roughly $10\,000$, and the number of variable sets is again $362$. The purpose of this benchmark is to demonstrate how the evaluators are likely perform in a typical scenario.
All three benchmarks will also simulate a parameter optimisation step, as this is the scenario, these evaluators will be used in. For parameter optimisation, $100$ steps have been used. This means, that all expressions will be evaluated $100$ times. During the benchmark, this process is simulated by re-transmitting the parameters instead of generating new ones. Generating new parameters is not part of the evaluators and is therefore not implemented. However, because the parameters are re-transmitted every time, the overhead of sending the data is taken into account. This is part of the evaluators and additional overhead the CPU implementation does not have and is therefore important to be measured.
All three benchmarks also simulate a parameter optimisation step, as this is the scenario in which these evaluators will be used in. For parameter optimisation, $100$ steps are used, meaning that all expressions will be evaluated $100$ times. During the benchmark, this process is simulated by re-transmitting the parameters instead of generating new ones. Generating new parameters is not part of the evaluators and is therefore not implemented. However, because the parameters are re-transmitted every time, the overhead of sending the data is taken into account. This overhead is part of the evaluators and is an additional burden that the CPU implementation does not have, making important to be measured.
\subsubsection{Measuring Performance}
The performance measurements are taken, using the BenchmarkTools.jl\footnote{\url{https://juliaci.github.io/BenchmarkTools.jl/stable/}} package. It is the standard for benchmarking applications in Julia, which makes it an obvious choice for measuring the performance of the evaluators.
@ -62,26 +62,39 @@ It offers extensive support for measuring and comparing results of different imp
\section{Results}
\label{sec:results}
talk about what we will see now (results only for interpreter, then transpiler and then compared with each other and the CPU interpreter)
This section presents the results of the benchmarks described above. First the results for the GPU-based interpreter will be presented alongside the performance tuning process. This is followed by the results of the transpiler as well as the performance tuning process. Finally, both GPU-based evaluators will be compared with each other to determine which of them performs the best. Additionally, these evaluators will be compared against the CPU-based interpreter to answer the research questions of this thesis.
BECAUSE OF RAM CONSTRAINTS, CACHING IS NOT USED TO THE FULL EXTEND AS IN CONTRAST TO HOW IT IS EXPLAINED IN THE IMPLEMENTATION CHAPTER. I hope I can cache the frontend. If only the finished kernels can not be cached, move this explanation to the transpiler section below and update the reference in subsubsection "System Memory"
% TODO: Do one run with
% - 250k expressions
% - increase variables to be 4 times as large (nr. of varsets should be 362 * 4)
% - increase variables to be 10 times as large (nr. of varsets should be 362 * 10)
% - compare CPU with interpreter (probably also transpiler, but only to see if it takes even longer, or roughly the same considering that resources are still available on the GPU)
% - This should demonstrate that bigger varsets lead to better performance (although I kinda doubt considering that the hardware is already fully utilised)
% - This should demonstrate that bigger varsets lead to better performance (although I kinda doubt for the interpreter considering that the hardware is already fully utilised)
% TODO: Do another run with
% - 10 000 expressions choose the file that is closest to these 10k
% - nr. var sets stays the same
% - compare CPU, interpreter and transpiler
% - do a second run with kernel compilation being performed before parameter optimisation step (as 10 000 expressions shouldn't fill up the memory as much)
% - depending on how much time I have, also do a run with 4 times as much var sets (if this is done, adapt the above subsection "Performance Evaluation Process")
% - depending on how much time I have, also do a run with 10 times as much var sets (if this is done, adapt the above subsection "Performance Evaluation Process")
\subsection{Interpreter}
Results only for Interpreter (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section)
\subsection{Performance Tuning}
% Results only for Interpreter (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section)
In this section, the results for the interpreter are presented...
\subsubsection{Benchmark 1}
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/gpu-interpreter-final-performance-benchmark1.png}
\caption{The results of the GPU-based interpreter for benchmark 1}
\label{fig:gpu_i_benchmark_1}
\end{figure}
\subsubsection{Benchmark 2}
\subsubsection{Benchmark 3}
\subsubsection{Performance Tuning} % either subsubSection or change the title to "Performance Tuning Interpreter"
Document the process of performance tuning
Initial: no cache; 256 blocksize; exprs pre-processed and sent to GPU on every call; vars sent on every call; frontend + dispatch are multithreaded