benchmarking: added results for benchmark 4; extended thesis to include the fourth benchmark
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run

This commit is contained in:
2025-05-24 10:55:44 +02:00
parent 7c97213d13
commit 2bbdef6837
6 changed files with 7454 additions and 20 deletions

View File

@ -45,15 +45,17 @@ Typically, newer versions of these components include, among other things, perfo
\subsection{Performance Evaluation Process}
With the hardware and software configuration established, the process of benchmarking the implementations can be described. This process is designed to simulate the load and scenario in which these evaluators will be used. The Nikuradse dataset \parencite{nikuradse_laws_1950} has been chosen as the data source. The dataset models the laws of flow in rough pipes and provides $362$ variable sets, each set containing two variables. This dataset has first been used by \textcite{guimera_bayesian_2020} to benchmark a symbolic regression algorithm.
Since only the evaluators are benchmarked, the expressions to be evaluated must already exist. These expressions are generated for the Nikuradse dataset using the exhaustive symbolic regression algorithm proposed by \textcite{bartlett_exhaustive_2024}. This ensures that the expressions are representative of what needs to be evaluated in a real-world application. In total, three benchmarks will be conducted, each having a different goal, which will be further explained in the following paragraphs.
Since only the evaluators are benchmarked, the expressions to be evaluated must already exist. These expressions are generated for the Nikuradse dataset using the exhaustive symbolic regression algorithm proposed by \textcite{bartlett_exhaustive_2024}. This ensures that the expressions are representative of what needs to be evaluated in a real-world application. In total, four benchmarks will be conducted, each having a different goal, which will be further explained in the following paragraphs.
The first benchmark involves a very large set of roughly $250\,000$ expressions. This means that all $250\,000$ expressions are evaluated in a single generation when using GP. In a typical generation, significantly fewer expressions would be evaluated. However, this benchmark is designed to show how the evaluators can handle large volumes of data. Evaluating such a high number of expressions also has some drawbacks, as will be explained in Section \ref{sec:results}.
The first benchmark involves a very large set of roughly $250\,000$ expressions. This means that all $250\,000$ expressions are evaluated in a single generation when using GP. In a typical generation, significantly fewer expressions would be evaluated. However, this benchmark is designed to show how the evaluators can handle large volumes of data.
A second benchmark, with slight modifications to the first, is also conducted. Given that GPUs are very good at executing work in parallel, the number of variable sets is increased in this benchmark. Therefore, the second benchmark consists of the same $250\,000$ expressions, but the number of variable sets has been increased by a factor of ten to a total of $3\,620$. This benchmark aims to demonstrate how the GPU is best used for large datasets, which is also more representative of the scenarios where the evaluators will be employed.
A second benchmark, with slight modifications to the first, is also conducted. Given that GPUs are very good at executing work in parallel, the number of variable sets is increased in this benchmark. Therefore, the second benchmark consists of the same $250\,000$ expressions, but the number of variable sets has been increased by a factor of 30 to a total of roughly $10\,000$. This benchmark aims to demonstrate how the GPU is best used for a larger number of variable sets. A higher number of variable sets is also more representative of the scenarios the evaluators will be employed.
Finally, a third benchmark will be performed to mimic a realistic load. For this benchmark the number of expressions has been reduced to roughly $10\,000$, and the number of variable sets is again $362$. The purpose of this benchmark is to demonstrate how the evaluators are likely perform in a typical scenario.
The third benchmark is conducted to demonstrate how the evaluators will perform in more realistic scenarios. For this benchmark the number of expressions has been reduced to roughly $10\,000$, and the number of variable sets is again $362$. The purpose of this benchmark is to demonstrate how the evaluators are likely perform in a typical scenario.
All three benchmarks also simulate a parameter optimisation step, as this is the scenario in which these evaluators will be used in. For parameter optimisation, $100$ steps are used, meaning that all expressions will be evaluated $100$ times. During the benchmark, this process is simulated by re-transmitting the parameters instead of generating new ones. Generating new parameters is not part of the evaluators and is therefore not implemented. However, because the parameters are re-transmitted every time, the overhead of sending the data is taken into account. This overhead is part of the evaluators and is an additional burden that the CPU implementation does not have, making important to be measured.
Finally, a fourth benchmark will be conducted. Similar to the second and third benchmarks, this benchmark evaluates the same $10\,000$ expressions with the same $10\,000$ variable sets. This benchmark mimics the scenario where the evaluators will most likely be used. While the others simulate different conditions to determine if and where the GPU evaluators can be used efficiently, this benchmark is more focused on determining if the GPU evaluators are suitable for the specific scenario they would be used in.
All four benchmarks also simulate a parameter optimisation step, as this is the scenario in which these evaluators will be used in. For parameter optimisation, $100$ steps are used, meaning that all expressions will be evaluated $100$ times. During the benchmark, this process is simulated by re-transmitting the parameters instead of generating new ones. Generating new parameters is not part of the evaluators and is therefore not implemented. However, because the parameters are re-transmitted every time, the overhead of sending the data is taken into account. This overhead is part of the evaluators and is an additional burden that the CPU implementation does not have, making important to be measured.
\subsubsection{Measuring Performance}
The performance measurements are taken, using the BenchmarkTools.jl\footnote{\url{https://juliaci.github.io/BenchmarkTools.jl/stable/}} package. It is the standard for benchmarking applications in Julia, which makes it an obvious choice for measuring the performance of the evaluators.
@ -64,24 +66,12 @@ It offers extensive support for measuring and comparing results of different imp
\label{sec:results}
This section presents the results of the benchmarks described above. First the results for the GPU-based interpreter will be presented alongside the performance tuning process. This is followed by the results of the transpiler as well as the performance tuning process. Finally, both GPU-based evaluators will be compared with each other to determine which of them performs the best. Additionally, these evaluators will be compared against the CPU-based interpreter to answer the research questions of this thesis.
BECAUSE OF RAM CONSTRAINTS, CACHING IS NOT USED TO THE FULL EXTEND AS IN CONTRAST TO HOW IT IS EXPLAINED IN THE IMPLEMENTATION CHAPTER. I hope I can cache the frontend. If only the finished kernels can not be cached, move this explanation to the transpiler section below and update the reference in subsubsection "System Memory"
%BECAUSE OF RAM CONSTRAINTS, CACHING IS NOT USED TO THE FULL EXTEND AS IN CONTRAST TO HOW IT IS EXPLAINED IN THE IMPLEMENTATION CHAPTER. I hope I can cache the frontend. If only the finished kernels can not be cached, move this explanation to the transpiler section below and update the reference in subsubsection "System Memory"
% TODO: Do one run with
% - 250k expressions
% - increase variables to be 10 times as large (nr. of varsets should be 362 * 10)
% - compare CPU with interpreter (probably also transpiler, but only to see if it takes even longer, or roughly the same considering that resources are still available on the GPU)
% - This should demonstrate that bigger varsets lead to better performance (although I kinda doubt for the interpreter considering that the hardware is already fully utilised)
% TODO: Do another run with
% - 10 000 expressions choose the file that is closest to these 10k
% - nr. var sets stays the same
% - compare CPU, interpreter and transpiler
% - do a second run with kernel compilation being performed before parameter optimisation step (as 10 000 expressions shouldn't fill up the memory as much)
% - depending on how much time I have, also do a run with 10 times as much var sets (if this is done, adapt the above subsection "Performance Evaluation Process")
\subsection{Interpreter}
% Results only for Interpreter (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section)
In this section, the results for the interpreter are presented...
In this section, the results for the interpreter are presented in detail. ...
\subsubsection{Benchmark 1}
\begin{figure}
\centering
@ -94,6 +84,8 @@ In this section, the results for the interpreter are presented...
\subsubsection{Benchmark 3}
\subsubsection{Benchmark 4}
\subsubsection{Performance Tuning} % either subsubSection or change the title to "Performance Tuning Interpreter"
Document the process of performance tuning
@ -116,6 +108,8 @@ Results only for Transpiler (also contains final kernel configuration and probab
\subsubsection{Benchmark 3}
\subsubsection{Benchmark 4}
\subsection{Performance Tuning}
Document the process of performance tuning
@ -137,4 +131,6 @@ talk about that compute portion is just too little. Only more complex expression
\subsubsection{Benchmark 2}
CPU Did not finish due to RAM constraints
\subsubsection{Benchmark 3}
\subsubsection{Benchmark 3}
\subsubsection{Benchmark 4}

Binary file not shown.