thesis: implemented most feedback
Some checks are pending
CI / Julia 1.10 - ubuntu-latest - x64 - (push) Waiting to run
CI / Julia 1.6 - ubuntu-latest - x64 - (push) Waiting to run
CI / Julia pre - ubuntu-latest - x64 - (push) Waiting to run

This commit is contained in:
2025-06-28 17:44:45 +02:00
parent f25919dc06
commit 5e42668e1a
31 changed files with 694 additions and 669 deletions

View File

@ -12,7 +12,7 @@ The hardware configuration is the most important aspect of the benchmark environ
\subsubsection{GPU}
The GPU plays a crucial role, as different microarchitectures typically operate differently and therefore require different performance tuning. Although the evaluators can generally operate on any Nvidia GPU with a compute capability of at least 6.1, they are tuned for the Ampere microarchitecture which has a compute capability of 8.6. Despite the evaluators being tuned for this microarchitecture, more recent microarchitectures can be used as well. However, additional tuning is required to ensure that the evaluators can utilise the hardware to its fullest potential.
Tuning must also be done on a per-problem basis. In particular, the number of variable sets impact how well the hardware is utilised. Therefore, it is crucial to determine which configuration yields the best performance. Section \ref{sec:results} outlines steps to tune the configuration for a specific problem.
Tuning must also be done on a per-problem basis. In particular, the number of data points impact how well the hardware is utilised. Therefore, it is crucial to determine which configuration yields the best performance. Section \ref{sec:results} outlines steps to tune the configuration for a specific problem.
\subsubsection{CPU}
Although the GPU plays a crucial role, work is also carried out on the CPU. The interpreter primarily utilises the CPU for the frontend and data transfer, making it more GPU-bound as most of the work is performed on the GPU. However, the transpiler additionally relies on the CPU to perform the transpilation step. This step involves generating a kernel for each expression and sending these kernels to the driver for compilation, a process also handled by the CPU. By contrast, the interpreter only required one kernel which needs to be converted into PTX and compiled by the driver only once. Consequently, the transpiler is significantly more CPU-bound and variations in the CPU used have a much greater impact. Therefore, using a more powerful CPU benefits the transpiler more than the interpreter.
@ -43,15 +43,15 @@ Typically, newer versions of these components include, among other things, perfo
\subsection{Performance Evaluation Process}
Now that the hardware and software configurations have been established, the benchmarking process can be defined. This process is designed to simulate the load and scenario in which these evaluators will be used. The Nikuradse dataset \parencite{nikuradse_laws_1950} has been chosen as the data source. The dataset models the laws of flow in rough pipes and provides $362$ variable sets, each set containing two variables. This dataset has first been used by \textcite{guimera_bayesian_2020} to benchmark a symbolic regression algorithm.
Now that the hardware and software configurations have been established, the benchmarking process can be defined. This process is designed to simulate the load and scenario in which these evaluators will be used. The Nikuradse dataset \parencite{nikuradse_laws_1950} has been chosen as the data source. The dataset models the laws of flow in rough pipes and provides $362$ data points, each set containing two variables. This dataset has first been used by \textcite{guimera_bayesian_2020} to benchmark a symbolic regression algorithm.
Since only the evaluators are benchmarked, the expressions to be evaluated must already exist. These expressions are generated for the Nikuradse dataset using the exhaustive symbolic regression algorithm proposed by \textcite{bartlett_exhaustive_2024}. This ensures that the expressions are representative of what needs to be evaluated in a real-world application. In total, three benchmarks will be conducted, each having a different goal, which will be further explained in the following paragraphs.
The first benchmark involves a very large set of roughly $250\,000$ expressions with $362$ variable sets. This means that when using GP all $250\,000$ expressions would be evaluated in a single generation. In a typical generation, significantly fewer expressions would be evaluated. However, this benchmark is designed to show how the evaluators can handle very large volumes of data. Because of memory constraints, it was not possible to conduct an additional benchmark with a higher number of variable sets.
The first benchmark involves a very large set of roughly $250\,000$ expressions with $362$ data points. This means that when using GP all $250\,000$ expressions would be evaluated in a single generation. In a typical generation, significantly fewer expressions would be evaluated. However, this benchmark is designed to show how the evaluators can handle very large volumes of data. Because of memory constraints, it was not possible to conduct an additional benchmark with a higher number of data points.
Both the second and third benchmarks are conducted to demonstrate how the evaluators will perform in more realistic scenarios. For the second benchmark the number of expressions has been reduced to roughly $10\,000$, and the number of variable sets is again $362$. The number of expressions is much more representative to a typical scenario, while the number of variable sets is still low. To determine if the GPU evaluators are a feasible alternative in scenarios with a realistic number of expressions but comparably few variable sets, this benchmark is conducted nonetheless.
Both the second and third benchmarks are conducted to demonstrate how the evaluators will perform in more realistic scenarios. For the second benchmark the number of expressions has been reduced to roughly $10\,000$, and the number of data points is again $362$. The number of expressions is much more representative to a typical scenario, while the number of data points is still low. To determine if the GPU evaluators are a feasible alternative in scenarios with a realistic number of expressions but comparably few data points, this benchmark is conducted nonetheless.
Finally, a third benchmark will be conducted. Similar to the second benchmark, this benchmark evaluates the same roughly $10\,000$ expressions but now with $30$ times more variable sets, which equates to roughly $10\,000$. This benchmark mimics the scenario where the evaluators will most likely be used. While the others simulate different conditions to determine if and where the GPU evaluators can be used efficiently, this benchmark is more focused on determining if the GPU evaluators are suitable for the specific scenario they are likely going to be used in.
Finally, a third benchmark will be conducted. Similar to the second benchmark, this benchmark evaluates the same roughly $10\,000$ expressions but now with $30$ times more data points, which equates to roughly $10\,000$. This benchmark mimics the scenario where the evaluators will most likely be used. While the others simulate different conditions to determine if and where the GPU evaluators can be used efficiently, this benchmark is more focused on determining if the GPU evaluators are suitable for the specific scenario they are likely going to be used in.
All three benchmarks also simulate a parameter optimisation step, as this is the intended use-case for these evaluators. For parameter optimisation, $100$ steps are used, meaning that all expressions are evaluated $100$ times. During the benchmark, this process is simulated by re-transmitting the parameters instead of generating new ones. Generating new parameters is not part of the evaluators and is therefore not implemented. However, because the parameters are re-transmitted each time, the overhead of sending the data is taken into account. This overhead is part of the evaluators and represents an additional burden that the CPU implementation does not have, making it important to be measured.
@ -62,13 +62,13 @@ It offers extensive support for measuring and comparing results of different imp
\section{Results}
\label{sec:results}
This section presents the results of the benchmarks described above. First the results for the GPU-based interpreter will be presented alongside the performance tuning process. This is followed by the results of the transpiler as well as the performance tuning process. Finally, both GPU-based evaluators will be compared with each other to determine which of them performs the best. Additionally, these evaluators will be compared against the CPU-based interpreter to answer the research questions of this thesis.
This section presents the results of the benchmarks described above. First the results for the GPU-based interpreter and GPU transpiler alongside the performance tuning process will be presented in isolation. Finally, both GPU-based evaluators will be compared with each other to determine which of them performs the best. Additionally, these evaluators will be compared against the CPU-based interpreter to answer the research questions of this thesis.
\subsection{Interpreter}
In this section, the results for the GPU-based interpreter are presented in detail. Following the benchmark results, the process of tuning the interpreter is described as well as how to adapt the tuning for the different benchmarks. This part not only contains the tuning of the GPU, but also performance improvements done on the CPU side.
\subsubsection{Benchmark 1}
The first benchmark consists of $250\,000$ expressions and $362$ variable sets with $100$ parameter optimisation steps. Because each expression needs to be evaluated with each variable set for each parameter optimisation step, a total of $250\,000 * 362 * 100 \approx 9.05\,\textit{billion}$ evaluations have been performed per sample. In Figure \ref{fig:gpu_i_benchmark_1} the result over all $50$ samples is presented. The median value across all samples is $466.3$ seconds with a standard deviation of $14.2$ seconds.
The first benchmark consists of $250\,000$ expressions and $362$ data points with $100$ parameter optimisation steps. Because each expression needs to be evaluated with each data point for each parameter optimisation step, a total of $250\,000 * 362 * 100 \approx 9.05\,\text{billion}$ evaluations have been performed per sample. In Figure \ref{fig:gpu_i_benchmark_1} the result over all $50$ samples is presented. The median value across all samples is $466.3$ seconds with a standard deviation of $14.2$ seconds.
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/gpu-interpreter-final-performance-benchmark1.png}
@ -79,7 +79,7 @@ The first benchmark consists of $250\,000$ expressions and $362$ variable sets w
For the kernel configuration, a block size of $128$ threads has been used. As will be explained below, this has been found to be the configuration that results in the most performance. During the benchmark, the utilisation of both the CPU and GPU was roughly $100\%$.
\subsubsection{Benchmark 2}
With $10\,000$ expressions, $362$ variable sets and $100$ parameter optimisation steps, the total number of evaluations per sample was $362\,\textit{million}$. The median across all samples is $21.3$ seconds with a standard deviation of $0.75$ seconds. Compared to the first benchmark, there were $25$ times fewer evaluations which also resulted in a reduction of the median and standard deviation of roughly $25$ times. This indicates a roughly linear correlation between the number of expressions and the runtime. Since the number of variable sets did not change, the block size for this benchmark remained at $128$ threads. Again the utilisation of the CPU and GPU during the benchmark was roughly $100\%$.
With $10\,000$ expressions, $362$ data points and $100$ parameter optimisation steps, the total number of evaluations per sample was $362\,\text{million}$. The median across all samples is $21.3$ seconds with a standard deviation of $0.75$ seconds. Compared to the first benchmark, there were $25$ times fewer evaluations which also resulted in a reduction of the median and standard deviation of roughly $25$ times. This indicates a roughly linear correlation between the number of expressions and the runtime. Since the number of data points did not change, the block size for this benchmark remained at $128$ threads. Again the utilisation of the CPU and GPU during the benchmark was roughly $100\%$.
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/gpu-interpreter-final-performance-benchmark2.png}
@ -88,7 +88,7 @@ With $10\,000$ expressions, $362$ variable sets and $100$ parameter optimisation
\end{figure}
\subsubsection{Benchmark 3}
The third benchmark used the same $10\,000$ expressions and $100$ parameter optimisation steps. However, now there are $30$ times more variable sets that need to be used for evaluation. This means, that the total number of evaluations per sample is now $10.86\,\textit{billion}$. Compared to the first benchmark, an additional $1.8\,\textit{billion}$ evaluations were performed. However, as seen in Figure \ref{fig:gpu_i_benchmark_3}, the execution time was significantly faster. With a median of $30.3$ seconds and a standard deviation of $0.45$ seconds, this benchmark was only marginally slower than the second benchmark. This also indicates, that the GPU evaluators are much more suited for scenarios, where there is a high number of variable sets.
The third benchmark used the same $10\,000$ expressions and $100$ parameter optimisation steps. However, now there are $30$ times more data points that need to be used for evaluation. This means, that the total number of evaluations per sample is now $10.86\,\text{billion}$. Compared to the first benchmark, an additional $1.8\,\text{billion}$ evaluations were performed. However, as seen in Figure \ref{fig:gpu_i_benchmark_3}, the execution time was significantly faster. With a median of $30.3$ seconds and a standard deviation of $0.45$ seconds, this benchmark was only marginally slower than the second benchmark. This also indicates, that the GPU evaluators are much more suited for scenarios, where there is a high number of data points.
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/gpu-interpreter-final-performance-benchmark3.png}
@ -96,7 +96,7 @@ The third benchmark used the same $10\,000$ expressions and $100$ parameter opti
\label{fig:gpu_i_benchmark_3}
\end{figure}
Although the number of variable sets has been increased by $30$ times, the block size remained at $128$ threads. Unlike the previous benchmarks, the hardware utilisation was different. Now only the GPU was utilised to 100\% while the CPU utilisation started at 100\% and slowly dropped to 80\%. The GPU needs to perform $30$ times more evaluations per expression, meaning it takes longer for one kernel dispatch to be finished. At the same time, the CPU tries to dispatch the kernel at the same rate as before. Because only a certain number of kernels can be dispatched at once, the CPU needs to wait for the GPU to finish a kernel before another one can be dispatched. Therefore, in this scenario, the evaluator runs into a GPU-bottleneck and using a more performant GPU would consequently improve the runtime. In the previous benchmarks, both the CPU and GPU would need to be upgraded, to achieve better performance.
Although the number of data points has been increased by $30$ times, the block size remained at $128$ threads. Unlike the previous benchmarks, the hardware utilisation was different. Now only the GPU was utilised to 100\% while the CPU utilisation started at 100\% and slowly dropped to 80\%. The GPU needs to perform $30$ times more evaluations per expression, meaning it takes longer for one kernel dispatch to be finished. At the same time, the CPU tries to dispatch the kernel at the same rate as before. Because only a certain number of kernels can be dispatched at once, the CPU needs to wait for the GPU to finish a kernel before another one can be dispatched. Therefore, in this scenario, the evaluator runs into a GPU-bottleneck and using a more performant GPU would consequently improve the runtime. In the previous benchmarks, both the CPU and GPU would need to be upgraded, to achieve better performance.
\subsection{Performance Tuning Interpreter}
@ -105,15 +105,15 @@ Optimising and tuning the interpreter is crucial to achieve good performance. Es
With this implementation, the initial performance measurements have been conducted for the first benchmark which served as the baseline for further performance optimisations. However, as already mentioned, during this benchmark, memory limitations where encountered, as too much RAM was being used. Therefore, the caching had to be disabled. Because the evaluator is multithreaded, this change resulted in significantly better performance. As the cache introduced critical sections where race conditions could occur, locking mechanisms were required. While locking ensures that no race conditions occur, it also means that parts of an otherwise entirely parallel implementation are now serialised, reducing the effect of parallelisation.
Without a cache and utilising all 12 threads, the frontend achieved very good performance. Processing $250\,000$ expressions takes roughly $88.5$ milliseconds. On the other hand, using a cache, resulted in the frontend running for $6.9$ \textit{seconds}. This equates to a speed-up of roughly 78 times when using no cache. Additionally, when looking at the benchmark results above, the time it takes to execute the frontend is negligible, meaning further optimising the frontend would not significantly improve the overall runtime.
Without a cache and utilising all 12 threads, the frontend achieved very good performance. Processing $250\,000$ expressions takes roughly $88.5$ milliseconds. On the other hand, using a cache, resulted in the frontend running for $6.9$ \text{seconds}. This equates to a speed-up of roughly 78 times when using no cache. Additionally, when looking at the benchmark results above, the time it takes to execute the frontend is negligible, meaning further optimising the frontend would not significantly improve the overall runtime.
During the tuning process $362$ variable sets have been used, which is the number of variable sets used by benchmark one and two. Before conducting benchmark three, additional performance tuning has been performed to ensure that this benchmark also utilises the hardware as much as possible.
During the tuning process $362$ data points have been used, which is the number of data points used by benchmark one and two. Before conducting benchmark three, additional performance tuning has been performed to ensure that this benchmark also utilises the hardware as much as possible.
\subsubsection{Optimisation 1}
After caching has been disabled, the first performance improvement was to drastically reduce the number of calls to the frontend and the number of data transfers to the GPU. Because the expressions and variables never change during the parameter optimisation process, processing the expression and transmitting the data to the GPU on each step wastes resources. Therefore, the expressions are sent to the frontend once before the parameter optimisation process. Afterwards, the processed expressions as well as the variables are transferred to the GPU exactly once for this execution of the interpreter.
Figure \ref{fig:gpu_i_optimisation_1} shows how this optimisation improved the overall performance as demonstrated with benchmark one. However, it can also be seen that the range the individual samples fall within is much greater now. While in all cases, this optimisation improved the performance, in some cases the difference between the initial and the optimised version is very low with roughly a two-second improvement.
Figure \ref{fig:gpu_i_optimisation_1} shows how this optimisation improved the overall performance as demonstrated with benchmark one. However, it can also be seen that the range the individual samples fall within is much greater now. While in all cases, this optimisation improved the performance, in some cases the difference between the initial and the optimised version is very low with roughly a two-second improvement. On median the performance improvement was roughly five percent.
\begin{figure}
\centering
@ -128,7 +128,7 @@ The second optimisation was concerned with tuning the kernel configuration. Usin
Since the evaluator is designed to execute many kernel dispatches in parallel, it was important to reduce the kernel runtime. Reducing the runtime per kernel has a knock-on effect, as the following kernel dispatches can begin execution sooner reducing the overall runtime.
After the evaluator tuning has been concluded, it was found that a block size of $128$ yielded the best results. With this kernel configuration, another performance measurement has been conducted with the results shown in Figure \ref{fig:gpu_i_optimisation_2} using benchmark one. As can be seen, the overall runtime again was noticeably faster. However, the standard deviation also drastically increased, with the duration from the fastest to the slowest sample differing by roughly 60 seconds.
After the evaluator tuning has been concluded, it was found that a block size of $128$ yielded the best results. With this kernel configuration, another performance measurement has been conducted with the results shown in Figure \ref{fig:gpu_i_optimisation_2} using benchmark one. As can be seen, the overall runtime again was noticeably faster, albeit in improvement of roughly six percent. However, the standard deviation also drastically increased, with the duration from the fastest to the slowest sample differing by roughly 60 seconds.
\begin{figure}
\centering
@ -139,7 +139,7 @@ After the evaluator tuning has been concluded, it was found that a block size of
The found block size of $128$ might seem strange. However, it makes sense, as in total at least $362$ threads need to be started to evaluate one expression. If one block contains $128$ threads a total of $362 / 128 \approx 3$ blocks need to be started, totalling $384$ threads. As a result, only $384 - 362 = 22$ threads are excess threads. When choosing a block size of $121$ three blocks could be started, totalling one excess thread. However, there is no performance difference between a block size of $121$ and $128$. Since all threads are executed inside a warp, which consists of exactly $32$ threads, a block size that is not divisible by $32$ has no benefit and only hides the true amount of excess threads started.
Benchmark three had a total of $10\,860$ variable sets, meaning at least this number of threads must be started. To ensure optimal hardware utilisation, the evaluator had to undergo another tuning process. As seen above, it is beneficial to start as little excess threads as possible. By utilising NSight Compute, a performance measurement with a block size of $128$ was used as the initial configuration. This already performed well as again very little excess threads are started. In total $10\,860 / 128 \approx 84.84$ blocks are needed, which must be round up to $85$ blocks with the last block being filled by roughly $84\%$ which equates to $20$ excess threads being started.
Benchmark three had a total of $10\,860$ data points, meaning at least this number of threads must be started. To ensure optimal hardware utilisation, the evaluator had to undergo another tuning process. As seen above, it is beneficial to start as little excess threads as possible. By utilising NSight Compute, a performance measurement with a block size of $128$ was used as the initial configuration. This already performed well as again very little excess threads are started. In total $10\,860 / 128 \approx 84.84$ blocks are needed, which must be round up to $85$ blocks with the last block being filled by roughly $84\%$ which equates to $20$ excess threads being started.
This was repeated for two more configurations. Once for a block size of $160$ and once for $192$. With a block size of $160$, the total number of blocks was reduced to $68$, which again resulted in $20$ excess threads being started. With the hypothesis behind increasing the block size was that using fewer blocks would result in better utilisation and therefore better performance. The same idea was also behind choosing a block size $192$. However, While this only required $57$ blocks, the number of excess threads increased to $84$.
@ -159,7 +159,7 @@ The first optimisation was to reduce the stack size of the interpreter from 25 t
During the parameter optimisation step a lot of memory operations where performed. These are required as for each step new memory on the GPU must be allocated for both the parameters and the meta information. The documentation of CUDA.jl\footnote{\url{https://cuda.juliagpu.org/stable/usage/memory/\#Avoiding-GC-pressure}} mentioned that this can lead to higher garbage-collector (GC) pressure, increasing the time spent garbage-collecting. To reduce this, CUDA.jl provides the \verb|CUDA.unsafe_free!(::CuArray)| function. This frees the memory on the GPU without requiring to run the Julia GC and therefore spending less resources on garbage-collecting and more on evaluating the expressions.
With these two changes the overall runtime has been improved as can be seen in Figure \ref{fig:gpu_i_optimisation_3}. Moreover, the standard deviation was also reduced which was the main goal of this optimisation.
With these two changes the overall runtime has been improved by two percent as can be seen in Figure \ref{fig:gpu_i_optimisation_3}. Moreover, the standard deviation was also reduced which was the main goal of this optimisation.
\begin{figure}
\centering
@ -194,7 +194,7 @@ During the benchmark it was observed that the CPU maintained a utilisation of 10
\subsubsection{Benchmark 3}
This benchmark increased the amount of variable sets by $30$ times and therefore also increases the total number of evaluations by $30$ times. As observed in the second benchmark, the GPU was underutilised and thus had more resources available for evaluating the expressions. As shown in Figure \ref{fig:gpu_t_benchmark_3} the available resources were better utilised. Although the number of evaluations increased by a factor of $30$, the median execution time only increased by approximately six seconds, or $1.3$ times, from $19.6$ to $25.4$. The standard deviation also decreased from $1.16$ seconds to $0.65$ seconds.
This benchmark increased the amount of data points by $30$ times and therefore also increases the total number of evaluations by $30$ times. As observed in the second benchmark, the GPU was underutilised and thus had more resources available for evaluating the expressions. As shown in Figure \ref{fig:gpu_t_benchmark_3} the available resources were better utilised. Although the number of evaluations increased by a factor of $30$, the median execution time only increased by approximately six seconds, or $1.3$ times, from $19.6$ to $25.4$. The standard deviation also decreased from $1.16$ seconds to $0.65$ seconds.
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/gpu-transpiler-final-performance-benchmark3.png}
@ -202,7 +202,7 @@ This benchmark increased the amount of variable sets by $30$ times and therefore
\label{fig:gpu_t_benchmark_3}
\end{figure}
Given the change in the number of variable sets, additional performance tests with different block sizes were conducted. During this process it was found, that changing the block size from $128$ to $160$ threads resulted in the best performance. This is in contrast to the GPU interpreter where changing the block size to $160$ resulted in degraded performance.
Given the change in the number of data points, additional performance tests with different block sizes were conducted. During this process it was found, that changing the block size from $128$ to $160$ threads resulted in the best performance. This is in contrast to the GPU interpreter where changing the block size to $160$ resulted in degraded performance.
While conducting this benchmark, the CPU utilisation began at 100\% during the frontend step as well as the transpilation and compilation steps. However, similar to the third benchmark of the GPU interpreter, the CPU utilisation dropped to 80\% during the evaluation phase. This is very likely due to the same reason that the kernels are dispatched too quickly in succession, filling up the number of allowed resident grids on the GPU.
@ -218,7 +218,7 @@ As already mentioned in Section \ref{sec:tuning_interpreter}, using a cache in c
Caching has also been used for the transpilation step. The reason for this was to reduce the runtime during the parameter optimisation step. While this reduced the overhead of transpilation, the overhead of searching the cache if the expression has already been transpiled still existed. Because of the already mentioned RAM constraints this cache has been disabled and a better solution has been implemented in the first and second optimisation steps.
Most data of the tuning process has been gathered with the number of expressions and variable sets of the first benchmark, as this was the worst performing scenario. Therefore, it would show best where potential for performance improvements was. Before any optimisations were applied a single sample of the first benchmark took roughly 15 hours. However, it needs to be noted that only two samples were taken due to the duration of one sample.
Most data of the tuning process has been gathered with the number of expressions and data points of the first benchmark, as this was the worst performing scenario. Therefore, it would show best where potential for performance improvements was. Before any optimisations were applied a single sample of the first benchmark took roughly 15 hours. However, it needs to be noted that only two samples were taken due to the duration of one sample.
\subsubsection{Optimisation 1}
% 1.) Done before parameter optimisation loop: Frontend, transmitting Variables (improved runtime)
@ -230,15 +230,15 @@ With this optimisation step the number of calls to the transpiler and compiler h
It also must be noted, that compiling the PTX kernels and storing the result before the parameter optimisation step lead to an out of memory error for the first benchmark. In order to get any results, this step had to be reverted for this benchmark. If much more RAM were available, the runtime would have been significantly better.
These optimisations lead to a runtime of one sample of roughly ten hours for the first benchmark. Therefore, a substantial improvement of roughly four hours per sample was achieved. When $10\,000$ expressions are transpiled it takes on average $0.05$ seconds over ten samples. Comparing this to the time spent compiling the resulting $10\,000$ kernels it takes on average $3.2$ seconds over ten samples. This suggests that performing the compilation before the parameter optimisation step would yield drastically better results in the first benchmark.
Nonetheless, these optimisations lead to a runtime of one sample of roughly ten hours for the first benchmark. Therefore, a substantial improvement of roughly four hours or 40\% per sample was achieved. When $10\,000$ expressions are transpiled it takes on average $0.05$ seconds over ten samples. Comparing this to the time spent compiling the resulting $10\,000$ kernels it takes on average $3.2$ seconds over ten samples. This suggests that performing the compilation before the parameter optimisation step would yield drastically better results in the first benchmark.
\subsubsection{Optimisation 3}
% 3.) benchmark3 std noticeably improved with blocksize 160 (around 70\% better) (also includes call to unsafe_free)
% here I can show chart of comparing the two blocksizes
% unsafe_free in benchmark one reduced std. but could also be run to run variance. at least no negative effects
The third optimisation step was more focused on improving the performance for the third benchmark as it has a higher number of variable sets than the first and second one. However, as with the interpreter, the function \verb|CUDA.unsafe_free!(::CuArray)| has been used to reduce the standard deviation for all benchmarks.
The third optimisation step was more focused on improving the performance for the third benchmark as it has a higher number of data points than the first and second one. However, as with the interpreter, the function \verb|CUDA.unsafe_free!(::CuArray)| has been used to reduce the standard deviation for all benchmarks.
Since the number of variable sets has changed in the third benchmark, it is important to re-do the performance tuning. This was done by measuring the kernel performance using NSight Compute. As with the interpreter, block sizes of $128$ and $160$ threads have been compared with each other. A block size of $192$ threads has been omitted here since the number of excess threads is very high. In the case of the interpreter the performance of this configuration was the worst out of the three configurations, and it was assumed it will be similar in this scenario.
Since the number of data points has changed in the third benchmark, it is important to re-do the performance tuning. This was done by measuring the kernel performance using NSight Compute. As with the interpreter, block sizes of $128$ and $160$ threads have been compared with each other. A block size of $192$ threads has been omitted here since the number of excess threads is very high. In the case of the interpreter the performance of this configuration was the worst out of the three configurations, and it was assumed it will be similar in this scenario.
However, since the number of excess threads for $128$ and $160$ threads per block is the same, the latter using fewer blocks might lead to performance improvements in the case of the transpiler. As seen in Figure \ref{fig:gpu_t_128_160} this assumption was true and using a block size of $160$ threads resulted in better performance for the third benchmark. This is in contrast to the interpreter, where this configuration performed much more poorly.
\begin{figure}
@ -259,11 +259,11 @@ The goal of the first benchmark was to determine how the evaluators are able to
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/cpu_gpui_gput_bench1.png}
\caption{The results of the comparison of all three implementations for the first benchmark. Note that the transpiler is absent because it did not finish this benchmark.}
\caption{The results of the comparison of the CPU and GPU based interpreter for the first benchmark. Note that the transpiler is absent because it did not finish this benchmark.}
\label{fig:cpu_gpui_gput_benchmark_1}
\end{figure}
Figure \ref{fig:cpu_gpui_gput_benchmark_1} shows the results of the first benchmark for the CPU and GPU interpreter. It can be seen that the GPU interpreter takes roughly four times as long on median than the CPU interpreter. Additionally, the standard deviation is much larger on the GPU interpreter. This shows that the CPU heavily benefits from scenarios where a lot of expressions need to be evaluated with very few variable sets. Therefore, it is not advisable to use the GPU to increase the performance in such scenarios.
Figure \ref{fig:cpu_gpui_gput_benchmark_1} shows the results of the first benchmark for the CPU and GPU interpreter. It can be seen that the GPU interpreter takes roughly four times as long on median than the CPU interpreter. Additionally, the standard deviation is much larger on the GPU interpreter. This shows that the CPU heavily benefits from scenarios where a lot of expressions need to be evaluated with very few data points. Therefore, it is not advisable to use the GPU to increase the performance in such scenarios.
\subsubsection{Benchmark 2}
Since the first benchmark has shown that with a large number of expressions the GPU is not a suitable alternative to the CPU. To further proof this statement a second benchmark with much fewer expressions was conducted. Now instead of $250\,000$ expressions, only $10\,000$ are evaluated. This reduction also meant that the transpiler can now be included in the comparison as it does not face any RAM limitations any more.
@ -280,7 +280,7 @@ Reducing the number of expressions did not benefit the GPU evaluators at all in
On the other side, it can also be seen that the GPU transpiler tends to perform better than the GPU interpreter. While in the worst case both implementations are roughly equal, the GPU transpiler on median performs better. Additionally, the GPU transpiler can also outperform the GPU interpreter in the best case.
\subsubsection{Benchmark 3}
As found by the previous two benchmarks, varying the number of expressions only has a slight impact on the performance of the GPU in relation to the performance of the CPU. However, instead of varying the number of expressions, the number of variable sets can also be changed. For this benchmark, instead of $362$ variable sets, a total of $10\,860$ variable sets were used, which translates to an increase by $30$ times. It needs to be noted, that it was only possible to evaluate the performance with roughly $10\,000$ expressions with this number of variable sets. When using the same roughly $250\,000$ expressions of the first benchmark and the increase number of variable sets, none of the implementations managed to complete the benchmark, as there was too little RAM available.
As found by the previous two benchmarks, varying the number of expressions only has a slight impact on the performance of the GPU in relation to the performance of the CPU. However, instead of varying the number of expressions, the number of data points can also be changed. For this benchmark, instead of $362$ data points, a total of $10\,860$ data points were used, which translates to an increase in performance by $30$ times. It needs to be noted, that it was only possible to evaluate the performance with roughly $10\,000$ expressions with this number of data points. When using the same roughly $250\,000$ expressions of the first benchmark and the increased number of data points, none of the implementations managed to complete the benchmark, as there was too little RAM available.
\begin{figure}
\centering
@ -289,9 +289,9 @@ As found by the previous two benchmarks, varying the number of expressions only
\label{fig:cpu_gpui_gput_benchmark_3}
\end{figure}
Increasing the number of variable sets greatly benefited both GPU evaluators as seen in Figure \ref{fig:cpu_gpui_gput_benchmark_3}. With this change, the CPU interpreter noticeably fell behind the GPU evaluators. Compared to the GPU transpiler, the CPU interpreter took roughly twice as long on median. The GPU transpiler continued its trend of performing better than the GPU interpreter. Furthermore, the standard deviation of all three evaluators is also very similar.
Increasing the number of data points greatly benefited both GPU evaluators as seen in Figure \ref{fig:cpu_gpui_gput_benchmark_3}. With this change, the CPU interpreter noticeably fell behind the GPU evaluators. Compared to the GPU transpiler, the CPU interpreter took roughly twice as long on median. The GPU transpiler continued its trend of performing better than the GPU interpreter. Furthermore, the standard deviation of all three evaluators is also very similar.
From this benchmark it can be concluded that the GPU heavily benefits from a larger number of variable sets. If the number of variable sets is increased even further, the difference in performance between the GPU and CPU should be even more pronounced.
From this benchmark it can be concluded that the GPU heavily benefits from a larger number of data points. If the number of data points is increased even further, the difference in performance between the GPU and CPU should be even more pronounced.
While the GPU is very limited in terms of concurrent kernel dispatches that can be evaluated, the number of threads and blocks can virtually be infinitely large. This means that a higher degree of parallelism is achievable with a higher number of variable sets. Increasing the number of expressions on the other hand does not influence the degree of parallelism to this extent. This is the reason no performance benefit was found by only decreasing the number of expressions with the same number of variable sets.
While the GPU is very limited in terms of concurrent kernel dispatches that can be evaluated, the number of threads and blocks can virtually be infinitely large. This means that a higher degree of parallelism is achievable with a higher number of data points. Increasing the number of expressions on the other hand does not influence the degree of parallelism to this extent. This is the reason no performance benefit was found by only decreasing the number of expressions with the same number of data points.