evaluation: finished chapter; re-read for errors yet to be done
Some checks failed
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Has been cancelled

This commit is contained in:
2025-06-07 13:58:49 +02:00
parent 4132a4946f
commit b494803611
5 changed files with 35 additions and 5 deletions

View File

@ -177,6 +177,7 @@ With these two changes the overall runtime has been improved as can be seen in F
In this section the results for the transpiler are presented in detail. First the results for all three benchmarks are shown. The benchmarks are the same as already explained in the previous sections. After the results, an overview of the steps taken to optimise the transpiler execution times is given. In this section the results for the transpiler are presented in detail. First the results for all three benchmarks are shown. The benchmarks are the same as already explained in the previous sections. After the results, an overview of the steps taken to optimise the transpiler execution times is given.
\subsubsection{Benchmark 1} \subsubsection{Benchmark 1}
\label{sec:gput_bench1}
This benchmark lead to very poor results for the transpiler. While the best performing kernel configuration of $128$ threads per block was used, the above-mentioned RAM constraints meant that this benchmark performed poorly. After roughly $20$ hours of execution only two samples have been taken at which point it was decided to not finish this benchmark. This benchmark lead to very poor results for the transpiler. While the best performing kernel configuration of $128$ threads per block was used, the above-mentioned RAM constraints meant that this benchmark performed poorly. After roughly $20$ hours of execution only two samples have been taken at which point it was decided to not finish this benchmark.
The reason for this benchmark to perform poorly was because of too little RAM being available. As described in Chapter \ref{cha:implementation} the expressions are transpiled into PTX code and then immediately compiled into machine code by the GPU driver before the compiled kernels are sent to the parameter optimisation step. This order of operations makes sense as the expressions remain the same during this process and otherwise would result in performing a lot of unnecessary transpilations and compilations. The reason for this benchmark to perform poorly was because of too little RAM being available. As described in Chapter \ref{cha:implementation} the expressions are transpiled into PTX code and then immediately compiled into machine code by the GPU driver before the compiled kernels are sent to the parameter optimisation step. This order of operations makes sense as the expressions remain the same during this process and otherwise would result in performing a lot of unnecessary transpilations and compilations.
@ -235,19 +236,32 @@ With this optimisation the number of calls to the transpiler and compiler have b
It also must be noted, that compiling the PTX kernels and storing the result before the parameter optimisation step lead to an out of memory error for the first benchmark. In order to get any results, this step had to be reverted for this benchmark. If much more RAM were available, the runtime would have also significantly improved. It also must be noted, that compiling the PTX kernels and storing the result before the parameter optimisation step lead to an out of memory error for the first benchmark. In order to get any results, this step had to be reverted for this benchmark. If much more RAM were available, the runtime would have also significantly improved.
These optimisations lead to a runtime of one sample of roughly ten hours for the first benchmark. Therefore, a substantial improvement of roughly four hours per sample was achieved. This suggests that moving the compilation to the same location as the transpiling would yield even better results. These optimisations lead to a runtime of one sample of roughly ten hours for the first benchmark. Therefore, a substantial improvement of roughly four hours per sample was achieved. When $10\,000$ expressions are transpiled it takes on average $0.05$ seconds over ten samples. Comparing this to the time spent compiling the resulting $10\,000$ kernels it takes on average $3.2$ seconds over ten samples. This suggests that performing the compilation before the parameter optimisation step would yield drastically better results in the first benchmark.
\subsubsection{Optimisation 3} \subsubsection{Optimisation 3}
% 3.) benchmark3 std noticeably improved with blocksize 160 (around 70\% better) (also includes call to unsafe_free) % 3.) benchmark3 std noticeably improved with blocksize 160 (around 70\% better) (also includes call to unsafe_free)
% here I can show chart of comparing the two blocksizes % here I can show chart of comparing the two blocksizes
% unsafe_free in benchmark one reduced std. but could also be run to run variance. at least no negative effects % unsafe_free in benchmark one reduced std. but could also be run to run variance. at least no negative effects
The third optimisation step was more focused on improving the performance for the third benchmark as it has a higher number of variable sets than the first and second one. However, as with the interpreter, the function \verb|CUDA.unsafe_free!(::CuArray)| has been used to reduce the standard deviation for all benchmarks.
Since the number of variable sets has changed in the third benchmark, it is important re-do the performance tuning. This was done by measuring the kernel performance using NSight Compute. As with the interpreter, block sizes of 128 and 160 threads have been compared with each other. A block size of 192 threads has been omitted here since the number of excess threads is very high. In the case of the interpreter the performance of this configuration was the worst out of the three configurations, and it was assumed it will be the same here.
However, since the number of excess threads for 128 and 160 threads per block is the same but the latter using fewer blocks might behave differently in the case of the transpiler. As seen in Figure \ref{fig:gpu_t_128_160} this assumption was true and using a block size of 160 threads resulted in better performance for the third benchmark. This is in contrast to the interpreter, where this configuration performed much more poorly.
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{results/transpiler-comparison-128-160.png}
\caption{Runtime comparison of the third benchmark with block sizes of 128 and 160 threads.}
\label{fig:gpu_t_128_160}
\end{figure}
\subsection{Comparison} \subsection{Comparison}
Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter % Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter
% more var sets == better performance for GPU; more expressions == more performance for CPU evaluator
more var sets == better performance for GPU; more expressions == more performance for CPU evaluator With the individual results of the GPU interpreter and transpiler presented, it is possible to compare them with the existing CPU interpreter. This section aims at outlining and comparing the performance of all three implementations across all three benchmark to understand their strengths and weaknesses. Through this analysis the scenarios will be identified where it is best to leverage the GPU but also when using the CPU interpreter is the better choice ultimately answering the research questions of this thesis.
\subsubsection{Benchmark 1} \subsubsection{Benchmark 1}
The goal of the first benchmark was to determine how the evaluators are able to handle large amounts of expressions. While this benchmark is not representative of a typical scenario, it allows for demonstrating the impact the number of expressions has on the execution time. As already explained in Section \ref{sec:gput_bench1} the transpiler was not able to finish this benchmark due to RAM limitations. This required a slightly modified implementation was needed to obtain results for at least two samples, each taking roughly ten hours to complete. Therefore, it has been omitted in this comparison.
\begin{figure} \begin{figure}
\centering \centering
\includegraphics[width=.9\textwidth]{results/cpu_gpui_gput_bench1.png} \includegraphics[width=.9\textwidth]{results/cpu_gpui_gput_bench1.png}
@ -255,8 +269,11 @@ more var sets == better performance for GPU; more expressions == more performanc
\label{fig:cpu_gpui_gput_benchmark_1} \label{fig:cpu_gpui_gput_benchmark_1}
\end{figure} \end{figure}
Figure \ref{fig:cpu_gpui_gput_benchmark_1} shows the results of the first benchmark for the CPU and GPU interpreter. It can be seen that the GPU interpreter takes roughly four times as long on median than the CPU interpreter. Additionally, the standard deviation is much larger on the GPU interpreter. This shows that the CPU heavily benefits from scenarios where a lot of expressions need to be evaluated with very few variable sets. Therefore, it is not advisable to use the GPU to increase the performance in these scenarios.
\subsubsection{Benchmark 2} \subsubsection{Benchmark 2}
Since the first benchmark has shown that with a large number of expressions the GPU is not a suitable alternative to the CPU. To further proof this statement a second benchmark with much fewer expressions was conducted. Now instead of $250\,000$ expressions, only $10\,000$ are evaluated. This reduction also meant that the transpiler can now be included in the comparison as it does not face the RAM limitations any more.
\begin{figure} \begin{figure}
\centering \centering
\includegraphics[width=.9\textwidth]{results/cpu_gpui_gput_bench2.png} \includegraphics[width=.9\textwidth]{results/cpu_gpui_gput_bench2.png}
@ -264,10 +281,23 @@ more var sets == better performance for GPU; more expressions == more performanc
\label{fig:cpu_gpui_gput_benchmark_2} \label{fig:cpu_gpui_gput_benchmark_2}
\end{figure} \end{figure}
Reducing the number of expressions did not benefit the GPU evaluators at all compared to the CPU interpreter. This can be seen in Figure \ref{fig:cpu_gpui_gput_benchmark_2}. Furthermore, now the GPU evaluators are both roughly five times slower than the CPU interpreter instead of the previous reduction of roughly four times. Again the standard deviation is also much higher on both GPU evaluators when compared to the CPU interpreter. This means that a lower number of expressions does not necessarily mean that the GPU can outperform the CPU and therefore disproves the above statement.
On the other side, it can also be seen that the GPU transpiler tends to perform better than the GPU interpreter. While in the worst case both implementations are roughly equal, the GPU transpiler on median performs better. Additionally, the GPU transpiler can also outperform the GPU interpreter in the best case.
\subsubsection{Benchmark 3} \subsubsection{Benchmark 3}
As found by the previous two benchmarks, varying the number of expressions only has a slight impact on the performance of the GPU in relation to the performance of the CPU. However, instead of varying the number of expressions, the number of variable sets can also be changed. For this benchmark, instead of $362$ variable sets, a total of $10\,860$ variable sets were used, which translates to an increase by $30$ times. It needs to be noted, that it was only possible to evaluate the performance with roughly $10\,000$ expressions. When using the same roughly $250\,000$ expressions of the first benchmark, none of the implementations managed to complete the benchmark, as there was too little RAM available.
\begin{figure} \begin{figure}
\centering \centering
\includegraphics[width=.9\textwidth]{results/cpu_gpui_gput_bench3.png} \includegraphics[width=.9\textwidth]{results/cpu_gpui_gput_bench3.png}
\caption{The results of the comparison of all three implementations for the third benchmark.} \caption{The results of the comparison of all three implementations for the third benchmark.}
\label{fig:cpu_gpui_gput_benchmark_3} \label{fig:cpu_gpui_gput_benchmark_3}
\end{figure} \end{figure}
Increasing the number of variable sets greatly benefited both GPU evaluators as seen in Figure \ref{fig:cpu_gpui_gput_benchmark_3}. With this change, the CPU interpreter noticeable fell behind the GPU evaluators. Compared to the GPU transpiler, the CPU interpreter took roughly twice as long on median. The GPU transpiler continued its trend of performing better than the GPU interpreter. Furthermore, the standard deviation of all three evaluators is also very similar.
From this benchmark it can be concluded that the GPU heavily benefits from a larger number of variable sets. If the number of variable sets is increased even further, the difference between in performance between the GPU and CPU should be even more pronounced.
Since the GPU is very limited in terms of concurrent kernel dispatches that can be evaluated, the number of threads and blocks can virtually be infinitely large. This means that a higher degree of parallelism is achievable with a higher number of variable sets. Increasing the number of expressions on the other hand does not influence the degree of parallelism to this extent. This is the reason no performance benefit was found by only decreasing the number of expressions with the same number of variable sets.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 46 KiB

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.