evaluation: started and almost finished transpiler section

2025-06-06 13:16:40 +02:00
parent 275162d38d
commit 4132a4946f
5 changed files with 82 additions and 30 deletions
--- a/package/src/Transpiler.jl
+++ b/package/src/Transpiler.jl
@ -19,7 +19,7 @@ function evaluate(expressions::Vector{ExpressionProcessing.PostfixType}, cudaVar
 	# each expression has nr. of variable sets (nr. of columns of the variables) results and there are n expressions
 	cudaResults = CuArray{Float32}(undef, variableColumns, length(expressions))
-	threads = min(variableColumns, 128)
+	threads = min(variableColumns, 160)
 	blocks = cld(variableColumns, threads)
 	kernelName = "evaluate_gpu"
@ -44,7 +44,7 @@ function evaluate(kernels::Vector{CuFunction}, cudaVars::CuArray{Float32}, nrOfV
 	# each expression has nr. of variable sets (nr. of columns of the variables) results and there are n expressions
 	cudaResults = CuArray{Float32}(undef, nrOfVariableSets, length(kernels))
-	threads = min(nrOfVariableSets, 256)
+	threads = min(nrOfVariableSets, 160)
 	blocks = cld(nrOfVariableSets, threads)
 	@inbounds Threads.@threads for i in eachindex(kernels)
--- a/thesis/chapters/evaluation.tex
+++ b/thesis/chapters/evaluation.tex
@ -68,11 +68,10 @@ This section presents the results of the benchmarks described above. First the r
 \subsection{Interpreter}
 % Results only for Interpreter (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section)
 In this section, the results for the GPU-based interpreter are presented in detail. Following the benchmark results, the process of tuning the interpreter is described as well as how to adapt the tuning for the different benchmarks. This part not only contains the tuning of the GPU, but also performance improvements done on the CPU side.
 \subsubsection{Benchmark 1}
-The first benchmark consisted of $250\,000$ expressions and $362$ variable sets with $100$ parameter optimisation steps. Because each expression needs to be evaluated with each variable set for each parameter optimisation step, a total of $9.05\,\textit{billion}$ evaluations have been performed per sample. In Figure \ref{fig:gpu_i_benchmark_1} the result over all $50$ samples is presented. The median value across all samples is $466.3$ seconds with a standard deviation of $14.2$ seconds.
+The first benchmark consists of $250\,000$ expressions and $362$ variable sets with $100$ parameter optimisation steps. Because each expression needs to be evaluated with each variable set for each parameter optimisation step, a total of $9.05\,\textit{billion}$ evaluations have been performed per sample. In Figure \ref{fig:gpu_i_benchmark_1} the result over all $50$ samples is presented. The median value across all samples is $466.3$ seconds with a standard deviation of $14.2$ seconds.
 \begin{figure}
 	\centering
 	\includegraphics[width=.9\textwidth]{results/gpu-interpreter-final-performance-benchmark1.png}
@ -100,15 +99,16 @@ The third benchmark used the same $10\,000$ expressions and $100$ parameter opti
 	\label{fig:gpu_i_benchmark_3}
 \end{figure}
-Although the number of variable sets has been increased by 30 times, the block size remained at 128 threads. Unlike the previous benchmarks, the hardware utilisation was different. Now only the GPU was utilised to 100\% while the CPU utilisation started at 100\% and slowly dropped to 80\%. The GPU needs to perform 30 times more evaluations, meaning it takes longer for one kernel dispatch to be finished. At the same time, the CPU tries to dispatch the kernel at the same rate as before. Because only a certain number of kernels can be dispatched at once, the CPU needs to wait for the GPU to finish a kernel before another one can be dispatched again. Therefore, in this scenario, the evaluator runs into a GPU-bottleneck and using a more performant GPU would consequently improve the runtime in this scenario. In the benchmarks before, both the CPU and GPU would need to be upgraded, to achieve better performance.
+Although the number of variable sets has been increased by $30$ times, the block size remained at $128$ threads. Unlike the previous benchmarks, the hardware utilisation was different. Now only the GPU was utilised to 100\% while the CPU utilisation started at 100\% and slowly dropped to 80\%. The GPU needs to perform $30$ times more evaluations, meaning it takes longer for one kernel dispatch to be finished. At the same time, the CPU tries to dispatch the kernel at the same rate as before. Because only a certain number of kernels can be dispatched at once, the CPU needs to wait for the GPU to finish a kernel before another one can be dispatched again. Therefore, in this scenario, the evaluator runs into a GPU-bottleneck and using a more performant GPU would consequently improve the runtime in this scenario. In the benchmarks before, both the CPU and GPU would need to be upgraded, to achieve better performance.
 \subsection{Performance Tuning Interpreter}
-Optimising and tuning the interpreter is crucial to achieve good performance. Especially tuning the kernel, as a wrongly configured kernel can drastically degrade performance. Before any performance tuning and optimisation has been performed, the kernel was configured with a block size of 256 threads. Additionally, on the CPU, the frontend was executed for each expression before every kernel dispatch, even in parameter optimisation scenarios, where the expressions did not change from one dispatch to the other. Moreover, the variables have also been transmitted to the GPU before ever dispatch. However, executing the frontend, as well as dispatching the kernel was multithreaded, utilising all 12 threads of the CPU and a cache for the frontend has been used. 
+\label{sec:tuning_interpreter}
 Optimising and tuning the interpreter is crucial to achieve good performance. Especially tuning the kernel, as a wrongly configured kernel can drastically degrade performance. Before any performance tuning and optimisation has been performed, the kernel was configured with a block size of $256$ threads as it is a good initial configuration as recommended by \textcite{nvidia_cuda_2025-1}. Additionally, on the CPU, the frontend was executed for each expression before every kernel dispatch, even in parameter optimisation scenarios, where the expressions did not change from one dispatch to the other. Moreover, the variables have also been transmitted to the GPU before ever dispatch. However, executing the frontend, as well as dispatching the kernel was multithreaded, utilising all 12 threads of the CPU and a cache for the frontend has been used. 
 With this implementation, the initial performance measurements have been conducted for benchmark 1 which served as the baseline for further performance optimisations. However, as already mentioned, during this benchmark, memory limitations where encountered, as too much RAM was being used. Therefore, the caching had to be disabled. Because the evaluator is multithreaded, this change resulted in significantly better performance. As the cache introduced critical sections where race conditions could occur, locking mechanisms needed to be used. While locking ensures that no race conditions occur, it also means that parts of an otherwise entirely parallel implementation are now serialised, reducing the effect of parallelisation.
-Without a cache and utilising all 12 threads, the frontend achieved very good performance. Processing $250\,000$ expressions takes roughly $88.5$ milliseconds. On the other hand, using a cache, resulted in the frontend running for $6.9$ \textit{seconds}. This equates to a speed-up of roughly 78 times when using no cache. Additionally, when looking at the results above, the time it takes to execute the frontend is negligible, meaning further optimising the frontend would not significantly improve the overall runtime.
+Without a cache and utilising all 12 threads, the frontend achieved very good performance. Processing $250\,000$ expressions takes roughly $88.5$ milliseconds. On the other hand, using a cache, resulted in the frontend running for $6.9$ \textit{seconds}. This equates to a speed-up of roughly 78 times when using no cache. Additionally, when looking at the benchmark results above, the time it takes to execute the frontend is negligible, meaning further optimising the frontend would not significantly improve the overall runtime.
 During the tuning process $362$ variable sets have been used, which is the number of variable sets used by benchmark one and two. Before conduction benchmark three, additional performance tuning has been performed to ensure that this benchmark also utilises the hardware as much as possible.
@ -129,7 +129,7 @@ Figure \ref{fig:gpu_i_optimisation_1} shows how this optimisation improved the o
 The second optimisation was concerned with tuning the kernel configuration. Using NSight Compute\footnote{\url{https://developer.nvidia.com/nsight-compute}} it was possible to profile the kernel with different configurations. During the profiling a lot of metrics have been gathered that allowed to deeply analyse the kernel executions, with the application recommending different aspects that had a lot of potential for performance improvements.
-Since the evaluator is designed to execute many kernel dispatches in parallel, it was important to reduce the kernel runtime by as much as possible. Reducing the runtime per kernel has a knock-on effect, as the following kernel dispatches can more begin execution sooner reducing the overall runtime.
+Since the evaluator is designed to execute many kernel dispatches in parallel, it was important to reduce the kernel runtime. Reducing the runtime per kernel has a knock-on effect, as the following kernel dispatches can begin execution sooner reducing the overall runtime.
 After the evaluator tuning has been concluded, it was found that a block size of $128$ yielded the best results. With this kernel configuration, another performance measurement has been conducted with the results shown in Figure \ref{fig:gpu_i_optimisation_2} using benchmark one. As can be seen, the overall runtime again was noticeably faster. However, the standard deviation also drastically increased, with the duration from the fastest to the slowest sample differing by roughly 60 seconds.
@ -146,7 +146,7 @@ Benchmark three had a total of $10\,860$ variable sets, meaning at least this nu
 This has been repeated for two more configurations. Once for a block size of $160$ and once for $192$. With a block size of $160$ the total number of blocks was reduced to $68$ which again resulted in $20$ excess threads being started. With the hypothesis being, that using fewer blocks will result in better utilisation and therefore better performance. The same idea was also behind choosing the block size $192$. While this only requires $57$ blocks, the number of excess threads increased to $84$. 
-Using NSight Compute it was found, that a block size of $160$ was the best performing followed by the block size of $192$ and the worst performing configuration was with a block size of $128$. However, this is not representative of how these configurations perform during the benchmarks. As seen in Figure \ref{fig:gpu_i_128-160-192} using a block size of $128$ lead to significantly better performance than the other configurations. While a block size of $160$ lead to worse results, it needs to be noted that it also improved the standard deviation by 25\% when compared to the results with a block size of $128$. These results also show that it is important to not only use NSight Compute but also conduct performance tests with real data to ensure the best possible configuration is chosen.
+Using NSight Compute it was found, that a block size of $160$ was the best performing followed by the block size of $192$ and the worst performing configuration was with a block size of $128$. However, this is not representative of how these configurations perform during the benchmarks. As seen in Figure \ref{fig:gpu_i_128-160-192} using a block size of $128$ lead to significantly better performance than the other configurations. While a block size of $160$ lead to worse results, it needs to be noted that it also improved the standard deviation by 25\% when compared to the results with a block size of $128$. These results also demonstrate that it is important to not only use NSight Compute but also conduct performance tests with real data to ensure the best possible configuration is chosen.
 \begin{figure}
 	\centering
@ -156,6 +156,13 @@ Using NSight Compute it was found, that a block size of $160$ was the best perfo
 \end{figure}
 \subsubsection{Optimisation 3}
 As seen in Figure \ref{fig:gpu_i_optimisation_2}, while the performance overall improved, the standard deviation also significantly increased. With the third optimisation the goal was to reduce the standard deviation. In order to achieve this, some minor optimisations where applied.
 The first optimisation was to reduce the stack size of the interpreter from 25 to 10. As the stack is stored in local memory, it is beneficial to minimise the data transfer. This change, however, means that the stack might not be sufficient for larger expressions. Because there was no problem found with using a stack size of 10 during testing, it was assumed to be sufficient for most cases and in the other cases, the stack size can simply be increased.
 During the parameter optimisation step a lot of memory operations where performed. These are required as for each step new memory on the GPU must be allocated for both the parameters and the meta information. The documentation of CUDA.jl\footnote{\url{https://cuda.juliagpu.org/stable/usage/memory/\#Avoiding-GC-pressure}} mentioned that this can lead to higher garbage-collector (GC) pressure, increasing the time spent garbage-collecting. To reduce this, CUDA.jl provides the \verb|CUDA.unsafe_free!(::CuArray)| function. This frees the memory on the GPU without requiring to run the Julia GC and therefore spending less resources on garbage-collecting and more on evaluating the expressions.
 With these two changes the overall runtime has been improved as can be seen in Figure \ref{fig:gpu_i_optimisation_3}. Moreover, the standard deviation was also reduced which was the goal of this optimisation.
 \begin{figure}
 	\centering
@ -165,49 +172,75 @@ Using NSight Compute it was found, that a block size of $160$ was the best perfo
 \end{figure}
 3.) Minor optimisations. Reduced stacksize; reduced memory allocations on the CPU; reduced GC pressure (helped with std)
 \subsection{Transpiler}
-Results only for Transpiler (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section
+% Results only for Transpiler (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section
 In this section the results for the transpiler are presented in detail. First the results for all three benchmarks are shown. The benchmarks are the same as already explained in the previous sections. After the results, an overview of the steps taken to optimise the transpiler execution times is given. 
 \subsubsection{Benchmark 1}
 This benchmark lead to very poor results for the transpiler. While the best performing kernel configuration of $128$ threads per block was used, the above-mentioned RAM constraints meant that this benchmark performed poorly. After roughly $20$ hours of execution only two samples have been taken at which point it was decided to not finish this benchmark. 
 The reason for this benchmark to perform poorly was because of too little RAM being available. As described in Chapter \ref{cha:implementation} the expressions are transpiled into PTX code and then immediately compiled into machine code by the GPU driver before the compiled kernels are sent to the parameter optimisation step. This order of operations makes sense as the expressions remain the same during this process and otherwise would result in performing a lot of unnecessary transpilations and compilations.
 However, only 16 GB of RAM where available with about half of that being used by the operating system. This meant that about eight GB of RAM where available to store $250\,000$ compiled kernels next to other required data for example the variable matrix. As a result, this was not enough memory and the benchmark was unable to finish. To combat this the step of compiling the kernels was moved into the parameter optimisation process, as this would free the memory taken up by the compiled kernel after it has been executed. As seen above consequently the performance was hurt dramatically and has shown that for these scenarios much more memory is required for the transpiler.
 \subsubsection{Benchmark 2}
-kernels can now be compiled at the same time as they are generated (should drastically improve performance)
+By reducing the number of expressions from $250\,000$ to roughly $10\,000$ the RAM constraint that hindered the first benchmark is not a concern any more. This can also be seen in Figure \ref{fig:gpu_t_benchmark_2} where the benchmark could be completed in a much more reasonable time. The median of this benchmark was $19.6$ seconds with a standard deviation of $1.16$ seconds. Again for this benchmark a block size of $128$ threads has been chosen.
 std: 1.16 seconds
 \begin{figure}
 	\centering
 	\includegraphics[width=.9\textwidth]{results/gpu-transpiler-final-performance-benchmark2.png}
-	\caption{The results of the transpiler for benchmark 2}
+	\caption{The results of the transpiler for benchmark 2.}
 	\label{fig:gpu_t_benchmark_2}
 \end{figure}
-CPU: 100\% 
+During the benchmark it was observed that the CPU maintained a utilisation of 100\%. However crucially the GPU rapidly oscillated between 0\% and 100\% utilisation. This pattern suggests that while the kernels can fully utilise the GPU, they complete the evaluations almost immediately. Consequently, although the evaluation is performed very quickly, the time spent evaluating is smaller than the time spent preparing the expressions for evaluation. To better leverage the GPU more evaluations should be performed. This would increase the GPU's share of total execution time and therefore increase the efficiency and performance drastically.
-GPU: very short bursts to 100\% then down to 0\% with a very high frequency (therefore GPU pretty much utilised to 50\% during a sample) 
+
 \subsubsection{Benchmark 3}
-Even larger var sets would be perfect. 10k is rather small and the GPU still has barely any work to do
+% Even larger var sets would be perfect. 10k is rather small and the GPU still has barely any work to do
-std: (re-calculate as block size changed to 160 from 128 before)
+% std: 648.8 ms
 This benchmark increased the amount of variable sets by $30$ times and therefore also increases the total number of evaluations by $30$ times. As already seen in the second benchmark, the GPU was under utilised and had therefore more resources for evaluating the expressions. As can be seen in Figure \ref{fig:gpu_t_benchmark_3} the available resources where better utilised. Although the number of evaluations increased by $30$ times, the median execution time only increased by roughly six seconds or $1.3$ times from $19.6$ to $25.4$. The standard deviation also decreased from $1.16$ seconds to $0.65$ seconds.
 \begin{figure}
 	\centering
 	\includegraphics[width=.9\textwidth]{results/gpu-transpiler-final-performance-benchmark3.png}
-	\caption{The results of the transpiler for benchmark 3; RE-DO THIS AS BLOCKSIZE CHANGED}
+	\caption{The results of the transpiler for benchmark 3.}
 	\label{fig:gpu_t_benchmark_3}
 \end{figure}
-CPU: 100\% during frontend + transpilation + compilation, then goes hovers at 80\% (same reason than interpreter bench 3 most likely)
+Since the number of variable sets has changed, some performance tests with different block sizes needed to be performed. During this process it was found, that changing the block size from $128$ to $160$ threads resulted in the best performance. This is in contrast to the GPU interpreter where changing the block size to $160$ resulted in degraded performance.
 GPU: During compilation at 20\% -> evaluation: between 50 and 100 but fewer spikes to 100; probably very small kernels, therefore a lot of scheduling on GPU, resulting in less utilisation but too many dispatches so CPU slows down (maybe do another quick performance tuning session to see if different block size can improve this behaviour)
-\subsection{Performance Tuning}
+While conducting this benchmark the CPU utilisation started with 100\% during the frontend step as well as the transpilation and compilation process. However, similar to the third benchmark of the GPU interpreter, the CPU utilisation dropped to 80\% during the evaluation. It is very likely also the same reason that the kernels are dispatched to quickly in succession, filling up the number of allowed resident grids on the GPU.
 Document the process of performance tuning
-Initial: no cache; 256 blocksize; exprs pre-processed and transpiled on every call; vars sent on every call; frontend + transpilation + dispatch are multithreaded
+However, the GPU utilisation also drastically increased. During the second benchmark, rapid oscillation was observed. With this benchmark the utilisation remained much more stable with the utilisation hovering at around 60\% to 70\% utilisation most of the time. It also needs to be noted however, that there also where frequent spikes to 100\% and slightly less frequent drops to 20\% utilisation. Overall the GPU utilisation was much higher compared to the second benchmark which explains why the execution time only increased slightly while the number of evaluations increased drastically.
-1.) Done before parameter optimisation loop: Frontend, transmitting Exprs and Variables (improved runtime)
+\subsection{Performance Tuning Transpiler}
-2.) All expressions to execute are transpiled first (before they were transpiled for every execution, even in parameter optimisation scenarios). Compilation is done every time in benchmark 1, because too little RAM was available (compilation takes the most time, so this is only a minor boost). 
+% Initial: no cache; 256 blocksize; exprs pre-processed and transpiled on every call; vars sent on every call; frontend + transpilation + dispatch are multithreaded
-3.) benchmark3 std noticeably improved with blocksize 160 (around 70\% better)
+This section describes how the transpiler has been tuned to achieve good performance. Steps taken to improve the performance of the CPU-side of the transpiler are presented. Additionally, steps taken to improve the performance of the kernels are also shown.
-CPU at 100\% GPU at around 30\%. Heavily CPU bottlenecked. Mainly due to PTX compilation taking by far the longest (while kernels are finished more or less instantly)
+Before any optimisations were applied the block size was set to $256$ threads. The frontend as well as the transpilation and compilation was performed during the parameter optimisation step before the expression needed to be evaluated. Additionally, the variables have also been sent to the GPU on every parameter optimisation step. Multithreading has been used for the frontend, transpilation, compilation and kernel dispatch. Caching has also been used for the frontend and for the transpilation process in an effort to reduce the runtime.
 As already mentioned in Section \ref{sec:tuning_interpreter}, using a cache in combination with multithreading for the frontend drastically slowed down the execution, which is the reason it has been disabled before conducting any benchmarks. 
 Caching has also been used for the transpilation step. The reason for this was to reduce the runtime during the parameter optimisation step. While this reduced the overhead of transpilation, the overhead of searching the cache if the expression has already been transpiled still existed. Because of the already mentioned RAM constraints this cache has been disabled and a better solution has been implemented in the first and second optimisation steps.
 Most data of the tuning process has been gathered with the number of expressions and variable sets of the first benchmark, as this was the worst performing scenario. Therefore, it would show best where potential for performance improvements was. Before any optimisations were applied a single sample of the first benchmark took roughly 15 hours. However, it needs to be noted that the sample size is due to the duration of one sample very low.
 \subsubsection{Optimisation 1}
 % 1.) Done before parameter optimisation loop: Frontend, transmitting Variables (improved runtime)
 Since all caching has been disabled, a better solution for reducing the number of calls to the frontend was needed. For this, the calls to the frontend were moved outside the parameter optimisation step and storing the result for later use. Furthermore, transmitting the variables to the GPU has also been performed before the parameter optimisation is started, further reducing the number and volume of data transfer to the GPU. These two optimisations were able to reduce the runtime of one sample to roughly 14 hours
 \subsubsection{Optimisation 2}
 % 2.) All expressions to execute are transpiled first (before they were transpiled for every execution, even in parameter optimisation scenarios). Compilation is done every time in benchmark 1, because too little RAM was available (compilation takes the most time). 
 With this optimisation the number of calls to the transpiler and compiler have been drastically reduced. Both steps have been performed at the same time the frontend is called. The compiled kernels are then stored and only need to be executed during the parameter optimisation step. This meant that a cache was not needed any more. Because each time a new set of expressions needs to be evaluated, it is very unlikely that the same expression needs to be evaluated more than once. Consequently, the benefit of reducing the RAM consumption far outweighs the potential time savings of using a cache. Moreover, removing the cache also reduced the overhead of accessing it on every parameter optimisation step, further improving performance.
 It also must be noted, that compiling the PTX kernels and storing the result before the parameter optimisation step lead to an out of memory error for the first benchmark. In order to get any results, this step had to be reverted for this benchmark. If much more RAM were available, the runtime would have also significantly improved.
 These optimisations lead to a runtime of one sample of roughly ten hours for the first benchmark. Therefore, a substantial improvement of roughly four hours per sample was achieved. This suggests that moving the compilation to the same location as the transpiling would yield even better results.
 \subsubsection{Optimisation 3}
 % 3.) benchmark3 std noticeably improved with blocksize 160 (around 70\% better) (also includes call to unsafe_free)
 % here I can show chart of comparing the two blocksizes
 % unsafe_free in benchmark one reduced std. but could also be run to run variance. at least no negative effects
 \subsection{Comparison}
 Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter
@ -215,7 +248,26 @@ Comparison of Interpreter and Transpiler as well as Comparing the two with CPU i
 more var sets == better performance for GPU; more expressions == more performance for CPU evaluator
 \subsubsection{Benchmark 1}
 \begin{figure}
 	\centering
 	\includegraphics[width=.9\textwidth]{results/cpu_gpui_gput_bench1.png}
 	\caption{The results of the comparison of all three implementations for the first benchmark. Note that the transpiler is absent because it did not finish this benchmark.}
 	\label{fig:cpu_gpui_gput_benchmark_1}
 \end{figure}
 \subsubsection{Benchmark 2}
 \begin{figure}
 	\centering
 	\includegraphics[width=.9\textwidth]{results/cpu_gpui_gput_bench2.png}
 	\caption{The results of the comparison of all three implementations for the second benchmark.}
 	\label{fig:cpu_gpui_gput_benchmark_2}
 \end{figure}
 \subsubsection{Benchmark 3}
 \begin{figure}
 	\centering
 	\includegraphics[width=.9\textwidth]{results/cpu_gpui_gput_bench3.png}
 	\caption{The results of the comparison of all three implementations for the third benchmark.}
 	\label{fig:cpu_gpui_gput_benchmark_3}
 \end{figure}
--- a/thesis/images/results/gpu-transpiler-final-performance-benchmark2.png
+++ b/thesis/images/results/gpu-transpiler-final-performance-benchmark2.png
--- a/thesis/images/results/gpu-transpiler-final-performance-benchmark3.png
+++ b/thesis/images/results/gpu-transpiler-final-performance-benchmark3.png
--- a/thesis/main.pdf
+++ b/thesis/main.pdf