thesis: finished implementing feedback

2025-06-29 16:33:48 +02:00
parent 5e42668e1a
commit 408f0ac795
4 changed files with 25 additions and 0 deletions
--- a/thesis/chapters/evaluation.tex
+++ b/thesis/chapters/evaluation.tex
@ -60,6 +60,17 @@ The performance measurements are taken, using the BenchmarkTools.jl\footnote{\ur

 It offers extensive support for measuring and comparing results of different implementations and versions of the same implementation. Benchmark groups allow to categorise the different implementations, take performance measurements and compare them. When taking performance measurements, it also supports setting a timeout and most importantly, set the number of samples to be taken. This is especially important, as it ensures to produce stable results by combating run-to-run variance. For this thesis, a sample size of $50$ has been used. This means that each of the previously-mentioned benchmarks, gets executed $50$ times. 

+\subsubsection{Theoretical Maximum Performance}
+To get an idea of how much performance would in theory be achievable, a rough optimistic estimation can be done. On average over all roughly $250\,000$ expressions of the first benchmark, a single expression has five operators. This translates to five floating point operations or FLOPS. Since some operators such as $x^y$ require three instructions, it is assumed that one of the five operators is such an operator. As a result $x^y$ needs three FLOPS which in total means a single expression on average requires seven FLOPS to be evaluated.
+
+Furthermore, expressions consist of variables and parameters, which need to be loaded from memory. It is assumed that per expression one parameter exists. Since the Nikuradse dataset is used, it is known that each expression contains exactly two variables. Loading a value from memory consists of three instructions. Therefore, it is assumed loading a value requires three FLOPS. This brings the total number of FLOPS per expression to $16$.
+
+The used GPU has a theoretical performance of $16.2$ Terra-FLOPS (TFLOPS) per second\footnote{\url{https://www.techpowerup.com/gpu-specs/geforce-rtx-3060-ti.c3681}}. Since the GPU has $4\,864$ cores, a single core has a theoretical performance of $16.2 / 4\,864 \approx 0.0033$ TFLOPS or $3.3$ GFLOPS per second. This means that a single core can perform $3.3$ billion 32-bit floating point operations per second. In return, this means that a single core can evaluate approximately $208$ million expressions per second. As a result, a single core would be able to evaluate all expressions of the first benchmark in less than a second, assuming the data is instantly accessible, and no more FLOPS are required to evaluate an expression than already accounted for.
+
+This calculation however is a very rough estimate. It does not take into account the time spent waiting for data to arrive nor does it take into account the time it takes to schedule the threads on the actual cores and other overhead work or waiting times. Especially the time spent waiting for data to arrive is important, as all data is present in global memory, which is the slowest form of memory on a GPU. While loading memory is a three instruction operation, it is very likely that the resulting machine code contains more instructions and therefore more FLOPS. Furthermore, both implementations contain many overhead instructions which are not accounted for in the above estimate. The interpreter loop for example contains many instructions that are not directly contributing to evaluating the expressions such as branching and jumping instructions. Additionally, not all FLOPS operate on FP32 values. Some also operate on FP64 instructions, which are about 64 times slower than FP32 instructions on this GPU.
+
+As seen in the results below, the benchmarks clearly show that the waiting time can not be neglected in the performance estimation. Furthermore, the CPU side has been omitted fully in the estimation. However, a significant part of the runtime is on the CPU, especially for the transpiler. Providing an estimation that incorporates both the waiting time and overhead FLOPS is an involved process which is out of scope of this thesis. Furthermore, no performance measurements of the runtime of a single kernel have been taken. While this would be interesting to get an idea of how much performance is lost compared to an ideal and optimistic scenario, it would have taken too much time to perform this analysis.
+
 \section{Results}
 \label{sec:results}
 This section presents the results of the benchmarks described above. First the results for the GPU-based interpreter and GPU transpiler alongside the performance tuning process will be presented in isolation. Finally, both GPU-based evaluators will be compared with each other to determine which of them performs the best. Additionally, these evaluators will be compared against the CPU-based interpreter to answer the research questions of this thesis.
@ -295,3 +306,7 @@ From this benchmark it can be concluded that the GPU heavily benefits from a lar

 While the GPU is very limited in terms of concurrent kernel dispatches that can be evaluated, the number of threads and blocks can virtually be infinitely large. This means that a higher degree of parallelism is achievable with a higher number of data points. Increasing the number of expressions on the other hand does not influence the degree of parallelism to this extent. This is the reason no performance benefit was found by only decreasing the number of expressions with the same number of data points.

+\subsection{Discussion}
+A similar problem statement of this thesis has already been explored by \textcite{weinberger_vektoroperationen_2018}. In his thesis he explored how utilising vector operations can be used in evaluating expression trees generated with GP. He used OpenCL to, on the one hand, vectorise a CPU implementation, and on the other hand utilise the GPU. Utilising the GPU using CUDA to evaluate expressions generated at runtime has also been the focus of this thesis. However, the goal of this thesis was to compare two GPU implementations with each other and a CPU implementation specifically for the use in symbolic regression utilising parameter optimisation. 
+
+In his thesis, Weinberger found that the GPU was able to outperform the CPU in all instances. Especially with larger datasets the advantage of the GPU was clearly visible. This trend was also confirmed in this thesis, specifically when comparing the second and third benchmarks. However, in this thesis, the CPU implementation was able to outperform the GPU clearly in two out of three benchmarks. This difference might be caused by the sophisticated usage of vectorisation in the CPU implementation which used for comparison. Overall this thesis was able to confirm the findings of Weinberger. Additionally, implementations are demonstrated that support the evaluation of expressions generated at runtime on the GPU that allow the usage of parameter optimisation which was not possible with Weinberger's implementation.
--- a/thesis/chapters/relwork.tex
+++ b/thesis/chapters/relwork.tex
@ -32,6 +32,8 @@ A typical GP generation generates multiple expressions at once. If for example a

 Each expression is part of a search space of all possible expressions consisting of the defined operators, variables and constants up to a defined maximum length. With the help of GP, this search space is explored, however, the generated expressions might not perfectly fit the data. To further refine the generated expressions, the concept of parameter optimisation can be used as described by \textcite{kommenda_local_2018}. Parameter optimisation is a kind of local search where parameters $p$ are introduced in the generated equations. In Equation \ref{eq:example} the parameter $p_1$ will be modified over some amount of iterations. This modification should assist in finding a local or even the global optimum by better fitting the expressions to the data. For example $50$ local search steps can be used, meaning that each expression needs to be evaluated $50$ times with the same variables, but different parameters. As a result, one GP generation consequently requires a total $300 * 50 = 15\,000$ evaluations of the expressions. However, typically more than one GP generation is needed to find a good solution. While the exact number of generations is problem specific, for this example a total of $100$ generations can be assumed. Each generation again generates $300$ expressions and needs to perform $50$ local search steps. This results in a total of $300 * 50 * 100 = 1\,500\,000$ evaluations which need to be performed during the entire runtime of the GP algorithm. These values have been taken from the GP algorithm for predicting discharge voltage curves of batteries as described by \textcite{kronberger_symbolic_2024}. Their GP algorithm converged after $54$ generations, resulting in $300 * 50 * 54 \approx 800\,000$ evaluations. This calculation omits the number of data points, which are the main contributor towards the total runtime. As for each generated expression, each data point needs to be used for parametrising the variables, drastically increasing the number of evaluations. They used a total of $11\,000$ data points, resulting in a total of $800\,000 * 11\,000 = 8.8 \text{billion}$ evaluations. Their results took over two days to compute on an eight core desktop CPU. While they did not provide runtime information for all problems they tested, the voltage curve prediction was the slowest. The other problems were in the range of a few seconds and up to a day. Especially the problems that took several hours to days to finish show, that there is still room for performance improvements. While a better CPU with more cores can be used, it is interesting to determine, if using GPUs can yield noticeable better performance.

+In his master's thesis \textcite{weinberger_vektoroperationen_2018} explored the possibility of utilising vector operations in the field of GP. He mainly focused on vectorising the evaluation on the CPU and by utilising the GPU to evaluate the expression trees generated by a GP algorithm. By utilising OpenCL and an AMD GPU he achieved a speed-up of two when utilising vectorisation on the CPU and a speed-up of 116 when utilising the GPU. This shows that the GPU also has great potential in the more specific case of symbolic regression with the above described parameter optimisation.
+
 \section[GPGPU]{General Purpose Computation on Graphics Processing Units}
 \label{sec:gpgpu}
 Graphics cards (GPUs) are commonly used to increase the performance of many different applications. Originally they were designed to improve performance and visual quality in games. \textcite{dokken_gpu_2005} first described the usage of GPUs for general purpose programming (GPGPU). They have shown how the graphics pipeline can be used for GPGPU programming. Because this approach also requires the programmer to understand the graphics terminology, this was not a great solution. Therefore, Nvidia released CUDA\footnote{\url{https://developer.nvidia.com/cuda-toolkit}} in 2007 with the goal of allowing developers to program GPUs independent of the graphics pipeline and terminology. A study of the programmability of GPUs with CUDA and the resulting performance has been conducted by \textcite{huang_gpu_2008}. They found that GPGPU programming has potential, even for non-embarassingly parallel problems. 
--- a/thesis/main.pdf
+++ b/thesis/main.pdf
--- a/thesis/references.bib
+++ b/thesis/references.bib
@ -1299,3 +1299,11 @@
 	date = {2025},
 	file = {PCI Express 6.0 Specification | PCI-SIG:C\:\\Users\\danwi\\Zotero\\storage\\MSYN4ZIU\\pci-express-6.html:text/html},
 }
+
+@thesis{weinberger_vektoroperationen_2018,
+	title = {Vektoroperationen in der genetischen Programmierung},
+	institution = {University of Applied Sciences Upper Austria},
+	type = {phdthesis},
+	author = {Weinberger, Patrick},
+	date = {2018},
+}