master-thesis/thesis/chapters/conclusion.tex

\chapter[Conclusion]{Conclusion and Future Work}
\label{cha:conclusion}

% When trying to model a system consisting of some inputs with an observed output, a computer can be used
A typical system consists of a set of inputs with an observed output. For example when trying to model the flow in rough pipes as done by \textcite{nikuradse_laws_1950} where the length, the diameter and the roughness of the pipes are the input. In this scenario the flow through the pipe is the output and a mathematical model is needed to describe the correlation between the inputs and outputs. Finding such a model or formula can be done by utilising a computer and symbolic regression. Symbolic regression typically is implemented using genetic programming. During the runtime thousands or even hundreds of thousands of formulas or expressions are generated which need to be evaluated to determine if they describe the observed system with sufficient accuracy. This process can take several hours to days to find a suitable formula on a single machine utilising the CPU only. Therefore, this thesis deals with the question of how the evaluation of the expressions generated at runtime can be sped up to minimise execution times.

Research has been conducted on how to best approach this problem statement. The GPU has been chosen to improve the performance as a cheap and powerful tool especially compared to compute clusters. Numerous instances exist were utilising the GPU lead to drastic performance improvements in many fields of research.

Two GPU evaluators were implemented which should determine if the GPU is more suitable for evaluating expressions generated at runtime as compared to the CPU. The two implementations are as follows:

\begin{description}
	\item[GPU Interpreter] \mbox{} \\
		A stack based interpreter that evaluates the expressions. The frontend converts these expressions into postfix notation to ensure the implementation can be as simple as possible. It consists of one kernel that is used to evaluate all expressions separately.
	\item[GPU Transpiler] \mbox{} \\
		A transpiler that takes the expressions and transpiles them into PTX code. Each expression is represented in its own unique kernel. The kernels are simpler than the one GPU interpreter kernel, but more effort is needed to generate them.
\end{description}

In total three benchmarks were conducted to determine if and under which circumstances the GPU is a more suitable choice for evaluating the expressions. The current CPU implementation is the baseline against which the GPU evaluators are evaluated. To answer the research questions the benchmarks are structured as follows:
\begin{enumerate}
	\item Roughly $250\,000$ expressions with $362$ variable sets have been evaluated. The goal of this benchmark was determining how the evaluators can handle large volumes of expressions.
	\item Roughly $10\,000$ expressions with $362$ variable sets have been evaluated. This benchmark should demonstrate how a change in the number of expressions impacts the performance, especially compared with each other.
	\item Roughly $10\,000$ expressions and roughly $10\,000$ variable sets have been evaluated. By increasing the number of variable sets a more realistic use-case is modelled with this benchmark. Additionally, by using more variable sets the strengths of the GPU should get more exploited.
\end{enumerate}

After conducting the first and second benchmarks it was clear, that the CPU is the better choice in these scenarios. The first benchmark in particular demonstrated how the high RAM usage of the GPU transpiler lead to it not finishing this benchmark. Reducing the number of expressions demonstrated that the GPU transpiler can perform better than the GPU interpreter, however, in relation to the CPU implementation, no real change was observed between the first and second benchmark. However, in the third benchmark, both GPU evaluators managed to outperform the CPU, with the GPU transpiler performing the best.

To address the research questions, this thesis demonstrates that evaluating expressions generated at runtime can be more efficient on the GPU under specific conditions. Utilizing the GPU becomes feasible when dealing with a high number of variable sets, typically in the thousands and above. For scenarios with fewer variable sets, the CPU remains the better choice. Additionally, in scenarios where RAM is abundant, the GPU transpiler is the optimal choice. If too little RAM is available and the number of variable sets is sufficiently large, the GPU interpreter should be chosen, as it outperforms both the GPU transpiler and the CPU in such cases.

\section{Future Work}
This thesis demonstrated how the GPU can be used to accelerate the evaluation of expressions and therefore the symbolic regression algorithm as a whole. However, the boundaries at which it is more feasible to utilise the GPU are very coarse-grained. Therefore, conducting more research into how the number of expressions and variable sets impact performance is needed. Furthermore, only one dataset with only two variables per variable set was used. Varying the number of variables per set and their impact on performance could also be interesting. The impact of the parameters was omitted from this thesis entirely. Further research on how the number of parameters impact the performance is of interest. Since parameters need to be transferred to the GPU frequently, having too many parameters could impact the GPU more negatively than the CPU.

The current implementation also has flaws that can be improved in future work. Currently, no shared memory is utilised, meaning the threads need to always retrieve the data from global memory. This is a slow operation and efficiently utilising shared memory should further improve the performance of both GPU evaluators.

Additionally, neither of the implementations supports special GPU instructions. Especially the Fused Multiply-Add (FMA) instruction is of interest. Given that multiplying two values and adding a third is a common operation, this special instruction allows these operations to be performed in a single clock cycle. The frontend can be extended to detect and convert sub-expressions of this form into a special ternary opcode, enabling the backend to generate more efficient code. If the effort of detecting these sub-expressions is outweighed by the performance improvement needs to be determined in a future work.