\chapter[Conclusion]{Conclusion and Future Work}
\label{cha:conclusion}

Research has been conducted on how to best approach the evaluation of dynamically generated expressions for symbolic regression. The GPU has been chosen to improve the performance as a cheap and powerful tool especially compared to compute clusters. Numerous instances exist were utilising the GPU lead to drastic performance improvements in many fields of research.

Two GPU evaluators were implemented which are used to determine if the GPU is more suitable for evaluating expressions generated at runtime as compared to the CPU. The two implementations are as follows:

\begin{description}
	\item[GPU Interpreter] \mbox{} \\
		A stack based interpreter that evaluates the expressions. The frontend converts these expressions into postfix notation to ensure the implementation can be as simple as possible. It consists of one kernel that is used to evaluate all expressions separately.
	\item[GPU Transpiler] \mbox{} \\
		A transpiler that takes the expressions and transpiles them into PTX code. Each expression is represented in its own unique kernel. The kernels are simpler than the one GPU interpreter kernel, but more effort is needed to generate them.
\end{description}

In total three benchmarks were conducted to determine if and under which circumstances the GPU is a more suitable choice for evaluating the expressions. A CPU-based implementation is the baseline against which the GPU evaluators are evaluated. To answer the research questions the benchmarks are structured as follows:
\begin{enumerate}
	\item Roughly $250\,000$ expressions with $362$ data points have been evaluated. The goal of this benchmark was determining how the evaluators can handle large volumes of expressions.
	\item Roughly $10\,000$ expressions with $362$ data points have been evaluated. This benchmark should demonstrate how a change in the number of expressions impacts the performance, especially compared with each other.
	\item Roughly $10\,000$ expressions and roughly $10\,000$ data points have been evaluated. By increasing the number of data points a more realistic use-case is modelled with this benchmark. Additionally, by using more data points the strengths of the GPU should get more exploited.
\end{enumerate}

After conducting the first and second benchmarks it was clear, that the CPU is the better choice in these scenarios. The CPU was faster by roughly four times when compared to the GPU interpreter and the GPU transpiler did not finish this benchmark at all.

The first benchmark in particular demonstrated how the high RAM usage of this GPU transpiler implementation lead to it not finishing this benchmark. Storing $250\,000$ compiled kernels uses a lot of RAM, however, compiling the PTX kernels just in time before they are executed is not a feasible alternative to reduce RAM usage. Since the PTX kernels need to be compiled into machine code before they can be executed, one alternative would be to use batch processing as a compromise between compiling ahead of time and just in time. Since it is not expected that these evaluators need to evaluate hundreds of thousands of expressions, the non-trivial process of rewriting the implementation to support batch processing has not been done.

Reducing the number of expressions demonstrated that the GPU transpiler can perform better than the GPU interpreter by roughly ten percent. However, in relation to the CPU implementation, no real change was observed between the first and second benchmark with the CPU being faster by roughly five times. 

In the third benchmark, both GPU evaluators managed to outperform the CPU, with the GPU transpiler performing the best. The GPU interpreter was faster by roughly $1.6$ times and the GPU transpiler was faster by roughly $2$ times compared to the CPU interpreter. Furthermore, the GPU transpiler managed to outperform the GPU interpreter by roughly $1.2$ times.

To address the research questions, this thesis demonstrates that evaluating expressions generated at runtime can be more efficient on the GPU under specific conditions. Utilizing the GPU becomes feasible when dealing with a high number of data points, typically in the thousands and above. For scenarios with fewer data points, the CPU remains the better choice. Additionally, in scenarios where RAM is abundant, the implementation of the GPU transpiler discussed in this thesis is the optimal choice. If too little RAM is available and the number of data points is sufficiently large, the GPU interpreter should be chosen, as it outperforms both the GPU transpiler and the CPU in such cases.

\section{Future Work}
This thesis demonstrated how the GPU can be used to accelerate the evaluation of expressions and therefore the symbolic regression algorithm as a whole. However, the boundaries at which it is more feasible to utilise the GPU needs to be further refined. Therefore, conducting more research into how the number of expressions and data points impact performance is needed. Furthermore, only one dataset with only two variables per data point was used. Varying the number of variables per data point and their impact on performance could also be interesting. The impact of the parameters was omitted from this thesis entirely. Further research on how the number of parameters impact the performance is of interest. Since parameters need to be transferred to the GPU frequently, having too many parameters could impact the GPU more negatively than the CPU. Alternatively, performing the entire parameter optimisation step on the GPU and not just the evaluation might also result in better performance, as the number of data transfers is drastically reduced. 

The current implementation also has flaws that can be improved in future work. Currently, no shared memory is utilised, meaning the threads need to always retrieve the data from global memory. This is a slow operation and efficiently utilising shared memory should further improve the performance of both GPU evaluators.

Furthermore, as seen with the GPU transpiler and the first benchmark, reducing RAM usage is of essence for very large problems with hundreds of thousands of expressions or very RAM limited environments. Therefore, future work needs to be done to rewrite the transpiler to support batch processing and conduct benchmarks with this new implementation. This will answer the question if batch processing allows the GPU transpiler to outperform the CPU and GPU interpreters in these scenarios. Additionally, it is of interest if the batch processing transpiler manages to achieve the same or better performance in the other scenarios explored in this thesis.

Lastly, neither of the implementations supports special GPU instructions. Especially the Fused Multiply-Add (FMA) instruction is of interest. Given that multiplying two values and adding a third is a common operation, this special instruction allows these operations to be performed in a single clock cycle. The frontend can be extended to detect and convert sub-expressions of this form into a special ternary opcode, enabling the backend to generate more efficient code. If the effort of detecting these sub-expressions is outweighed by the performance improvement needs to be determined in a future work.