master-thesis/thesis/chapters/evaluation.tex

\chapter{Evaluation}
\label{cha:evaluation}

The aim of this thesis is to determine whether at least one of the GPU evaluators is faster than the current CPU evaluator. This chapter describes the performance evaluation. First, the environment in which the performance tests are performed is explained. Then the individual results for the GPU interpreter and the transpiler are presented. In addition, this part also includes the performance tuning steps taken to achieve these results. Finally, the results of the GPU evaluators are compared to the CPU evaluator in order to answer the research questions of this thesis.

\section{Test environment}
Explain the hardware used, as well as the actual data (how many expressions, variables etc.)

three scenarios -> few, normal and many variable sets;; expr repetitions to simulate parameter optimisation
Benchmarktools.jl -> 1000 samples per scenario

\section{Results}
talk about what we will see now (results only for interpreter, then transpiler and then compared with each other and a CPU interpreter)

\subsection{Interpreter}
Results only for Interpreter (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section)
\subsection{Performance tuning}
Document the process of performance tuning

Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking enabled (especially in kernel)

1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
2.) Using @inbounds -> noticeable improvement in 2 out of 3
3.) Tuned blocksize with NSight compute -> slight improvement
4.) used int32 everywhere to reduce register usage -> significant performance drop (probably because a lot more waiting time "latency hiding not working basically", or more type conversions happening on GPU? look at generated PTX code and use that as an argument to describe why it is slower)

\subsection{Transpiler}
Results only for Transpiler (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section
\subsection{Performance tuning}
Document the process of performance tuning

Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking enabled

1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
2.) Using @inbounds -> small improvement only on CPU side code
3.) Tuned blocksize with NSight compute -> slight improvement
4.) Only changed things on interpreter side

\subsection{Comparison}
Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter

talk about that compute portion is just too little. Only more complex expressions with higher var set count benefit well (make one or two performance evaluations, with 10 larger expressions and at least 1k var sets and present that here as point for that statement)