benchmarking: reverted previous; made interpreter use fast math

2025-04-13 13:26:35 +02:00
parent 6d6874c7ba
commit a5c34a53b7
7 changed files with 32 additions and 26 deletions
--- a/thesis/chapters/conclusion.tex
+++ b/thesis/chapters/conclusion.tex
@ -2,8 +2,11 @@
 \label{cha:conclusion}

 Summarise the results
+talk again how a typical input is often not complex enough (basically repeat that statement from comparison section in evaluation)

 \section{Future Work}
 talk about what can be improved

-Transpiler: transpile expression directly from Julia AST -> would save time because no intermediate representation needs to be created (looses step and gains performance, but also makes transpiler itself more complex)
+Transpiler: transpile expression directly from Julia AST -> would save time because no intermediate representation needs to be created (looses step and gains performance, but also makes transpiler itself more complex)
+
+CPU Interpreter: Probably more worth to dive into parallelising cpu interpreter itself (not really future work, as you wouldn't write a paper about that)
--- a/thesis/chapters/evaluation.tex
+++ b/thesis/chapters/evaluation.tex
@ -22,7 +22,7 @@ Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking
 1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
 2.) Using @inbounds -> noticeable improvement in 2 out of 3
 3.) Tuned blocksize with NSight compute -> slight improvement
-4.) used int32 everywhere to reduce register usage -> significant performance drop (probably because a lot more waiting time, or more type conversions happening on GPU? would need to look at PTX)
+4.) used int32 everywhere to reduce register usage -> significant performance drop (probably because a lot more waiting time "latency hiding not working basically", or more type conversions happening on GPU? look at generated PTX code and use that as an argument to describe why it is slower)

 \subsection{Transpiler}
 Results only for Transpiler (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section
@ -37,4 +37,6 @@ Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking
 4.) Only changed things on interpreter side

 \subsection{Comparison}
-Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter
+Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter
+
+talk about that compute portion is just too little. Only more complex expressions with higher var set count benefit well (make one or two performance evaluations, with 10 larger expressions and at least 1k var sets and present that here as point for that statement)