benchmarking: used int32 wherever possible; resulted in noticeable performance drop

2025-04-13 11:32:54 +02:00
parent 4c60331288
commit af3b72f196
8 changed files with 29 additions and 23 deletions
--- a/thesis/chapters/evaluation.tex
+++ b/thesis/chapters/evaluation.tex
@ -21,6 +21,8 @@ Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking

 1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
 2.) Using @inbounds -> noticeable improvement in 2 out of 3
+3.) Tuned blocksize with NSight compute -> slight improvement
+4.) used int32 everywhere to reduce register usage -> significant performance drop (probably because a lot more waiting time, or more type conversions happening on GPU? would need to look at PTX)

 \subsection{Transpiler}
 Results only for Transpiler (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section
@ -31,6 +33,8 @@ Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking

 1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
 2.) Using @inbounds -> small improvement only on CPU side code
+3.) Tuned blocksize with NSight compute -> slight improvement
+4.) Only changed things on interpreter side

 \subsection{Comparison}
 Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter