benchmarking: updated transpiler to drastically reduce the number of transpilations at the expense of memory usage
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run

This commit is contained in:
2025-05-19 11:39:49 +02:00
parent 33e7edd4c8
commit f33551e25f
4 changed files with 48 additions and 69 deletions

View File

@ -59,13 +59,10 @@ Results only for Interpreter (also contains final kernel configuration and proba
\subsection{Performance Tuning}
Document the process of performance tuning
Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking enabled (especially in kernel)
Initial: no cache; 256 blocksize; exprs pre-processed and sent to GPU on every call; vars sent on every call; frontend + dispatch are multithreaded
1.) Done before parameter optimisation loop: Frontend, transmitting Exprs and Variables (improved runtime)
1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
2.) Using @inbounds -> noticeable improvement in 2 out of 3
3.) Tuned blocksize with NSight compute -> slight improvement
4.) used int32 everywhere to reduce register usage -> significant performance drop (probably because a lot more waiting time "latency hiding not working basically", or more type conversions happening on GPU? look at generated PTX code and use that as an argument to describe why it is slower)
5.) reverted previous; used fastmath instead -> imporvement (large var set is now faster than on transpiler)
\subsection{Transpiler}
Results only for Transpiler (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section
@ -75,13 +72,9 @@ Results only for Transpiler (also contains final kernel configuration and probab
\subsection{Performance Tuning}
Document the process of performance tuning
Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking enabled
Initial: no cache; 256 blocksize; exprs pre-processed and transpiled on every call; vars sent on every call; frontend + transpilation + dispatch are multithreaded
1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
2.) Using @inbounds -> small improvement only on CPU side code
3.) Tuned blocksize with NSight compute -> slight improvement
4.) Only changed things on interpreter side
5.) Only changed things on interpreter side
1.) Done before parameter optimisation loop: Frontend, transmitting Exprs and Variables (improved runtime)
\subsection{Comparison}
Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter