concept and design: improved wording and added overview diagram of kernel usage
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run

This commit is contained in:
2025-04-10 10:21:01 +02:00
parent 258d33c338
commit c68e0d04a0
6 changed files with 138 additions and 53 deletions

View File

@ -8,24 +8,24 @@ Explain the hardware used, as well as the actual data (how many expressions, var
talk about what we will see now (results only for interpreter, then transpiler and then compared with each other and a CPU interpreter)
\subsection{Interpreter}
Results only for Interpreter
Results only for Interpreter (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section)
\subsection{Performance tuning}
Document the process of performance tuning
Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking enabled (especially in kernel)
Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
Using @inbounds -> noticeable improvement in 2 out of 3
1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
2.) Using @inbounds -> noticeable improvement in 2 out of 3
\subsection{Transpiler}
Results only for Transpiler
Results only for Transpiler (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section
\subsection{Performance tuning}
Document the process of performance tuning
Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking enabled
Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
Using @inbounds -> small improvement only on CPU side code
1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
2.) Using @inbounds -> small improvement only on CPU side code
\subsection{Comparison}
Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter