benchmarking: used int32 wherever possible; resulted in noticeable performance drop
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
This commit is contained in:
@ -21,6 +21,8 @@ Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking
|
||||
|
||||
1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
|
||||
2.) Using @inbounds -> noticeable improvement in 2 out of 3
|
||||
3.) Tuned blocksize with NSight compute -> slight improvement
|
||||
4.) used int32 everywhere to reduce register usage -> significant performance drop (probably because a lot more waiting time, or more type conversions happening on GPU? would need to look at PTX)
|
||||
|
||||
\subsection{Transpiler}
|
||||
Results only for Transpiler (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section
|
||||
@ -31,6 +33,8 @@ Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking
|
||||
|
||||
1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
|
||||
2.) Using @inbounds -> small improvement only on CPU side code
|
||||
3.) Tuned blocksize with NSight compute -> slight improvement
|
||||
4.) Only changed things on interpreter side
|
||||
|
||||
\subsection{Comparison}
|
||||
Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter
|
Reference in New Issue
Block a user