added information on how to best approach register assignment

2024-10-20 12:24:18 +02:00
parent 1f6b40b750
commit ee3c5001bd
1 changed files with 18 additions and 0 deletions
--- a/package/src/Transpiler.jl
+++ b/package/src/Transpiler.jl
@ -78,6 +78,24 @@ function culoadtest(N::Int32, op = "add.f32")
 		@time CUDA.@sync cudacall(func, Tuple{CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat},Cint}, d_a, d_b, d_c, N; threads=threadsPerBlock, blocks=blocksPerGrid)
 end

+# Number of threads per block/SM + max number of registers
+# https://docs.nvidia.com/cuda/cuda-c-programming-guide/#features-and-technical-specifications
+# Need to assume a max of 2048 threads per Streaming Multiprocessor (SM)
+# One SM can have 64*1024 32-bit registers at max
+# One thread can at max use 255 registers
+# Meaning one has access to at most 32 registers in the worst case. Using 64 bit values this number gets halfed (see: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#multiprocessor-level (almost at the end of the linked section))
+
+# I think I will go with max 16 registers for now and leave a better register allocation technique for future work
+# Maybe helpful for future tuning: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#maximum-number-of-registers-per-thread
+
+# https://docs.nvidia.com/cuda/cuda-c-programming-guide/#multiprocessor-level
+# This states, that using fewer registers allows more threads to reside on a single SM which improves performance. 
+# So I could use more registers at the expense for performance. Depending on how this would simplify my algorithm, I might do this and leave more optimisation to future work
+
+# Since the generated expressions should have between 10 and 50 symbols, I think allowing a max. of 128 32-bit registers should make for an easy algorithm. If during testing the result is slow, maybe try reducing the number of registers and perform more intelligent allocation/assignment
+# With 128 Registers, one could have 32 Warps on one SM ((128 * 16 = 2048) * 32 == 64*1024 == max number of registers per SM) This means 512 Threads per SM in the worst case
+
+
 const exitJumpLocationMarker = "\$L__BB0_2"
 function transpile(expression::ExpressionProcessing.PostfixType)
 	ptxBuffer = IOBuffer()