All Instructions: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#instructions ``` .version 7.1 .target sm_52 .address_size 64 // .globl VecAdd_kernel .visible .entry VecAdd_kernel( .param .u64 VecAdd_kernel_param_0, .param .u64 VecAdd_kernel_param_1, .param .u64 VecAdd_kernel_param_2, .param .u32 VecAdd_kernel_param_3 ) { .reg .pred %p<2>; -> predicate registers: p0, p1 (needed for branching) .reg .f32 %f<4>; -> float registers: f0 - f3 .reg .b32 %r<6>; -> 32 bits registers: r0 - r5 (bits are actual raw bits without a type) .reg .b64 %rd<11>; -> 64 bits registers: rd0 - rd10 ld.param.u64 %rd1, [VecAdd_kernel_param_0]; -> rd1 = Data1 ld.param.u64 %rd2, [VecAdd_kernel_param_1]; -> rd2 = Data2 ld.param.u64 %rd3, [VecAdd_kernel_param_2]; -> rd3 = Result ld.param.u32 %r2, [VecAdd_kernel_param_3]; -> r2 = N mov.u32 %r3, %ntid.x; -> initialise r3 with ntid.x mov.u32 %r4, %ctaid.x; -> same as above mov.u32 %r5, %tid.x; -> same as above mad.lo.s32 %r1, %r3, %r4, %r5; -> r3 * r4 -> extract lowest 32/2 bits -> add r5 -> r1 = lowest16Bits(r3*r4) + r5 setp.ge.s32 %p1, %r1, %r2; -> p1 = r1 >= r2 (setp would assign !p1 to second register if one was given) (gate clause for the case when we start more threads than needed) @%p1 bra \$L__BB0_2; -> if(p1) then {execute} else {branch to \$L__BB0_2} cvta.to.global.u64 %rd4, %rd1; -> convert rd1 to global state space and write address to rd4 (I think) mul.wide.s32 %rd5, %r1, 4; -> rd5 = r1 * 4 add.s64 %rd6, %rd4, %rd5; -> rd6 = rd4 + rd5 cvta.to.global.u64 %rd7, %rd2; -> same as above cvta add.s64 %rd8, %rd7, %rd5; -> rd8 = rd7 + rd5 ld.global.f32 %f1, [%rd8]; -> f1 = rd8 (loading rd8 in a global f32 register) ld.global.f32 %f2, [%rd6];" * op * " %f3, %f2, %f1; -> custom binary operator cvta.to.global.u64 %rd9, %rd3; -> load local Result to global Result (I think this aggregates the result because rd9 = rd3 = Result) add.s64 %rd10, %rd9, %rd5; -> rd10 = rd9 + rd5 st.global.f32 [%rd10], %f3; -> rd10 = f3 (We are overwriting the previous result?) \$L__BB0_2: ret; } ``` The above probably calculates this expression: f3 = (x1 + ((r3 * r4 + r5) * 4) CUSTOM_OPERATOR (x2 + ((r3 * r4 + r5) * 4))) # Plan 1. Generate PTX that only works with constant values and one expression 1. Add support for loading variables and parameters (get vars/params as parameters -> Result still only one number) 1. Add support for loading variables as matrix (params still only one value -> Result now a vector) 1. Add support for loading parameters as "sparse" matrix (Not much should change) 1. Add support for multiple expressions (Result is now a matrix)