evaluation: found thath benchmark 2 can't be executed by any implementation due to RAM constraints

2025-05-24 16:58:35 +02:00
parent 5f44e4d122
commit 14b2e23d9a
5 changed files with 13 additions and 13 deletions
--- a/package/src/ExpressionExecutorCuda.jl
+++ b/package/src/ExpressionExecutorCuda.jl
@ -16,7 +16,7 @@ export interpret_gpu,interpret_cpu
 export evaluate_gpu

 # Some assertions:
-# Variables and parameters start their naming with "1" meaning the first variable/parameter has to be "x1/p1" and not "x0/p0"
+# Variables and parameters start their indexing with "1" meaning the first variable/parameter has to be "x1/p1" and not "x0/p0"
 # Matrix X is column major
 # each index i in exprs has to have the matching values in the column i in Matrix X so that X[:,i] contains the values for expr[i]. The same goes for p
 #     This assertion is made, because in julia, the first index doesn't have to be 1
@ -109,14 +109,4 @@ function interpret_cpu(exprs::Vector{Expr}, X::Matrix{Float32}, p::Vector{Vector
 	res
 end

-# Flow
-# input: Vector expr    == expressions contains eg. 4 expressions
-#        Matrix X       == |expr| columns, n rows. n == number of variabls x1..xn; n is the same for all expressions --- WRONG
-#        Matrix X       == k columns, n rows. k == number of variables in the expressions (every expression must have the same number of variables); n == number of different values for xk where k is the column
-#        VectorVector p == vector size |expr| containing vector size m. m == number of parameters per expression. p can be different for each expression
-# 
-# The following can be done on the CPU
-#     convert expression to postfix notation (mandatory)
-#     optional: replace every parameter with the correct value (should only improve performance if data transfer is the bottleneck)
-
 end
--- a/package/src/Interpreter.jl
+++ b/package/src/Interpreter.jl
@ -24,7 +24,7 @@ function interpret(cudaExprs, numExprs::Integer, exprsInnerLength::Integer,
 	cudaResults = CuArray{Float32}(undef, variableColumns, numExprs)

 	# Start kernel for each expression to ensure that no warp is working on different expressions
-	numThreads = min(variableColumns, 121)
+	numThreads = min(variableColumns, 128)
 	numBlocks = cld(variableColumns, numThreads)

 	Threads.@threads for i in 1:numExprs # multithreaded to speedup dispatching (seems to have improved performance)
--- a/package/src/Transpiler.jl
+++ b/package/src/Transpiler.jl
@ -19,7 +19,7 @@ function evaluate(expressions::Vector{ExpressionProcessing.PostfixType}, cudaVar
 	# each expression has nr. of variable sets (nr. of columns of the variables) results and there are n expressions
 	cudaResults = CuArray{Float32}(undef, variableColumns, length(expressions))

-	threads = min(variableColumns, 256)
+	threads = min(variableColumns, 128)
 	blocks = cld(variableColumns, threads)
 	
 	kernelName = "evaluate_gpu"
--- a/thesis/chapters/evaluation.tex
+++ b/thesis/chapters/evaluation.tex
@ -49,6 +49,7 @@ Since only the evaluators are benchmarked, the expressions to be evaluated must

 The first benchmark involves a very large set of roughly $250\,000$ expressions. This means that all $250\,000$ expressions are evaluated in a single generation when using GP. In a typical generation, significantly fewer expressions would be evaluated. However, this benchmark is designed to show how the evaluators can handle large volumes of data.

+TODO:::: Remove this benchmark, as it just uses too much RAM
 A second benchmark, with slight modifications to the first, is also conducted. Given that GPUs are very good at executing work in parallel, the number of variable sets is increased in this benchmark. Therefore, the second benchmark consists of the same $250\,000$ expressions, but the number of variable sets has been increased by a factor of 30 to a total of roughly $10\,000$. This benchmark aims to demonstrate how the GPU is best used for a larger number of variable sets. A higher number of variable sets is also more representative of the scenarios the evaluators will be employed.

 The third benchmark is conducted to demonstrate how the evaluators will perform in more realistic scenarios. For this benchmark the number of expressions has been reduced to roughly $10\,000$, and the number of variable sets is again $362$. The purpose of this benchmark is to demonstrate how the evaluators are likely perform in a typical scenario.
@ -85,6 +86,7 @@ The first benchmark consisted of $250\,000$ expressions and $362$ variable sets
 % talk about kernel configuration (along the lines of: results achieved with block size of X) etc. Also include that CPU and GPU utilisation was 100% the entire time. If this is too short, just add it to the above paragraph and make the 4 benchmark sections relatively short, as the most interesting information is in the performance tuning and comparison sections anyway

 \subsubsection{Benchmark 2}
+TODO: Remove this benchmark, none of the implementations had enough RAM available

 \subsubsection{Benchmark 3}
 std of 750.1 ms
@ -97,6 +99,9 @@ std of 750.1 ms

 \subsubsection{Benchmark 4}

+blocksize 128: 84.84 blocks fast (prolly because less wasted threads)
+bocksize 192: 56.56 blocks very slow
+
 \subsubsection{Performance Tuning} % either subsubSection or change the title to "Performance Tuning Interpreter"
 Document the process of performance tuning (mostly GPU, but also talk about CPU. Especially the re-aranging of data transfer and non usage of a cache)

@ -116,11 +121,15 @@ Results only for Transpiler (also contains final kernel configuration and probab
 \subsubsection{Benchmark 1}

 \subsubsection{Benchmark 2}
+TODO: Remove this benchmark

 \subsubsection{Benchmark 3}
+kernels can now be compiled at the same time as they are generated (should drastically improve performance)

 \subsubsection{Benchmark 4}

+Even larger var sets would be perfect. 10k is rather small and the GPU barely has any work to do
+
 \subsection{Performance Tuning}
 Document the process of performance tuning

@ -140,6 +149,7 @@ talk about that compute portion is just too little. Only more complex expression


 \subsubsection{Benchmark 2}
+TODO: Remove this benchmark
 CPU Did not finish due to RAM constraints

 \subsubsection{Benchmark 3}
--- a/thesis/main.pdf
+++ b/thesis/main.pdf