Compare commits

...

7 Commits

Author SHA1 Message Date
Daniel
6880c1ceb5 benchmarking: added uni performance test
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-04-16 20:54:12 +02:00
c62aff806a small updates and notes for further writing
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-04-15 19:32:39 +02:00
ef721b13e0 evaluation: updated notes for chapter
Some checks failed
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Has been cancelled
2025-04-13 14:20:16 +02:00
a5c34a53b7 benchmarking: reverted previous; made interpreter use fast math
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-04-13 13:26:35 +02:00
6d6874c7ba benchmarking: added results for test 4 on uni-pc
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-04-13 11:45:52 +02:00
af3b72f196 benchmarking: used int32 wherever possible; resulted in noticeable performance drop
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-04-13 11:32:54 +02:00
4c60331288 evaluation: added introduction text and made plan for additional text
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-04-12 16:22:14 +02:00
18 changed files with 45 additions and 14 deletions

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2024 Daniel Wiplinger
Copyright (c) 2024 Daniel Roth
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

View File

@ -27,7 +27,7 @@ function interpret_gpu(exprs::Vector{Expr}, X::Matrix{Float32}, p::Vector{Vector
results = Matrix{Float32}(undef, ncols, length(exprs))
for i in 1:repetitions # Simulate parameter tuning
for i in 1:repetitions # Simulate parameter tuning -> local search (X remains the same, p gets changed in small steps and must be performed sequentially)
results = Interpreter.interpret(exprs, X, p)
end
@ -41,7 +41,7 @@ function evaluate_gpu(exprs::Vector{Expr}, X::Matrix{Float32}, p::Vector{Vector{
results = Matrix{Float32}(undef, ncols, length(exprs))
for i in 1:repetitions # Simulate parameter tuning
for i in 1:repetitions # Simulate parameter tuning -> local search (X remains the same, p gets changed in small steps and must be performed sequentially)
results = Transpiler.evaluate(exprs, X, p)
end

View File

@ -31,7 +31,7 @@ function interpret(expressions::Vector{Expr}, variables::Matrix{Float32}, parame
# Start kernel for each expression to ensure that no warp is working on different expressions
@inbounds for i in eachindex(exprs)
kernel = @cuda launch=false interpret_expression(cudaExprs, cudaVars, cudaParams, cudaResults, cudaStepsize, i)
kernel = @cuda launch=false fastmath=true interpret_expression(cudaExprs, cudaVars, cudaParams, cudaResults, cudaStepsize, i)
# config = launch_configuration(kernel.fun)
threads = min(variableCols, 128)
blocks = cld(variableCols, threads)
@ -104,7 +104,7 @@ function interpret_expression(expressions::CuDeviceArray{ExpressionElement}, var
operationStack[operationStackTop] = sqrt(operationStack[operationStackTop])
end
else
operationStack[operationStackTop] = NaN
operationStack[operationStackTop] = NaN32
break
end
end

View File

@ -5,6 +5,10 @@ using .Transpiler
using .Interpreter
const BENCHMARKS_RESULTS_PATH = "./results-fh"
# TODO: Expressions can get much much bigger (into millions) (will be provided by Mr. Kronberger)
# TODO: Variable-Sets: 1000 can be considered the minimum; 100.000 can be considered the maximum (will be provided by Mr. Kronberger)
exprsCPU = [
# CPU interpreter requires an anonymous function and array ref s
:(p[1] * x[1] + p[2]), # 5 op
@ -24,7 +28,7 @@ exprsGPU = [
# p is the same for CPU and GPU
p = [randn(Float32, 10) for _ in 1:length(exprsCPU)] # generate 10 random parameter values for each expr
expr_reps = 100 # 100 parameter optimisation steps basically
expr_reps = 100 # 100 parameter optimisation steps (local search; sequentially; only p changes but not X)
@testset "CPU performance" begin
@ -89,15 +93,15 @@ if compareWithCPU
suite["CPU"]["large varset"] = @benchmarkable interpret_cpu(exprsCPU, X_large, p; repetitions=expr_reps)
end
X_small_GPU = randn(Float32, 5, varsets_small)
X_small_GPU = randn(Float32, 5, varsets_small) # column-major
suite["GPUI"]["small varset"] = @benchmarkable interpret_gpu(exprsGPU, X_small_GPU, p; repetitions=expr_reps)
suite["GPUT"]["small varset"] = @benchmarkable evaluate_gpu(exprsGPU, X_small_GPU, p; repetitions=expr_reps)
X_medium_GPU = randn(Float32, 5, varsets_medium)
X_medium_GPU = randn(Float32, 5, varsets_medium) # column-major
suite["GPUI"]["medium varset"] = @benchmarkable interpret_gpu(exprsGPU, X_medium_GPU, p; repetitions=expr_reps)
suite["GPUT"]["medium varset"] = @benchmarkable evaluate_gpu(exprsGPU, X_medium_GPU, p; repetitions=expr_reps)
X_large_GPU = randn(Float32, 5, varsets_large)
X_large_GPU = randn(Float32, 5, varsets_large) # column-major
suite["GPUI"]["large varset"] = @benchmarkable interpret_gpu(exprsGPU, X_large_GPU, p; repetitions=expr_reps)
suite["GPUT"]["large varset"] = @benchmarkable evaluate_gpu(exprsGPU, X_large_GPU, p; repetitions=expr_reps)
@ -143,9 +147,10 @@ if compareWithCPU
println(gpuiVsGPUT_median)
println(gpuiVsGPUT_std)
BenchmarkTools.save("$BENCHMARKS_RESULTS_PATH/3-tuned-blocksize_I128_T96.json", results)
BenchmarkTools.save("$BENCHMARKS_RESULTS_PATH/5-interpreter_using_fastmath.json", results)
else
resultsOld = BenchmarkTools.load("$BENCHMARKS_RESULTS_PATH/2-using_inbounds.json")[1]
resultsOld = BenchmarkTools.load("$BENCHMARKS_RESULTS_PATH/3-tuned-blocksize_I128_T96.json")[1]
# resultsOld = BenchmarkTools.load("$BENCHMARKS_RESULTS_PATH/3-tuned-blocksize_I128_T96.json")[1]
medianGPUI_old = median(resultsOld["GPUI"])
stdGPUI_old = std(resultsOld["GPUI"])

View File

@ -26,5 +26,5 @@ end
@testset "Transpiler Tuning" begin
# CUDA.@profile evaluate_gpu(exprsGPU, X, p; repetitions=expr_reps)
CUDA.@profile evaluate_gpu(exprsGPU, X, p; repetitions=expr_reps)
end

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -1,6 +1,8 @@
using ExpressionExecutorCuda
using Test
using BenchmarkTools
const baseFolder = dirname(dirname(pathof(ExpressionExecutorCuda)))
include(joinpath(baseFolder, "src", "Utils.jl"))
include(joinpath(baseFolder, "src", "ExpressionProcessing.jl"))

View File

@ -1,3 +1,5 @@
RE-READ to ensure that concepts why this is done to improve performance and why this should be the "locally best" implementation (most should be in implementation though)
\chapter{Concept and Design}
\label{cha:conceptdesign}
% introduction to what needs to be done. also clarify terms "Host" and "Device" here

View File

@ -2,8 +2,11 @@
\label{cha:conclusion}
Summarise the results
talk again how a typical input is often not complex enough (basically repeat that statement from comparison section in evaluation)
\section{Future Work}
talk about what can be improved
Transpiler: transpile expression directly from Julia AST -> would save time because no intermediate representation needs to be created (looses step and gains performance, but also makes transpiler itself more complex)
Transpiler: transpile expression directly from Julia AST -> would save time because no intermediate representation needs to be created (looses step and gains performance, but also makes transpiler itself more complex)
CPU Interpreter: Probably more worth to dive into parallelising cpu interpreter itself (not really future work, as you wouldn't write a paper about that)

View File

@ -1,9 +1,14 @@
\chapter{Evaluation}
\label{cha:evaluation}
The aim of this thesis is to determine whether at least one of the GPU evaluators is faster than the current CPU evaluator. This chapter describes the performance evaluation. First, the environment in which the performance tests are performed is explained. Then the individual results for the GPU interpreter and the transpiler are presented. In addition, this part also includes the performance tuning steps taken to achieve these results. Finally, the results of the GPU evaluators are compared to the CPU evaluator in order to answer the research questions of this thesis.
\section{Test environment}
Explain the hardware used, as well as the actual data (how many expressions, variables etc.)
three scenarios -> few, normal and many variable sets;; expr repetitions to simulate parameter optimisation
Benchmarktools.jl -> 1000 samples per scenario
\section{Results}
talk about what we will see now (results only for interpreter, then transpiler and then compared with each other and a CPU interpreter)
@ -16,6 +21,9 @@ Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking
1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
2.) Using @inbounds -> noticeable improvement in 2 out of 3
3.) Tuned blocksize with NSight compute -> slight improvement
4.) used int32 everywhere to reduce register usage -> significant performance drop (probably because a lot more waiting time "latency hiding not working basically", or more type conversions happening on GPU? look at generated PTX code and use that as an argument to describe why it is slower)
5.) reverted previous; used fastmath instead -> imporvement (large var set is now faster than on transpiler)
\subsection{Transpiler}
Results only for Transpiler (also contains final kernel configuration and probably quick overview/recap of the implementation used and described in Implementation section
@ -26,6 +34,11 @@ Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking
1.) Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
2.) Using @inbounds -> small improvement only on CPU side code
3.) Tuned blocksize with NSight compute -> slight improvement
4.) Only changed things on interpreter side
5.) Only changed things on interpreter side
\subsection{Comparison}
Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter
Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter
talk about that compute portion is just too little. Only more complex expressions with higher var set count benefit well (make one or two performance evaluations, with 10 larger expressions and at least 1k var sets and present that here as point for that statement)

View File

@ -3,6 +3,8 @@
somewhere in here explain why one kernel per expression and not one kernel for all expressions
Go into the details why this implementation is tuned towards performance and should be the optimum at that
\section{Technologies}
Short section; CUDA, PTX, Julia, CUDA.jl

Binary file not shown.