8 Commits

Author SHA1 Message Date
1dc0c1898d benchmarking: fixed bugs; took initial_benchmark
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-03-30 12:54:50 +02:00
ad175abac0 benchmarking: added gpu evaluators to performance testing. getting execution errors still
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-03-29 13:35:59 +01:00
690ee33db1 benchmarks: started preparing benchmarks
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-03-29 12:01:06 +01:00
effd477558 transpiler: generates valid PTX and evaluates expressions correctly
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-03-28 19:32:48 +01:00
9df78ca72e transpiler: invalid memory access error finally fixed
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-03-27 22:32:24 +01:00
561b37160b transpiler: trying to fix problem with writing to global memory; not yet fixed
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-03-27 18:00:31 +01:00
eaee21ca75 transpiler: results are now written in results array; preperation for performance testing
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run
2025-03-27 09:55:29 +01:00
baa37ea183 code: started finalising transpilation process and preparing for performance testing and tuning
Some checks failed
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Has been cancelled
2025-03-23 13:38:22 +01:00
17 changed files with 523 additions and 235 deletions

View File

@ -1,15 +1,17 @@
name = "ExpressionExecutorCuda"
uuid = "5b8ee377-1e19-4ba5-a85c-78c7d1694bfe"
authors = ["Daniel Wiplinger"]
authors = ["Daniel Roth"]
version = "1.0.0-DEV"
[deps]
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
[compat]
LinearAlgebra = "1.11.0"
Printf = "1.11.0"
Random = "1.11.0"
julia = "1.6.7"

View File

@ -1,6 +1,8 @@
module ExpressionExecutorCuda
include("Utils.jl")
include("ExpressionProcessing.jl")
include("Interpreter.jl")
include("Transpiler.jl")
module CpuInterpreter
include("Code.jl")
@ -13,18 +15,37 @@ export test
# Some assertions:
# Variables and parameters start their naming with "1" meaning the first variable/parameter has to be "x1/p1" and not "x0/p0"
# Matrix X is column major
# each index i in exprs has to have the matching values in the column i in Matrix X so that X[:,i] contains the values for expr[i]. The same goes for p
# This assertion is made, because in julia, the first index doesn't have to be 1
#
# Evaluate Expressions on the GPU
function interpret_gpu(exprs::Vector{Expr}, X::Matrix{Float32}, p::Vector{Vector{Float32}})::Matrix{Float32}
exprsPostfix = ExpressionProcessing.expr_to_postfix(exprs[1])
function interpret_gpu(exprs::Vector{Expr}, X::Matrix{Float32}, p::Vector{Vector{Float32}}; repetitions=1)::Matrix{Float32}
@assert axes(exprs) == axes(p)
ncols = size(X, 2)
results = Matrix{Float32}(undef, ncols, length(exprs))
for i in 1:repetitions # Simulate parameter tuning
results = Interpreter.interpret(exprs, X, p)
end
return results
end
# Convert Expressions to PTX Code and execute that instead
function evaluate_gpu(exprs::Vector{Expr}, X::Matrix{Float32}, p::Vector{Vector{Float32}})::Matrix{Float32}
# Look into this to maybe speed up PTX generation: https://cuda.juliagpu.org/stable/tutorials/introduction/#Parallelization-on-the-CPU
function evaluate_gpu(exprs::Vector{Expr}, X::Matrix{Float32}, p::Vector{Vector{Float32}}; repetitions=1)::Matrix{Float32}
@assert axes(exprs) == axes(p)
ncols = size(X, 2)
results = Matrix{Float32}(undef, ncols, length(exprs))
for i in 1:repetitions # Simulate parameter tuning
results = Transpiler.evaluate(exprs, X, p)
end
return results
end

View File

@ -71,6 +71,10 @@ function get_operator(op::Symbol)::Operator
return EXP
elseif op == :sqrt
return SQRT
elseif op == :powabs
return POWER # TODO: Fix this
else
throw("Operator unknown")
end
end

View File

@ -2,6 +2,7 @@ module Interpreter
using CUDA
using StaticArrays
using ..ExpressionProcessing
using ..Utils
export interpret
@ -11,19 +12,25 @@ export interpret
- variables::Matrix{Float32} : The variables to use. Each column is mapped to the variables x1..xn
- parameters::Vector{Vector{Float32}} : The parameters to use. Each Vector contains the values for the parameters p1..pn. The number of parameters can be different for every expression
"
function interpret(expressions::Vector{ExpressionProcessing.PostfixType}, variables::Matrix{Float32}, parameters::Vector{Vector{Float32}})::Matrix{Float32}
function interpret(expressions::Vector{Expr}, variables::Matrix{Float32}, parameters::Vector{Vector{Float32}})::Matrix{Float32}
exprs = Vector{ExpressionProcessing.PostfixType}(undef, length(expressions))
for i in eachindex(expressions)
exprs[i] = ExpressionProcessing.expr_to_postfix(expressions[i])
end
variableCols = size(variables, 2) # number of variable sets to use for each expression
cudaVars = CuArray(variables)
cudaParams = create_cuda_array(parameters, NaN32) # column corresponds to data for one expression
cudaExprs = create_cuda_array(expressions, ExpressionElement(EMPTY, 0)) # column corresponds to data for one expression
cudaParams = Utils.create_cuda_array(parameters, NaN32) # column corresponds to data for one expression
cudaExprs = Utils.create_cuda_array(exprs, ExpressionElement(EMPTY, 0)) # column corresponds to data for one expression
# put into seperate cuArray, as this is static and would be inefficient to send seperatly to every kernel
cudaStepsize = CuArray([get_max_inner_length(expressions), get_max_inner_length(parameters), size(variables, 1)]) # max num of values per expression; max nam of parameters per expression; number of variables per expression
cudaStepsize = CuArray([Utils.get_max_inner_length(exprs), Utils.get_max_inner_length(parameters), size(variables, 1)]) # max num of values per expression; max nam of parameters per expression; number of variables per expression
# each expression has nr. of variable sets (nr. of columns of the variables) results and there are n expressions
cudaResults = CuArray{Float32}(undef, variableCols, length(expressions))
cudaResults = CuArray{Float32}(undef, variableCols, length(exprs))
# Start kernel for each expression to ensure that no warp is working on different expressions
for i in eachindex(expressions)
for i in eachindex(exprs)
kernel = @cuda launch=false interpret_expression(cudaExprs, cudaVars, cudaParams, cudaResults, cudaStepsize, i)
config = launch_configuration(kernel.fun)
threads = min(variableCols, config.threads)
@ -38,19 +45,23 @@ end
#TODO: Add @inbounds to all indexing after it is verified that all works https://cuda.juliagpu.org/stable/development/kernel/#Bounds-checking
const MAX_STACK_SIZE = 25 # The depth of the stack to store the values and intermediate results
function interpret_expression(expressions::CuDeviceArray{ExpressionElement}, variables::CuDeviceArray{Float32}, parameters::CuDeviceArray{Float32}, results::CuDeviceArray{Float32}, stepsize::CuDeviceArray{Int}, exprIndex::Int)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x # ctaid.x * ntid.x + tid.x
stride = gridDim().x * blockDim().x # nctaid.x * ntid.x
varSetIndex = (blockIdx().x - 1) * blockDim().x + threadIdx().x # ctaid.x * ntid.x + tid.x (1-based)
# stride = gridDim().x * blockDim().x # nctaid.x * ntid.x
variableCols = length(variables) / stepsize[3]
if varSetIndex > variableCols
return
end
firstExprIndex = ((exprIndex - 1) * stepsize[1]) + 1 # Inclusive
lastExprIndex = firstExprIndex + stepsize[1] - 1 # Inclusive
firstParamIndex = ((exprIndex - 1) * stepsize[2]) # Exclusive
variableCols = length(variables) / stepsize[3]
operationStack = MVector{MAX_STACK_SIZE, Float32}(undef) # Try to get this to function with variable size too, to allow better memory usage
operationStackTop = 0 # stores index of the last defined/valid value
for varSetIndex in index:stride
firstVariableIndex = ((varSetIndex - 1) * stepsize[3]) # Exclusive
# for varSetIndex in index:stride
firstVariableIndex = ((varSetIndex-1) * stepsize[3]) # Exclusive
for i in firstExprIndex:lastExprIndex
if expressions[i].Type == EMPTY
@ -62,7 +73,7 @@ function interpret_expression(expressions::CuDeviceArray{ExpressionElement}, var
if val > 0
operationStack[operationStackTop] = variables[firstVariableIndex + val]
else
val = -val
val = abs(val)
operationStack[operationStackTop] = parameters[firstParamIndex + val]
end
elseif expressions[i].Type == FLOAT32
@ -103,49 +114,9 @@ function interpret_expression(expressions::CuDeviceArray{ExpressionElement}, var
# "+ varSetIndex" -> to get the row inside the column at which to insert the result of the variable set (variable set = row)
resultIndex = convert(Int, (exprIndex - 1) * variableCols + varSetIndex) # Inclusive
results[resultIndex] = operationStack[operationStackTop]
end
# end
return
end
"Retrieves the number of entries for the largest inner vector"
function get_max_inner_length(vec::Vector{Vector{T}})::Int where T
maxLength = 0
@inbounds for i in eachindex(vec)
if length(vec[i]) > maxLength
maxLength = length(vec[i])
end
end
return maxLength
end
"Returns a CuArray filed with the data provided. The inner vectors do not have to have the same length. All missing elements will be the value ```invalidElement```"
function create_cuda_array(data::Vector{Vector{T}}, invalidElement::T)::CuArray{T} where T
dataCols = get_max_inner_length(data)
dataRows = length(data)
dataMat = convert_to_matrix(data, invalidElement)
cudaArr = CuArray{T}(undef, dataCols, dataRows) # length(parameters) == number of expressions
copyto!(cudaArr, dataMat)
return cudaArr
end
"Converts a vector of vectors into a matrix. The inner vectors do not need to have the same length.
All entries that cannot be filled have ```invalidElement``` as their value
"
function convert_to_matrix(vec::Vector{Vector{T}}, invalidElement::T)::Matrix{T} where T
vecCols = get_max_inner_length(vec)
vecRows = length(vec)
vecMat = fill(invalidElement, vecCols, vecRows)
for i in eachindex(vec)
vecMat[:,i] = copyto!(vecMat[:,i], vec[i])
end
return vecMat
end
end

View File

@ -1,55 +1,95 @@
module Transpiler
using CUDA
using ..ExpressionProcessing
using ..Utils
# Number of threads per block/SM + max number of registers
# https://docs.nvidia.com/cuda/cuda-c-programming-guide/#features-and-technical-specifications
# Need to assume a max of 2048 threads per Streaming Multiprocessor (SM)
# One SM can have 64*1024 32-bit registers at max
# One thread can at max use 255 registers
# Meaning one has access to at most 32 registers in the worst case. Using 64 bit values this number gets halfed (see: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#multiprocessor-level (almost at the end of the linked section))
# Maybe helpful for future performance tuning: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#maximum-number-of-registers-per-thread
# https://docs.nvidia.com/cuda/cuda-c-programming-guide/#multiprocessor-level
# This states, that using fewer registers allows more threads to reside on a single SM which improves performance.
# So I could use more registers at the expense for performance. Depending on how this would simplify my algorithm, I might do this and leave more optimisation to future work
# Since the generated expressions should have between 10 and 50 symbols, I think allowing a max. of 128 32-bit registers should make for an easy algorithm. If during testing the result is slow, maybe try reducing the number of registers and perform more intelligent allocation/assignment
# With 128 Registers, one could have 32 Warps on one SM ((128 * 16 = 2048) * 32 == 64*1024 == max number of registers per SM) This means 512 Threads per SM in the worst case
#
# Make a "function execute(...)" that takes the data and the transpiled code. Pass the data to the kernel and start executing
# Note: Maybe make an additional function that transpiles and executed the code. This would then be the function the user calls
#
const BYTES = sizeof(Float32)
const Operand = Union{Float32, String} # Operand is either fixed value or register
cache = Dict{Expr, CuFunction}() # needed if multiple runs with the same expr but different parameters are performed
function evaluate(expression::ExpressionProcessing.PostfixType, variables::Matrix{Float32}, parameters::Vector{Vector{Float32}})
# TODO: think of how to do this. Probably get all expressions. Transpile them in parallel and then execute the generatd code.
cudaVars = CuArray(variables)
function evaluate(expressions::Vector{Expr}, variables::Matrix{Float32}, parameters::Vector{Vector{Float32}})::Matrix{Float32}
varRows = size(variables, 1)
variableCols = size(variables, 2)
kernels = Vector{CuFunction}(undef, length(expressions))
# Test this parallel version again when doing performance tests. With the simple "functionality" tests this took 0.03 seconds while sequential took "0.00009" seconds
# Threads.@threads for i in eachindex(expressions)
# TODO: Use cache
# kernel = transpile(expressions[i], varRows, Utils.get_max_inner_length(parameters))
#kernel = transpile(expression, )
# execute kernel.
# linker = CuLink()
# add_data!(linker, "ExpressionProcessing", kernel)
# image = complete(linker)
# mod = CuModule(image)
# kernels[i] = CuFunction(mod, "ExpressionProcessing")
# end
for i in eachindex(expressions)
if haskey(cache, expressions[i])
kernels[i] = cache[expressions[i]]
continue
end
formattedExpr = ExpressionProcessing.expr_to_postfix(expressions[i])
kernel = transpile(formattedExpr, varRows, Utils.get_max_inner_length(parameters), variableCols, i-1) # i-1 because julia is 1-based but PTX needs 0-based indexing
linker = CuLink()
add_data!(linker, "ExpressionProcessing", kernel)
image = complete(linker)
mod = CuModule(image)
kernels[i] = CuFunction(mod, "ExpressionProcessing")
cache[expressions[i]] = kernels[i]
end
cudaVars = CuArray(variables) # maybe put in shared memory (see PerformanceTests.jl for more info)
cudaParams = Utils.create_cuda_array(parameters, NaN32) # maybe make constant (see PerformanceTests.jl for more info)
# each expression has nr. of variable sets (nr. of columns of the variables) results and there are n expressions
cudaResults = CuArray{Float32}(undef, variableCols, length(expressions))
# execute each kernel (also try doing this with Threads.@threads. Since we can have multiple grids, this might improve performance)
for i in eachindex(kernels)
config = launch_configuration(kernels[i])
threads = min(variableCols, config.threads)
blocks = cld(variableCols, threads)
cudacall(kernels[i], (CuPtr{Float32},CuPtr{Float32},CuPtr{Float32}), cudaVars, cudaParams, cudaResults; threads=threads, blocks=blocks)
end
return cudaResults
end
# To increase performance, it would probably be best for all helper functions to return their IO Buffer and not a string
# seekstart(buf1); write(buf2, buf1)
function transpile(expression::ExpressionProcessing.PostfixType, varSetSize::Integer, paramSetSize::Integer)::String
"
- param ```varSetSize```: The size of a variable set. Equal to number of rows of variable matrix (in a column major matrix)
- param ```paramSetSize```: The size of the longest parameter set. As it has to be stored in a column major matrix, the nr of rows is dependent oon the longest parameter set
- param ```expressionIndex```: The 0-based index of the expression
"
function transpile(expression::ExpressionProcessing.PostfixType, varSetSize::Integer, paramSetSize::Integer,
nrOfVariableSets::Integer, expressionIndex::Integer)::String
exitJumpLocationMarker = "\$L__BB0_2"
ptxBuffer = IOBuffer()
regManager = Utils.RegisterManager(Dict(), Dict())
# TODO: Suboptimal solution
signature, paramLoading = get_kernel_signature("ExpressionProcessing", [Int32, Float32, Float32]) # nrOfVarSets, Vars, Params
guardClause, threadIdReg = get_guard_clause(exitJumpLocationMarker, "%parameter0") # parameter0 because first entry holds the number of variable sets and that is always stored in %parameter0
signature, paramLoading = get_kernel_signature("ExpressionProcessing", [Float32, Float32, Float32], regManager) # Vars, Params, Results
guardClause, threadId64Reg = get_guard_clause(exitJumpLocationMarker, nrOfVariableSets, regManager)
println(ptxBuffer, get_cuda_header())
println(ptxBuffer, signature)
println(ptxBuffer, "{")
calc_code = generate_calculation_code(expression, "%parameter1", varSetSize, "%parameter2", paramSetSize, threadIdReg)
println(ptxBuffer, get_register_definitions())
calc_code = generate_calculation_code(expression, "%parameter0", varSetSize, "%parameter1", paramSetSize, "%parameter2",
threadId64Reg, expressionIndex, nrOfVariableSets, regManager)
println(ptxBuffer, Utils.get_register_definitions(regManager))
println(ptxBuffer, paramLoading)
println(ptxBuffer, guardClause)
println(ptxBuffer, calc_code)
@ -59,20 +99,23 @@ function transpile(expression::ExpressionProcessing.PostfixType, varSetSize::Int
println(ptxBuffer, "}")
generatedCode = String(take!(ptxBuffer))
println(generatedCode)
return generatedCode
end
# TODO: Make version, target and address_size configurable; also see what address_size means exactly
function get_cuda_header()::String
return "
.version 7.1
.version 8.5
.target sm_61
.address_size 32
.address_size 64
"
end
function get_kernel_signature(kernelName::String, parameters::Vector{DataType})::Tuple{String, String}
"
param ```parameters```: [1] = nr of var sets; [2] = variables; [3] = parameters; [4] = result
"
function get_kernel_signature(kernelName::String, parameters::Vector{DataType}, regManager::Utils.RegisterManager)::Tuple{String, String}
signatureBuffer = IOBuffer()
paramLoadingBuffer = IOBuffer()
print(signatureBuffer, ".visible .entry ")
@ -80,11 +123,11 @@ function get_kernel_signature(kernelName::String, parameters::Vector{DataType}):
println(signatureBuffer, "(")
for i in eachindex(parameters)
print(signatureBuffer, " .param .u32", " ", "param_", i)
print(signatureBuffer, " .param .u64", " ", "param_", i)
parametersReg = get_next_free_register("r")
println(paramLoadingBuffer, "ld.param.u32 $parametersReg, [param_$i];")
println(paramLoadingBuffer, "cvta.to.global.u32 $(get_next_free_register("parameter")), $parametersReg;")
parametersLocation = Utils.get_next_free_register(regManager, "rd")
println(paramLoadingBuffer, "ld.param.u64 $parametersLocation, [param_$i];")
println(paramLoadingBuffer, "cvta.to.global.u64 $(Utils.get_next_free_register(regManager, "parameter")), $parametersLocation;")
if i != lastindex(parameters)
println(signatureBuffer, ",")
end
@ -99,36 +142,45 @@ Constructs the PTX code used for handling the case where too many threads are st
- param ```nrOfVarSetsRegister```: The register which holds the total amount of variable sets for the kernel
"
function get_guard_clause(exitJumpLocation::String, nrOfVarSetsRegister::String)::Tuple{String, String}
function get_guard_clause(exitJumpLocation::String, nrOfVarSets::Integer, regManager::Utils.RegisterManager)::Tuple{String, String}
guardBuffer = IOBuffer()
threadIds = get_next_free_register("r")
threadsPerCTA = get_next_free_register("r")
currentThreadId = get_next_free_register("r")
threadIds = Utils.get_next_free_register(regManager, "r")
threadsPerCTA = Utils.get_next_free_register(regManager, "r")
currentThreadId = Utils.get_next_free_register(regManager, "r")
# load data into above defined registers
println(guardBuffer, "mov.u32 $threadIds, %ntid.x;")
println(guardBuffer, "mov.u32 $threadsPerCTA, %ctaid.x;")
println(guardBuffer, "mov.u32 $currentThreadId, %tid.x;")
globalThreadId = get_next_free_register("r") # basically the index of the thread in the variable set
breakCondition = get_next_free_register("p")
nrOfVarSets = get_next_free_register("i")
println(guardBuffer, "ld.global.u32 $nrOfVarSets, [$nrOfVarSetsRegister];")
globalThreadId = Utils.get_next_free_register(regManager, "r") # basically the index of the thread in the variable set
breakCondition = Utils.get_next_free_register(regManager, "p")
println(guardBuffer, "mad.lo.s32 $globalThreadId, $threadIds, $threadsPerCTA, $currentThreadId;")
println(guardBuffer, "setp.ge.s32 $breakCondition, $globalThreadId, $nrOfVarSets;") # guard clause = index > nrOfVariableSets
println(guardBuffer, "setp.gt.s32 $breakCondition, $globalThreadId, $nrOfVarSets;") # guard clause = index > nrOfVariableSets
# branch to end if breakCondition is true
print(guardBuffer, "@$breakCondition bra $exitJumpLocation;")
println(guardBuffer, "@$breakCondition bra $exitJumpLocation;")
return (String(take!(guardBuffer)), globalThreadId)
# Convert threadIdReg to a 64 bit register. Not 64 bit from the start, as this would take up more registers. Performance tests can be performed to determin if it is faster doing this, or making everything 64-bit from the start
threadId64Reg = Utils.get_next_free_register(regManager, "rd")
print(guardBuffer, "cvt.u64.u32 $threadId64Reg, $globalThreadId;")
return (String(take!(guardBuffer)), threadId64Reg)
end
function generate_calculation_code(expression::ExpressionProcessing.PostfixType, variablesReg::String, variablesSetSize::Integer,
parametersReg::String, parametersSetSize::Integer, threadIdReg::String)::String
"
- param ```parametersSetSize```: Size of the largest parameter set
"
function generate_calculation_code(expression::ExpressionProcessing.PostfixType, variablesLocation::String, variablesSetSize::Integer,
parametersLocation::String, parametersSetSize::Integer, resultsLocation::String,
threadId64Reg::String, expressionIndex::Integer, nrOfVarSets::Integer, regManager::Utils.RegisterManager)::String
codeBuffer = IOBuffer()
operands = Vector{Operand}()
exprId64Reg = Utils.get_next_free_register(regManager, "rd")
println(codeBuffer, "mov.u64 $exprId64Reg, $expressionIndex;")
for i in eachindex(expression)
token = expression[i]
@ -144,47 +196,57 @@ function generate_calculation_code(expression::ExpressionProcessing.PostfixType,
else
left = pop!(operands)
end
operation, resultRegister = get_operation(operator, left, right)
operation, resultRegister = get_operation(operator, regManager, left, right)
println(codeBuffer, operation)
push!(operands, resultRegister)
elseif token.Type == INDEX
if token.Value > 0 # varaibles
var, first_access = get_register_for_name("x$(token.Value)")
var, first_access = Utils.get_register_for_name(regManager, "x$(token.Value)")
if first_access
println(codeBuffer, load_into_register(var, variablesReg, token.Value, threadIdReg, variablesSetSize))
println(codeBuffer, load_into_register(var, variablesLocation, token.Value, threadId64Reg, variablesSetSize, regManager))
end
push!(operands, var)
else
absVal = abs(token.Value)
param, first_access = get_register_for_name("p$absVal")
param, first_access = Utils.get_register_for_name(regManager, "p$absVal")
if first_access
println(codeBuffer, load_into_register(param, parametersReg, absVal, threadIdReg, parametersSetSize))
println(codeBuffer, load_into_register(param, parametersLocation, absVal, exprId64Reg, parametersSetSize, regManager))
end
push!(operands, param)
end
end
end
tempReg = Utils.get_next_free_register(regManager, "rd")
# reg = pop!(operands)
# tmp = "abs.f32 $(reg), 16.0;"
# push!(operands, reg)
println(codeBuffer, "
add.u64 $tempReg, $((expressionIndex)*nrOfVarSets), $threadId64Reg;
mad.lo.u64 $tempReg, $tempReg, $BYTES, $resultsLocation;
st.global.f32 [$tempReg], $(pop!(operands));
")
return String(take!(codeBuffer))
end
"
Loads a value from a location into the given register. It is assumed that the location refers to a column-major matrix
- param ```register```: The register where the loaded value will be stored
- param ```loadLocation```: The location from where to load the value
- param ```valueIndex```: 0-based index of the value in the variable set/parameter set
- param ```setIndexReg```: 0-based index of the set. Needed to calculate the actual index from the ```valueIndex```. Is equal to the global threadId
- param ```setSize```: The size of one set. Needed to calculate the actual index from the ```valueIndex```
- param ```valueIndex```: 1-based index of the value in the variable set/parameter set
- param ```setIndexReg64```: 0-based index of the set. Needed to calculate the actual index from the ```valueIndex```. Is equal to the global threadId
- param ```setSize```: The size of one set. Needed to calculate the actual index from the ```valueIndex```. Total number of elements in the set (length(set))
"
function load_into_register(register::String, loadLocation::String, valueIndex::Integer, setIndexReg::String, setSize::Integer)::String
# loadLocation + startIndex + valueIndex * bytes (4 in our case)
# startIndex: setIndex * setSize
tempReg = get_next_free_register("i")
# we are using "sizeof(valueIndex)" because it has to use the same amount of bytes as the actual stored values, even though it could use more bytes
function load_into_register(register::String, loadLocation::String, valueIndex::Integer, setIndexReg64::String, setSize::Integer, regManager::Utils.RegisterManager)::String
tempReg = Utils.get_next_free_register(regManager, "rd")
# "mad" calculates the offset and "add" applies the offset. Classical pointer arithmetic for accessing values of an array like in C
return "
mul.lo.u32 $tempReg, $setIndexReg, $setSize;
add.u32 $tempReg, $tempReg, $(valueIndex*sizeof(valueIndex));
add.u32 $tempReg, $loadLocation, $tempReg;
mad.lo.u64 $tempReg, $setIndexReg64, $(setSize*BYTES), $((valueIndex - 1) * BYTES);
add.u64 $tempReg, $loadLocation, $tempReg;
ld.global.f32 $register, [$tempReg];"
end
@ -200,8 +262,8 @@ function type_to_ptx_type(type::DataType)::String
end
end
function get_operation(operator::Operator, left::Operand, right::Union{Operand, Nothing} = nothing)::Tuple{String, String}
resultRegister = get_next_free_register("f")
function get_operation(operator::Operator, regManager::Utils.RegisterManager, left::Operand, right::Union{Operand, Nothing} = nothing)::Tuple{String, String}
resultRegister = Utils.get_next_free_register(regManager, "f")
resultCode = ""
if is_binary_operator(operator) && isnothing(right)
@ -219,6 +281,7 @@ function get_operation(operator::Operator, left::Operand, right::Union{Operand,
elseif operator == POWER
# x^y == 2^(y*log2(x)) as generated by nvcc for "pow(x, y)"
resultCode = "
// x^y:
lg2.approx.f32 $resultRegister, $left;
mul.f32 $resultRegister, $right, $resultRegister;
ex2.approx.f32 $resultRegister, $resultRegister;"
@ -227,11 +290,13 @@ function get_operation(operator::Operator, left::Operand, right::Union{Operand,
elseif operator == LOG
# log(x) == log2(x) * ln(2) as generated by nvcc for "log(x)"
resultCode = "
// log(x):
lg2.approx.f32 $resultRegister, $left;
mul.f32 $resultRegister, $resultRegister, 0.693147182;"
elseif operator == EXP
# e^x == 2^(x/ln(2)) as generated by nvcc for "exp(x)"
resultCode = "
// e^x:
mul.f32 $resultRegister, $left, 1.44269502;
ex2.approx.f32 $resultRegister, $resultRegister;"
elseif operator == SQRT
@ -242,68 +307,5 @@ function get_operation(operator::Operator, left::Operand, right::Union{Operand,
return (resultCode, resultRegister)
end
let registers = Dict() # stores the count of the register already used.
global get_next_free_register
global get_register_definitions
# By convention these names correspond to the following types:
# - p -> pred
# - f -> float32
# - r -> 32 bit
# - var -> float32 (used for variables and params)
function get_next_free_register(name::String)::String
if haskey(registers, name)
registers[name] += 1
else
registers[name] = 1
end
return string("%", name, registers[name] - 1)
end
function get_register_definitions()::String
registersBuffer = IOBuffer()
for definition in registers
regType = ""
if definition.first == "p"
regType = ".pred"
elseif definition.first == "f"
regType = ".f32"
elseif definition.first == "var"
regType = ".f32"
elseif definition.first == "param"
regType = ".f32"
elseif definition.first == "r"
regType = ".b32"
elseif definition.first == "parameter"
regType = ".u32"
elseif definition.first == "i"
regType = ".u32"
else
throw(ArgumentError("Unknown register name used. Name '$(definition.first)' cannot be mapped to a PTX type."))
end
println(registersBuffer, ".reg $regType %$(definition.first)<$(definition.second)>;")
end
return String(take!(registersBuffer))
end
end
let symtable = Dict()
global get_register_for_name
"Returns the register for this variable/parameter and true if it is used for the first time and false otherwise."
function get_register_for_name(varName::String)
if haskey(symtable, varName)
return (symtable[varName], false)
else
reg = get_next_free_register("var")
symtable[varName] = reg
return (reg, true)
end
end
end
end

88
package/src/Utils.jl Normal file
View File

@ -0,0 +1,88 @@
module Utils
using CUDA
"Converts a vector of vectors into a matrix. The inner vectors do not need to have the same length.
All entries that cannot be filled have ```invalidElement``` as their value
"
function convert_to_matrix(vecs::Vector{Vector{T}}, invalidElement::T)::Matrix{T} where T
maxLength = get_max_inner_length(vecs)
# Pad the shorter vectors with the invalidElement
paddedVecs = [vcat(vec, fill(invalidElement, maxLength - length(vec))) for vec in vecs]
vecMat = hcat(paddedVecs...)
return vecMat
end
"Retrieves the number of entries for the largest inner vector"
function get_max_inner_length(vecs::Vector{Vector{T}})::Int where T
return maximum(length.(vecs))
end
"Returns a CuArray filed with the data provided. The inner vectors do not have to have the same length. All missing elements will be the value ```invalidElement```"
function create_cuda_array(data::Vector{Vector{T}}, invalidElement::T)::CuArray{T} where T
dataMat = convert_to_matrix(data, invalidElement)
cudaArr = CuArray(dataMat)
return cudaArr
end
struct RegisterManager
registers::Dict
symtable::Dict
end
function get_next_free_register(manager::RegisterManager, name::String)::String
if haskey(manager.registers, name)
manager.registers[name] += 1
else
manager.registers[name] = 1
end
return string("%", name, manager.registers[name] - 1)
end
function get_register_definitions(manager::RegisterManager)::String
registersBuffer = IOBuffer()
for definition in manager.registers
regType = ""
if definition.first == "p"
regType = ".pred"
elseif definition.first == "f"
regType = ".f32"
elseif definition.first == "var"
regType = ".f32"
elseif definition.first == "param"
regType = ".f32"
elseif definition.first == "r"
regType = ".b32"
elseif definition.first == "rd"
regType = ".b64"
elseif definition.first == "parameter"
regType = ".b64"
elseif definition.first == "i"
regType = ".b64"
else
throw(ArgumentError("Unknown register name used. Name '$(definition.first)' cannot be mapped to a PTX type."))
end
println(registersBuffer, ".reg $regType %$(definition.first)<$(definition.second)>;")
end
return String(take!(registersBuffer))
end
"Returns the register for this variable/parameter and true if it is used for the first time and false otherwise."
function get_register_for_name(manager::RegisterManager, varName::String)
if haskey(manager.symtable, varName)
return (manager.symtable[varName], false)
else
reg = get_next_free_register(manager, "var")
manager.symtable[varName] = reg
return (reg, true)
end
end
end

View File

@ -1,4 +1,5 @@
using LinearAlgebra
using BenchmarkTools
function test_cpu_interpreter(nrows; parallel = false)
exprs = [
@ -18,16 +19,27 @@ function test_cpu_interpreter(nrows; parallel = false)
reps= 100
if parallel
t_sec = @elapsed fetch.([Threads.@spawn interpret_cpu(exprs, X, p; repetitions=expr_reps) for i in 1:reps])
println("~ $(round(30 * reps * expr_reps * nrows / 1e9 / t_sec, digits=2)) GFLOPS ($(Threads.nthreads()) threads) ($(round(peakflops(1000, eltype=Float32, ntrials=1) / 1e9, digits=2)) GFLOPS (peak, single-core))")
# t_sec = @elapsed fetch.([Threads.@spawn interpret_cpu(exprs, X, p; repetitions=expr_reps) for i in 1:reps])
@btime parallel(exprs, X, p, expr_reps, reps)
println("~ $(round(30 * reps * expr_reps * nrows / 1e9 / t_sec, digits=2)) GFLOPS ($(Threads.nthreads()) threads) ($(round(LinearAlgebra.peakflops(1000, eltype=Float32, ntrials=1) / 1e9, digits=2)) GFLOPS (peak, single-core))")
else
t_sec = @elapsed for i in 1:reps interpret_cpu(exprs, X, p; repetitions=expr_reps) end
println("~ $(round(30 * reps * expr_reps * nrows / 1e9 / t_sec, digits=2)) GFLOPS (single-core) ($(round(peakflops(1000, eltype=Float32, ntrials=1) / 1e9, digits=2)) GFLOPS (peak, single-core))")
# t_sec = @elapsed for i in 1:reps interpret_cpu(exprs, X, p; repetitions=expr_reps) end
@btime single(exprs, X, p, expr_reps, reps)
println("~ $(round(30 * reps * expr_reps * nrows / 1e9 / t_sec, digits=2)) GFLOPS (single-core) ($(round(LinearAlgebra.peakflops(1000, eltype=Float32, ntrials=1) / 1e9, digits=2)) GFLOPS (peak, single-core))")
end
true
end
LinearAlgebra.BLAS.set_num_threads(1) # only use a single thread for peakflops
function parallel(exprs, X, p, expr_reps, reps)
fetch.([Threads.@spawn interpret_cpu(exprs, X, p; repetitions=expr_reps) for i in 1:reps])
end
function single(exprs, X, p, expr_reps, reps)
for i in 1:reps interpret_cpu(exprs, X, p; repetitions=expr_reps) end
end
# LinearAlgebra.BLAS.set_num_threads(1) # only use a single thread for peakflops
@test test_cpu_interpreter(1000)
@test test_cpu_interpreter(1000, parallel=true) # start julia -t 6 for six threads

View File

@ -1,6 +1,7 @@
using CUDA
using .ExpressionProcessing
using .Interpreter
using .Utils
expressions = Vector{Expr}(undef, 2)
variables = Matrix{Float32}(undef, 2,2)
@ -20,8 +21,8 @@ parameters[2][1] = 5.0
parameters[2][2] = 0.0
function testHelper(expression::Expr, variables::Matrix{Float32}, parameters::Vector{Vector{Float32}}, expectedResult)
postfix = Vector([expr_to_postfix(expression)])
result = Interpreter.interpret(postfix, variables, parameters)
exprs = Vector([expression])
result = Interpreter.interpret(exprs, variables, parameters)
expectedResult32 = convert(Float32, expectedResult)
@test isequal(result[1,1], expectedResult32)
@ -35,7 +36,7 @@ end
reference[2,2] = 0.0
# reference = Matrix([5.0, NaN],
# [5.0, 0.0])
result = Interpreter.convert_to_matrix(parameters, NaN32)
result = Utils.convert_to_matrix(parameters, NaN32)
@test isequal(result, reference)
end
@ -126,8 +127,8 @@ end
expr1 = :((x1 + 5) * p1 - 3 / abs(x2) + (2^4) - log(8))
expr2 = :(1 + 5 * x1 - 10^2 + (p1 - p2) / 9 + exp(x2))
postfix = Vector([expr_to_postfix(expr1), expr_to_postfix(expr2)])
result = Interpreter.interpret(postfix, var, param)
exprs = Vector([expr1, expr2])
result = Interpreter.interpret(exprs, var, param)
# var set 1
@test isapprox(result[1,1], 37.32, atol=0.01) # expr1

View File

@ -0,0 +1,146 @@
using LinearAlgebra
using BenchmarkTools
using .Transpiler
using .Interpreter
const BENCHMARKS_RESULTS_PATH = "./results"
# University setup at 10.20.1.7 if needed
exprsCPU = [
# CPU interpreter requires an anonymous function and array ref s
:(p[1] * x[1] + p[2]), # 5 op
:((((x[1] + x[2]) + x[3]) + x[4]) + x[5]), # 9 op
:(log(abs(x[1]))), # 3 op
:(powabs(p[2] - powabs(p[1] + x[1], 1/x[1]),p[3])) # 13 op
] # 30 op
exprsCPU = map(e -> Expr(:->, :(x,p), e), exprsCPU)
exprsGPU = [
# CPU interpreter requires an anonymous function and array ref s
:(p1 * x1 + p2), # 5 op
:((((x1 + x2) + x3) + x4) + x5), # 9 op
:(log(abs(x1))), # 3 op
:(powabs(p2 - powabs(p1 + x1, 1/x1),p3)) # 13 op
] # 30 op
# p is the same for CPU and GPU
p = [randn(Float32, 10) for _ in 1:length(exprsCPU)] # generate 10 random parameter values for each expr
expr_reps = 100 # 100 parameter optimisation steps basically
@testset "CPU performance" begin
# warmup
# interpret_cpu(exprsCPU, X, p)
# @btime interpret_cpu(exprsCPU, X, p; repetitions=expr_reps) # repetitions simulates parameter optimisation
# @btime test_cpu_interpreter(1000)
# @btime fetch.([Threads.@spawn interpret_cpu(exprsCPU, X, p; repetitions=expr_reps) for i in 1:reps])
# test_cpu_interpreter(1000, parallel=true) # start julia -t 6 for six threads
# @btime test_cpu_interpreter(10000)
# @btime test_cpu_interpreter(10000, parallel=true)
end
@testset "Interpreter Performance" begin
# Put data in shared memory:
# https://cuda.juliagpu.org/v2.6/api/kernel/#Shared-memory
# Make array const:
# https://cuda.juliagpu.org/v2.6/api/kernel/#Device-arrays
# Memory management like in C++ might help with performance improvements
# https://cuda.juliagpu.org/v2.6/lib/driver/#Memory-Management
end
@testset "Transpiler Performance" begin
# Put data in shared memory:
# https://cuda.juliagpu.org/v2.6/api/kernel/#Shared-memory
# Make array const:
# https://cuda.juliagpu.org/v2.6/api/kernel/#Device-arrays
# Memory management like in C++ might help with performance improvements
# https://cuda.juliagpu.org/v2.6/lib/driver/#Memory-Management
end
suite = BenchmarkGroup()
suite["CPU"] = BenchmarkGroup(["CPUInterpreter"])
suite["GPUI"] = BenchmarkGroup(["GPUInterpreter"])
suite["GPUT"] = BenchmarkGroup(["GPUTranspiler"])
varsets_small = 100
varsets_medium = 1000
varsets_large = 10000
X_small = randn(Float32, varsets_small, 5)
suite["CPU"]["small varset"] = @benchmarkable interpret_cpu(exprsCPU, X_small, p; repetitions=expr_reps)
X_medium = randn(Float32, varsets_medium, 5)
suite["CPU"]["medium varset"] = @benchmarkable interpret_cpu(exprsCPU, X_medium, p; repetitions=expr_reps)
X_large = randn(Float32, varsets_large, 5)
suite["CPU"]["large varset"] = @benchmarkable interpret_cpu(exprsCPU, X_large, p; repetitions=expr_reps)
X_small_GPU = randn(Float32, 5, varsets_small)
suite["GPUI"]["small varset"] = @benchmarkable interpret_gpu(exprsGPU, X_small_GPU, p; repetitions=expr_reps)
suite["GPUT"]["small varset"] = @benchmarkable evaluate_gpu(exprsGPU, X_small_GPU, p; repetitions=expr_reps)
X_medium_GPU = randn(Float32, 5, varsets_medium)
suite["GPUI"]["medium varset"] = @benchmarkable interpret_gpu(exprsGPU, X_medium_GPU, p; repetitions=expr_reps)
suite["GPUT"]["medium varset"] = @benchmarkable evaluate_gpu(exprsGPU, X_medium_GPU, p; repetitions=expr_reps)
X_large_GPU = randn(Float32, 5, varsets_large)
suite["GPUI"]["large varset"] = @benchmarkable interpret_gpu(exprsGPU, X_large_GPU, p; repetitions=expr_reps)
suite["GPUT"]["large varset"] = @benchmarkable evaluate_gpu(exprsGPU, X_large_GPU, p; repetitions=expr_reps)
# interpret_gpu(exprsGPU, X_large_GPU, p; repetitions=expr_reps)
# tune!(suite)
# BenchmarkTools.save("params.json", params(suite))
loadparams!(suite, BenchmarkTools.load("params.json")[1], :samples, :evals, :gctrial, :time_tolerance, :evals_set, :gcsample, :seconds, :overhead, :memory_tolerance)
results = run(suite, verbose=true, seconds=180)
# BenchmarkTools.save("$BENCHMARKS_RESULTS_PATH/initial_results.json", results)
# initial_results = BenchmarkTools.load("$BENCHMARKS_RESULTS_PATHinitial_results.json")
medianCPU = median(results["CPU"])
minimumCPU = minimum(results["CPU"])
stdCPU = std(results["CPU"])
medianInterpreter = median(results["GPUI"])
minimumInterpreter = minimum(results["GPUI"])
stdInterpreter = std(results["GPUI"])
medianTranspiler = median(results["GPUT"])
minimumTranspiler = minimum(results["GPUT"])
stdTranspiler = std(results["GPUT"])
cpuVsGPUI_median = judge(medianInterpreter, medianCPU) # is interpreter better than cpu?
cpuVsGPUT_median = judge(medianTranspiler, medianCPU) # is transpiler better than cpu?
gpuiVsGPUT_median = judge(medianTranspiler, medianInterpreter) # is tranpiler better than interpreter?
cpuVsGPUI_minimum = judge(minimumInterpreter, minimumCPU) # is interpreter better than cpu?
cpuVsGPUT_minimum = judge(minimumTranspiler, minimumCPU) # is transpiler better than cpu?
gpuiVsGPUT_minimum = judge(minimumTranspiler, minimumInterpreter) # is tranpiler better than interpreter?
cpuVsGPUI_std = judge(stdInterpreter, stdCPU) # is interpreter better than cpu?
cpuVsGPUT_std = judge(stdTranspiler, stdCPU) # is transpiler better than cpu?
gpuiVsGPUT_std = judge(stdTranspiler, stdInterpreter) # is tranpiler better than interpreter?
println("Is the interpreter better than the CPU implementation:")
println(cpuVsGPUI_median)
println(cpuVsGPUI_minimum)
println(cpuVsGPUI_std)
println("Is the transpiler better than the CPU implementation:")
println(cpuVsGPUT_median)
println(cpuVsGPUT_minimum)
println(cpuVsGPUT_std)
println("Is the transpiler better than the interpreter:")
println(gpuiVsGPUT_median)
println(gpuiVsGPUT_minimum)
println(gpuiVsGPUT_std)

View File

@ -1,4 +1,8 @@
[deps]
BenchmarkPlots = "ab8c0f59-4072-4e0d-8f91-a91e1495eb26"
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
StatsPlots = "f3b207a7-027a-5e70-b257-86293d7955fd"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

View File

@ -2,42 +2,65 @@ using CUDA
using .ExpressionProcessing
using .Transpiler
expressions = Vector{Expr}(undef, 2)
variables = Matrix{Float32}(undef, 2,2)
parameters = Vector{Vector{Float32}}(undef, 2)
expressions = Vector{Expr}(undef, 3)
variables = Matrix{Float32}(undef, 5, 4)
parameters = Vector{Vector{Float32}}(undef, 3)
# Resulting value should be 1.14... for the first expression
expressions[1] = :(1 + 3 * 5 / 7 - sqrt(4))
expressions[2] = :(5 + x1 + 1 * x2 + p1 + p2)
expressions[2] = :(5 + x1 + 1 * x2 + p1 + p2 + x1^x3)
expressions[3] = :(log(x1) / x2 * sqrt(p1) + x3^x4 - exp(x5))
variables[1,1] = 2.0
variables[2,1] = 3.0
variables[1,2] = 0.0
variables[3,1] = 0.0
variables[4,1] = 1.0
variables[5,1] = 0.0
variables[1,2] = 2.0
variables[2,2] = 5.0
parameters[1] = Vector{Float32}(undef, 1)
variables[3,2] = 3.0
variables[4,2] = 0.0
variables[5,2] = 0.0
variables[1,3] = 6.0
variables[2,3] = 2.0
variables[3,3] = 2.0
variables[4,3] = 4.0
variables[5,3] = 2.0
variables[1,4] = 1.0
variables[2,4] = 2.0
variables[3,4] = 3.0
variables[4,4] = 4.0
variables[5,4] = 5.0
parameters[1] = Vector{Float32}(undef, 0)
parameters[2] = Vector{Float32}(undef, 2)
parameters[1][1] = 5.0
parameters[3] = Vector{Float32}(undef, 1)
parameters[2][1] = 5.0
parameters[2][2] = 0.0
parameters[3][1] = 16.0
@testset "Test transpiler evaluation" begin
results = Transpiler.evaluate(expressions, variables, parameters)
@testset "Test TMP transpiler" begin
postfixExpr = expr_to_postfix(expressions[1])
postfixExprs = Vector([postfixExpr])
push!(postfixExprs, expr_to_postfix(expressions[2]))
push!(postfixExprs, expr_to_postfix(:(5^3 + x1)))
# dump(expressions[3]; maxdepth=10)
# Expr 1:
@test isapprox(results[1,1], 1.14286)
@test isapprox(results[2,1], 1.14286)
@test isapprox(results[3,1], 1.14286)
@test isapprox(results[4,1], 1.14286)
#Expr 2:
@test isapprox(results[1,2], 16.0)
@test isapprox(results[2,2], 25.0)
@test isapprox(results[3,2], 54.0)
@test isapprox(results[4,2], 14.0)
# generatedCode = Transpiler.transpile(postfixExpr)
generatedCode = Transpiler.transpile(postfixExprs[3], 2, 3) # TEMP
# CUDA.@sync interpret(postfixExprs, variables, parameters)
# This is just here for testing. This will be called inside the execute method in the Transpiler module
linker = CuLink()
add_data!(linker, "ExpressionProcessing", generatedCode)
image = complete(linker)
mod = CuModule(image)
func = CuFunction(mod, "ExpressionProcessing")
#Expr3:
@test isapprox(results[1,3], -0.07580)
@test isapprox(results[2,3], 0.55452)
@test isapprox(results[3,3], 12.19446)
@test isapprox(results[4,3], -67.41316)
end
#TODO: test performance of transpiler PTX generation when doing "return String(take!(buffer))" vs "return take!(buffer)"
# TODO: test performance of transpiler PTX generation when doing "return String(take!(buffer))" vs "return take!(buffer)"

File diff suppressed because one or more lines are too long

1
package/test/params.json Normal file
View File

@ -0,0 +1 @@
[{"Julia":"1.11.4","BenchmarkTools":{"major":1,"minor":6,"patch":0,"prerelease":[],"build":[]}},[["BenchmarkGroup",{"data":{"CPU":["BenchmarkGroup",{"data":{"medium varset":["Parameters",{"gctrial":true,"time_tolerance":0.05,"evals_set":false,"samples":1000,"evals":1,"gcsample":false,"seconds":5.0,"overhead":0.0,"memory_tolerance":0.01}],"large varset":["Parameters",{"gctrial":true,"time_tolerance":0.05,"evals_set":false,"samples":1000,"evals":1,"gcsample":false,"seconds":5.0,"overhead":0.0,"memory_tolerance":0.01}],"small varset":["Parameters",{"gctrial":true,"time_tolerance":0.05,"evals_set":false,"samples":1000,"evals":1,"gcsample":false,"seconds":5.0,"overhead":0.0,"memory_tolerance":0.01}]},"tags":["CPUInterpreter"]}],"GPUT":["BenchmarkGroup",{"data":{"medium varset":["Parameters",{"gctrial":true,"time_tolerance":0.05,"evals_set":false,"samples":1000,"evals":1,"gcsample":false,"seconds":5.0,"overhead":0.0,"memory_tolerance":0.01}],"large varset":["Parameters",{"gctrial":true,"time_tolerance":0.05,"evals_set":false,"samples":1000,"evals":1,"gcsample":false,"seconds":5.0,"overhead":0.0,"memory_tolerance":0.01}],"small varset":["Parameters",{"gctrial":true,"time_tolerance":0.05,"evals_set":false,"samples":1000,"evals":1,"gcsample":false,"seconds":5.0,"overhead":0.0,"memory_tolerance":0.01}]},"tags":["GPUTranspiler"]}],"GPUI":["BenchmarkGroup",{"data":{"medium varset":["Parameters",{"gctrial":true,"time_tolerance":0.05,"evals_set":false,"samples":1000,"evals":1,"gcsample":false,"seconds":5.0,"overhead":0.0,"memory_tolerance":0.01}],"large varset":["Parameters",{"gctrial":true,"time_tolerance":0.05,"evals_set":false,"samples":1000,"evals":1,"gcsample":false,"seconds":5.0,"overhead":0.0,"memory_tolerance":0.01}],"small varset":["Parameters",{"gctrial":true,"time_tolerance":0.05,"evals_set":false,"samples":1000,"evals":1,"gcsample":false,"seconds":5.0,"overhead":0.0,"memory_tolerance":0.01}]},"tags":["GPUInterpreter"]}]},"tags":[]}]]]

View File

@ -2,17 +2,22 @@ using ExpressionExecutorCuda
using Test
const baseFolder = dirname(dirname(pathof(ExpressionExecutorCuda)))
include(joinpath(baseFolder, "src", "Utils.jl"))
include(joinpath(baseFolder, "src", "ExpressionProcessing.jl"))
include(joinpath(baseFolder, "src", "Interpreter.jl"))
include(joinpath(baseFolder, "src", "Transpiler.jl"))
@testset "ExpressionExecutorCuda.jl" begin
include("ExpressionProcessingTests.jl")
include("InterpreterTests.jl")
include("TranspilerTests.jl")
@testset "Functionality tests" begin
# include("ExpressionProcessingTests.jl")
# include("InterpreterTests.jl")
# include("TranspilerTests.jl")
end
@testset "CPU Interpreter" begin
include("CpuInterpreterTests.jl")
# @testset "CPU Interpreter" begin
# include("CpuInterpreterTests.jl")
# end
@testset "Performance tests" begin
include("PerformanceTests.jl")
end

View File

@ -9,5 +9,12 @@ Probably reference the performance evaluation papers for Julia and CUDA.jl
\section{Interpreter}
Talk about how the interpreter has been developed.
\subsection{Performance tuning}
Document the process of performance tuning
\section{Transpiler}
Talk about how the transpiler has been developed
Talk about how the transpiler has been developed
\subsection{Performance tuning}
Document the process of performance tuning

View File

@ -41,7 +41,7 @@ In order to answer the research questions, this thesis is divided into the follo
\item[Chapter 4: Implementation] \mbox{} \\
This chapter explains the implementation of the GPU interpreter and transpiler. The details of the implementation with the used technologies are covered, such as the interpretation process and the transpilation of the expressions into Parallel Thread Execution (PTX) code.
\item[Chapter 5: Evaluation] \mbox{} \\
The software and hardware requirements and the evaluation environment are introduced in this chapter. All three evaluators will be compared against each other and the form of the expressions used for the comparisons are outlined. Finally, the results of the comparison of the GPU and CPU evaluators are presented to show which of these yields the best performance.
The software and hardware requirements and the evaluation environment are introduced in this chapter. All three evaluators will be compared against each other and the form of the expressions used for the comparisons are outlined. The comparison will not only include the time taken for the pure evaluation, but it will also include the overhead, like PTX code generation. Finally, the results of the comparison of the GPU and CPU evaluators are presented to show which of these yields the best performance.
\item[Chapter 6: Conclusion] \mbox{} \\
In the final chapter, the entire work is summarised. A brief overview of the implementation as well as the evaluation results will be provided. Additionally, an outlook of possible future research is given.
\end{description}

Binary file not shown.