\chapter{Implementation} \label{cha:implementation} somewhere in here explain why one kernel per expression and not one kernel for all expressions \section{Technologies} Short section; CUDA, PTX, Julia, CUDA.jl Probably reference the performance evaluation papers for Julia and CUDA.jl \section{Expression Processing} Talk about why this needs to be done and how it is done (the why is basically: simplifies evaluation/transpilation process; the how is in ExpressionProcessing.jl) \section{Interpreter} Talk about how the interpreter has been developed. UML-Ablaufdiagram main loop; kernel transpiled by CUDA.jl into PTX and then executed Memory access (currently global memory only) no dynamic memory allocation like on CPU (stack needs to have fixed size) \section{Transpiler} Talk about how the transpiler has been developed (probably largest section, because it just has more interesting parts) UML-Ablaufdiagram Front-End and Back-End Caching of back-end results PTX code generated and compiled using CUDA.jl (so basically the driver) and then executed Memory access (global memory and register management especially register management)