\chapter{Implementation} \label{cha:implementation} somewhere in here explain why one kernel per expression and not one kernel for all expressions Go into the details why this implementation is tuned towards performance and should be the optimum at that \section{Technologies} Short section; CUDA, PTX, Julia, CUDA.jl Probably reference the performance evaluation papers for Julia and CUDA.jl \section{Expression Processing} Talk about why this needs to be done and how it is done (the why is basically: simplifies evaluation/transpilation process; the how is in ExpressionProcessing.jl) \section{Interpreter} Talk about how the interpreter has been developed. UML-Ablaufdiagram main loop; kernel transpiled by CUDA.jl into PTX and then executed Memory access (currently global memory only) no dynamic memory allocation like on CPU (stack needs to have fixed size) \section{Transpiler} Talk about how the transpiler has been developed (probably largest section, because it just has more interesting parts) UML-Ablaufdiagram Front-End and Back-End Caching of back-end results PTX code generated and compiled using CUDA.jl (so basically the driver) and then executed Memory access (global memory and register management especially register management)