\chapter{Implementation}
\label{cha:implementation}

somewhere in here explain why one kernel per expression and not one kernel for all expressions

Go into the details why this implementation is tuned towards performance and should be the optimum at that

\section{Technologies}
Short section; CUDA, PTX, Julia, CUDA.jl

Probably reference the performance evaluation papers for Julia and CUDA.jl

\section{Expression Processing}
Talk about why this needs to be done and how it is done (the why is basically: simplifies evaluation/transpilation process; the how is in ExpressionProcessing.jl)

\section{Interpreter}
Talk about how the interpreter has been developed.

UML-Ablaufdiagram

main loop; kernel transpiled by CUDA.jl into PTX and then executed

Memory access (currently global memory only)
no dynamic memory allocation like on CPU (stack needs to have fixed size)

\section{Transpiler}
Talk about how the transpiler has been developed (probably largest section, because it just has more interesting parts)

UML-Ablaufdiagram

Front-End and Back-End
Caching of back-end results

PTX code generated and compiled using CUDA.jl (so basically the driver) and then executed

Memory access (global memory and register management especially register management)