implementation: started writing impl; finished technology section

2025-04-20 12:18:40 +02:00
parent 90a4194283
commit 210831146a
5 changed files with 64 additions and 12 deletions
--- a/thesis/chapters/conceptdesign.tex
+++ b/thesis/chapters/conceptdesign.tex
@ -1,5 +1,3 @@
-RE-READ to ensure that concepts why this is done to improve performance and why this should be the "locally best" implementation (most should be in implementation though)
-
 \chapter{Concept and Design}
 \label{cha:conceptdesign}
 % introduction to what needs to be done. also clarify terms "Host" and "Device" here
@ -29,7 +27,9 @@ The main goal of both prototypes or evaluators is to provide a speed-up compared

 With this, the required capabilities are outlined. However, for a better understanding, the input and output data need to be explained further. The first input contains the expressions that need to be evaluated. These can be of any length and can contain constant values, variables and parameters, all of which are linked together with the supported operations. In the simplified example shown in Figure \ref{fig:input_output_explanation}, there are six expressions $e_1$ to $e_6$. 

-Next is the variable matrix. An entry in this matrix corresponds to one variable in every expression. The row indicates which variable it holds the value for. For example the values in row three, are used to parameterise the variable $x_3$. Each column holds a different set of variables. Each expression must be evaluated using each set of variable. In the provided example, there are three variable sets, each containing the values for four variables $x_1$ to $x_4$. After all expressions have been evaluated using all variable sets, the results of these evaluations must be stored in the result matrix. Each entry in this matrix holds the result of the evaluation of one expression parameterised with one variable set. The row indicates the variable set and the column indicates the expression.
+Next is the variable matrix. An entry in this matrix corresponds to one variable in every expression. The row indicates which variable it holds the value for. For example the values in row three, are used to parameterise the variable $x_3$. Each column holds a different set of variables. Each expression must be evaluated using each set of variables. In the provided example, there are three variable sets, each containing the values for four variables $x_1$ to $x_4$. 
+
+After all expressions have been evaluated using all variable sets, the results of these evaluations must be stored in the result matrix. Each entry in this matrix holds the result of the evaluation of one expression parameterised with one variable set. The row indicates the variable set and the column indicates the expression.

 The prototypes developed in this thesis, are part of a GP algorithm for symbolic regression. This means that the expressions that are evaluated, represent parts of the search space of all expressions being made up of any combination of allowed operators, the set of input variables, a set of parameters and constants. This means that the size of the search space grows exponentially. Exploring this search space by simply generating expressions, evaluating them once and then generating the next set of expressions leaves much of the search space unexplored. To combat this, parameters are introduced. These allow the algorithm to perform some kind of local search. To enable this, the prototypes must support not only variables, but also parameters.

--- a/thesis/chapters/conclusion.tex
+++ b/thesis/chapters/conclusion.tex
@ -7,6 +7,9 @@ talk again how a typical input is often not complex enough (basically repeat tha
 \section{Future Work}
 talk about what can be improved

-Transpiler: transpile expression directly from Julia AST -> would save time because no intermediate representation needs to be created (looses step and gains performance, but also makes transpiler itself more complex)
+Transpiler: 
+1.) transpile expression directly from Julia AST -> would save time because no intermediate representation needs to be created (looses step and gains performance, but also makes transpiler itself more complex)
+2.) Better register management strategy might be helpful -> look into register pressure etc.
+

 CPU Interpreter: Probably more worth to dive into parallelising cpu interpreter itself (not really future work, as you wouldn't write a paper about that)
--- a/thesis/chapters/implementation.tex
+++ b/thesis/chapters/implementation.tex
@ -1,17 +1,32 @@
 \chapter{Implementation}
 \label{cha:implementation}

-somewhere in here explain why one kernel per expression and not one kernel for all expressions
+This chapter focuses on the implementation phase of the project, building upon the concepts and designs previously discussed. It begins with an overview of the technologies employed for both the CPU and GPU parts of the application. This is followed by a description of the pre-processing or frontend phase. The chapter concludes with a detailed overview of the core components, the interpreter and the transpiler.

-Go into the details why this implementation is tuned towards performance and should be the optimum at that
+% Go into the details why this implementation is tuned towards performance and should be the optimum at that

 \section{Technologies}
-Short section; CUDA, PTX, Julia, CUDA.jl
+This section describes the technologies used for both the CPU side of the prototypes and the GPU side. The rationale behind these choices, including consideration of their performance implications, is presented. In addition, the hardware limitations imposed by the choice of GPU technology are outlined.

-Probably reference the performance evaluation papers for Julia and CUDA.jl
+\subsection{CPU side}
+Both prototypes were implemented using the Julia programming language. It was chosen mainly, because the current symbolic regression algorithm is also implemented in Julia. Being a high-level programming language, with modern features such as a garbage collector, support for meta-programming and dynamic typing, it also offers great convenience to the developer. 

-\section{Expression Processing}
-Talk about why this needs to be done and how it is done (the why is basically: simplifies evaluation/transpilation process; the how is in ExpressionProcessing.jl)
+More interestingly however, is the high performance that can be achieved with this language. It is possible to achieve high performance despite the supported modern features, which are often deemed to be harmful to performance. \textcite{bezanson_julia_2017} have shown how Julia can provide C-like performance while supporting the developer with modern quality of life features. The ability of Julia to be used in high performance computing scenarios and to be competitive with C has been demonstrated by \textcite{lin_comparing_2021}. This shows how Julia is a good and valid choice for scenarios where developer comfort and C-like performance are needed.
+
+\subsection{GPU side}
+In addition to a programming language for the CPU, a method for programming the GPU is also required. For this purpose, the CUDA API was chosen. While CUDA offers robust capabilities, it is important to note that it is exclusively compatible with Nvidia GPUs. An alternative would have been OpenCL, which provides broader compatibility by supporting GPUs from Nvidia, AMD and Intel. However, considering Nvidia's significant market share and the widespread adoption of CUDA in the industry, the decision was made to use CUDA.
+
+A typical CUDA program is primarily written C++ and Nvidia also provides their CUDA compiler nvcc\footnote{\url{https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/}} for C and C++ and their official CUDA programming guide \parencite{nvidia_cuda_2025} also uses C++ for code examples. It is also possible to call C++ code from within Julia. This would allow for writing the kernel and interacting with the GPU in C++, leveraging the knowledge built up over several years.
+
+\subsubsection{CUDA and Julia}
+Instead of writing the kernel in C++ and calling it from Julia, a much simpler and effective alternative can be used. The Julia package CUDA.jl\footnote{\url{https://cuda.juliagpu.org/}} enables a developer to write a kernel in Julia similar to how a kernel is written in C++ with CUDA. One drawback of using CUDA.jl however, is the fact that it is much newer compared to CUDA and therefore does not have years of testing and bug fixing in its history, which might be a concern for some applications. Apart from writing kernels with CUDA.jl, it also offers a method for interacting with the driver, to compile PTX code into machine code. This is a must-have feature as otherwise, it wouldn't have been possible to fully develop the transpiler in Julia.
+
+Additionally, the JuliaGPU initiative\footnote{\url{https://juliagpu.org/}} offers a collection of additional packages to enable GPU development for AMD, Intel and Apple and not just for Nvidia. However, CUDA.jl is also the most mature of the available implementations, which is also a reason why CUDA has been chosen instead of for example OpenCL. 
+
+Again, the question arises if the performance of CUDA.jl is sufficient to be used as an alternative to C++ and CUDA. Performance studies by \textcite{besard_rapid_2019}, \textcite{lin_comparing_2021} and \textcite{faingnaert_flexible_2022} have demonstrated, that CUDA.jl provides sufficient performance. They found that in some cases CUDA.jl was able to perform better than the same algorithm implemented in C and C++. This provides the confidence, that Julia alongside CUDA.jl is a good choice for leveraging the performance of GPUs to speed-up expression evaluation.
+
+\section{Pre-Processing}
+Talk about why this needs to be done and how it is done (the why is basically: simplifies evaluation/transpilation process; the how is in ExpressionProcessing.jl (the why is probably not needed because it is explained in concept and design))

 \section{Interpreter}
 Talk about how the interpreter has been developed.
--- a/thesis/main.pdf
+++ b/thesis/main.pdf
--- a/thesis/references.bib
+++ b/thesis/references.bib
@ -1176,7 +1176,7 @@
 	booktitle = {2005 {IEEE} Congress on Evolutionary Computation},
 	author = {Gustafson, S. and Burke, E.K. and Krasnogor, N.},
 	date = {2005-09},
-	keywords = {Computer science, Concrete, Diversity methods, Evolutionary computation, Genetic programming, Problem-solving},
+	keywords = {Evolutionary computation, Computer science, Concrete, Diversity methods, Genetic programming, Problem-solving},
 	file = {Full Text PDF:C\:\\Users\\danwi\\Zotero\\storage\\28ZEEUYG\\Gustafson et al. - 2005 - On improving genetic programming for symbolic regression.pdf:application/pdf},
 }

@ -1204,7 +1204,7 @@
 	publisher = {{arXiv}},
 	author = {Bruneton, J.-P.},
 	date = {2025-03-24},
-	keywords = {Computer Science - Neural and Evolutionary Computing, Computer Science - Symbolic Computation, Physics - Data Analysis, Statistics and Probability},
+	keywords = {Computer Science - Symbolic Computation, Computer Science - Neural and Evolutionary Computing, Physics - Data Analysis, Statistics and Probability},
 	file = {Preprint PDF:C\:\\Users\\danwi\\Zotero\\storage\\9U346ZEV\\Bruneton - 2025 - Enhancing Symbolic Regression with Quality-Diversity and Physics-Inspired Constraints.pdf:application/pdf},
 }

@ -1222,3 +1222,37 @@
 	date = {1999},
 	langid = {english},
 }
+
+@article{bezanson_julia_2017,
+	title = {Julia: A Fresh Approach to Numerical Computing},
+	volume = {59},
+	issn = {0036-1445},
+	url = {https://epubs.siam.org/doi/10.1137/141000671},
+	doi = {10.1137/141000671},
+	shorttitle = {Julia},
+	abstract = {This is the third in a series of papers on aspects of modern computing environments that are relevant to statistical data analysis. In this paper, we discuss programming environments. In particular, we argue that integrated programming environments (for example, Lisp and Smalltalk environments) are more appropriate as a base for data analysis than conventional operating systems (for example, Unix).},
+	pages = {65--98},
+	number = {1},
+	journaltitle = {{SIAM} Review},
+	shortjournal = {{SIAM} Rev.},
+	author = {Bezanson, Jeff and Edelman, Alan and Karpinski, Stefan and Shah, Viral B.},
+	date = {2017-01},
+	file = {Submitted Version:C\:\\Users\\danwi\\Zotero\\storage\\9R4QSU35\\Bezanson et al. - 2017 - Julia A Fresh Approach to Numerical Computing.pdf:application/pdf},
+}
+
+@article{faingnaert_flexible_2022,
+	title = {Flexible Performant {GEMM} Kernels on {GPUs}},
+	volume = {33},
+	issn = {1558-2183},
+	url = {https://ieeexplore.ieee.org/document/9655458},
+	doi = {10.1109/TPDS.2021.3136457},
+	abstract = {General Matrix Multiplication or {GEMM} kernels take centre place in high performance computing and machine learning. Recent {NVIDIA} {GPUs} include {GEMM} accelerators, such as {NVIDIA}’s Tensor Cores. Their exploitation is hampered by the two-language problem: it requires either low-level programming which implies low programmer productivity or using libraries that only offer a limited set of components. Because rephrasing algorithms in terms of established components often introduces overhead, the libraries’ lack of flexibility limits the freedom to explore new algorithms. Researchers using {GEMMs} can hence not enjoy programming productivity, high performance, and research flexibility at once. In this paper we solve this problem. We present three sets of abstractions and interfaces to program {GEMMs} within the scientific Julia programming language. The interfaces and abstractions are co-designed for researchers’ needs and Julia’s features to achieve sufficient separation of concerns and flexibility to easily extend basic {GEMMs} in many different ways without paying a performance price. Comparing our {GEMMs} to state-of-the-art libraries {cuBLAS} and {CUTLASS}, we demonstrate that our performance is in the same ballpark of the libraries, and in some cases even exceeds it, without having to write a single line of code in {CUDA} C++ or assembly, and without facing flexibility limitations.},
+	pages = {2230--2248},
+	number = {9},
+	journaltitle = {{IEEE} Transactions on Parallel and Distributed Systems},
+	author = {Faingnaert, Thomas and Besard, Tim and De Sutter, Bjorn},
+	urldate = {2025-04-20},
+	date = {2022-09},
+	keywords = {Codes, Graphics processing units, graphics processors, high-level programming languages, Instruction sets, Kernel, Libraries, Matrix multiplication, Productivity, Programming},
+	file = {Full Text PDF:C\:\\Users\\danwi\\Zotero\\storage\\QCJ6LSF3\\Faingnaert et al. - 2022 - Flexible Performant GEMM Kernels on GPUs.pdf:application/pdf},
+}