master-thesis/thesis/chapters/relwork.tex

\chapter{Fundamentals and Related Work}
\label{cha:relwork}
The goal of this chapter is to provide an overview of equation learning to establish common knowledge of the topic and problem this thesis is trying to solve. The main part of this chapter is split into two parts. The first part is exploring research that has been done in the field of general purpose computations on the GPU (GPGPU) as well as the fundamentals of it. Focus lies on exploring how graphics processing units (GPUs) are used to achieve substantial speed-ups and when they can be effectively employed. The second part describes the basics of how interpreters and compilers are built and how they can be adapted to the workflow of programming GPUs.

\section{Equation learning}
% Section describing what equation learning is and why it is relevant for the thesis
Equation learning is a field of research that aims at understanding and discovering equations from a set of data from various fields like mathematics and physics. Data is usually much more abundant while models often are elusive. Because of this, generating equations with a computer can more easily lead to discovering equations that describe the observed data. \textcite{brunton_discovering_2016} describe an algorithm that leverages equation learning to discover equations for physical systems. A more literal interpretation of equation learning is demonstrated by \textcite{pfahler_semantic_2020}. They use machine learning to learn the form of equations. Their aim was to simplify the discovery of relevant publications by the equations they use and not by technical terms, as they may differ by the field of research. However, this kind of equation learning is not relevant for this thesis.

Symbolic regression is a subset of equation learning, that specialises more towards discovering mathematical equations. A lot of research is done in this field. \textcite{keijzer_scaled_2004} and \textcite{korns_accuracy_2011} presented ways of improving the quality of symbolic regression algorithms, making symbolic regression more feasible for problem-solving. Additionally, \textcite{jin_bayesian_2020} proposed an alternative to genetic programming (GP) for the use in symbolic regression. Their approach increased the quality of the results noticeably compared to GP alternatives. The first two approaches are more concerned with the quality of the output, while the third is also concerned with interpretability and reducing memory consumption. \textcite{bartlett_exhaustive_2024} also describe an approach to generate simpler and higher quality equations while being faster than GP algorithms. Heuristics like GP or neural networks as used by \textcite{werner_informed_2021} in their equation learner can help with finding good solutions faster, accelerating scientific progress. As seen by these publications, increasing the quality of generated equations but also increasing the speed of finding these equations is a central part in symbolic regression and equation learning in general. This means research in improving the computational performance of these algorithms is desired.

The expressions generated by an equation learning algorithm can look like this $x_1 + 5 - \text{abs}(p_1) * \text{sqrt}(x_2) / 10 + 2 \char`^ 3$. They consist of several unary and binary operators but also of constants, variables and parameters and expressions mostly differ in length and the kind of terms in the expressions. Per iteration many of these expressions are generated and in addition, matrices of values for the variables and parameters are also created. One row of the variable matrix corresponds to one instantiation of all expressions and this matrix contains multiple rows. This leads to a drastic increase of instantiated expressions that need to be evaluated. Parameters are a bit simpler, as they can be treated as constants for one iteration but can have a different value on another iteration. This means that parameters do not increase the number of expressions that need to be evaluated. However, the increase in evaluations introduced by the variables is still drastic and therefore increases the algorithm runtime significantly. 


\section[GPGPU]{General Purpose Computation on Graphics Processing Units}
\label{sec:gpgpu}
Graphics cards (GPUs) are commonly used to increase the performance of many different applications. Originally they were designed to improve performance and visual quality in games. \textcite{dokken_gpu_2005} first described the usage of GPUs for general purpose programming. They have shown how the graphics pipeline can be used for GPGPU programming. Because this approach also requires the programmer to understand the graphics terminology, this was not a great solution. Therefore, Nvidia released CUDA\footnote{\url{https://developer.nvidia.com/cuda-toolkit}} in 2007 with the goal of allowing developers to program GPUs independent of the graphics pipeline and terminology. A study of the programmability of GPUs with CUDA and the resulting performance has been conducted by \textcite{huang_gpu_2008}. They found that GPGPU programming has potential, even for non-embarassingly parallel problems. Research is also done in making the low level CUDA development simpler. \textcite{han_hicuda_2011} have described a directive-based language to make development simpler and less error-prone, while retaining the performance of handwritten code. To drastically simplify CUDA development \textcite{besard_effective_2019} showed that it is possible to develop with CUDA in the high level programming language Julia\footnote{\url{https://julialang.org/}} while performing similar to CUDA written in C. In a subsequent study \textcite{lin_comparing_2021} found that high performance computing (HPC) on the CPU and GPU in Julia performs similar to HPC development in C. This means that Julia can be a viable alternative to Fortran, C and C++ in the HPC field and has the additional benefit of developer comfort since it is a high level language with modern features such as garbage-collectors. \textcite{besard_rapid_2019} have also shown how the combination of Julia and CUDA help in rapidly developing HPC software. While this thesis in general revolves around CUDA, there also exist alternatives by AMD called ROCm\footnote{\url{https://www.amd.com/de/products/software/rocm.html}} and a vendor independent alternative called OpenCL\footnote{\url{https://www.khronos.org/opencl/}}.

While in the early days of GPGPU programming a lot of research has been done to assess if this approach is feasible, it now seems obvious to use GPUs to accelerate algorithms. Weather simulations began using GPUs very early for their models. In 2008 \textcite{michalakes_gpu_2008} proposed a method for simulating weather with the WRF model on a GPU. With their approach, they reached a speed-up of the most compute intensive task of 5 to 20, with very little GPU optimisation effort. They also found that the GPU usages was very low, meaning there are resources and potential for more detailed simulations. Generally, simulations are great candidates for using GPUs, as they can benefit heavily from a high degree of parallelism and data throughput. \textcite{koster_high-performance_2020} have developed a way of using adaptive time steps to improve the performance of time step simulations, while retaining their precision and constraint correctness. Black hole simulations are crucial for science and education for a better understanding of our world. \textcite{verbraeck_interactive_2021} have shown that simulating complex Kerr (rotating) black holes can be done on consumer hardware in a few seconds. Schwarzschild black hole simulations can be performed in real-time with GPUs as described by \textcite{hissbach_overview_2022} which is especially helpful for educational scenarios. While both approaches do not have the same accuracy as detailed simulations on supercomputers, they show how single GPUs can yield similar accuracy at a fraction of the cost. Networking can also heavily benefit from GPU acceleration as shown by \textcite{han_packetshader_2010}, where they achieved a significant increase in throughput than with a CPU only implementation. Finite element structural analysis is an essential tool for many branches of engineering and can also heavily benefit from the usage of GPUs as demonstrated by \textcite{georgescu_gpu_2013}. However, it also needs to be noted, that GPUs are not always better performing than CPUs as illustrated by \textcite{lee_debunking_2010}, but they still can lead to performance improvements nonetheless.

\subsection{Programming GPUs}
% This part now starts taking about architecture and how to program GPUs
The development process on a GPU is vastly different from a CPU. A CPU has tens or hundreds of complex cores with the AMD Epyc 9965\footnote{\url{https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9965.html}} having a staggering $192$ cores and twice as many threads. A guide for a simple one core 8-bit CPU has been published by \textcite{schuurman_step-by-step_2013}. He describes the many different and complex parts of a CPU core. Modern CPUs are even more complex, with dedicated fast integer, floating-point arithmetic gates as well as logic gates, sophisticated branch prediction and much more. This makes a CPU perfect for handling complex control flows on a single program strand and on modern CPUs even multiple strands simultaneously. However, as seen in section \ref{sec:gpgpu}, this often isn't enough. On the other hand, a GPU contains thousands or even tens of thousands of cores. For example, the GeForce RTX 5090\footnote{\url{https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/}} contains a total of $21760$ CUDA cores. To achieve this enormous core count a single GPU core has to be much simpler than one CPU core. As described by \textcite{nvidia_cuda_2024} a GPU designates much more transistors towards floating-point computations. This results in less efficient integer arithmetic and control flow handling. There is also less Cache available per core and clock speeds are usually also much lower than those on a CPU. An overview of the differences of a CPU and a GPU architecture can be seen in figure \ref{fig:cpu_vs_gpu}.

\begin{figure}
	\centering
	\includegraphics[width=1\textwidth]{nvidia_cpu_vs_gpu.png}
	\caption{Overview of the architecture of a CPU (left) and a GPU (right). Note the higher number of simpler cores on the GPU \parencite{nvidia_cuda_2024}.}
	\label{fig:cpu_vs_gpu}
\end{figure}

Despite these drawbacks, the sheer number of cores, makes a GPU a valid choice when considering improving the performance of an algorithm. Because of the high number of cores, GPUs are best suited for data parallel scenarios. This is due to the SIMD architecture of these cards. SIMD stands for Sinlge-Instruction Multiple-Data and states that there is a single stream of instructions that is executed on a huge number of data streams. \textcite{franchetti_efficient_2005} and \textcite{tian_compiling_2012} describe ways of using SIMD instructions on the CPU. Their approaches lead to noticeable speed-ups of 3.3 and 4.7 respectively by using SIMD instructions instead of serial computations. Extending this to GPUs which are specifically built for SIMD/data parallel calculations shows why they are so powerful despite having less complex and slower cores than a CPU. 

While the concepts of GPGPU programming are the same no matter the GPU used, the naming on AMD GPUs and guides differs from the CUDA naming. As previously stated, this thesis will use the terminology and concepts as described by Nvidia in their CUDA programming guide. 

The many cores on a GPU, also called threads, are grouped together in several categories. On the lowest level exists a streaming multiprocessor (SM) which is a hardware unit responsible for scheduling and executing threads and also contains the registers used by the threads. One SM is always executing a group of 32 threads simultaneously and this group is called a warp. The number of threads that can be started is virtually unlimited. However, threads need to be grouped in a block, with one block usually containing a maximum of 1024 threads. Therefore, if more than 1024 threads are needed, more blocks need to be created. All threads in one block have access to some shared memory, which can be used for L1-caching or communication between threads. It is important that the blocks can be scheduled independently from on another with no dependencies between them. This allows the scheduler to schedule blocks and threads as efficiently as possible. All threads within a warp are ensured to be part of the same block and therefore executed simultaneously \parencite{nvidia_cuda_2024}. 

While all threads in a warp start at the same point in a program, they have their own instruction address, allowing them to work independently. Because of the SIMD architecture, all threads in a warp must execute the same instructions and if threads start diverging, the SM must pause threads with different instructions and execute them later. Figure \ref{fig:thread_divergence} shows how such divergences can impact performance. The situation described by the figure also shows, that after the divergent thread would reconverge, this does not happen and leads to T2 being executed after T1 and T3 are finished. In situations where a lot of data dependent thread divergence happens, most of the benefits of using a GPU have vanished.

\begin{figure}
	\centering
	\includegraphics[width=.8\textwidth]{thread_divergence.png}
	\caption{Thread T2 wants to execute instruction B while T1 and T3 want to execute instruction A. Therefore T2 will be an inactive thread this cycle and active once T1 and T3 are finished. This means that now the divergent threads are serialised.}
	\label{fig:thread_divergence}
\end{figure}

Threads not executing the same instruction is against the SIMD principle but can happen in reality, due to data dependent branching. Consequently, this leads to bad resource utilisation, which in turn leads to worse performance. Another possibility of threads being paused (inactive threads) is the fact that sometimes, the number of threads started is not divisible by 32. In such cases, the last warp still contains 32 threads but only the threads with work are executed \parencite{nvidia_cuda_2024}.

Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) architecture. In many cases a developer does not need to know the details of SIMT and can develop fast and correct programs with just the SIMD architecture in mind. However, leveraging the power of SIMT can yield substantial performance gains by re-converging threads once data dependent divergence occurred. A proposal for a re-convergence algorithm was proposed by \textcite{collange_stack-less_2011} where they have shown that these approaches help with hardware occupation, resulting in improved performance as threads are now no longer fully serialised. Another approach for increasing occupancy using the SIMT architecture is proposed by \textcite{fung_thread_2011}. They introduced a technique for compacting thread blocks by moving divergent threads to new warps until they reconverge. This approach resulted in a noticeable speed-up between 17\% and 22\%.

% Maybe this can also be used to better explain SIMT: https://rocm.docs.amd.com/projects/HIP/en/latest/understand/programming_model.html#programming-model-simt

Talk about memory allocation (with the one paper diving into dynamic allocations)
Memory transfer (with streams potentially)

\subsection[PTX]{Parallel Thread Execution}
Describe what PTX is to get a common ground for the implementation chapter. Probably a short section


\section{Compilers}
Maybe even move this entire section to "Concept and Design"?

brief overview about compilers (just setting the stage for the subsections basically). Talk about register management and these things. 

\subsection{Interpreters}
What are interpreters; how they work; should mostly contain/reference gpu interpreters

\subsection{Transpilers}
talk about what transpilers are and how to implement them. If possible also gpu specific transpilation.
added thesis structure 2025-01-04 10:38:27 +01:00			`\chapter{Fundamentals and Related Work}`
			`\label{cha:relwork}`
Related Work: started with equation learning section 2025-02-26 13:34:46 +01:00			The goal of this chapter is to provide an overview of equation learning to establish common knowledge of the topic and problem this thesis is trying to solve. The main part of this chapter is split into two parts. The first part is exploring research that has been done in the field of general purpose computations on the GPU (GPGPU) as well as the fundamentals of it. Focus lies on exploring how graphics processing units (GPUs) are used to achieve substantial speed-ups and when they can be effectively employed. The second part describes the basics of how interpreters and compilers are built and how they can be adapted to the workflow of programming GPUs.
added thesis structure 2025-01-04 10:38:27 +01:00
			`\section{Equation learning}`
Related Work: started with equation learning section 2025-02-26 13:34:46 +01:00			`% Section describing what equation learning is and why it is relevant for the thesis`
			Equation learning is a field of research that aims at understanding and discovering equations from a set of data from various fields like mathematics and physics. Data is usually much more abundant while models often are elusive. Because of this, generating equations with a computer can more easily lead to discovering equations that describe the observed data. \textcite{brunton_discovering_2016} describe an algorithm that leverages equation learning to discover equations for physical systems. A more literal interpretation of equation learning is demonstrated by \textcite{pfahler_semantic_2020}. They use machine learning to learn the form of equations. Their aim was to simplify the discovery of relevant publications by the equations they use and not by technical terms, as they may differ by the field of research. However, this kind of equation learning is not relevant for this thesis.

Related Work: finished equation learning section; started GPGPU section 2025-03-01 13:14:37 +01:00			Symbolic regression is a subset of equation learning, that specialises more towards discovering mathematical equations. A lot of research is done in this field. \textcite{keijzer_scaled_2004} and \textcite{korns_accuracy_2011} presented ways of improving the quality of symbolic regression algorithms, making symbolic regression more feasible for problem-solving. Additionally, \textcite{jin_bayesian_2020} proposed an alternative to genetic programming (GP) for the use in symbolic regression. Their approach increased the quality of the results noticeably compared to GP alternatives. The first two approaches are more concerned with the quality of the output, while the third is also concerned with interpretability and reducing memory consumption. \textcite{bartlett_exhaustive_2024} also describe an approach to generate simpler and higher quality equations while being faster than GP algorithms. Heuristics like GP or neural networks as used by \textcite{werner_informed_2021} in their equation learner can help with finding good solutions faster, accelerating scientific progress. As seen by these publications, increasing the quality of generated equations but also increasing the speed of finding these equations is a central part in symbolic regression and equation learning in general. This means research in improving the computational performance of these algorithms is desired.
related work: continuation of equation learning section 2025-02-27 11:41:01 +01:00
Related Work: finished equation learning section; started GPGPU section 2025-03-01 13:14:37 +01:00			The expressions generated by an equation learning algorithm can look like this $x_1 + 5 - \text{abs}(p_1) * \text{sqrt}(x_2) / 10 + 2 \char`^ 3$. They consist of several unary and binary operators but also of constants, variables and parameters and expressions mostly differ in length and the kind of terms in the expressions. Per iteration many of these expressions are generated and in addition, matrices of values for the variables and parameters are also created. One row of the variable matrix corresponds to one instantiation of all expressions and this matrix contains multiple rows. This leads to a drastic increase of instantiated expressions that need to be evaluated. Parameters are a bit simpler, as they can be treated as constants for one iteration but can have a different value on another iteration. This means that parameters do not increase the number of expressions that need to be evaluated. However, the increase in evaluations introduced by the variables is still drastic and therefore increases the algorithm runtime significantly.
added thesis structure 2025-01-04 10:38:27 +01:00

			`\section[GPGPU]{General Purpose Computation on Graphics Processing Units}`
relwork: continued with 'programming GPUs' 2025-03-08 12:28:46 +01:00			`\label{sec:gpgpu}`
Related work: continuation of GPGPU 2025-03-02 12:23:59 +01:00			Graphics cards (GPUs) are commonly used to increase the performance of many different applications. Originally they were designed to improve performance and visual quality in games. \textcite{dokken_gpu_2005} first described the usage of GPUs for general purpose programming. They have shown how the graphics pipeline can be used for GPGPU programming. Because this approach also requires the programmer to understand the graphics terminology, this was not a great solution. Therefore, Nvidia released CUDA\footnote{\url{https://developer.nvidia.com/cuda-toolkit}} in 2007 with the goal of allowing developers to program GPUs independent of the graphics pipeline and terminology. A study of the programmability of GPUs with CUDA and the resulting performance has been conducted by \textcite{huang_gpu_2008}. They found that GPGPU programming has potential, even for non-embarassingly parallel problems. Research is also done in making the low level CUDA development simpler. \textcite{han_hicuda_2011} have described a directive-based language to make development simpler and less error-prone, while retaining the performance of handwritten code. To drastically simplify CUDA development \textcite{besard_effective_2019} showed that it is possible to develop with CUDA in the high level programming language Julia\footnote{\url{https://julialang.org/}} while performing similar to CUDA written in C. In a subsequent study \textcite{lin_comparing_2021} found that high performance computing (HPC) on the CPU and GPU in Julia performs similar to HPC development in C. This means that Julia can be a viable alternative to Fortran, C and C++ in the HPC field and has the additional benefit of developer comfort since it is a high level language with modern features such as garbage-collectors. \textcite{besard_rapid_2019} have also shown how the combination of Julia and CUDA help in rapidly developing HPC software. While this thesis in general revolves around CUDA, there also exist alternatives by AMD called ROCm\footnote{\url{https://www.amd.com/de/products/software/rocm.html}} and a vendor independent alternative called OpenCL\footnote{\url{https://www.khronos.org/opencl/}}.
Related Work: finished equation learning section; started GPGPU section 2025-03-01 13:14:37 +01:00
relwork: continued with 'programming GPUs' 2025-03-08 12:28:46 +01:00			While in the early days of GPGPU programming a lot of research has been done to assess if this approach is feasible, it now seems obvious to use GPUs to accelerate algorithms. Weather simulations began using GPUs very early for their models. In 2008 \textcite{michalakes_gpu_2008} proposed a method for simulating weather with the WRF model on a GPU. With their approach, they reached a speed-up of the most compute intensive task of 5 to 20, with very little GPU optimisation effort. They also found that the GPU usages was very low, meaning there are resources and potential for more detailed simulations. Generally, simulations are great candidates for using GPUs, as they can benefit heavily from a high degree of parallelism and data throughput. \textcite{koster_high-performance_2020} have developed a way of using adaptive time steps to improve the performance of time step simulations, while retaining their precision and constraint correctness. Black hole simulations are crucial for science and education for a better understanding of our world. \textcite{verbraeck_interactive_2021} have shown that simulating complex Kerr (rotating) black holes can be done on consumer hardware in a few seconds. Schwarzschild black hole simulations can be performed in real-time with GPUs as described by \textcite{hissbach_overview_2022} which is especially helpful for educational scenarios. While both approaches do not have the same accuracy as detailed simulations on supercomputers, they show how single GPUs can yield similar accuracy at a fraction of the cost. Networking can also heavily benefit from GPU acceleration as shown by \textcite{han_packetshader_2010}, where they achieved a significant increase in throughput than with a CPU only implementation. Finite element structural analysis is an essential tool for many branches of engineering and can also heavily benefit from the usage of GPUs as demonstrated by \textcite{georgescu_gpu_2013}. However, it also needs to be noted, that GPUs are not always better performing than CPUs as illustrated by \textcite{lee_debunking_2010}, but they still can lead to performance improvements nonetheless.
Related work: continuation of GPGPU 2025-03-02 12:23:59 +01:00
			`\subsection{Programming GPUs}`
			`% This part now starts taking about architecture and how to program GPUs`
relwork: continued with 'programming GPUs' 2025-03-08 12:28:46 +01:00			The development process on a GPU is vastly different from a CPU. A CPU has tens or hundreds of complex cores with the AMD Epyc 9965\footnote{\url{https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9965.html}} having a staggering $192$ cores and twice as many threads. A guide for a simple one core 8-bit CPU has been published by \textcite{schuurman_step-by-step_2013}. He describes the many different and complex parts of a CPU core. Modern CPUs are even more complex, with dedicated fast integer, floating-point arithmetic gates as well as logic gates, sophisticated branch prediction and much more. This makes a CPU perfect for handling complex control flows on a single program strand and on modern CPUs even multiple strands simultaneously. However, as seen in section \ref{sec:gpgpu}, this often isn't enough. On the other hand, a GPU contains thousands or even tens of thousands of cores. For example, the GeForce RTX 5090\footnote{\url{https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/}} contains a total of $21760$ CUDA cores. To achieve this enormous core count a single GPU core has to be much simpler than one CPU core. As described by \textcite{nvidia_cuda_2024} a GPU designates much more transistors towards floating-point computations. This results in less efficient integer arithmetic and control flow handling. There is also less Cache available per core and clock speeds are usually also much lower than those on a CPU. An overview of the differences of a CPU and a GPU architecture can be seen in figure \ref{fig:cpu_vs_gpu}.
added thesis structure 2025-01-04 10:38:27 +01:00
relwork: continued with 'programming GPUs' 2025-03-08 12:28:46 +01:00			`\begin{figure}`
			`\centering`
			`\includegraphics[width=1\textwidth]{nvidia_cpu_vs_gpu.png}`
			`\caption{Overview of the architecture of a CPU (left) and a GPU (right). Note the higher number of simpler cores on the GPU \parencite{nvidia_cuda_2024}.}`
			`\label{fig:cpu_vs_gpu}`
			`\end{figure}`

			Despite these drawbacks, the sheer number of cores, makes a GPU a valid choice when considering improving the performance of an algorithm. Because of the high number of cores, GPUs are best suited for data parallel scenarios. This is due to the SIMD architecture of these cards. SIMD stands for Sinlge-Instruction Multiple-Data and states that there is a single stream of instructions that is executed on a huge number of data streams. \textcite{franchetti_efficient_2005} and \textcite{tian_compiling_2012} describe ways of using SIMD instructions on the CPU. Their approaches lead to noticeable speed-ups of 3.3 and 4.7 respectively by using SIMD instructions instead of serial computations. Extending this to GPUs which are specifically built for SIMD/data parallel calculations shows why they are so powerful despite having less complex and slower cores than a CPU.

			`While the concepts of GPGPU programming are the same no matter the GPU used, the naming on AMD GPUs and guides differs from the CUDA naming. As previously stated, this thesis will use the terminology and concepts as described by Nvidia in their CUDA programming guide.`

			The many cores on a GPU, also called threads, are grouped together in several categories. On the lowest level exists a streaming multiprocessor (SM) which is a hardware unit responsible for scheduling and executing threads and also contains the registers used by the threads. One SM is always executing a group of 32 threads simultaneously and this group is called a warp. The number of threads that can be started is virtually unlimited. However, threads need to be grouped in a block, with one block usually containing a maximum of 1024 threads. Therefore, if more than 1024 threads are needed, more blocks need to be created. All threads in one block have access to some shared memory, which can be used for L1-caching or communication between threads. It is important that the blocks can be scheduled independently from on another with no dependencies between them. This allows the scheduler to schedule blocks and threads as efficiently as possible. All threads within a warp are ensured to be part of the same block and therefore executed simultaneously \parencite{nvidia_cuda_2024}.

related work: small continuation of explaining SIMT 2025-03-08 14:12:50 +01:00			While all threads in a warp start at the same point in a program, they have their own instruction address, allowing them to work independently. Because of the SIMD architecture, all threads in a warp must execute the same instructions and if threads start diverging, the SM must pause threads with different instructions and execute them later. Figure \ref{fig:thread_divergence} shows how such divergences can impact performance. The situation described by the figure also shows, that after the divergent thread would reconverge, this does not happen and leads to T2 being executed after T1 and T3 are finished. In situations where a lot of data dependent thread divergence happens, most of the benefits of using a GPU have vanished.
relwork: continued with 'programming GPUs' 2025-03-08 12:28:46 +01:00
			`\begin{figure}`
			`\centering`
			`\includegraphics[width=.8\textwidth]{thread_divergence.png}`
related work: small continuation of explaining SIMT 2025-03-08 14:12:50 +01:00			`\caption{Thread T2 wants to execute instruction B while T1 and T3 want to execute instruction A. Therefore T2 will be an inactive thread this cycle and active once T1 and T3 are finished. This means that now the divergent threads are serialised.}`
relwork: continued with 'programming GPUs' 2025-03-08 12:28:46 +01:00			`\label{fig:thread_divergence}`
			`\end{figure}`

			Threads not executing the same instruction is against the SIMD principle but can happen in reality, due to data dependent branching. Consequently, this leads to bad resource utilisation, which in turn leads to worse performance. Another possibility of threads being paused (inactive threads) is the fact that sometimes, the number of threads started is not divisible by 32. In such cases, the last warp still contains 32 threads but only the threads with work are executed \parencite{nvidia_cuda_2024}.

related work: small continuation of explaining SIMT 2025-03-08 14:12:50 +01:00			Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) architecture. In many cases a developer does not need to know the details of SIMT and can develop fast and correct programs with just the SIMD architecture in mind. However, leveraging the power of SIMT can yield substantial performance gains by re-converging threads once data dependent divergence occurred. A proposal for a re-convergence algorithm was proposed by \textcite{collange_stack-less_2011} where they have shown that these approaches help with hardware occupation, resulting in improved performance as threads are now no longer fully serialised. Another approach for increasing occupancy using the SIMT architecture is proposed by \textcite{fung_thread_2011}. They introduced a technique for compacting thread blocks by moving divergent threads to new warps until they reconverge. This approach resulted in a noticeable speed-up between 17\% and 22\%.

			`% Maybe this can also be used to better explain SIMT: https://rocm.docs.amd.com/projects/HIP/en/latest/understand/programming_model.html#programming-model-simt`

			`Talk about memory allocation (with the one paper diving into dynamic allocations)`
			`Memory transfer (with streams potentially)`
Related work: continuation of GPGPU 2025-03-02 12:23:59 +01:00
added thesis structure 2025-01-04 10:38:27 +01:00			`\subsection[PTX]{Parallel Thread Execution}`
			`Describe what PTX is to get a common ground for the implementation chapter. Probably a short section`


Related work: continuation of GPGPU 2025-03-02 12:23:59 +01:00			`\section{Compilers}`
relwork: continued with 'programming GPUs' 2025-03-08 12:28:46 +01:00			`Maybe even move this entire section to "Concept and Design"?`

			`brief overview about compilers (just setting the stage for the subsections basically). Talk about register management and these things.`
Related work: continuation of GPGPU 2025-03-02 12:23:59 +01:00
			`\subsection{Interpreters}`
			`What are interpreters; how they work; should mostly contain/reference gpu interpreters`
added thesis structure 2025-01-04 10:38:27 +01:00
Related work: continuation of GPGPU 2025-03-02 12:23:59 +01:00			`\subsection{Transpilers}`
			`talk about what transpilers are and how to implement them. If possible also gpu specific transpilation.`