related work: slight restructuring; continued with section programming gpus
Some checks failed
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Has been cancelled
Some checks failed
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Has been cancelled
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Has been cancelled
This commit is contained in:
parent
4e48686b62
commit
fddfa23b4f
1
thesis/.vscode/ltex.dictionary.en-GB.txt
vendored
1
thesis/.vscode/ltex.dictionary.en-GB.txt
vendored
|
@ -1,2 +1,3 @@
|
|||
CUDA
|
||||
GPGPU
|
||||
SIMT
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
\chapter{Fundamentals and Related Work}
|
||||
\label{cha:relwork}
|
||||
The goal of this chapter is to provide an overview of equation learning to establish common knowledge of the topic and problem this thesis is trying to solve. The main part of this chapter is split into two parts. The first part is exploring research that has been done in the field of general purpose computations on the GPU (GPGPU) as well as the fundamentals of it. Focus lies on exploring how graphics processing units (GPUs) are used to achieve substantial speed-ups and when they can be effectively employed. The second part describes the basics of how interpreters and compilers are built and how they can be adapted to the workflow of programming GPUs.
|
||||
The goal of this chapter is to provide an overview of equation learning to establish common knowledge of the topic and problem this thesis is trying to solve. First the field of equation learning is explored which helps to contextualise the topic of this thesis. The main part of this chapter is split into two sub-parts. The first part is exploring research that has been done in the field of general purpose computations on the GPU (GPGPU) as well as the fundamentals of it. Focus lies on exploring how graphics processing units (GPUs) are used to achieve substantial speed-ups and when and where they can be effectively employed. The second part describes the basics of how interpreters and compilers are built and how they can be adapted to the workflow of programming GPUs. When discussing GPU programming concepts, the terminology used is that of Nvidia and may differ from that used for AMD GPUs.
|
||||
|
||||
\section{Equation learning}
|
||||
% Section describing what equation learning is and why it is relevant for the thesis
|
||||
|
@ -18,8 +18,7 @@ Graphics cards (GPUs) are commonly used to increase the performance of many diff
|
|||
While in the early days of GPGPU programming a lot of research has been done to assess if this approach is feasible, it now seems obvious to use GPUs to accelerate algorithms. Weather simulations began using GPUs very early for their models. In 2008 \textcite{michalakes_gpu_2008} proposed a method for simulating weather with the WRF model on a GPU. With their approach, they reached a speed-up of the most compute intensive task of 5 to 20, with very little GPU optimisation effort. They also found that the GPU usages was very low, meaning there are resources and potential for more detailed simulations. Generally, simulations are great candidates for using GPUs, as they can benefit heavily from a high degree of parallelism and data throughput. \textcite{koster_high-performance_2020} have developed a way of using adaptive time steps to improve the performance of time step simulations, while retaining their precision and constraint correctness. Black hole simulations are crucial for science and education for a better understanding of our world. \textcite{verbraeck_interactive_2021} have shown that simulating complex Kerr (rotating) black holes can be done on consumer hardware in a few seconds. Schwarzschild black hole simulations can be performed in real-time with GPUs as described by \textcite{hissbach_overview_2022} which is especially helpful for educational scenarios. While both approaches do not have the same accuracy as detailed simulations on supercomputers, they show how single GPUs can yield similar accuracy at a fraction of the cost. Networking can also heavily benefit from GPU acceleration as shown by \textcite{han_packetshader_2010}, where they achieved a significant increase in throughput than with a CPU only implementation. Finite element structural analysis is an essential tool for many branches of engineering and can also heavily benefit from the usage of GPUs as demonstrated by \textcite{georgescu_gpu_2013}. However, it also needs to be noted, that GPUs are not always better performing than CPUs as illustrated by \textcite{lee_debunking_2010}, but they still can lead to performance improvements nonetheless.
|
||||
|
||||
\subsection{Programming GPUs}
|
||||
% This part now starts taking about architecture and how to program GPUs
|
||||
The development process on a GPU is vastly different from a CPU. A CPU has tens or hundreds of complex cores with the AMD Epyc 9965\footnote{\url{https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9965.html}} having a staggering $192$ cores and twice as many threads. A guide for a simple one core 8-bit CPU has been published by \textcite{schuurman_step-by-step_2013}. He describes the many different and complex parts of a CPU core. Modern CPUs are even more complex, with dedicated fast integer, floating-point arithmetic gates as well as logic gates, sophisticated branch prediction and much more. This makes a CPU perfect for handling complex control flows on a single program strand and on modern CPUs even multiple strands simultaneously. However, as seen in section \ref{sec:gpgpu}, this often isn't enough. On the other hand, a GPU contains thousands or even tens of thousands of cores. For example, the GeForce RTX 5090\footnote{\url{https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/}} contains a total of $21760$ CUDA cores. To achieve this enormous core count a single GPU core has to be much simpler than one CPU core. As described by \textcite{nvidia_cuda_2024} a GPU designates much more transistors towards floating-point computations. This results in less efficient integer arithmetic and control flow handling. There is also less Cache available per core and clock speeds are usually also much lower than those on a CPU. An overview of the differences of a CPU and a GPU architecture can be seen in figure \ref{fig:cpu_vs_gpu}.
|
||||
The development process on a GPU is vastly different from a CPU. A CPU has tens or hundreds of complex cores with the AMD Epyc 9965\footnote{\url{https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9965.html}} having a staggering $192$ of those complex cores and twice as many threads. A guide for a simple one core 8-bit CPU has been published by \textcite{schuurman_step-by-step_2013}. He describes the many different and complex parts of a CPU core. Modern CPUs are even more complex, with dedicated fast integer and floating-point arithmetic gates as well as logic gates, sophisticated branch prediction and much more. This makes a CPU perfect for handling complex control flows on a single program strand and on modern CPUs even multiple strands simultaneously. However, as seen in section \ref{sec:gpgpu}, this often isn't enough. On the other hand, a GPU contains thousands or even tens of thousands of cores. For example, the GeForce RTX 5090\footnote{\url{https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/}} contains a total of $21760$ CUDA cores. To achieve this enormous core count a single GPU core has to be much simpler than one CPU core. As described by \textcite{nvidia_cuda_2024} a GPU designates much more transistors towards floating-point computations. This results in less efficient integer arithmetic and control flow handling. There is also less Cache available per core and clock speeds are usually also much lower than those on a CPU. An overview of the differences of a CPU and a GPU architecture can be seen in figure \ref{fig:cpu_vs_gpu}.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
|
@ -30,11 +29,21 @@ The development process on a GPU is vastly different from a CPU. A CPU has tens
|
|||
|
||||
Despite these drawbacks, the sheer number of cores, makes a GPU a valid choice when considering improving the performance of an algorithm. Because of the high number of cores, GPUs are best suited for data parallel scenarios. This is due to the SIMD architecture of these cards. SIMD stands for Sinlge-Instruction Multiple-Data and states that there is a single stream of instructions that is executed on a huge number of data streams. \textcite{franchetti_efficient_2005} and \textcite{tian_compiling_2012} describe ways of using SIMD instructions on the CPU. Their approaches lead to noticeable speed-ups of 3.3 and 4.7 respectively by using SIMD instructions instead of serial computations. Extending this to GPUs which are specifically built for SIMD/data parallel calculations shows why they are so powerful despite having less complex and slower cores than a CPU.
|
||||
|
||||
While the concepts of GPGPU programming are the same no matter the GPU used, the naming on AMD GPUs and guides differs from the CUDA naming. As previously stated, this thesis will use the terminology and concepts as described by Nvidia in their CUDA programming guide.
|
||||
\subsubsection{Thread Hierarchy and Tuning}
|
||||
The thousands of cores on a GPU, also called threads, are grouped together in several categories. This is the Thread hierarchy of GPUs. The developer can influence this grouping to a degree which allows them to tune their algorithm for optimal performance. In order to develop a well performing algorithm, it is necessary to know how this grouping works. Tuning the grouping is unique to each algorithm and also dependent on the GPU used, which means it is important to test a lot of different configurations to achieve the best possible result. This section aims at exploring the thread hierarchy and how it can be tuned to fit an algorithm.
|
||||
|
||||
The many cores on a GPU, also called threads, are grouped together in several categories. On the lowest level exists a streaming multiprocessor (SM) which is a hardware unit responsible for scheduling and executing threads and also contains the registers used by the threads. One SM is always executing a group of 32 threads simultaneously and this group is called a warp. The number of threads that can be started is virtually unlimited. However, threads need to be grouped in a block, with one block usually containing a maximum of 1024 threads. Therefore, if more than 1024 threads are needed, more blocks need to be created. All threads in one block have access to some shared memory, which can be used for L1-caching or communication between threads. It is important that the blocks can be scheduled independently from on another with no dependencies between them. This allows the scheduler to schedule blocks and threads as efficiently as possible. All threads within a warp are ensured to be part of the same block and therefore executed simultaneously \parencite{nvidia_cuda_2024}.
|
||||
At the lowest level of a GPU exists a Streaming Multiprocessor (SM), which is a hardware unit responsible for scheduling and executing threads and also contains the registers used by these threads. An SM is always executing a group of 32 threads simultaneously, and this group is called a warp. The number of threads that can be started is virtually unlimited. However, threads must be grouped in a block, with one block typically containing a maximum of 2048 threads but is often configured to be less. Therefore, if more than 2048 threads are required, more blocks must be created. Blocks can also be grouped thread block clusters which is optional, but can be useful in certain scenarios. All thread blocks or thread block clusters are part of a grid, which manifests as a dispatch of the code run the GPU, also called kernel \parencite{amd_hip_2025}. All threads in one block have access to some shared memory, which can be used for L1 caching or communication between threads. It is important that the blocks can be scheduled independently, with no dependencies between them. This allows the scheduler to schedule blocks and threads as efficiently as possible. All threads within a warp are guaranteed to be part of the same block, and are therefore executed simultaneously and can access the same memory addresses. Figure \ref{fig:thread_hierarchy} depicts how threads in a block are grouped into warps for execution and how they share memory.
|
||||
|
||||
While all threads in a warp start at the same point in a program, they have their own instruction address, allowing them to work independently. Because of the SIMD architecture, all threads in a warp must execute the same instructions and if threads start diverging, the SM must pause threads with different instructions and execute them later. Figure \ref{fig:thread_divergence} shows how such divergences can impact performance. The situation described by the figure also shows, that after the divergent thread would reconverge, this does not happen and leads to T2 being executed after T1 and T3 are finished. In situations where a lot of data dependent thread divergence happens, most of the benefits of using a GPU have vanished.
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=.8\textwidth]{thread_hierarchy.png}
|
||||
\caption{An overview of the thread hierarchy with blocks being split into multiple warps and their shared memory \parencite{amd_hip_2025}.}
|
||||
\label{fig:thread_hierarchy}
|
||||
\end{figure}
|
||||
|
||||
A piece of code that is executed on a GPU is written as a kernel which can be configured. The most important configuration is how threads are grouped into blocks. The GPU allows the kernel to allocate threads and blocks and block clusters in up to three dimensions. This is often useful because of the already mentioned shared memory, which will be explained in more detail in section \ref{sec:memory_model}. Considering the case where an image needs to be blurred, it not only simplifies the development if threads are arranged in a 2D grid, it also helps with optimising memory access. As the threads in a block, need to access a lot of the same data, this data can be loaded in the shared memory of the block. This allows the data to be accessed much quicker compared to when threads are allocated in only one dimension. With one dimensional blocks it is possible that threads assigned to nearby pixels, are part of a different block, leading to a lot of duplicate data transfer.
|
||||
|
||||
All threads in a warp start at the same point in a program, they have their own instruction address, allowing them to work independently. Because of the SIMD architecture, all threads in a warp must execute the same instructions and if threads start diverging, the SM must pause threads with different instructions and execute them later. Figure \ref{fig:thread_divergence} shows how such divergences can impact performance. The situation described by the figure also shows, that after the divergent thread would reconverge, this does not happen and leads to T2 being executed after T1 and T3 are finished. In situations where a lot of data dependent thread divergence happens, most of the benefits of using a GPU have vanished.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
|
@ -45,11 +54,16 @@ While all threads in a warp start at the same point in a program, they have thei
|
|||
|
||||
Threads not executing the same instruction is against the SIMD principle but can happen in reality, due to data dependent branching. Consequently, this leads to bad resource utilisation, which in turn leads to worse performance. Another possibility of threads being paused (inactive threads) is the fact that sometimes, the number of threads started is not divisible by 32. In such cases, the last warp still contains 32 threads but only the threads with work are executed \parencite{nvidia_cuda_2024}.
|
||||
|
||||
Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) architecture. In many cases a developer does not need to know the details of SIMT and can develop fast and correct programs with just the SIMD architecture in mind. However, leveraging the power of SIMT can yield substantial performance gains by re-converging threads once data dependent divergence occurred. A proposal for a re-convergence algorithm was proposed by \textcite{collange_stack-less_2011} where they have shown that these approaches help with hardware occupation, resulting in improved performance as threads are now no longer fully serialised. Another approach for increasing occupancy using the SIMT architecture is proposed by \textcite{fung_thread_2011}. They introduced a technique for compacting thread blocks by moving divergent threads to new warps until they reconverge. This approach resulted in a noticeable speed-up between 17\% and 22\%.
|
||||
Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) architecture. In many cases a developer does not need to know the details of SIMT and can develop fast and correct programs with just the SIMD architecture in mind. However, leveraging the power of SIMT can yield substantial performance gains by re-converging threads once data dependent divergence occurred. A proposal for a re-convergence algorithm was proposed by \textcite{collange_stack-less_2011} where they have shown that these approaches help with hardware occupation, resulting in improved performance as threads are now no longer fully serialised. Another approach for increasing occupancy using the SIMT architecture is proposed by \textcite{fung_thread_2011}. They introduced a technique for compacting thread blocks by moving divergent threads to new warps until they reconverge. This approach resulted in a noticeable speed-up between 17\% and 22\%. Another example where a SIMT aware algorithm can perform better was proposed by \textcite{koster_massively_2020}. While they did not implement techniques for thread re-convergence, they implemented a thread compaction algorithm. On data-dependent divergence it is possible for threads to end early, leaving a warp with only partial active threads. This means the deactivated threads are still occupied and cannot be used for other work. Their thread compaction tackles this problem by moving active threads into a new thread block, releasing the inactive threads to perform other work. With this they were able to gain a speed-up of roughly 4 times compared to previous implementations.
|
||||
|
||||
% !!! Find an image that can depict what SIMT is !!!
|
||||
% Maybe this can also be used to better explain SIMT: https://rocm.docs.amd.com/projects/HIP/en/latest/understand/programming_model.html#programming-model-simt
|
||||
|
||||
Talk about memory allocation (with the one paper diving into dynamic allocations)
|
||||
\subsubsection{Memory Model}
|
||||
\label{sec:memory_model}
|
||||
On a GPU there are two parts that contribute to the performance of an algorithm. The one already looked at is the compute-portion of the GPU. This is necessary because if threads are serialised or run inefficiently, there is nothing that can make the algorithm execute faster. However, algorithms run on a GPU usually require huge amounts of data to be processed, as they are designed for exactly that purpose. The purpose of this section is to explain how the memory model of the GPU works and how it can influence the performance of an algorithm.
|
||||
|
||||
Talk about memory model and memory allocation (with the one paper diving into dynamic allocations)
|
||||
Memory transfer (with streams potentially)
|
||||
|
||||
\subsection[PTX]{Parallel Thread Execution}
|
||||
|
|
BIN
thesis/images/thread_hierarchy.png
Normal file
BIN
thesis/images/thread_hierarchy.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 18 KiB |
3
thesis/images/thread_hierarchy.svg
Normal file
3
thesis/images/thread_hierarchy.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 477 KiB |
BIN
thesis/main.pdf
BIN
thesis/main.pdf
Binary file not shown.
|
@ -538,7 +538,7 @@ Publisher: Multidisciplinary Digital Publishing Institute},
|
|||
urldate = {2025-03-08},
|
||||
date = {2005-02},
|
||||
note = {Conference Name: Proceedings of the {IEEE}},
|
||||
keywords = {Automatic vectorization, Boosting, Computer aided instruction, Computer applications, Concurrent computing, Digital signal processing, digital signal processing ({DSP}), fast Fourier transform ({FFT}), Kernel, Parallel processing, Registers, short vector single instruction, multiple data ({SIMD}), Signal processing algorithms, Spirals, symbolic vectorization},
|
||||
keywords = {Concurrent computing, Parallel processing, Automatic vectorization, Boosting, Computer aided instruction, Computer applications, Digital signal processing, digital signal processing ({DSP}), fast Fourier transform ({FFT}), Kernel, Registers, short vector single instruction, multiple data ({SIMD}), Signal processing algorithms, Spirals, symbolic vectorization},
|
||||
file = {Eingereichte Version:C\:\\Users\\danwi\\Zotero\\storage\\J48HM9VD\\Franchetti et al. - 2005 - Efficient Utilization of SIMD Extensions.pdf:application/pdf;IEEE Xplore Abstract Record:C\:\\Users\\danwi\\Zotero\\storage\\W6PT75CV\\1386659.html:text/html},
|
||||
}
|
||||
|
||||
|
@ -553,7 +553,7 @@ Publisher: Multidisciplinary Digital Publishing Institute},
|
|||
author = {Tian, Xinmin and Saito, Hideki and Girkar, Milind and Preis, Serguei V. and Kozhukhov, Sergey S. and Cherkasov, Aleksei G. and Nelson, Clark and Panchenko, Nikolay and Geva, Robert},
|
||||
urldate = {2025-03-08},
|
||||
date = {2012-05},
|
||||
keywords = {Cloning, Compiler, {GPU}, Graphics processing unit, Hardware, Multicore, Parallel processing, Programming, {SIMD}, Vectorization, Vectors},
|
||||
keywords = {{GPU}, Parallel processing, Cloning, Compiler, Graphics processing unit, Hardware, Multicore, Programming, {SIMD}, Vectorization, Vectors},
|
||||
file = {IEEE Xplore Abstract Record:C\:\\Users\\danwi\\Zotero\\storage\\HBSGBKT2\\6270606.html:text/html},
|
||||
}
|
||||
|
||||
|
@ -587,7 +587,7 @@ Publisher: Multidisciplinary Digital Publishing Institute},
|
|||
urldate = {2025-03-08},
|
||||
date = {2014-10},
|
||||
note = {{ISSN}: 2159-3450},
|
||||
keywords = {Computer architecture, Educational institutions, {GPGPU}, Graphics, Graphics processing units, Instruction sets, Mobile communication, Registers, {SIMT} Architecture, Stream Processor},
|
||||
keywords = {Graphics processing units, Computer architecture, Graphics, Registers, Educational institutions, {GPGPU}, Instruction sets, Mobile communication, {SIMT} Architecture, Stream Processor},
|
||||
file = {IEEE Xplore Abstract Record:C\:\\Users\\danwi\\Zotero\\storage\\9B85REHH\\7022313.html:text/html},
|
||||
}
|
||||
|
||||
|
@ -600,7 +600,7 @@ Publisher: Multidisciplinary Digital Publishing Institute},
|
|||
author = {Collange, Caroline},
|
||||
urldate = {2025-03-08},
|
||||
date = {2011-09},
|
||||
keywords = {Control-flow reconvergence, {GPU}, {SIMD}, {SIMT}},
|
||||
keywords = {{GPU}, {SIMD}, Control-flow reconvergence, {SIMT}},
|
||||
file = {HAL PDF Full Text:C\:\\Users\\danwi\\Zotero\\storage\\M2WPWNXF\\Collange - 2011 - Stack-less SIMT reconvergence at low cost.pdf:application/pdf},
|
||||
}
|
||||
|
||||
|
@ -616,6 +616,15 @@ Publisher: Multidisciplinary Digital Publishing Institute},
|
|||
urldate = {2025-03-08},
|
||||
date = {2011-02},
|
||||
note = {{ISSN}: 2378-203X},
|
||||
keywords = {Compaction, Graphics processing unit, Hardware, Instruction sets, Kernel, Pipelines, Random access memory},
|
||||
keywords = {Pipelines, Kernel, Graphics processing unit, Hardware, Instruction sets, Compaction, Random access memory},
|
||||
file = {Full Text PDF:C\:\\Users\\danwi\\Zotero\\storage\\TRPWUTI6\\Fung und Aamodt - 2011 - Thread block compaction for efficient SIMT control flow.pdf:application/pdf;IEEE Xplore Abstract Record:C\:\\Users\\danwi\\Zotero\\storage\\LYPYEA8U\\5749714.html:text/html},
|
||||
}
|
||||
|
||||
@online{amd_hip_2025,
|
||||
title = {{HIP} programming model — {HIP} 6.3.42134 Documentation},
|
||||
url = {https://rocm.docs.amd.com/projects/HIP/en/latest/understand/programming_model.html#programming-model-simt},
|
||||
author = {{AMD}},
|
||||
urldate = {2025-03-09},
|
||||
date = {2025-03},
|
||||
file = {HIP programming model — HIP 6.3.42134 Documentation:C\:\\Users\\danwi\\Zotero\\storage\\6KRNU6PG\\programming_model.html:text/html},
|
||||
}
|
||||
|
|
Loading…
Reference in New Issue
Block a user