related work: small continuation of explaining SIMT
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run

This commit is contained in:
2025-03-08 14:12:50 +01:00
parent b683f3ae96
commit 4e48686b62
4 changed files with 53 additions and 3 deletions

View File

@ -34,18 +34,23 @@ While the concepts of GPGPU programming are the same no matter the GPU used, the
The many cores on a GPU, also called threads, are grouped together in several categories. On the lowest level exists a streaming multiprocessor (SM) which is a hardware unit responsible for scheduling and executing threads and also contains the registers used by the threads. One SM is always executing a group of 32 threads simultaneously and this group is called a warp. The number of threads that can be started is virtually unlimited. However, threads need to be grouped in a block, with one block usually containing a maximum of 1024 threads. Therefore, if more than 1024 threads are needed, more blocks need to be created. All threads in one block have access to some shared memory, which can be used for L1-caching or communication between threads. It is important that the blocks can be scheduled independently from on another with no dependencies between them. This allows the scheduler to schedule blocks and threads as efficiently as possible. All threads within a warp are ensured to be part of the same block and therefore executed simultaneously \parencite{nvidia_cuda_2024}.
While all threads in a warp start at the same point in a program, they have their own instruction address, allowing them to work independently. Because of the SIMD architecture, all threads in a warp must execute the same instructions and if threads start diverging, the SM must pause threads with different instructions and execute them later. Figure \ref{fig:thread_divergence} shows how such divergences can impact performance.
While all threads in a warp start at the same point in a program, they have their own instruction address, allowing them to work independently. Because of the SIMD architecture, all threads in a warp must execute the same instructions and if threads start diverging, the SM must pause threads with different instructions and execute them later. Figure \ref{fig:thread_divergence} shows how such divergences can impact performance. The situation described by the figure also shows, that after the divergent thread would reconverge, this does not happen and leads to T2 being executed after T1 and T3 are finished. In situations where a lot of data dependent thread divergence happens, most of the benefits of using a GPU have vanished.
\begin{figure}
\centering
\includegraphics[width=.8\textwidth]{thread_divergence.png}
\caption{Thread T2 wants to execute instruction B while T1 and T3 want to execute instruction A. Therefore T2 will be an inactive thread this cycle and active in the next, with T1 and T3 being the opposite. This means that now 2 cycles are needed instead of one to advance all threads, resulting in worse performance.}
\caption{Thread T2 wants to execute instruction B while T1 and T3 want to execute instruction A. Therefore T2 will be an inactive thread this cycle and active once T1 and T3 are finished. This means that now the divergent threads are serialised.}
\label{fig:thread_divergence}
\end{figure}
Threads not executing the same instruction is against the SIMD principle but can happen in reality, due to data dependent branching. Consequently, this leads to bad resource utilisation, which in turn leads to worse performance. Another possibility of threads being paused (inactive threads) is the fact that sometimes, the number of threads started is not divisible by 32. In such cases, the last warp still contains 32 threads but only the threads with work are executed \parencite{nvidia_cuda_2024}.
Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) architecture.
Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) architecture. In many cases a developer does not need to know the details of SIMT and can develop fast and correct programs with just the SIMD architecture in mind. However, leveraging the power of SIMT can yield substantial performance gains by re-converging threads once data dependent divergence occurred. A proposal for a re-convergence algorithm was proposed by \textcite{collange_stack-less_2011} where they have shown that these approaches help with hardware occupation, resulting in improved performance as threads are now no longer fully serialised. Another approach for increasing occupancy using the SIMT architecture is proposed by \textcite{fung_thread_2011}. They introduced a technique for compacting thread blocks by moving divergent threads to new warps until they reconverge. This approach resulted in a noticeable speed-up between 17\% and 22\%.
% Maybe this can also be used to better explain SIMT: https://rocm.docs.amd.com/projects/HIP/en/latest/understand/programming_model.html#programming-model-simt
Talk about memory allocation (with the one paper diving into dynamic allocations)
Memory transfer (with streams potentially)
\subsection[PTX]{Parallel Thread Execution}
Describe what PTX is to get a common ground for the implementation chapter. Probably a short section