diff --git a/thesis/chapters/relwork.tex b/thesis/chapters/relwork.tex index 7e289ba..f12de61 100644 --- a/thesis/chapters/relwork.tex +++ b/thesis/chapters/relwork.tex @@ -34,18 +34,23 @@ While the concepts of GPGPU programming are the same no matter the GPU used, the The many cores on a GPU, also called threads, are grouped together in several categories. On the lowest level exists a streaming multiprocessor (SM) which is a hardware unit responsible for scheduling and executing threads and also contains the registers used by the threads. One SM is always executing a group of 32 threads simultaneously and this group is called a warp. The number of threads that can be started is virtually unlimited. However, threads need to be grouped in a block, with one block usually containing a maximum of 1024 threads. Therefore, if more than 1024 threads are needed, more blocks need to be created. All threads in one block have access to some shared memory, which can be used for L1-caching or communication between threads. It is important that the blocks can be scheduled independently from on another with no dependencies between them. This allows the scheduler to schedule blocks and threads as efficiently as possible. All threads within a warp are ensured to be part of the same block and therefore executed simultaneously \parencite{nvidia_cuda_2024}. -While all threads in a warp start at the same point in a program, they have their own instruction address, allowing them to work independently. Because of the SIMD architecture, all threads in a warp must execute the same instructions and if threads start diverging, the SM must pause threads with different instructions and execute them later. Figure \ref{fig:thread_divergence} shows how such divergences can impact performance. +While all threads in a warp start at the same point in a program, they have their own instruction address, allowing them to work independently. Because of the SIMD architecture, all threads in a warp must execute the same instructions and if threads start diverging, the SM must pause threads with different instructions and execute them later. Figure \ref{fig:thread_divergence} shows how such divergences can impact performance. The situation described by the figure also shows, that after the divergent thread would reconverge, this does not happen and leads to T2 being executed after T1 and T3 are finished. In situations where a lot of data dependent thread divergence happens, most of the benefits of using a GPU have vanished. \begin{figure} \centering \includegraphics[width=.8\textwidth]{thread_divergence.png} - \caption{Thread T2 wants to execute instruction B while T1 and T3 want to execute instruction A. Therefore T2 will be an inactive thread this cycle and active in the next, with T1 and T3 being the opposite. This means that now 2 cycles are needed instead of one to advance all threads, resulting in worse performance.} + \caption{Thread T2 wants to execute instruction B while T1 and T3 want to execute instruction A. Therefore T2 will be an inactive thread this cycle and active once T1 and T3 are finished. This means that now the divergent threads are serialised.} \label{fig:thread_divergence} \end{figure} Threads not executing the same instruction is against the SIMD principle but can happen in reality, due to data dependent branching. Consequently, this leads to bad resource utilisation, which in turn leads to worse performance. Another possibility of threads being paused (inactive threads) is the fact that sometimes, the number of threads started is not divisible by 32. In such cases, the last warp still contains 32 threads but only the threads with work are executed \parencite{nvidia_cuda_2024}. -Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) architecture. +Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) architecture. In many cases a developer does not need to know the details of SIMT and can develop fast and correct programs with just the SIMD architecture in mind. However, leveraging the power of SIMT can yield substantial performance gains by re-converging threads once data dependent divergence occurred. A proposal for a re-convergence algorithm was proposed by \textcite{collange_stack-less_2011} where they have shown that these approaches help with hardware occupation, resulting in improved performance as threads are now no longer fully serialised. Another approach for increasing occupancy using the SIMT architecture is proposed by \textcite{fung_thread_2011}. They introduced a technique for compacting thread blocks by moving divergent threads to new warps until they reconverge. This approach resulted in a noticeable speed-up between 17\% and 22\%. + +% Maybe this can also be used to better explain SIMT: https://rocm.docs.amd.com/projects/HIP/en/latest/understand/programming_model.html#programming-model-simt + +Talk about memory allocation (with the one paper diving into dynamic allocations) +Memory transfer (with streams potentially) \subsection[PTX]{Parallel Thread Execution} Describe what PTX is to get a common ground for the implementation chapter. Probably a short section diff --git a/thesis/images/thread_divergence.png b/thesis/images/thread_divergence.png index 5440ed5..25a7780 100644 Binary files a/thesis/images/thread_divergence.png and b/thesis/images/thread_divergence.png differ diff --git a/thesis/main.pdf b/thesis/main.pdf index 7309ba5..88669ce 100644 Binary files a/thesis/main.pdf and b/thesis/main.pdf differ diff --git a/thesis/references.bib b/thesis/references.bib index 1455917..a28e18f 100644 --- a/thesis/references.bib +++ b/thesis/references.bib @@ -574,3 +574,48 @@ Publisher: Multidisciplinary Digital Publishing Institute}, date = {2010-06-19}, file = {Full Text PDF:C\:\\Users\\danwi\\Zotero\\storage\\D64U9R8Q\\Lee et al. - 2010 - Debunking the 100X GPU vs. CPU myth an evaluation of throughput computing on CPU and GPU.pdf:application/pdf}, } + +@inproceedings{kyung_implementation_2014, + title = {An implementation of a {SIMT} architecture-based stream processor}, + url = {https://ieeexplore.ieee.org/abstract/document/7022313}, + doi = {10.1109/TENCON.2014.7022313}, + abstract = {In this paper, we designed a {SIMT} architecture-based stream processor for parallel processing in the mobile environment. The designed processor is a superscalar architecture and can issue up to four instructions. Considering the limited resources of the mobile environment, this processor was consisted of 16 stream processors ({SPs}). To verify the operation of the designed processor, a functional level simulation was conducted with the Modelsim {SE} 10.0b simulator. We synthesized on Virtex-7 {FPGA} as the target with the Xilinx {ISE} 14.7 tool and the results analyzed. The performance of the designed processor was 150M Triangles/Sec, 4.8 {GFLOPS} at 100 {MHz}. When the performance was compared with that of conventional processors, the proposed architecture of the processor attested to be effective in processing 3D graphics and parallel general-purpose computing in the mobile environment.}, + eventtitle = {{TENCON} 2014 - 2014 {IEEE} Region 10 Conference}, + pages = {1--5}, + booktitle = {{TENCON} 2014 - 2014 {IEEE} Region 10 Conference}, + author = {Kyung, Gyutaek and Jung, Changmin and Lee, Kwangyeob}, + urldate = {2025-03-08}, + date = {2014-10}, + note = {{ISSN}: 2159-3450}, + keywords = {Computer architecture, Educational institutions, {GPGPU}, Graphics, Graphics processing units, Instruction sets, Mobile communication, Registers, {SIMT} Architecture, Stream Processor}, + file = {IEEE Xplore Abstract Record:C\:\\Users\\danwi\\Zotero\\storage\\9B85REHH\\7022313.html:text/html}, +} + +@report{collange_stack-less_2011, + title = {Stack-less {SIMT} reconvergence at low cost}, + url = {https://hal.science/hal-00622654}, + abstract = {Parallel architectures following the {SIMT} model such as {GPUs} benefit from application regularity by issuing concurrent threads running in lockstep on {SIMD} units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of {SIMD} units. In this paper, we propose a technique to handle {SIMT} control divergence that operates in constant space and handles indirect jumps and recursion. We describe a possible implementation which leverage the existing memory divergence management unit, ensuring a low hardware cost. In terms of performance, this solution is at least as efficient as existing techniques.}, + institution = {{ENS} Lyon}, + type = {Research Report}, + author = {Collange, Caroline}, + urldate = {2025-03-08}, + date = {2011-09}, + keywords = {Control-flow reconvergence, {GPU}, {SIMD}, {SIMT}}, + file = {HAL PDF Full Text:C\:\\Users\\danwi\\Zotero\\storage\\M2WPWNXF\\Collange - 2011 - Stack-less SIMT reconvergence at low cost.pdf:application/pdf}, +} + +@inproceedings{fung_thread_2011, + title = {Thread block compaction for efficient {SIMT} control flow}, + url = {https://ieeexplore.ieee.org/abstract/document/5749714}, + doi = {10.1109/HPCA.2011.5749714}, + abstract = {Manycore accelerators such as graphics processor units ({GPUs}) organize processing units into single-instruction, multiple data “cores” to improve throughput per unit hardware cost. Programming models for these accelerators encourage applications to run kernels with large groups of parallel scalar threads. The hardware groups these threads into warps/wavefronts and executes them in lockstep-dubbed single-instruction, multiple-thread ({SIMT}) by {NVIDIA}. While current {GPUs} employ a per-warp (or per-wavefront) stack to manage divergent control flow, it incurs decreased efficiency for applications with nested, data-dependent control flow. In this paper, we propose and evaluate the benefits of extending the sharing of resources in a block of warps, already used for scratchpad memory, to exploit control flow locality among threads (where such sharing may at first seem detrimental). In our proposal, warps within a thread block share a common block-wide stack for divergence handling. At a divergent branch, threads are compacted into new warps in hardware. Our simulation results show that this compaction mechanism provides an average speedup of 22\% over a baseline per-warp, stack-based reconvergence mechanism, and 17\% versus dynamic warp formation on a set of {CUDA} applications that suffer significantly from control flow divergence.}, + eventtitle = {2011 {IEEE} 17th International Symposium on High Performance Computer Architecture}, + pages = {25--36}, + booktitle = {2011 {IEEE} 17th International Symposium on High Performance Computer Architecture}, + author = {Fung, Wilson W. L. and Aamodt, Tor M.}, + urldate = {2025-03-08}, + date = {2011-02}, + note = {{ISSN}: 2378-203X}, + keywords = {Compaction, Graphics processing unit, Hardware, Instruction sets, Kernel, Pipelines, Random access memory}, + file = {Full Text PDF:C\:\\Users\\danwi\\Zotero\\storage\\TRPWUTI6\\Fung und Aamodt - 2011 - Thread block compaction for efficient SIMT control flow.pdf:application/pdf;IEEE Xplore Abstract Record:C\:\\Users\\danwi\\Zotero\\storage\\LYPYEA8U\\5749714.html:text/html}, +}