diff --git a/thesis/chapters/relwork.tex b/thesis/chapters/relwork.tex index d0cc79a..7e289ba 100644 --- a/thesis/chapters/relwork.tex +++ b/thesis/chapters/relwork.tex @@ -12,22 +12,49 @@ The expressions generated by an equation learning algorithm can look like this $ \section[GPGPU]{General Purpose Computation on Graphics Processing Units} +\label{sec:gpgpu} Graphics cards (GPUs) are commonly used to increase the performance of many different applications. Originally they were designed to improve performance and visual quality in games. \textcite{dokken_gpu_2005} first described the usage of GPUs for general purpose programming. They have shown how the graphics pipeline can be used for GPGPU programming. Because this approach also requires the programmer to understand the graphics terminology, this was not a great solution. Therefore, Nvidia released CUDA\footnote{\url{https://developer.nvidia.com/cuda-toolkit}} in 2007 with the goal of allowing developers to program GPUs independent of the graphics pipeline and terminology. A study of the programmability of GPUs with CUDA and the resulting performance has been conducted by \textcite{huang_gpu_2008}. They found that GPGPU programming has potential, even for non-embarassingly parallel problems. Research is also done in making the low level CUDA development simpler. \textcite{han_hicuda_2011} have described a directive-based language to make development simpler and less error-prone, while retaining the performance of handwritten code. To drastically simplify CUDA development \textcite{besard_effective_2019} showed that it is possible to develop with CUDA in the high level programming language Julia\footnote{\url{https://julialang.org/}} while performing similar to CUDA written in C. In a subsequent study \textcite{lin_comparing_2021} found that high performance computing (HPC) on the CPU and GPU in Julia performs similar to HPC development in C. This means that Julia can be a viable alternative to Fortran, C and C++ in the HPC field and has the additional benefit of developer comfort since it is a high level language with modern features such as garbage-collectors. \textcite{besard_rapid_2019} have also shown how the combination of Julia and CUDA help in rapidly developing HPC software. While this thesis in general revolves around CUDA, there also exist alternatives by AMD called ROCm\footnote{\url{https://www.amd.com/de/products/software/rocm.html}} and a vendor independent alternative called OpenCL\footnote{\url{https://www.khronos.org/opencl/}}. -While in the early days of GPGPU programming a lot of research has been done to assess if this approach is feasible, it now seems obvious to use GPUs to accelerate algorithms. Weather simulations began using GPUs very early for their models. In 2008 \textcite{michalakes_gpu_2008} proposed a method for simulating weather with the WRF model on a GPU. With their approach, they reached a speed-up of the most compute intensive task of 5 to 20, with very little GPU optimisation effort. They also found that the GPU usages was very low, meaning there are resources and potential for more detailed simulations. Generally, simulations are great candidates for using GPUs, as they can benefit heavily from a high degree of parallelism and data throughput. \textcite{koster_high-performance_2020} have developed a way of using adaptive time steps to improve the performance of time step simulations, while retaining their precision and constraint correctness. Black hole simulations are crucial for science and education for a better understanding of our world. \textcite{verbraeck_interactive_2021} have shown that simulating complex Kerr (rotating) black holes can be done on consumer hardware in a few seconds. Schwarzschild black hole simulations can be performed in real-time with GPUs as described by \textcite{hissbach_overview_2022} which is especially helpful for educational scenarios. While both approaches do not have the same accuracy as detailed simulations on supercomputers, they show how single GPUs can yield similar accuracy at a fraction of the cost. Networking can also heavily benefit from GPU acceleration as shown by \textcite{han_packetshader_2010}, where they achieved a significant increase in throughput than with a CPU only implementation. Finite element structural analysis is an essential tool for many branches of engineering and can also heavily benefit from the usage of GPUs as demonstrated by \textcite{georgescu_gpu_2013}. +While in the early days of GPGPU programming a lot of research has been done to assess if this approach is feasible, it now seems obvious to use GPUs to accelerate algorithms. Weather simulations began using GPUs very early for their models. In 2008 \textcite{michalakes_gpu_2008} proposed a method for simulating weather with the WRF model on a GPU. With their approach, they reached a speed-up of the most compute intensive task of 5 to 20, with very little GPU optimisation effort. They also found that the GPU usages was very low, meaning there are resources and potential for more detailed simulations. Generally, simulations are great candidates for using GPUs, as they can benefit heavily from a high degree of parallelism and data throughput. \textcite{koster_high-performance_2020} have developed a way of using adaptive time steps to improve the performance of time step simulations, while retaining their precision and constraint correctness. Black hole simulations are crucial for science and education for a better understanding of our world. \textcite{verbraeck_interactive_2021} have shown that simulating complex Kerr (rotating) black holes can be done on consumer hardware in a few seconds. Schwarzschild black hole simulations can be performed in real-time with GPUs as described by \textcite{hissbach_overview_2022} which is especially helpful for educational scenarios. While both approaches do not have the same accuracy as detailed simulations on supercomputers, they show how single GPUs can yield similar accuracy at a fraction of the cost. Networking can also heavily benefit from GPU acceleration as shown by \textcite{han_packetshader_2010}, where they achieved a significant increase in throughput than with a CPU only implementation. Finite element structural analysis is an essential tool for many branches of engineering and can also heavily benefit from the usage of GPUs as demonstrated by \textcite{georgescu_gpu_2013}. However, it also needs to be noted, that GPUs are not always better performing than CPUs as illustrated by \textcite{lee_debunking_2010}, but they still can lead to performance improvements nonetheless. \subsection{Programming GPUs} % This part now starts taking about architecture and how to program GPUs -talk about the fields GPGPU really helped make performance improvements (weather simulations etc). Then describe how it differs from classical programming. talk about architecture (SIMD/SIMT; a lot of "slow" cores). +The development process on a GPU is vastly different from a CPU. A CPU has tens or hundreds of complex cores with the AMD Epyc 9965\footnote{\url{https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9965.html}} having a staggering $192$ cores and twice as many threads. A guide for a simple one core 8-bit CPU has been published by \textcite{schuurman_step-by-step_2013}. He describes the many different and complex parts of a CPU core. Modern CPUs are even more complex, with dedicated fast integer, floating-point arithmetic gates as well as logic gates, sophisticated branch prediction and much more. This makes a CPU perfect for handling complex control flows on a single program strand and on modern CPUs even multiple strands simultaneously. However, as seen in section \ref{sec:gpgpu}, this often isn't enough. On the other hand, a GPU contains thousands or even tens of thousands of cores. For example, the GeForce RTX 5090\footnote{\url{https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/}} contains a total of $21760$ CUDA cores. To achieve this enormous core count a single GPU core has to be much simpler than one CPU core. As described by \textcite{nvidia_cuda_2024} a GPU designates much more transistors towards floating-point computations. This results in less efficient integer arithmetic and control flow handling. There is also less Cache available per core and clock speeds are usually also much lower than those on a CPU. An overview of the differences of a CPU and a GPU architecture can be seen in figure \ref{fig:cpu_vs_gpu}. -starting from here I can hopefully incorporate more images to break up these walls of text +\begin{figure} + \centering + \includegraphics[width=1\textwidth]{nvidia_cpu_vs_gpu.png} + \caption{Overview of the architecture of a CPU (left) and a GPU (right). Note the higher number of simpler cores on the GPU \parencite{nvidia_cuda_2024}.} + \label{fig:cpu_vs_gpu} +\end{figure} + +Despite these drawbacks, the sheer number of cores, makes a GPU a valid choice when considering improving the performance of an algorithm. Because of the high number of cores, GPUs are best suited for data parallel scenarios. This is due to the SIMD architecture of these cards. SIMD stands for Sinlge-Instruction Multiple-Data and states that there is a single stream of instructions that is executed on a huge number of data streams. \textcite{franchetti_efficient_2005} and \textcite{tian_compiling_2012} describe ways of using SIMD instructions on the CPU. Their approaches lead to noticeable speed-ups of 3.3 and 4.7 respectively by using SIMD instructions instead of serial computations. Extending this to GPUs which are specifically built for SIMD/data parallel calculations shows why they are so powerful despite having less complex and slower cores than a CPU. + +While the concepts of GPGPU programming are the same no matter the GPU used, the naming on AMD GPUs and guides differs from the CUDA naming. As previously stated, this thesis will use the terminology and concepts as described by Nvidia in their CUDA programming guide. + +The many cores on a GPU, also called threads, are grouped together in several categories. On the lowest level exists a streaming multiprocessor (SM) which is a hardware unit responsible for scheduling and executing threads and also contains the registers used by the threads. One SM is always executing a group of 32 threads simultaneously and this group is called a warp. The number of threads that can be started is virtually unlimited. However, threads need to be grouped in a block, with one block usually containing a maximum of 1024 threads. Therefore, if more than 1024 threads are needed, more blocks need to be created. All threads in one block have access to some shared memory, which can be used for L1-caching or communication between threads. It is important that the blocks can be scheduled independently from on another with no dependencies between them. This allows the scheduler to schedule blocks and threads as efficiently as possible. All threads within a warp are ensured to be part of the same block and therefore executed simultaneously \parencite{nvidia_cuda_2024}. + +While all threads in a warp start at the same point in a program, they have their own instruction address, allowing them to work independently. Because of the SIMD architecture, all threads in a warp must execute the same instructions and if threads start diverging, the SM must pause threads with different instructions and execute them later. Figure \ref{fig:thread_divergence} shows how such divergences can impact performance. + +\begin{figure} + \centering + \includegraphics[width=.8\textwidth]{thread_divergence.png} + \caption{Thread T2 wants to execute instruction B while T1 and T3 want to execute instruction A. Therefore T2 will be an inactive thread this cycle and active in the next, with T1 and T3 being the opposite. This means that now 2 cycles are needed instead of one to advance all threads, resulting in worse performance.} + \label{fig:thread_divergence} +\end{figure} + +Threads not executing the same instruction is against the SIMD principle but can happen in reality, due to data dependent branching. Consequently, this leads to bad resource utilisation, which in turn leads to worse performance. Another possibility of threads being paused (inactive threads) is the fact that sometimes, the number of threads started is not divisible by 32. In such cases, the last warp still contains 32 threads but only the threads with work are executed \parencite{nvidia_cuda_2024}. + +Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) architecture. \subsection[PTX]{Parallel Thread Execution} Describe what PTX is to get a common ground for the implementation chapter. Probably a short section \section{Compilers} -brief overview about compilers (just setting the stage for the subsections basically). Talk about register management and these things +Maybe even move this entire section to "Concept and Design"? + +brief overview about compilers (just setting the stage for the subsections basically). Talk about register management and these things. \subsection{Interpreters} What are interpreters; how they work; should mostly contain/reference gpu interpreters diff --git a/thesis/images/nvidia_cpu_vs_gpu.png b/thesis/images/nvidia_cpu_vs_gpu.png new file mode 100644 index 0000000..2f0d219 Binary files /dev/null and b/thesis/images/nvidia_cpu_vs_gpu.png differ diff --git a/thesis/images/thread_divergence.png b/thesis/images/thread_divergence.png new file mode 100644 index 0000000..5440ed5 Binary files /dev/null and b/thesis/images/thread_divergence.png differ diff --git a/thesis/main.pdf b/thesis/main.pdf index 69e195a..7309ba5 100644 Binary files a/thesis/main.pdf and b/thesis/main.pdf differ diff --git a/thesis/references.bib b/thesis/references.bib index f1703fa..1455917 100644 --- a/thesis/references.bib +++ b/thesis/references.bib @@ -490,7 +490,7 @@ Publisher: Multidisciplinary Digital Publishing Institute}, urldate = {2025-03-02}, date = {2021-02}, note = {Conference Name: {IEEE} Transactions on Visualization and Computer Graphics}, - keywords = {Algorithms, Cameras, Computer Graphics Techniques, Distortion, Engineering, Mathematics, Observers, Physical \& Environmental Sciences, Ray tracing, Real-time systems, Rendering (computer graphics), Visualization}, + keywords = {Rendering (computer graphics), Algorithms, Cameras, Computer Graphics Techniques, Distortion, Engineering, Mathematics, Observers, Physical \& Environmental Sciences, Ray tracing, Real-time systems, Visualization}, file = {PDF:C\:\\Users\\danwi\\Zotero\\storage\\HDASRGYN\\Verbraeck und Eisemann - 2021 - Interactive Black-Hole Visualization.pdf:application/pdf}, } @@ -506,3 +506,71 @@ Publisher: Multidisciplinary Digital Publishing Institute}, langid = {english}, file = {Full Text PDF:C\:\\Users\\danwi\\Zotero\\storage\\TBBLEZ5N\\Hissbach et al. - 2022 - An Overview of Techniques for Egocentric Black Hole Visualization and Their Suitability for Planetar.pdf:application/pdf}, } + +@inproceedings{schuurman_step-by-step_2013, + location = {New York, {NY}, {USA}}, + title = {Step-by-step design and simulation of a simple {CPU} architecture}, + isbn = {978-1-4503-1868-6}, + url = {https://dl.acm.org/doi/10.1145/2445196.2445296}, + doi = {10.1145/2445196.2445296}, + series = {{SIGCSE} '13}, + abstract = {This paper describes a sequence of assignments, each building upon the next, leading students to a working simulation of a simple 8-bit {CPU} (Central Processing Unit). The design features a classic Von Neumann architecture comprising a simple data path with a few registers, a simple {ALU} (Arithmetic Logic Unit), and a microprogram to direct all the control signals. The first step involves the design of the {ALU} which is capable of eight basic operations. The second step guides students to construct a datapath complete with several 8-bit registers. The third step involves the design and implementation of a control unit which uses a microprogram to implement machine code instructions. The microprogram implements nine basic machine language instructions which are sufficient for writing many simple programs. The final step involves adding program memory and an input and output device to form a simple working simulation of a computer. At this point, students may hand-assemble code for their {CPU} and simulate its execution. All simulations are performed using a free and open source simulator called Logisim which performs digital logic simulations with the ability to build larger circuits from smaller subcircuits. Students can set an adjustable clock rate and observe the internal {CPU} state and registers as it retrieves instructions and steps through the microcode. The basic {CPU} architecture provides many opportunities for more advanced exercises, such as adding an instruction fetch unit, adding pipelining, or adding more machine language instructions. The assignments were introduced in a second year course on computer organization, providing an effective hands-on approach to understanding how a {CPU} actually operates.}, + pages = {335--340}, + booktitle = {Proceeding of the 44th {ACM} technical symposium on Computer science education}, + publisher = {Association for Computing Machinery}, + author = {Schuurman, Derek C.}, + urldate = {2025-03-08}, + date = {2013-03-06}, + file = {Full Text PDF:C\:\\Users\\danwi\\Zotero\\storage\\KM664H87\\Schuurman - 2013 - Step-by-step design and simulation of a simple CPU architecture.pdf:application/pdf}, +} + +@article{franchetti_efficient_2005, + title = {Efficient Utilization of {SIMD} Extensions}, + volume = {93}, + issn = {1558-2256}, + url = {https://ieeexplore.ieee.org/abstract/document/1386659}, + doi = {10.1109/JPROC.2004.840491}, + abstract = {This paper targets automatic performance tuning of numerical kernels in the presence of multilayered memory hierarchies and single-instruction, multiple-data ({SIMD}) parallelism. The studied {SIMD} instruction set extensions include Intel's {SSE} family, {AMD}'s 3DNow!, Motorola's {AltiVec}, and {IBM}'s {BlueGene}/L {SIMD} instructions. {FFTW}, {ATLAS}, and {SPIRAL} demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize {ANSI} C code and feed it into the target machine's general-purpose C compiler to maintain portability. The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and, thus, inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special-purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes: 1) symbolic vectorization of digital signal processing transforms; 2) straight-line code vectorization for numerical kernels; and 3) compiler back ends for straight-line code with vector instructions. Methods from all three areas were combined with {FFTW}, {SPIRAL}, and {ATLAS} to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speedups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.}, + pages = {409--425}, + number = {2}, + journaltitle = {Proceedings of the {IEEE}}, + author = {Franchetti, F. and Kral, S. and Lorenz, J. and Ueberhuber, C.W.}, + urldate = {2025-03-08}, + date = {2005-02}, + note = {Conference Name: Proceedings of the {IEEE}}, + keywords = {Automatic vectorization, Boosting, Computer aided instruction, Computer applications, Concurrent computing, Digital signal processing, digital signal processing ({DSP}), fast Fourier transform ({FFT}), Kernel, Parallel processing, Registers, short vector single instruction, multiple data ({SIMD}), Signal processing algorithms, Spirals, symbolic vectorization}, + file = {Eingereichte Version:C\:\\Users\\danwi\\Zotero\\storage\\J48HM9VD\\Franchetti et al. - 2005 - Efficient Utilization of SIMD Extensions.pdf:application/pdf;IEEE Xplore Abstract Record:C\:\\Users\\danwi\\Zotero\\storage\\W6PT75CV\\1386659.html:text/html}, +} + +@inproceedings{tian_compiling_2012, + title = {Compiling C/C++ {SIMD} Extensions for Function and Loop Vectorizaion on Multicore-{SIMD} Processors}, + url = {https://ieeexplore.ieee.org/abstract/document/6270606}, + doi = {10.1109/IPDPSW.2012.292}, + abstract = {{SIMD} vectorization has received significant attention in the past decade as an important method to accelerate scientific applications, media and embedded applications on {SIMD} architectures such as Intel® {SSE}, {AVX}, and {IBM}* {AltiVec}. However, most of the focus has been directed at loops, effectively executing their iterations on multiple {SIMD} lanes concurrently relying upon program hints and compiler analysis. This paper presents a set of new C/C++ high-level vector extensions for {SIMD} programming, and the Intel® C++ product compiler that is extended to translate these vector extensions and produce optimized {SIMD} instruction sequences of vectorized functions and loops. For a function, our main idea is to vectorize the entire function for callers instead of just vectorizing loops (if any) inside the function. It poses the challenge of dealing with complicated control-flow in the function body, and matching caller and callee for {SIMD} vector calls while vectorizing caller functions (or loops) and callee functions. Our compilation methods for automatically compiling vector extensions are described. We present performance results of several non-trivial visual computing, computational, and simulation workloads, utilizing {SIMD} units through the vector extensions on Intel® Multicore 128-bit {SIMD} processors, and we show that significant {SIMD} speedups (3.07x to 4.69x) are achieved over the serial execution.}, + eventtitle = {2012 {IEEE} 26th International Parallel and Distributed Processing Symposium Workshops \& {PhD} Forum}, + pages = {2349--2358}, + booktitle = {2012 {IEEE} 26th International Parallel and Distributed Processing Symposium Workshops \& {PhD} Forum}, + author = {Tian, Xinmin and Saito, Hideki and Girkar, Milind and Preis, Serguei V. and Kozhukhov, Sergey S. and Cherkasov, Aleksei G. and Nelson, Clark and Panchenko, Nikolay and Geva, Robert}, + urldate = {2025-03-08}, + date = {2012-05}, + keywords = {Cloning, Compiler, {GPU}, Graphics processing unit, Hardware, Multicore, Parallel processing, Programming, {SIMD}, Vectorization, Vectors}, + file = {IEEE Xplore Abstract Record:C\:\\Users\\danwi\\Zotero\\storage\\HBSGBKT2\\6270606.html:text/html}, +} + +@inproceedings{lee_debunking_2010, + location = {New York, {NY}, {USA}}, + title = {Debunking the 100X {GPU} vs. {CPU} myth: an evaluation of throughput computing on {CPU} and {GPU}}, + isbn = {978-1-4503-0053-7}, + url = {https://dl.acm.org/doi/10.1145/1815961.1816021}, + doi = {10.1145/1815961.1816021}, + series = {{ISCA} '10}, + shorttitle = {Debunking the 100X {GPU} vs. {CPU} myth}, + abstract = {Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core {CPUs} and {GPUs}. In the past few years there have been many studies claiming {GPUs} deliver substantial speedups (between 10X and 1000X) over multi-core {CPUs} on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both {CPUs} and {GPUs} the performance gap between an Nvidia {GTX}280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both {CPU} and {GPU}, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.}, + pages = {451--460}, + booktitle = {Proceedings of the 37th annual international symposium on Computer architecture}, + publisher = {Association for Computing Machinery}, + author = {Lee, Victor W. and Kim, Changkyu and Chhugani, Jatin and Deisher, Michael and Kim, Daehyun and Nguyen, Anthony D. and Satish, Nadathur and Smelyanskiy, Mikhail and Chennupaty, Srinivas and Hammarlund, Per and Singhal, Ronak and Dubey, Pradeep}, + urldate = {2025-03-08}, + date = {2010-06-19}, + file = {Full Text PDF:C\:\\Users\\danwi\\Zotero\\storage\\D64U9R8Q\\Lee et al. - 2010 - Debunking the 100X GPU vs. CPU myth an evaluation of throughput computing on CPU and GPU.pdf:application/pdf}, +}