relwork: continuation of compilers section
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run

This commit is contained in:
Daniel 2025-03-20 13:31:45 +01:00
parent d514b07434
commit db3ea32b66
3 changed files with 58 additions and 5 deletions

View File

@ -61,9 +61,10 @@ All threads in a warp start at the same point in a program, they have their own
Threads not executing the same instruction is against the SIMD principle but can happen in reality, due to data dependent branching. Consequently, this leads to bad resource utilisation, which in turn leads to worse performance. Another possibility of threads being paused (inactive threads) is the fact that sometimes, the number of threads started is not divisible by 32. In such cases, the last warp still contains 32 threads but only the threads with work are executed.
Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) architecture. In many cases a developer does not need to know the details of SIMT and can develop fast and correct programs with just the SIMD architecture in mind. However, leveraging the power of SIMT can yield substantial performance gains by re-converging threads once data dependent divergence occurred. A proposal for a re-convergence algorithm was proposed by \textcite{collange_stack-less_2011} where they have shown that these approaches help with hardware occupation, resulting in improved performance as threads are now no longer fully serialised. Another approach for increasing occupancy using the SIMT architecture is proposed by \textcite{fung_thread_2011}. They introduced a technique for compacting thread blocks by moving divergent threads to new warps until they reconverge. This approach resulted in a noticeable speed-up between 17\% and 22\%. Another example where a SIMT aware algorithm can perform better was proposed by \textcite{koster_massively_2020}. While they did not implement techniques for thread re-convergence, they implemented a thread compaction algorithm. On data-dependent divergence it is possible for threads to end early, leaving a warp with only partial active threads. This means the deactivated threads are still occupied and cannot be used for other work. Their thread compaction tackles this problem by moving active threads into a new thread block, releasing the inactive threads to perform other work. With this they were able to gain a speed-up of roughly 4 times compared to previous implementations.
Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) architecture. In many cases a developer does not need to know the details of SIMT and can develop fast and correct programs with just the SIMD architecture in mind. However, leveraging the power of SIMT can yield substantial performance gains by re-converging threads once data dependent divergence occurred. A proposal for a re-convergence algorithm was proposed by \textcite{collange_stack-less_2011} where they have shown that these approaches help with hardware occupation, resulting in improved performance as threads are now no longer fully serialised. Another approach for increasing occupancy using the SIMT architecture is proposed by \textcite{fung_thread_2011}. They introduced a technique for compacting thread blocks by moving divergent threads to new warps until they reconverge. This approach resulted in a noticeable speed-up between 17\% and 22\%. Another example where a SIMT aware algorithm can perform better was proposed by \textcite{koster_massively_2020}. While they did not implement techniques for thread re-convergence, they implemented a thread compaction algorithm. On data-dependent divergence it is possible for threads to end early, leaving a warp with only partial active threads. This means the deactivated threads are still occupied and cannot be used for other work. Their thread compaction tackles this problem by moving active threads into a new thread block, releasing the inactive threads to perform other work. With this they were able to gain a speed-up of roughly 4 times compared to previous implementations. A survey by \textcite{khairy_survey_2019} explores different aspects of improving GPGPU performance architecturally. Specifically, they have compiled a list of different publications discussing algorithms for thread re-convergence, thread compaction and much more. Their main goal was to give a broad overview of many ways to improve the performance of GPGPU programming to help other developers.
% Slightly more info on SIMT, with independent thread scheduling (diverging threads etc.): https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#independent-thread-scheduling
% MIMD synchronisation on SIMT architecture: https://ieeexplore.ieee.org/abstract/document/7783714
% could be interesting to also include in this paragraph (I haven't yet fully looked at the paper)
\subsubsection{Memory Model}
\label{sec:memory_model}
@ -166,11 +167,14 @@ Done:
Compilers are a necessary tool for many developers. If a developer wants to run their program it is very likely they need one. As best described by \textcite{aho_compilers_2006} in their dragon book, a compiler takes code written by a human in some source language and translates it into a destination language readable by a computer. This section briefly explores what compilers are and research done in this old field of computer science. Furthermore, the topics of transpilers and interpreters are explored, as their use-cases are very similar. % TODO: Maybe not a subsection for transpilers? Would be very short and could fit here very well?
%brief overview about compilers (just setting the stage for the subsections basically). Talk about register management and these things.
\textcite{aho_compilers_2006} and \textcite{cooper_engineering_2022} describe how a compiler can be developed, with the latter focusing on more modern approaches. They describe how a compiler consists of two parts, the analyser, also called frontend, and the synthesiser also called backend. While the front end is responsible for ensuring syntactic and semantic correctness and converts the source code into an intermediate representation for the backend. The backend is then responsible to generate target code from the intermediate representation. This target code can be assembly or anything else that is needed for a specific use-case. This intermediate representation also makes it simple to swap out frontends or backends. The Gnu Compiler Collection \textcite{gcc_gcc_2025} takes advantage of using different frontends to provide support for many languages including C, C++, Ada and more. Instead of compiling source code for specific machines directly, many languages compile for virtual machines instead. Notable examples are the Java Virtual Machine (JVM) \parencite{lindholm_java_2025} and the low level virtual machine (LLVM) \parencite{lattner_llvm_2004}. Such virtual machines provide a bytecode which can be used as a target language for compilers. A huge benefit of such virtual machines is the ability for one program to be run on all physical machines the virtual machine exists for, without the developer needing to change that program \parencite{lindholm_java_2025}.
\textcite{aho_compilers_2006} and \textcite{cooper_engineering_2022} describe how a compiler can be developed, with the latter focusing on more modern approaches. They describe how a compiler consists of two parts, the analyser, also called frontend, and the synthesiser also called backend. While the front end is responsible for ensuring syntactic and semantic correctness and converts the source code into an intermediate representation for the backend. The backend is then responsible to generate target code from the intermediate representation. This target code can be assembly or anything else that is needed for a specific use-case. This intermediate representation also makes it simple to swap out frontends or backends. The Gnu Compiler Collection \textcite{gcc_gcc_2025} takes advantage of using different frontends to provide support for many languages including C, C++, Ada and more. Instead of compiling source code for specific machines directly, many languages compile for virtual machines instead. Notable examples are the Java Virtual Machine (JVM) \parencite{lindholm_java_2025} and the low level virtual machine (LLVM) \parencite{lattner_llvm_2004}. Such virtual machines provide a bytecode which can be used as a target language for compilers. A huge benefit of such virtual machines is the ability for one program to be run on all physical machines the virtual machine exists for, without the developer needing to change that program \parencite{lindholm_java_2025}. Programs written for virtual machines are usually compiled to a bytecode. This bytecode can then be interpreted or compiled to physical machine code and then run. According to the JVM specification \textcite{lindholm_java_2025} the Java bytecode is interpreted and also compiled with a just-in-time (JIT) compiler to increase the performance of code blocks that are often executed. On the other hand, the common language runtime (CLR)\footnote{\url{https://learn.microsoft.com/en-us/dotnet/standard/clr}}, the virtual machine for languages like C\#, never interprets the generated bytecode. As described by \textcite{microsoft_overview_2023} the CLR always compiles the bytecode to physical machine code using a JIT.
Continue with Grammer, parser generators like antlr; then byte cod; then a bit about transpilers and I think thats it for this section
A grammar describes how a language is structured. It not only describes the structure of natural language, but it can also be used to describe the structure of a programming language. \textcite{chomsky_certain_1959} found that language can be grouped into four levels, with regular and context-free grammars being the most relevant for programming languages. A regular grammar is of the structure $A = a\,|\,a\,B$ which is called a rule. The symbols $A$ and $B$ are non-terminal symbols and $a$ is a terminal symbol. A non-terminal symbol stands for another rule that follows a terminal symbol. Terminal symbols are fixed symbols or a value that can be found in the input stream, like literals in programming languages. Context-free grammars are more complex and are of the structure $A = \beta$. In this context $\beta$ stands for any combination of terminal and non-terminal symbols. Therefore, a rule like $A = a\,| a\,B\,a$ is allowed with this grammar level. This shows that with context-free grammars hierarchical structures are possible. To write grammars for programming language, other properties are also important to efficiently validate or parse some input to be defined by this grammar. However, these are not discussed here, but are described by \textcite{aho_compilers_2006}. They also described that generating a parser out of a grammar can be automated. This automation can be performed by parser generators like Yacc \parencite{johnson_yacc_1975} as described in their book. More modern alternatives are Bison\footnote{\url{https://www.gnu.org/software/bison/}} or Antlr\footnote{\url{https://www.antlr.org/}}. Before the parser can validate the input stream, a scanner is needed as described by \textcite{cooper_engineering_2022}. The scanner reads every character of the input stream and is responsible for removing white-spaces and ensures only valid characters and words are present. Flex \footnote{\url{https://github.com/westes/flex}} is a tool that allows generating a scanner and is often used in combination with Bison.
Continue with a bit about transpilers and I think thats it for this section
% find reference for JIT compilers
% as a starting point https://dl.acm.org/doi/abs/10.1145/857076.857077
\subsection{Transpilers}

Binary file not shown.

View File

@ -895,6 +895,55 @@ Publisher: Multidisciplinary Digital Publishing Institute},
author = {Lattner, C. and Adve, V.},
urldate = {2025-03-18},
date = {2004-03},
keywords = {Algorithm design and analysis, Application software, Arithmetic, High level languages, Information analysis, Performance analysis, Program processors, Runtime, Software safety, Virtual machining},
keywords = {Runtime, Application software, Algorithm design and analysis, Arithmetic, High level languages, Information analysis, Performance analysis, Program processors, Software safety, Virtual machining},
file = {Full Text PDF:C\:\\Users\\danwi\\Zotero\\storage\\P2U5LRF2\\Lattner and Adve - 2004 - LLVM a compilation framework for lifelong program analysis & transformation.pdf:application/pdf;IEEE Xplore Abstract Record:C\:\\Users\\danwi\\Zotero\\storage\\U58QV47G\\1281665.html:text/html},
}
@article{khairy_survey_2019,
title = {A survey of architectural approaches for improving {GPGPU} performance, programmability and heterogeneity},
volume = {127},
issn = {0743-7315},
url = {https://www.sciencedirect.com/science/article/pii/S0743731518308669},
doi = {10.1016/j.jpdc.2018.11.012},
abstract = {With the skyrocketing advances of process technology, the increased need to process huge amount of data, and the pivotal need for power efficiency, the usage of Graphics Processing Units ({GPUs}) for General Purpose Computing becomes a trend and natural. {GPUs} have high computational power and excellent performance per watt, for data parallel applications, relative to traditional multicore processors. {GPUs} appear as discrete or embedded with Central Processing Units ({CPUs}), leading to a scheme of heterogeneous computing. Heterogeneous computing brings as many challenges as it brings opportunities. To get the most of such systems, we need to guarantee high {GPU} utilization, deal with irregular control flow of some workloads, and struggle with far-friendly-programming models. The aim of this paper is to provide a survey about {GPUs} from two perspectives: (1) architectural advances to improve performance and programmability and (2) advances to enhance {CPU}{GPU} integration in heterogeneous systems. This will help researchers see the opportunities and challenges of using {GPUs} for general purpose computing, especially in the era of big data and the continuous need of high-performance computing.},
pages = {65--88},
journaltitle = {Journal of Parallel and Distributed Computing},
shortjournal = {Journal of Parallel and Distributed Computing},
author = {Khairy, Mahmoud and Wassal, Amr G. and Zahran, Mohamed},
urldate = {2025-03-20},
date = {2019-05-01},
keywords = {Control divergence, {GPGPU}, Heterogeneous architecture, Memory systems},
file = {PDF:C\:\\Users\\danwi\\Zotero\\storage\\FQJC5EUT\\Khairy et al. - 2019 - A survey of architectural approaches for improving GPGPU performance, programmability and heterogene.pdf:application/pdf},
}
@online{microsoft_overview_2023,
title = {Overview of .{NET} Framework - .{NET} Framework {\textbar} Microsoft Learn},
url = {https://learn.microsoft.com/en-us/dotnet/framework/get-started/overview},
author = {{Microsoft}},
urldate = {2025-03-20},
date = {2023-03},
}
@article{chomsky_certain_1959,
title = {On certain formal properties of grammars},
volume = {2},
issn = {0019-9958},
url = {https://www.sciencedirect.com/science/article/pii/S0019995859903626},
doi = {10.1016/S0019-9958(59)90362-6},
abstract = {A grammar can be regarded as a device that enumerates the sentences of a language. We study a sequence of restrictions that limit grammars first to Turing machines, then to two types of system from which a phrase structure description of the generated language can be drawn, and finally to finite state Markov sources (finite automata). These restrictions are shown to be increasingly heavy in the sense that the languages that can be generated by grammars meeting a given restriction constitute a proper subset of those that can be generated by grammars meeting the preceding restriction. Various formulations of phrase structure description are considered, and the source of their excess generative power over finite state sources is investigated in greater detail.},
pages = {137--167},
number = {2},
journaltitle = {Information and Control},
shortjournal = {Information and Control},
author = {Chomsky, Noam},
urldate = {2025-03-20},
date = {1959-06-01},
file = {PDF:C\:\\Users\\danwi\\Zotero\\storage\\7KYIINJ3\\Chomsky - 1959 - On certain formal properties of grammars.pdf:application/pdf},
}
@report{johnson_yacc_1975,
title = {Yacc: Yet another compiler-compiler},
institution = {Bell Laboratories Murray Hill, {NJ}},
author = {Johnson, Stephen C},
date = {1975},
}