concept and design: started writing this chapter
Some checks are pending
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.10) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, 1.6) (push) Waiting to run
CI / Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} (x64, ubuntu-latest, pre) (push) Waiting to run

This commit is contained in:
Daniel 2025-04-03 13:43:23 +02:00
parent 2b9c394f1b
commit d8f5454e9c
11 changed files with 96 additions and 57 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 152 KiB

After

Width:  |  Height:  |  Size: 154 KiB

View File

@ -1,6 +1,6 @@
<mxfile host="app.diagrams.net" agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:129.0) Gecko/20100101 Firefox/129.0" version="24.7.6">
<mxfile host="app.diagrams.net" agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:137.0) Gecko/20100101 Firefox/137.0" version="26.1.1">
<diagram name="Page-1" id="gpsZjoig8lt5hVv5Hzwz">
<mxGraphModel dx="989" dy="539" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="1169" pageHeight="827" math="0" shadow="0">
<mxGraphModel dx="830" dy="457" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="1169" pageHeight="827" math="0" shadow="0">
<root>
<mxCell id="0" />
<mxCell id="1" parent="0" />
@ -40,22 +40,22 @@
<mxPoint x="200" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-18" value="e1" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" vertex="1" parent="9Xn2HrUYLFHSwPnNgvM3-13">
<mxCell id="9og6d5YY-6gPx96OlZrF-18" value="e1" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" parent="9Xn2HrUYLFHSwPnNgvM3-13" vertex="1">
<mxGeometry width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-19" value="e2" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" vertex="1" parent="9Xn2HrUYLFHSwPnNgvM3-13">
<mxCell id="9og6d5YY-6gPx96OlZrF-19" value="e2" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" parent="9Xn2HrUYLFHSwPnNgvM3-13" vertex="1">
<mxGeometry x="40" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-20" value="e3" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" vertex="1" parent="9Xn2HrUYLFHSwPnNgvM3-13">
<mxCell id="9og6d5YY-6gPx96OlZrF-20" value="e3" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" parent="9Xn2HrUYLFHSwPnNgvM3-13" vertex="1">
<mxGeometry x="80" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-21" value="e4" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" vertex="1" parent="9Xn2HrUYLFHSwPnNgvM3-13">
<mxCell id="9og6d5YY-6gPx96OlZrF-21" value="e4" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" parent="9Xn2HrUYLFHSwPnNgvM3-13" vertex="1">
<mxGeometry x="120" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-22" value="e5" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" vertex="1" parent="9Xn2HrUYLFHSwPnNgvM3-13">
<mxCell id="9og6d5YY-6gPx96OlZrF-22" value="e5" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" parent="9Xn2HrUYLFHSwPnNgvM3-13" vertex="1">
<mxGeometry x="160" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-23" value="e6" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" vertex="1" parent="9Xn2HrUYLFHSwPnNgvM3-13">
<mxCell id="9og6d5YY-6gPx96OlZrF-23" value="e6" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" parent="9Xn2HrUYLFHSwPnNgvM3-13" vertex="1">
<mxGeometry x="200" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-14" value="" style="group" parent="1" vertex="1" connectable="0">
@ -179,7 +179,7 @@
</mxGeometry>
</mxCell>
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-44" value="" style="rounded=0;whiteSpace=wrap;html=1;rotation=90;" parent="1" vertex="1">
<mxGeometry x="1040" y="440" width="40" height="40" as="geometry" />
<mxGeometry x="960" y="520" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-51" value="" style="rounded=0;whiteSpace=wrap;html=1;rotation=90;" parent="1" vertex="1">
<mxGeometry x="880" y="480" width="120" height="40" as="geometry" />
@ -208,12 +208,6 @@
<mxPoint x="1000" y="480" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-61" value="" style="endArrow=none;html=1;rounded=0;exitX=0.167;exitY=1;exitDx=0;exitDy=0;exitPerimeter=0;entryX=0.167;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" parent="1" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="1040" y="480" as="sourcePoint" />
<mxPoint x="1080" y="480" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-62" value="" style="endArrow=none;html=1;rounded=0;exitX=0.167;exitY=1;exitDx=0;exitDy=0;exitPerimeter=0;entryX=0.167;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" parent="1" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="920" y="480" as="sourcePoint" />
@ -244,7 +238,7 @@
<mxPoint x="1019.6700000000001" y="440" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-68" value="" style="endArrow=classic;html=1;rounded=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" parent="1" edge="1">
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-68" value="" style="endArrow=baseDash;html=1;rounded=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;endFill=0;endSize=18;" parent="1" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="1059.8300000000002" y="400" as="sourcePoint" />
<mxPoint x="1059.8300000000002" y="440" as="targetPoint" />
@ -313,8 +307,8 @@
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-95" value="&lt;div&gt;p5&lt;/div&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="1000" y="600" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-96" value="p1" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="1040" y="440" width="40" height="40" as="geometry" />
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-96" value="p3" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="960" y="520" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-97" value="p1" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="920" y="440" width="40" height="40" as="geometry" />
@ -424,7 +418,7 @@
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-118" value="x3" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" parent="1" vertex="1">
<mxGeometry x="640" y="560" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-12" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" edge="1" parent="1" source="9Xn2HrUYLFHSwPnNgvM3-119">
<mxCell id="9og6d5YY-6gPx96OlZrF-12" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" parent="1" source="9Xn2HrUYLFHSwPnNgvM3-119" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="720" y="740" as="targetPoint" />
<Array as="points">
@ -444,7 +438,7 @@
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-122" value="x3" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" parent="1" vertex="1">
<mxGeometry x="600" y="560" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-14" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" edge="1" parent="1" source="9Xn2HrUYLFHSwPnNgvM3-123">
<mxCell id="9og6d5YY-6gPx96OlZrF-14" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" parent="1" source="9Xn2HrUYLFHSwPnNgvM3-123" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="720" y="780" as="targetPoint" />
<Array as="points">
@ -464,7 +458,7 @@
<mxCell id="9Xn2HrUYLFHSwPnNgvM3-126" value="x3" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;direction=south;" parent="1" vertex="1">
<mxGeometry x="560" y="560" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-11" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" edge="1" parent="1" source="9Xn2HrUYLFHSwPnNgvM3-127">
<mxCell id="9og6d5YY-6gPx96OlZrF-11" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" parent="1" source="9Xn2HrUYLFHSwPnNgvM3-127" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="720" y="700" as="targetPoint" />
<Array as="points">
@ -535,61 +529,61 @@
</Array>
</mxGeometry>
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-9" value="" style="group" vertex="1" connectable="0" parent="1">
<mxCell id="9og6d5YY-6gPx96OlZrF-9" value="" style="group" parent="1" vertex="1" connectable="0">
<mxGeometry x="721" y="680" width="240" height="120" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-1" value="" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="9og6d5YY-6gPx96OlZrF-9">
<mxCell id="9og6d5YY-6gPx96OlZrF-1" value="" style="rounded=0;whiteSpace=wrap;html=1;" parent="9og6d5YY-6gPx96OlZrF-9" vertex="1">
<mxGeometry width="240" height="120" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-2" value="" style="endArrow=none;html=1;rounded=0;" edge="1" parent="9og6d5YY-6gPx96OlZrF-9">
<mxCell id="9og6d5YY-6gPx96OlZrF-2" value="" style="endArrow=none;html=1;rounded=0;" parent="9og6d5YY-6gPx96OlZrF-9" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="200" y="120" as="sourcePoint" />
<mxPoint x="200" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-3" value="" style="endArrow=none;html=1;rounded=0;" edge="1" parent="9og6d5YY-6gPx96OlZrF-9">
<mxCell id="9og6d5YY-6gPx96OlZrF-3" value="" style="endArrow=none;html=1;rounded=0;" parent="9og6d5YY-6gPx96OlZrF-9" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint y="40" as="sourcePoint" />
<mxPoint x="240" y="40" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-4" value="" style="endArrow=none;html=1;rounded=0;" edge="1" parent="9og6d5YY-6gPx96OlZrF-9">
<mxCell id="9og6d5YY-6gPx96OlZrF-4" value="" style="endArrow=none;html=1;rounded=0;" parent="9og6d5YY-6gPx96OlZrF-9" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint y="80" as="sourcePoint" />
<mxPoint x="240" y="80" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-5" value="" style="endArrow=none;html=1;rounded=0;" edge="1" parent="9og6d5YY-6gPx96OlZrF-9">
<mxCell id="9og6d5YY-6gPx96OlZrF-5" value="" style="endArrow=none;html=1;rounded=0;" parent="9og6d5YY-6gPx96OlZrF-9" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="40" y="120" as="sourcePoint" />
<mxPoint x="40" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-6" value="" style="endArrow=none;html=1;rounded=0;" edge="1" parent="9og6d5YY-6gPx96OlZrF-9">
<mxCell id="9og6d5YY-6gPx96OlZrF-6" value="" style="endArrow=none;html=1;rounded=0;" parent="9og6d5YY-6gPx96OlZrF-9" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="80" y="120" as="sourcePoint" />
<mxPoint x="80" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-7" value="" style="endArrow=none;html=1;rounded=0;" edge="1" parent="9og6d5YY-6gPx96OlZrF-9">
<mxCell id="9og6d5YY-6gPx96OlZrF-7" value="" style="endArrow=none;html=1;rounded=0;" parent="9og6d5YY-6gPx96OlZrF-9" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="119.65999999999997" y="120" as="sourcePoint" />
<mxPoint x="119.65999999999997" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-8" value="" style="endArrow=none;html=1;rounded=0;" edge="1" parent="9og6d5YY-6gPx96OlZrF-9">
<mxCell id="9og6d5YY-6gPx96OlZrF-8" value="" style="endArrow=none;html=1;rounded=0;" parent="9og6d5YY-6gPx96OlZrF-9" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="160" y="120" as="sourcePoint" />
<mxPoint x="160" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-10" value="&lt;div&gt;Results&lt;/div&gt;&lt;div&gt;Matrix&lt;/div&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" vertex="1" parent="1">
<mxCell id="9og6d5YY-6gPx96OlZrF-10" value="&lt;div&gt;Results&lt;/div&gt;&lt;div&gt;Matrix&lt;/div&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="721" y="630" width="70" height="40" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-16" value="" style="shape=curlyBracket;whiteSpace=wrap;html=1;rounded=1;labelPosition=left;verticalLabelPosition=middle;align=right;verticalAlign=middle;rotation=-90;" vertex="1" parent="1">
<mxCell id="9og6d5YY-6gPx96OlZrF-16" value="" style="shape=curlyBracket;whiteSpace=wrap;html=1;rounded=1;labelPosition=left;verticalLabelPosition=middle;align=right;verticalAlign=middle;rotation=-90;" parent="1" vertex="1">
<mxGeometry x="832" y="701" width="20" height="240" as="geometry" />
</mxCell>
<mxCell id="9og6d5YY-6gPx96OlZrF-17" value="Expression 1 through Expression n" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" vertex="1" parent="1">
<mxCell id="9og6d5YY-6gPx96OlZrF-17" value="Expression 1 through Expression n" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="727" y="832" width="230" height="30" as="geometry" />
</mxCell>
</root>

View File

@ -138,9 +138,9 @@ if compareWithCPU
println(gpuiVsGPUT_median)
println(gpuiVsGPUT_std)
BenchmarkTools.save("$BENCHMARKS_RESULTS_PATH/using_inbounds.json", results)
# BenchmarkTools.save("$BENCHMARKS_RESULTS_PATH/using_inbounds.json", results)
else
resultsOld = BenchmarkTools.load("$BENCHMARKS_RESULTS_PATH/256_blocksize.json")[1]
resultsOld = BenchmarkTools.load("$BENCHMARKS_RESULTS_PATH/using_inbounds.json")[1]
medianGPUI_old = median(resultsOld["GPUI"])
stdGPUI_old = std(resultsOld["GPUI"])

View File

@ -1,11 +1,44 @@
\chapter{Concept and Design}
\label{cha:conceptdesign}
introduction to what needs to be done. also clarify terms "Host" and "Device" here
% introduction to what needs to be done. also clarify terms "Host" and "Device" here
To be able to determine whether evaluating mathematical expressions on the GPU is better suited than on the CPU, a prototype needs to be implemented. More specifically, a prototype for interpreting these expressions on the GPU, as well as a prototype that transpiles expressions into code that can be executed by the GPU. The goal of this chapter, is to describe how these two prototypes can be implemented conceptually. First the requirements for the prototypes as well as the data they operate on are explained. This is followed by the design of the interpreter and the transpiler. The CPU interpreter will not be described, as it already exists.
% TODO: maybe describe CPU interpreter too? We will see
\section[Requirements]{Requirements and Data}
short section.
Multiple expressions; vars for all expressions; params unique to expression; operators that need to be supported
% short section.
% Multiple expressions; vars for all expressions; params unique to expression; operators that need to be supported
The main goal of both prototypes or evaluators is to provide a speed-up to the CPU interpreter already in use. However, it is also important to determine which evaluator provides the most speed-up. This also means that if one of the evaluators is faster, it is meant to replace the CPU interpreter. Therefore, they must have similar capabilities, and therefore meet the following requirements:
\begin{itemize}
\item Multiple expressions as input.
\item All input expressions have the same number of variables ($x_n$), but can have a different number of parameters ($p_n$).
\item The variables are parametrised using a matrix of the form $k \times N$, where $k$ is the number of variables in the expressions and $N$ is the number of different parametrisations for the variables. This matrix is the same for all expressions.
\item The parameters are parametrised using a vector of vectors. Each vector $v_i$ corresponds to an expression $e_i$.
\item The following operations must be supported: $x + y$, $x - y$, $x * y$, $x / y$, $x ^ y$, $|x|$, $\log(x)$, $e^x$ and $\sqrt{x}$. Note that $x$ and $y$ can either stand for a value, a variable, or another operation.
\item The results of the evaluations are returned in a matrix of the form $k \times N$. In this case, $k$ is equal to the $N$ of the variable matrix and $N$ is equal to the number of input expressions.
\end{itemize}
With these requirements, one possible expression that must be able to be evaluated is the following: $\log(e^{p_1}) - |x_1| * \sqrt{x_2} / 10 + 2^{x_3}$
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{input_output_explanation.png}
\caption{This diagram shows how the input and output looks like and how they interact with each other.}
\label{fig:input_output_explanation}
\end{figure}
With this, the capabilities are outlined, however, the input and output data need to further be explained for a better understanding. The first input are the expressions that need to be evaluated. These can have any length and can contain constant values, variables and parameters and all of these are linked together with the supported operations. In the example shown in Figure \ref{fig:input_output_explanation}, there are six expressions $e_1$ through $e_6$. Next is the variable matrix. One entry in this matrix, corresponds to one variable in every expression, with the row indicating which variable it holds the value for. Each column holds a different set of variables. In the provided example, there are three variable sets, each holding the values for four variables $x_1$ through $x_4$. All expressions are evaluated using all variable sets and the results of these evaluations are stored in the results matrix. Each entry in this matrix holds the resulting value of the evaluation of one expression with one variable set. The row indicates the variable set while the column indicates the expression.
%%
%% TODO: Explain parameter optimisation a bit better/longer. Right now the understanding of parameters is not great with this.
%%
This is the minimal functionality needed to evaluate expressions with variables generated by a symbolic regression algorithm. In the case of parameter optimisation, it is useful to have a different type of variable, called parameter. For parameter optimisation it is important that for the given variable sets, the best fitting parameters need to be found. To achieve this, the evaluator is called multiple times with different parameters, but the same variables, and the results are evaluated for their fitness by the caller. In this case, the parameters do not change within one call. Parameters could therefore be treated as constant values of the expressions and no separate input for them would be needed. However, providing the possibility to have the parameters as an input, makes the process of parameter optimisation easier. This is the reason the prototype evaluators need to support parameters as inputs. Not all expressions need to have the same number of parameters. Therefore, they are structured as a vector of vectors and not a matrix. The example in Figure \ref{fig:input_output_explanation} shows how the parameters are structured. For example one expression has zero parameters, while another has six parameters $p_1$ through $p_6$. It needs to be mentioned that just like the number of variables, the number of parameters per expression is not limited. It is also possible to completely omit the parameters if they are not needed.
\subsection{Non-Goals}
Probably a good idea. Probably move this to "introduction"
\section{Interpreter}
as introduction to this section talk about what "interpreter" means in this context. so "gpu parses expr and calculates"

View File

@ -6,9 +6,26 @@ Explain the hardware used, as well as the actual data (how many expressions, var
\section{Results}
talk about what we will see now (results only for interpreter, then transpiler and then compared with each other and a CPU interpreter)
\subsection{Interpreter}
Results only for Interpreter
\subsection{Performance tuning}
Document the process of performance tuning
Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking enabled (especially in kernel)
Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
Using @inbounds -> noticeable improvement in 2 out of 3
\subsection{Transpiler}
Results only for Transpiler
\subsection{Performance tuning}
Document the process of performance tuning
Initial: CPU-Side single-threaded; up to 1024 threads per block; bounds-checking enabled
Blocksize reduced to a maximum of 256 -> moderate improvement in medium and large
Using @inbounds -> small improvement only on CPU side code
\subsection{Comparison}
Comparison of Interpreter and Transpiler as well as Comparing the two with CPU interpreter

View File

@ -6,15 +6,12 @@ Short section; CUDA, PTX, Julia, CUDA.jl
Probably reference the performance evaluation papers for Julia and CUDA.jl
\section{Expression Processing}
Talk about why this needs to be done and how it is done (the why is basically: simplifies evaluation/transpilation process; the how is in ExpressionProcessing.jl)
\section{Interpreter}
Talk about how the interpreter has been developed.
\subsection{Performance tuning}
Document the process of performance tuning
\section{Transpiler}
Talk about how the transpiler has been developed
\subsection{Performance tuning}
Document the process of performance tuning

View File

@ -11,15 +11,13 @@ Optimisation and acceleration of program code is a crucial part in many fields.
The following expression $5 - \text{abs}(x_1) * \text{sqrt}(x_2) / 10 + 2 \char`^ x_3$ which contains simple mathematical operations as well as variables $x_n$ and parameters $p_n$ is one example that can be generated by the equation learning algorithm, Usually an equation learning algorithm generates multiple of such expressions per iteration. Out of these expressions all possibly relevant ones have to be evaluated. Additionally, multiple different values need to be inserted for all variables and parameters, drastically increasing the amount of evaluations that need to be performed.
In his Blog \textcite{sutter_free_2004} described how the free lunch is over in terms of the ever-increasing performance of hardware like the CPU. He states that to gain additional performance, developers need to start developing software for multiple cores and not just hope that on the next generation of CPUs the program magically runs faster. While this approach means more development overhead, a much greater speed-up can be achieved. However, in some cases the speed-up achieved by this is still not large enough and another approach is needed. One of these approaches is the utilisation of Graphics Processing Units (GPUs) as an easy and affordable option as compared to compute clusters. Especially when talking about performance per dollar, GPUs are very inexpensive as found by \textcite{brodtkorb_graphics_2013}. \textcite{michalakes_gpu_2008} have shown a noticeable speed-up when using GPUs for weather simulation. In addition to computer simulations, GPU acceleration also can be found in other places such as networking \parencite{han_packetshader_2010} or structural analysis of buildings \parencite{georgescu_gpu_2013}.
%The free lunch theorem as described by \textcite{adam_no_2019} states that to gain additional performance, a developer cannot just hope for future hardware to be faster, especially on a single core.
In his blog, \textcite{sutter_free_2004} described how the free lunch is over in terms of the ever-increasing performance of hardware like the CPU. He states that to gain additional performance, developers need to start developing software for multiple cores and not just hope that on the next generation of CPUs the program magically runs faster. While this approach means more development overhead, a much greater speed-up can be achieved. However, in some cases the speed-up achieved by this is still not large enough and another approach is needed. One of these approaches is the utilisation of Graphics Processing Units (GPUs) as an easy and affordable option as compared to compute clusters. Especially when talking about performance per dollar, GPUs are very inexpensive as found by \textcite{brodtkorb_graphics_2013}. \textcite{michalakes_gpu_2008} have shown a noticeable speed-up when using GPUs for weather simulation. In addition to computer simulations, GPU acceleration also can be found in other places such as networking \parencite{han_packetshader_2010} or structural analysis of buildings \parencite{georgescu_gpu_2013}.
% TODO: Incorporate PTX somehow
\section{Research Question}
With these successful implementations of GPU acceleration, this thesis also attempts to improve the performance of evaluating mathematical equations using GPUs. Therefore, the following research questions are formulated:
With these successful implementations of GPU acceleration, this thesis also attempts to improve the performance of evaluating mathematical equations, generated at runtime for symbolic regression using GPUs. Therefore, the following research questions are formulated:
\begin{itemize}
\item How can simple arithmetic expressions that are generated at runtime be efficiently evaluated on GPUs?

View File

@ -25,7 +25,7 @@ Graphics cards (GPUs) are commonly used to increase the performance of many diff
While in the early days of GPGPU programming a lot of research has been done to assess if this approach is feasible, it now seems obvious to use GPUs to accelerate algorithms. GPUs have been used early to speed up weather simulation models. \textcite{michalakes_gpu_2008} proposed a method for simulating weather with the Weather Research and Forecast (WRF) model on a GPU. With their approach, they reached a speed-up of 5 to 2 for the most compute intensive task, with little GPU optimisation effort. They also found that the GPU usage was low, meaning there are resources and potential for more detailed simulations. Generally, simulations are great candidates for using GPUs, as they can benefit heavily from a high degree of parallelism and data throughput. \textcite{koster_high-performance_2020} have developed a way of using adaptive time steps on the GPU to considerably improve the performance of numerical and discrete simulations. In addition to the performance gains they were able to retain the precision and constraint correctness of the simulation. Black hole simulations are crucial for science and education for a better understanding of our world. \textcite{verbraeck_interactive_2021} have shown that simulating complex Kerr (rotating) black holes can be done on consumer hardware in a few seconds. Schwarzschild black hole simulations can be performed in real-time with GPUs as described by \textcite{hissbach_overview_2022} which is especially helpful for educational scenarios. While both approaches do not have the same accuracy as detailed simulations on supercomputers, they show how a single GPU can yield similar accuracy at a fraction of the cost. Software network routing can also heavily benefit from GPU acceleration as shown by \textcite{han_packetshader_2010}, where they achieved a significantly higher throughput than with a CPU only implementation. Finite element structural analysis is an essential tool for many branches of engineering and can also heavily benefit from the usage of GPUs as demonstrated by \textcite{georgescu_gpu_2013}. Generating test data for DeepQ learning can also significantly benefit from using the GPU \parencite{koster_macsq_2022}. However, it also needs to be noted, that GPUs are not always better performing than CPUs as illustrated by \textcite{lee_debunking_2010}, so it is important to consider if it is worth using GPUs for specific tasks.
\subsection{Programming GPUs}
The development process on a GPU is vastly different from a CPU. A CPU has tens or hundreds of complex cores with the AMD Epyc 9965\footnote{\url{https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9965.html}} having a staggering $192$ cores and twice as many threads. To demonstrate the complexity of a simple one core 8-bit CPU \textcite{schuurman_step-by-step_2013} has written a development guide. He describes the different parts of one CPU core and how they interact. Modern CPUs are even more complex, with dedicated fast integer and floating-point arithmetic gates as well as logic gates, sophisticated branch prediction and much more. This makes a CPU perfect for handling complex control flows on a single program strand and on modern CPUs even multiple strands simultaneously \parencite{palacios_comparison_2011}. However, as seen in section \ref{sec:gpgpu}, this often isn't enough. On the other hand, a GPU contains thousands or even tens of thousands of cores. For example, the GeForce RTX 5090\footnote{\url{https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/}} contains a total of $21\,760$ CUDA cores. To achieve this enormous core count a single GPU core has to be much simpler than one CPU core. As described by \textcite{nvidia_cuda_2025} a GPU designates much more transistors towards floating-point computations. This results in less efficient integer arithmetic and control flow handling. There is also less Cache available per core and clock speeds are usually also much lower than those on a CPU. An overview of the differences of a CPU and a GPU architecture can be seen in figure \ref{fig:cpu_vs_gpu}.
The development process on a GPU is vastly different from a CPU. A CPU has tens or hundreds of complex cores with the AMD Epyc 9965\footnote{\url{https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9965.html}} having a staggering $192$ cores and twice as many threads. To demonstrate the complexity of a simple one core 8-bit CPU \textcite{schuurman_step-by-step_2013} has written a development guide. He describes the different parts of one CPU core and how they interact. Modern CPUs are even more complex, with dedicated fast integer and floating-point arithmetic gates as well as logic gates, sophisticated branch prediction and much more. This makes a CPU perfect for handling complex control flows on a single program strand and on modern CPUs even multiple strands simultaneously \parencite{palacios_comparison_2011}. However, as seen in Section \ref{sec:gpgpu}, this often is not enough. On the other hand, a GPU contains thousands or even tens of thousands of cores. For example, the GeForce RTX 5090\footnote{\url{https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/}} contains a total of $21\,760$ CUDA cores. To achieve this enormous core count a single GPU core has to be much simpler than one CPU core. As described by \textcite{nvidia_cuda_2025} a GPU designates much more transistors towards floating-point computations. This results in less efficient integer arithmetic and control flow handling. There is also less Cache available per core and clock speeds are usually also much lower than those on a CPU. An overview of the differences of a CPU and a GPU architecture can be seen in Figure \ref{fig:cpu_vs_gpu}.
\begin{figure}
\centering
@ -48,7 +48,7 @@ At the lowest level of a GPU exists a Streaming Multiprocessor (SM), which is a
\label{fig:thread_hierarchy}
\end{figure}
A piece of code that is executed on a GPU is written as a kernel which can be configured. The most important configuration is how threads are grouped into blocks. The GPU allows the kernel to allocate threads and blocks and block clusters in up to three dimensions. This is often useful because of the already mentioned shared memory, which will be explained in more detail in section \ref{sec:memory_model}. Considering the case where an image needs to be blurred, it not only simplifies the development if threads are arranged in a 2D grid, it also helps with optimising memory access. As the threads in a block, need to access a lot of the same data, this data can be loaded in the shared memory of the block. This allows the data to be accessed much quicker compared to when threads are allocated in only one dimension. With one dimensional blocks it is possible that threads assigned to nearby pixels, are part of a different block, leading to a lot of duplicate data transfer. The size in each dimension of a block can be almost arbitrary within the maximum allowed number of threads. However, blocks that are too large might lead to other problems which are described in more detail in section \ref{sec:occupancy}.
A piece of code that is executed on a GPU is written as a kernel which can be configured. The most important configuration is how threads are grouped into blocks. The GPU allows the kernel to allocate threads and blocks and block clusters in up to three dimensions. This is often useful because of the already mentioned shared memory, which will be explained in more detail in Section \ref{sec:memory_model}. Considering the case where an image needs to be blurred, it not only simplifies the development if threads are arranged in a 2D grid, it also helps with optimising memory access. As the threads in a block, need to access a lot of the same data, this data can be loaded in the shared memory of the block. This allows the data to be accessed much quicker compared to when threads are allocated in only one dimension. With one dimensional blocks it is possible that threads assigned to nearby pixels, are part of a different block, leading to a lot of duplicate data transfer. The size in each dimension of a block can be almost arbitrary within the maximum allowed number of threads. However, blocks that are too large might lead to other problems which are described in more detail in Section \ref{sec:occupancy}.
All threads in a warp start at the same point in a program, but with their own instruction address, allowing them to work independently. Because of the SIMD architecture, all threads in a warp must execute the same instructions and if threads start diverging, the SM must pause threads with different instructions and execute them later. Figure \ref{fig:thread_divergence} shows how such divergences can impact performance. The situation described by the figure also shows, that after the divergence the thread could re-converge. On older hardware this does not happen and leads to T2 being executed after T1 and T3 are finished. In situations where a lot of data dependent thread divergence happens, most of the benefits of using a GPU likely have vanished. Threads not executing the same instruction is strictly speaking against the SIMD principle but can happen in reality, due to data dependent branching. Consequently, this leads to bad resource utilisation, which in turn leads to worse performance. Another possibility of threads being paused (inactive threads) is the fact that sometimes, the number of threads started is not divisible by 32. In such cases, the last warp still contains 32 threads but only the threads with work are executed.
@ -68,7 +68,7 @@ Modern GPUs implement the so called Single-Instruction Multiple-Thread (SIMT) ar
% - Memory allocation (with the one paper diving into dynamic allocations)
% - Memory transfer (with streams potentially)
On a GPU there are two parts that contribute to the performance of an algorithm. The one already looked at is the compute-portion of the GPU. This is necessary because if threads are serialised or run inefficiently, there is nothing that can make the algorithm execute faster. However, algorithms run on a GPU usually require huge amounts of data to be processed, as they are designed for exactly that purpose. The purpose of this section is to explain how the memory model of the GPU works and how it can influence the performance of an algorithm. In figure \ref{fig:gpu_memory_layout} the memory layout and the kinds of memory available are depicted. The different parts will be explained in this section.
On a GPU there are two parts that contribute to the performance of an algorithm. The one already looked at is the compute-portion of the GPU. This is necessary because if threads are serialised or run inefficiently, there is nothing that can make the algorithm execute faster. However, algorithms run on a GPU usually require huge amounts of data to be processed, as they are designed for exactly that purpose. The purpose of this section is to explain how the memory model of the GPU works and how it can influence the performance of an algorithm. In Figure \ref{fig:gpu_memory_layout} the memory layout and the kinds of memory available are depicted. The different parts will be explained in this section.
\begin{figure}
\centering
@ -77,7 +77,7 @@ On a GPU there are two parts that contribute to the performance of an algorithm.
\label{fig:gpu_memory_layout}
\end{figure}
On a GPU there are multiple levels and kinds of memory available. All these levels and kinds have different purposes they are optimised for. This means that it is important to know what they are and how they can be best used for specific tasks. On the lowest level threads have registers and local memory available. Registers is the fastest way to access memory but is also the least abundant memory with up to a maximum of 255 32-Bit registers per thread on Nvidia GPUs and 256 on AMD GPUs \parencite{amd_hardware_2025}. However, using all registers of a thread can lead to other problems which are described in more detail in section \ref{sec:occupancy}. On the other side, the thread local memory is significantly slower than registers. This is due to the fact, that local memory is actually stored in global memory and therefore has the same limitations which are explained later. This means it is important to try and avoid local memory as much as possible. Local memory is usually only used when a thread uses too many registers. The compiler will then spill the remaining data into local memory and loads it into registers once needed, drastically slowing down the application.
On a GPU there are multiple levels and kinds of memory available. All these levels and kinds have different purposes they are optimised for. This means that it is important to know what they are and how they can be best used for specific tasks. On the lowest level threads have registers and local memory available. Registers is the fastest way to access memory but is also the least abundant memory with up to a maximum of 255 32-Bit registers per thread on Nvidia GPUs and 256 on AMD GPUs \parencite{amd_hardware_2025}. However, using all registers of a thread can lead to other problems which are described in more detail in Section \ref{sec:occupancy}. On the other side, the thread local memory is significantly slower than registers. This is due to the fact, that local memory is actually stored in global memory and therefore has the same limitations which are explained later. This means it is important to try and avoid local memory as much as possible. Local memory is usually only used when a thread uses too many registers. The compiler will then spill the remaining data into local memory and loads it into registers once needed, drastically slowing down the application.
Shared memory is the next tier of memory on a GPU. Unlike local memory and registers, shared memory is shared between all threads inside a block. The amount of shared memory is depending on the GPU architecture but for Nvidia it hovers at around 100 Kilobyte (KB) per block. While this memory is slower than registers, its primary use-case is communicating and sharing data between threads in a block. If all threads in a block access a lot of overlapping data this data can be loaded from global memory into faster shared memory once. It can then be accessed multiple times, further increasing performance. Loading data into shared memory and accessing that data has to be done manually. Because shared memory is part of the unified data cache, it can either be used as a cache or for manual use, meaning a developer can allocate more shared memory towards caching if needed. Another feature of shared memory are the so-called memory banks. Shared memory is always split into 32 equally sized memory modules also called memory banks. All available memory addresses lie in one of these banks. This means if two threads access two memory addresses which lie in different banks, the access can be performed simultaneously, increasing the throughput.
@ -118,7 +118,7 @@ When starting a kernel, the most important configuration is the number of thread
In general, it is important to have as many warps as possible ready for execution. While this means that a lot of warps could be executed but are not, this is actually desired. A key feature of GPUs is so-called latency hiding, meaning that while a warp waits for data to be retrieved for example, another warp ready for execution can now be run. With low occupancy, and therefore little to no warps waiting for execution, latency hiding does not work, as now the hardware is idle. As a result, the runtime increases which also explains why high occupancy is not guaranteed to result in performance improvements while low occupancy can and often will increase the runtime.
As seen in table \ref{tab:compute_capabilities}, there exist different limitations that can impact occupancy. The number of warps per SM is important, as this means this is the degree of parallelism achievable per SM. If due to other limitations, the number of warps per SM is below the maximum, there is idle hardware. One such limitation is the number of registers per block and SM. In the case of compute capability 8.9, one SM can handle $32 * 48 = 1\,536$ threads. This leaves $64\,000 / 1\,536 \approx 41$ registers per thread, which is lower than the theoretical maximum of $255$ registers per thread. Typically, one register is mapped to one variable in the kernel code, meaning a developer can use up to 41 variables in their code. However, if the variable needs 64 bits to store its value, the register usage doubles, as all registers on a GPU are 32-bit. On a GPU with compute capability 10.x a developer can use up to $64\,000 / 2\,048 \approx 31$ registers. Of course a developer can use more registers, but this results in less occupancy. However, depending on the algorithm using more registers might be more beneficial to performance than the lower occupancy, in which case occupancy is not as important. If a developer needs more than $255$ registers for their variables the additional variables will spill into local memory which is, as described in section \ref{sec:memory_model}, not desirable.
As seen in table \ref{tab:compute_capabilities}, there exist different limitations that can impact occupancy. The number of warps per SM is important, as this means this is the degree of parallelism achievable per SM. If due to other limitations, the number of warps per SM is below the maximum, there is idle hardware. One such limitation is the number of registers per block and SM. In the case of compute capability 8.9, one SM can handle $32 * 48 = 1\,536$ threads. This leaves $64\,000 / 1\,536 \approx 41$ registers per thread, which is lower than the theoretical maximum of $255$ registers per thread. Typically, one register is mapped to one variable in the kernel code, meaning a developer can use up to 41 variables in their code. However, if the variable needs 64 bits to store its value, the register usage doubles, as all registers on a GPU are 32-bit. On a GPU with compute capability 10.x a developer can use up to $64\,000 / 2\,048 \approx 31$ registers. Of course a developer can use more registers, but this results in less occupancy. However, depending on the algorithm using more registers might be more beneficial to performance than the lower occupancy, in which case occupancy is not as important. If a developer needs more than $255$ registers for their variables the additional variables will spill into local memory which is, as described in Section \ref{sec:memory_model}, not desirable.
Additionally, shared memory consumption can also impact the occupancy. If for example a block needs all the available shared memory, which is almost the same as the amount of shared memory per SM, this SM can only serve this block. On compute capability 10.x, this would mean that occupancy would be at maximum $50\%$ as a block can have up to $1\,024$ threads while an SM supports up to $2\,048$ threads. Again, in such cases it needs to be determined, if the performance gain of using this much shared memory is worth the lower occupancy.
@ -134,7 +134,7 @@ Syntactically PTX resembles Assembly style code. Every PTX code must have a \ver
\begin{GenericCode}[numbers=none]
add.f32 \%n, 0.1, 0.2;
\end{GenericCode}
Loops in the classical sense do not exist in PTX. Alternatively a developer needs to define jump targets for the beginning and end of the loop. The code in \ref{code:ptx_loop} shows how a function with simple loop can be implemented. The loop counts down to zero from the passed parameter $N$ which is loaded into the register \%n in line 6. If the value in the register \%n reached zero the loop branches at line 9 to the jump target at line 12 and the loop has finished. All other used directives and further information on writing PTX code can be taken from the PTX documentation \parencite{nvidia_parallel_2025}.
Loops in the classical sense do not exist in PTX. Alternatively a developer needs to define jump targets for the beginning and end of the loop. The Program in \ref{code:ptx_loop} shows how a function with simple loop can be implemented. The loop counts down to zero from the passed parameter $N$ which is loaded into the register \%n in line 6. If the value in the register \%n reached zero the loop branches at line 9 to the jump target at line 12 and the loop has finished. All other used directives and further information on writing PTX code can be taken from the PTX documentation \parencite{nvidia_parallel_2025}.
\begin{program}
\begin{GenericCode}
@ -161,7 +161,7 @@ Compilers are a necessary tool for many developers. If a developer wants to run
\textcite{aho_compilers_2006} and \textcite{cooper_engineering_2022} describe how a compiler can be developed, with the latter focusing on more modern approaches. They describe how a compiler consists of two parts, the analyser, also called frontend, and the synthesiser also called backend. The front end is responsible for ensuring syntactic and semantic correctness and converts the source code into an intermediate representation, an abstract syntax tree (AST), for the backend. Generating code in the target language, from the intermediate representation is the job of the backend. This target code can be assembly or anything else that is needed for a specific use-case. This intermediate representation also makes it simple to swap out frontends or backends. The Gnu Compiler Collection \textcite{gcc_gcc_2025} takes advantage of using different frontends to provide support for many languages including C, C++, Ada and more. Instead of compiling source code for specific machines directly, many languages compile code for virtual machines instead. Notable examples are the Java Virtual Machine (JVM) \parencite{lindholm_java_2025} and the low level virtual machine (LLVM) \parencite{lattner_llvm_2004}. Such virtual machines provide a bytecode which can be used as a target language for compilers. A huge benefit of such virtual machines is the ability for one program to be run on all physical machines the virtual machine exists for, without the developer needing to change that program \parencite{lindholm_java_2025}. Programs written for virtual machines are compiled into their respective bytecode. This bytecode can then be interpreted or compiled to physical machine code and then be run. According to the JVM specification \textcite{lindholm_java_2025} the Java bytecode is interpreted and also compiled with a just-in-time (JIT) compiler to increase the performance of code blocks that are often executed. On the other hand, the common language runtime (CLR)\footnote{\url{https://learn.microsoft.com/en-us/dotnet/standard/clr}}, the virtual machine for languages like C\#, never interprets the generated bytecode. As described by \textcite{microsoft_overview_2023} the CLR always compiles the bytecode to physical machine code using a JIT compiler before it is executed.
A grammar describes how a language is structured. It not only describes the structure of natural language, but it can also be used to describe the structure of a programming language. \textcite{chomsky_certain_1959} found that grammars can be grouped into four levels, with regular and context-free grammars being the most relevant for programming languages. A regular grammar is of the structure $A = a\,|\,a\,B$ which is called a rule. The symbols $A$ and $B$ are non-terminal symbols and $a$ is a terminal symbol. A non-terminal symbol stands for another rule with the same structure and must only occur after a terminal symbol. Terminal symbols are fixed symbols or a value that can be found in the input stream, like literals in programming languages. Context-free grammars are more complex and are of the structure $A = \beta$. In this context $\beta$ stands for any combination of terminal and non-terminal symbols. Therefore, a rule like $A = a\,| a\,B\,a$ is allowed with this grammar level. This shows that with context-free grammars enclosing structures are possible. To write grammars for programming languages, other properties are also important to efficiently validate or parse some input to be defined by this grammar. However, these are not discussed here, but are described by \textcite{aho_compilers_2006}. They also described that generating a parser out of a grammar can be automated. This automation can be performed by parser generators like Yacc \parencite{johnson_yacc_1975} as described in their book. More modern alternatives are Bison\footnote{\url{https://www.gnu.org/software/bison/}} or Antlr\footnote{\url{https://www.antlr.org/}}. Before the parser can validate the input stream, a scanner is needed as described by \textcite{cooper_engineering_2022}. The scanner reads every character of the input stream and is responsible for removing white-spaces and ensures only valid characters and words are present. Flex \footnote{\url{https://github.com/westes/flex}} is a tool that allows generating a scanner and is often used in combination with Bison. A simplified version of the compiler architecture using Flex and Bison is depicted in figure \ref{fig:compiler_layout}. It shows how source code is taken and transformed into the intermediate representation by the frontend, and how it is converted into executable machine code by the backend.
A grammar describes how a language is structured. It not only describes the structure of natural language, but it can also be used to describe the structure of a programming language. \textcite{chomsky_certain_1959} found that grammars can be grouped into four levels, with regular and context-free grammars being the most relevant for programming languages. A regular grammar is of the structure $A = a\,|\,a\,B$ which is called a rule. The symbols $A$ and $B$ are non-terminal symbols and $a$ is a terminal symbol. A non-terminal symbol stands for another rule with the same structure and must only occur after a terminal symbol. Terminal symbols are fixed symbols or a value that can be found in the input stream, like literals in programming languages. Context-free grammars are more complex and are of the structure $A = \beta$. In this context $\beta$ stands for any combination of terminal and non-terminal symbols. Therefore, a rule like $A = a\,| a\,B\,a$ is allowed with this grammar level. This shows that with context-free grammars enclosing structures are possible. To write grammars for programming languages, other properties are also important to efficiently validate or parse some input to be defined by this grammar. However, these are not discussed here, but are described by \textcite{aho_compilers_2006}. They also described that generating a parser out of a grammar can be automated. This automation can be performed by parser generators like Yacc \parencite{johnson_yacc_1975} as described in their book. More modern alternatives are Bison\footnote{\url{https://www.gnu.org/software/bison/}} or Antlr\footnote{\url{https://www.antlr.org/}}. Before the parser can validate the input stream, a scanner is needed as described by \textcite{cooper_engineering_2022}. The scanner reads every character of the input stream and is responsible for removing white-spaces and ensures only valid characters and words are present. Flex \footnote{\url{https://github.com/westes/flex}} is a tool that allows generating a scanner and is often used in combination with Bison. A simplified version of the compiler architecture using Flex and Bison is depicted in Figure \ref{fig:compiler_layout}. It shows how source code is taken and transformed into the intermediate representation by the frontend, and how it is converted into executable machine code by the backend.
\begin{figure}
\centering

Binary file not shown.

After

Width:  |  Height:  |  Size: 154 KiB

Binary file not shown.

View File

@ -31,7 +31,7 @@
% Title page entries
%%%-----------------------------------------------------------------------------
\title{Interpreter and Transpiler for simple expressions on Nvidia GPUs using Julia}
\title{Interpreter and Transpiler for Simple Expressions on Nvidia GPUs using Julia}
\author{Daniel Roth}
\programname{Software Engineering}