concept and design: rewrote requirements section

2025-04-17 15:12:58 +02:00
parent c3f19d807a
commit 6ab826cc42
2 changed files with 12 additions and 4 deletions
--- a/thesis/chapters/conceptdesign.tex
+++ b/thesis/chapters/conceptdesign.tex
@ -27,12 +27,20 @@ The main goal of both prototypes or evaluators is to provide a speed-up compared
 \end{figure}


-With this, the required capabilities are outlined. However, the input and output data need to further be explained for a better understanding. The first input contains the expressions that need to be evaluated. These can have any length and can contain constant values, variables and parameters and all of these are linked together with the supported operations. In the example shown in Figure \ref{fig:input_output_explanation}, there are six expressions $e_1$ through $e_6$. Next is the variable matrix. One entry in this matrix, corresponds to one variable in every expression. The row indicates which variable it holds the value for. For example the values in row three, are used to parametrise the variable $x_3$. Each column holds a different set of variables. Each expression must be evaluated using every variable set. In the provided example, there are three variable sets, each holding the values for four variables $x_1$ through $x_4$. After all expressions are evaluated using all variable sets the results of these evaluations must be stored in the results matrix. Each entry in this matrix holds the resulting value of the evaluation of one expression parametrised with one variable set. The row indicates the variable set while the column indicates the expression.
+With this, the required capabilities are outlined. However, for a better understanding, the input and output data need to be explained further. The first input contains the expressions that need to be evaluated. These can be of any length and can contain constant values, variables and parameters, all of which are linked together with the supported operations. In the simplified example shown in Figure \ref{fig:input_output_explanation}, there are six expressions $e_1$ to $e_6$. 

-This is the minimal functionality needed to evaluate expressions with variables generated by a symbolic regression algorithm. In the case of parameter optimisation, it is useful to have a different type of variable, called parameter. For parameter optimisation it is important that for the given variable sets, the best fitting parameters need to be found. To achieve this, the evaluator is called multiple times with different parameters, but the same variables. The results are then evaluated for their fitness by the caller. In this case, the parameters do not change within one call. Parameters could therefore be treated as constant values of the expressions, and no separate input for them would be needed. However, providing the possibility to have the parameters as an input, makes the process of parameter optimisation easier. Unlike variables, not all expressions need to have the same number of parameters. Therefore, they are structured as a vector of vectors and not a matrix. The example in Figure \ref{fig:input_output_explanation} shows how the parameters are structured. For example one expression has zero parameters, while another has six parameters $p_1$ through $p_6$. It needs to be mentioned that just like the number of variables, the number of parameters per expression is not limited. It is also possible to completely omit the parameters if they are not needed. Because these evaluators will primarily be used in parameter optimisation use-cases, allowing parameters as an input is required.
+Next is the variable matrix. An entry in this matrix corresponds to one variable in every expression. The row indicates which variable it holds the value for. For example the values in row three, are used to parameterise the variable $x_3$. Each column holds a different set of variables. Each expression must be evaluated using each set of variable. In the provided example, there are three variable sets, each containing the values for four variables $x_1$ to $x_4$. After all expressions have been evaluated using all variable sets, the results of these evaluations must be stored in the result matrix. Each entry in this matrix holds the result of the evaluation of one expression parameterised with one variable set. The row indicates the variable set and the column indicates the expression.
+
+The prototypes developed in this thesis, are part of a GP algorithm for symbolic regression. This means that the expressions that are evaluated, represent parts of the search space of all expressions being made up of any combination of allowed operators, the set of input variables, a set of parameters and constants. This means that the size of the search space grows exponentially. Exploring this search space by simply generating expressions, evaluating them once and then generating the next set of expressions leaves much of the search space unexplored. To combat this, parameters are introduced. These allow the algorithm to perform some kind of local search. To enable this, the prototypes must support not only variables, but also parameters.
+
+The parameters themselves are unique to each expression, meaning they have a one-to-one mapping to an expression. Furthermore, as can be seen in Figure \ref{fig:input_output_explanation}, each expression can have a different number of parameters, or even no parameters at all. However, with no parameters, it wouldn't be possible to perform parameter optimisation. This is in contrast to variables, where each expression must have the same number of variables. Because parameters are unique to each expression and can vary in size, they are not structured as a matrix, but as a vector of vectors.
+
+An important thing to consider, is the volume and volatility of the data itself. The example used above has been drastically simplified. It is expected, that there are hundreds of expressions evaluate per GP generation. Each of these expressions may contain between ten and 50 tokens. A token is equivalent to either a variable, a parameter, a constant value or an operator.
+
+Usually, the number of variables per expression is around ten. However, the number of variable sets can increase drastically. It can be considered, that $1\,000$ variable sets is the lower limit. On the other hand, $100\,000$ can be considered as the upper limit. Considering that one variable takes up 4 bytes of space and 10 variables are needed per expression, at least $4 * 10 * 1\,000 = 40\,000$ bytes and at most $4 * 10 * 100\,000 = 400\,000$ bytes need to be transferred to the GPU for the variables.
+
+These variables do not change during the runtime of the symbolic regression algorithm, so the data only needs to be sent to the GPU once. This means that the impact of this data transfer is minimal. On the other hand, the data for the parameters is much more volatile. As explained above, they are used for parameter optimisation and therefore vary from evaluation to evaluation and need to be sent to the GPU very frequently.

-% \subsection{Non-Goals}
-% Probably a good idea. Probably move this to "introduction"
 \section{Architecture}

 Based on the requirements above, the architecture of both prototypes can be designed. While the requirements only specify the input and output, the components and workflow also need to be specified. This section aims at giving an architectural overview of both prototypes, alongside their design decisions. 
--- a/thesis/main.pdf
+++ b/thesis/main.pdf