related work: continuation of equation learning section

2025-02-27 11:41:01 +01:00
parent 99ed6a1cca
commit 28ef6b121e
4 changed files with 72 additions and 11 deletions
--- a/Ressources.txt
+++ b/Ressources.txt
@ -1,2 +0,0 @@
-https://www.markussteinberger.net/papers/DynMemory.pdf
- - Shows the performance impact of dynamically allocating Memory for different allocators (including the CUDA internal which I am using. Might be a topic for "Future Work" so as in the future, one could look into another allocator to gain more performance)
--- a/thesis/chapters/relwork.tex
+++ b/thesis/chapters/relwork.tex
@ -6,8 +6,10 @@ The goal of this chapter is to provide an overview of equation learning to estab
 % Section describing what equation learning is and why it is relevant for the thesis
 Equation learning is a field of research that aims at understanding and discovering equations from a set of data from various fields like mathematics and physics. Data is usually much more abundant while models often are elusive. Because of this, generating equations with a computer can more easily lead to discovering equations that describe the observed data. \textcite{brunton_discovering_2016} describe an algorithm that leverages equation learning to discover equations for physical systems. A more literal interpretation of equation learning is demonstrated by \textcite{pfahler_semantic_2020}. They use machine learning to learn the form of equations. Their aim was to simplify the discovery of relevant publications by the equations they use and not by technical terms, as they may differ by the field of research. However, this kind of equation learning is not relevant for this thesis.

-Symbolic regression is a subset of equation learning, that specialises more towards discovering mathematical equations. 
-% probably transition to symbolic regression and \textcite{werner_informed_2021}. As this seems more fitting and symblic regression probably has more publications/makes it easier to find publications. This is the section where I will also talk about how expressions look (see introduction) and the process of generating and evaluating expressions and therefore how this is a potential performance bottleneck
+Symbolic regression is a subset of equation learning, that specialises more towards discovering mathematical equations. A lot of research is done in this field. \textcite{keijzer_scaled_2004} and \textcite{korns_accuracy_2011} presented ways of improving the quality of symbolic regression algorithms, making symbolic regression more feasible for problem-solving. Additionally, \textcite{jin_bayesian_2020} proposed an alternative to genetic programming (GP) for the use in symbolic regression. Their approach increased the quality of the results noticeably compared to GP alternatives. The first two approaches are more concerned with the quality of the output, while the third is also concerned with interpretability and reducing memory consumption. Heuristics like GP or neural networks as used by \textcite{werner_informed_2021} in their equation learner can help with finding good solutions faster, accelerating scientific progress. One key part of equation learning in general is the computational evaluation of the generated equations. As this is an expensive operation, improving the performance reduces computation times and in turn, helps all approaches to find solutions more quickly.
+% probably a quick detour to show how a generated equation might look and why evaluating them is expensive
+
+% talk about cases where porting algorithms to gpus helped increase performance. This will be the transition the the below sections


 \section[GPGPU]{General Purpose Computation on Graphics Processing Units}
--- a/thesis/main.pdf
+++ b/thesis/main.pdf
--- a/thesis/references.bib
+++ b/thesis/references.bib
@ -219,13 +219,6 @@
 	file = {PDF:C\:\\Users\\danwi\\Zotero\\storage\\GKAYMMNN\\Memarzia und Khunjush - 2015 - An In-depth Study on the Performance Impact of CUDA, OpenCL, and PTX Code.pdf:application/pdf},
 }

-@online{noauthor_-depth_nodate,
-	title = {An In-depth Study on the Performance Impact of {CUDA}, {OpenCL}, and {PTX} Code},
-	url = {https://www.global-sci.org/intro/article_detail.html?journal=undefined&article_id=22555},
-	urldate = {2024-12-01},
-	file = {An In-depth Study on the Performance Impact of CUDA, OpenCL, and PTX Code:C\:\\Users\\danwi\\Zotero\\storage\\7CPIZPCF\\article_detail.html:text/html},
-}
-
@article{bastidas_fuertes_transpiler-based_2023,
 	title = {Transpiler-Based Architecture Design Model for Back-End Layers in Software Development},
 	volume = {13},
@ -331,3 +324,71 @@ Publisher: Multidisciplinary Digital Publishing Institute},
 	note = {Publisher: Proceedings of the National Academy of Sciences},
 	file = {Full Text PDF:C\:\\Users\\danwi\\Zotero\\storage\\6R643NFZ\\Brunton et al. - 2016 - Discovering governing equations from data by sparse identification of nonlinear dynamical systems.pdf:application/pdf},
 }
+
+@article{dong_evolving_2024,
+	title = {Evolving Equation Learner For Symbolic Regression},
+	issn = {1941-0026},
+	url = {https://ieeexplore.ieee.org/abstract/document/10538006/metrics#metrics},
+	doi = {10.1109/TEVC.2024.3404650},
+	abstract = {Symbolic regression, a multifaceted optimization challenge involving the refinement of both structural components and coefficients, has gained significant research interest in recent years. The Equation Learner ({EQL}), a neural network designed to optimize both equation structure and coefficients through gradient-based optimization algorithms, has emerged as an important topic of concern within this field. Thus far, several variations of {EQL} have been introduced. Nevertheless, these existing {EQL} methodologies suffer from a fundamental constraint that they necessitate a predefined network structure. This limitation imposes constraints on the complexity of equations and makes them ill-suited for high-dimensional or high-order problem domains. To tackle the aforementioned shortcomings, we present a novel approach known as the evolving Equation Learner ({eEQL}). {eEQL} introduces a unique network structure characterized by automatically defined functions ({ADFs}). This new architectural design allows for dynamic adaptations of the network structure. Moreover, by engaging in self-learning and self-evolution during the search process, {eEQL} facilitates the generation of intricate, high-order, and constructive sub-functions. This enhancement can improve the accuracy and efficiency of the algorithm. To evaluate its performance, the proposed {eEQL} method has been tested across various datasets, including benchmark datasets, physics datasets, and real-world datasets. The results have demonstrated that our approach outperforms several well-known methods.},
+	pages = {1--1},
+	journaltitle = {{IEEE} Transactions on Evolutionary Computation},
+	author = {Dong, Junlan and Zhong, Jinghui and Liu, Wei-Li and Zhang, Jun},
+	urldate = {2025-02-26},
+	date = {2024},
+	note = {Conference Name: {IEEE} Transactions on Evolutionary Computation},
+	keywords = {Optimization, Adaptation models, Complexity theory, Equation Learner, Evolutionary computation, Evolving equation learner, Mathematical models, Neural networks, Progressive Evolutionary Structure Search, Training},
+	file = {IEEE Xplore Abstract Record:C\:\\Users\\danwi\\Zotero\\storage\\8PQADTZP\\metrics.html:text/html},
+}
+
+@incollection{korns_accuracy_2011,
+	location = {New York, {NY}},
+	title = {Accuracy in Symbolic Regression},
+	isbn = {978-1-4614-1770-5},
+	url = {https://doi.org/10.1007/978-1-4614-1770-5_8},
+	abstract = {This chapter asserts that, in current state-of-the-art symbolic regression engines, accuracy is poor. That is to say that state-of-the-art symbolic regression engines return a champion with good fitness; however, obtaining a champion with the correct formula is not forthcoming even in cases of only one basis function with minimally complex grammar depth. Ideally, users expect that for test problems created with no noise, using only functions in the specified grammar, with only one basis function and some minimal grammar depth, that state-of-the-art symbolic regression systems should return the exact formula (or at least an isomorph) used to create the test data. Unfortunately, this expectation cannot currently be achieved using published state-of-the-art symbolic regression techniques. Several classes of test formulas, which prove intractable, are examined and an understanding of why they are intractable is developed. Techniques in Abstract Expression Grammars are employed to render these problems tractable, including manipulation of the epigenome during the evolutionary process, together with breeding of multiple targeted epigenomes in separate population islands. Aselected set of currently intractable problems are shown to be solvable, using these techniques, and a proposal is put forward for a discipline-wide program of improving accuracy in state-of-the-art symbolic regression systems.},
+	pages = {129--151},
+	booktitle = {Genetic Programming Theory and Practice {IX}},
+	publisher = {Springer},
+	author = {Korns, Michael F.},
+	editor = {Riolo, Rick and Vladislavleva, Ekaterina and Moore, Jason H.},
+	urldate = {2025-02-27},
+	date = {2011},
+	langid = {english},
+	doi = {10.1007/978-1-4614-1770-5_8},
+}
+
+@article{keijzer_scaled_2004,
+	title = {Scaled Symbolic Regression},
+	volume = {5},
+	issn = {1573-7632},
+	url = {https://doi.org/10.1023/B:GENP.0000030195.77571.f9},
+	doi = {10.1023/B:GENP.0000030195.77571.f9},
+	abstract = {Performing a linear regression on the outputs of arbitrary symbolic expressions has empirically been found to provide great benefits. Here some basic theoretical results of linear regression are reviewed on their applicability for use in symbolic regression. It will be proven that the use of a scaled error measure, in which the error is calculated after scaling, is expected to perform better than its unscaled counterpart on all possible symbolic regression problems. As the method (i) does not introduce additional parameters to a symbolic regression run, (ii) is guaranteed to improve results on most symbolic regression problems (and is not worse on any other problem), and (iii) has a well-defined upper bound on the error, scaled squared error is an ideal candidate to become the standard error measure for practical applications of symbolic regression.},
+	pages = {259--269},
+	number = {3},
+	journaltitle = {Genetic Programming and Evolvable Machines},
+	shortjournal = {Genet Program Evolvable Mach},
+	author = {Keijzer, Maarten},
+	urldate = {2025-02-27},
+	date = {2004-09-01},
+	langid = {english},
+	keywords = {Artificial Intelligence, genetic programming, linear regression, symbolic regression},
+	file = {Full Text PDF:C\:\\Users\\danwi\\Zotero\\storage\\ZH9LAN74\\Keijzer - 2004 - Scaled Symbolic Regression.pdf:application/pdf},
+}
+
+@misc{jin_bayesian_2020,
+	title = {Bayesian Symbolic Regression},
+	url = {http://arxiv.org/abs/1910.08892},
+	doi = {10.48550/arXiv.1910.08892},
+	abstract = {Interpretability is crucial for machine learning in many scenarios such as quantitative finance, banking, healthcare, etc. Symbolic regression ({SR}) is a classic interpretable machine learning method by bridging X and Y using mathematical expressions composed of some basic functions. However, the search space of all possible expressions grows exponentially with the length of the expression, making it infeasible for enumeration. Genetic programming ({GP}) has been traditionally and commonly used in {SR} to search for the optimal solution, but it suffers from several limitations, e.g. the difficulty in incorporating prior knowledge; overly-complicated output expression and reduced interpretability etc. To address these issues, we propose a new method to fit {SR} under a Bayesian framework. Firstly, Bayesian model can naturally incorporate prior knowledge (e.g., preference of basis functions, operators and raw features) to improve the efficiency of fitting {SR}. Secondly, to improve interpretability of expressions in {SR}, we aim to capture concise but informative signals. To this end, we assume the expected signal has an additive structure, i.e., a linear combination of several concise expressions, whose complexity is controlled by a well-designed prior distribution. In our setup, each expression is characterized by a symbolic tree, and the proposed {SR} model could be solved by sampling symbolic trees from the posterior distribution using an efficient Markov chain Monte Carlo ({MCMC}) algorithm. Finally, compared with {GP}, the proposed {BSR}(Bayesian Symbolic Regression) method saves computer memory with no need to keep an updated 'genome pool'. Numerical experiments show that, compared with {GP}, the solutions of {BSR} are closer to the ground truth and the expressions are more concise. Meanwhile we find the solution of {BSR} is robust to hyper-parameter specifications such as the number of trees.},
+	number = {{arXiv}:1910.08892},
+	publisher = {{arXiv}},
+	author = {Jin, Ying and Fu, Weilin and Kang, Jian and Guo, Jiadong and Guo, Jian},
+	urldate = {2025-02-27},
+	date = {2020-01-16},
+	eprinttype = {arxiv},
+	eprint = {1910.08892 [stat]},
+	keywords = {Statistics - Methodology},
+	file = {Preprint PDF:C\:\\Users\\danwi\\Zotero\\storage\\3MP48UI3\\Jin et al. - 2020 - Bayesian Symbolic Regression.pdf:application/pdf;Snapshot:C\:\\Users\\danwi\\Zotero\\storage\\UNNZKPRJ\\1910.html:text/html},
+}