Investigating the effects of minimising the training set fill distance in machine learning regression
Paolo Climacoa, Jochen Garckea,b
a | Institut für Numerische Simulation, Universität Bonn, Bonn, Germany |
b | Fraunhofer SCAI, Sankt Augustin, Germany |
Machine learning (ML) regression methods are powerful tools leveraging large datasets for nonlinear function approximation in high-dimensional domains. Unfortunately, learning from large datasets may be unfeasible due to computational limitations and the cost of producing labels for the data points, as in the case of quantum-chemistry applications. Therefore, an important task in scientific ML is sampling small training sets from large pools of unlabelled data points to maximise models’ performance while keeping the computational effort of the training and labelling processes low. In this talk, we analyse the advantages of a common approach for training set sampling aiming to minimise the fill distance of the selected set, which can be considered a measure of the data distribution. We provide a bound for the maximum expected error of the loss function, conditional to the knowledge of the data features, depending linearly on the fill distance. Next, we show that minimising such a bound by minimising the training set fill distance reduces the worst-case approximation error of several ML models, which can be interpreted as an indicator of the robustness of a model’s approximation. We perform experiments considering various ML techniques, such as Kernel methods and Neural Networks. The application context is quantum-chemistry, where different datasets have been generated to develop effective ML techniques and approximate the functions mapping high-dimensional representations of molecules to their physical and chemical properties.