Adaptive sampling and variable selection strategies for high-dimensional genetic data
Christian Staerk, Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University of Bonn
With the growing availability of high-dimensional genetic data, data-driven variable selection methods play an increasingly important role in genetic epidemiology. Scalable statistical learning methods are needed to effectively explore the high-dimensional space of possible models with thousands of potential genetic variables.
I will present an overview of recent adaptive sampling strategies which can be incorporated in various variable selection approaches, such as ℓ0-type regularization, statistical boosting and Bayesian variable selection. The main underlying idea is to exploit the sparsity via adaptive stochastic searches, which focus on those variables that have proven to be “important” in previous iterations of the algorithms. In a first approach, the Adaptive Subspace (AdaSub) algorithm [1] tackles the high-dimensional discrete optimization problem induced by ℓ0-type selection criteria by solving several low-dimensional sub-problems in an adaptive way, where the probability of each variable to be included in a new sub-problem is sequentially adjusted based on its selection frequency in previous sub-problems. In a statistical boosting approach, the AdaSubBoost algorithm [2] incorporates an adaptive random preselection of multivariable base-learners in each iteration, focusing on base-learners which were also predictive in previous iterations. Finally, the Metropolized AdaSub (MAdaSub) algorithm [3] is an adaptive Markov Chain Monte Carlo (MCMC) approach for Bayesian variable selection, where the individual proposal probabilities of the covariates are sequentially updated so that they converge against the posterior inclusion probabilities. Despite the continuing adaptation of the proposal probabilities, MAdaSub is ergodic, i.e. in the limit it samples from the full posterior model distribution. In gene expression data applications with more than 20,000 genes, MAdaSub can effectively sample from high-dimensional and multimodal posterior distributions.
While the adaptive variable selection methods described above, based on ℓ0-type regularization, boosting and Bayesian variable selection, are primarily designed for sparse high-dimensional settings, I will also highlight recent approaches [4, 5] and open challenges for modelling polygenic traits based on large-scale genotype data with many influential genetic variants and large sample sizes.
[1] Staerk, C., Kateri, M., & Ntzoufras, I. (2021). High-dimensional variable selection via low-dimensional adaptive learning. Electronic Journal of Statistics, 15(1), 830–879.
[2] Staerk, C., & Mayr, A. (2021). Randomized boosting with multivariable base-learners for high-dimensional variable selection and prediction. BMC Bioinformatics, 22(441).
[3] Staerk, C., Kateri, M., & Ntzoufras, I. (2022). A Metropolized adaptive subspace algorithm for high-dimensional Bayesian variable selection. Bayesian Analysis, Advance Publication, 1-31. https://doi.org/10.1214/22-BA1351
[4] Maj, C., Staerk, C., Borisov, O., Klinkhammer, H., Yeung, M. W., Krawitz, P., & Mayr, A. (2022). Statistical learning for sparser fine-mapped polygenic models: The prediction of LDL-cholesterol. Genetic Epidemiology, 46, 589–603.
[5] Klinkhammer, H., Staerk, C., Maj, C., Krawitz, P. M., & Mayr, A. (2023). A statistical boosting framework for polygenic risk scores based on large-scale genotype data. Frontiers in Genetics, 13(1076440).