Learning of Genome-Wide Generalized Additive Models

CDS members associated with the project

Chromatin immunoprecipitation followed by deep sequencing (ChiP-Seq) is a widely used approach to study protein-DNA interactions. To analyze ChiP-Seq data, practitioners are required to combine tools based on different statistical assumptions and dedicated to specific applications such as calling protein occupancy peaks or testing for differential occupancies.

We have developed a genome-wide Generalized Additive Model (GenoGAM), which brings the well-established and flexible generalized additive models framework to genomic applications (see the figure below). For human data, GenoGAM requires the fitting of millions of parameters. The calculation of confidence bounds requires - at least formally - the inversion of quadratic matrices with the same number of columns. Using some algebraic tricks and a data parallelism strategy, we can accomplish this task within a few hours on 40 CPUs.

Based on the output of GenoGAM, we construct a peak caller with performance matching state-of-the-art methods. Moreover, GenoGAM provides significance testing for differential occupancy with controlled type I error rate.

Reference

Georg Stricker, Alexander Engelhardt, Daniel Schulz, Matthias Schmid, Achim Tresch, Julien Gagneur; GenoGAM: genome-wide generalized additive models for ChIP-Seq analysis, Bioinformatics, Volume 33, Issue 15, 1 August 2017, Pages 2258–2265, https://doi.org/10.1093/bioinformatics/btx150

UNIVERSITÄT ZU KÖLN

Center for Data and Simulation Science

Learning of Genome-Wide Generalized Additive Models

CDS members associated with the project

Reference