Learning of Genome-Wide Generalized Additive Models
The typical primary output of a ChIP-Seq experiment consists of count data, which assigns to each genomic position (x-axis) the number of DNA-protein binding events at this position (immunoprecipitation = IP signal, third row). These measurements need to be corrected for unspecific binding, as quantified by a control experiment (Input control signal, top row). GenoGAM converts the discrete count data into a continuous, smooth estimate of the two binding signals (solid lines, second and fourth row). Ultimately, GenoGAM estimates the ratio between IP and control signal (bottom row), since this is the normalized DNA affinity measure for the protein of interest. Only this ratio reveals the binding of the protein of interest at the TATA / promoter site of the ADH3 gene (bottom row). Importantly, it also provides confidence bands around the point estimates (ribbons around the solid lines), which are important for further statistical analysis.
Chromatin immunoprecipitation followed by deep sequencing (ChiP-Seq) is a widely used approach to study protein-DNA interactions. To analyze ChiP-Seq data, practitioners are required to combine tools based on different statistical assumptions and dedicated to specific applications such as calling protein occupancy peaks or testing for differential occupancies.
We have developed a genome-wide Generalized Additive Model (GenoGAM), which brings the well-established and flexible generalized additive models framework to genomic applications (see the figure below). For human data, GenoGAM requires the fitting of millions of parameters. The calculation of confidence bounds requires - at least formally - the inversion of quadratic matrices with the same number of columns. Using some algebraic tricks and a data parallelism strategy, we can accomplish this task within a few hours on 40 CPUs.
Based on the output of GenoGAM, we construct a peak caller with performance matching state-of-the-art methods. Moreover, GenoGAM provides significance testing for differential occupancy with controlled type I error rate.
Reference
- Georg Stricker, Alexander Engelhardt, Daniel Schulz, Matthias Schmid, Achim Tresch, Julien Gagneur; GenoGAM: genome-wide generalized additive models for ChIP-Seq analysis, Bioinformatics, Volume 33, Issue 15, 1 August 2017, Pages 2258–2265, https://doi.org/10.1093/bioinformatics/btx150