 Method
 Open Access
 Published:
Clipper: pvaluefree FDR control on highthroughput data from two conditions
Genome Biology volume 22, Article number: 288 (2021)
Abstract
Highthroughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on pvalues. However, obtaining valid pvalues relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on pvalues or specific data distributions. Clipper outperforms existing methods for a broad range of applications in highthroughput data analysis.
Introduction
Highthroughput technologies are widely used to measure systemwide biological features, such as genes, genomic regions, and proteins (“highthroughput” means the number of features is large, at least in thousands). The most common goal of analyzing highthroughput data is to contrast two conditions so as to reliably screen “interesting features,” where “interesting” means “enriched” or “differential.” “Enriched features” are defined to have higher expected measurements (without measurement errors) under the experimental (i.e., treatment) condition than the background (i.e., negative control) condition. The detection of enriched features is called “enrichment analysis.” For example, typical enrichment analyses include calling proteinbinding sites in a genome from chromatin immunoprecipitation sequencing (ChIPseq) data [1, 2] and identifying peptides from mass spectrometry (MS) data [3]. In contrast, “differential features” are defined to have different expected measurements between two conditions, and their detection is called “differential analysis.” For example, popular differential analyses include the identification of differentially expressed genes (DEGs) from genomewide gene expression data (e.g., microarray and RNA sequencing (RNAseq) data [4–10]) and differentially interacting chromatin regions (DIRs) from HiC data [11–13] (Fig. 1a). In most scientific research, the interesting features only constitute a small proportion of all features, and the remaining majority is referred to as “uninteresting features.”
The identified features, also called the “discoveries” from enrichment or differential analysis, are subject to further investigation and validation. Hence, to reduce experimental validation that is often laborious or expensive, researchers demand reliable discoveries that contain few false discoveries. Accordingly, the false discovery rate (FDR) [14] has been developed as a statistical criterion for ensuring discoveries’ reliability. Technically, under the frequentist statistical paradigm, the FDR is defined as the expected proportion of uninteresting features among the discoveries. In parallel, under the Bayesian statistical paradigm, other criteria have been developed, including the Bayesian false discovery rate [15], the local false discovery rate (local fdr) [16], and the local false sign rate [17]. Among all these frequentist and Bayesian criteria, the FDR is the dominant criterion for setting thresholds in biological data analysis [1, 10, 18–24] and is thus the focus of this paper.
FDR control refers to the goal of finding discoveries such that the FDR is under a prespecified threshold (e.g., 0.05). Existing computational methods for FDR control primarily rely on pvalues, one per feature. Among the pvaluebased FDR control methods, the most classic and popular ones are the BenjaminiHochberg (BH) procedure [14] and the Storey’s qvalue [25]; later development introduced methods that incorporate feature weights [26] and/or covariates—e.g., independent hypothesis weighting (IHW) [27], adaptive pvalue thresholding [28], and Boca and Leek’s FDR regression [29]—to boost the detection power. All these methods set a pvalue cutoff based on the prespecified FDR threshold. However, the calculation of pvalues requires either distributional assumptions, which are often questionable, or large numbers of replicates, which are often unachievable in biological studies (see the “Results” section). Due to the difficulty of valid pvalue calculation in highthroughput biological data analysis, bioinformatics tools often output illposed pvalues. This issue is evidenced by serious concerns about the widespread miscalculation and misuse of pvalues in the scientific community [30]. As a result, bioinformatics tools using questionable pvalues either cannot reliably control the FDR to a target level [23] or lack power to make discoveries [31]; see the “Results” section. Therefore, pvaluefree control of FDR is desirable, as it would make data analysis more transparent and thus improve the reproducibility of scientific research.
Although pvaluefree FDR control has been implemented in the MACS2 method for ChIPseq peak calling [1] and the SAM method for microarray DEG identification [32], these two methods are restricted to specific applications and lack a theoretical guarantee for FDR control. (Although later works have studied some theoretical properties of SAM, they are not about the exact control of the FDR [33, 34].) More recently, the BarberCandès (BC) procedure has been proposed to achieve theoretical FDR control without using pvalues [35], and it has been shown to perform comparably to the BH procedure using wellcalibrated pvalues [36]. The BC procedure is advantageous because it does not require wellcalibrated pvalues, so it holds tremendous potential in various highthroughput data analyses where pvalue calibration is challenging [37]. For example, a recent paper has implemented a generalization of the BC procedure to control the FDR in peptide identification from MS data [38].
Inspired by the BC procedure, we propose a general statistical framework Clipper to provide reliable FDR control for highthroughput biological data analysis, without using pvalues or relying on specific data distributions. Clipper is a robust and flexible framework that applies to both enrichment and differential analyses and that works for highthroughput data with various characteristics, including data distributions, replicate numbers (from one to multiple), and outlier existence.
Results
Clipper consists of two main steps: construction and thresholding of contrast scores. First, Clipper defines a contrast score for each feature, as a replacement of a pvalue, to summarize that feature’s measurements between two conditions and to describe the degree of interestingness of that feature. Second, as its name suggests, Clipper establishes a cutoff on features’ contrast scores and calls as discoveries the features whose contrast scores exceed the cutoff (see the “Methods” section and Additional File 1: Section S2). Clipper is a flexible framework that only requires a minimal input: all features’ measurements under two conditions and a target FDR threshold (e.g., 5%) (Fig. 1b).
Clipper only relies on two fundamental statistical assumptions of biological data analysis: (1) measurement errors (i.e., differences between measurements and their expectations, with the expectations including both biological signals and batch effects) are independent across all features and replicates and (2) every uninteresting feature has measurement errors identically distributed across all replicates under both conditions. These two assumptions are used in almost all bioinformatics tools and are commonly referred to as the “measurement model” in statistical genomics [39]. With these two assumptions, Clipper has a theoretical guarantee for FDR control under both enrichment and differential analyses with any number of replicates (see the “Methods” section and Additional File 1: Section S2).
To verify Clipper’s performance, we designed comprehensive simulation studies to benchmark Clipper against existing generic FDR control methods (Additional File 1: Section S1). We also benchmarked Clipper against bioinformatics tools in studies including peak calling from ChIPseq data, peptide identification from mass spectrometry data, DEG identification from bulk and singlecell RNAseq data, and DIR identification from HiC data. Notably, our benchmarking results for peptide identification are based on our inhouse data, the first MS data standard with a realistic dynamic range.
Clipper has verified FDR control and power advantage in simulation
Simulation is essential because we can generate numerous datasets from the same distribution with known truths to calculate the FDR, which is not observable from real data. Our simulation covers both enrichment and differential analyses. In enrichment analysis, we consider four “experimental designs”: 1vs1 design (one replicate per condition), 2vs1 design (two and one replicates under the experimental and background conditions, respectively), 3vs3 design (three replicates per condition), and 10vs10 design (ten replicates per condition). In differential analysis, since Clipper requires that at least one condition has two replicates, we only consider the 2vs1 and 3vs3 designs. For each analysis and design, we simulated data from three “distributional families”—Gaussian, Poisson, and negative binomial—for individual features under two “background scenarios” (i.e., scenarios of the background condition): homogeneous and heterogeneous. Under the homogeneous scenario, all features’ measurements follow the same distribution under the background condition; otherwise, we are under the heterogeneous scenario, which is ubiquitous in applications, e.g., identifying DEGs from RNAseq data and calling proteinbinding sites from ChIPseq data. By simulation setting, we refer to a combination of an experimental design, a distributional family, and a background scenario. The details of simulation settings are described in Additional File 1: Section S4.
For both enrichment and differential analyses and each simulation setting, we compared Clipper against generic FDR control methods, including pvaluebased methods and localfdrbased methods. The pvaluebased methods include BHpair, BHpool, qvaluepair, and qvaluepool, where “BH” and “qvalue” stand for pvalue thresholding procedures, and “pair” and “pool” represent the paired and pooled pvalue calculation approaches, respectively. The localfdrbased methods include locfdremp and locfdrswap, where “emp” and “swap” represent the empirical null and swapping null localfdr calculation approaches, respectively. See the “ Methods” section for details.
The comparison results are shown in Fig. 2 and Additional File 1: Figures S1–S14. A good FDR control method should have its actual FDR no larger than the target FDR threshold and achieve high power. The results show that Clipper controls the FDR and is overall more powerful than the other methods, excluding those that fail to control the FDR, under all settings. Clipper is also shown to be more robust to the number of features and the existence of outliers than the other methods. In detail, in both enrichment analyses (1vs1, 2vs1, 3vs3, and 10vs10 designs) and differential analyses (2vs1 and 3vs3 designs), Clipper consistently controls the FDR, and it is more powerful than the generic methods in most cases under the realistic, heterogeneous background, where features do not follow the same distribution under the background condition. Under the idealistic, homogeneous background, Clipper is still powerful and only second to BHpool and qvaluepool, which, however, cannot control the FDR under the heterogeneous background.
Here we summarize the performance of the generic FDR control methods. First, the two pvaluebased methods using the pooled approach—BHpool and qvaluepool—are the most powerful under the idealistic, homogeneous background, which is their inherent assumption; however, they cannot control the FDR under the heterogeneous background (Fig. 2b). Besides, they cannot control the FDR when the number of features is small (Fig. 2a and Additional File 1: Figure S1). These results show that the validity of BHpool and qvaluepool requires a large number of features and the homogeneous background assumption, two requirements that rarely both hold in biological applications.
Second, the four pvaluebased methods using the paired approach with misspecified models or misformulated tests (BHpairmis, qvaluepairmis, BHpair2as1, and qvaluepair2as1; see the “Methods” section) fail to control the FDR by a large margin in most cases, and rarely when they control the FDR, they lack power (Fig. 2c, d and Additional File 1: Figures S1–S8). These results confirm that BHpair and qvaluepair rely on the correct model specification to control the FDR; however, the correct model specification is hardly achievable with no more than three replicates per condition.
Third, even when models are correctly specified (an idealistic scenario), the pvaluebased methods that use the paired approach—BHpaircorrect and qvaluepaircorrect (see the “Methods” section)—fail to control the FDR in the existence of outliers (Fig. 2e and Additional File 1: Figures S3 and S7) or for the negative binomial distribution with unknown dispersion (Fig. 2f and Additional File 1: Figure S9). It is worth noting that even when they control the FDR, they are less powerful than Clipper in most cases except for the 3vs3 differential analysis with the Poisson distribution (Fig. 2d and Additional File 1: Figures S4 and S8).
Fourth, the two localfdrbased methods—locfdremp and locfdrswap—achieve the FDR control for all designs and analyses; however, they are less powerful than Clipper in most cases (Additional File 1: Figures. S1–S4).
Fifth, when the numbers of replicates are large (10vs10 design), nonparametric tests become applicable. We compared Clipper with three BHpair methods that use different statistical tests: BHpairWilcoxon (the nonparametric Wilcoxon ranksum test), BHpairpermutation (the nonparametric permutation test), and BHpairparametric (the parametric test based on the correct model specification, equivalent to BHpaircorrect). Although all the three methods control the FDR, they are less powerful than Clipper (Additional File 1: Figure S10).
Moreover, the above five phenomena are consistently observed across the three distributions (Gaussian, Poission, and negative binomial) that we have examined, further confirming the robustness of Clipper.
In addition, for the 3vs3 enrichment analysis, we also varied the proportion of interesting features as 10%, 20%, and 40%. The comparison results in Additional File 1: Figure S3 (columns 1 and 3 for 10%) and Additional File 1: Figure S11 (for 20% and 40%) show that the performance of Clipper is robust to the proportion of interesting features.
The above results are all based on simulations with independent features. To examine the robustness of Clipper, we introduced feature correlations to our simulated data, on which we compared Clipper with the other generic FDR control methods. The comparison results in Additional File 1: Figure S12 show that even when the feature independence assumption is violated, Clipper still demonstrates strong performance in both FDR control and power.
Clipper has broad applications in omics data analyses
We then demonstrate the use of Clipper in four omics data applications: peak calling from ChIPseq data, peptide identification from MS data, DEG identification from bulk or singlecell RNAseq data, and DIR identification from HiC data. The first two applications are enrichment analyses, and the last two are differential analyses. In each application, we compared Clipper with mainstream bioinformatics methods to demonstrate Clipper’s superiority in FDR control and detection power.
Peak calling from ChIPseq data (enrichment analysis I)
ChIPseq is a genomewide experimental assay for measuring binding intensities of a DNAassociated protein [40], often a transcription factor that activates or represses gene expression [41, 42]. ChIPseq data are crucial for studying gene expression regulation, and an indispensable analysis is to identify genomic regions with enriched sequence reads in ChIPseq data. These regions are likely bound by the target protein and thus of biological interest. The identification of these regions is termed “peak calling” in ChIPseq data analysis.
As the identified peaks are subject to experimental validation that is often expensive [43], it is essential to control the FDR of peak identification to reduce unnecessary costs. The two most highlycited peakcalling methods are MACS2 [1] and HOMER [2], both of which claim to control the FDR for their identified peaks. Specifically, both MACS2 and HOMER assume that the read counts for each putative peak (with one count per sample/replicate) follow the Poisson distribution, and they use modified paired approaches to assign each putative peak a pvalue and a corresponding Storey’s qvalue. Then given a target FDR threshold 0<q<1, they call the putative peaks with qvalues ≤q as identified peaks. Despite being popular, MACS2 and HOMER have not been verified for their FDR control, to our knowledge.
To verify the FDR control of MACS2 and HOMER (Additional File 1: Section S5.1), we used ENCODE ChIPseq data of cell line GM12878 [44] and ChiPulate [45], a ChIPseq data simulator, to generate semisynthetic data with spikedin peaks (Additional File 1: Section S6.1). We examined the actual FDR and power of MACS2 and HOMER in a range of target FDR thresholds: q=1%,2%,…,10%. Figure 3a shows that MACS2 and HOMER cannot control the FDR as standalone peakcalling methods. However, with Clipper as an addon (Additional File 1: Section S7.1), both MACS2 and HOMER can guarantee the FDR control. This result demonstrates the flexibility and usability of Clipper for reducing false discoveries in peak calling analysis.
Technically, the failed FDR control by MACS2 and HOMER is attributable to the likely model misspecification and test misformulation in their use of the paired approach. Both MACS2 and HOMER assume the Poisson distribution for read counts in a putative peak; however, it has been widely acknowledged that read counts are overdispersed and thus better modeled by the negative binomial distribution [46]. Besides, MACS2 uses onesample tests to compute pvalues when twosample tests should be performed. As a result, the pvalues of MACS2 and HOMER are questionable, so using their pvalues for FDR control has no guaranteed success. (Note that MACS2 does not use pvalues to control the FDR but instead swaps experimental and background samples to calculate the empirical FDR; yet, we emphasize that controlling the empirical FDR does not guarantee the FDR control.) As a remedy, Clipper strengthens both methods to control the FDR while maintaining high power.
As a side note, it is known that uninteresting regions tend to have larger read counts in the control sample than in the experimental (ChIP) sample; however, this phenomenon does not violate Clipper’s theoretical assumption for FDR control (Additional File 1: Section S7.1).
Peptide identification from MS data (enrichment analysis II)
The stateoftheart proteomics studies use MS experiments and database search algorithms to identify and quantify proteins in biological samples. In a typical proteomics experiment, a protein mixture sample is first digested into peptides and then measured by tandem MS technology as mass spectra, which encode peptide sequence information. “Peptide identification” is the process that decodes mass spectra and converts mass spectra into peptide sequences in a protein sequence database via search algorithms. The search process matches each mass spectrum to peptide sequences in the database and outputs the best match, called a “peptidespectrum match” (PSM). The identified PSMs are used to infer and quantify proteins in a highthroughput manner.
False PSMs could occur when mass spectra are matched to wrong peptide sequences due to issues such as lowquality spectra, dataprocessing errors, and incomplete protein databases, causing problems in the downstream protein identification and quantification [47]. Therefore, a common goal of database search algorithms is to simultaneously control the FDR and maximize the number of identified PSMs, so as to maximize the number of proteins identified in a proteomics study [3, 48, 49]. A widely used FDR control strategy is the targetdecoy search, where mass spectra of interest are matched to peptide sequences in both the original (target) database and a decoy database that contains artificial false protein sequences. The resulting PSMs are called the target PSMs and decoy PSMs, respectively. The decoy PSMs, i.e., matched mass spectrum and decoy peptide pairs, are known to be false and thus used by database search algorithms to control the FDR. Mainstream database search algorithms output a qvalue for each target or decoy PSM. Discoveries are the target PSMs whose qvalues are no greater than the target FDR threshold q.
We generated the first comprehensive benchmark dataset from an archaea species Pyrococcus furiosus, and we used it to examine the FDR control and power of a popular database search algorithm SEQUEST [3] (Additional File 1: Section S5.2). Using this benchmark dataset (Additional File 1: Section S6.2), we demonstrate that, as an addon, Clipper improves the power of SEQUEST. Specifically, Clipper treats mass spectra as features. For each mass spectrum, Clipper considers the experimental (or background) measurement as the − log10transformed SEQUEST qvalue of the target (or decoy) PSM that includes the mass spectrum. Then Clipper decides which mass spectra and their corresponding target PSMs are discoveries (Additional File 1: Section S7.2). Based on the benchmark dataset, we examined the empirical FDR, i.e., the FDP calculated based on the true positives and negatives, and the power of SEQUEST with or without Clipper as an addon, for a range of target FDR thresholds: q=1%,2%,…,10%. Figure 3b shows that although SEQUEST and SEQUEST+Clipper both control the FDR, SEQUEST+Clipper consistently improves the power, thus enhancing the peptide identification efficiency of proteomics experiments.
While preparing this manuscript, we found a recent work [38] that used a similar idea to identify PSMs without using pvalues. Clipper differs from this work in two aspects: (1) Clipper is directly applicable as an addon to any existing database search algorithms that output qvalues and (2) Clipper is not restricted to the peptide identification application.
DEG identification from bulk RNAseq data (differential analysis I)
RNAseq data measure genomewide gene expression levels in biological samples. An important use of RNAseq data is the DEG analysis, which aims to discover genes whose expression levels change between two conditions. The FDR is a widely used criterion in DEG analysis [4–9].
We compared Clipper with two popular DEG identification methods: edgeR [4] and DESeq2 [5] (Additional File 1: Section S5.3). Specifically, when we implemented Clipper, we first performed the trimmed mean of M values (TMM) normalization [50] to correct for batch effects; then, we treated genes as features and their normalized expression levels as measurements under two conditions (Additional File 1: Section S7.3). We also implemented two versions of DESeq2 and edgeR: with or without IHW, a popular procedure for boosting the power of pvaluebased FDR control methods by incorporating feature covariates [27]. In our implementation of the two versions of DESeq2 and edgeR, we used their standard pipelines, including normalization, model fitting, and gene filtering (edgeR only). To verify the FDR control, we generated four realistic semisynthetic datasets from two real RNAseq datasets—one from classical and nonclassical human monocytes [51] and the other from yeasts with or without snf2 knockout [52]—using simulation strategies 1 and 2 (Additional File 1: Section S6.3).
In detail, in simulation strategy 1, we used bulk RNAseq samples from two conditions to compute a fold change for every gene between the two conditions; then, we defined true DEGs as the genes whose fold changes exceeded a threshold; next, we randomly drew three RNAseq samples and treated them as replicates from each condition (m=n=3 as in the “ Methods” section); using those subsampled replicates of two conditions, we preserved the true DEGs’ read counts and permuted the read counts of the true nonDEGs, i.e., the genes other than true DEGs, between conditions. In summary, simulation strategy 1 guarantees that the measurements of true nonDEGs are i.i.d., an assumption that Clipper relies on for theoretical FDR control.
In simulation strategy 2, which we borrowed from a benchmark study [53], we first randomly selected at most 30% genes as true DEGs; next, we randomly drew six RNAseq samples from one condition (classical human monocytes and yeasts without knockout) and split the samples into two “synthetic conditions,” each with three replicates (m=n=3 as in the “Methods” section); then for each true DEG, we multiplied its read counts under one of the two synthetic conditions (randomly and independently picked for each gene) by a randomly generated fold change (see Additional File 1: Section S6.3); finally, for the true nonDEGs, we preserved their read counts in the six samples. In summary, simulation strategy 2 preserves batch effects, if existent in real data, for the true nonDEGs (the majority of genes). As a result, the semisynthetic data generated under strategy 2 may violate the Clipper assumption for theoretical FDR control and thus can help evaluate the robustness of Clipper on real data.
The four semisynthetic datasets have ground truths (true DEGs and nonDEGs) to evaluate each DEG identification method’s FDR and power for a range of target FDR thresholds: q=1%,2%,…,10%. Our results in Fig. 4a and Additional File 1: Figures S15a–S17a show that Clipper consistently controls the FDR and achieves high power on all four semisynthetic datasets. In contrast, DESeq2 and edgeR cannot consistently control the FDR except for the yeast semisynthetic dataset generated under simulation strategy 2. Given the fact that DESeq2 and edgeR do not consistently perform well on the three other semisynthetic datasets, we hypothesize that their parametric distributional assumptions, if violated on real data, hinder valid FDR control (see our other study [54]), in line with our motivation for developing Clipper. By examining whether true nonDEGs’ pvalues calculated by DESeq2 or edgeR follow the theoretical Uniform [0,1] distribution, we find that the answer is no for many nonDEGs, as indicated by the small pvalues (one per nonDEG) of uniformity tests (Additional File 1: Figure S18); this issue is more serious for DESeq2, consistent with the worse FDR control of DESeq2 (Fig. 4a and Additional File 1: Figures S15a–S17a). Furthermore, we observe that adding IHW to edgeR and DESeq2 has negligible effects on the four semisynthetic datasets.
To further explain why DESeq2 fails to control the FDR, we examined the pvalue distributions of 16 nonDEGs that were most frequently identified (from the 100 semisynthetic datasets generated from the human monocyte dataset using simulation strategy 1) by DESeq2 at the target FDR threshold q=0.05. Our results in Additional File 1: Figure S19 show that the 16 nonDEGs’ pvalues are nonuniformly distributed with a mode close to 0. Such unusual enrichment of overly small pvalues makes these nonDEGs mistakenly called as discoveries by DESeq2.
In addition, we compared the DEG ranking by Clipper, edgeR, and DESeq2 in two ways. First, for true DEGs, we compared their ranking by each method with their true ranking based on true expression fold changes (from large to small, as in semisynthetic data generation in Additional File 1: Section S6.3). Specifically, we ranked true DEGs using Clipper’s contrast scores (from large to small), edgeR’s pvalues (from small to large), or DESeq2’s pvalues (from small to large). Our results in Fig. 4b and Additional File 1: Figures S15b–S17b show that Clipper’s contrast scores exhibit the most consistent ranking with the ranking based on true fold changes. Second, to compare the power of Clipper, edgeR, and DESeq2 based on their DEG rankings instead of nominal pvalues, we calculated their power under the actual FDRs, which only depend on gene rankings (for the definition of actual FDR, see Additional File 1: Section S6.3). Figure 4a and Additional File 1: Figures S15a–S17a show that, when Clipper, edgeR, and DESeq2 have the same actual FDR, Clipper consistently outperforms edgeR and DESeq2 in terms of power, i.e., Clipper has the largest number of true DEGs in its top ranked genes.
We also compared the reproducibility of Clipper, edgeR, and DESeq2 in the presence of sampling randomness. Specifically, we used two semisynthetic datasets (generated independently from the same procedure in Additional File 1: Section S6.3) as technical replicates and computed Clipper’s contrast scores, edgeR’s pvalues, and DESeq’s pvalues on each dataset. For each method, we evaluated its reproducibility between the two semisynthetic datasets by computing three criteria—the irreproducibility discovery rate (IDR) [55], the Pearson correlation, and the Spearman correlation—using its contrast scores or − log10transformed pvalues. Figure 4c and Additional File 1: Figures S15c–S17c show that Clipper’s contrast scores have higher reproducibility by all three criteria compared to edgeR’s and DESeq2’s pvalues.
Finally, we compared Clipper with DESeq2 and edgeR on the real RNAseq data of classical and nonclassical human monocytes [51]. In this dataset, gene expression changes are expected to be associated with the immune response process. We input three classical and three nonclassical samples into Clipper, DESeq2, and edgeR for DEG identification. Figure 5a shows that edgeR identifies the fewest DEGs, while DESeq2 identifies the most DEGs, followed by Clipper. Notably, most DEGs identified by DESeq2 are not identified by Clipper or edgeR. To investigate whether DESeq2 makes too many false discoveries and whether Clipper finds biologically meaningful DEGs missed by DESeq2 or edgeR, we performed functional analysis on the set of DEGs identified by each method. We first performed the gene ontology (GO) analysis on the three sets of identified DEGs using the R package clusterProfiler [56]. Figure 5b (“Total”) shows that more GO terms are enriched (with enrichment qvalues ≤0.01) in the DEGs identified by Clipper than in the DEGs identified by DESeq2 or edgeR. For the GO terms enriched in all three sets of identified DEGs, Fig. 5c shows that they are all related to the immune response and thus biologically meaningful. Notably, these biologically meaningful GO terms have more significant enrichment in Clipper’s identified DEGs than in edgeR and DESeq2’s identified DEGs. We further performed GO analysis on the DEGs uniquely identified by one method in pairwise comparisons of Clipper vs. DESeq2 and Clipper vs. edgeR. Figure 5b and Additional File 1: Figure S20 show that multiple immunerelated GO terms are enriched in Clipperspecific DEGs, while no GO terms are enriched in edgeRspecific or DESeq2specific DEGs. In addition, we examined the DEGs that were identified by Clipper only but missed by both edgeR and DESeq2. Figure 5d and Additional File 2 show that these genes include multiple key immunerelated genes, including CD36, DUSP2, and TNFAIP3. We further performed pathway analysis on these genes and the DEGs that were identified by DEseq2 only but missed by both edgeR and Clipper, using the R package limma [10]. Additional File 1: Figure S21a shows that the DEGs that were only identified by Clipper have significant enrichment for immunerelated pathways including phagosome, a key function of monocytes and macrophages. On the contrary, Additional File 1: Figure S21b shows that fewer immunerelated pathways are enriched in DEGs that were only identified by DESeq2. Altogether, these results confirm the capacity of Clipper in realdata DEG analysis, and they are consistent with our simulation results that edgeR lacks power, while DESeq2 fails to control the FDR.
DEG identification from singlecell RNAseq data (differential analysis II)
Singlecell RNA sequencing (scRNAseq) technologies have revolutionized biomedical sciences by enabling genomewide profiling of gene expression levels at an unprecedented singlecell resolution. DEG analysis is widely applied to scRNAseq data for discovering genes whose expression levels change between two conditions or between two cell types. Compared with bulk RNAseq data, scRNAseq data have many more “replicates” (i.e., cells, whose number is often in hundreds) under each condition or within each cell type.
We compared Clipper (Additional File 1: Section S7.4) with edgeR [4], MAST [57], Monocle3 [58], the twosample t test, and the Wilcoxon ranksum test (Additional File 1: Section S5.4), five methods that are either popular or reported to have comparatively top performance from a previous benchmark study [59]. To verify the FDR control, we used scDesign2, a flexible probabilistic simulator, to generate scRNAseq count data with known true DEGs [60]. scDesign2 offers three key advantages that enable the generation of realistic semisynthetic scRNAseq count data: (1) it captures distinct marginal distributions of different genes, (2) it preserves genegene correlations, and (3) it adapts to various scRNAseq protocols. Using scDesign2, we generated two semisynthetic scRNAseq datasets from two real scRNAseq datasets of peripheral blood mononuclear cells (PBMCs) [61]: one using 10x Genomics [62] and the other using Dropseq [63]. Each semisynthetic dataset contains two cell types, CD4+ T cells and cytotoxic T cells, which we treated as two conditions (Additional File 1: Section S6.4). Having true DEGs known, the semisynthetic datasets allow us to evaluate the FDRs and power of Clipper and the other five methods for a range of target FDR thresholds: q=1%,2%,…,10%. Figure 3c and Additional File 1: Figure S22 show that on both 10x Genomics and Dropseq semisynthetic datasets, Clipper consistently controls the FDR and remains the most powerful (except for q= 1% and 2%) among all the methods that achieve FDR control. These results demonstrate Clipper’s robust performance in scRNAseq DEG analysis.
DIR analysis of HiC data (differential analysis III)
HiC experiments are widely used to investigate spatial organizations of chromosomes and to map chromatin interactions across the genome. A HiC dataset is often processed and summarized into an interaction matrix, whose rows and columns represent manually binned chromosomal regions and whose (i,j)th entry represents the measured contact intensity between the ith and jth binned regions. The DIR analysis aims to identify pairs of genomic regions whose contact intensities differ between conditions. Same as DEG analysis, DIR analysis also uses the FDR as a decision criterion [11–13].
We compared Clipper with three popular DIR identification methods: diffHic [13], FIND [12], and multiHiCcompare [11] (Additional File 1: Section S5.5). Specifically, we applied Clipper to DIR identification by treating pairs of genomic regions as features and their contact intensities as measurements. To verify the FDR control of Clipper (Additional File 1: Section S7.5), diffHic, FIND, and multiHiCcompare, we generated realistic semisynthetic data from real interaction matrices of ENCODE cell line GM12878 [44] with true spikedin DIRs to evaluate the FDR and power (Additional File 1: Section S6.5). We examined the actual FDR and power in a range of target FDR thresholds: q=1%,2%,…,10%. Figure 3d shows that Clipper and diffHic are the only two methods that consistently control the FDR, while multiHiCcompare and FIND fail by a large margin. In terms of power, Clipper outperforms diffHic except for q=1% and 2%, even though Clipper has not been optimized for HiC data analysis. This result demonstrates Clipper’s general applicability and strong potential for DIR analysis.
Discussion
In this paper, we proposed a new statistical framework, Clipper, for identifying interesting features with FDR control from highthroughput data. Clipper avoids the use of pvalues and makes FDR control more reliable and flexible. We used comprehensive simulation studies to verify the FDR control by Clipper under various settings. We demonstrate that Clipper outperforms existing generic FDR control methods by having higher power and greater robustness to model misspecification. We further applied Clipper to four popular bioinformatics analyses: peak calling from ChIPseq data, peptide identification from MS data, DEG identification from RNAseq data, and DIR identification from HiC data. Our results indicate that Clipper provides a powerful addon to existing bioinformatics tools to improve the reliability of FDR control and thus the reproducibility of scientific discoveries.
Clipper’s FDR control procedures (BC and GZ procedures in the “Methods” section) are motivated by the BarberCandès (BC)’s knockoff paper [35] and the GimenezZou (GZ)’s multiple knockoff paper [64], but we do not need to construct knockoffs in enrichment analysis when two conditions have the same number of replicates; the reason is that the replicates under the background condition serve as natural negative controls. For differential analysis and enrichment analysis with unequal numbers of replicates, in order to guarantee the theoretical assumptions for FDR control, Clipper uses permutations instead of the complicated knockoff construction because Clipper only examines features marginally and does not concern about features’ joint distribution.
We validated the FDR control by Clipper using extensive and concrete simulations, including both modelbased and realdatabased data generation with ground truths, which are widely used to validate newly developed computational frameworks [65]. In contrast, in most bioinformatics method papers, the FDR control was merely mentioned but rarely validated. Many of them assumed that using the BH procedure on pvalues would lead to valid FDR control; however, the reality is often otherwise because pvalues would be invalid when model assumptions were violated or the pvalue calculation was problematic. Here we voice the importance of validating FDR control in bioinformatics method development, and we use this work as a demonstration. We believe that Clipper provides a powerful booster to this movement. As a pvaluefree alternative to the classic pvaluebased BH procedure, Clipper relies less on model assumptions and is thus more robust to model misspecifications, making it an appealing choice for FDR control in diverse highthroughput biomedical data analyses.
Clipper is a flexible framework that is easily generalizable to identify a variety of interesting features. The core component of Clipper summarizes each feature’s measurements under each condition into an informative statistic (e.g., the sample mean); then Clipper combines each feature’s informative statistics under two conditions into a contrast score to enable FDR control. The current implementation of Clipper only uses the sample mean as the informative statistic to identify the interesting features that have distinct expected values under two conditions. However, by modifying the informative statistic, we can generalize Clipper to identify the features that are interesting in other aspects, e.g., having different variances between two conditions. Regarding the contrast score, Clipper currently makes careful choices between two contrast scores, minus and maximum, based on the number of replicates and the analysis task (enrichment or differential).
Notably, Clipper achieves FDR control and high power using those two simple contrast scores, which are calculated for individual features without borrowing information from other features. However, Clipper does leverage the power of multiple testing by setting a contrast score threshold based on all features’ contrast scores. This is a likely reason why Clipper achieves good power even with simple contrast scores. An advantage of Clipper is that it allows other definitions of contrast scores, such as the twosample t statistic that considers withincondition variances. Empirical evidence (Additional File 1: Figures S13 and S14) shows that the Clipper variant using the twosample t statistic is underpowered by the default Clipper, which uses the minus summary statistic (difference of two conditions’ sample means) as the contrast score in the 3vs3 enrichment analysis or as the degree of interestingness in the 3vs3 differential analysis (see the “Methods” section). Here is our current interpretation of this seemingly counterintuitive result.

First, both the minus statistic and the t statistic satisfy Clipper’s theoretical conditions (Lemmas 1 and 3 in Additional File 1: Section S2), which guarantee the FDR control by the BC and GZ procedures; this is confirmed in Additional File 1: Figures S13 and S14. Hence, from the FDR control perspective, Clipper does not require the adjustment for withincondition variances by using a t statistic.

Second, Clipper is different from the twosample t test or the regressionbased t test, where the t statistic was purposely derived as a pivotal statistic so that its null distribution (the t distribution) does not depend on unknown parameters. Since Clipper does not require a null distribution for each feature, the advantage of the t statistic being pivotal no longer matters.

Third, the minus statistic only requires estimates of two conditions’ mean parameters, while the t statistic additionally requires estimates of the two conditions’ variances. Hence, when the sample sizes (i.e., the numbers of replicates) are small, the two more parameters that need estimation in the t statistic might contribute to the observed power loss of the Clipper t statistic variant. Indeed, the power difference between the two statistics diminishes as the sample sizes increase from 3vs3 in Additional File 1: Figures S13–S14 to 10vs10 in Additional File 1: Figure S10 (where we compared the default Clipper with BHpairparametric, which is based on the twosample t test and is highly similar to the Clipper t statistic variant).

Fourth, we observe empirically that a contrast score would have better power if its distribution (based on its values of all features) has a larger range and a heavier right tail (in the positive domain). Compared to the minus statistic, the t statistic has a smaller range and a lighter right tail due to its adjustment for features’ withincondition variances (Additional File 1: Figure S23). This observation is consistent with the power difference of the two statistics.
Beyond our current interpretation, however, we admit that future studies are needed to explore alternative contrast scores and their power with respect to data characteristics and analysis tasks. Furthermore, we may generalize Clipper to be robust against sample batch effects by constructing the contrast score as a regressionbased test statistic that has batch effects removed.
Our current version of Clipper allows the identification of interesting features between two conditions. However, there is a growing need to generalize our framework to identify features across more than two conditions. For example, temporal analysis of scRNAseq data aims to identify genes whose expression levels change along cell pseudotime [31]. To tailor Clipper for such analysis, we could define a new contrast score that differentiates the genes with stationary expression (uninteresting features) from the other genes with varying expression (interesting features). Further studies are needed to explore the possibility of extending Clipper to the regression framework so that Clipper can accommodate data of multiple conditions or even continuous conditions, as well as adjusting for confounding covariates.
We have demonstrated the broad application potential of Clipper in various bioinformatics data analyses. Specifically, when used as an addon to established, popular bioinformatics methods such as MACS2 for peak calling and SEQUEST for peptide identification, Clipper guaranteed the desired FDR control and in some cases boosted the power. However, many more careful thoughts are needed to escalate Clipper into standalone bioinformatics methods for specific data analyses, for which data processing and characteristics (e.g., peak lengths, GC contents, proportions of zeros, and batch effects) must be appropriately accounted for before Clipper is used for the FDR control [59, 66]. We expect that the Clipper framework will propel future development of bioinformatics methods by providing a flexible pvaluefree approach to control the FDR, thus improving the reliability of scientific discoveries.
After finishing this manuscript, we were informed of the work by He et al. [67], which is highly similar to the part of Clipper for differential analysis, as both work use permutation for generating negative controls and the GZ procedure for thresholding (test statistics in He et al. and contrast scores in Clipper). However, the test statistics used in He et al. are the twosample t statistic and the twosample Wilcoxon statistic, both of which are different from the minus and maximum contrast scores used in Clipper. While we leave the optimization of contrast scores to future work, we observed that the minus contrast score outpowers the twosample t statistic in our analysis (Additional File 1: Figures S13 and S14), and we hypothesize that the twosample Wilcoxon statistic, though being a valid contrast score for differential analysis, requires a large sample size to achieve good power. For this reason, we did not consider it as a contrast score in the current Clipper implementation, whose focus is on smallsamplesize highthroughout biological data.
Conclusion
In highthroughput biological data analysis, which aims to identify interesting features by comparing two conditions, existing bioinformatics tools control the FDR based on pvalues. However, obtaining valid pvalues relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions—two requirements that are often unmet in biological studies. To address this issue, we propose Clipper, a general statistical framework for FDR control without relying on pvalues or specific data distributions. Clipper is applicable to identifying both enriched and differential features from highthroughput biological data of diverse types. In comprehensive simulation and realdata benchmarking, Clipper outperforms existing generic FDR control methods and specific bioinformatics tools designed for various tasks, including peak calling from ChIPseq data, differentially expressed gene identification from bulk or singlecell RNAseq data, differentially interacting chromatin region identification from HiC data, and peptide identification from mass spectrometry data. Our results demonstrate Clipper’s flexibility and reliability for FDR control, as well as its broad applications in highthroughput data analysis.
Methods
Clipper: notations and assumptions
We first introduce notations and assumptions used in Clipper. While the differential analysis treats the two conditions symmetric, the enrichment analysis requires one condition to be the experimental condition (i.e., the condition of interest) and the other condition to be the background condition (i.e., the negative control). For simplicity, we use the same set of notations for both analyses. For two random vectors X=(X_{1},…,X_{m})^{⊤} and Y=(Y_{1},…,Y_{n})^{⊤}, we write X⊥Y if X_{i} is independent of Y_{j} for all i=1,…,m and j=1,…,n. To avoid confusion, we use card(A) to denote the cardinality of a set A and c to denote the absolute value of a scalar c. We define a∨b:= max(a,b).
Clipper only requires two inputs: the target FDR threshold q∈(0,1) and the input data. Regarding the input data, we use d to denote the number of features with measurements under two conditions, and we use m and n to denote the numbers of replicates under the two conditions. For each feature \(j = 1,\dots, d\), we use \(\boldsymbol {X}_{j} = \left (X_{j1},\dots, X_{{jm}}\right)^{\top } \in \mathbb {R}_{\scriptscriptstyle \ge 0}^{m}\) and \(\boldsymbol {Y}_{j} = \left (Y_{j1},\dots, Y_{{jn}}\right)^{\top } \in \mathbb {R}_{\scriptscriptstyle \ge 0}^{n}\) to denote its measurements under the two conditions, where \(\mathbb {R}_{\scriptscriptstyle \ge 0}\) denotes the set of nonnegative real numbers. We assume that all measurements are nonnegative, as is the case with most highthroughput experiments. (If this assumption does not hold, transformations can be applied to make data satisfy this assumption.)
Clipper has the following assumptions on the joint distribution of \(\boldsymbol {X}_{1},\ldots, \boldsymbol {X}_{d}, \boldsymbol {Y}_{1}, \dots, \boldsymbol {Y}_{d}\). For \(j = 1,\dots, d\), Clipper assumes that X_{j1},…,X_{jm} are identically distributed, so are Y_{j1},…,Y_{jn}. Let \(\mu _{{Xj}}=\mathbb {E}\left [X_{j1}\right ]\) and \(\mu _{{Yj}}=\mathbb {E}\left [Y_{j1}\right ]\) denote the expected measurement of feature j under the two conditions, respectively. Then conditioning on \(\left \{\mu _{{Xj}}\right \}_{j = 1}^{d}\) and \(\left \{\mu _{{Yj}}\right \}_{j = 1}^{d}\),
An enrichment analysis aims to identify interesting features with μ_{Xj}>μ_{Yj} (with X_{j} and Y_{j} defined as the measurements under the experimental and background conditions, respectively), while a differential analysis aims to call interesting features with μ_{Xj}≠μ_{Yj}. We define \(\mathcal {N} := \{j: \mu _{{Xj}} = \mu _{{Yj}}\}\) as the set of uninteresting features and denote \(N:=\text {card}({\mathcal {N}})\). In both analyses, Clipper further assumes that an uninteresting feature j satisfies
Clipper consists of two main steps: construction and thresholding of contrast scores. First, Clipper computes contrast scores, one per feature, as summary statistics that reflect the extent to which features are interesting. Second, Clipper establishes a contrastscore cutoff and calls as discoveries the features whose contrast scores exceed the cutoff.
To construct contrast scores, Clipper uses two summary statistics \(t(\cdot,\cdot): \mathbb {R}_{\scriptscriptstyle \ge 0}^{m}\times \mathbb {R}_{\scriptscriptstyle \ge 0}^{n} \rightarrow \mathbb {R}\) to extract data information regarding whether a feature is interesting or not:
where \(\boldsymbol {x} = (x_{1},\ldots,x_{m})^{\top } \in \mathbb {R}_{\scriptscriptstyle \ge 0}^{m}, \boldsymbol {y} = (y_{1},\ldots,y_{n})^{\top } \in \mathbb {R}_{\scriptscriptstyle \ge 0}^{n}, \bar {x} = \sum _{i=1}^{m} x_{i}/m, \bar {y} = \sum _{i=1}^{n} y_{i}/n\), and \(\text {sign}(\cdot):\mathbb {R} \rightarrow \{1,0,1\}\) with sign(x)=1 if x>0,sign(x)=−1 if x<0, and sign(x)=0 otherwise.
Notably, other summary statistics can also be used to construct contrast scores. For example, an alternative summary statistic is the t statistic from the twosample t test:
Then we introduce how Clipper works in three analysis tasks: the enrichment analysis with equal numbers of replicates under two conditions (m=n), the enrichment analysis with different numbers of replicates under two conditions (m≠n), and the differential analysis (when m+n>2).
Clipper: enrichment analysis with equal numbers of replicates (m=n)
Under the enrichment analysis, we assume that \(\boldsymbol {X}_{j} \in \mathbb {R}_{\scriptscriptstyle \ge 0}^{m}\) and \(\boldsymbol {Y}_{j} \in \mathbb {R}_{\scriptscriptstyle \ge 0}^{n}\) are the measurements of feature j, j=1,…,d, under the experimental and background conditions with m and n replicates, respectively. We start with the simple case when m=n. Clipper defines a contrast score C_{j} of feature j in one of two ways:
or
Figure 6a shows a cartoon illustration of contrast scores when m=n=1. Accordingly, a large positive value of C_{j} bears evidence that μ_{Xj}>μ_{Yj}. Motivated by [35], Clipper uses the following procedure to control the FDR under the target level q∈(0,1).
Definition 1
(BarberCandès (BC) procedure for thresholding contrast scores [35]) Given contrast scores \(\left \{C_{j} \right \}_{j = 1}^{d}, \mathcal {C} = \left \{\left C_{j}\right : \left.C_{j} \neq 0\right.;\; j = 1, \dots, d \right.\}\) is defined as the set of nonzero absolute values of C_{j}’s. The BC procedure finds a contrastscore cutoff T^{BC} based on the target FDR threshold q∈(0,1) as
and outputs {j:C_{j}≥T^{BC}} as discoveries.
Clipper: enrichment analysis with any numbers of replicates m and n
When m≠n, Clipper constructs contrast scores via permutation of replicates across conditions. The idea is that, after permutation, every feature becomes uninteresting and can serve as its own negative control.
Definition 2
(Permutation) We define σ as permutation, i.e., a bijection from the set {1,⋯,m+n} onto itself, and we rewrite the data X_{1},…,X_{d},Y_{1},…,Y_{d} into a matrix W:
We then apply σto permute the columns of W and obtain
from which we obtain the permuted measurements \(\left \{\left (\boldsymbol {X}_{j}^{\sigma }, \boldsymbol {Y}_{j}^{\sigma }\right)\right \}_{j=1}^{d}\), where
In the enrichment analysis, if two permutations σ and σ^{′} satisfy that
then we define σ and σ^{′} to be in one equivalence class. That is, permutations in the same equivalence class lead to the same division of m+n replicates (from the two conditions) into two groups with sizes m and n. In total, there are \({m+n \choose m}\) equivalence classes of permutations.
We define σ_{0} as the identity permutation such that σ_{0}(i)=i for all i∈{1,⋯,m+n}. In addition, Clipper randomly samples h equivalence classes \(\sigma _{1},\dots, \sigma _{h}\) with equal probabilities without replacement from the other \(h_{\max } := {m+n \choose m}  1\) equivalence classes (after excluding the equivalence class containing σ_{0}). Note that h_{max} is the maximum value h can take.
Clipper then obtains \(\phantom {\dot {i}\!}\left \{ \left (\boldsymbol {X}_{j}^{\sigma _{0}},\boldsymbol {Y}_{j}^{\sigma _{0}}\right), \left (\boldsymbol {X}_{j}^{\sigma _{1}},\boldsymbol {Y}_{j}^{\sigma _{1}}\right), \cdots, \left (\boldsymbol {X}_{j}^{\sigma _{h}},\boldsymbol {Y}_{j}^{\sigma _{h}}\right) \right \}_{j=1}^{d}\), where \(\phantom {\dot {i}\!}\left (\boldsymbol {X}_{j}^{\sigma _{\ell }},\boldsymbol {Y}^{\sigma _{\ell }}_{j}\right)\) are the permuted measurements based on σ_{ℓ},ℓ=0,1,…,h. Then Clipper computes \(\phantom {\dot {i}\!}T^{\sigma _{\ell }}_{j} := t^{\text {minus}}\left (\boldsymbol {X}^{\sigma _{\ell }}_{j},\boldsymbol {Y}^{\sigma _{\ell }}_{j}\right)\) to indicate the degree of “interestingness” of feature j reflected by \(\phantom {\dot {i}\!}\left (\boldsymbol {X}^{\sigma _{\ell }}_{j},\boldsymbol {Y}^{\sigma _{\ell }}_{j}\right)\). Note that Clipper chooses t^{minus} instead of t^{max} because empirical evidence shows that t^{minus} leads to better power. Sorting \(\phantom {\dot {i}\!}\left \{T^{\sigma _{\ell }}_{j}\right \}_{\ell =0}^{h}\) gives
Then Clipper defines the contrast score of feature j, j=1,…,d, in one of two ways:
or
The intuition behind the contrast scores is that, if C_{j}<0, then \(\phantom {\dot {i}\!}T^{(0)}_{j} \neq T^{\sigma _{0}}_{j}\), which means that at least one of \(\phantom {\dot {i}\!}T^{\sigma _{1}}_{j}, \ldots, T^{\sigma _{h}}_{j}\) (calculated after random permutation) is greater than \(\phantom {\dot {i}\!}T^{\sigma _{0}}_{j}\) calculated from the original data (identity permutation), suggesting that feature j is likely an uninteresting feature in enrichment analysis. Figure 6b (right) shows a cartoon illustration of contrast scores when m=2 and n=1. Motivated by [64], we propose the following procedure for Clipper to control the FDR under the target level q∈(0,1).
Definition 3
(GimenezZou (GZ) procedure for thresholding contrast scores [64]) Given h∈{1,⋯,h_{max}} and contrast scores \(\{ C_{j} \}_{j = 1}^{d}, \mathcal {C} = \left \{C_{j}: \left.C_{j} \neq 0\right.;\; j = 1, \dots, d \right.\}\) is defined as the set of nonzero absolute values of C_{j}’s. The GZ procedure finds a contrastscore cutoff T^{GZ} based on the target FDR threshold q∈(0,1) as:
and outputs {j:C_{j}≥T^{GZ}} as discoveries.
Clipper: differential analysis with m+n>2
For differential analysis, Clipper also uses permutation to construct contrast scores. When m≠n, the equivalence classes of permutations are defined the same as for the enrichment analysis with m≠n. When m=n, there is a slight change in the definition of equivalence classes of permutations: if σ and σ^{′} satisfy that
then we say that σ and σ^{′} are in one equivalence class. In total, there are \(h_{\text {total}} := {m+n \choose m}\) (when m≠n) or \({2m \choose m}/2\) (when m=n) equivalence classes of permutations. Hence, to have more than one equivalence class, we cannot perform differential analysis with m=n=1; in other words, the total number of replicates m+n must be at least 3.
Then Clipper randomly samples σ_{1},…,σ_{h} with equal probabilities without replacement from the h_{max}:=h_{total}−1 equivalence classes that exclude the class containing σ_{0}, i.e., the identity permutation. Note that h_{max} is the maximum value h can take. Next, Clipper computes \(\phantom {\dot {i}\!}T^{\sigma _{\ell }}_{j} := \left t^{\text {minus}}(\boldsymbol {X}^{\sigma _{\ell }}_{j}, \boldsymbol {Y}^{\sigma _{\ell }}_{j}) \right \), where \(\phantom {\dot {i}\!}\boldsymbol {X}^{\sigma _{\ell }}_{j}\) and \(\phantom {\dot {i}\!}\boldsymbol {Y}^{\sigma _{\ell }}_{j}\) are the permuted data defined in (9), and it defines C_{j} as the contrast score of feature j, j=1,…,d, in the same ways as in (10) or (11). Figure 6b (left) shows a cartoon illustration of contrast scores when m=2 and n=1.
Same as in the enrichment analysis with m≠n, Clipper also uses the GZ procedure [64] to set a cutoff on contrast scores to control the FDR under the target level q∈(0,1).
Clipper: discussion on contrast scores
Granted, when we use permutations to construct contrast scores in the GZ procedure, we can convert contrast scores into permutationbased pvalues (see Additional File 1: Section S1.1.2). However, when the numbers of replicates are small, the number of possible permutations is small, so permutationbased pvalues would have a low resolution (e.g., when m=2 and n=1, the number of nonidentity permutations is only 2). Hence, applying the BH procedure to the permutationbased pvalues would result in almost no power. Although Yekutieli and Benjamini proposed another thresholding procedure for permutationbased pvalues [68], it still requires the number of permutations to be large to obtain a reliable FDR control. Furthermore, if we apply the SeqStep+ procedure by Barber and Candés [35] to permutationbased pvalues, it would be equivalent to our application of the GZ procedure to contrast scores (Additional File 1: Section S1.1.2).
For both differential and enrichment analyses, the two contrast scores (minus and maximum) can both control the FDR. Based on the power comparison results in Additional File 1: Section S3 and Additional File 1: Figures S24–S28, Clipper has the following default choice of contrast score: for the enrichment analysis with m=n, Clipper uses the BC procedure with the minus contrast score; for the enrichment analysis with m≠n or the differential analysis, Clipper uses the GZ procedure with the maximum contrast score.
Generic FDR control methods
In our simulation analysis, we compared Clipper against generic FDR control methods including pvaluebased methods and localfdrbased methods. Briefly, each pvaluebased method is a combination of a pvalue calculation approach and a pvalue thresholding procedure. We use either the “paired” or “pooled” approach (see next paragraph) to calculate pvalues of features and then threshold the pvalues using the BH procedure (Additional File 1: Definition S1) or Storey’s qvalue procedure (Additional File 1: Definition S2) to make discoveries (Additional File 1: Section S1.1). As a result, we have four pvaluebased methods: BHpair, BHpool, qvaluepair, and qvaluepool (Fig. 1b).
Regarding the existing pvalue calculation approaches in bioinformatics tools, we categorize them as “paired” or “pooled.” The paired approach has been widely used to detect DEGs and proteinbinding sites [1, 2, 4, 5]. It examines one feature at a time and compares the feature’s measurements between two conditions using a statistical test. In contrast, the pooled approach is popular in proteomics for identifying peptide sequences from MS data [69]. For every feature, it defines a test statistic and estimates a null distribution by pooling all features’ observed test statistic values under the background condition. Finally, it calculates a pvalue for every feature based on the feature’s observed test statistic under the experimental condition and the null distribution.
In parallel to pvaluebased methods, localfdrbased methods estimate local fdrs of features and then threshold the local fdrs using the locfdr procedure (Additional File 1: Definition S5) to make discoveries. The estimation of local fdrs takes one of two approaches: (1) empirical null, which is estimated parametrically from the test statistic values that are likely drawn from the null distribution, and (2) swapping null, which is constructed by swapping measurements between experimental and background conditions. The resulting two localfdrbasedmethods are referred to as locfdremp and locfdrswap (Figs. 1b and 2). Additional File 1: Section S1 provides a detailed explanation of these generic methods and how we implemented them in this work.
Specific to the pvaluebased methods, for the paired approach, besides the ideal implementation that uses the correct model to calculate pvalues (BHpaircorrect and qvaluepaircorrect), we also consider common misimplementations. The first misimplementations is misspecification of the distribution (BHpairmis and qvaluepairmis). An example is the detection of proteinbinding sites from ChIPseq data. A common assumption is that ChIPseq read counts in a genomic region (i.e., a feature) follow the Poisson distribution [1, 2], which implies that the counts have the variance equal to the mean. However, if only two replicates are available, it is impossible to check whether this Poisson distribution is reasonably specified. The second misimplementation is the misspecification of a twosample test as a onesample test (BHpair2as1 and qvaluepair2as1), which ignores the sampling randomness of replicates under one condition. This issue is implicit but widespread in bioinformatics methods [1, 70].
To summarize, we compared Clipper against the following implementations of generic FDR control methods:

BHpool or qvaluepool: pvalues calculated by the pooled approach and thresholded by the BH or qvalue procedure.

BHpaircorrect or qvaluepaircorrect: pvalues calculated by the paired approach with the correct model specification and thresholded by the BH or qvalue procedure.

BHpairmis or qvaluepairmis: pvalues calculated by the paired approach with a misspecified model and thresholded by the BH or qvalue procedure.

BHpair2as1 or qvaluepair2as1: pvalues calculated by the paired approach that misformulates a twosample test as a onesample test (2as1) and thresholded by the BH or qvalue procedure.

locfdremp: local fdrs calculated by the empirical null approach and thresholded by the locfdr procedure.

locfdrswap: local fdrs calculated by the swapping approach and thresholded by the locfdr procedure.
Software packages used in this study

p.adjustR function (in R package stats v 4.0.2 with default arguments) [14]: used for BHpool, BHpaircorrect, BHpairmis, and BHpair2as1.

qvalueR package (v 2.20.0 with default arguments) [71]: used for qvaluepool, qvaluepaircorrect, qvaluepairmis, and qvaluepair2as1.

locfdrR package (v 1.18 with default arguments) [72]: used for locfdremp.

MACS2 software package (v 2.2.6 with default settings) [1]: available at https://github.com/macs3project/MACS/releases/tag/v2.2.6.

ChIPulate software package [45]: available at https://github.com/vishakad/chipulate.

HOMER software package (findPeaks v 3.1.9.2 with default settings) [2]: available at https://www.bcgsc.ca/platform/bioinfo/software/findpeaks/releases/3.1.9.2/ findpeaks3192tar.gz.

SEQUEST in Proteome Discoverer (v 2.3.0.523 with default settings) [3]: commercial software by ThermoScientific.

edgeRR package (v 3.30.0 with default arguments) [4]: available at https://www.bioconductor.org/packages/release/bioc/html/edgeR.html.

DESeq2R package (v 1.28.1 with default arguments) [5]: available at https://bioconductor.org/packages/release/bioc/html/DESeq2.html.

limmaR package (v 3.44.3 with default arguments) [10]: available at https://www.bioconductor.org/packages/release/bioc/html/limma.html.

MASTR package (v 1.14.0 with default arguments) [57]: available at https://www.bioconductor.org/packages/release/bioc/html/MAST.html.

monocle3R package (v 0.2.3.0 with default arguments) [58]: available at https://github.com/coletrapnelllab/monocle3.

MultiHiCcompareR package (v 1.6.0 with default arguments) [11]: available at https: //bioconductor.org/packages/release/bioc/html/multiHiCcompare.html.

diffHicR package (v 1.20.0 with default arguments) [13]: available at https://www.bioconductor.org/packages/release/bioc/html/diffHic.html.

FINDR package (v 0.99 with default arguments) [12]: available at https://bitbucket.org/nadhir/find/src/master/.
Availability of data and materials
∙ The ClipperR package is available at https://github.com/JSBUCLA/Clipper/ [80].
∙ The code and processed data for reproducing the figures are available at https://zenodo.org/record/5115468(DOI number 10.5281/zenodo.5115468) [81].
∙ A video introduction of Clipper is available at https://youtu.be/GXyHiJMpLo.
∙ Real datasets:
− The H3K4me3 ChIPseq dataset with one experimental sample (GEO accession number GSM733708) and two control samples (GEO accession number GSM733742) from the cell line GM12878 is available at ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/, with the experimental sample wgEncodeBroadHistoneGm12878H3k4me3StdAlnRep1.bam and the two control samples wgEncodeBroadHistoneGm12878ControlStdAlnRep1.bam wgEncodeBroadHistoneGm12878ControlStdAlnRep2.bam. The processed dataset is available at https://zenodo.org/record/5115468.
− The Pfu mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE [83] partner repository with the dataset identifier PXD028558 (https://www.ebi.ac.uk/pride/archive/projects/PXD028558). The processed MS benchmark dataset is available as data/pfu/Archaea Sequest DECOYS 100_FDR.xlsx for the decoy PSMs and data/pfu/Archaea Sequest Targets 100_FDR.xlsx for the target PSMs at https://zenodo.org/record/5202768.
− The human monocyte RNAseq dataset (SRA accession number SRP082682) is available at https://0wwwncbinlmnihgov.brum.beds.ac.uk/Traces/study/?acc=srp082682. The dataset includes 17 samples of classical monocytes and 17 samples of nonclassical monocytes, and it is converted to a samplebygene count matrix by R package GenomicFeatures (v 1.40.1). The processed count matrix is available at https://zenodo.org/record/5115468.
− The HiC dataset from the cell line GM12878 (GEO accession number GSE63525) is available at https://0wwwncbinlmnihgov.brum.beds.ac.uk/geo/query/acc.cgi?acc=GSE63525. The count matrix is under the filename GSE63525_GM12878_primary_intrachromosomal_contact_matrices.tar.gz, and the matrix corresponding to Chromosome 1 and bin width 1MB is used. The processed dataset is available at https://zenodo.org/record/5115468.
− The PBMC singlecell RNAseq dataset (GEO accession number GSE132044) is available at https://0wwwncbinlmnihgov.brum.beds.ac.uk/geo/query/acc.cgi?acc=GSE132044. The processed dataset is available at https://zenodo.org/record/5115468.
References
 1
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, et al. Modelbased analysis of chipseq (macs). Genome Biol. 2008; 9(9):1–9.
 2
Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. Simple combinations of lineagedetermining transcription factors prime cisregulatory elements required for macrophage and b cell identities. Mol Cell. 2010; 38(4):576–89.
 3
Washburn MP, Wolters D, Yates JR. Largescale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol. 2001; 19(3):242–7.
 4
Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1):139–40.
 5
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for rnaseq data with deseq2. Genome Biol. 2014; 15(12):550.
 6
Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with rnaseq. Nat Biotechnol. 2013; 31(1):46–53.
 7
Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and false discovery rate estimation for rnasequencing data. Biostatistics. 2012; 13(3):523–38.
 8
Hardcastle TJ, Kelly KA. bayseq: empirical bayesian methods for identifying differential expression in sequence count data. BMC Bioinforma. 2010; 11(1):1–14.
 9
Smyth G. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3(1). https://0doiorg.brum.beds.ac.uk/10.2202/15446115.1027.
 10
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for rnasequencing and microarray studies. Nucleic acids research. 2015; 43(7):e47.
 11
Stansfield JC, Cresswell KG, Dozmorov MG. multihiccompare: joint normalization and comparative analysis of complex hic experiments. Bioinformatics. 2019; 35(17):2916–23.
 12
Djekidel MN, Chen Y, Zhang MQ. Find: differential chromatin interactions detection using a spatial poisson process. Genome Res. 2018; 28(3):412–22.
 13
Lun AT, Smyth GK. diffhic: a bioconductor package to detect differential genomic interactions in hic data. BMC Bioinforma. 2015; 16(1):1–11.
 14
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995; 57(1):289–300.
 15
Efron B, Tibshirani R. Empirical bayes methods and false discovery rates for microarrays. Genet Epidemiol. 2002; 23(1):70–86.
 16
Efron B, Tibshirani R, Storey JD, Tusher V. Empirical bayes analysis of a microarray experiment. J Am Stat Assoc. 2001; 96(456):1151–60.
 17
Stephens M. False discovery rates: a new deal. Biostatistics. 2017; 18(2):275–94.
 18
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci. 2003; 100(16):9440–5.
 19
Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003; 19(3):368–75.
 20
Yang B, Wu YJ, Zhu M, Fan SB, Lin J, Zhang K, Li S, Chi H, Li YX, Chen HF, et al. Identification of crosslinked peptides from complex samples. Nat Methods. 2012; 9(9):904–6.
 21
White JR, Nagarajan N, Pop M. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput Biol. 2009; 5(4):e1000352.
 22
Shabalin AA. Matrix eqtl: ultra fast eqtl analysis via large matrix operations. Bioinformatics. 2012; 28(10):1353–8.
 23
Hawinkel S, Mattiello F, Bijnens L, Thas O. A broken promise: microbiome differential abundance methods do not control the false discovery rate. Brief Bioinform. 2019; 20(1):210–21.
 24
Zheng Y, Keleş S. Freehic simulates highfidelity hic data for benchmarking and data augmentation. Nat Methods. 2020; 17(1):37–40.
 25
Storey JD. A direct approach to false discovery rates. J R Stat Soc Ser B Stat Methodol. 2002; 64(3):479–98.
 26
Benjamini Y, Hochberg Y. Multiple hypotheses testing with weights. Scand J Stat. 1997; 24(3):407–18.
 27
Ignatiadis N, Klaus B, Zaugg JB, Huber W. Datadriven hypothesis weighting increases detection power in genomescale multiple testing. Nat Methods. 2016; 13(7):577–80.
 28
Lei L, Fithian W. AdaPT: an interactive procedure for multiple testing with side information. J R Stat Soc Ser B Stat Method. 2018; 80(4):649–79.
 29
Boca SM, Leek JT. A direct approach to estimating false discovery rates conditional on covariates. PeerJ. 2018; 6:e6035.
 30
Ho J, Tumkaya T, Aryal S, Choi H, ClaridgeChang A. Moving beyond p values: data analysis with estimation graphics. Nat Methods. 2019; 16(7):565–6.
 31
Song D, Li JJ. PseudotimeDE: inference of differential gene expression along cell pseudotime with wellcalibrated pvalues from singlecell RNA sequencing data. Genome Biol. 2021; 22:124.
 32
Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001; 98(9):5116–21.
 33
Hemerik J, Goeman JJ. False discovery proportion estimation by permutations: confidence for significance analysis of microarrays. J R Stat Soc Ser B Stat Methodol. 2018; 80(1):137–55.
 34
Hemerik J, Solari A, Goeman JJ. Permutationbased simultaneous confidence bounds for the false discovery proportion. Biometrika. 2019; 106(3):635–49.
 35
Barber RF, Candès EJ. Controlling the false discovery rate via knockoffs. Ann Stat. 2015; 43(5):2055–85.
 36
AriasCastro E, Chen S. Distributionfree multiple testing. Electron J Stat. 2017; 11(1):1983–2001.
 37
Benjamini Y. Selective inference: The silent killer of replicability. Harv Data Sci Rev. 2020;2(4). https://0doiorg.brum.beds.ac.uk/10.1162/99608f92.fc62b261.
 38
Emery K, Hasam S, Noble WS, Keich U. Multiple competitionbased FDR control and its application to peptide detection In: Schwartz R, editor. Research in Computational Molecular Biology. RECOMB 2020. Lecture Notes in Computer Science, vol 12074. Cham: Springer: 2020. p. 54–71.
 39
Sarkar A, Stephens M. Separating measurement and expression models clarifies confusion in single cell RNAsequencing analysis. Nat Genet. 2021; 53(6):770–7.
 40
Park PJ. Chip–seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009; 10(10):669–80.
 41
Mitchell PJ, Tjian R. Transcriptional regulation in mammalian cells by sequencespecific dna binding proteins. Science. 1989; 245(4916):371–8.
 42
Ptashne M, Gann A. Transcriptional activation by recruitment. Nature. 1997; 386(6625):569–77.
 43
Bailey T, Krajewski P, Ladunga I, Lefebvre C, Li Q, Liu T, Madrigal P, Taslim C, Zhang J. Practical guidelines for the comprehensive analysis of chipseq data. PLoS Comput Biol. 2013; 9(e1003326).
 44
ENCODE Project Consortium, et al. An integrated encyclopedia of dna elements in the human genome. Nature. 2012; 489(7414):57–74.
 45
Datta V, Hannenhalli S, Siddharthan R. Chipulate: A comprehensive chipseq simulation pipeline. PLoS Comput Biol. 2019; 15(3):e1006921.
 46
Diaz A, Park K, Lim D, Song JS. Normalization, bias correction, and peak calling for ChIPseq. Stat Appl Genet Mol Biol. 2012;11(3). https://0doiorg.brum.beds.ac.uk/10.1515/15446115.1750.
 47
Bogdanow B, Zauber H, Selbach M. Systematic errors in peptide and protein identification and quantification by modified peptides. Mol Cell Proteomics. 2016; 15(8):2791–801.
 48
Perkins D, Pappin D, Creasy D, Cottrell J. Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophor Int J. 1999; 20:3551–67.
 49
Bern M, Kil Y, Becker C. Byonic: advanced peptide and protein identification software. Curr Protoc Bioinforma. 2012; 40(1):13–20.
 50
Robinson M, Oshlack A. A scaling normalization method for differential expression analysis of rnaseq data. Genome Biol. 2010; 11(3):1–9.
 51
Williams C, Baccarella A, Parrish J, Kim C. Empirical assessment of analysis workflows for differential expression analysis of human samples using rnaseq. BMC Bioinforma. 2017; 18(1):38.
 52
Gierliński M, Cole C, Schofield P, Schurch N, Sherstnev A, Singh V, Wrobel N, Gharbi K, Simpson G, OwenHughes T, et al. Statistical models for rnaseq data derived from a twocondition 48replicate experiment. Bioinformatics. 2015; 31:3625–30.
 53
Korthauer K, Kimes P, Duvallet C, Reyes A, Subramanian A, Teng M, Shukla C, Alm E, Hicks S. A practical guide to methods controlling false discoveries in computational biology. Genome Biol. 2019; 20(1):1–21.
 54
Li Y, Ge X, Peng F, Li W, Li JJ. A largesample crisis? Exaggerated false positives by popular differential expression methods. bioRxiv 2021.08.25.457733. https://0doiorg.brum.beds.ac.uk/10.1101/2021.08.25.457733.
 55
Li Q, Brown J, Huang H, Bickel P. Measuring reproducibility of highthroughput experiments. Ann Appl Stat. 2011; 5(3):1752–79.
 56
Yu G, Wang LG, Han Y, He QY. clusterprofiler: an r package for comparing biological themes among gene clusters. Omics J Integr Biol. 2012; 16(5):284–7.
 57
Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek A, Slichter C, Miller H, McElrath MJ, Prlic M, et al. Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in singlecell rna sequencing data. Genome Biol. 2015; 16(1):1–13.
 58
Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C. Singlecell mrna quantification and differential analysis with census. Nat Methods. 2017; 14(3):309–15.
 59
Soneson C, Robinson M. Bias, robustness and scalability in singlecell differential expression analysis. Nat methods. 2018; 15(4):255.
 60
Sun T, Song D, Li WV, Li JJ. scdesign2: an interpretable simulator that generates highfidelity singlecell gene expression count data with gene correlations captured. bioRxiv. 2020.
 61
Ding J, Adiconis X, Simmons S, Kowalczyk MS, Hession CC, Marjanovic ND, Hughes T, Wadsworth M, Burks T, Nguyen LT, et al. Systematic comparison of singlecell and singlenucleus rnasequencing methods. Nat Biotechnol. 2020; 38(6):737–46.
 62
Zheng G, Terry J, Belgrader P, Ryvkin P, Bent Z, Wilson R, Ziraldo S, Wheeler T, McDermott G, Zhu J, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017; 8(1):1–12.
 63
Macosko E, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas A, Kamitaki N, Martersteck E, et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell. 2015; 161(5):1202–14.
 64
Gimenez JR, Zou J. Improving the stability of the knockoff procedure: Multiple simultaneous knockoffs and entropy maximization. 22nd Int Conf Artif Intell Stat PMLR. 2019; 89:2184–92.
 65
Wang N, Lefaudeux D, Mazumder A, Li JJ, Hoffmann A. Identifying the combinatorial control of signaldependent transcription factors. PLOS Comput Biol. 2021; 17(6):e1009095.
 66
Thorsen J, Brejnrod A, Mortensen M, Rasmussen M, Stokholm J, AlSoud WA, Sørensen S, Bisgaard H, Waage J. Largescale benchmarking reveals false discoveries and count transformation sensitivity in 16s rrna gene amplicon data analysis methods used in microbiome studies. Microbiome. 2016; 4(1):62.
 67
He K, Li M, Fu Y, Gong F, Sun X. Nullfree false discovery rate control using decoy permutations for multiple testing. arXiv preprint arXiv:1804.08222. 2018.
 68
Yekutieli D, Benjamini Y. Resamplingbased false discovery rate controlling multiple test procedures for correlated test statistics. J Stat Plan Infer. 1999; 82(12):171–96.
 69
Nesvizhskii A. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J proteomics. 2010; 73(11):2092–123.
 70
Li W, Xu H, Xiao T, Cong L, Love MI, Zhang F, Irizarry RA, Liu JS, Brown M, Liu XS. Mageck enables robust identification of essential genes from genomescale crispr/cas9 knockout screens. Genome Biol. 2014; 15(12):554.
 71
Storey J, Bass A, Dabney A, Robinson David. qvalue: Qvalue estimation for false discovery rate control. 2020. https://github.com/jdstorey/qvalue. R package version 2.20.0. Accessed 11 July 2021.
 72
Efron B. Largescale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc. 2004; 99(465):96–104.
 73
Barber RF, Candès E, et al. A knockoff filter for highdimensional selective inference. Ann Stat. 2019; 47(5):2504–37.
 74
Spivak M, Weston J, Bottou L, Käll L, Noble WS. Improvements to the percolator algorithm for peptide identification from shotgun proteomics data sets. J Proteome Res. 2009; 8(7):3737–45.
 75
Stansfield J, Cresswell K, Vladimirov V, Dozmorov M. Hiccompare: an rpackage for joint normalization and comparison of hic datasets. BMC Bioinforma. 2018; 19(1):279.
 76
Howe KL, Achuthan P, Allen J, Allen J, AlvarezJarreta J, Amode MR, Armean IM, Azov AG, Bennett R, Bhai J, et al. Ensembl 2021. Nucleic Acids Res. 2021; 49(D1):D884–91.
 77
Mellacheruvu D, Wright Z, Couzens A, Lambert JP, StDenis N, Li T, Miteva Y, Hauri S, Sardiu M, Low TY, et al. The crapome: a contaminant repository for affinity purification–mass spectrometry data. Nat Methods. 2013; 10(8):730–6.
 78
Cox J, Mann M. Maxquant enables high peptide identification rates, individualized ppbrange mass accuracies and proteomewide protein quantification. Nat Biotechnol. 2008; 26(12):1367–72.
 79
Goloborodko AA, Levitsky LI, Ivanov MV, Gorshkov MV. Pyteomics—a python framework for exploratory data analysis and rapid software prototyping in proteomics. J Am Soc Mass Spectrom. 2013; 24(2):301–4.
 80
Levitsky L, Klein J, Ivanov M, Gorshkov M. Pyteomics 4.0: five years of development of a python proteomics framework. J Proteome Res. 2018; 18(2):709–14.
 81
Ge X, Chen Y, Li JJ. Clipper: pvaluefree FDR control on highthroughput data from two conditions. https://github.com/JSBUCLA/Clipper. R package version 0.0.0.9. Accessed 11 July 2021.
 82
Ge X, Chen Y, Li JJ. Clipper: pvaluefree FDR control on highthroughput data from two conditions. 2021. https://zenodo.org/record/5115468.
 83
PerezRiverol Y, Csordas A, Bai J, BernalLlinares M, Hewapathirana S, Kundu D, Inuganti A, Griss J, Mayer G, Eisenacher M, et al. The pride database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 2019; 47(D1):D442–50.
Acknowledgements
The authors would like to thank Dr. YuCheng Yang for his suggestions on the figures and R package. The authors would also like to thank Mr. Nikos Ignatiadis, Dr. Lihua Lei, and Dr. Rina Barber for their insightful comments after we presented this work at the International Seminar on Selective Inference (https://www.selectiveinferenceseminar.com/pasttalks). The authors also appreciate the comments and feedback from Mr. Tianyi Sun, Ms. Kexin Li, and other members of the Junction of Statistics and Biology at UCLA (http://jsb.ucla.edu).
Peer review information
Anahita Bishop was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Review history
The review history is available as Additional file 3.
Funding
This work was supported by the following grants: NIHNCI T32LM012424 (to Y.E.C.); NCI K08 CA201591, the Gabrielle’s Angel Foundation and Alex’s Lemonade Stand Foundation (to L.D.W.); NIH R01HG007538, R01CA193466, and R01CA228140 (to W.L.); National Science Foundation DBI1846216 and DMS2113754, NIH/NIGMS R01GM120507 and R35GM140888, Johnson & Johnson WiSTEM2D Award, Sloan Research Fellowship, and UCLA David Geffen School of Medicine W.M. Keck Foundation Junior Faculty Award (to J.J.L.).
Author information
Affiliations
Contributions
J.J.L., X.G., and Y.E.C. developed the methodology. X.G. performed the simulation analysis and the DEG analysis on semisynthetic singlecell RNAseq data. X.G. did the peak calling analysis, advised by W.L.. Y.E.C. did the DEG analysis on semisynthetic bulk RNAseq data, the DIR analysis, and the peptide identification analysis. The real data used in peptide identification analysis was generated by M.M., K.W., and A.M., supported by L.D.W. X.G. and Y.E.C. performed DEG analysis on real bulk RNAseq data, the results of which were interpreted by X.G., D.S., and N.W. X.G. and Y.E.C. developed the R package. Y.E.C. wrote the mathematical proof. X.G., Y.E.C., and J.J.L. wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
L.D.W. holds equity in Magenta Therapeutics.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
Supplementary materials. It includes a review of generic FDR control methods (Section S1 [73]), the detailed Clipper methodology (Section S2), the comparison of Clipper variant algorithms (Section S3), data generation and detailed implementation of the paired approach (a pvalue calculation approach) in simulation studies (Section S4), bioinformatic methods with FDR control functionality (Section S5 [74, 75]), benchmark data generation in omics data applications (Section S6 [76–80]), implementation of Clipper in omics data applications (Section S7), proofs (Section S8), and Figures S1–S29.
Additional file 2
Supplementary table. Biological functions of the DEGs (between classical and nonclassical human monocytes) that were identified by Clipper only but missed by both edgeR and DESeq2 from the human monocyte RNAseq dataset (SRA accession number SRP082682).
Additional file 3
Review history
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ge, X., Chen, Y.E., Song, D. et al. Clipper: pvaluefree FDR control on highthroughput data from two conditions. Genome Biol 22, 288 (2021). https://0doiorg.brum.beds.ac.uk/10.1186/s13059021025069
Received:
Accepted:
Published:
DOI: https://0doiorg.brum.beds.ac.uk/10.1186/s13059021025069