<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fgene.2020.00509</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Enhanced Permutation Tests via Multiple Pruning</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Leem</surname> <given-names>Sangseob</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="author-notes" rid="fn002"><sup>&#x2020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/997331/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Huh</surname> <given-names>Iksoo</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="author-notes" rid="fn002"><sup>&#x2020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/872018/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Park</surname> <given-names>Taesung</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/872605/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Statistics, Seoul National University</institution>, <addr-line>Seoul</addr-line>, <country>South Korea</country></aff>
<aff id="aff2"><sup>2</sup><institution>College of Nursing and Research Institute of Nursing Science, Seoul National University</institution>, <addr-line>Seoul</addr-line>, <country>South Korea</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Christian Darabos, Dartmouth College, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Tiejun Tong, Hong Kong Baptist University, Hong Kong; Gil Speyer, Arizona State University, United States</p></fn>
<corresp id="c001">&#x002A;Correspondence: Taesung Park, <email>tspark@stats.snu.ac.kr</email></corresp>
<fn fn-type="other" id="fn002"><p><sup>&#x2020;</sup>These authors have contributed equally to this work</p></fn>
<fn fn-type="other" id="fn004"><p>This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>25</day>
<month>06</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>11</volume>
<elocation-id>509</elocation-id>
<history>
<date date-type="received">
<day>24</day>
<month>12</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>27</day>
<month>04</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2020 Leem, Huh and Park.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Leem, Huh and Park</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Big multi-omics data in bioinformatics often consists of a huge number of features and relatively small numbers of samples. In addition, features from multi-omics data have their own specific characteristics depending on whether they are from genomics, proteomics, metabolomics, etc. Due to these distinct characteristics, standard statistical analyses using parametric-based assumptions may sometimes fail to provide exact asymptotic results. To resolve this issue, permutation tests can be a way to exactly analyze multi-omics data because they are distribution-free and flexible to use. In permutation tests, <italic>p</italic>-values are evaluated by estimating the locations of test statistics in an empirical null distribution generated by random shuffling. However, the permutation approach can be infeasible when the number of features increases, because more stringent control of type I error is needed for multiple hypothesis testing, and consequently, much larger numbers of permutations are required to reach significance. To address this problem, we propose a well-organized strategy, &#x201C;ENhanced Permutation tests via multiple Pruning (ENPP).&#x201D; ENPP prunes the features in every permutation round if they are determined to be non-significant. In other words, if the feature statistics from the permuted datasets exceed the feature statistics from the original dataset, beyond a predetermined threshold, the feature is determined to be non-significant. If so, ENPP removes the feature and iterates the process without the feature in the next permutation round. Our simulation study showed that the ENPP method could remove about 50% of the features at the first permutation round, and, by the 100th permutation round, 98% of the features had been removed and only 7.4% of the computation time with the original unpruned permutation approach had elapsed. In addition, we applied this approach to a real data set (Korea Association REsource: KARE) of 327,872 SNPs to find association with a non-normally distributed phenotype (fasting plasma glucose), interpreted the results, and discussed the feasibility and advantages of the approach.</p>
</abstract>
<kwd-group>
<kwd>permutation test</kwd>
<kwd>multiple hypothesis testing</kwd>
<kwd>pruning</kwd>
<kwd>big multi-omics data</kwd>
<kwd>GWAS</kwd>
</kwd-group>
<contract-num rid="cn001">2013M3A9C4078158</contract-num>
<contract-sponsor id="cn001">National Research Foundation<named-content content-type="fundref-id">10.13039/100011512</named-content></contract-sponsor>
<counts>
<fig-count count="3"/>
<table-count count="1"/>
<equation-count count="3"/>
<ref-count count="28"/>
<page-count count="8"/>
<word-count count="0"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1">
<title>Introduction</title>
<p>Unlike typical big data, big data in bioinformatics consists of huge numbers of features and relatively small numbers of samples. For example, the data from genome-wide association studies (GWAS) contain at least thousands of samples and several hundred thousands of single nucleotide polymorphisms (SNPs) (<xref ref-type="bibr" rid="B17">Manolio, 2010</xref>). In the case of transcriptomic analysis for finding differently expressed genes, tens of thousands of genes are tested from only hundreds of samples at most (<xref ref-type="bibr" rid="B18">McLachlan et al., 2005</xref>). In epigenomic data, such as DNA methylation, the number of features (e.g., CpG sites) varies from tens of thousands to several million according to profiling techniques and their resolution (<xref ref-type="bibr" rid="B5">Bibikova et al., 2011</xref>; <xref ref-type="bibr" rid="B1">Adusumalli et al., 2014</xref>). Moreover, not only large numbers of features but also various characteristics of the features are important points to be considered. For example, in genomic data, such as SNPs, a feature is represented as a count of a minor allele at a genomic locus in each individual. In transcriptome data sets, gene expression levels are represented as continuous and positive real values measured from microarray spot intensities. In the case of epigenomics data, the DNA methylation levels of loci can be provided as a ratio between read counts of C and read counts of C and T. In addition, proteomics and metabolomics data provide marker intensities from mass-spectrometry-based approaches. Therefore, detecting association between phenotypes and biomarkers using standard statistical approaches may sometimes be inaccurate, as many of these are based on parametric assumptions that require specific properties of the features. Although several remedies have been proposed in terms of parametric approaches (<xref ref-type="bibr" rid="B26">Thygesen and Zwinderman, 2004</xref>; <xref ref-type="bibr" rid="B15">Lin et al., 2008</xref>; <xref ref-type="bibr" rid="B21">Park and Wu, 2016</xref>), they are naturally asymptotic ones and still possibly have type 1 error inflation or low power.</p>
<p>As an alternative to these issues, the permutation test (<xref ref-type="bibr" rid="B23">Pitman, 1937</xref>; <xref ref-type="bibr" rid="B3">Annis, 2005</xref>) has become a popular approach for analyzing multi-omics data because it can be used regardless of the shape of distribution of the biomarkers&#x2019; expression and uses a simple algorithm. In the permutation test, a <italic>p</italic>-value is assessed through evaluating the relative rank of the observed test statistic in an empirical null distribution of the test statistic generated by random shuffling. The permutation test has already been used in some omics analysis. For example, in GWAS, the permutation test is used for adjusting for multiple tests (<xref ref-type="bibr" rid="B6">Browning, 2008</xref>), considering biological structures (<xref ref-type="bibr" rid="B20">Pahl and Sch&#x00E4;fer, 2010</xref>), and identifying gene-gene interactions (<xref ref-type="bibr" rid="B24">Ritchie et al., 2001</xref>; <xref ref-type="bibr" rid="B10">Greene et al., 2010</xref>). In next-generation sequencing data analysis, rare variants have been identified by permutation test for association with a phenotype (<xref ref-type="bibr" rid="B16">Madsen and Browning, 2009</xref>) and as a significance test of structural models (<xref ref-type="bibr" rid="B14">Lee et al., 2016</xref>; <xref ref-type="bibr" rid="B13">Kim et al., 2018</xref>). In integration analysis of multi-omics data, the permutation test is used for finding edges in the integrated network (<xref ref-type="bibr" rid="B11">Jeong et al., 2015</xref>) and significance testing of an aggregated unit with a structure (<xref ref-type="bibr" rid="B13">Kim et al., 2018</xref>). In metagenome studies, the permutation test is used for testing differences between distances of groups (<xref ref-type="bibr" rid="B7">Chen et al., 2012</xref>), finding differentially abundant operational taxonomic units (<xref ref-type="bibr" rid="B2">Anderson, 2005</xref>), and finding differentially abundant genomic features (<xref ref-type="bibr" rid="B22">Paulson et al., 2011</xref>).</p>
<p>However, a major obstacle to the permutation test is its large computation time, because the smallest <italic>p</italic>-value that a permutation test can reach is inversely proportional to the permutation time. Therefore, if a data set has a large number of features, it requires a large number of permutations to detect significantly associated features because larger numbers of features require more stringent type 1 error control in terms of multiple hypothesis testing correction. For example, if a researcher wants to test an association between 5.0 &#x00D7; 10<sup>5</sup> SNPs and a specific phenotype, the <italic>p</italic>-value threshold will be 1.0 &#x00D7; 10<sup>&#x2013;7</sup> [0.05/(5.0 &#x00D7; 10<sup>5</sup>) by Bonferroni correction]. To achieve such a stringent <italic>p</italic>-value threshold, the number of permutations must be at least 1.0 &#x00D7; 10<sup>7</sup>&#x2212;1 for each SNP, and the total computation time for all features is impractical. Considering that only significant features are of general interest to researchers, pruning insignificant features can be a way to resolve the issue.</p>
<p>Therefore, in this study, we propose a well-organized strategy, ENhanced Permutation tests via multiple Pruning (ENPP). The key idea of ENPP is simple. When the number of features is large, the <italic>p</italic>-value threshold is very low due to multiple testing correction. In most cases, if a feature is reported to be significant, its observed test statistic value should be more extreme than those from permuted data sets. On the other hand, if a feature has more than a set number of instances of having larger statistics from permuted data sets, it can be regarded as a feature with significantly less chance of being significant, and ENPP prunes the feature during a certain permutation round. In other words, ENPP specifically removes non-significant features and continues the permutation procedures with the remaining features, which can then be candidates for a predetermined significance level. This approach can reduce total permutation time to a feasible level compared to ordinary permutation approaches that conduct the same number of permutation tests on all features. Herein, we show that ENPP can remove about 50% of features in the first permutation round and requires, at the 100th permutation round, only 7.4% of the computation time needed for the unpruned permutation approach. This relative proportion of computation time becomes smaller as the iteration time increases. In addition, we applied our approach to a real data set (Korea Association REsource: KARE) (<xref ref-type="bibr" rid="B8">Cho et al., 2009</xref>) containing 327,872 SNP features and a non-normally distributed phenotype (fasting plasma glucose, FPG) for validation of our approach in terms of feasibility and usefulness.</p>
</sec>
<sec id="S2" sec-type="materials|methods">
<title>Materials and Methods</title>
<sec id="S2.SS1">
<title>Data Set</title>
<p>For real data analysis, we chose a Korean GWAS data set collected since 2007 by The Korean Association REsource (KARE) project (<xref ref-type="bibr" rid="B8">Cho et al., 2009</xref>). In this project, all participants were recruited from either of two region-based cohorts (rural Ansung and urban Ansan). The total number of participants was 10,038 (5,018 from Ansung and 5,020 from Ansan), and they were all genotyped, using genomic DNA from peripheral blood, using the Affymetrix (Santa Clara, CA, United States) Genome-Wide Human SNP array 5.0, containing 500,568 SNPs. For quality control, we followed the same process used in a previous study (<xref ref-type="bibr" rid="B19">Oh et al., 2016</xref>). As a result, we finally obtained 8,842 individuals and 327,872 SNPs, and the processed data set was used in our real data analysis. The study was reviewed and approved by the Institutional Review Board of Seoul National University (IRB No. E1908/001-004).</p>
</sec>
<sec id="S2.SS2">
<title>ENPP Approach</title>
<p>Suppose that there are <italic>N</italic> samples, each with a dependent variable <italic>Y</italic>, and <italic>J</italic> features <italic>X</italic><sub>1</sub>,&#x2026;,<italic>X</italic><sub><italic>J</italic></sub>, representing features from a multi-omics data set. In general, for a significance test of association between a specific <italic>X</italic><sub><italic>j</italic></sub> and <italic>Y</italic>, the null distribution of the test statistic <italic>S</italic> consists of test statistics from permuted data sets, and we call the statistics <italic>s</italic><sub><italic>r</italic></sub>, where <italic>r</italic> = 1,2,&#x2026;,<italic>R</italic>, with <italic>R</italic> denoting the total number of permutation rounds for the feature. Then, the observed value, <italic>s</italic><sub><italic>obs</italic></sub> (i.e., the original value of the test statistic, <italic>S</italic>) is compared to the null distribution of <italic>S</italic>, and the significance is assessed by the proportion of <italic>s</italic><sub><italic>r</italic></sub> values more extreme than <italic>s</italic><sub><italic>obs</italic></sub>. For exact generation of the null distribution, <italic>N</italic>! iterations are required. However, when <italic>N</italic>! is too large, <italic>R</italic> iterations of random shuffling (<italic>R</italic>) (<italic>R</italic>&#x226A;<italic>N</italic>!) are generally used for assessing computational feasibility in terms of Monte-Carlo estimation. A finding that a <italic>s</italic><sub><italic>obs</italic></sub> value is larger than the simulated <italic>s</italic><sub><italic>r</italic></sub> values implies that the test is more supportive of the alternative hypothesis, and the <italic>p</italic>-value is then calculated by the following equation:</p>
<disp-formula id="S2.E1">
<label>(1)</label>
<mml:math id="M1">
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo rspace="5.3pt">+</mml:mo>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mi>I</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2264;</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>r</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where I(&#x22C5;) is an indicator function, and +1 in the numerator and denominator can be omitted.</p>
<p>When the number of features is multiple, the <italic>p</italic>-value threshold should be adjusted for a multiple testing comparison. For example, a typical <italic>p</italic>-value threshold is 0.05, and, if there are 1,000 features for association tests, then the <italic>p</italic>-value threshold becomes 0.05/1,000, for the Bonferroni correction. In other words, when a feature has a <italic>p</italic>-value smaller than this adjusted <italic>p</italic>-value threshold it is reported as significant. Therefore, the possibility of I(&#x22C5;) = 1 (more extreme than the observed statistic value) is extremely low for this feature. On the other hand, if I(&#x22C5;) = 1 frequently appears in a feature, the <italic>p</italic>-value of the feature may be closer to 1, meaning that it may not be significant and would therefore be of no interest to researchers. Let <italic>p</italic><sub><italic>raw</italic></sub> be an unadjusted <italic>p</italic>-value threshold (e.g., 0.05) and <italic>p</italic><sub><italic>adj</italic></sub> be an adjusted <italic>p</italic>-value threshold, for each feature, after the multiple testing correction (e.g., 0.05/J by Bonferroni correction). <italic>p</italic><sub><italic>adj</italic></sub> is then the significance level for which we need to detect significant features, and the decision of whether or not to prune a feature, in any specific round, is based on the hypothesis that:</p>
<disp-formula id="S2.E2">
<label>(2)</label>
<mml:math id="M2">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mtext>H</mml:mtext>
<mml:mo>&#x2062;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mo>:</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo rspace="4.2pt">,</mml:mo>
<mml:mrow>
<mml:mpadded width="+1.7pt">
<mml:mtext>and</mml:mtext>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>H1</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>:</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mo>&gt;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where p implies the true p-value from the permutation approach. In the hypothesis, the significance level for the test needs to be determined, and we call the threshold p<sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub>. For the hypothesis test, a binomial test can be used, and, based on p<sub><italic>a</italic><italic>d</italic><italic>j</italic></sub> and p<sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub>, we can set an integer C<sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub> that satisfies p<sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub> in a permutation round. Therefore, C<sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub> is a variable that depends on permutation numbers, while p<sub><italic>a</italic><italic>d</italic><italic>j</italic></sub> and p<sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub> are fixed values for the whole pruning process. Consequently, using this rule, EPNN counts in how many cases a feature has a more extreme test statistic than its observed test statistic value in each permutation round. If a feature is equal to or greater than C<sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub> in a round, it is removed from the next permutation round. The following is a detailed explanation of the parameter determination.</p>
<p>Let us assume that p<sub><italic>a</italic><italic>d</italic><italic>j</italic></sub> = 5 &#x00D7; 10<sup>&#x2013;5</sup>, which is equivalent to a threshold Bonferroni correction with 1,000 features, and p<sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub> = p<sub><italic>a</italic><italic>d</italic><italic>j</italic></sub>. In addition, if we let <italic>p</italic><sub><italic>k&#x2014;r</italic></sub> denote a probability of observing at least a number <italic>k</italic> of test statistics values more extreme than the observed test statistics at the <italic>r</italic>th permutation round, then <inline-formula><mml:math id="INEQ19"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo lspace="2.5pt" rspace="2.5pt" stretchy="false">|</mml:mo><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mi>r</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mtable rowspacing="0pt"><mml:mtr><mml:mtd columnalign="center"><mml:mi>r</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd columnalign="center"><mml:mi>t</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2062;</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mo>-</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula>. Therefore, if the p-value of a feature is significant, then <italic>p</italic><sub><italic>k&#x2014;r</italic></sub> should be equal to or smaller than p<sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub>. As an illustration, consider the first permutation round. Based on a setting of p<sub><italic>a</italic><italic>d</italic><italic>j</italic></sub> = 5 &#x00D7; 10<sup>&#x2013;5</sup>, two probabilities, <italic>p</italic><sub>0|1</sub>,<italic>p</italic><sub>1|1</sub>, are given. Because we set <italic>p</italic><sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic> =</sub><italic>p</italic><sub><italic>a</italic><italic>d</italic><italic>j</italic></sub>,<italic>p</italic><sub>0|1</sub> will be 1 and <italic>p</italic><sub><italic>1&#x2014;1</italic></sub> will be <italic>p</italic><sub><italic>a</italic><italic>d</italic><italic>j</italic></sub>, implying that <italic>C</italic><sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub> = 1 is in the first round. For the second round, there are three probabilities, <italic>p</italic><sub>0|2</sub>,<italic>p</italic><sub>1|2</sub> and <italic>p</italic><sub>2|2</sub>, that can be easily computed. In this case, <italic>p</italic><sub>1|2</sub> = 1&#x00D7;10<sup>&#x2212;4</sup> &#x003E; <italic>p</italic><sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub><italic>p</italic><sub>2|2</sub> = 10<sup>&#x2212;9</sup> &#x003C; <italic>p</italic><sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub>. Therefore <italic>C</italic><sub><italic>prun</italic></sub> will be 2 for the second round. In this manner, we can obtain <italic>C</italic><sub><italic>prun</italic></sub> for all permutation rounds conducted. We will show the properties of the parameters in the next section.</p>
</sec>
</sec>
<sec id="S3">
<title>Results</title>
<sec id="S3.SS1">
<title>Simulation Analysis</title>
<p>In this section, we evaluated the advantages of ENPP compared to a strict permutation approach, including its need for only very few counts for rejecting and removing non-significant features. As a consequence of this attribute, ENPP can greatly reduce total computation time to a feasible level compared to an unpruned permutation approach. To show the desired properties, we artificially generated data sets whose features did not associate with a feature. When the Bonferroni threshold was applied and <italic>p</italic><sub><italic>raw</italic></sub> = 0.05, the first example had <inline-formula><mml:math id="INEQ29"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>1</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.05/1,000 and the second example had <inline-formula><mml:math id="INEQ30"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.05/(5 &#x00D7; 10<sup>5</sup>). In addition, we also assumed that <italic>p</italic><sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub> = <italic>p</italic><sub><italic>a</italic><italic>d</italic><italic>j</italic></sub> for both examples.</p>
</sec>
<sec id="S3.SS2">
<title>Distribution of <italic>C<sub>prun</sub></italic></title>
<p>Firstly, we investigated the distribution of <italic>C<sub>prun</sub></italic> values according to each permutation round for <inline-formula><mml:math id="INEQ34"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>1</mml:mn></mml:msubsup></mml:math></inline-formula>, and <inline-formula><mml:math id="INEQ35"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>, respectively. Using the formula described in the methods, <italic>C<sub>prun</sub></italic> values were calculated for <italic>r</italic> = 1,2,&#x2026;, 10,000, and the resulting values are shown in <xref ref-type="fig" rid="F1">Figure 1A</xref>, which also shows that the values of <italic>C<sub>prun</sub></italic> for <inline-formula><mml:math id="INEQ38"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>1</mml:mn></mml:msubsup></mml:math></inline-formula> are at most 6 in the 10,000th round. This implies that the threshold is not hard to satisfy and that we can reduce a large proportion of the number of features at each permutation round. In the case of <inline-formula><mml:math id="INEQ39"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>, <italic>C<sub>prun</sub></italic> becomes smaller (<xref ref-type="fig" rid="F1">Figure 1B</xref>). In detail, <italic>C<sub>prun</sub></italic> is 1 for <italic>i</italic> = 1, 2 for i &#x2208; [2, 4,473], and 3 for i &#x2208; [4,474, 10,000], implying that smaller <italic>p</italic><sub><italic>adj</italic></sub> values provide smaller <italic>C<sub>prun</sub></italic> values, although <italic>p</italic><sub><italic>prun</italic></sub> is proportional to <italic>p</italic><sub><italic>adj</italic></sub>.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p><bold>(A)</bold> Distribution of <italic>C</italic><sub><italic>prun</italic></sub> with <italic>P</italic><sub><italic>prun</italic></sub> = 5 &#x00D7; 10<sup>&#x2013;5</sup>. <bold>(B)</bold> Distribution of <italic>C</italic><sub><italic>prun</italic></sub> with <italic>P</italic><sub><italic>prun</italic></sub> = 1 &#x00D7; 10<sup>&#x2013;7</sup>.</p></caption>
<graphic xlink:href="fgene-11-00509-g001.tif"/>
</fig>
</sec>
<sec id="S3.SS3">
<title>Pruning Rates and Computational Efficiency in Each Permutation Round</title>
<p>Based on the <italic>C<sub>prun</sub></italic> values calculated above, we also evaluated the pruned proportion of the total features for each permutation round. Suppose that the p-value of a feature has a uniform distribution, meaning that the feature has no association with a phenotype. In this setting, the pruned proportion of features depends only on <italic>C<sub>prun</sub></italic>. For example, at the first round, for <italic>C</italic><sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub>(1) = 1, the proportion of pruned features will be <inline-formula><mml:math id="INEQ48"><mml:mrow><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x222B;</mml:mo><mml:mn>0</mml:mn><mml:mn>1</mml:mn></mml:msubsup><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo mathvariant="italic" rspace="0pt">d</mml:mo><mml:mi>p</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:math></inline-formula>. At the second round, for <italic>C</italic><sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub>(2) = 2, no pruning will happen, because the event that <italic>C</italic><sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub>(1) = 1 includes the event that <italic>C</italic><sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub>(2) = 2. At the third round of permutation, for <italic>C</italic><sub><italic>p</italic><italic>r</italic><italic>u</italic><italic>n</italic></sub>(3) = 2, the expected pruning proportion after the permutation will be:</p>
<disp-formula id="S3.Ex1">
<mml:math id="M3">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mo largeop="true" symmetric="true">&#x222B;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mn>1</mml:mn>
</mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:msup>
<mml:mi>p</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo mathvariant="italic" rspace="0pt">d</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mo largeop="true" symmetric="true">&#x222B;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mn>1</mml:mn>
</mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mi>p</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>-</mml:mo>
<mml:msup>
<mml:mi>p</mml:mi>
<mml:mn>3</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo mathvariant="italic" rspace="0pt">d</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mn>3</mml:mn>
</mml:mfrac>
<mml:mo>-</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mn>4</mml:mn>
</mml:mfrac>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mn>12</mml:mn>
</mml:mfrac>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>In other words, at the first permutation, <inline-formula><mml:math id="INEQ53"><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:math></inline-formula> of the features are expected to be pruned, and <inline-formula><mml:math id="INEQ54"><mml:mfrac><mml:mn>1</mml:mn><mml:mn>12</mml:mn></mml:mfrac></mml:math></inline-formula> of the features are additionally pruned after the third round. In this manner, the expected proportions of remaining features after pruning from 1 to 10,000 permutation rounds are calculated using the <italic>C<sub>prun</sub></italic> values (<xref ref-type="fig" rid="F1">Figure 1</xref>), and the results are described in <xref ref-type="fig" rid="F2">Figure 2</xref>. Because the cumulative pruning proportion is not easily derived by numerical calculation, we estimated the proportion by simulation using variables from a Bernoulli distribution, with the probability for success taken from a uniform distribution U(0,1). In <xref ref-type="fig" rid="F2">Figure 2A</xref>, only about 2% of features remain after the 100th permutation round in both <italic>p</italic><sub><italic>prun</italic></sub> settings, thus greatly reducing the number of tests for the data set at the round. However, as <italic>C<sub>prun</sub></italic> becomes different, the remaining proportions also become different. For example, at the 1000th permutation round, 0.3% of total features remained for <inline-formula><mml:math id="INEQ57"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>1</mml:mn></mml:msubsup></mml:math></inline-formula> and 0.2% for <inline-formula><mml:math id="INEQ58"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>. The ratio between the two proportions became larger at the 10,000th permutation round, with 0.057% for the former, <inline-formula><mml:math id="INEQ59"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>1</mml:mn></mml:msubsup></mml:math></inline-formula>, and 0.028% for the latter, <inline-formula><mml:math id="INEQ60"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>. These results reflect the differences of <italic>C<sub>prun</sub></italic> provided in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p><bold>(A)</bold> Proportion of tested features at each round after pruning for the two <italic>P</italic><sub><italic>prun</italic></sub> values in <xref ref-type="fig" rid="F1">Figure 1</xref>. <bold>(B)</bold> Inverse of computational efficiency (ICE) for 2A. <bold>(C)</bold> Type 1 error results. We divided the number of false-positive features by 1,000 to obtain the family-wise type 1 error rate. 95% confidence intervals of the estimated type 1 errors are also provided.</p></caption>
<graphic xlink:href="fgene-11-00509-g002.tif"/>
</fig>
<p>We next assessed computational efficiency by comparing the total permutation time for ENPP to that for the original, unpruned permutation test. The efficiency is represented as a ratio between the number of tests in the original unpruned permutation approach and the cumulative number of tests in the ENPP approach. The total permutation time for a given permutation round in ENPP is calculated by accumulating all permutation times of earlier permutation rounds. Therefore, larger computational efficiencies imply a large timesaving advantage for ENPP analysis. For example, during the first round, there is no reduction of permutation time, but for the second and third permutation rounds, ENPP needs only <inline-formula><mml:math id="INEQ62"><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:math></inline-formula> the computations compared to the original unpruned permutation tests, and <inline-formula><mml:math id="INEQ63"><mml:mfrac><mml:mn>5</mml:mn><mml:mn>12</mml:mn></mml:mfrac></mml:math></inline-formula> the permutations are needed for the fourth round. Therefore, computational efficiency will be <inline-formula><mml:math id="INEQ64"><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mn>1</mml:mn></mml:mfrac><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></inline-formula> for the first permutation round, and <inline-formula><mml:math id="INEQ65"><mml:mrow><mml:mrow><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>4</mml:mn><mml:mn>3</mml:mn></mml:mfrac></mml:mrow><mml:mo rspace="4.2pt">,</mml:mo><mml:mrow><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>3</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow><mml:mo rspace="4.2pt">,</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mn>5</mml:mn><mml:mn>12</mml:mn></mml:mfrac></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mn>48</mml:mn><mml:mn>29</mml:mn></mml:mfrac></mml:mrow></mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math></inline-formula> for the second, third, and fourth permutation rounds, respectively. The Inverse Computational Efficiency (ICE) for each permutation round is summarized in <xref ref-type="fig" rid="F2">Figure 2B</xref>. In <xref ref-type="fig" rid="F2">Figure 2B</xref>, ICE does not seem to decrease as fast as the remaining proportion, as shown in <xref ref-type="fig" rid="F2">Figure 2A</xref>, due to the fact that permutation times of precedent rounds accumulate in estimating computational efficiency. Compared to the ordinary unpruned permutation test, only about 7.4% of the computation time is needed at the 100th permutation round in both settings, because they have the same numbers for <italic>C<sub>prun</sub></italic> and the same resulting remaining proportions. However, as in the remaining proportion of features, ICE became more different in terms of ratios between the two settings as the permutation round progresses. For example, at the 1000th permutation round, ICE is 1.3% for <italic>p</italic><sub><italic>prun</italic></sub> = 5 &#x00D7; 10<sup>&#x2013;5</sup> and 1.2% for <italic>p</italic><sub><italic>prun</italic></sub> = 1 &#x00D7; 10<sup>&#x2013;7</sup>. However, in the 10,000 iteration, 0.23% is needed for the former, <italic>p</italic><sub><italic>prun</italic></sub> while 0.17% is needed for the latter, <italic>p</italic><sub><italic>prun</italic></sub>. Thus, the overall computational efficiency improves as the iteration round progresses because the remaining rate of the features grows smaller, and smaller <italic>p</italic><sub><italic>prun</italic></sub> requires less computation.</p>
<p>On the other hand, we assessed the type 1 error rate of non-associated features from the ENPP approach. For <inline-formula><mml:math id="INEQ67"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>1</mml:mn></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ68"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>, we generated 10<sup>6</sup> and 5 &#x00D7; 10<sup>8</sup> non-associated features from the Bernoulli distribution so that the expected numbers of features with type 1 error are 50 in both settings. We first applied the pruning process to the non-associated features and then the full permutation approach to the remaining unpruned features. After the full permutation approach had been applied, we counted how many non-associated features were found significant at the given significance levels. The type 1 error rates are summarized in <xref ref-type="fig" rid="F2">Figure 2C</xref>, showing that the ENPP approach controls the type 1 error well.</p>
</sec>
<sec id="S3.SS4">
<title>Real Data Analysis</title>
<p>We next applied our approach to a real genome-wide data set (Korea Association REsource: KARE), which has 327,872 SNPs from each of 8,842 individuals (<xref ref-type="bibr" rid="B8">Cho et al., 2009</xref>). In order to detect significant SNP features at the Bonferroni significance level in the data set, the ordinary permutation approach (without ENPP) requires at least (1/0.05) &#x00D7; 327,872<sup>2</sup> = 2.15 &#x00D7; 10<sup>12</sup>, a computationally impractical number of tests. Therefore, using a pruning approach for this data set becomes inevitable when the permutation approach is used. For the application of ENPP, we set <italic>p</italic><sub><italic>raw</italic></sub> = 0.05 and <italic>p</italic><sub><italic>prun</italic></sub> = <italic>p</italic><sub><italic>adj</italic></sub> = 0.05/327,872 = 1.52 &#x00D7; 10<sup>&#x2013;7</sup>, and the corresponding <italic>C<sub>prun</sub></italic> is calculated and described in <xref ref-type="fig" rid="F3">Figure 3A</xref>. Here, we set the number of iterations to 100,000 because simulation analysis found that the remaining proportion of features was 3.7 &#x00D7; 10<sup>&#x2013;5</sup> at the 100,000th round and the corresponding expected count of remaining features was 3.7 &#x00D7; 10<sup>&#x2013;5</sup> &#x00D7; 327,872 = 12.13 if all features were assumed not to associate with a phenotype. We selected fasting plasma glucose (FPG) as a phenotype because its distribution is very highly skewed (skewness = 5.32) and the skewness is still high (=2.71) (<xref ref-type="bibr" rid="B12">Kim, 2013</xref>) even after log-transformation. Consequently, we expected that this property may produce results that differ between a parametric approach and a permutation approach. For the association analysis, we used age, gender, and living regions as covariates, and we assumed that the genotype of the SNP features has an additive effect on the phenotype. As a test statistic for the permutation test, we used a t-statistic for the genotype effect.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p><bold>(A)</bold> Distribution of <italic>C</italic><sub><italic>prun</italic></sub> for the real data set with <italic>P</italic><sub><italic>prun</italic></sub> = 1.52 &#x00D7; 10<sup>&#x2013;7</sup>. <bold>(B)</bold> Proportion of tested features after pruning. <bold>(C)</bold> ICE for <italic>P</italic><sub><italic>prun</italic></sub> = 1.52 &#x00D7; 10<sup>&#x2013;7</sup>. Black lines are expected values from the simulation, and blue lines are observed values from the real data analysis.</p></caption>
<graphic xlink:href="fgene-11-00509-g003.tif"/>
</fig>
<p>Based on the expected remaining proportion of the features, we found ICE to be 2.4 &#x00D7; 10<sup>&#x2013;4</sup> at the 100,000th permutation round (<xref ref-type="fig" rid="F3">Figure 3C</xref>), meaning that we needed only 24 times more computation compared to the parametric linear regression approach. This number of permutation tests can be done in a few days, even in a single thread. After implementing the 100,000th iteration of ENPP with the real data set, we plotted the number of remaining features (<xref ref-type="fig" rid="F3">Figure 3B</xref>) and the ICE (<xref ref-type="fig" rid="F3">Figure 3C</xref>) in each round. Those results showed that 46 SNP features remained and that the computational efficiency was 3.7 &#x00D7; 10<sup>&#x2013;4</sup>, implying that some SNP features were candidates for significant features. For each of 46 SNP features, we implemented a 3&#x00D7;10<sup>7</sup>&#x2212;1 permutation test to provide a <italic>p</italic>-value not only for Bonferroni correction but also for a genome-wide significance of 5 &#x00D7; 10<sup>&#x2013;8</sup> (<xref ref-type="bibr" rid="B27">Xu et al., 2014</xref>). After implementation of the test, we found that five SNP features passed the Bonferroni threshold, and two SNPs also passed for genome-wide significance (<xref ref-type="table" rid="T1">Table 1</xref>). On the other hand, the parametric approach found four SNPs for Bonferroni correction, and two SNPs passed genome-wide significance. However, only three SNPs overlapped for the former threshold, and one SNP overlapped for the latter one. To determine substantial differences of <italic>p</italic>-values between the two approaches, we used an exact binomial test (<xref ref-type="bibr" rid="B9">Clopper and Pearson, 1934</xref>) that regarded <italic>p</italic>-values from the parametric approach as a null hypothesis <italic>p</italic>-value for the permutation results. From the test, we found that only one SNP (rs7197218G in chromosome 16) showed a significant difference between the two results (<xref ref-type="table" rid="T1">Table 1</xref>). This SNP showed a more conservative result from the permutation approach; this result may come from type 1 error inflation in the parametric test in the presence of very low minor allele frequency and large differences of variance between FPG values with and without the minor allele (<xref ref-type="bibr" rid="B28">Zimmerman, 2004</xref>).</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>6 SNPs selected from either parametric (linear regression) or non-parametric (ENPP) tests at a Bonferroni significance level <italic>p</italic> = 1.52 &#x00D7; 10<sup>&#x2013;7</sup>.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">CHR</td>
<td valign="top" align="center">SNP id</td>
<td valign="top" align="center">MAF</td>
<td valign="top" align="center"><italic>P</italic>-value from linear regression</td>
<td valign="top" align="center"><italic>P</italic>-value from permutation</td>
<td valign="top" align="center"><italic>P</italic>-value from comparison between the two values</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="center">rs9348440T</td>
<td valign="top" align="center">0.478</td>
<td valign="top" align="center">1.63 &#x00D7; 10<sup>&#x2013;7</sup></td>
<td valign="top" align="center">1.33 &#x00D7; 10<sup>&#x2013;7</sup></td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="center">rs6456368C</td>
<td valign="top" align="center">0.480</td>
<td valign="top" align="center">1.54 &#x00D7; 10<sup>&#x2013;7</sup></td>
<td valign="top" align="center">1.00 &#x00D7; 10<sup>&#x2013;7</sup></td>
<td valign="top" align="center">0.640</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="center">rs10946398C</td>
<td valign="top" align="center">0.479</td>
<td valign="top" align="center">8.35 &#x00D7; 10<sup>&#x2013;8</sup></td>
<td valign="top" align="center">6.67 &#x00D7; 10<sup>&#x2013;8</sup></td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="center">rs7754840C</td>
<td valign="top" align="center">0.479</td>
<td valign="top" align="center">4.93 &#x00D7; 10<sup>&#x2013;8</sup></td>
<td valign="top" align="center">3.33 &#x00D7; 10<sup>&#x2013;8</sup></td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="center">rs9460546G</td>
<td valign="top" align="center">0.481</td>
<td valign="top" align="center">5.45 &#x00D7; 10<sup>&#x2013;8</sup></td>
<td valign="top" align="center">3.33 &#x00D7; 10<sup>&#x2013;8</sup></td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">16</td>
<td valign="top" align="center">rs7197218G</td>
<td valign="top" align="center">0.014</td>
<td valign="top" align="center">4.81 &#x00D7; 10<sup>&#x2013;8</sup></td>
<td valign="top" align="center">7.33 &#x00D7; 10<sup>&#x2013;7</sup></td>
<td valign="top" align="center">&#x003C;2.2 &#x00D7; 10<sup>&#x2013;16</sup></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<attrib><italic>Here, we provide information for SNP features such as chromosome, SNP id, and minor allele frequency (MAF) and the <italic>p</italic>-values from both tests. In the last column of the table, we also include the results of an exact binomial test for permutation results based on the null hypothesis that the <italic>p</italic>-value of the permutation test is the same as the results from the parametric approach.</italic></attrib>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec id="S4">
<title>Discussion</title>
<p>For the analysis of multi-omics data, the permutation test has been popularly used because it is non-parametric and flexible to use. However, the main drawback of this approach is that it may require such a large number of tests as to make it infeasible, especially for data sets with large numbers of features and a Bonferroni-corrected significance level. To resolve this issue, we proposed a well-organized strategy, ENhanced Permutation tests via multiple Pruning (ENPP), for enhanced permutation tests, using the idea of pruning. ENPP investigates the features at every permutation round and removes them if they have less chance of being significant. Our empirical study showed that the ENPP method could remove about 50% of the number of features at the first permutation round and required only 7.4% of the total computation time at the 100th permutation round as is needed by an unpruned approach. Moreover, in real data analysis, on a data set of 327,872 SNP features, our approach was found to greatly reduce computational burden to a feasible level, and the analysis results seemed more reliable than the results from a parametric approach because they were not affected by a specific assumption of a null distribution. Interestingly, we found that the number of tests conducted in the ENPP process was much smaller than the number in the final evaluation of the 46 SNP features to obtain precise <italic>p</italic>-values. In the pruning process of real GWAS data, about 1.2 &#x00D7; 10<sup>7</sup> permutations were needed, while in parallel, the full permutation analysis required about 1.4 &#x00D7; 10<sup>9</sup> iterations. Since the pruning process and the full permutation process are performed on each feature independently, they can easily be parallelized. We believe that parallelism has a large impact on the full permutation process because the full permutation process seems to take much more computing time than the pruning process in our real data analysis. Therefore, with the help of parallel computing, our ENPP approach can easily handle, without computational burden, larger data sets such as human methylation data with 2 &#x00D7; 10<sup>7</sup> CpG site features.</p>
<p>Our EPNN algorithm is also flexible for pruning processes. Researchers can modify <italic>p</italic><sub><italic>adj</italic></sub> and <italic>p</italic><sub><italic>prun</italic></sub> as they want. In this study, we set <italic>p</italic><sub><italic>adj</italic></sub> = <italic>p</italic><sub><italic>prun</italic></sub>, with <italic>p</italic><sub><italic>adj</italic></sub> from a Bonferroni correction, and conducted 100,000 ENPP permutations. These settings could be interpreted with the number of expected significant features and the number of tests of the features, considering that summation of the actual significance level, calculated for <italic>C</italic><sub><italic>prun</italic></sub>, from the first round to the 100,000th round is 2.66 &#x00D7; 10<sup>&#x2013;3</sup>, and it admits 0.05/(2.66&#x00D7;10<sup>&#x2013;3</sup>) &#x2248; 18 truly significant features at the Bonferroni threshold. In other words, if there are 18 or fewer significant features, at p = 1.52&#x00D7;10<sup>&#x2013;7</sup>, we can control the probability of falsely pruning any significant features under 0.05. This assumption of the number of the significant features is reasonable, considering that only a few features may satisfy Bonferroni cutoff in general and that our analysis results in both parametric and permutation approaches found only four or five SNPs, respectively. In addition, researchers may sometimes be interested not only in features for a specific Bonferroni significance level but also in a <italic>p</italic>-value distribution of whole features. For this purpose, ENPP can be applied after some number of unpruned permutation rounds, such as 100, so that more precise <italic>p</italic>-values can be obtained, even for non-significant features, and the results can be used in false discovery rate (FDR) approaches (<xref ref-type="bibr" rid="B4">Benjamini and Hochberg, 1995</xref>) or in combining <italic>p</italic>-value approaches for some group-wise testing such as gene- or pathway-wise significance tests (<xref ref-type="bibr" rid="B25">Subramanian et al., 2005</xref>). Our ENPP approach will help many researchers achieve precise <italic>p</italic>-values in a feasible time, even for datasets with a large number of features. A brief R script for performing ENPP is provided for SNPs at <ext-link ext-link-type="uri" xlink:href="http://statgen.snu.ac.kr/software/ENPP">http://statgen.snu.ac.kr/software/ENPP</ext-link>. This will enable more accurate decisions based on the statistical results.</p>
</sec>
<sec id="S5">
<title>Data Availability Statement</title>
<p>The data will be publicly distributed by the Distribution Desk of the Korea Biobank Network (<ext-link ext-link-type="uri" xlink:href="https://koreabiobank.re.kr/">https://koreabiobank.re.kr/</ext-link>), to whom data requests should be directly made. Any inquiries should be sent to <email>admin@koreabiobank.re.kr.</email></p>
</sec>
<sec id="S6">
<title>Ethics Statement</title>
<p>The study was reviewed and approved by the Institutional Review Board of Seoul National University (IRB No. E1908/001-004). The patients/participants provided their written informed consent to participate in this study.</p>
</sec>
<sec id="S7">
<title>Author Contributions</title>
<p>SL, IH, and TP developed the algorithm. SL conducted the simulation study and wrote the manuscript. IH conducted real data analysis and wrote the manuscript. TP supervised the whole research project. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="financial-disclosure">
<p><bold>Funding.</bold> This work was supported by the Bio-Synergy Research Project (2013M3A9C4078158) of the Ministry of Science, ICT and Future Planning through the National Research Foundation.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Adusumalli</surname> <given-names>S.</given-names></name> <name><surname>Omar</surname> <given-names>M. F. M.</given-names></name> <name><surname>Soong</surname> <given-names>R.</given-names></name> <name><surname>Benoukraf</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>Methodological aspects of whole-genome bisulfite sequencing analysis.</article-title> <source><italic>Brief. Bioinform.</italic></source> <volume>16</volume> <fpage>369</fpage>&#x2013;<lpage>379</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bbu016</pub-id> <pub-id pub-id-type="pmid">24867940</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Anderson</surname> <given-names>M.</given-names></name></person-group> (<year>2005</year>). <source><italic>PERMANOVA: A fortran Computer Program For Permutational Multivariate Analysis Of Variance.</italic></source> <publisher-loc>New Zealand</publisher-loc>: <publisher-name>University of Auckland</publisher-name>.</citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Annis</surname> <given-names>D. H.</given-names></name></person-group> (<year>2005</year>). <source><italic>Permutation, Parametric, And Bootstrap Tests Of Hypotheses.</italic></source> <publisher-loc>Milton Park</publisher-loc>: <publisher-name>Taylor &#x0026; Francis</publisher-name>.</citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Benjamini</surname> <given-names>Y.</given-names></name> <name><surname>Hochberg</surname> <given-names>Y.</given-names></name></person-group> (<year>1995</year>). <article-title>Controlling the false discovery rate: a practical and powerful approach to multiple testing.</article-title> <source><italic>J. R. Statist. Soc.Ser. B</italic></source> <volume>57</volume> <fpage>289</fpage>&#x2013;<lpage>300</lpage>. <pub-id pub-id-type="doi">10.1111/j.2517-6161.1995.tb02031.x</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bibikova</surname> <given-names>M.</given-names></name> <name><surname>Barnes</surname> <given-names>B.</given-names></name> <name><surname>Tsan</surname> <given-names>C.</given-names></name> <name><surname>Ho</surname> <given-names>V.</given-names></name> <name><surname>Klotzle</surname> <given-names>B.</given-names></name> <name><surname>Le</surname> <given-names>J. M.</given-names></name><etal/></person-group> (<year>2011</year>). <article-title>High density DNA methylation array with single CpG site resolution.</article-title> <source><italic>Genomics</italic></source> <volume>98</volume> <fpage>288</fpage>&#x2013;<lpage>295</lpage>. <pub-id pub-id-type="doi">10.1016/j.ygeno.2011.07.007</pub-id> <pub-id pub-id-type="pmid">21839163</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Browning</surname> <given-names>B. L.</given-names></name></person-group> (<year>2008</year>). <article-title>PRESTO: rapid calculation of order statistic distributions and multiple-testing adjusted P-values via permutation for one and two-stage genetic association studies.</article-title> <source><italic>BMC Bioinform.</italic></source> <volume>9</volume>:<issue>309</issue>. <pub-id pub-id-type="doi">10.1186/1471-2105-9-309</pub-id> <pub-id pub-id-type="pmid">18620604</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>J.</given-names></name> <name><surname>Bittinger</surname> <given-names>K.</given-names></name> <name><surname>Charlson</surname> <given-names>E. S.</given-names></name> <name><surname>Hoffmann</surname> <given-names>C.</given-names></name> <name><surname>Lewis</surname> <given-names>J.</given-names></name> <name><surname>Wu</surname> <given-names>G. D.</given-names></name></person-group> (<year>2012</year>). <article-title>Associating microbiome composition with environmental covariates using generalized UniFrac distances.</article-title> <source><italic>Bioinformatics</italic></source> <volume>28</volume> <fpage>2106</fpage>&#x2013;<lpage>2113</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bts342</pub-id> <pub-id pub-id-type="pmid">22711789</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cho</surname> <given-names>Y. S.</given-names></name> <name><surname>Go</surname> <given-names>M. J.</given-names></name> <name><surname>Kim</surname> <given-names>Y. J.</given-names></name> <name><surname>Heo</surname> <given-names>J. Y.</given-names></name> <name><surname>Oh</surname> <given-names>J. H.</given-names></name> <name><surname>Ban</surname> <given-names>H.-J.</given-names></name><etal/></person-group> (<year>2009</year>). <article-title>A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits.</article-title> <source><italic>Nat. Genet.</italic></source> <volume>41</volume>:<issue>527</issue>. <pub-id pub-id-type="doi">10.1038/ng.357</pub-id> <pub-id pub-id-type="pmid">19396169</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Clopper</surname> <given-names>C. J.</given-names></name> <name><surname>Pearson</surname> <given-names>E. S.</given-names></name></person-group> (<year>1934</year>). <article-title>The use of confidence or fiducial limits illustrated in the case of the binomial.</article-title> <source><italic>Biometrika</italic></source> <volume>26</volume> <fpage>404</fpage>&#x2013;<lpage>413</lpage>. <pub-id pub-id-type="doi">10.1093/biomet/26.4.404</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Greene</surname> <given-names>C. S.</given-names></name> <name><surname>Himmelstein</surname> <given-names>D. S.</given-names></name> <name><surname>Nelson</surname> <given-names>H. H.</given-names></name> <name><surname>Kelsey</surname> <given-names>K. T.</given-names></name> <name><surname>Williams</surname> <given-names>S. M.</given-names></name> <name><surname>Andrew</surname> <given-names>A. S.</given-names></name><etal/></person-group> (<year>2010</year>). <article-title>Enabling personal genomics with an explicit test of epistasis.</article-title> <source><italic>Biocomputing</italic></source> <volume>2010</volume> <fpage>327</fpage>&#x2013;<lpage>336</lpage>. <pub-id pub-id-type="doi">10.1142/9789814295291_0035</pub-id> <pub-id pub-id-type="pmid">19908385</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jeong</surname> <given-names>H.-H.</given-names></name> <name><surname>Leem</surname> <given-names>S.</given-names></name> <name><surname>Wee</surname> <given-names>K.</given-names></name> <name><surname>Sohn</surname> <given-names>K.-A.</given-names></name></person-group> (<year>2015</year>). <article-title>Integrative network analysis for survival-associated gene-gene interactions across multiple genomic profiles in ovarian cancer.</article-title> <source><italic>J. Ovar. Res.</italic></source> <volume>8</volume>:<issue>42</issue>.</citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>H. Y.</given-names></name></person-group> (<year>2013</year>). <article-title>Statistical notes for clinical researchers: assessing normal distribution (2) using skewness and kurtosis.</article-title> <source><italic>Restor. Dent. Endod.</italic></source> <volume>38</volume> <fpage>52</fpage>&#x2013;<lpage>54</lpage>.</citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>Y.</given-names></name> <name><surname>Lee</surname> <given-names>S.</given-names></name> <name><surname>Choi</surname> <given-names>S.</given-names></name> <name><surname>Jang</surname> <given-names>J.-Y.</given-names></name> <name><surname>Park</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>Hierarchical structural component modeling of microRNA-mRNA integration analysis.</article-title> <source><italic>BMC Bioinform.</italic></source> <volume>19</volume>:<issue>75</issue>. <pub-id pub-id-type="doi">10.1186/s12859-018-2070-0</pub-id> <pub-id pub-id-type="pmid">29745843</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>S.</given-names></name> <name><surname>Choi</surname> <given-names>S.</given-names></name> <name><surname>Kim</surname> <given-names>Y. J.</given-names></name> <name><surname>Kim</surname> <given-names>B.-J.</given-names></name> <collab>T2d-Genes Consortium</collab> <name><surname>Hwang</surname> <given-names>H.</given-names></name><etal/></person-group> (<year>2016</year>). <article-title>Pathway-based approach using hierarchical components of collapsed rare variants.</article-title> <source><italic>Bioinformatics</italic></source> <volume>32</volume> <fpage>i586</fpage>&#x2013;<lpage>i594</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btw425</pub-id> <pub-id pub-id-type="pmid">27587678</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>S. M.</given-names></name> <name><surname>Du</surname> <given-names>P.</given-names></name> <name><surname>Huber</surname> <given-names>W.</given-names></name> <name><surname>Kibbe</surname> <given-names>W. A.</given-names></name></person-group> (<year>2008</year>). <article-title>Model-based variance-stabilizing transformation for Illumina microarray data.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>36</volume>:<issue>e11</issue>. <pub-id pub-id-type="doi">10.1093/nar/gkm1075</pub-id> <pub-id pub-id-type="pmid">18178591</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Madsen</surname> <given-names>B. E.</given-names></name> <name><surname>Browning</surname> <given-names>S. R.</given-names></name></person-group> (<year>2009</year>). <article-title>A groupwise association test for rare mutations using a weighted sum statistic.</article-title> <source><italic>PLoS Genet.</italic></source> <volume>5</volume>:<issue>e1000384</issue>. <pub-id pub-id-type="doi">10.1371/journal.pgen.1000384</pub-id> <pub-id pub-id-type="pmid">19214210</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Manolio</surname> <given-names>T. A.</given-names></name></person-group> (<year>2010</year>). <article-title>Genome-wide association studies and assessment of the risk of disease.</article-title> <source><italic>New Engl. J. Med.</italic></source> <volume>363</volume> <fpage>166</fpage>&#x2013;<lpage>176</lpage>.</citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>McLachlan</surname> <given-names>G.</given-names></name> <name><surname>Do</surname> <given-names>K.-A.</given-names></name> <name><surname>Ambroise</surname> <given-names>C.</given-names></name></person-group> (<year>2005</year>). <source><italic>Analyzing Microarray Gene Expression Data.</italic></source> <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>John Wiley &#x0026; Sons</publisher-name>.</citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oh</surname> <given-names>S.</given-names></name> <name><surname>Huh</surname> <given-names>I.</given-names></name> <name><surname>Lee</surname> <given-names>S. Y.</given-names></name> <name><surname>Park</surname> <given-names>T.</given-names></name></person-group> (<year>2016</year>). <article-title>Analysis of multiple related phenotypes in genome-wide association studies.</article-title> <source><italic>J. Bioinform. Comput. Biol.</italic></source> <volume>14</volume>:<issue>1644005</issue>. <pub-id pub-id-type="doi">10.1142/s0219720016440054</pub-id> <pub-id pub-id-type="pmid">27774872</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pahl</surname> <given-names>R.</given-names></name> <name><surname>Sch&#x00E4;fer</surname> <given-names>H.</given-names></name></person-group> (<year>2010</year>). <article-title>PERMORY: an LD-exploiting permutation test algorithm for powerful genome-wide association testing.</article-title> <source><italic>Bioinformatics</italic></source> <volume>26</volume> <fpage>2093</fpage>&#x2013;<lpage>2100</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btq399</pub-id> <pub-id pub-id-type="pmid">20605926</pub-id></citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Park</surname> <given-names>Y.</given-names></name> <name><surname>Wu</surname> <given-names>H.</given-names></name></person-group> (<year>2016</year>). <article-title>Differential methylation analysis for BS-seq data under general experimental design.</article-title> <source><italic>Bioinformatics</italic></source> <volume>32</volume> <fpage>1446</fpage>&#x2013;<lpage>1453</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btw026</pub-id> <pub-id pub-id-type="pmid">26819470</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Paulson</surname> <given-names>J. N.</given-names></name> <name><surname>Pop</surname> <given-names>M.</given-names></name> <name><surname>Bravo</surname> <given-names>H. C.</given-names></name></person-group> (<year>2011</year>). <article-title>Metastats: an improved statistical method for analysis of metagenomic data.</article-title> <source><italic>Genome Biol.</italic></source> <volume>12</volume>:<issue>17</issue>.</citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pitman</surname> <given-names>E. J.</given-names></name></person-group> (<year>1937</year>). <article-title>Significance tests which may be applied to samples from any populations.</article-title> <source><italic>Suppl. J. R. Statist. Soc.</italic></source> <volume>4</volume> <fpage>119</fpage>&#x2013;<lpage>130</lpage>.</citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ritchie</surname> <given-names>M. D.</given-names></name> <name><surname>Hahn</surname> <given-names>L. W.</given-names></name> <name><surname>Roodi</surname> <given-names>N.</given-names></name> <name><surname>Bailey</surname> <given-names>L. R.</given-names></name> <name><surname>Dupont</surname> <given-names>W. D.</given-names></name> <name><surname>Parl</surname> <given-names>F. F.</given-names></name><etal/></person-group> (<year>2001</year>). <article-title>., Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer.</article-title> <source><italic>Am. J. Hum. Genet.</italic></source> <volume>69</volume> <fpage>138</fpage>&#x2013;<lpage>147</lpage>. <pub-id pub-id-type="doi">10.1086/321276</pub-id> <pub-id pub-id-type="pmid">11404819</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Subramanian</surname> <given-names>A.</given-names></name> <name><surname>Tamayo</surname> <given-names>P.</given-names></name> <name><surname>Mukherjee</surname> <given-names>V. K. M.</given-names></name> <name><surname>Ebert</surname> <given-names>B. L.</given-names></name> <name><surname>Gillette</surname> <given-names>M. A.</given-names></name> <name><surname>Paulovich</surname> <given-names>A.</given-names></name><etal/></person-group> (<year>2005</year>). <article-title>Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.</article-title> <source><italic>Proc. Natl. Acad. Sci. U.S.A.</italic></source> <volume>102</volume> <fpage>15545</fpage>&#x2013;<lpage>15550</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.0506580102</pub-id> <pub-id pub-id-type="pmid">16199517</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thygesen</surname> <given-names>H. H.</given-names></name> <name><surname>Zwinderman</surname> <given-names>A. H.</given-names></name></person-group> (<year>2004</year>). <article-title>Comparing transformation methods for DNA microarray data.</article-title> <source><italic>BMC Bioinform.</italic></source> <volume>5</volume>:<issue>77</issue>. <pub-id pub-id-type="doi">10.1186/1471-2105-5-77</pub-id> <pub-id pub-id-type="pmid">15202953</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>C.</given-names></name> <name><surname>Tachmazidou</surname> <given-names>I.</given-names></name> <name><surname>Walter</surname> <given-names>K.</given-names></name> <name><surname>Ciampi</surname> <given-names>A.</given-names></name> <name><surname>Zeggini</surname> <given-names>E.</given-names></name> <name><surname>Greenwood</surname> <given-names>C. M. T.</given-names></name></person-group> (<year>2014</year>). <article-title>Estimating genome-wide significance for whole-genome sequencing studies.</article-title> <source><italic>Genet. Epidemiol.</italic></source> <volume>38</volume> <fpage>281</fpage>&#x2013;<lpage>290</lpage>. <pub-id pub-id-type="doi">10.1002/gepi.21797</pub-id> <pub-id pub-id-type="pmid">24676807</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zimmerman</surname> <given-names>D. W.</given-names></name></person-group> (<year>2004</year>). <article-title>Inflation of type I error rates by unequal variances associated with parametric, nonparametric, and rank-transformation tests.</article-title> <source><italic>Psicologica</italic></source> <volume>25</volume> <fpage>103</fpage>&#x2013;<lpage>133</lpage>.</citation></ref>
</ref-list>
</back>
</article>