Efficient Mining of Variants From Trios for Ventricular Septal Defect Association Study

Ventricular septal defect (VSD) is a fatal congenital heart disease showing severe consequence in affected infants. Early diagnosis plays an important role, particularly through genetic variants. Existing panel-based approaches of variants mining suffer from shortage of large panels, costly sequencing, and missing rare variants. Although a trio-based method alleviates these limitations to some extent, it is agnostic to novel mutations and computational intensive. Considering these limitations, we are studying a novel variants mining algorithm from trio-based sequencing data and apply it on a VSD trio to identify associated mutations. Our approach starts with irrelevant k-mer filtering from sequences of a trio via a newly conceived coupled Bloom Filter, then corrects sequencing errors by using a statistical approach and extends kept k-mers into long sequences. These extended sequences are used as input for variants needed. Later, the obtained variants are comprehensively analyzed against existing databases to mine VSD-related mutations. Experiments show that our trio-based algorithm narrows down candidate coding genes and lncRNAs by about 10- and 5-folds comparing with single sequence-based approaches, respectively. Meanwhile, our algorithm is 10 times faster and 2 magnitudes memory-frugal compared with existing state-of-the-art approach. By applying our approach to a VSD trio, we fish out an unreported gene—CD80, a combination of two genes—MYBPC3 and TRDN and a lncRNA—NONHSAT096266.2, which are highly likely to be VSD-related.


INTRODUCTION
Ventricular septal defect (VSD) is a major kind of congenital heart disease (CHD), constituting about 20% of all CHD cases (Spicer et al., 2014). By taking conservative treatment, mortality is around 90% to 95%, whereas via surgery, this rate reduces to 19% to 60% (Serpytis et al., 2015). Very often, diagnosis of a VSD patient is at its late stage due to the obvious communication obstacles in infants; this poses a need for early diagnosis, particularly through genetic variants.
Mining genetic variants and associating them with diseases is a hot topic, by which thousands of disease-associated variants have been identified (The International HapMap 3 Consortium, 2010; The 1000 Genomes Project Consortium et al., 2015). Obtaining these findings usually starts with a panel containing hundreds to thousands of patients diagnosed as having the same specific disease; later their genetic materials are extracted and sequenced. This is followed by disease-associated variants mining through a series of analytic procedures. Using this protocol, 89,251 single-nucleotide polymorphism (SNP) trait associations have been successfully pinpointed according to genome-wide association study (GWAS) catalog (MacArthur et al., 2017), including more than 400 CHD-related genes (Jin et al., 2017a). Although an association study is fruitful and promising, many issues weaken its applicability. First, panelbased association studies only identify common variants, and rare variants are overlooked due to low statistical significance. Thus, it requires large number of samples to be collected, i.e., hundreds, even thousands of cases. Second, almost all existing studies mine a one-to-one correspondence between genes and diseases rather than a many-to-many scheme, which is pretty challenging. Unfortunately, majority of diseases are caused by many mutations of genes. For instance, more than 400 genes have been discovered to be associated with CHD (Jin et al., 2017a), and more than 700 genes are involved in adult height (Wood et al., 2014), and even much more (Marouli et al., 2017). Third, it is costly to obtain the whole DNA (deoxyribonucleic acid) sequence of a sample. Although the ever increasing throughput and decreasing cost have made whole-genome sequencing possible for general research, it still costs a few hundreds to a thousand dollars for a single genome. To partially overcome the aforementioned limitations from single sequencing (SS) data, trio-based sequencing emerges.
Typically, a trio usually contains two parents and one child. This trio-based approach is effective for identifying diseaseassociated genes according to the basic rule of inheritance. It is also powerful to pinpoint de novo mutations without a large panel. Various studies have been conducted to identify diseaseassociated genes by using trio-sequencing (TS). For instance, a trio-based exome sequencing is used to identify de novo mutations in early-onset high myopia (Jin et al., 2017b), and ~440 CHD-related genes have been discovered based on 2,645 trios (Jin et al., 2017a). The typical procedure of using trios to identify variants is mapping-calling-filtering, i.e., mapping all sequences of each individual from a trio to a reference genome, calling variants based on mapped sequences, and filtering out variants shared by members of the trio. Intuitively, this protocol is inefficient to identify de novo mutations from child sequences. Obviously, a large portion of sequences have no contribution to variant calling, which have been considered during the whole processes for all samples within the trio. To solve this problem, we propose a novel idea of calling de novo variants from a trio and have applied it to identified VSD-related genetic variants, including coding genes and long non-coding RNAs (lncRNAs).
Our approach starts from a trio with a child diagnosed as having VSD but with healthy parents. Later, unique k-mers (k-length consecutive bases from a genomic sequence) belonging to the child only are fished out through a newly proposed counted k-mer-encoding algorithm. This is followed by sequence error correction and k-mer extension before mapping to a reference genome. Finally, variants are fished out and analyzed against existing databases to mine VSD-related coding genes and lncRNAs.

METHODS
Our approach is composed of two major parts: TS-based variant mining and VSD-related variant filtering.

Variant Mining
Unlike conventional mapping-calling-filtering approach of variant identification, e.g., SAMtools GATK (DePristo et al., 2011), we conceive a novel idea of de novo variants identification algorithm from a trio achieving good computation efficiency. Our approach contains four steps: k-mer filtering, k-mer extension, and variant identification. Details are shown below.

k-mer Filtering
Let a trio be R f , R m , R c , representing the reads of the father (R f ), the reads of the mother (R m ), and the reads of the child (R c , suppose only one child is available); the set of k-mers contained in a sample is K f , K m , and K c for father, mother, and child, respectively. Herein, we mean each k-mer having its count (the times it appears within the sequenced data) available, i.e., a k-mer, say κ, is a touple (s κ , f κ ), where s κ is the k-length string of κ, and f κ is its count. To fish out de novo mutations from K c , we go through all the k-mers of K c and check them with K f and K m . In case the count ratio of a k-mer between both parents and the child is less than a threshold (τ 0 ), the k-mer is kept as a variant-containing candidate.
It seems trivial to filter out large amount of k-mers shared between K f /K m and K c . However, the number of k-mers obtained from a whole human genome sequencing reads is usually too large to fit into a main memory, not to mention putting them together. For instance, the 31-mers having a count larger than one of the HapMap sample NA12878 ((https://www.ncbi.nlm. nih.gov/sra/ERR091571/)) take 90-Gb space on disk. To solve this problem, we have designed a novel coupled Bloom Filterbased algorithm achieving high memory saving ratio and good retrieval efficiency (Jiang et al., 2019). Let f max be the maximum frequency in K, which can be represented by at most h bits (in binary). We take the following steps to represent K: where binary (f κ ) h is the binary representation of f κ via h bits, and binary (f κ ) h [i] returns the value of the ith bit. 4. Repeat Step 3 above until all k-mers are inserted.
Based on the above steps, K f , K m , and K c can be saved into B f , B m , and B c economically; more details are shown in Jiang et al. (2019).
Based on above algorithm, we are able to store K f , K m , and K c within a memory simultaneously and compute the ratio of a k-mer between a parent and the child efficiently. Note that, the time efficiency of k-mer retrieval from a coupled Bloom Filter mainly comes from the hash operation, which is in O(1) time complexity.
Due to sequencing bias, the k-mers are error-prone. To mitigate the impact of errors on variants identification, we perform error correction before further analysis (Zhao et al., 2018). For a k-mer κ in K x , we search its neighbors N κ from B x , where x ∈{f, m, c}. A neighbor of κ is defined as the one having edit distance of 1 from κ. Later, a z-score z κ is calculated from ′ ∪ = N N κ κ κ { } , where z κ = (f κ -μ)/σ, μ is mean frequency of k-mers in ′ N κ , and σ is their standard deviation. We consider κ is error-free when z κ > z 0 and f κ > f 0 . In this study, z 0 = 0.8 and f 0 = 4. More details are presented in Zhao et al. (2018).
Data: (K f , K m , K c ), k-mers of a trio; H, a hash function vector Result:

k-mer Extension
A k-mer is usually not long enough to uniquely map to a specific location of a reference genome. Hence, extending a k-mer into a long sequence is necessary before mapping. To this end, we take a candidate variation-containing k-mer as seed, and elongate the k-mer to both side. Taking right-hand extension, each time one base is attached to the right of the current string s, i.e., sʹ = s ⋅ x, x ∈{A, C, G, T}, and the k length suffix of sʹ, i.e., suffix ′ In case the suffix is absent, the extension will be altered by another base, or terminated if all alternatives have failed. The left-hand side extension is similar to the right-hand extension but with opposite direction. An extension will be terminated in case the length limitation is reached or multiple extensions are available. We set the length limitation to 1,000 in this study. Extension details are shown in Algorithm 2.

Variant Identification
All extended k-mers are mapped to GRCh38/hg38 by BWA (Li and Durbin, 2009), and variants as well as their position are pinpointed by using SAMtools ) and the best practice of GATK (DePristo et al., 2011). Unlike most existing approach that uses read coverage to filter out low confidence variants, we use previously identified variant-containing k-mers (from the first step and with count included) to refine the obtained variants. More precisely, a variant is kept if it satisfies the following criteria: 1) a k-mer (formed by the reference genome and the variant jointly) containing the variant can be found from the set of kept k-mers obtained from the first step; 2) the sum of the count of all k-mer covering the variant is not less than 3. Note that the first criterion is necessary because extended k-mers introduce additional variants that are not unique to the child.

Variant Filtering
We focus on VSD-related variants; thus, those obtained from the previous step undergo filtering to fish out VSD-related variants. Two types of variants are considered, viz., contained in coding and non-coding regions. For affected non-coding genes, we pay special attention on long non-coding RNAs.

Identifying VSD-Related Variants
Only a tiny portion of variants obtained from GATK could be VSD-related. To obtain these variants, we first filter out irrelevant ones by GATK built-in modules with various parameters, including "QD < 2.0, " "QUAL < 30.0, " "SOR > 3.0, " "FS > 60.0, " "MQ < 40.0, " "MQRankSum<-12.5, " and "ReadPosRankSum<-8.0" for SNPs and "QD < 2.0, " "QUAL < 30.0, " "FS > 200.0, " and "ReadPosRankSum<-20.0" for indels (insertions and deletions). This step is followed by using ANNOVAR (Wang et al., 2010) to filter out variants presented in known individuals with minor allele frequency (MAF) of 0.01. Reference databases used in this stage are the phase 3 of 1000 Genomes Project (The 1000 Genomes Project Consortium et al., 2015), ExAC (Lek et al., 2016), ESP (Exome Variant Server, 2019), and gnomAD (Lek et al., 2016). That is, a variant that appears in these databases having MAF no less than 0.01 is excluded. After filtering, we use DAVID (Huang et al., 2009) to analyze functions of remaining variants. These variants are also validated by using Gene Ontology (GO) (The Gene Ontology Consortium, 2017), Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2018), the Online Mendelian Inheritance in Man (OMIM) (Hamosh et al., 2005), and the Human Gene Mutation Database (HGMD) (Stenson et al., 2017). Functions and pathways of coding genes can be easily obtained by using DAVID, whereas they are unable to be obtained directly for noncoding transcripts. Hence, we handle them separately; see below.

Fishing Out Coding Genes
Taking the results generated by ANNOVAR, we select the variants having consequence of "Nonsense_Mutation, " "Frame_Shift_ Ins, " "Frame_Shift_Del, " "Translation_Start_Site, " "Splice_Site, " "In_Frame_Ins, " "In_Frame_Del, " and "Missense_Mutation. " In addition, variants having SIFT score (Sim et al., 2012) larger than 0.05 and PolyPhen-2 index (Adzhubei et al., 2010) smaller than 0.446 are further filtered out. The remaining genes are input into DAVID to analyze gene-disease association, gene-annotation enrichment analysis, pathway mapping, and so on. Those genes related to cardiovascular diseases are fished out. In addition, the neurodegenerative diseases-related genes are also obtained because many studies have shown that these two diseases are closely related (Jin et al., 2017a).
The pinpointed genes are also checked with GO, OMIM, HGMD, and KEGG to verify their functions if available.

Pinpointing Out lncRNAs
A lncRNA does not translate proteins; however, it possesses many roles in gene transcription regulation, post-transcriptional regulation, epigenetic regulation, aging, and so on (Marchese et al., 2017). Hence, mutations occurred in lncRNAs may affect the downstream products. To identify the VSD-related variantcontaining lncRNAs within the child, we extend the VSD-related genes (listed in Jin et al., 2017a) to upstream and downstream by 100, 200, 500, and 1,000 bp. A variant is considered as a candidate if it is within the extension region and overlaps with lncRNAs shown in LNCipedia (Volders et al., 2018) or NONCODE (Zhao et al., 2016). Note that this protocol approaches the VSD-related lncRNAs approximately but not directly. The rationale is that regulatory elements within proximity usually play a role together (Razin et al., 2013;Andrey and Mundlos, 2017).
Functions of identified lncRNAs are fully explored by using LNCipedia and NONCODE.

Data Preparation
A trio containing a 3-year-old boy diagnosed as having typical VSD and a couple of healthy parents is collected. The DNA of each individual is extracted from 5-ml venous blood and is sequenced by an Illumina HiSeq X Ten platform having coverage of 30× and read length of 151 bp. All DNA sequences of the three samples are obtained from one batch. As a result, 356,781,358 paired-end reads are obtained from the child, and 368,280,232 and 330,790,178 paired-end reads are obtained from his father and mother, respectively.
Before the sample collection, a written informed consent is obtained from the parents of the child.

TS-Based Variants
Based on the TS data, we obtained 2,585,348 variants by using GATK (DePristo et al., 2011) with default settings. These variants are further divided into two types, i.e., protein coding and non-coding. For the non-coding variants, we focus on the regions transcribed into lncRNAs. Variants associated with both cardiovascular and neurodegenerative diseases are explored because they usually occur together (Jin et al., 2017a). Details are shown below.

Coding Genes Related to VSD
From the 2,585,348 variants, 193 within exonic regions and 6 from splicing regions pass various filtering criteria that are obtained by using ANNOVAR (Wang et al., 2010). The 193 variants are associated with 61 unique genes, whereas the 6 are involved in 8 genes; see more details in the Supplementary File.
We also performed pathway enrichment test for the identified genes; however, no significant cardiovascular-related pathway can be identified. We found that only five genes overlap with the genes saved in KEGG.
After careful investigation of the 14 cardiovascular diseaserelated genes via literature review, we found 13 of them have literature support that are associated with CHD (VSD is the most common type of CHD), while CD80 has no explicit support. Hence, we take further effort to explore the possible roles of CD80 in VSD development.
CD80 is well known in providing co-stimulatory signal necessary for T-cell activation and survival, which has been found on dendritic cells, activated B cells, and monocytes (Peach et al., 1995). To our knowledge, no study shows direct relation between CD80 and CHD. However, Kallikourdis et al. (2017) have reported that CD80 involved in T-cell costimulation complex contributes to a heart failure, suggesting that mutated CD80 has impact on heart defect. Hence, we carefully explored the role of CD80 with VSD computationally.
To examine whether the mutated CD80 has a connection with VSD as shown in this study, we retrieved all cardiovascularrelated genes from GO, OMIM, and HGMD, and built connections between these genes and CD80 by using the STRING database (Szklarczyk et al., 2017). Results show that 31 genes in GO and 18 genes in OMIM have connections with CD80 (in protein association, including known interactions, predicted interactions, co-expression, etc.). In total, 41 genes have connection with CD80. The details are shown in Table 2. Among these 41 genes, 7 of them are known interactions (experimentally determined or curated from databases, shown in italic in Table  2; see Figure 1B). Among these genes, AKT1, PDPK1, CDC42, AKT3, and PIK3CA have concrete evidences shown in relation FIGURE 1 | Variant-containing coding genes obtained from the trio that are associated with cardiovascular and neurodegenerative diseases. Panel (A) shows the 18 genes attached to the two categories [generated by using the STRING database (Szklarczyk et al., 2017)], panel (B) presents the connections between CD80-and CHD-related genes, and panel (C) illustrates the 3D structure of the mutated CD80 (PDB ID: 1I8L) discovered in this study. The 18 genes are fished out by using DAVID from OMIM and GAD databases, genes identified by OMIM are shaded by a polygon. Note that, all the genes identified by OMIM have also been confirmed by GAD.
Regarding the mutated CD80 in this study, it has a "T" deletion on the reverse strand at chr3:119537362, leading to a frame shift at position 159 of the translated protein (cf. Figure  1C). As a result, the protein is no longer able to insert into the membrane of a cardiac myocyte; see Figure 1C. Therefore, the downstream pathway will be affected.
Regarding the eight variants that reside in the splicing region, we found only one (MROH5) that is related to cardiovascular disease.

LncRNAs Related to VSD
Other than fishing out candidate VSD-related genes from variants directly, we use known VSD-associated genes as a seed, and then pinpoint lncRNAs having variants near these seeds. A set of 457 known VSD-related genes are obtained from Jin et al. (2017a), whereas the whole set of lncRNAs are retrieved from LNCipedia and NONCODE. A variant-containing lncRNA is considered to be VSD-related if it is within a certain distance of a known VSDrelated gene. Other than using a single distance, we use various distances, which are 100, 200, 500, and 1,000 bp.
We identified 6, 7, 27, and 49 lncRNAs from LNCipedia having distance of 100, 200, 500, and 1,000 bp, respectively, whereas these numbers are 6, 8, 32, and 57 when checked against NONCODE. Details are shown in the Supplementary File. To examine whether these lncRNAs have a potential effect to VSD, we carefully studied their expression in different tissues, particularly in the heart. We found that among all the lncRNAs (both from LNCipedia and NONCODE), 29 of them present in the heart, especially NONHSAT096266.2 (NONCODE ID), which is highly and uniquely expressed in the heart, having a FPKM score of 13.97. More interestingly, this lncRNA is very close to NFXL1, which has been identified as a VSD-associated gene (Jin et al., 2017a). See Table 3.

Results Comparison
We use TrioDeNovo (Wei et al., 2015) with default settings (depth of coverage equals 5) to call de novo variants from the trio and compare results with that of our approach. As a result, TrioDeNovo identifies 79,082 variants contained in 357 genes. After filtering out common variants with MAF of 0.01, 51 variants located in 25 genes are obtained. Among these genes, 21 overlap with our findings. Regarding lncRNAs, no variant can be found within 1,000 bp of known VSD-related genes.

SS-Based Variants
Intuitively a TS-based approach is able to significantly narrow down candidate genes; however, it is hard to speculate to what extent the improvement is. Hence, we have conducted experiments on the sequences of the VSD sample (the child) only with the same protocols as the TS-based experiments.

Coding Genes Related to VSD
After annotation and filtering by using ANNOVAR, we obtained 1,552 variants contained in 436 genes. Among these genes, 424 have exonic variations and the other 12 have splicing variations. For the 424 genes, we identified MYBPC3 (Chr11:47342683, C/T, missense mutation, p.G507R) and TRDN (Chr6:123571021, C/A, missense mutation, p.S45I), which are highly related to ventricles, by using DAVID based on OMIM. More details are shown in the Supplementary File. Unfortunately, these two genes cannot be identified based on the trio. Further investigation has shown that the variation in MYBPC3 is inherited from the father, whereas the variation in TRDN is inherited from the mother. Considering the truth that both the father and mother are healthy, but the child has VSD, we speculate that the combination of mutated MYBPC3 and TRDN may have noteworthy contribution to VSD. Regarding the 12 genes having variants in splicing regions, we have not found cardiovascular-or neurodegenerative-related genes. The genes identified by GAD are excluded for the child sample analysis. This is because more than 60 genes can be found, and the significance of relations between these variants and VSD can be hardly determined.

LncRNAs Related to VSD
Similar to TS-based lncRNA identification, we carried out the same experiments on the VSD patient only. Unlike the results obtained from coding genes that are about 10 times larger, candidates are selected from the SS-based data than the TS-based data; we get five times larger number of lncRNA variants between the SS-based data and the TS-based data.
The numbers of lncRNAs having variants close to VSD-related genes are 37, 60, 129, and 197 for distances of 100, 200, 500, and 1,000 bp, respectively. Among all these lncRNAs, only 97 are present in heart cells, of which 23 are highly expressed having FPKM score larger than 1. Details are shown in the Supplementary File. For instance, the lncRNA NONHSAT181468.1 (Chr2:27217145, CT/C), which has the highest FPKM score of the identified lncRNAs, is highly expressed in the heart having FPKM of 31.7. This lncRNA is within the first intron of SLC5A6, which has been confirmed as a VSD-associated gene.

Variants Profile of the Trio
We use the ratio of k-mers between the parents and child to reflect their genetic variations. Figure 2 shows the detailed ratio distribution. It is clear that only a small portion of k-mers have ratio of 0 (see Figure 2A, C). That having been said, among the k-mers of the child, only 0.442% contains de novo mutations compared with his father, and this value is 0.438% compared with his mother. After combining them together, 0.43% are unique k-mers.
The ratio of k-mers may be affected by sequencing errors. To alleviate this impact, we include k-mers having small ratio (less than 0.3) except the ones having a ratio of 0. Generally, the number of k-mers having a ratio of 0 is four to five magnitudes larger than those non-zero ones (see Figure 2C). In case these k-mers contain mutations, they will be fished out during the downstream variant calling.
Unlike the distribution of k-mer count for all k-mers (approximate normal), the k-mers having mutations follow a Poisson distribution (see Figure 2D). The k-mers having counted smaller than 20 forms 97.97% of all k-mers having ratio less than 0.3. The distribution breakdowns of these k-mers are shown in Figure 2B.

Run-Time Analysis
Our experiments are conducted on a computer having 128G RAM and two E5-2683V4 CPUs (32 cores in total), installed with CentOS 7.0. Throughout the entire experiments, we use 24 threads as default if applicable.
Other than existing approaches that filter out irrelevant variants from trios after mapping, e.g., TrioDeNovo (Wei et al., 2015), we conduct filtering before mapping. This small change is not trivial since the input data are usually very large. For instance, the input size of the VSD sample used in this study is 242 Gb in fastq format, and the total size is over 700 Gb for the trio. To solve this problem, we have conceived a novel coupled Bloom Filter-based k-mer encoding algorithm. This algorithm achieves a compression ratio of 12 under default settings. That having been said, a typical set of k-mers obtained from a human genome (usually around 120Gb) can be compressed into 10 Gb. Using this approach, we are able to handle a trio within a main memory.
Experiments show that the total memory used to encode counted k-mers obtained from the trio is 31.7 Gb. Based on the encoded k-mers having count available, we calculate count ratio of all k-mers between the parents and the child. Mathematically, suppose the count of a k-mer κ from the child is f c κ , and the count is f f κ and f m κ for his father and mother, respectively; then, are smaller than the threshold r 0 , then, κ is kept, where r 0 is set as 0.3 in this study. Results show that k-mer counting takes 129 min, k-mer encoding takes 175 min, and k-mer filtering takes 20.3 s. As a result, 3.9% k-mers are left for further analysis.
Because there exist sequencing errors, we perform error correction on the remained k-mers (Zhao et al., 2017). It takes 1.7 s and 0.12-Gb RAM to correct 93.7% errors of the kept k-mers. As a result, 293.2M k-mers are left for variants identification.
Before mapping variant-containing k-mers to a reference genome, we have also conducted k-mer extension to avoid multimapping problem caused by short input sequence, e.g., k-mer. An extension takes a k-mer as seed, and extends the k-mer to both sides based on the reads in which the k-mer is contained. Finally, we mapped extended sequences to the reference genome GRCh38/h38 via BWA (Li and Durbin, 2009), which takes 52 min to finish. This is followed by variants calling through SAMtools GATK (DePristo et al., 2011) jointly. It takes 50 min to finish the above mentioned steps.
Regarding TrioDeNovo, it takes 572 min to get the sorted sam file from a raw fastq file and uses 8,179 min to merge and generate the final vcf file by using GATK and TrioDeNovo. Compared with our approach, TrioDenovo is 10 times (= (8179 + 572*3)/ ((129 + 175)*3+102)) slower than ours. Besides, the maximum RAM required by our approach is two magnitudes smaller.

CONCLUSIONS
As the most common CHD, VSD affects a noteworthy portion of newborns, leading to a high mortality. Unveiling the biological mechanism, particularly the underpinning genetic variants, is essential for both early diagnosis and clinical treatment. Existing approaches of mining genetic variants rely on large panels, which is challenging in cost and sample collection. It is also prone to overlooking rare variants and hard to handle multiple variants. We designed a novel algorithm for identifying variants from a trio and associate them with VSD. Experiments show that triosequencing-based approach is able to narrow down VSD-related candidates by about 10 times in coding genes and 5 times in lncRNAs; meanwhile our approach is 10 times faster than existing state-of-the-art approach. Applying our method to a VSD trio, we fish out 14 coding genes closely correlated to cardiovascular diseases and 5 coding genes associated with neurodegenerative diseases. Among them, CD80 has not been reported yet. More promisingly, results show that the combination of MYBPC3 and TRDN has high possibility to be VSD-related. Analysis on lncRNA shows that six are highly expressed in heart that are within 1,000 bp to VSD-related genes, particularly NONHSAT096266.2, which has a FPKM socre of 13.97 and is uniquely expressed in heart.

AUTHOR CONTRIBUTIONS
LZ conceived the algorithm, designed the experiments, and wrote the manuscript. PJ participated in program coding. PJ, YH, FIGURE 2 | Variants profile of the trio. Panels (A and C) are the distribution of k-mer count ratio between the parents and the child, panel (B) is the ratio distribution of the k-mers having a count less than 20, and panel (D) is the overall distribution of ratio against count. Note that ratio distribution in panel (B) is clustered into bins, viz., r = 0 (denoted as "0"), 0 < r ≤ 0.1 (denoted as "0.1"), 0.1 < r ≤ 0.2 (denoted as "0.2"), and 0.2 < r ≤ 0.3 (denoted as "0.3"), where r is a ratio and the count is in log scale.