Bioinformatic analysis for age prediction using epigenetic clocks: Application to fisheries management and conservation biology

Epigenetic clocks are accurate tools for age prediction and are of great interest for fisheries management and conservation biology. Here, we review the necessary computational steps and tools in order to build an epigenetic clock in any species focusing on fish. Currently, a bisulfite conversion method which allows the distinction of methylated and unmethylated cytosines is the recommended method to be performed at single nucleotide resolution. Typically, reduced representation bisulfite sequencing methods provide enough coverage of CpGs to select from for age prediction while the exact implemented method depends on the specific objectives and cost of the study. Sequenced reads are controlled for their quality, aligned to either a reference or a deduced genome and methylation levels of CpGs are extracted. Methylation values are obtained in biological samples of fish that cover the widest age range possible. Using these datasets, machine learning statistical procedures and, in particular, penalized regressions, are applied in order to identify a set of CpGs the methylation of which in combination is enough to accurately predict age. Training and test datasets are used to build the optimal model or “epigenetic clock”, which can then be used to predict age in independent samples. Once a set of CpGs is robustly identified to predict age in a given species, DNA methylation in only a small number of CpGs is necessary, thus, sequencing efforts including data and money resources can be adjusted to interrogate a small number of CpGs in a high number of samples. Implementation of this molecular resource in routine evaluations of fish population structure is expected to increase in the years to come due to high accuracy, robustness and decreasing costs of sequencing. In the context of overexploited fish stocks, as well as endangered fish species, accurate age prediction with easy-to-use tools is much needed for improved fish populations management and conservation.


Introduction
Epigenetics can be defined as "the study of phenomena and mechanisms that cause chromosome-bound, heritable changes to gene expression that are not dependent on changes to DNA sequence" (Deans and Maggert, 2015). Epigenetics has emerged as a powerful discipline in the study of the integration of genomic and environmental information, both intrinsic and extrinsic factors, to bring about a specific phenotype (Turner, 2009;Vogt, 2017). There are three major epigenetic molecular mechanisms widely accepted as such: 1) DNA methylation, 2) the modifications of histones and histone variants, and 3) the abundance and distribution of regulatory non-coding RNA (for review, see Carlberg and Molnaŕ (2014)). One of the best studied epigenetic mechanisms is DNA methylation. Methylation can occur in two of the four nucleotides of DNA, cytosine and adenine. The former is the process by which a methylgroup (CH3) is transferred from a methyl donor, S-adenosyl-Lmethionine (SAM), to the fifth position of a cytosine, converting it to 5-methylcytosine (5mC) or to the sixth position of an adenine converting it to N6-methyladenine (Ratel et al., 2006;Grosjean, 2013;Pfeifer, 2016). 5mCs are the most abundant modifications, are present in most species and therefore the most studied.
According to the Biomarkers Definitions Working Group, a biomarker is defined as "a characteristic that is objectively measured and evaluated as indicator of normal biological processes, pathogenic processes or pharmacologic responses to a therapeutic intervention" (Biomarkers Definitions Working Group, 2001). Biomarkers have been developed for a variety of purposes, including medicine and environmental assessment (Liu et al., 2019). Epigenetic modifications have been suggested recently as good candidates for biomarkers because they can be stable, frequent, abundant and accessible (Costa-Pinheiro et al., 2015). Details on the development of epigenetic biomarkers in aquatic organisms can be found elsewhere (Anastasiadi and Beemelmanns, 2023). An epigenetic clock is a set of biomarkers used to predict age, or in other words a "highly accurate age estimator based on CpG DNA methylation levels". In the last years they have been developed for about half a dozen fish species and it is expected that in the years to come epigenetic clocks will be of common use for both fisheries management and conservation biology. To the best of our knowledge, epigenetic clocks have been developed for: European sea bass (Anastasiadi and Piferrer, 2020), zebrafish (Mayne et al., 2020), Australian lungfish (Mayne et al., 2021b), Mary river cod (Mayne et al., 2021b), Murray cod (Mayne et al., 2021b), medaka (Bertucci et al., 2021), northern red snapper (Weber et al., 2021) and red grouper (Weber et al., 2021). For details on piscine epigenetic clocks including accuracy, techniques, CpGs covered and biological aspects to consider for new clocks please see Piferrer and Anastasiadi (2023). However, a crucial aspect for epigenetic clock development is how DNA methylation data is actually used to build the age predictor. This is of importance because a proper model building is essential to take out the most of the capabilities that epigenetic clocks may offer. There are several reviews that cover the factors causing, modulating and accelerating epigenetic clocks, mainly focusing on humans and mammals (Field et al., 2018;Guevara and Lawler, 2018;Bell et al., 2019;Simpson and Chandra, 2021). However, to the best of our knowledge, there are no reviews on the necessary computational steps and tools in order to build an epigenetic clock in any species, while these steps will be essentially the same. The issues dealt with below will thus be very helpful not only to fisheries managers and conservation biologists but to scientists that want to develop epigenetic clocks for new species.

Methods to analyze DNA methylation
The methods used to analyze DNA methylation can be categorized at three broad levels [ Table 1 (Anastasiadi, 2016;Barros-Silva et al., 2018;Ortega-Recalde and Hore, 2023]. These three levels are based on how methylated loci are identified (level 1), at what resolution they are identified (level 2) and what portion of the genome is interrogated (level 3). For epigenetic clocks construction, methods that make use of bisulfite (level 1) at single nucleotide resolution (level 2) are used. However, the portion of the genome to be interrogated depends on the resources and available knowledge on the species of target or closely related species. Importantly, advances in sequencing using Oxford Nanopore Technologies MinION render this a powerful alternative to other methods. Thus, direct detection at single nucleotide resolution of 5mCs using portable devices is possible without the need of bisulfite conversion. This technology has been used recently to construct an epigenetic clock in cattle (Hayes et al., 2021).
5mCs must be identified and separated from the unmethylated ones (Cs). The processes of distinction between the two types of cytosines can be further divided into three general sub-levels, detailed below (Table 1 for an overview), that are not mutually exclusive and that in some cases are used in combination (Rauluseviciute et al., 2019): 1) Restriction enzymes. There are restriction enzymes which function differently when they encounter 5mCs and Cs. This property can be used to distinguish between the two types of Cs and ultimately identify their methylation status. Common isoschizomers, like MspI and HpaII, are used. For instance, these enzymes recognize the same sequence pattern (5'-CCGG-3'), however, MspI cuts at those sites where the internal C is methylated in the two complementary DNA strands, while HpaII is functional in those with methylation of the external C in one or both of the complementary DNA strands. The Methylation Sensitive Amplified Polymorphism (MSAP) (Reyna-Lopez et al., 1997;Xu et al., 2000) and the Restriction Landmark Genomic Scanning (RLGS (Hatada et al., 1991); are examples of approaches using methylationsensitive restriction enzyme (Table 1).
2) Antibodies. This approach is based on the use of antibodies that show specificity against 5mC or of recombinant proteins which have been developed to contain a methyl-CpG binding domain (MBD; e.g (Aberg et al., 2012). These processes end up enriching the fraction of chromatin that is methylated. Methylated DNA ImmunoPrecipitation (MeDIP) (Jacinto et al., 2008) and Methyl-CpG-Binding Domain (MBD) (Jacinto et al., 2008;Nair et al., 2011) are examples of affinity-based approaches (Table 1) with MeDIP using a monoclonal antibody specific for 5mCs and MBD-based strategies using methyl-CpG binding domain-based proteins (MBDCap) (Nair et al., 2011).
3) Bisulfite. The treatment of DNA with bisulfite involves a chemical reaction that converts unmethylated Cs into uracils in 3 steps. Methylated 5mCs also react with bisulfite but this reaction is extremely slow and 5mCs are favoured by the equilibrium. Thus, 5mCs essentially escape conversion and remain intact (Clark et al., 1994). This reaction functions, therefore, as a recorder of the original methylation status and downstream steps allow to register and recall it. Several techniques, ranging from locus-specific to whole-genome, take advantage of the bisulfite properties in order to analyze the DNA methylation levels, like the Methylation-specific PCR (MSP) or the Whole Genome Bisulfite Sequencing (WGBS ; Table 1). Bisulfite conversion of DNA is considered the "gold standard" in DNA methylation analysis because it allows the identification of the methylation status of each interrogated cytosine. However, limitations exist for bisulfite conversion methods as well.
Methylation of a cytosine is a binary state (methylated or not methylated) in a given cell at a given time. Bisulfite sequencing reflects the relative proportion of Us/Cs at a given position, when sequencing tissues due to cell heterogeneity, and not the binary state of a specific cytosine unless single cell sequencing is performed. Methods based on bisulfite treatment of DNA are used for epigenetic clock construction.

Level 2. What is the resolution used?
The methylation profiling methods can have variable resolution, where higher resolution means information retrieval at the level of nucleotide and lower resolution means information retrieval at a larger genomic scale. In Table 1, an overview of the different methodologies for the analysis of DNA methylation with their corresponding resolution is provided. The resolution can broadly be grouped into the following three categories: 1) Low resolution. These techniques typically allow to obtain information on the global 5mC content. This is useful in order to conclude whether there are overall differences in the global methylation content or not, e.g., between control and treatment or disease group. Nevertheless, where exactly in the genome these differences occur remains unknown.
2) Medium resolution. Here, apart from global differences, an approximate location of the 5mCs is obtained. This is the case, for instance, of MeDIP-seq, where the methylated fraction of the immunoprecipitated DNA is sequenced and the differences can be located within a region that corresponds to the length of the sequenced fragment.
3) Single nucleotide resolution. In this case, the precise location of both 5mCs and Cs is obtained. This means that the exact position of 5mCs and Cs can be mapped to genomic coordinates that include 3 numbers: chromosome, start position, end position. For example, one obtains the information that in chromosome 1, start position=253, end position=254, there is a 5mC. Single nucleotide resolution is needed to construct an epigenetic clock.

Level 3. Which part of the genome is targeted?
The part of the genome that is investigated following the separation of Cs is also variable. In Table 1 an overview of the different methodologies for the analysis of DNA methylation with their relative CpG/genome coverage is provided. They can be broadly grouped into three categories according to this criterion as well: 1) Locus-specific. The amount of 5mCs and Cs is measured within target regions of interest typically spanning 10 -1 -10 2 CpGs. The target region of interest can be a specific gene, regulatory region of a gene, genomic regions within a gene such as exons, introns or 5'UTRs, or any other genomic region that is a priori interesting and therefore can be a target for the analysis of its DNA methylation.
2) Genome-wide. The amount of 5mCs and Cs is measured within a part of the genome that is considered representative of the overall genome. The part of the genome is in the order of 10 5 -10 6 CpGs and is representative because usually it is enriched for sites that can be methylated. For example, after digestion with enzymes that specifically recognize sites that include CpGs.
3) Whole-genome. The amount of 5mCs and Cs is measured across the whole genome covering more than 10 6 CpGs. The entire genome is interrogated for its methylation levels, there is no reduction for specific regions or representative parts, but rather information on every single basis is obtained.

DNA methylation analysis using bisulfite sequencing
In the last years, high throughput sequencing (HTS) approaches have been used extensively to analyze the DNA methylation patterns in many different situations. The technique that combines the best possible way to distinguish 5mCs, single nucleotide resolution and whole genome coverage is Whole Genome Bisulfite Sequencing (WGBS), which is a HTS-based approach that uses bisulfite conversion to allow the distinction between 5mCs and Cs and interrogates the whole genome at single nucleotide resolution (Bock, 2012). Other HTS-based approaches that use bisulfite conversion, but analyze only a part of the genome are: Reduced Representation Bisulfite Sequencing (RRBS) (Gu et al., 2011;Klughammer et al., 2015), Bisulfite-converted Restriction site Associated DNA sequencing (bis-RAD-seq) (Trucchi et al., 2016), epiRADseq (Schield et al., 2016) and epi-GBS (van Gurp et al., 2016;Gawehns et al., 2022). Furthermore, targeted approaches such as BisPCR 2 (Bernstein et al., 2015), Multiplex Bisulfite Sequencing (MBS) (Anastasiadi et al., 2018) or others (Masser et al., 2013;Korbie et al., 2015;Roeh et al., 2018) are also HTS-based techniques that make use of bisulfite conversion but for a targeted part of the genome. Oxford Nanopore Technology sequencing is expected to vary for the basic bioinformatics steps, however, the statistical analysis including machine learning model building will be essentially the same (section 3). The HTS methods used for epigenetic clocks in fish species until now include RRBS, bis-RAD-seq, MBS and BisPCR 2 (see Table 1, Piferrer and Anastasiadi, 2023).
Different epigenetic clocks can be developed using different CpGs across the genome in different combinations, depending on the original dataset and the machine learning model. Around 20% of the Illumina 450K CpGs (90000 CpGs) can be used for epigenetic clocks (Porter et al., 2021). Taking this information into account, WGBS (>10 6 CpGs) or RRBS (10 3 -10 6 CpGs) may produce a large amount of unnecessary data and workload, while bis-RAD-seq or epiRADseq (10 3 -10 5 CpGs) are expected to produce a good balance of informative CpGs without excess. Targeted approaches (e.g., MBS 10 -1 -10 2 CpGs) will be more relevant when prior information is available.

Bioinformatics workflow for bisulfite sequencing
All HTS-derived data produced from methods that use bisulfite conversion share some common characteristics. A summary workflow allows to distinguish the steps of quality controls, filtering/trimming, alignment/mapping, methylation extraction and analysis ( Figure 1).
Processing of HTS-derived data always initiates with the appropriate quality controls of the raw sequencing data obtained followed by filtering of the data that fall below the specified thresholds. Adapters or indices have usually been added to the DNA fragments during the preparation of the libraries and are used to demultiplex the samples if needed. Their sequences also usually need to be removed from the data (trimming) otherwise they might influence the downstream steps of the workflow. Once adapters have been trimmed and low quality reads filtered out, the reads are aligned against the genome which, importantly, must have been previously bisulfite converted in silico. This is because bisulfite treatment converts the unmethylated cytosines of the genome into uracils, which are in turn converted into thymines (Ts) after amplification by PCR. This process results in many genomic sites in the sequenced reads that fail to map to the genome because the original sites have been lost and thus, they cannot match. Moreover, after PCR amplification, the complementary DNA strand contains adenines (As) instead of guanines (Gs) in the positions where the C was unmethylated and has been converted into T. Thus, the procedure through which the sequenced Workflow for bioinformatics analysis of bisulfite sequencing data for epigenetic clock construction.
reads are aligned to the genome needs to take into account these mismatches and considerations. Different tools have been developed to convert and map bisulfite sequencing data (See section 2.4.). A reference genome is not a pre-requisite for applying bisulfite sequencing since alternative bioinformatics procedures have been developed to assist the analysis, e.g., ad hoc genomes can be deduced from RRBS reads (Klughammer et al., 2015) or for bs-RAD-seq, a standard RAD-seq reduced representation genome can be used (Trucchi et al., 2016).
Once reads have been successfully mapped, the information of the methylation status has to be extracted at each C position of the genome, a process called methylation extraction or methylation calling. Usually, the final methylation of a given C is calculated according to the proportion of 5mCs and Cs found in that position: their sum equals to the coverage of a position and is the denominator in the equation where the numerator is the number of 5mCs. This value may be expressed as (5mCs/5mCs+Cs) and thus the methylation values will range from 0 to 1, or can be expressed as percentage, (5mCs/5mCs+Cs) multiplied by 100, and thus the methylation values will range from 0 to 100 (more details in section 2.5).

Quality controls
Modern sequencing platforms (e.g., Illumina) usually include the corresponding software which automatically performs the demultiplexing steps required prior to sample analysis and thus the corresponding set of files for each sample are obtained. In the case of single-end sequencing one file per sample is produced, while in the case of paired-end sequencing two files are produced per sample, each one of them refers to the forward and reverse read.
The standard format for these files is the FASTQ. FASTQ is a textbased format to store the sequences which includes more information than the older FASTA format which included only the sequence. In FASTQ, each read is unique and contains a sequence identifier and there is further information on the specific quality of the read. The quality of the reads is mainly measured by the Phred score which is a property logarithmically related to the base-calling error probabilities. A Phred score of 10, means that there is a 1 in 10 probability of incorrect base call and a 90% accuracy in base calls. A Phred score of 30 means that there is a 1 in 100 probability of incorrect base call and a 99.9% accuracy in base calls. Typically, Phred scores below 20, which equals to 99% accuracy of base call, are excluded from downstream analysis.
Quality controls are usually performed by a range of open source software packages, the most common of which is FASTQC (Andrews, 2010). In case several samples are to be evaluated at once, the MultiQC (Ewels et al., 2016) can be useful for simultaneous assessment of quality (see indicators below). These tools allow to assess the data quality via a variety of plots and statistics (Figures 2A-D), namely: 1) Sequence counts. The number of sequences counted for each sample.
2) Sequence quality histograms. The mean Phred score across each base position in the read.
3) Per sequence quality scores. The total number of reads plotted against the average Phred quality scores over the full read. 4) Per base sequence content. The percent of bases called for each of the four nucleotides (e.g., 30% A, 40% T, 20% C, 10% G) at each position (e.g, position 1-150 for a 150 bp read sequencing) across all reads. 5) Per sequence GC content. The number of reads plotted against the GC% per read. 6) Per base N content. The percent of bases at each position of the read for which no base could be called and are therefore coded as "N".

7)
Sequence length distribution. The distribution of lengths across all reads.
8) Sequence duplication levels. The percent of reads of a specific sequence that are present repeatedly inside the file and can be an indicator of PCR duplication. 9) Overrepresented sequences. Sequences that appear more times than expected.
10) Adapter content. Cumulative plot of the fraction of reads where the adapters used for library construction are identified.
Sequence quality scores below 30 are nowadays considered unacceptable. However, interpretation of the rest of the metrics will be specific to the technique and sequencing platform used, since high duplication levels are inherent to enrichment (e.g., RRBS) or targeted techniques, but may indicate a problem with WGBS data.
On the other hand, the simultaneous visualization of multiple quality controls (QC) can be obtained by the MultiQC software ( Figure 2E). A drop of quality below the available threshold at the end of the read is expected in general and for long reads (300 bp) in particular.
Example code of running FASTQC in all available fq files: fastqc -nogroup -q -t 2 -o output_fastq_raw *.fq.gz Example code of running MultiQC in the output of FASTQC: multiqc output_fastqc_raw -i Fastqc-Raw

Trimming
Several open source packages are available for trimming ( Table 2). Trimming of low quality reads can significantly improve the quality of the data to process and all downstream workflow, as a minimum in terms of Phred scores Low quality Phred scores (<20) would be associated with too high probabilities of erroneously called bases, one nucleotide, e.g,. A, instead of another, e.g., C. Thus, they cannot be accepted in a HTS experiment. Quality controls are performed again after the trimming procedure as well and are useful to visualize the improvements.

Alignment
Bisulfite conversion depletes the genome of unmethylated cytosines which represents a challenge for the normal alignment procedure of reads to a large reference genome. Softwares developed for standard alignment procedures are not adequate in this case due to the conversion effect (Laird, 2010). This challenge has been circumvented by two different algorithms: 1) Wild card aligners. In this case, the Cs of the genome are replaced by Y which is the wild-card letter that is able to match both Cs and Ts, equivalent to Cs and 5mCs in the original molecule. Otherwise, these aligners modify the alignment score matrix in a manner that allows mismatch between Cs in the original molecule and Ts in the sequence of the read. Examples of wild card aligners include BSMAP and RRBSMAP (Xi and Li, 2009;Xi et al., 2012).
2) Three-letter aligners. In this case, all the Cs are converted into Ts in both the reads to be aligned and in the genomic sequence. The alignment is simplified and carried out using only three-letters of the nucleotide alphabet with C excluded. In this case, any standard aligner can be used at the lower level of the package, such as Bowtie or Bowtie2 (Langmead et al., 2009). Examples of three-letter aligners include Bismark, bwa-meth and BS-Seeker (Chen et al., 2010;Krueger and Andrews, 2011;Krueger et al., 2012, Pedersen et al., 2014. Example code using bwa-meth: Index reference genome python/software/bwa-meth/bwameth.py index/genome/ species-genome.fasta Align reads to reference genome python /software/bwa-meth/bwameth.py-threads 16 -reference/genome/species-genome.fasta /trimmed_data/sample1_trimmed_R1.fastq /trimmed_data/sample1_trimmed_R2.fastq | samtools view -Sb -q 10 ->/alignments/sample1.bam Example code using Bismark: bismark /reference/genome/ -1 sample1_trimmed_R1.fastq -2 sample1_trimmed_R2.fastq -non_directional -un -o alignments bismark /software/bismark/Genome/ -1 sample1_trimmed_R1. fastq -2 sample1_trimmed_R2.fastq -non_directional -un -o alignments Wild card aligners typically result in higher genomic coverage, but also in the introduction of bias towards higher DNA methylation as compared to three-letter aligners. This is relevant mainly in parts of the genome such as repetitive sequences. When selecting an aligner, considerations such as speed, computer memory and program use are more important (Bock, 2012). A recent comprehensive comparison of the most commonly used aligners should be consulted before executing this step (Nunn et al., 2021). In any case, mapping of the reads to the genome needs to be precise because otherwise it would result in biased DNA methylation levels calculated on the basis of methylated and unmethylated reads (Bock et al., 2010).

Methylation extraction
The methylation state of each C is extracted according to the alignments. Cytosines from the aligned sequences are transcribed into a table format where each row corresponds to a cytosine and its genomic position according the chromosome and position, methylation state and strand. Coding of this information within the table depends on the software used. For example, the Bismark (Krueger and Andrews, 2011) primary alignment output codes cytosines depending on the context, as z in CpG context, x in CHG context and h in CHH context. Methylation status is coded as uppercase (Z, X, H) for methylated and lowercase (z, x, h) for unmethylated. This information is transcribed into + for methylated and -for unmethylated cytosines in the methylation extraction file.

Bisulfite conversion rate
Evaluation of bisulfite conversion efficiency is an important step of the whole procedure because if this fails, then conclusions on the methylation status of the cytosines are erroneous. Spike-in sequences of known methylation status may have been introduced during library preparation to assist with bisulfite conversion rate estimation. If not, bisulfite conversion rate can be estimated in silico. Tools like the 'bsrate' script of the MethPipe pipeline (Song et al., 2013) allow for an automatic calculation of the bisulfite conversion ratio. Otherwise, one can make use of the percent of Cs methylated in a CHH context where C is cytosine and H can be any nucleotide except of Gs. In this case, the percent of these Cs is subtracted from 100 and the result is the bisulfite conversion rate. In current DNA methylation analysis procedures, bisulfite conversion rates should be as high as possible.

Statistical analysis 3.1 Objective
The objective of this step is to identify CpGs the methylation levels of which allow to predict the age of an individual. The methylation of these CpGs may be decreasing or increasing with age with different slopes. The methylation of each CpG will be given a specific weight (coefficient) and their combination will be sufficient to predict age. These coefficients were shown to differ in the same CpGs between broad age groups in mammals, including humans (younger, middle-aged and older). Therefore, the extreme age groups should be considered with caution (Field et al., 2018) but nevertheless included for the development of the clock. The statistical analysis includes a typical machine learning model building (Figure 3). Building of machine learning models for age prediction follows the same principles as for any biological feature predicted from epigenetic biomarkers (Anastasiadi and Beemelmanns, 2023). The outcome variable is quantitative (age) and thus we deal with a regression problem aiming to predict the outcome variable on the basis of the independent variable(s) by means of a fitting curve explaining the input. When running a regression trying to predict a quantitative value (i.e. age) with many predictors (CpGs) results tend to overfit, reducing the predictive value. Penalized regression circumvent this problem by shrinking values of the predictors, being the recommended for age prediction based on CpG methylation (See Section 3.3).

Data structure
The dataset consists of: 1) Biological samples that cover a defined age range. The total number of samples should ensure covering the full age range of the species considered, and may vary between species in the extremes of lifespan. In the published literature of fish epigenetic clocks, the total number of samples range between 10 in Northern red snapper and red grouper (Weber et al., 2021) and 141 in Australian lungfish (Mayne et al., 2021b) with mean 46 samples (Piferrer and Anastasiadi, 2023). These numbers maybe suboptimal, since the minimum sample size according to simulations using human and zebrafish (Danio rerio) data is 70 (Mayne et al., 2021a). If feasible, 134 samples should be ideally included according to the same simulations, a recommendation for all new piscine epigenetic clocks (Mayne et al., 2021a). In order to build a prediction model, these will have to be divided into training and test sets. The training set is used for the machine to learn, to fit the parameters of the model. The test set is an independent set of data which the model built predicts and thus serves as an evaluation dataset of the model fit. Usually, the original dataset is split in 70-80% of the observations into training and 20-30% of the observations into the test set, using random procedures.
2) The methylation levels of target CpGs. Depending on the technique used, the number of CpGs will be in the order of hundreds (e.g. MBS), thousands (e.g. bis-RAD-seq), hundreds of thousands (e.g. RRBS) or millions (e.g. WGBS). Since many epigenetic clocks across a genome are possible (Porter et al., 2021) and extremely accurate epigenetic clocks with only 3 carefully selected CpGs have been constructed in mice (Han et al., 2018), the number of CpGs analyzed are not expected to affect the overall accuracy. However, each epigenetic clock or model will be unique as will be the coefficients attributed to each CpG of the clock. This type of data is not independent, since the methylation of one CpG may depend on the methylation of its neighboring CpG and are characterized by strong multicollinearity, where a large number of CpGs may be closely related to each other. Genome-wide patterns of DNA methylation in vertebrates are bimodal with a specific CpGs showing 0 or 100% methylation.

Penalized regressions
In the development of epigenetic clocks we are dealing with a large multivariate dataset, where the number of variables (the different CpGs, at least the ones initially analyzed) is much, much higher than the number of samples (biological samples). Thus, the standard linear model is not suitable to use. A way to circumvent the structure of the dataset is to use penalized regressions. This approach was already implemented by Horvath (2013) when constructing the first epigenetic clock in humans. Penalized regressions allow to construct linear regression models that are penalized when they have too many variables (Kassambara, 2018). The penalization occurs via the addition of a constraint in the equation (Bruce and Bruce, 2017). This increases bias but, importantly, reduces variance. The methodology to achieve this is shrinkage or regularization, which results in the shrinkage of some coefficients values to zero. This allows Workflow for machine learning model building for epigenetic clock. Data are split into training and test, model is tuned and evaluated using the training dataset, the optimal model is selected and evaluated using the test dataset. If model performance is not optimal, the procedure may be repeated using the training dataset. The age in independent data can be predicted by the optimal final model or "epigenetic clock". for exclusion of the variables (i.e., individual CpGs) that contribute less by shrinking their coefficient or in other words, to retain the minimum number of CpGs that are valuable for age prediction.
There are three most commonly used methods of penalized regression and typically they are all tested when constructing an epigenetic age prediction clock for a new species: 1) Ridge regression. The least contributing variables will have their coefficient very close to zero.
2) LASSO regression (Least Absolute Shrinkage and Selection Operator). The variables with the least contribution will be forced to be zero. This will produce models with reduced complexity as compared to ridge regression, where all variables are kept.
3) Elastic net regression. This type of penalized regression stands in between the previous two types, where some coefficients will be shrank, as in ridge regression, and some coefficients will be set to zero, as in LASSO.
There are advantages and disadvantages of each penalized regression type over the other that depend on the specific dataset. LASSO will perform better when there are few predictors with large coefficients and a lot of predictors with small coefficients, while ridge will perform better where there are a lot of predictors with similar coefficients. Ridge regression keeps all variables, therefore, is not recommended when genome-wide techniques have been used. In any case, parameters of the model have to be tuned and the model has to be selected by evaluating its performance, as explained below.

Machine learning model building
Penalized regressions are machine learning models and thus, to build them, a standard machine learning model building procedure should be followed (Figure 3). In aquatic organisms, machine learning methods for developing epigenetic biomarkers have been applied in limited cases, while the procedure has been recently reviewed in details (Anastasiadi and Beemelmanns, 2023).
Below we explain the typical workflow of the procedure that can be implemented in R using the specialized caret (Classification And REgression Training) package (Kuhn, 2008). Nevertheless, other packages or programming language (e.g., Python) can also be used to navigate the same workflow.
1) Data splitting. Data are split into at least 2 datasets that allow to later evaluate model performance. The training dataset contains 70-80% of the samples and is used to for algorithm training and parameter tuning. The test datasets contains the remaining 20-30% of samples and is used once the right model has been trained and selected to test whether the model can be generalized in unseen data. Ideally, training dataset is sufficiently large to be split further into training and validation dataset during model performance assessment. However, this is rarely the case and instead resampling techniques are used. With resampling, iterative splitting into training and validation datasets occurs and prediction errors of all splitting cycles are averaged at the end. K-fold cross-validation (CV) has been extensively used in fish epigenetic clock building. Data splitting can be performed using specific functions that randomly splits the dataset, while keeping track of the randomness by setting the seed to a specific number in R.
R code example: library(tidymodels) library(readr) set.seed(123) splits <-initial_split(meth.age.df, strata = age) age_other <-training(splits) age_test <-testing(splits) Training set proportions by age class age_other %>% count(age) %>% mutate(prop = n/sum(n)) Test set proportions by age class age_test %>% count(age) %>% mutate(prop = n/sum(n)) 2) Data preparation and pre-processing. This step may include a) exclusion of CpGs the methylation of which has zero or near-zero variance across ages in the training dataset; b) dealing with multicollinearity by identifying CpGs with correlated methylation -a common feature in this type of data-; c) data transformation of centering and scaling variables to mean 0 and standard deviation 1; d) imputation of missing values if necessary. Imputations can be performed by the mice (Multivariate Imputation by Chained Equations) package in R (van Buuren and Groothuis-Oudshoorn, 2011).
Correlation of CpG methylation with other biological parameters that we want to account for, such as diet, sex or other environmental factors, can be dealt with by exclusion of the correlated CpGs when lots of CpGs. This type of correlation is likely to be confounding factor in the model if these biological parameters are parallel to age (i.e., we have many samples of younger males and older females).
Ridge regression. This regression may not be relevant in cases of RRBS or WGBS data since it keeps all CpGs available, but may be worth in cases of targeted methods (e.g., MBS).
sum ( set.seed(123) final <-train(age~., data = trainTransformed, method = "glmnet", trControl=finalmodelCtrl, tuneGrid = expand.grid(alpha = bestalpha, lambda = bestlambda)) predicted.final.train <-predict(final, trainTransformed) cor.test(predicted.final.train, trainTransformed$age) Evaluation of models during training as well as at the final model is done by assessing the predictive accuracy via loss functions comparing predicted age vs actual age. The measures to take into account and report include: a) Root Mean Squared Error (RMSE) = average deviation of the predictions from the observations. b) Mean Absolute Error (MAE) = average of the absolute differences between the observed and predicted values.
c) R 2 = the squared correlation between the observed and predicted values. This value shows how well the selected variables (methylation of CpGs) explain the variability of the dependent variable (age).
The errors should be minimized while the R 2 should be maximized.
Epigenetic clocks are considered valid if the correlation (R) is higher than 0.80 in large independent data covering a broad age range (Horvath and Raj, 2018). Piscine epigenetic clocks show a mean correlation of 0.93 (Piferrer and Anastasiadi, 2023), while higher values are also possible. Precision reported as MAE is used with actual time units (days, months or years) and shows a mean of 0.87 years in piscine clocks, or an average of about 3.5% of the total lifespan (Piferrer and Anastasiadi, 2023).

Conclusions
Epigenetic clocks for age prediction are typically constructed using DNA methylation sequencing technologies that involve the use of bisulfite conversion and provide information at single nucleotide resolution. Bioinformatic analysis of the data follows mostly standard procedures of sequencing reads analysis, however, care should be taken to account for C to T conversion during the alignment step. Methylation values are extracted per base and this results in the dataset consisting of individual fish aged samples as rows and methylation values of interrogated CpGs as columns. This multivariate dataset is submitted to machine learning procedures aiming to select features, i.e., CpGs the methylation of which is enough to predict age. The machine learning procedures used are penalized (or regularized) regressions which fit well the structure of the multivariate dataset. At the end of the procedure, the optimal model or "epigenetic clock" is constructed. This constitutes a molecular resource to be implemented by scientists and managers for accurate age prediction of fish. The simultaneous interrogation of the methylation of a few target CpGs forming the epigenetic clock of a large amount of samples in a ready-to-use kit constitutes the ultimate goal for application of this HTS to fisheries and conservation.

Author contributions
DA: Conceived and wrote the manuscript. FP: Conceived and edited the manuscript. All authors contributed to the article and approved the submitted version.