Accounting for Genetic Differences Among Unknown Parents in Bubalus bubalis: A Case Study From the Italian Mediterranean Buffalo

The use of genetic evaluations in the Water Buffalo by means of a Best Linear Unbiased Prediction (BLUP) animal model has been increased over the last two-decades across several countries. However, natural mating is still a common reproductive strategy that can increase the proportion of missing pedigree information. The inclusion of genetic groups in variance component (VC) and breeding value (EBV) estimation is a possible solution. The aim of this study was to evaluate two different genetic grouping strategies and their effects on VC and EBV for composite (n = 5) and linear (n = 10) type traits in the Italian Mediterranean Buffalo (IMB) population. Type traits data from 7,714 buffalo cows plus a pedigree file including 18,831 individuals were provided by the Italian National Association of Buffalo Breeders. VCs and EBVs were estimated for each trait fitting a single-trait animal model and using the official DNA-verified pedigree. Successively, EBVs were re-estimated using modified pedigrees with two different proportion of missing genealogies (30 or 60% of buffalo with records), and two different grouping strategies, year of birth (Y30/Y60) or genetic clustering (GC30, GC60). The different set of VCs, estimated EBVs and their standard errors were compared with the results obtained using the original pedigree. Results were also compared in terms of efficiency of selection. Differences among VCs varied according to the trait and the scenario considered. The largest effect was observed for two traits, udder teat and body depth in the GC60 genetic cluster, whose heritability decreased by −0.07 and increased by +0.04, respectively. Considering buffalo cows with record, the average correlation across traits between official EBVs and EBVs from different scenarios was 0.91, 0.88, 0.84, and 0.79 for Y30, CG30, Y60, and CG60, respectively. In bulls the correlations between EBVs ranged from 0.90 for fore udder attachment and udder depth to 0.96 for stature and body length in the GC30 scenario and from 0.75 for udder depth to 0.90 for stature in the GC60 scenario. When a variable proportion of missing pedigree is present using the appropriate strategy to define genetic groups and including them in VC and EBV is a worth-while and low-demanding solution.


INTRODUCTION
The Water Buffalo (Bubalus bubalis) is a large bovid mainly distributed in the Asian continent where the 97% of its world population is concentrated [Food and Agriculture Organization (FAO), 2020]. The name "water buffalo" is due to its adaptation to flooded or swampy areas, where it partially submerges and walks on the bottom mud without difficulty. The rest of the water buffalo world population (3%) is raised in the Mediterranean area historically characterized by the same optimal rearing conditions. In the European continent only the 0.2% of its world population is found and about 93% of these animals are located in southcentral Italy (Neglia et al., 2020). The total census in Italy has increased considerably over the last decade, making it one of the most important dairy species in the country. In 2019, 34,990 lactating buffaloes have been registered to the official herd book. Moreover, 666,960 controlled lactations and 9,953 type traits evaluations are available and officially recorded [Associazione Nazionale Allevatori Specie Bufalina (ANASB), 2020]. Thanks to the physical-chemical properties of its milk-high concentration in protein and fat (FC ∼ 8%) and favorable coagulation (Costa et al., 2020b)-the main zootechnical interest of the Italian Mediterranean Buffalo (IMB) is the production of the iconic traditional dairy products like the Mozzarella di Bufala Campana (Boselli et al., 2020), which has a great economic impact on the Italian food industry (ISMEA, 2020). Costa et al. (2020a,b) refers to the outstanding increase of IMB population size observed in the last 15 years, as well as the increase in terms of kilos of cheese produced, the larger herd size, the constant expansion in registered herds and the increment in milk price. Therefore, the economic interest in this specie makes it necessary to develop new innovative tools to improve the breeding process.
The implementation of genetic evaluations in the Water Buffalo based on a BLUP animal model has been increasing over the last decade across several countries (Agudelo- Gómez et al., 2015;Safari et al., 2018;Abdel-Shafy et al., 2020). The prediction of breeding values (EBVs) constitutes an integral part of most breeding programs which are based on two fundamental pillars: phenotypic data (e.g., milk production%, fat%, protein, or morphological trait) and genealogical information (i.e., a pedigree). However, if animals with unknown parents are present in the pedigree, bias in the prediction of both variance component (VC) and EBV is expected (Peškovičová et al., 2004;Petrini et al., 2015). BLUP methodology allows for the simultaneous estimation of fixed and random effects but gaps in the relationship matrix may jeopardize its unbiasedness due to the inability of correctly estimating and disentangling genetic and environmental components (Postma, 2006;Gómez et al., 2016;Wolak and Reid, 2017). Indeed, incomplete pedigree information can lead to inaccurate prediction of animal genetic potential, overestimating or underestimating animal breeding value and hampering decisions based on the selection eventually causing economic losses (Raoul et al., 2016;Carneiro et al., 2017;Abdel-Shafy et al., 2020).
One of the reason behind incomplete pedigree information is the use of natural mating, still common in the buffalo herds, which makes parentage assignment more complex.
Indeed, in IMB the use of the artificial insemination (AI) is still moderate (Parlato and Van Vleck, 2012). According to official data [Associazione Nazionale Allevatori Specie Bufalina (ANASB), 2020] and following a worldwide tendency (Singh and Balhara, 2016;Purohit et al., 2019), the proportion of natural mating in IMB decreased from around 76 to 62% from 2010 to 2019 [Associazione Nazionale Allevatori Specie Bufalina (ANASB), 2020]. These values, even if promising, are still lower than what it is observed in other species such as in dairy cattle, where the use of artificial insemination is close to 100% (Rodríguez-Martínez and Peña Vega, 2013;Ugur et al., 2019). Among the reasons why natural mating is still the most common reproduction technology for water buffalo there are physiological and reproductive aspects, herd management, breeding techniques, and organization (Neglia et al., 2020).
Despite being a routine analysis, it is almost impossible for the farmer to bear the total cost of parentage verification and to have his entire herd genotyped. In detail, in 2019 approximately 10,000 individuals have received a type trait evaluation in Italy but only 4,671 were DNA tested [Associazione Nazionale Allevatori Specie Bufalina (ANASB), 2020]. Hence, we are in a situation where phenotypic data are available for many animals, but a large proportion of these animals do not have complete pedigree information. Despite this limitation, the number of paternity tests in IMB in year 2019 showed a two-fold increase compared to year 2018.
Moreover, parentage testing is often reserved only for the best animals causing additional biases in the genetic evaluation being eventually based on a selected and non-random sample of the effective population. Furthermore, the possibility of using a larger number of data, albeit with incomplete pedigree, allows to observe all the variability of the trait of interest and therefore to obtain more accurate estimates.
The problem of incomplete pedigree has existed for many years and continues to be one of the main issues in genetic evaluations. Several researchers have worked on possible statistical approaches in order to correct for the presence of gaps in the pedigree (Peškovičová et al., 2004;Carneiro et al., 2017;Tonussi et al., 2017;Shiotsuki et al., 2018;Nwogwugwu et al., 2020;Macedo et al., 2020). The implementation of new technologies such as high-throughput single-nucleotide polymorphism (SNP) genotyping will certainly solve most of the problems linked to uncertain paternity but this is true only for individuals who are still alive or whose biological samples are available. Moreover, although genomics is the new standard in breeding and genetics, there are still some problems that need to be solved regarding how to cope with missing pedigree information (Tonussi et al., 2017;Misztal et al., 2020).
One suggested solution when dealing with an incomplete pedigree is the use of "Genetic Groups" approach, suggested over 30 years ago by Westell et al. (1988). This approach is based on the concept that subjects born in a certain period or coming from a certain area are the result of specific selective choices and therefore "genetically different" from other subjects born in other periods or from other areas.
The inclusion of genetic groups in VC and EBV is a method that has been adopted and extensively validated, as an example, in beef and dairy cattle (Perez-Enciso and Fernando, 1992;Sullivan, 1995;Theron et al., 2002;Peškovičová et al., 2004;Phocas and Laloë, 2004;Petrini et al., 2015;Wolak and Reid, 2017). The assignment of genetics groups to animals with uncertain genealogy represents a simple and effective solution to increase the accuracy of genetic evaluations (Henderson, 1988;Cardoso and Tempelman, 2003).
However, a crucial aspect is the strategy used to define the genetic groups. Therefore, the aim of this study was to evaluate the use of different genetic grouping strategies and its effects on VC and EBV estimation for 5 composite and 10 linear traits in the IMB population.

Ethics Statement
Animal welfare and use committee approval was not needed for this study as datasets were obtained from pre-existing databases based on routine animal recording procedures.

Data Description
Data for the present study were provided by the Italian National Association of Buffalo Breeders (ANASB) and consisted of linear appraisal records from years 2004 to 2020. The initial data set included 79,342 IMB cows from 464 herds phenotyped for fifteen type traits. The type traits were five composite traits, namely, final score (FS), structure (ST), feet and legs (FL), yield potential (YP) and udder teat (UT), and 10 linear traits, namely, stature (STAT), body depth (BD), body length (BL), foot angle (FA), fore udder attachment (FUA), rear udder width (RUW), udder depth (UD), teat placement (TP), teat length (TL), and body condition score (BCS). The median age at evaluation was 46 months. The scale used for scoring varied according to the set of observed traits. Composite traits were scored on a 65-100 scale, linear traits were scored on a 1-50 scale and BCS was scored on a 4.5-9.5 scale. Overall 17 official classifiers were enrolled in the scoring procedures. Data editing consisted of retaining only cows from herds with at least two contemporaries (i.e., individuals classified by the same classifier in the same round of classification) and whose ascendants were confirmed by a DNA parentage test. Finally, 7,714 buffalo cows belonging to 194 herd with a pedigree containing 18,831 individuals were used in the analysis. Descriptive statistics are in Table 1.

Alteration of Genetic Relationships and Grouping Strategies
The impact of different genetic grouping strategies on VC, EBV, and their accuracies (ACC) was investigated using the original pedigree and a modified pedigree where two different proportion of missing genealogies, namely, 30% (30) and 60%, (60) were randomly introduced. The choice of using these two thresholds was based not only on the need to mimic the real situation observed across ANASB farms but also to investigate the effect of moderate or massive pedigree gaps. After introducing the missing genealogy, the individual was assigned to a specific genetic group. Genetic groups (GG) were created following two clustering methods. The first method (Y) was based on the year of birth and on an average generation interval, which for the IMB was defined (based on an estimation on actual IMB data) as 6 years. Individuals born before 1985 was considered as base animals and assigned to group 1. The remainder of the buffaloes was assigned to six different groups.
The second grouping strategy (GC) was based on the genetic distances estimated from the original pedigree. The procedure consisted of two steps. In the first step the pedigree-based additive relationship matrix was calculated and used as input for a hierarchical cluster analysis using a complete-linkage clustering method (Kaufman and Rousseeuw, 2009). This method works in a bottom-up manner. Each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes).
This procedure is iterated until all points are member of just one single big cluster (root). The result is a tree that can be plotted as a dendrogram. In the second step, the dendrogram is visually evaluated to define a priori the cut-off level that will identify the number of clusters (i.e., genetic groups). Each individual is then assigned to a particular cluster. Following the above mentioned procedure, fourteen different genetic groups were created (Supplementary Figure 1).
In detail at the end of the procedures, four scenarios were created according to the grouping strategy (Y or GC) and the proportion of missing genealogies (30 or 60%).
Successively, VC, EBV, and ACC were estimated for each trait presented in Table 1 fitting a single-trait animal model and using the original pedigree (GOLD) and the four scenarios, namely Y30, Y60, GC30, and GC60. Estimates from GOLD were considered as gold standard. The estimation of VC, EBV, and ACC was repeated 10 times per each scenario (Y30, Y60, GC30, and GC60). The average number of animals and its standard deviation per scenario are shown in Table 2. a Grouping strategy based on the year of birth and on an average generation interval set to 6 years. b Grouping strategy based on the genetic distances estimated from the original pedigree.

Genetic Analysis
The following single-trait animal model with groups was used to estimate VC, their corresponding heritability, and breeding value for each considered trait: where y ijklm is the score of each trait for a given buffalo cow; µ is the overall mean; hyc i is the fixed effect of the ith herdyear of evaluation-classifier (i = 1,. . .957); PA j is the fixed effect of the jth age nested within parity (j = 1,. . .173); DIM k is the fixed effect of the kth days in milk (k = 1,. . .30); NM l is the fixed effect of the lth number of milking (l = 1,. . .3); a m is the random additive genetic effect of the mth buffalo; g n is the fixed group effect based on Y or GG and containing the nth ancestor; t mn is the additive relationship between the nth and mth animals and the summation is over all p ancestors of animal m; and e ijklm is the random residual effect.
In matrix notation, the model can be written as: where matrix X is an incidence matrix relating phenotypic records in vector y to fixed effects in vector b, matrix Z a is an incidence matrix relating phenotypic records in vector y to animal additive genetic effects in vector a, matrix Q a is an incidence matrix relating animals in vector a to unknown parent groups in vector g a . Vectors a and e have means 0 and variances Aσ 2 a and σ 2 e , respectively. The corresponding mixed-model equations were: Solving the equations the breeding value of an animal m will be: The accuracy of EBV was calculated as recommended by Aguilar et al. (2020): where SE is the standard error for the animal solution i in trait j, fx corresponds to individual inbreeding and v a is the additive variance σ 2 a .

Comparison of Analysis
Results from different scenarios were compared based on descriptive statistics (i.e., mean and standard errors) of VC, Pearson's correlations between EBVs grouped by animal status (i.e., bulls with at least 10 daughters, buffalo cows with or without progeny), re-rankings of first 10 bulls, efficiency of selection (SEf) as defined later and genetic trends, estimated by the linear regression of EBVs on year of birth. The SEf was calculated as proposed by Petrini et al. (2015) and Peškovičová et al. (2004), which defined SEf as the ratio between EBVs excluding (x gg_GG ) and including genetic groups (x GG_GG ): The SEf was calculated for the best 10, 30, and 50% animals, respectively.

Softwares
Data preparation and editing, and all statistical analysis were performed using the R programming environment v.3.6.1 (R Core Team, 2019), except VC which were estimated using AIREMLF90 (Misztal et al., 2002) and EBV which were obtained using BLUPF90 (Misztal et al., 2018). The R package optiSel (Wellmann, 2019) was used to calculate the pedigree-based additive relationship matrix and the package stats for the hierarchical cluster analysis (R Core Team, 2019). The analyses were run on the ANASB server 1 using an Intel R Pentium R CPU G3220 @ 3.00GHz, with 2 CPUs and 16 Gb of RAM.

Data Overview
Descriptive statistics for the analyzed traits are shown in Table 1. The deviation from the normal distribution was moderate, with kurtosis values ranging from 0.03 to 2.07. Traits distribution was skewed to the right (Supplementary Figure 2) and the average coefficient of variation was 2.8 and 24.4% for composite and linear traits, respectively.

Variance Components and Heritability
The VC and heritability estimates from the different scenarios are shown in the Tables 3, 4 for composite and linear traits, respectively. The estimated genetic variance was highest for five linear traits (STAT, FUA, RUW, UD, and TL), intermediate for BD, BL, FA, and TP, while the lowest were for composite traits and BCS. On average, the estimates of the additive variances from the GOLD scenario were the highest, observing largest differences  σ 2 a = additive genetic variance; σ 2 e = residual variance; σ 2 p = phenotypic variance; h2 = hereditability; s.e. = standard error.
Frontiers in Genetics | www.frontiersin.org with GC60 for the STAT (−0.60) and FUA (+1.31) traits, respectively. Differences among heritability estimates varied according to the trait and the scenario considered and are presented in Figure 1. The green line identifies the heritability from the GOLD scenario. The largest differences were observed in the scenario GC60 for trait UT (0.22 vs. 0.15) and for trait BD (0.15 vs. 0.19). Moreover, GC60 showed the highest within-trait variability, with maximum differences for UT, BCS, and FS (0.39, 0.21, and 0.18, respectively), and minimum differences of 0.08 for RUW and STAT (result not show). Standard errors of heritabilities for all traits were low, ranging from 0.03 (GOLD) to 0.05 (GC60).

Correlations Between Breeding Values
The correlations between EBVs in the different scenarios are shown in Table 5. Results differed depending on sex: higher estimates were observed in the female population when using a grouping strategy based on the year of birth (Y), while for the bulls higher estimates were observed with the genetic clustering strategy (GC). On average, the correlations were  positive and high. Considering buffalo cows with records, the average correlation across traits between official EBVs and EBVs from different scenarios were 0.91, 0.88, 0.84, and 0.79 for Y30, GC30, Y60, and GC60, respectively. The best results were observed for STAT, UD, and TL (average r = 0.91) while the most affected trait was FA in the scenario GC60 (r = 0.68).
In the case of bulls, the correlation between EBVs in the grouping GC30 ranged from 0.90 for FUA to 0.96 for STAT and BL, while, in the GC60 scenario the values range between 0.75 for UD to 0.90 for STAT ( Table 5). As expected, the highest correlations occurred in scenarios where the proportion of missing pedigree was lower (i.e., Y30 and GC30).

Accuracy of Breeding Values
The accuracy of breeding values across traits and scenarios for bulls with at least 10 daughters and buffalo cows with own record are presented in Table 6. The drop in accuracy for bulls ranged from 0.06 for stature in the scenario GC30 to 0.24 for YP in the scenario Y60. Similar pattern was observed in buffalo cows, with higher accuracies in the Y30 and GC30 scenarios. On average the best results were shown by GC30 (average accuracy = 0.43) and Y30 (average accuracy = 0.42), while the worst results were in the scenario GC60 (average accuracy = 0.34) and Y60 (average accuracy = 0.32) (Figure 2).

Selection Efficiency
The result of the average selection efficiency for the three different selection intensities (top 10, 30, and 50%) for composite and linear trait are summarized in the Observing the average intensity of selection across scenarios, the highest value was in GC30 (81.27%), followed by 78.75, 67.41, and 65.22% in Y30, GC60, and Y60, respectively. The average intensity of selections for the best 10, 30, and 50% were 73.16, 60.40, and 42.31%, respectively.
Within each scenario, selection efficiency in composite traits was more effective than in linear traits. When the best 10% of individuals were selected, four out of five composite traits had a  selection efficiency higher than 60%, while only three out of 10 linear traits exceeded such a threshold ( Table 7). A similar trend was observed selecting 30% (3/5; 4/10 ≥ 50.01%) or 50% (3/5; 4/10 ≥ 32.91%).
In terms of standard deviation, the GC30 scenario showed the lowest standard deviation (average = 4.61), while the values obtained from GC60 and Y60 tend to be higher, with an average SD of 7.94 and 7.82, respectively.

Re-Ranking
The effect of the different genetic grouping strategies on the ranking of the bulls was explored using only three linear traits, with high, medium, and low heritability, namely STAT (h 2 = 0.35), UD (h 2 = 0.23), and FA (h 2 = 0.10). Spearman's rank correlation calculated on 111 bulls in STAT-UD-FA were 0.921-0.884-0.842, 0.913-0.852-0.728, 0.846-0.695-0.659, and 0.811-0.690-0.587 for GC30, Y30, GC60, and Y60, respectively. The consistency of ranking across grouping strategy can also be effectively visualized with a target plot (Biscarini et al., 2016). The rankings of the first 10 bulls across replicates and grouping strategy for STAT, UD, and FA are presented in Figures 3-5, respectively. Each cloud of points represents the ranking of the bull across replicates and within grouping strategy. When the points within the clouds are more dispersed, a larger re-ranking was observed (e.g., BULL9 for STAT trait).

Genetic Trend
The genetic trends for both composite and linear traits are presented in Figures 6, 7. Overall a flat trend was observed until year 2013 for all traits. After this year, positive trends were observed and differences among years were enhanced on including genetic groups. For composite traits, an underestimation of the genetic trend was observed when the GC30 and GC60 grouping strategies were used. Specific behaviors were detected across linear traits. Genetic trends for STAT, FUA, and TL showed the same pattern as the composite traits. BD and BL showed an uneven trend, with a clear positive trend from year 2014. However, when using GC30 and GC60 grouping strategies, EBVs were more regressed than when EBVs were estimated using a grouping strategy based on the year of birth. Similar results were observed for FA and UD where, particularly for recent years, Y30 and Y60 EBVs were higher than GC30 and GC60 EBVs. Finally, BCS showed a flat trend until 2014 followed by a slight decrease, a pattern common to all grouping strategies.
The different grouping strategies have had an impact on the EBVs scale. From year 2000 the average increase in the scenario without genetic groups (GOLD) was +0.032 for composite traits and +0.014 for linear traits (Figure 8). The average increase in composite traits was +0.046, +0.042, +0.026, and +0.020 when the Y30/Y60/GC30/GC60 genetic group was used, respectively. The same order was observed in the linear trait set with an average increase of +0.020, +0.018, +0.009, and +0.006.

DISCUSSION
In this study, the effect of two genetic grouping strategies on the estimation of VC and EBV for type traits in a parentagetested IMB sub-population was evaluated. In the last three years the IMB has experienced an exponential increase in term of registered animals in the Herd Book. As a consequence, IMB is facing a situation where phenotypic data are available for many animals, but some animals lack complete genealogical data. Records from individuals without pedigree information has been excluded from the genetic evaluation or assumed to have an unknown sire. Such practice results in loss of information and potentially could compromise expected genetic gain (Sapp et al., 2007). To mitigate this undesirable effect, several statistical methods have been developed over the years. The use of genetic grouping, parentage probabilities, use of phenotypic information to increase the probability of defining the paternity, iterative empirical Bayesian model (ITER), Bayesian hierarchical model (HIER), and model based on the average relationship matrix (ANRM), have been applied to account for uncertain paternity (Henderson, 1988;Peškovičová et al., 2004;Sapp et al., 2007;Petrini et al., 2015;Carneiro et al., 2017;Shiotsuki et al., 2018;Macedo et al., 2020).
Genetic groups are normally created according to different criteria, for example on the basis of origin, sex, herd, or year of birth of the individual. The creation of the GG is not a simple procedure and can sometimes present some practical problems. Genetic groups modeling must be balanced as groups with few animals might impair the estimation of the GG effect (Rodriguez et al., 1996;Peškovičová et al., 2004;Petrini et al., 2015). At the same time, very large groups are not able to capture the actual differences which exist among individuals. However, (Quaas, 1988) warned about potential bias in defining a determinate FIGURE 3 | Ten best ranked bulls for the Stature trait according to the different genetic group in the IMB. When the points within the clouds are more dispersed, a larger re-ranking was observed.
FIGURE 4 | Ten best ranked bulls for the Udder depth trait according to the different genetic group in the IMB. When the points within the clouds are more dispersed, a larger re-ranking was observed.
FIGURE 5 | Ten best ranked bulls for the Foot angle trait according to the different genetic group in the IMB. When the points within the clouds are more dispersed, a larger re-ranking was observed.
grouping strategy due to the effects of confusion between groups. In our case, the "phantom" parents of an individual are always assigned to the same group, because the grouping is based on animal itself, not on its parents, as shown by other studies (Peškovičová et al., 2004;Shiotsuki et al., 2013;Petrini et al., 2015;Wolak and Reid, 2017).
Results have shown that including GG in the mixed model equation had an effect on the estimates of both VC, which can be observed in Tables 3, 4, and EBV (Table 5). Pieramati and Van Vleck (Pieramati and Van Vleck, 1993) obtained lower estimates of additive genetic variance with models that included genetic group. However, we have found that the estimates of VC and EBV with the Y30 and GC30 genetic groups are quite close to the GOLD estimates. These results support the efficiency of the methodology to estimate the true parameters. According to the magnitude of heritability estimates, the GC60 scenario was the one that showed the largest discrepancy with GOLD, confirmed by the highest SE (0.05). Petrini et al. (2015) suggested that such result may be caused by the structure of the group itself. Indeed, the size of GG should be homogeneous and well balanced. In the present study, when a genetic clustering strategy was used, a greater number of groups with a more heterogeneous size was observed. These results depend on the pedigree structure of the IMB, because its completeness is mainly related to the use of artificial insemination. Bulls used for AI have a more complete pedigree both on paternal and maternal side. The fourteen groups used in the GC strategy ( Table 2) are based on the relationship matrix and hence are strictly related to the completeness of the paternal line. Indeed in the GC scenario we had a particular group -namely group 1 -which basically included all individuals with no pedigree information and whose size was from 10 to 20fold larger than the others. Those evidences matched results from Santana et al. (2013) and Shiotsuki et al. (2013) who stressed the importance of the structure of the groups, especially in terms of their number and size (Petrini et al., 2015).
As expected, EBVs accuracy decreased when an increased proportion of missing pedigree was simulated (Table 6). However, when the proportion of missing pedigree was 30%, the average percentage point drop in accuracy was 10 and 7 for bulls and buffalo cows, respectively. We can therefore hypothesize that the contemporary use of the available pedigree information and of the most appropriate GG strategy will mitigate the loss in accuracy of the EBV due to missing pedigree information. Sullivan (1995) suggested the importance of the inclusion of genetic groups in EBV estimation and that data should not be discarded due to the uncertainty of the paternities. Surely, the problem of uncertain paternities might possibly be mitigated by the use of genomic selection (Abdel-Shafy et al., 2020;Macedo et al., 2020;Misztal et al., 2020), however, the genotyping of all animals in a herd might still be too expensive. In the case of IMB, the use of GG is a practical and no cost solution to integrate all the available information into the genetic evaluations process eventually not compromising the accuracy of the results.
On the other hand, Pearson's correlations between EBVs were generally high in all clustering scenarios. However, Y30 and GC30 scenarios showed the highest correlations. Several studies have shown that correlation coefficients between EBVs lower than 0.70 could suggest changes in the classification of animals (Crews and Franke, 1998;Petrini et al., 2015). Moreover, if we analyze results within traits, we can observe a relationship with heritability value. In our case, the trait that had the lowest correlation coefficient (r = 0.68) was FA, whose h 2 was 0.10. In addition, observing the correlations within sex, the Y30/Y60 genetic group strategy showed the highest coefficients for buffalo cows, while for bulls GC30 was the most appropriate for the data. This result was somewhat expected because the strategy based on the hierarchical clustering is strictly related to the relationship matrix, i.e., on the pedigree information. The number of AI bulls in the IMB population is limited (n < 100) and most of them have common ancestors. This means that grouping based on the relationship matrix will be possibly biased by the sire's pedigree. Actually, all individuals with both parents missing have been assigned to group 1 ( Table 2), possibly regressing their breeding value. On the other hand, the year of birth has a more balanced behavior and it is less linked to the pedigree. Therefore, our results suggest that the EBV and consequently the ranking of the animals, will be closely influenced by the nature of the trait and by the structure and type of grouping adopted (Shiotsuki et al., 2018).
Considering SEf, several studies suggest that it can be used as a measure of the correlation between the ranking of the best animals obtained in the different analyzes and that would in turn provide information on the degree of efficacy of the genetic grouping strategy (Theron et al., 2002;Peškovičová et al., 2004;Petrini et al., 2015). A value above 70% would indicate that the ranking observed in the different scenarios is stable and does not undergo a significant re-ranking. In relation to what we observed in this study, when the selection intensity is 10%, practically all traits exceeded this threshold (14/15 traits in Y30 and 15/15 in GC30). Meanwhile, in the scenario where the proportion of missing of pedigrees was 60% only 5/15 traits showed a value of SEf higher than 70%. These results suggest that bulls that are above the 90th percentile would experience virtually no important changes in their ranking. Another aspect worth noticing is the standard deviation of SEf. If a large standard deviation is observed, the response to selection will be more unstable and less accurate (Peškovičová et al., 2004). In this regard, the genetic group GC30 showed the lowest standard deviation while results obtained from GC60 and Y60 were more unstable. Consequently, when considering a high correlation and SEf, in addition to a low SD, we retain that the ranking of the bulls will be consistent. The inclusion of GG in the genetic evaluation could have unpredictable but substantial effects on the estimated genetic trend (Saavedra, 2019). Furthermore, the exclusion of genetic groups or having paternities with "phantom" parents could lead to biased estimates of selection response (Theron et al., 2002). In our study, these expectations are met, observing how the cumulative genetic trends without genetic groups were slightly lower than those estimated with the Y30/Y60 genetic group. Upward trends may indicate that the grouping type "year of birth" may be comparable to those obtained in GOLD. Other study, obtained some indication that the best strategy was grouping phantom sires according to the year of birth and the phantom dams in a single group due to the slow genetic change in females over the generations (Casellas et al., 2007). Theron et al. (2002) and Shiotsuki et al. (2013) observed higher genetic trends when they included GG in the analyses. Those results did not agree with (Petrini et al., 2015) where the inclusion of GG in genetic analyses showed a lower genetic trend.
The effectiveness of including GG on genetic evaluation depends on the genetic structure of the population, the nature of the observed trait (Petrini et al., 2015) and the criterion adopted to define GG. Several authors recommended that the definition of the GG should be a balance between complexity of the method and the adequate representation of genetic differences (Rodriguez et al., 1996;Peškovičová et al., 2004;Petrini et al., 2015;Carneiro et al., 2017;Shiotsuki et al., 2018). The adoption of an inappropriate method may not only have consequences on genetic progress (at the population level), but also on the choice of the best animals that will be used at the herd level. On the other hand, a change in the pedigree structure tends to have a higher impact on traits with medium-low heritability. In our study, this fact occurred with the FA trait, where GC30/GC60 scenarios had the largest correlation with GOLD. On the other hand, for traits with high heritability, the weight of the phenotypic information is high, therefore, the use of GG would have a lesser effect on the estimates. According to Cardoso and Tempelman (Cardoso and Tempelman, 2003), differences between the models that take into account uncertain paternity do not necessarily increase with increasing heritability, but these differences will be greater for the traits of medium-low heritability. In addition, individuals that have a greater number of ancestors or progeny with an incomplete pedigree will be more affected, in particular young animals with no own phenotypic information.
The lack of pedigree information is a common problem among domestic species, being more pronounced in less represented breeds that are mainly managed by small farmers with scarce economic resources. Resolving the uncertainty of paternity has always been a topic of interest to the scientific community and for decades various methodologies have been developed that allow managing the presence of gaps in a relationship matrix. Nowadays, there are different tools to improve the knowledge of genealogical information, such as DNA-based methods, but these are still expensive for breeders. Likewise, in those species that have recently implemented the genetic evaluation system they may face this problem, as they may be in the situation where they possess historical phenotypic data from which it is almost impossible to obtain biological samples due to the absence of a DNA banks.
The prediction of the genetic value with models that consider the uncertainty in paternity have been shown to have better precision (Cardoso and Tempelman, 2003;Sapp et al., 2007;Shiotsuki et al., 2012;Shiotsuki et al., 2013;Carneiro et al., 2017;Shiotsuki et al., 2018). Its effectiveness depends on the definition of the grouping strategy (Petrini et al., 2015), which requires prior knowledge of: (a) the selection process of the breed, (b) the sources of genetic variation present in the population, (c) the intensity of selection or the generational interval. It is clear that GG should be included in the model to improve the accuracy of the EBV of animals with some degree of unknown paternity (Saavedra, 2019). Therefore, the use of genetic groups can be considered an effective alternative in the absence of relationship data for VC and EBV.

CONCLUSION
Pedigree completeness is a fundamental requirement of any genetic evaluation. In species other than dairy cattle, the presence of individuals with phenotypic records but with an incomplete pedigree is not a trivial matter. Buffalo breeding is an example of such a situation. We do expect a more extended use of DNA testing which will eventually increase the implementation of genomic selection approaches in Buffalo species as well. However, missing information in the pedigree will still be present and even genomic selection will be faced with the same problem. When a variable proportion of missing pedigree information is present in a population under selection, including genetic groups in the mixed model equations for both VC and EBV estimation is a worth-while and low-demanding approach to mitigate the loss in accuracy. Different strategies can be used to create genetic grouping depending on data distribution across years and on population structure. In the IMB population the best results were obtained when grouping was based on the year of birth. These findings confirmed the possibility of developing a genetic evaluation in populations with uncertain paternities without the need to exclude data or to use only a select of the available population.

DATA AVAILABILITY STATEMENT
The data analyzed in this study was obtained from Italian National Association of Buffalo Breeders (ANASB). Requests to access these datasets should be directed to d.rossi@anasb.it

ETHICS STATEMENT
Ethical review and approval was not required for the animal study because Animal welfare and use committee approval was not needed for this study as datasets were obtained from pre-existing databases based on routine animal recording procedures.

AUTHOR CONTRIBUTIONS
SB and MGC conceived and designed the work. DR, RC, GZ, YG, and DA were responsible for updating and editing the data. SB, RDP, and MGC contributed to analyzing the data and interpreting the results. MGC and SB wrote the manuscript with input from all the authors. All authors revised the manuscript, contributed to the article, and approved the submitted version.