Child Face Recognition at Scale: Synthetic Data Generation and Performance Benchmark

We address the need for a large-scale database of children's faces by using generative adversarial networks (GANs) and face age progression (FAP) models to synthesize a realistic dataset referred to as HDA-SynChildFaces. To this end, we proposed a processing pipeline that initially utilizes StyleGAN3 to sample adult subjects, which are subsequently progressed to children of varying ages using InterFaceGAN. Intra-subject variations, such as facial expression and pose, are created by further manipulating the subjects in their latent space. Additionally, the presented pipeline allows to evenly distribute the races of subjects, allowing to generate a balanced and fair dataset with respect to race distribution. The created HDA-SynChildFaces consists of 1,652 subjects and a total of 188,832 images, each subject being present at various ages and with many different intra-subject variations. Subsequently, we evaluates the performance of various facial recognition systems on the generated database and compare the results of adults and children at different ages. The study reveals that children consistently perform worse than adults, on all tested systems, and the degradation in performance is proportional to age. Additionally, our study uncovers some biases in the recognition systems, with Asian and Black subjects and females performing worse than White and Latino Hispanic subjects and males.


I. INTRODUCTION
The use of facial recognition systems in various domains such as surveillance, airports, and personal devices, has been well established.These systems have proven to be highly effective and accurate in verifying the identity of subjects [1], [2].However, as facial recognition becomes increasingly integrated into our daily lives, it is crucial to consider the potential for biases and discrimination against certain demographic groups.Previous research has investigated this issue [3], but less attention has been given to the effect of age on the recognition of children's faces.This area is essential, as there are numerous potential applications for face recognition systems for children.For instance, police can use it to find kidnapped or lost children.Another use case is an automated process for analyzing seized child sexual abuse material (CSAM) to recognize victims.In 2019, more than 70 million CSAM videos and images were obtained 1 .This issue is an increasing Fig. 1: Examples of face images generated by StyleGAN3 [7] (leftmost) with progressed child faces of varying ages using InterFaceGAN [8].
problem with 17 million reports of CSAM in 2019, which saw a dramatic increase to 29.3 million reports in 2021 2 .Due to this immense amount of data it is necessary to have automated systems for identifying children in such material, which requires effective face recognition systems.
The emergence of deep learning in recent years has shown to be extremely useful in face recognition [2].A caveat of these models is that they need a huge amount of training data to achieve state-of-the-art performance.The amount of data needed has become a growing concern due to increased legal and political scrutiny surrounding the privacy issues associated with large datasets of individuals' faces [4], [5], [6].The current databases used for research in this area are often limited in size, constrained, and focused on specific ages or races and are frequently retracted due to privacy concerns.This issue is further exacerbated when it comes to children due to heightened focus on protecting their rights.
To address the aforementioned issues, we present a novel pipeline for creating a synthetic face database containing the same subjects both at adult age, but also at different child ages, see figure 1.To do so, state-of-the-art generative adversarial networks (GANs) and face age progression (FAP) models are combined enabling the generation of the first large-scale synthetic child face image database referred to as HDA-SynChildFaces.Two open-source and one commercial face recognition system are evaluated on this database.It is found that the recognition performance of all tested systems decreases with age groups.Evaluations on further demographic subgroups, i.e. gender and race, additionally reveal certain  [14] 2023 3,069 9,810 0-∼18 -Scraped Not released a The authors do not describe the demographic distribution of the dataset.b The datasets seem to not be publicly available anymore.
biases in the tested face recognition systems.The generated HDA-SynChildFaces dataset, which can be used to train or evaluate face recognition systems, will be made available to the public domain to facilitate reproducible research [15].Additionally, the source code for the data generation pipeline will also be made available: https://github.com/dasec/HDA-SynChildFaces-AgeTransformation.
The rest of this work is organised as follows: section II briefly discusses related works on face recognition for children and face age progression.The database generation process is described in detail in section III.Experiments are presented in section IV and a discussion is provided in section V. Finally, conclusions are drawn in section VI.

A. Child Face Recognition
Multiple efforts have been made to create datasets of children at different ages to evaluate or train facial recognition systems.However, many datasets used in research are not publicly available due to ethical and privacy concerns.These efforts are largely divided into two categories: controlled datasets obtained through controlled settings, such as [10], [11], [12], and [13], or web-scraped datasets, such as [9], [14].An overview of the relevant datasets and their statistics is presented in table I.
In general, the controlled datasets are obtained in environments where the researchers control different factors such as pose, facial expression, illumination, and the age gap between sessions.This makes it easier to isolate the dataset to only focus on age differences.However, one limitation of these datasets is the potential for race and demographic bias in the sample population.The web-scraped datasets are often less constrained and have more variation in the images.This makes it more difficult to distinguish between the effects of age and other factors on the performance of facial recognition systems.Most of the datasets are used for longitudinal studies of the performance of facial recognition systems.The NITL [10] is a longitudinal dataset focusing on children aged 0-4.The data were collected at a free paediatrics clinic in Dayalbagh, India.The images were collected over four sessions between March 2015 and March 2016.Their experiment compares facial recognition systems' accuracy on images from the same sessions with images from different sessions.They found that the verification accuracy when children aged six months decreased with 50% compared to the verification of images taken in the same session.The difference was even more significant when only looking at children aged 1-2 in the first session.Here the accuracy was decreased by 82%.
In the three longitudinal studies [11], [12], [13], the datasets were collected in cooperation with schools.These datasets are not public available due to privacy reason regarding the subjects.In [11], the CLF dataset consists of facial images of children aged 2-18 with constrained images taken of the same subject over time (avg.4.2 years).The YFA [12] dataset contains images captured over time at a local elementary and middle school of voluntary children.In the dataset the target is to investigate how changes in ages influence facial recognition systems, and thus the images captured are limited with regards to change in pose, illumination and expression.The images are taken over multiple session with a maximum total age gap of 3 years.The ICD dataset, used in ChildGAN [13], contains subjects with multiple images taken over time as well as subjects with only a single image.The facial images are divided into five different sets based on the following age groups: 2-5, 6-8, 9-11, 12-14, and 15-19.All of the images were collected in India.The In-the-Wild Child Celebrity (ITWCC) [9] dataset is a recent longitudinal children database scraped from the internet.The dataset consist of different celebrities.As the images are scraped from the internet, they are unconstrained, this makes it difficult to isolate the age as a parameter when testing facial recognition systems.They present results showing that facial recognition systems has issues verifying identities of non-adult aging subjects.Another, and very recent, dataset scraped from the internet is the YLFW [14] dataset.In this, the authors have a method of scraping identities on the web using a specific set of keywords.A set of images are downloaded for each of these keyword sets.This set of images are then filtered by using hierarchical clustering.They then balance the dataset regarding the four races: Caucasian, Asian, African, and Indian.A manual procedure follows this process to verify the match pairs.In the performance evaluation of facial recognition systems for children, they find that the systems are significantly worse for children, as previously studies also have shown.However, they also show that training facial recognition systems on their dataset can reduce this difference.
In [16], a subsets of the ITWCC [9] and LFW [17] datasets are used to compare the performance of facial recognition systems performance on adults and children.The authors compare 8 different facial recognition systems and find that all eight facial recognition systems were biased, performed significantly worse on children.

B. Face Age Progression
GAN-based architectures have not only proven their worth in generating synthetic images but also for performing face age progression (FAP).Grimmer et al. [18] have recently provided a comprehensive survey on deep face age progression, noting that GANs indeed produce remarkable face ageing results, cf.figure 1.Many of the covered FAP models in this section are based on GANs in some way.
In InterFaceGAN [8], the authors do not directly train a new GAN to do FAP but instead investigate the latent space learned by the original StyleGAN trained on the FFHQ dataset.The researchers train a linear model in which a boundary is learned to, e.g.change the age or gender of a generated image directly in the latent space.In [19], Alaluf et al. retrain InterFaceGAN on the StyleGAN3 latent space thus taking advantage of the improved architecture for generating faces.
In [20], the authors handle FAP by proposing a new GAN architecture trained with labelled age groups (e.g.0-2 or 50-69), enabling FAP by giving an input image and specifying the wanted age group.He et al. [21] note that many of the GANbased FAP approaches end up with an entangled latent space in which they then manipulate the age.In their work, they instead propose a model where they disentangle key characteristics while modifying the age, also in different age groups.In [13], the authors also take a GAN-based approach to FAP learned with different age groups.AgeGAN, an architecture proposed in [22], uses a dual condition GAN architecture, where one generator converts input faces to other ages based on an age group condition, and the dual conditional GAN learns to invert the task.
In [23] Authors from Disney Research in [25] propose a FAP model that does not use a GAN but instead a U-Net, translating in an image-to-image manner together with a provided age, and note promising results.A caveat of their model is that it is only possible to progress down to the age of 20, due to the training data used.
Many of the models proposed in the scientific literature, e.g.[20], [21], [23], [25], are all end-to-end solutions, meaning that they take an image and a wanted age (or age-group) as input and then outputs an image.In [8], and thus also in [19], an image is manipulated directly in the latent space of the StyleGAN variant, which then skips the part of inverting or translating an image into latent space, which otherwise may come at a loss.

III. DATABASE GENERATION
The proposed pipeline for creating the desired biometric dataset consists of the following steps that are described in detail in the subsequent subsections: Sampling This step handles the generation of synthetic faces, thus creating the initial database.Filtering The filtering step handles the filtering of the initial database, removing poor quality and unwanted images.Race Balancing As the generation of the initial faces is random, the distribution of the subjects' races may be skewed.This step evenly distributes races in the database.Age Transformation Progressing an adult into a child is a key concept in this paper.This step does that by progressing an adult into a child in different age groups.Intra-Subject Transformation To biometrically benchmark a database, reference images need corresponding probe images, which this step is responsible for creating.Post-Processing This step will do some automatic cleaning.It ensures that the same seeds are present in all different age groups and tries to remove poorly transformed images.

A. Sampling and Filtering
In this work, StyleGAN3 is used to sample an initial set of face images.A subset of these initially sampled images are then chosen by performing a filtering by first discarding images based on age and afterwards based on sample quality.
The age filtering step is implemented by using the C3AE age estimator [26].It simply works by estimating the age of the generated subjects and if they are below a pre-defined age they are rejected.The quality filtering step is implemented by using the SER-FIQ [27] quality score algorithm, which represents a stateof-the-art algorithm.The quality score extracted is between 0 and 1, where 1 is an image of perfect quality.Figure 2 shows examples of accepted and rejected images based on the SER-FIQ score.Figure 3 shows the distribution of the quality scores for a 10,000 generated subjects without doing any age filtering beforehand.The distribution looks Gaussian-like but with a heavy tail that may be caused by artefacts or very young looking subjects which usually gets a low quality score.

B. Latent Transformation
The method used for estimating hyperplane boundaries for certain attributes is done as explained in the original InterFaceGAN paper [8], but by using the latent space of StyleGAN3 instead of StyleGAN.The separation boundary between different categories of an attribute is found by the use of a linear support vector machine (SVM) to identify a hyperplane that separates the two categories.For example, in the case of the gender attribute, the SVM could be trained to distinguish between Male and Female, see figure 4.This The images with a blue border are categorized as women, and those with a green border are categorized as men.The vector n is the normal vector to the hyperplane.
normal vector n can then be used to modify the latent code of an image by adding it to the latent code.This can be described as: where w is the latent code of the image, α is a parameter choosing the degree of the edit and w edit is the resulting latent code after the manipulation.figure 5 shows how this  Deepface [30] This work Race White Deepface [30] This work Latino Hispanic Deepface [30] This work Indian Deepface [30] This work Middle Eastern Deepface [30] This work Asian Deepface [30] This work Black Deepface [30] This work Illumination Illumination DPR [31] This work Gender Male Anycost GAN [29] [19] As mentioned in [8] the manipulation of a specific attribute can result in unintended changes to other attributes.This is due to the entanglement in the latent space and the correlation of the attributes in the images used for training the SVM.To minimize these unwanted side effects, a new conditional boundary can be calculated by projecting the boundary of the desired attribute onto the boundary of another attribute.This process can be formalized mathematically as follows: Where n 1 is the boundary for the desired attribute (e.g.smile), n 2 is the boundary for the unintended attribute (e.g.glasses) and n cond is the conditional boundary.This new conditional boundary can then be used to edit the desired attribute.Furthermore, the pipeline uses the w latent vector, which is a single dimension of the w+ latent.This is due to less entanglement, than when using the z latent, as mentioned in the original InterFaceGAN paper [8].
The need for neutralizing images with respect to certain attributes, such as a pose, arises during image sampling in order to ensure the quality of the generated images.Here, we follow the process proposed in [32].If one wanted to neutralize an image with respect to yaw using a trained boundary denoted as n yaw , it can be described mathematically as: where w is the initial latent code for the image and w neutral is the latent vector for the neutralized image.One could then use w neutral to generate the neutralized image.This concept of neutralization is used several times throughout the pipeline, and can be done with any of the boundaries seen in table II.

C. Balancing Races
In contrast to real existing child face databases, we aim at creating a database that is equally distributed with respect to race.To do so the trained race boundaries as seen in table II can be used to change the race of individual subjects.Figure 6 shows examples of using the individual race boundaries on the same subject.

Black Middle Eastern
Latino Hispanic Asian Indian Fig. 6: An example of moving a subject (leftmost) along each of the 5 race boundaries.
Firstly, a database of images and latent vectors is sampled, where the race of each of the subjects is initially classified.Subsequently, a random subject of the most represented race is changed into the least represented race.This step is repeated until the races are uniformly distributed.
An example of the distribution of the races before and after balancing races can be seen in figure 7. From the figure it can be seen that, initially, 70% of the subjects sampled are classified as white, where only 0.5% are classified as black.It should be noted that a caveat of this approach is that it is largely dependant on the race classifier.That is, human inspection of the subjects races may not always agree with what the classifier and algorithm outputs.

D. Age Transformation
The latent transformations previously described is also used to transform the age of a subject.However, one problem of the latent transformations is that sometimes a subject is  transformed poorly because the subject is moved too far in a direction in the latent space.This can, for instance, happen if the age classifier inaccurately predicts the age of a subject.An example of a subject being moved too far can be seen in figure 8.The first 3 images, surrounded by green boxes, are realistically looking and looks like the same person in progressively younger versions.In the last 3 images, surrounded by red boxes, it can be seen that by moving too far along the age direction the subject starts to look less human and unrealistic, i.e. poorly transformed.
A way to automatically detect such undesired effects is to do a principal component analysis (PCA) and use it for outlier detection.We generated a large amount of latent vectors (300,000) to fit the PCA and find the principle components.The idea is that the two most important principal components form a distribution and that a transformed image is an anomaly if it is too far away from the center.If the image is categorized as an anomaly, then it should be removed as it is likely to be one of poor transformation.A visualization of this concept can be seen in figure 9.
This approach is able to automatically detect the majority of poorly transformed subjects.Figure 10 shows more examples of removed subject from the database where the images are categorized as anomalies.

E. Intra-Subject Transformations
The following subject-and environment-related properties are further modified to simulate intra-class variations: pose, expression, and illumination.These are implemented by manipulating the latent vectors using the linear boundaries trained as Fig. 9: 300k StyleGAN3 latent vectors projected onto their first and second principal components, where each blue dot corresponds to a subject.The red and green dots showcases the effect of transforming a single subject along the age direction, corresponding to the progression seen in figure 8, where the red dots are going out of the distribution.seen in table II. Figure 11 depicts all variations across different age groups for an example subject.
For changing the pose of the subject two boundaries were trained for yaw and pitch.The boundaries were trained using the Hopenet pose estimator [28].By default the pipeline will generate four variations for each axis of the subjects.The amount of illumination in an image is also a boundary trained by using the light classifier from the DPR model by Zhou et al. [31].To change the facial expression of a subject two boundaries were used, one for making a subject smile and one for making a subject look sad.To compress an facial image lossy compressed versions of each subject is generated by saving the image in the JPEG format with different qualities.It should be noted that the original reference image are saved in the lossless PNG format.

F. HDA-SynChildFaces
The HDA-SynChildFaces database consists of 1,652 different subjects which have been processed in the whole pipeline explained before.A short overview of the used parameters can be seen in table III.
Here the original 1,652 subjects correspond to being of age 20 and above.Each of these subjects have been progressed  are female and thus 59.7% are male, which is a bit skewed.This skewness happens as a part of the filtering process where the quality filterer is slightly biased against women.
2) Race Subset: The race of the different subjects are also saved after equally distributing them.This allows for dividing the dataset into race-specific subsets, to see if face recognition systems are biased against some races, and if that changes appear across age groups.The number of images and different subjects for each subset can be seen in table V.Although the races are equally distributed after the race distribution step, this may become a bit unbalanced due to the post-processing step.For instance, as seen in the table, there are fewer Asians left at the end of the pipeline than of other races.

IV. EXPERIMENTS A. Experimental Setup
The HDA-SynChildFaces database will be evaluated with multiple facial recognition systems to determine the performance difference of facial recognition systems on children.The impact that race and gender has is also evaluated by said systems to investigate whether age may impact these factors as well.The facial recognition systems under investigation are ArcFace [33] 3 and MagFace [34] 4 , and a commercial offthe-shelf (COTS) solution.Before doing the facial recognition with both state-of-the-art open-source systems, face detection and alignment was done with RetinaFace [35], a state-of-theart face detection system.
To evaluate the different recognition systems, biometric measures and metrics from the ISO/IEC 19795-1 [36] standard will be used.For the open-source systems the mated (genuine) scores and non-mated (impostor) scores are calculated for each of the datasets by using the cosine similarity measure as seen in equation 4.
Here A i and B i refers to specific feature vectors extracted by a face recognition system.The COTS system uses its own proprietary similarity score.The mated comparisons are done by calculating the similarity score between each of the images with each of its corresponding variations.The non-mated comparisons are done by calculating the similarity score between a image and a random image from all other individuals.The results will be evaluated by using the following metrics: FMR/FNMR Following ISO/IEC 19795-1 [36], False Match Rate (FMR) and False None Match Rate (FNMR) are technical terms used to describe the performance of biometric systems.Specifically, FMR represents the percentage of non-mated comparisons that are incorrectly confirmed as matches at a specific threshold, while FNMR represents the percentage of mated comparisons that are incorrectly rejected as non-mated.In this experiment, the focus will be on evaluating the FNMR values under three distinct conditions, corresponding to FMR values of 0.01, 0.1, and 1 percent.DET-curves The Detection Error Trade-off (DET) curve is a plot to visualize the trade-off between the False None Match Rate (FNMR) and the False Match Rate (FMR).EER The Equal Error Rate (EER) are the rate where the value of FMR and FNMR are equal.Distribution Statistics The following common distribution statistics will be calculated to characterize the distribution of the mated and non-mated comparisons: mean µ and standard deviation σ.Decidability Index The decidability index, denoted as d , can be interpreted as a value that describes the amount of separation between two distributions.It will be calculated for the distributions of the mated and non-mated comparisons, where a larger value means a better separation between the two.It is calculated using the following formula [37]: where µ m and σ m are the mean and standard deviation for the mated comparisons and µ nm and σ nm are the mean and standard deviation for the non-mated comparisons.

B. Results 1) Children vs Adults:
Experimental results across all age groups for the tested face recognition systems are summarised in table VI.The corresponding DET curves are plotted in figure 12.When focusing on one system at a time, it can be seen how the mated part of the table's statistics is very similar across the age groups.The mean and standard deviation are all extremely close.It can be noted how the mated mean and standard deviation are pretty similar for MagFace and ArcFace while COTS has quite a larger mean and smaller standard deviation.When looking at the non-mated part of the table, it can be seen how the mean of the distributions grows steadily the younger the age group, which happens for all three face recognition systems.The same is true for the standard deviation where there is a considerable increase when comparing the age groups 20+ and 16-13.For ArcFace and MagFace, a significant increase between the groups 7-4 and 4-1 can also be noticed, but not for COTS.Another interesting statistic can be seen when looking at d , where a higher d means that the system can better distinguish between the mated and non-mated distributions.An evident tendency across all systems is that the d scores decrease for the younger age groups.
2) Demographic Differentials: For the analysis of demographics, i.e. gender and race, only results for MagFace will be shown.This is due to the fact that all three different systems show similar patterns, albeit the actual numbers may differ slightly.
In table VII, obtained results for the gender subset are summarized.it can be observed that the d values are larger for males than females when looking at the age groups 20+, 16-13 and 13-10; thus, the systems are better at distinguishing mated and non-mated samples.The value of the younger age groups is a bit larger for females.The non-mated mean values are generally larger for males across all age groups, but for mated, the values are very similar for both males and females.The DET-curves for the three age groups 20+, 10-13 and 1-4 divided into gender are depicted in figure 13.For the first five age groups, the EER is lower for males than females, but for the last age group, ages 4-1, the opposite can be observed.This is the same phenomenon observed from the DET curves in figure 13, where the male subset performed better than the in all age groups, except the adult one where Latino-Hispanic has a slightly larger one.Subjects of the black race always have the smallest d score, closely followed by Indians.The mean, standard deviation, and median change significantly depending on the race and age.It can be observed that the ones with the Asian race have the most significant standard deviation for all but the oldest age group.The corresponding DET-curves can be seen in Figure 14.The performance worsens for all the races in the youngest age groups but still with a similar hierarchy of the worst to best performing races.

V. DISCUSSION
From the observed results of the full dataset, the mated scores are stable across the different age groups and in general have quite a high mean.The progression highly impacts the non-mated scores across age groups, which causes a performance decrease as the subjects gets younger when verification metrics are measured.This drop in performance was a common pattern across all three tested face recognition systems.Performance significantly decreases as the subjects get younger, with a notable increase in EER.A common threshold in biometric verification is having a FMR of 0.1% [38] and looking at MagFace in table VI, this results in a practical FNMR of 0.04% for the adults.But by progressing these same adults down to an age group of the youngest children of age 1-4, it rises to 2.49%.This change in score demonstrates how large a performance decrease can be observed when the same identities are younger.On the youngest children, the COTS system showed the best performance, according to EER values.Although when looking at the adults COTS performed the worst.It is unknown what kind of data is used in training this system, but it could indicate that the system has seen images of young children before.As this analysis of the performance across ages is based on synthetic data, the question arises whether these same observed results would happen if it was tested on facial images of persons at different ages.As mentioned in section II, several studies on child face recognition were described in which a performance drop was also observed in real data of children, compared to the performance of adults.Notably, in the recent paper [14], a performance decrease in younger children is seen compared to adults.The authors also note that children are harder to discriminate for the different facial recognition systems that they test.They do see a performance increase by fine-tuning a system on their child database.These results are comparable with the results observed in this work, but this dataset has the benefit of being synthetic.It was also observed how subjects of Black and Asian race in general perform worse than ones of White and Latino-Hispanic.It was further seen that all races have a performance decrease as they get younger.In [39], Grother et al. have performed a vendor test with a specific focus on the performance and bias of commercial face recognition systems concerning demographics.In the report, they note several of the same observations regarding race and age.Similar findings were made in [40].Overall the results indicate that facial recognition systems are not robust to younger subjects and that racial and gender bias is a general problem across age groups.
In figure 15, pairs of subjects from ages 7-10 with a high non-mated score can be seen.As seen, the pairs of subjects has the same gender and race.Similarly, pairs of subjects from the youngest age group (ages 1-4) with a high non-mated score can be seen in figure 16.In this particular age group, false matches across gender and race have also been observed.
An example where the same two subjects have a high nonmated score in all of the different age groups can be seen in figure 17.Here the top left image is from the adult age group while the bottom right is from the youngest child group.To investigate this phenomena further the scatter matrix in Figure 18 was made.
In the figure, each of the non-mated scores of one age group was plotted against the same non-mated comparison of all    other age groups.So if some subject s1 is compared with some other subject s2 in age group 20+ and produces some score, then the same two subjects have comparison scores in all of the other age groups.All of these scores can then be plotted pairwise to see if there is a correlation between scores of the same subjects across age groups.An example can be seen by looking at the x-axis at ages 20+ and the y-axis at ages 16-13, where there is a positive correlation.An even stronger correlation can be observed at x-axis 16-13 and y-axis 13-10, which may be because the age groups are much closer in age than those of 20+ and 16-13.This positive correlation continues down the whole diagonal, which indeed tells us that there is a tendency for the same pairs of non-mated comparison scores being correlated across age groups.

VI. CONCLUSIONS
In this work we introduced the HDA-SynChildFaces database, a synthetic database of demographically balanced face images of children across various age groups including common intra-class variations.
From experiments conducted on HDA-SynChildFaces the following key findings were obtained: • The mated scores are on average not impacted much by face age progression in all tested face recognition systems.
• The non-mated scores on average become significantly higher proportional to the age in all tested systems.• The EER and different FNMR rates at relevant FMR rates increase proportional to the age in all tested systems • Subjects classified as female have higher EER as well as FNMR rates than males at all age groups, with some exceptions in the youngest group (ages 1-4).• The race of the subjects has an impact on the systems, and the performance across all races gets worse as the age of the subjects decrease.Especially subjects of Black and Asian have high EER and FNMR rates compared to subjects of White and Latino-Hispanic, also as children.

Fig. 2 :
Fig.2: Example of accepted and rejected images using SER-FIQ[27] quality assessment with a threshold of 0.75.

Fig. 4 :
Fig.4: Hyperplane between two categories in a simplified 2D space.The red line is the hyperplane as found by a linear SVM.The images with a blue border are categorized as women, and those with a green border are categorized as men.The vector n is the normal vector to the hyperplane.

Fig. 5 :
Fig.5: An example of a subject manipulated with a gender boundary, moving the subject in both a positive and negative direction

Fig. 7 :
Fig. 7: Effect of race balancing on a dataset of 3510 subjects of non-uniformly distributed races.A: Asian, B: Black, I: Indians, LH: Latino-Hispanic, ME: Middle Easteren, and W:White.

Fig. 8 :
Fig. 8: Example of a subject moved too far in the age direction (green boxed images still look natural, where the red boxed images looks progressively more unnatural).

Fig. 10 :
Fig. 10: Examples of rejected images in the automatic postprocessing step of the pipeline.

Fig. 11 :
Fig. 11: Example of a subject in the different age groups.The ages from left to right: 1-4, 7-10, 13-16 and 20+.The variations from the top and downwards, where the reference is the red boxed ones: left yaw, right yaw, pitch down, pitch up, smile, high illumination, low illumination & compression.

Fig. 17 :
Fig. 17: An example of two subjects who have a high nonmated score throughout all age groups, according to MagFace.

Fig. 18 :
Fig. 18: Pair-wise scatter plots between the different age groups showing the relationship between the non-mated comparison score with MagFace.

TABLE I :
Different child face datasets for facial recognition systems.The bias column indicates the presence of any potential biases within the respective dataset.

, Alaluf et al. propose an architecture where age is approached as a regression task, rather than discrete age groups. Their model learns a non-linear path to disentangle the age progression from other attributes. Li et al. [24] also
focus on continuous ageing and use an age estimator as part of a GAN-generator in a novel architecture.

TABLE II :
Boundaries used in the implementation of the proposed pipeline.

TABLE III :
Age groups, races and intra-subject variations provided as input to the pipeline to produce the dataset.As a part of the pipeline each synthetic subject has also been classified as either being male (M) or female (F).This split is done to test if the performance on the face recognition systems between gender, and if it varies across different age groups.The number of images in each group can be seen in table IV.It can be seen that 40.3% of the subjects

TABLE IV :
Female (F) and male (F) subjects and total images in the database.

TABLE V :
The amount of subjects of the different races in the database.

TABLE VI :
Different biometric performance metrics withArcFace, MagFace and COTS as face recognition systems from the full dataset.It should be noted that only the non-mated distribution statistics are shown here, as the mated statistics are very similar and would produce an overwhelmingly large table.The table contains several interesting observations, such as the d score being largest for subjects of white race

TABLE VII :
Different biometric performance metrics withMagFace from the gender-divided dataset.

TABLE VIII :
Different biometric measures from using Mag-Face on the race-divided dataset.