Geometrical Consistency Modeling on B-Spline Parameter Domain for 3D Face Reconstruction From Limited Number of Wild Images

A number of methods have been proposed for face reconstruction from single/multiple image(s). However, it is still a challenge to do reconstruction for limited number of wild images, in which there exists complex different imaging conditions, various face appearance, and limited number of high-quality images. And most current mesh model based methods cannot generate high-quality face model because of the local mapping deviation in geometric optics and distortion error brought by discrete differential operation. In this paper, accurate geometrical consistency modeling on B-spline parameter domain is proposed to reconstruct high-quality face surface from the various images. The modeling is completely consistent with the law of geometric optics, and B-spline reduces the distortion during surface deformation. In our method, 0th- and 1st-order consistency of stereo are formulated based on low-rank texture structures and local normals, respectively, to approach the pinpoint geometric modeling for face reconstruction. A practical solution combining the two consistency as well as an iterative algorithm is proposed to optimize high-detailed B-spline face effectively. Extensive empirical evaluations on synthetic data and unconstrained data are conducted, and the experimental results demonstrate the effectiveness of our method on challenging scenario, e.g., limited number of images with different head poses, illuminations, and expressions.


INTRODUCTION
3D face has been extensively applied in the areas of face recognition (Artificial and Aryananda, 2002;Mian et al., 2006), expression recognition (Zhang et al., 2015). These face analysis technologies are of significance for human-robot cooperative tasks in a safe and intelligent state (Maejima et al., 2012). So 3D face reconstruction is a import topic, and it is meaningful to reconstruct specific 3D face from person-of-interest images under many challenge scenes. The images under challenge scene are also referred as images in the wild, having following characteristics: (1) significant changes in illuminations across time periods; (2) various face poses caused by different camera sensors and view points; (3) different appearances among different environment; (4) occlusions or redundant backgrounds. More seriously, only limited number of identity images are available under human-robot interaction, surveillance, and mobile shooting scenario as listed in Figure 1, sometimes.
As a whole, reconstruction technologies include single-image method, multiple images, and even unconstrained images based methods. Recent researches (Kemelmacher and Seitz, 2011;Roth et al., 2015Roth et al., , 2016 prove that good reconstruction depends on two aspects of efforts: (1) enough rich local information, e.g., normal, and (2) a good face prior, e.g., face template. Particularly, the latter is to find an embedding representation with good characteristic to register local information finely.
According to the template representation, these methods can be categorized into three classes: (i) methods without using template, e.g., integration (Kemelmacher and  and structure from motion (Koo and Lam, 2008), (ii) methods using a single discrete template, e.g., a reference face mesh (Roth et al., 2015), and (iii) methods using a statistic continuous template, e.g, T-splineMMs (Peng et al., 2017), or discrete template, e.g., 3DMMs (Piotraschke and Blanz, 2016;Roth et al., 2016). The methods with template always generate good global shape compared with those without template, and a statistic template contributes to a better personalization. Therefore, it is very significant to find a excellent template representation for face reconstruction. Mesh model is widely used due to its rapid computation and popularity in computer vision, but it is not well-compatible with geometric optics in vertex level, resulting in local mapping deviation of rays, seen in Figure 1. This makes local information not strictly registered physically. Additional, discretization of Laplace-Beltrami operation (LBO), i.e., cotangent scheme (Meyer et al., 2003), may bring a deformation distortion at local, which often happens when images are not enough for high-quality normal estimation. This distortion irregularly occurs at the edge and the location with large curvature changing, e.g., nose and mouth. Lastly the topology-fixed mesh also restricts an extended refinement. All above problem limits reconstruction precision of mesh.
FIGURE 1 | Geometric optics of BP (i.e., back projection) imaging on two types of surfaces: the correct ray lines go through the blue points on the true shape, while the biased ones go through red points on the mesh shape because the cross point between a ray and mesh is bounded to vertex. The difference between red point and the blue point is referred to local mapping deviation.
To solve the existing issue in mesh template, we adopt classic B-spline embedding function (Piegl and Tiller, 1997) to register local information and reconstruct face. Firstly, B-spline surface is a parametric surface that can approximate the true shape of an object with fewer parameters (control points) than mesh. It contributes to correct rays in geometric optics, that makes local information, i.e., texture, feature points and normals, accurately registered. Secondly, we use 2nd-order partial derivative operator w.r.t. parameters as the local deformation constraint to reduce the deformation distortion. Lastly, B-spline surface also can be used to generate mesh in any precision or be extended for further refinement. The three characteristics of B-spline face show great advantages over a mesh template based method. Given a collection of images, we use B-spline embedding function as 3D face representation and model 0th-and 1st-order consistency of reconstruction in the parameter domain, which makes BP imaging rays completely compatible with geometric optics. The 0th-order consistency model guarantees that the images are well-registered to surface even if the face images has occlusion or expression; And the 1st-order consistency model guarantees that the surface normals is consistent to the normals estimated from images. Both qualitative and quantitative experiments are conducted and compared with other methods.
In a nutshell, there are two primary contributions: 1. Pinpoint geometrical consistency is modeled on B-spline embedding function for face reconstruction from multiple images, completely consistent with the law of geometric optics. 2. 0th-and 1st-order consistency conditions and its a practical solution is proposed to optimize B-spline face effectively, which is able to handle variations such as different poses, illuminations, and expressions with limited of number images.
In the following, we will first review related work in section 2. Section 3 provides a geometric modeling of multiple BP imaging in image-based stereo for our problem. We introduce the Bspline embedding and its brief representations in section 4 and present consistency modeling for B-spline face reconstruction in section 5. In addition, a practical solution is proposed in section 6. We conduct experiment in section 7 and conclude in section 9.

3D Face Required Scenes
With the development of robots and AIoT (Qiu et al., 2020), vision will play an very important role in safety (Khraisat et al., 2019;Li et al., 2019), scene and human understanding (Zhang et al., 2015;Meng et al., 2020). As a base technology, 3D face contributes to the scenes greatly. For example, to build humanoid robots that interact in a human-understanding manner, automatic face, and expression recognition is very import (Zhang et al., 2015). The recognition during reallife human robot interaction could still be challenging as a result of subject variations, illumination changes, various pose, background clutter, and occlusions (Mian et al., 2006). However, humanoid robot API of original version cannot always be able to handling such challenges. Optimal, robust, and accurate automatic face analysis is thus meaningful for the real-life applications since the performance of facial action and emotion recognition relies heavily on it. Many parametric approaches like 3DMMs (Blanz and Vetter, 1999;Blanz et al., 2004) and face alignment with 3D solution (Zhu et al., 2016) in the computer vision field have been proposed to estimate head pose, recognition identity, and expression from real-life images to benefit subsequent automatic facial behavior perception to address the above issues. Therefore, 3d face modeling in a humanoid robot view is of great significant to handling the challenging face analysis during interaction.

2D Images Based Face Reconstruction
2D methods generally cover several kinds of fundamental methods including Structure from Motion (SFM) (Tomasi and Kanade, 1992), Shape from Shading (SFM) (Zhang et al., 1999), 3D Morphable Model (3DMM) (Blanz and Vetter, 1999;Blanz et al., 2004), and Deep learnings Deng et al., 2019). SFM methods compute the positions of surface points based on an assumption that there exists a coordinate transformation between the image coordinate system and the camera coordinate system. And SFS methods compute surface normals with an assumption that the subject surface is of Lambertian and under a relatively distant illumination. And the idea of 3DMM is that human faces are within a linear subspace, and that any novel face shape can be represented by a linear combination of shape eigenvectors deduced by PCA. SFS and SFM give the geometrical and physical descriptions of face shape and imaging, and 3DMM concentrates on the statistical explanation of 3D meshes or skeletons. Deep learning methods infer 3D face shape or texture (Lin et al., 2020) by statistically learning mapping between face images and their 3D shapes (Zhou et al., 2019). Being limited to data size, most of them relies 3DMM or PCA for synthesizing supplementary ground truths (Richardson et al., 2016) or as a priori (Tran et al., 2017;Gecer et al., 2019;Wu et al., 2019), resulting absence of shape detail. It's believed that face reconstruction is rather a geometrical optimization problem than a statistical problem, as 3DMM is more suitable to be an assistant of the geometrical method when building detailed shape, e.g., that by Yang et al. (2014).

Shape in Shading and Structure in Motion
SFS has been widely used for reconstruction, e.g., singleview reconstruction (Kemelmacher Shlizerman and Basri, 2011), multiple frontal images based reconstruction (Wang et al., 2003), and unconstrained image based reconstruction (Kemelmacher and Roth et al., 2015). As singleview is ill posed (Prados and Faugeras, 2005), a reference is always needed (Kemelmacher Shlizerman and Basri, 2011). For unconstrained images, photometric stereo is applied to obtain accurate normals locally (Kemelmacher and Roth et al., 2015). SFM uses multiple frame or images to recover sparse 3D structure of feature points of an object (Tomasi and Kanade, 1992). Spatial-transformation approach (Sun et al., 2013) only estimates the depth of facial points. Bundle adjustment (Agarwal et al., 2011) fits the large scale rigid object reconstruction, but it cannot generate the dense model of non-rigid face. Incremental SFM (Gonzalez-Mora et al., 2010) is proposed to build a generic 3D face model for non-rigid face. The work by Roth et al. (2015) optimizes the local information with normals from shading, based on a 3D feature points-driven global warping. Therefore, shading and motion are important and very distinct geometric information of face, and they enhance the reconstruction when being combined. In our method, 0th-and 1st-order consistency of stereo is modeled to integrate the advantages of both shading and motion information.

Facial Surface Modeling
Surface modeling is dependent on the data input (point cloud, noise, outlier, etc), output (point cloud, mesh, skeleton), and types of shape (man-made shape, organic shape). Point cloud, skeleton, and mesh grid are the widely used man-made shape type for face reconstruction. Lu et al. (2016) present an a stepwise tracking method approach to reconstruct 3D B-spline space curves from planar orthogonal views through minimizing the energy function with weight values. Spatial transformation method (Sun et al., 2013) estimates positions of sparse facial feature points. Bundle adjustment builds the dense point cloud for large scale rigid object with a great number of images (Agarwal et al., 2011). Heo and Savvides (2009) reconstruct face dense mesh based on skeleton and 3DMM. Kemelmacher and Seitz (2011) apply integration of normals to get discrete surface points, which may produce incredible depth when the recovered normals are unreliable. Roth et al. (2015) reconstruct face mesh based on Laplace mesh editing, which may produce local mesh distortion after several iterations of local optimization. In work of mesh reconstruction, surface-smoothness priors is also needed to guarantee the smoothness of discrete mesh based on point cloud, e.g., radial basis function (Carr et al., 2001) and Poisson surface reconstruction (Kazhdan et al., 2006). Due to the fact that the point cloud and 3D mesh are discontinuous geometric shape, they cannot approximate the true shape of a face of arbitrary precision. There have been works of fitting B-splines to noisy 3D data, like Hoch et al. (1998). B-spline face model is a continuous free-form surface that can be reconstructed from images directly, instead of intermediate point data, but it is not a detailed model by only using structure optimization (Peng et al., 2016). Because B-spline surface is a special case of NURBS (Non-Uniform Rational B-Spline) (Piegl and Tiller, 1997), it can also be imported to 3D modeling software like Rhino3D for further editing, analysis, and transformation conveniently by adjusting the B-spline control points. It can also be converted into mesh model with any precision according to appropriate parameter interval, conveniently, which is meaningful for a system with limited memory.

GEOMETRIC MODELING
Our problem modeling is illustrated in Figure 2. The domain of input image I i from a camera is I i ⊂ R 2 , i = 1, 2, . . . , n. −1 denotes the inverse operator of . The camera operator i ∈ C ∞ (R 3 , R 2 ) map a point P ∈ S to p = i (P) ∈ I i using weak perspective projection, i = 1, 2, . . . , n. And −1 i determines the ray cluster Rays#i of BP imaging from I i , i = 1, 2, . . . , n. Let s i , R i , and t i denote scale, rotation, and translation parameter in projection i . The ith projection operation is simply (1) [1,2] expresses the first two rows of R i . Let U ⊂ R 2 denote the parameter domain of human face surface. A certain embedding F ∈ C 1 (U, R 3 ) maps a point u ∈ U to the 3D point P ∈ S. F −1 denote the inverse operator of F. It is thus clear that different embedding F determine different face shapes. According to the geometric optics of BP imaging, a image point p ∈ I i is back projected onto a point u = τ i (p) ∈ U via the operator Therefore, an image I i in the i-th view is mapped to surface S, and then is mapped to texture space by where we define In fact, τ i , i = 1, 2, . . . , n generate discrete and inconsistent rays mapping in texture space because of the discrete and different images domains, as well as the noises, seen in Figure 2.

0th-and 1st-Order Consistency
Generally, the problem is how to determine F according to from multiple images. If all images are the captures of a same S, all {T i } i=1 : n in texture space are hoped to be highly consistent in the geometry. First, that satisfies . . , n. And (·) # is a composition operator of fitting and sampling, to handle the inconsistency. It firstly fits a texture function based on the discrete texture and parameters mapped from one image, and then samples texture intensity values at unified parameter points {u j } j=1 : N p .
Second, it satisfies which describes the equivalence relation between normal n and 1st-order partial derivative in the first formulation, and the equivalence relation among albedo ρ, normal n, light direction l, and image intensity T in the second. This follows a linear photometric model, as seen in Figure 3.
We refer to Equations (5) and (6) as 0th-and 1st-order consistence equations in 3D surface reconstruction respectively. Generally, researchers solve any one of the two consistence problem to reconstruct 3D surface, classically, by multi-view stereo (Seitz et al., 2006) for 0th-order consistence problem, or by photometric stereo (Barsky and Petrou, 2003) for the 1st-order one.

Embedding F
There are several types of representation for embedding F, such as discrete mesh and C 2 parametric surface. In fact the representation type of F also affects the reconstruction effect. Intuitively for mesh, on one hand there exists mapping deviation of rays from image points to vertices of mesh, which contributes to inaccurate texture charts {T i } i=1 : n and affects the accuracy of reconstruction. On the other, discrete differential operator, i.e., LBO (Meyer et al., 2003), brings potential distortion error when there exists obtuse triangles in the mesh caused by error local normal. Additionally, the precision of mesh also limit the detail of reconstruction.
We consider to apply C 2 parametric surface as the representation of face. Generally, B-spline surface is recommended because of its advantages of good locality over other types of surfaces such as polynomial surface and Bessel surface. By B-spline surface, it doesn't exist mapping deviation in geometric optics, and it avoids the potential distortion brought by discrete differential operator. Therefore, accurate and continuous back projection texture charts {T i } i=1 : n can be generated based on Equations (2), (3), and (5). Then accurate reconstruction can be implemented based on Equation (6). What's more, the precision can be enhanced for high-detailed reconstruction by inserting control points.

B-SPLINE FACE EMBEDDING F, AND THE 0TH-, 1ST-, 2ND-ORDER REPRESENTATION
The human face is assumed to be a uniform B-spline surface S of degree 4 × 4, with B = {b mn } M×N as its control points. In parameter domain U, knots U = {u m } M+4 m=1 and V = {v n } N+4 n=1 split uv parameter plane into uniform grid. Let u denote parameter point (u, v). The surface function is F is C 2 , meaning that it can approximate the true shape in arbitrary uv precision with deterministic k-ordered partial derivative ∂ k F ∂u k and ∂ k F ∂v k , k = 1, 2, and ∂ 2 F ∂u∂v .

0th-Order Representation
We give a more brief formulation of 0th-order representation as follows: where b denotes a 3MN×1 vector storing B-spline control points, and T| u denotes a sparse 3 × 3MN matrix stacking the 0th-order coefficients at parameter u ∈ U. In fact, we needn't consider all 3D points mapping to 2D images when estimating a operator . Instead, we only consider f landmark points on human face as shown in Figure 4, and their brief formulation is where u(l i ) is the parameter point of the i-th feature point, i = 1, 2, . . . , f . The landmarks cover a sparse structure of face.

1st-Order Representation
The 1st-order partial derivatives of F w.r.t u and v are and Similarly, we give a more brief formulation of 1st-order partial derivative as follows: where T 1 | u and T 2 | u denote the matrixes stacking the 1st-order coefficients w.r.t u and v, respectively. Therefore, the surface normal vector at u can be computed by the cross product which is the key information for detailed reconstruction using photometric stereo method.

2nd-Order Representation
And similarly, the 2nd-order partial derivatives w.r.t. u and v, respectively are where T 11 | u and T 22 | u denote the matrixes stacking the 2ndorder coefficients w.r.t u and v, respectively. The 2nd-order information can be used for smooth control during optimization. Based on face surface embedded with B-spline function, we present the pinpoint 0th-and 1st-order geometric consistency conditions in the following section.

CONSISTENCY MODELING IN B-SPLINE FACE RECONSTRUCTION
Reconstruction problem is to compute F by solving 0th-order consistence of Equation (5) or 1st-order consistence of Equation (6). Generally, two consistency conditions are combined for face reconstruction considering that estimating abundant consistent points in images is limited and that the estimated normals are unfaithful. Furthermore, how to obtain the accurate registration of 0th-and 1st-order information is the most important to high-detailed B-spline reconstruction.
The well-registered textures are low-rank structures of the back projection texture charts. But in practice, they can be easily violated due to the presence of partial occlusions or expressions in the images captured. Since these errors typically affect only a small fraction of all pixels in an chart, they can be modeled as sparse errors whose nonzero entries can have arbitrarily large magnitude.

Modeling Occlusion and Expression Corruptions in 0th-Order Consistence
Let e i represent the error corresponding to image I i such that the back projection texture charts . . , n are well registered to the surface F, and free of any corruptions or expressions. Also combining with 0th-order representation of B-spline face in Equation (7), the formulation (5) can be modified as follows: where D e = vec(T e 1 ), vec(T e 2 ), . . . , vec(T e n ) and E = [vec(e 1 ), vec(e 2 ), . . . , vec(e n )].
However, the solutionb of face surface S is not unique if all images are in similar views. And the reconstruction is not highdetailed even if we can make a unique solution by applying a prior face template. So we also need to model high details in 1st-order consistence.

Modeling High Details in 1st-Order Consistence
The resolution of reconstruction is determined by the density of correctly estimated normals. To enhance the resolution of B-spline surface, we use operator (·) # to sample N p dense parameter points {u j } j=1 : N p on the domain U for the problem of Equation (6).
Then the well-registered and dense texture are obtained by for i = 1, 2, . . . , n and j = 1, 2, . . . , N p . According to Lambertian illumination model seen in Equation (6), dense normals n j as well as light l i can be computed from the shading (intensity) of charts T i by SVD method.
Finally, the high detailed reconstruction must satisfy Frontiers in Neurorobotics | www.frontiersin.org By putting Equation (9) into Equation (14), we get Conditions of both Equations (6) and (15) have to be considered for a good reconstruction, which is very difficult. Therefore, we propose a practical solution that combining both 0th-and 1st-order consistence.

PRACTICAL SOLUTION COMBINING 0TH-AND 1ST-ORDER CONSISTENCE
The problems of both 0th-order consistence and 1st-order consistence are difficult to solve. For , Jacobian matrices w.r.t. {τ −1 i } i=1 : n have to be computed, which is computingexpensive. And the solution of Equation (15) is not unique, either. Therefore, we aim to find a practical solution to handle both two consistence conditions in this section. We first define the subproblem for each condition, and then provide a iterative algorithm.

0th-Order Solution
In Equation (6), three kind of parameters including camera parameters { i } i=1 : n , surface parameters F (or b), and texture parameters {T i } i=1 : n (or D) need to be computed, but they are difficult to be solved simultaneously. We adopt to optimize them by turns, instead.

Estimating i
According to linear transformation from 3D to 2D in Equation (1), we can estimate scale s i , rotation R i and translationt i of landmarks for each image I i , i = 1, 2, . . . , n based on the and SVD method (Kemelmacher and . The image landmarks are detected by a state-of-art detector (Burgos-Artizzu et al., 2013) that has a similar high performance to human. And the 3D landmarks are defined on a B-spline face template with control point parameter b 0 , according to Equation (8).

Estimating b
Let f denote a 2nf × 1 vector stacking f landmarks of n images, and P denote a 2nf × 3f projection matrix stacking n views of parameters s i R i, [1,2] , and t denote a 2nf × 1 vector stacking f translation. The update of b can be implemented by solving: where the first and the second are 0th-and 2nd-order item, respectively, and ζ is used to balance them. Operator (·) #l is a sampling operator that selects B-spline coefficients of landmarks at parameters {u(l i )} i=1 : f , and (·) # selects B-spline coefficients at {u j } i=1 : N p . In fact, T #l is a 3f ×3MN matrix that stacks T| u(l i ) , i = 1, 2, . . . , f , and T # 11 (or T # 22 ) is a 3f ×3MN matrix that stacks T 11 | u j (or T 22 | u j ), j = 1, 2, . . . , N p .
The second item also work as a regularization measuring the distance of local information between faces b and b 0 . It helps eliminate affect of geometric rotation brought by 0storder warping, and guarantee a smoothness changing during optimization. Particularly, ζ cannot be too small, otherwise a fast changing may bring a local optimal.

1st-Order Solution
Firstly, texture charts based photometric stereo method is used to estimate the local normals. Secondly, a normals driven optimization strategy is proposed to optimize the B-spline face.

Estimating n j
According to Photometric stereo, the shape of each point can be solved by the observed variation in shading of the images. Data of n texture charts are input into M n×N p for estimating the initial shapeS and lightingL by factorizing M = LS via SVD (Yuille et al., 1999).
To approach the true normal information, we estimate the shape S and ambiguity A by following the work of Kemelmacher and . Lastly, the normal at j-th point is n j = S T j , where S j is the j-th row of S.

Estimating b
We normalize n j and stack them into a 3N p ×1 vector h. Equation (15) can be rewritten as where is a 3N p × 3N p diagonal matrix that stores 3N p reciprocals of lengths of the normals {n j } j=1 : N p ; and (·) # is a selection operator that selects 3N p rows of 1st-order coefficients at parameter {u j } j=1 : N p ; and b 0 represent the control points of a B-spline template face. Particularly, symbol ⊗ denotes a composite operator of cross product, which makes w ⊗ v = [w 1 × v 1 ; w 2 × v 2 ; . . . ; w N p × v N p ], where w and v are 3N p × 1 vectors containing N p normals.
However, there exists two issues: (1) the low-dimension h may not guarantee an unique solution of high-dimension b; and (2) the system is not simply linear, which is difficult to be solved. Therefore, a frontal constraint based on template b 0 is applied to make a unique solution; And a strategy of approximating to linearization is also proposed to make a linear solution.

Frontal Constraint
The frontal constraint is a distance measurement condition between surface S and template w.r.t. x-and y-component: where the matrix T #xy stacks 0th-order coefficients at parameter {u j } j=1 : N p corresponding to x-and y-components. Operator (·) #sxy also sets the coefficients corresponding to z-components to zeros.
Particularly, the first item O 1 is not a simple linear form, for which an approximating to linearization is proposed.

Approximating to Linearization
According to the characteristics of the cross-product ⊗, the first item in O 1 can be rewritten as a linear-like formulation: Particularly, the operation [·] ⊗ makes a 3N p × 1 vector If b is a known parameter, e.g., as b 0 , for L| b , the minimization of h − L| b0 · b will be a linear system. That is also true for R| b .
In fact, we can use formulation h − L| b0 · b to optimize the control points in parameter space of v by fixing u, and use h − R| b0 · b to optimize in parameter space of u by fixing v. A practical skill is to optimize the control points on u and v parameter spaces by turns. The two iteration items are rewritten as where the second term for each formulation is unit tangent vector constraint on the fixed the directions. 1 | b 0 (or 2 | b 0 ) is a 3N p × 3N p diagonal matrix that stores 3N p reciprocals of lengths of tangent vector ∂F ∂u (or ∂F ∂v ) at {u j } j=1 : N p . During this procedure b 0 is updated step-by-step. As shown in Figure 5, two partial derivatives ∂F ∂v and ∂F ∂u at (u, v) are updated until ∂F ∂v × ∂F ∂u converges to n. By integrating with O 2 , the final formulation of optimization consists of two items as follows:  (18) iteratively until convergence.

Algorithm
An iterative algorithm is presented for this practical solution in Algorithm 1. Processes of 0th-order consistence and 1storder consistence are separately conducted in the inner loop. And the outer loop guarantees a global convergence on two consistence problem.

Computational Complexity
The computation in above Algorithm 1 involves linear least square for solving Equations (16), (18.a), and (18.b), SVD for estimating camera parameter, and Robust PCA for Equation (17). In detail, the computational complexity for solving Equation (16) is O(n 2 f 2 MN), and that of both Equations (18.a) and (18.b) are O(N 2 p MN). The computational complexity of robust PCA comes to be O(N 2 p k), where k is the rank constraint. By assuming N p > M > N >> f > n, computational complexity of the other parts can be negligible. In addition, we need considering the number of iteration for total computation of Algorithm 1.

EXPERIMENT
In this section experiments are presented to verify our automatic free-form surface modeling method. We first describe the pipeline to prepare a collection of face images of a person for B-spline face reconstruction. And then we demonstrate the quantitative and qualitative comparisons with recent baseline methods on projected standard images from ground truth 3D data (Zhang et al., 2017) with various expressions, illuminations and poses. Finally, we conduct challenging reconstructions and comparison based on real unconstrained data taken from the challenging Labeled Faces in Wild (LFW) database 1 (Huang et al., 2007).
FIGURE 5 | Iterative adjustment on two partial derivatives: Process (1) to (2) adjusts ∂F ∂u by fixing ∂F ∂v , and process (3) to (4) adjusts ∂F ∂v by fixing ∂F ∂u , … until that ∂F ∂u × ∂F ∂v is infinitely close to objective n; Process A implements a practically and iteratively linear handle for B-spline surface adjustment in B.

Synthesized Data With Expression
The ground truth data are from the space-times faces (Zhang et al., 2017) which contains 3D face models with different expressions. We use the data because it is convenient to evaluate our method with ground truth. And different poses and illuminations can also be simulated by the spacestimes faces, seen in Figure 6. Images with various poses and illuminations are collected, and feature points manually labeled. The reconstruction is evaluated by the error to the ground truth model.

Real Data in the Wild
The wild data (Huang et al., 2007) has characteristics of subject variations, illumination changes, various pose, background clutter and occlusions. Images of each person are collected and input into a facial point detector (Burgos-Artizzu et al., 2013) that has a similar high performance to human, to find the 40 facial points shown in Figure 4. The initial B-spline template face is computed from a neutral model of space-time faces.

Comparison
To verify the accuracy of automatic surface reconstruction, discrete points are sampled from the generated continuous free-form shape, and are compared to the traditional discrete reconstructions, e.g., work by Kemelmacher and Seitz (2011) and Roth et al. (2015). For a memory-limited capture system, it is not available to collect thousands of images as what Kemelmacher and Seitz (2011) and Roth et al. (2015) have done, so we limit all the reconstructions to less than forty images. We also compare them with an end-to-end deep learning method by Sela et al. (2017) qualitatively. Deep learning methods rely training on a large amount of unconstrained data, so we just use the model provided by Sela et al. (2017) that have been training on unconstrained images, and test it on the images in the wild.

Synthesized Standard Images
We conduct five sessions of reconstructions: the first four are used to reconstruct expression S1, S2, S3, and S4 by using their corresponding images, and the fifth session S5 is based on images with different expressions. Each session contains 40 images with various illumination and different poses. Reconstruction results are compared with the re-implemented method Kemel_meth by Kemelmacher and Seitz (2011) and Roth_meth by Roth et al. (2015). Kemel_meth generates frontal face surface based on integration in image domain of size 120 × 110. We clip it according to the peripheral facial points and interpolate points to get more vertices. Roth_meth generates a face mesh based on a template with 23,725 vertices. In our method, control point grid of 102 × 77 is optimized for a B-spline face surface.

Quantitative Comparison
To compare the approaches numerically, we compute the shortest point-to-point distance from ground truth to reconstruction. Point clouds are sampled from B-spline face and aligned according to absolute orientation problem. As done in work of Roth et al. (2015), mean Euclidean distance (MED), and the root mean square (RMS) of the distances, after normalized by the eye-to-eye distance, are reported in Table 1.
Particularly, evaluation of Roth_meth is based on surface clipped with same facial points like the other two methods by considering a fair comparison. In the table, the best results are highlighted in boldface, and the underlined result has no significant difference with the best. To our knowledge, Roth_meth is the state-of-art FIGURE 6 | Sample data simulated by the spaces-times faces (Zhang et al., 2017) : images and 3D model with various poses and illuminations are available; data of sample S1, S2, S3, and S4 are used for evaluation. method for face reconstruction from unconstrained images. Its re-implementation version is affected by the noisy normal estimation because of limited number images, showing results that are not good like as in its original paper. But it still performs good on all sessions. As a whole, results by both Roth_meth and our method have lower errors than Kemel_meth. On session S1 and S5, Roth_meth obtains the lowest mean error 5.21 and 6.96%, respectively. However, we obtains lower RMS 4.10 and 4.34% while its errors is quite close to the best especially on session S5. And on session S2, S3, and S4, our method obtains the best results, 6.49 ± 4.66, 4.43 ± 2.91, and 6.46 ± 4.06%. In contrast, the errors by Kemel_meth exceed 8%, and the RMS is also very large on every session. These numerical comparisons supply highly persuasive evidence that our B-spline method can build promising reconstructions based on face images.

Visual Comparison
The visual results in Figure 7. We show 3D models in mesh format for three methods on different sessions, and vertex numbers of models are also presented. It also demonstrates that our method has a promise performance by comparisons in the figure. An important fact is that Kemel (Kemelmacher and  cannot make a credible depth information and global shape, e.g., the global shape of reconstruction S2 and the mouse and nose of S3 are obviously incorrect, but our method solves global and local problem by optimization of 0th-and 1st-order consistency. And while Roth (Roth et al., 2015) generates more detailed information of an individual, it also produces distortion at the detailed shape, e.g., the eye of reconstruction S2 and the nose of reconstruction S3 and S4. In contrast, our method obtains realistic shape both globally and locally.

Characteristic Comparison
We give statistics of characteristics of the results generated by the three methods in Table 2, covering the global shape, local detail, credible depth, smoothness, distortion, and derivability. Depending on the quantitative and qualitative comparisons, we also give a rough rating. One star, two stars, and three stars represents bad, general, and good reconstruction respectively in the rating system. Both Roth_meth and our method obtain good scores on global shape, local detail, and credible depth. And both Kemel_meth and our method obtain a good score on smoothness. Because of the bad depth, Kemel_meth also gets bad score on global shape and distortion, and gets general scores on local detail. In addition, B-spline face model has better smoothness than the models by Kemel_meth and Roth_meth, because it is C 2 differentiable parametric surface while the other two are discrete model. Conclusively, 0th-and 1storder consistency modeling using B-spline surface is efficient to reconstruct parametric surface of individual face.

Real Unconstrained Images
Our method is also tested based on real unconstrained data. Unconstrained data mean that the images are captured under uncertain condition, and the faces in the images are different in expression, pose and illumination condition. It is difficult to build the geometrical consistency for reconstruction using such data.
Unlike the experiments in the work by Kemelmacher and Seitz (2011) using hundreds of images, we conduct reconstruction with limited number of images, because a large mount of face images for one person are not always available for small sample size tasks such as criminal investigation. In the experiment, uniformly 35 images are collected for each person from LFW database 1 covering different poses, illuminations and expressions.   Figure 8. Let A label the results generated by the reimplemented Kemel_meth, and let B label the results generated by the reimplemented Roth_meth, and let C label the method Seta_meth of deep learning by Sela et al. (2017) and let D label our results. Particularly, the input for Seta_meth is one image selected from the 35 images. Images in column 1, 5, and 8 are corresponding mean textures and two views of images respectively. By comparing these results, we observe some phenomena as follows: (1) In frontal viewpoint, A and D show more vivid details than B, e.g., eyes and nose of Colin Powell. But in an other viewpoint, D shows more credible shape than A, e.g., the eyes and the forehead of Colin Powell, and the forehead and the mouth of Donald Rumsfeld. (2) When the normals are incorrectly estimated from a limited number of images, e.g., for Gloria Macapagal Arroyo, A loses the local information completely, but B, C, and D still maintain general geometrical shape of face. For all methods, reconstructing nose is a challenge because the geometric curvature of the nose varies greatly. When the images are not enough, the noise could be amplified. So B shows bad results at nose being limited by number of input images. (3) The input of C is a approximately frontal face image selected.
As the model of C is learning on a set of 3D face data, it may not handle the uncertain noise and identity of inputs. So the details in reconstruction by C don't look real, although their global shapes are stable and like human faces. (4) By comparison, our method steadily produces better looking results than others from different viewpoints in the dataset. Clear and vivid details can be seen at key components such as eyes, nose and mouth, forehead, and cheek.

DISCUSSION
All the above experiments prove that our method can build pinpoint geometrical consistency on the limited number of real unconstrained data. Our method may not be best method in area of 3D reconstruction from multiple images, as the results in the original work by B looks better. It could deal with 3D reconstruction with limited number of images. Because we may not obtain large amount of images for reconstruction as done by Roth et al. (2015), for some condition restricted system. The shortcomings of A are mainly resulted from the inauthentic depth generated by integration method. And the bad results of B are caused by that the mesh template cannot build correct geometric consistency of number limited of unconstrained images and that the discrete differential operating on estimated noisy normal brings distortion errors. In contrast, we build pinpoint geometric consistency using B-spline surface. B-spline can smooth the noise in estimated normal better. So D can reconstruct correct face shape with little distortion, showing better result as a whole.
In the comparison, we don't consider other deep learning methods based methods appeared in recent years (Dou et al., 2017;Richardson et al., 2017;Lin et al., 2020;Sengupta et al., 2020;Shang et al., 2020). Because almost all recent works are focused on deep learning methods for single image based 3D face reconstruction (Dou et al., 2017;Richardson et al., 2017;Lin et al., 2020;Sengupta et al., 2020), as well as using a 3DMM model as prior. And the multi-view deep learning method only handle constrained face images (Shang et al., 2020). It means the deep learning methods can use a large amount of training data, and also a good prior. The input are different between these learning based methods and our method. So we conduct comparison with the classic optimizationbased approaches for the sake of fairness. Nevertheless, we also select one representative method by Sela et al. (2017) to show result by deep learning as a reference in the comparison. It proves that if the test are not satisfactory to the prior and distribution of training data, it may obtain bad result.

CONCLUSIONS
This study set out to present high-detailed face reconstruction from multiple images based on pinpoint 0th-and 1st-order geometric consistence using B-spline embedding. Based on the good consistence modeling in geometric optics, the method works well for data with different poses and expressions in the wild. The key contribution of this study is that surface modeling adapts the correct rays in geometric optics by using B-spline embedding. This makes the high-detailed B-spline modeling from a number limited of face images captured under wild condition become reality. The method could also be applied to expression tracking and assisting face recognition in a monitoring or robot system.

AUTHOR CONTRIBUTIONS
WP and ZF has contributed equally to the core idea as well as the experiment design and results analysis. YS, KT, and CX has provided assistance in experiments and analysis, under ZF's supervision. Besides, KT and MF provided the research group with financial support and experimental equipments. KT and ZF are supportive corresponding authors. All authors contributed to the article and approved the submitted version.