Optical See-Through Head-Mounted Displays With Short Focal Distance: Conditions for Mitigating Parallax-Related Registration Error

Optical see-through (OST) augmented reality head-mounted displays are quickly emerging as a key asset in several application fields but their ability to profitably assist high precision activities in the peripersonal space is still sub-optimal due to the calibration procedure required to properly model the user's viewpoint through the see-through display. In this work, we demonstrate the beneficial impact, on the parallax-related AR misregistration, of the use of optical see-through displays whose optical engines collimate the computer-generated image at a depth close to the fixation point of the user in the peripersonal space. To estimate the projection parameters of the OST display for a generic viewpoint position, our strategy relies on a dedicated parameterization of the virtual rendering camera based on a calibration routine that exploits photogrammetry techniques. We model the registration error due to the viewpoint shift and we validate it on an OST display with short focal distance. The results of the tests demonstrate that with our strategy the parallax-related registration error is submillimetric provided that the scene under observation stays within a suitable view volume that falls in a ±10 cm depth range around the focal plane of the display. This finding will pave the way to the development of new multi-focal models of OST HMDs specifically conceived to aid high-precision manual tasks in the peripersonal space.


INTRODUCTION
The overarching goal of any augmented reality (AR) display is to seamlessly enrich the visual perception of the physical world with computer-generated elements that appear to spatially coexist with it. This aspect, that can be referred to as locational realism (Grubert et al., 2018), is the main factor that provides the user with a sense of perceptual consistency. Wearable AR head-mounted displays (HMDs) ideally represent the most ergonomic and reliable solutions to support complex manual tasks since they preserve the user's egocentric viewpoint (Sielhorst et al., 2006;Vávra et al., 2017;Cutolo et al., 2020).
In optical see-through (OST) HMDs the direct view of the world is mostly preserved and there is no perspective conversion in viewpoint and field of view (fov), as with video see-through (VST) systems. This aspect confers a clear advantage over VST solutions, particularly when used to interact with objects in the peripersonal space, since it allows the user to maintain an unaltered and almost natural visual experience of the surrounding world (Rolland and Fuchs, 2000;Cattari et al., 2019). This aspect is critical for instance in highly challenging manual tasks as in image-guided surgery, where reality preservation and fail-safety are essential features (van Krevelen and Poelman, 2010;Qian et al., 2017b).
OST displays use semi-transparent surfaces (i.e., optical combiners) to optically combine the computer-generated content with the real view of the world (Holliman et al., 2011). The virtual content is rendered on a two-dimensional (2D) microdisplay placed outside the user's fov and collimation lenses are placed between the microdisplay and the optical combiner to focus the virtual 2D image so that it appears at a pre-defined and comfortable viewing distance on a virtual image plane (i.e., the display focal plane) (Rolland and Cakmakci, 2005).
Nowadays, optical see-through (OST) HMDs are at the cutting edge of the AR research, and several consumer level headsets have been recently developed following the success of the Microsoft HoloLens 1 (e.g., MagicLeap One, HoloLens 2, Meta Two, Avegant, Lumus DK Vision). Despite this surge in consumer access, the successful use of these devices in practical applications is still limited by the complexity and unreliability of the calibration procedures needed to ensure an accurate spatial alignment between real-world view and computer-generated elements rendered onto the see-through display (Qian et al., 2017a;Cutolo, 2019). The inaccessibility of the user-perceived reality, makes indeed OST display calibration particularly challenging (Gilson et al., 2008).
The calibration aims to estimate the intrinsic and extrinsic parameters of the virtual rendering camera (Grubert et al., 2018). These parameters account for the eye position with respect to the OST display and encapsulate the projection properties of the eye-NED pinhole model.
State-of-the art manual Tuceryan et al., 2002;Navab et al., 2004;Moser and Swan, 2016) or interactionfree (Itoh and Klinker, 2014a,b;Plopski et al., 2015) OST display calibration procedures either partially or fully rely on user interaction and provide sub-optimal results that are not tolerable for those high-precision applications for which the accurate alignment between virtual content and perceived reality is of the utmost importance. Moreover, this process should be theoretically repeated whenever the HMD moves and causes a change in the relative position between the virtual image plane of the OST display and the user's eye (i.e., center of projection of the virtual rendering camera). This would entail re-estimating the position of the eye's first nodal point (i.e., the center of projection of the user's eye) with respect to the OST display. Unfortunately, manual calibration procedures are tedious and error-prone, whereas interaction-free methods based on eyetracking devices are only capable of indirectly estimating the center of rotation of the user's eye(s) and not the actual center(s) of projection, which is usually shifted by 7-8 mm (Guestrin and Eizenman, 2006). Furthermore, the pose of the eye-tracker with respect to the display might change during use because the user may unintentionally move the camera or the camera needs to be re-oriented to be adjusted for different users and eye positions. As a result, frequent re-calibrations of the camera would be required. For these reasons, none of these approaches are capable to completely remove the virtual-to-real registration error due to the viewpoint shift (i.e., parallax). Overall, if the parallax between the calibrated rendering camera and the actual user's viewpoint remains uncorrected, the virtual-to-real registration error will grow with the depth difference between the virtual image plane of the display and the observed scene (Luo et al., 2005). Unfortunately, the focal plane of most consumer-level OST NEDs is at infinity or at a distance that is incompatible with its use as an aid to manual activities (as it is far from the peripersonal space) .
Indeed, owing to the uncertainty in the calibration of the viewpoint-dependent rendering camera, the Microsoft HoloLens can have a maximum static registration error of < 10 mrad, which results in an error of about 5 mm at a distance of 50 cm from the user. This value of the registration error was experimentally verified by Condino et al. (2018) in their study.
To counter this problem, in this work we present a strategy that significantly mitigates the registration error due to the viewpoint shift around a pre-defined depth in the user's peripersonal space. Our solution follows the intuition by Owen et al. (2004), and demonstrates the beneficial impact, on the virtual-to-real registration, of the adoption of OST displays whose optical engines collimate the computer-generated image at a depth close to the fixation point of the user at close distances. This feature, if coupled with a single camera-based calibration procedure performed for a generic viewpoint, is capable to substantially mitigate the AR registration error due to the viewpoint shift for working areas around the depth of the focal plane of the OST display. We also demonstrate that, with this solution, there is no need for any prior-touse calibration refinement, either manual or interaction-free, to maintain the accurate virtual-to-real registration provided that the view volume under observation stays within a suitable depth range around the optical depth of the display image. This finding will pave the way to the development of new multi-focal models of OST HMDs specifically conceived as aid during high-precision manual tasks in the peripersonal space.

Notation
The following notation is used throughout the paper. Lowercase letters represent scalars. Coordinate systems are denoted by uppercase letters (e.g., the rendering camera coordinate system associated with the calibration point R C ). The origin of any reference systems is denoted by uppercase bold letters (e.g., the origin of the rendering camera coordinate system C). Points/vectors are denoted by lowercase bold letters with a superscript indicating the reference coordinate system (e.g., a 3D point in the world reference system p W ). Matrices are denoted by uppercase typewriter letters, such as the intrinsic matrix of the off-axis rendering camera off−E K. Rigid transformation matrices are denoted by uppercase typewriter letters with subscript and superscript representing the source and destination reference frames respectively (e.g., the rigid transformation between W and

Pinhole Camera Model
The combined eye-display system of an OST display is commonly modeled as an off-axis pinhole camera where the nodal point of the user's eye corresponds to the center of projection V and the see-through virtual screen of the display corresponds to the image plane S (Figure 1). The intrinsic matrix of the off-axis pinhole camera model of the eye-display system for a generic position of the eye is: (1) where f u and f v are the focal lengths of the display in pixels, denoting the distances from the image plane S to the pinhole camera projection center V. Notably, the focal lengths are different for non-perfectly square pixels for which the pixel aspect ratio is not 1. The principal point is defined as the intersection between the principal axis of the see-through display and the display image plane. The pixel coordinates of the principal point are (c u , c v ).
This model represents the perspective projection transformation of the virtual rendering camera that maps a random point in the 3D rendering camera space p R V to the associated 2D pixel displayed on the image plane i S of the OST display.
where both points are expressed in homogeneous coordinates, and λ is a generic scale factor due to the equivalence between points in homogeneous coordinates.
The above formulation, assumes a special choice of the world reference system W, with W ≡ R V . The 3 × 4 general projection transformation that maps world points onto the image plane of the display P encapsulates also the extrinsic parameters (i.e., the 6DoF rigid transformation from W to R V ): Since λ is arbitrary, Equation (3) is an up-to-scale relation and the independent parameters to be computed are 11. Therefore, any calibration of OST display, aims to calculate the 11 independent projection parameters of the virtual rendering camera (P) R V that generates the correct mapping of each 3D vertex of the virtual object onto the image plane of the OST display. This is commonly done by solving for all of the matrix components at once, or by systematically determining the parameters in Equation (3).

Camera-Based OST Calibration Method With Homography Correction
In a previous work , we presented an offline camera-based calibration method for OST displays. We refer to Cutolo et al. (2019) for more details on the method.
Here we report the key equations underpinning the calibration procedure as they are the starting points for our analysis of the registration error. The method exploits standard camera calibration and photogrammetry techniques to produce the offaxis camera model of the eye-display system off−E K for a generic viewpoint position C. Hereafter, this position will be referred to as the calibration position. Figure 2 shows a schematic diagram illustrating the spatial relationships between all the reference systems involved in the calibration procedure as well as in the overall perspective projection. Similarly the same relationships are listed in Table 1.
As described in more detail in Cutolo et al. (2019), the perspective projection equation of the off-axis rendering camera associated with the calibration point is obtained using the induced-by-a-plane homography relation between the points on the viewpoint camera image plane and those on the OST display image plane: The same relation in matrix form is: Therefore, the matrix off−E K can be computed by applying a planar homography correction H C to the intrinsic matrix of the ideal on-axis model of the rendering camera that models the eye-display system on−E K. This latter is determined by using the manufacturer's specifications of the OST display, and it is ideally located at the center of the eye-box of the display V, where the eye-box consists of the range of allowed eye's positions, at a pre-established eye-to-combiner (i.e., the eyerelief) distance, from where the whole image produced by the display is visible. This homography correction encapsulates the shift and scaling effect due to a particular viewpoint position, and it also accounts for the deviations of the real optical features of the see-through display from the ones provided by the manufacturer's specifications.
The method relies on a camera used as a replacement of the user's eye and placed within the eye-box of the OST display (i.e., at the calibration position C). This camera is referred to as the viewpoint camera.
To compute the homography correction H C and the rotation between display and camera image plane R D C R , a virtual checkerboard pattern is displayed on the image plane of the OST display and observed by the viewpoint camera. Hence, by solving a standard PnP problem, we can calculate the relative pose between the viewpoint camera C reference system and the ideal on-axis rendering camera [ Notably, all the rendering cameras have the same orientation (i.e., the orientation of the display) and therefore we have R D C R ≡ R C C R; this latter relation explains how all the rotational transformations between any viewpoint-dependent rendering camera reference system ( R C C R ∀ C) and the reference system of the physical viewpoint camera (C) are the same. On account of this, Equations (5) and (3) are equivalent provided that the user's eye position (viewpoint) matches the calibration position V≡C. Further, the rotational contribution of T ext (i.e., R D W R) in Equation (5) is the same regardless of the orientation of the viewpoint camera used for the calibration as it represents the relative orientation between world reference system and display. The translation vector R D C t is used to compute H C and it provides us a measure of the shifting and scaling contribution due to the viewpoint shift from the ideal on-axis location of the rendering camera to the real viewpoint location (i.e., the calibration position where the viewpoint camera is located). The homography correction accounting for the viewpoint position and the real optical features of the display is: where d C→π is the distance from the calibration point C and the virtual image plane of the OST display π (i.e., the display focal plane), and d R→π is the distance from the ideal center of projection of the on-axis rendering camera R and π. The planar homography H C encapsulates the shift and scaling effect due to the measured translation vector R D C t that is induced by the particular viewpoint position with respect to the ideal on-axis position of the rendering camera.

Viewpoint Shift Contribution to the Eye-Display Pinhole Model
In a real application, the actual viewpoint position is different from the calibration position V =C. Therefore, the intrinsic matrix should be further refined by applying an additional homography correction encapsulating the shift and scaling effect associated with the relative translation from the calibration position and FIGURE 2 | Geometrical representation of the spatial relationships between the four reference systems involved in the camera-based calibration procedure, illustrating the relevant coordinate systems: the ideal on-axis virtual rendering camera RD , the real off-axis rendering camera associated with a generic calibration position R C , the physical viewpoint camera used as a replacement of the user's eye C, and the off-axis rendering camera associated with the real user's eye position. Rendering cameras are black-colored, whereas the physical viewpoint camera is orange-colored.
the current viewpoint position (Itoh and Klinker, 2014b;Cutolo et al., 2019): where R V is the translation vector from the calibration point to the current viewpoint position. As anticipated, all off-axis rendering camera systems associated with different viewpoint positions share the same orientation (i.e., the orientation of the image plane of the display). Thus, the rotation correction By plugging Equation (7) in Equation (5), we have: Therefore, the viewpoint shift V C t generates two parallax contributions to the projection relation of the eye-display model: an intrinsic contribution, represented by the homography correction H C→V , and an extrinsic contribution represented by The accurate real-to-virtual registration is maintained only if both these two contributions are accurately estimated by tracking the viewpoint (e.g., with an eye-tracking mechanism). TABLE 1 | Position and orientation of all the reference systems involved in the OST perspective projection relation.

Reference systems Position Orientation
On-axis rendering camera RD Center of the display eye-box D Orientation of the display image plane Off-axis rendering camera (at calibration position) R C Calibration position C Orientation of the display image plane Physical viewpoint camera C Calibration position C Orientation of the viewpoint camera image plane Off-axis rendering camera (at viewpoint position) RV User's viewpoint position V Orientation of the display image plane

CONDITIONS FOR MITIGATING PARALLAX-RELATED REGISTRATION ERROR
Overall, the viewpoint shift, if not compensated for through an intrinsic and extrinsic correction of the perspective projection relation, generates a registration error. Nonetheless, there are points in the space for which this error is theoretically null regardless of the amount of viewpoint shift. These are the points belonging to the image plane of the see-through display, for which p w z = d C→π . Without loss in generality, let us assume that W ≡ C. Equation (8) becomes: A parallax between calibration position and viewpoint position V C t = [x ′ y ′ z ′ ] T generates the following contributions: and Therefore, Equation (9) becomes: Geometrically, it is easy to demonstrate the following relation: (Figure 3). Therefore, we obtain: Then, by normalizing by d V→π , we obtain the equivalence of the image point location observed from the two different viewpoints C and V: Thus, Equation (14) implies that the parallax-related registration error (i.e., due to the viewpoint shift) is null for those world points located exactly at the image plane of the OST display (Figure 4). On the other hand, for the points in space with p R C z = d C→π , the viewpoint shift generates a registration error of: From simple algebraic manipulations, the registration error due to the viewpoint shift is (in vector form and Euclidean coordinates): In Figure 5, we provide a geometrical representation of the registration error due to a viewpoint shift along the

EXPERIMENTAL SETTING AND CALIBRATION PROCEDURE
The Figure 6 shows the experimental setup. In our tests, we used a commercial OST HMD (the ARS30 by Trivisio, Luxemburg) duly customized. The ARS30 visor is provided with a pair of 1280x1024 OLED microdisplays and a pair of optical engines that collimate the computer-generated image to a depth compatible with manual tasks in the peripersonal space. Each microdisplay has a 30 • diagonal angle of view resulting in an average angular resolution of ≈ 1.11 arcmin/pixel, and an eye-box dimension of about 8x10 mm. In our experiments, we used only the right display of the sHMD.
With an approach similar to the one proposed by Cutolo et al. (2017Cutolo et al. ( , 2020, we housed the HMD in a 3D printed plastic shell whose function is to incorporate a pair of liquid crystal optical shutters that allowed us to occlude the see-through view upon request and remove the real-world background. As viewpoint camera, we used a SONY FCB-MA130, which has a 1/2.45 ′′ CMOS sensor, a 1280x720 resolution, a 59 • diagonal FOV, and an angular resolution of ≈ 2.67 arcmin/pixel. The HMD was also integrated with a USB camera placed above the display (Leopard Imaging LI-OV580) for the inside-out tracking mechanism; this camera supports M12 lenses: in our tests, we used a 2.8 mm lens (f-number f2.0 ) that, associated with a camera resolution of 1280x720, results in a 109 • diagonal angle of view. The focal distance of the display d V→π was empirically measured using the same camera equipped with a lens having a 17.5 mm focal length, a f-number f5.6, and a circle-of-confusion size of 0.025 mm. This particular lens was associated with a narrower fieldof-view compared to the 2.8 mm lens and to a wider depth-offield. Therefore, by measuring the depth-of-field of the camera when the display was in focus, we were able to estimate the value of d C→π ≈ 33.5 cm. The calibration procedure was performed as follows. First, the viewpoint camera and the tracking camera were both calibrated with a conventional calibration technique (Zhang, 2000) that requires storing multiple camera views of a planar pattern (i.e., OpenCV checkerboard). The linear parameters (i.e., intrinsic camera matrix) and non-linearities due to the camera lens distortion were computed using nonlinear least-squares minimization (i.e., Levenberg-Marquardt algorithm). This procedure was performed using the MATLAB camera calibration toolbox (R2019b MathWorks, Inc., Natick, MA, USA).
Next, the viewpoint camera was placed at the calibration point C, empirically and approximately set in the center of the eye-box of the display and at the eye relief distance (i.e., ≈ 30 mm from the optical combiner). As done in Owen et al. (2004), this was done by moving the viewpoint camera left and right so as to determine the width of the viewable area of the display and then averaging the extents to determine the center. The same process was performed to establish also the vertical position.
A standard 7x4 OpenCV checkerboard with a square size of 20 mm was used as the target object to be tracked; hereafter this board will be referred to as the validation checkerboard.
The viewpoint-dependent elements of Equation (8) were determined as follows. The composite extrinsic transformation matrix T ext was estimated by means of the tracking camera whose coordinate system is L.
T ext can be broken down into two main components: • L W R L W t , which represents the pose of the world reference system with respect to the tracking camera reference system.
, which represents the rigid transformation between the tracking camera and the off-axis rendering camera located at the calibration point.  rendering camera with respect to the viewpoint camera R D C R, and the translation vector R D C t that allows us to compute the homography correction accounting for the viewpoint position H C , were estimated by rendering a virtual structured marker of known size on the see-through display and by localizing its inner corners through C. This calibration routine was developed in MATLAB environment and using the Computer Vision Toolbox.

Test Design
A dedicated AR application implemented in MATLAB was used to measure the virtual-to-real overlay accuracy, namely the registration error ( Figure 8A). In the routine, we generated a virtual scene that consisted of a set of virtual spots observed by a virtual viewpoint (i.e., the rendering camera) whose intrinsic and extrinsic projection parameters were initialized according to the calibration procedure previously described. In this way, for each viewpoint position, we were able to measure the virtual-to-real registration accuracy in terms of Euclidean distance between real landmarks (the corners of the validation checkerboard) and virtual features (the virtual dots).
The validation checkerboard was placed at 16 different distances from the viewpoint camera, ranging from 18 to 65 cm (18 ≤ d C→π ≤ 65 cm). Both the OST HMD and the validation checkerboard were locked by means of two rigid and adjustable holders. The viewpoint camera was attached to a 3D printed mounting template. The mounting template was equipped with fixing holes for placing the camera in eight different pre-set viewpoint positions radially arranged within the eye-box of the see-through display (Figure 7). Each viewpoint position was therefore at a distance of 4 mm from the calibration position ( (x ′ , y ′ ) = 4 mm and z ′ ≈ 0 mm). The template and the camera were both anchored to the translation bar of the HMD holder. For each position of the viewpoint camera, and for each checkerboard position, a viewpoint camera image of the validation board was captured with the display and the optical shutter turned off ( Figure 8B). Without moving the board or the camera, the set of virtual spots rendered by the OST display was then captured by the viewpoint camera with both the display FIGURE 6 | Experimental setting for the calibration procedure and the experimental session. 1→The validation checkerboard holder. 2→The validation checkerboard. 3→The optical see-through head mounted display (OST HMD) holder. 4→The the OST HMD. 5→The tracking camera. 6→The optical shutter. 7→The optical combiner. 8→The viewpoint camera. 9→The 3D printed mounting template. and the optical shutter turned on in order to remove the realworld background (Figure 8C). The two images were processed separately by a user-guided semi-automatic corner detection algorithm. The pixel locations of the 18 inner corners of the checkerboard were used for the evaluation. The registration error was computed as the Euclidean distance between the virtual and real features (Figure 9). Table 2 shows the registration errors obtained for the viewpoint camera placed in the calibration position; the errors are grouped for the 16 clusters associated with the 16 positions where the validation checkerboard was placed (i.e., for 18 ≤ d C→π ≤ 65 cm). In particular, the table reports the mean and standard deviation of respectively: the image registration error (pixel), the associated angular registration error (arcmin), and the absolute registration error (mm) measured by backprojecting the image registration error at the current chessboard distance (d C→π ).

Results
Overall, the mean image registration error, angular registration error, and absolute registration for the calibration position E c were 5.87 px, 15.7 arcmin, and 1.57 mm. We can consider this contribution to the registration error as the intrinsic registration error of the system after the calibration, devoid of any parallax contribution due to the viewpoint shift. When analyzing the registration errors obtained for the other eight viewpoint positions, we reasonably considered the average registration error computed for the calibration position as the minimum error achievable. Results of the tests

DISCUSSION
As shown in Figure 10, for checkerboard distances p R C z > d C→π and p R C z < d C→π , taking into account all the 8 viewpoint camera positions associated with a parallax of magnitude O ′ , the estimated absolute registration error increases. This increment is properly modeled by Equation (17) accounting for the contribution of the viewpoint shift to the overall registration error.
The results obtained with the experimental tests confirm rather accurately the trend of the theoretical model represented by Equation (17) and comprising also the contribution of the calibration error ( E + E c ). It should be also noted that, as predicted by our model, the minimum of the curve representing the experimental tests is extremely close to the value of the display focal distance (i.e., ≈ 33 cm).
The results reported in Tables 2, 3 show that, for a reasonably wide range of distances from the focal plane of the OST display (i.e., 21 < p R C z < 41 cm), the mean absolute registration error due to a viewpoint shift of magnitude 4 mm is comparable to that obtained for the calibration viewpoint: to a first approximation, given a viewpoint shift of 4 mm with respect to the calibration point, the parallax contribution to the registration error is ≤ 0.6 mm for target objects placed at 21 < p R C z < 41 cm (1.3 mm of mean absolute registration error for the calibration position vs. 1.9 mm for the shifted viewpoint positions). This absolute registration error is reasonably low to be considered as sufficiently reliable to guide high-precision manual procedures.
We hypothesize that the non-negligible magnitude of E c is due to inaccuracies in the calibration. By means of example, in our calibration procedure, we did not consider the non-linear distortions due to the optics of the display, whereas a certain amount of image distortion (such as the radial distortion) is certainly present. As suggested in Lee and Hua (2015), a camerabased calibration method that tackled this problem and estimates also the non-linearities in the projection model of the OST display due to the optical distortions would likely provide better results in terms of registration accuracy.
The global registration error trend is slightly different from the theoretical parallax error contribution. This discrepancy could be due to the non-perfectly accurate estimation of the rotation matrix R D C R performed during the calibration procedure. As illustrated in section 4, this rotational contribution is used to estimate the orientation of the tracking camera with respect to the rendering camera of the display R D L R. In order to improve the accuracy of this part of the calibration procedure, we could measure the final calibration result of R t R by averaging from a set of repeated measurements obtained by placing the viewpoint camera in different calibration positions and calculating, for each position, To this aim, a more elaborated stereo calibration procedure, based on global calibration technique capable to counter some of the causes of calibration errors could improve the overall calibration accuracy (Chen et al., 2019). In the future, we are planning to carry out a more detailed error analysis on the possible sources of inaccuracies in the stereo-calibration.
In order to provide a preliminary evaluation of the proposed solution with human subjects, we have performed a preliminary user study: six lab members, including the authors, were asked to monocularly look at the right display of the HMD to judge the virtual-to-real registration accuracy obtained while FIGURE 10 | Absolute registration error for the 16 clusters associated with the 16 positions where the validation checkerboard was placed (i.e., for 18 ≤ d C→π ≤ 65 cm). Each asterisk is the average of the absolute registration error computed over the eight viewpoint positions. The lines represent the theoretical registration error due to the viewpoint shift as calculated through the Equation (17). As predicted by our model, the minimum of the curve representing the experimental tests is extremely close to the value of the display focal distance (i.e., ≈ 33 cm).
observing a checkerboard placed at approximately 33 cm. All of the subjects evaluated that the alignment was accurate at a first glance. Nevertheless, more objective tests are still needed in order to robustly assess our solution also in terms of vergence-accommodation conflict and focus rivalry, as these specific perceptual aspects were not considered in the paper.
Better results could be also attained during these user tests if we a prior-to-use quick SPAAM-like manual calibration refinement was performed to roughly estimate the viewpoint position and therefore to reduce the magnitude of O ′ . This approach would be similar to the official Microsoft HoloLens 1 calibration procedure that is used to approximately estimate the user's interpupillary distance (Grubert et al., 2018). Finally, the possibility to estimate the registration error due to the viewpoint shift offered by our model would allow the user to be notified, in a real application, whenever the AR view cannot ensure that the AR registration fall within a certain margin of accuracy.

CONCLUSION AND FUTURE WORK
In literature, the idea of adjusting the optical depth of the virtual image to match the depth of the fixation point of the user was primarily explored with the aim to reduce the vergence-accommodation conflict inherent to near-eye-displays used in the peripersonal space (Dunn et al., 2018). In this paper, we demonstrated the beneficial impact, on the virtualto-real registration, of the use of AR OST displays with optical engines that collimate the computer-generated image at a depth that matches the fixation point of the user in the peripersonal space. This strategy is based on the use of displays with short focal distances and it features a dedicated parameterization of the virtual rendering camera based on an automatic calibration routine to estimate the projection parameters of the OST display for a generic viewpoint position.
In the work, we have first built a theoretical model of the registration error and then we have experimentally proved its ability to predict the registration error, due to a specific viewpoint shift, outside the distance of the focal plane of the display. We have also demonstrated that, with this solution, there is reasonably no need for any prior-to-use calibration refinement, either manual or interaction-free, to ensure the accurate virtual-to-real registration provided that the view volume under observation stays within a suitable depth range around the display focal plane. This finding will pave the way to the development of new multi-focal models of OST HMDs specifically conceived as aid during high-precision manual tasks in the peripersonal space.
To counter some of the limitations of our validation tests, future work will involve improving the calibration procedure in order to estimate also the radial and tangential distortions caused by the collimation optics of the display, and to estimate more accurately the orientation of the tracking sensor/camera with respect to the display. In addition, we plan to conduct a rigorous validation of the proposed method by separating the contribution to the registration error due to the viewpoint shift from the contribution due to the tracking inaccuracies. Finally, future work will also include experimental tests involving actual users, and based on the manipulation of three-dimensional objects in the peripersonal space. With this user study, we will be able also to evaluate the efficacy of the proposed strategy in mitigating the vergence-accommodation conflict and the focus rivalry problem typical of OST HMDs.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Materials, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
FC, VF, and NC conceived of the presented idea. FC developed the theory and performed the computations. NC, UF, and FC implemented the software. NC and UF carried out the experiments. FC, NC, and VF verified the analytical methods. FC wrote the manuscript. NC took care of the writing review and editing. FC and VF were in charge of overall direction and planning. VF supervised the project. All authors discussed the results, and have read and agreed to the published version of the manuscript.