From pictorial space to tactile form: a comparative evaluation of AI-based 2.5D reconstruction from modern artwork paintings

Furferi, Rocco; Governi, Lapo; Volpe, Yary; Servi, Michaela; Buonamici, Francesco

doi:10.3389/fcomp.2026.1821454

ORIGINAL RESEARCH article

Front. Comput. Sci., 16 April 2026

Sec. Human-Media Interaction

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1821454

From pictorial space to tactile form: a comparative evaluation of AI-based 2.5D reconstruction from modern artwork paintings

Department of Industrial Engineering, University of Florence, Florence, Italy

Abstract

Introduction:

The translation of paintings into tactile 2.5D models (i.e., bas-reliefs) represents a significant advancement in improving accessibility for blind and visually impaired individuals. However, reconstructing spatial structure from a single painted image without explicit perspective is inherently ill-posed, particularly in modern and contemporary artworks where perspective, illumination, and geometry deviate from physical realism.

Methods:

This study presents a comparative evaluation of three AI-based reconstruction paradigms: Monocular Depth Estimation, Large Language Models, and Large Reconstruction Models. These approaches are applied to a selected corpus of photographic, realist, and abstract artworks from the CSAC collection (Parma, Italy). An assessment framework is introduced, combining expert-based qualitative evaluation by art historians, formal geometric verification (including integrability and topological consistency), and manufacturability analysis conducted by additive manufacturing specialists.

Results:

The results indicate that Large Language Model-based methods generate semantically rich and perceptually plausible bas-reliefs but lack geometric integrability and topological robustness. Monocular Depth Models perform well in capturing depth hierarchies but tend to oversmooth fine details. Large Reconstruction Models demonstrate strong structural coherence and fabrication readiness, though they often struggle with stylistic reinterpretation.

Discussion:

These findings highlight the trade-offs among current AI-based reconstruction approaches for tactile bas-relief generation. While each paradigm excels in specific aspects, none achieves a complete balance between perceptual fidelity, geometric soundness, and manufacturability. Future work should focus on hybrid strategies that integrate semantic understanding with geometric consistency to better support accessible cultural heritage applications.

1 Introduction

Access to art and cultural heritage is recognized as a fundamental human right by the Universal Declaration of Human Rights. Not by chance, Article 27 states that “everyone has the right freely to participate in the cultural life of the community, to enjoy the arts and to share in scientific advancement and its benefits.” Therefore, every effort to promote inclusivity and fair cultural participation is encouraged (Pless and Maak, 2004). Nevertheless, the accessibility of paintings remains a quite impossible challenge for blind people (BP) and for visually impaired people (Buonamici et al., 2015). While sculptures and other three-dimensional artifacts can be explored through touch (or, at least, it is straightforward to create 3D replicas of such artworks to be enjoyed by BP), paintings require a transformation process to convey their spatial dimensions through tactile means (Uboldi et al., 2025).

A well-established body of research and practice exists on two-dimensional tactile graphics, commonly adopted in tiflo-didactics and museum accessibility. These methods typically consist of redrawn versions of pictorial works using raised lines, embossed contours, or differentiated textures to encode shapes, regions, and compositional elements (Cavazos Quero et al., 2021). Their advantage lies in their strong adherence to the bidimensional nature of the original artwork, allowing for a faithful translation of contours and spatial relationships without introducing volumetric interpretation. For this reason, two-dimensional tactile representations are widely used for educational purposes and for supporting structured reading strategies in blind and visually impaired users (Phutane et al., 2022). However, while effective in conveying layout and segmentation, these techniques are inherently limited in expressing depth, volumetric hierarchy, and spatial articulation, which motivates the exploration of alternative approaches such as 2.5D bas-relief reconstruction. As a consequence, translating the painted scene, or subjects, into 3D or 2.5D models can be an effective way to convey significant information from a given artwork. In this perspective, these models are not a replacement of traditional tactile graphics, but rather a complementary strategy aimed at enriching tactile perception through the introduction of interpretable depth cues, albeit with an unavoidable degree of reconstruction and abstraction. After this “translation” from 2D to 3D, the retrieved models can be manufactured (e.g., by using Additive Manufacturing) so as to provide BP with a touchable object.

Traditionally, the conversion of paintings into tactile representations relies on manual or semi-automatic techniques, guided by subjective artistic interpretation and limited geometric assumptions (Carfagni et al., 2012). These approaches are time-consuming, labor-intensive, and difficult to reproduce across the extensive collections managed by cultural institutions. As a consequence, tactile accessibility of paintings cannot be considered a norm within museum and gallery settings, despite the increasing awareness of curators demonstrated, for example, by studies carried out by British Museum, the Louvre, and the Uffizi Gallery (Furferi et al., 2024). From a scientific point of view, three-dimensional reconstruction from a single 2D image is an inherently ill-posed mathematical problem for artworks of non-geometric nature. While in the field of descriptive geometry the so-called inverse problem of perspective (or perspective restitution) shows that, under specific conditions it is possible to recover the three-dimensional configuration that generated the image, in artworks with intentionally ambiguous perspective (or absent perspective) the lack of strict geometric constraints and the prevalence of artistic conventions make the reconstruction problem underdetermined. In such cases, in fact, painted scenes lack sufficient geometric constraints to define a unique solution, exactly like in the case of single view photos (Hoiem et al., 2005). Moreover, paintings often depict a scene where subjects (or objects) are represented with shapes, shades, illuminations and textures that are unrealistic. Finally, while in the case of photos it is somehow possible to have the ground truth of the imaged object/scene, this is impossible for paintings, especially when the painted scene is imagined by the artist and has not a counterpart in the real world. In the context of pictorial art, this concept is amplified by artistic conventions that deviate from physical reality, rendering many algorithms typically adopted for standard computer-based vision ineffective.

When dealing with Renaissance or hyper realistic artworks, the aforementioned limitations can be overcome, and a range of image processing methods can be adopted to reconstruct the painted scene and/or subjects.

Not by chance, image processing-based methods for extracting non-strictly three-dimensional models of a painted scene are available in literature (Hu Q. et al., 2025). These methods are based on several widely known approaches such as interactive Shape From Shading (SFS) (Schmitz et al., 2023) combined with interactive methods encompassing Computer Aided Design (CAD) intervention, as demonstrated in Furferi et al. (2014). In such work, a systematic methodology for the semi-automatic generation of 2.5D tactile models intended for exploration by visually impaired people is proposed. The scene is initially reconstructed as a flat-layered “virtual” bas-relief. The spatial relationships and relative depth between regions (objects/subjects) are established using the painting’s perspective information (e.g., vanishing points).

To retrieve the 3D shape of individual subjects, SFS method is applied. This involves estimating the lighting conditions and computing the surface’s normals to retrieve volumetric information. A crucial aspect is the semi-automatic nature of the process, where the user can interact with the model, adjusting lighting parameters and defining boundaries to ensure accurate volume reconstruction. However, many paintings are characterized by inconsistent vanishing points (or lack of identifiable vanishing points) or non-realistic flattening depth. Furthermore, illumination within a painting is governed by stylistic choices (such as chiaroscuro) rather than a single physical light source, and pictorial surfaces are rarely Lambertian (uniform diffuse reflectors) due to the use of varnishes, glazes, and impasto techniques. For these reasons, SFS-based approaches are unreliable and produce topological artifacts or incorrect depth estimates.

These issues are much more challenging when dealing with modern and contemporary art where the painted scene is often intentionally non-realistic and characterized by multiple viewpoints simultaneously, or with perspectives that are mathematically incorrect or distorted. Furthermore, modern movements like abstract Expressionism or purely Conceptual Art do not depict recognizable objects with definable contours or volume. Translating this kind of depth into a geometric model is fundamentally impossible without significant subjective interpretation. The artist may not have painted a clear, single light source, or they may use lighting and shadow for purely emotional or compositional effect, violating real-world physics. As mentioned before, another relevant issue related to the reconstruction of 3D models from painted scenes is the absence of ground truth; there is no objective, measurable 3D geometry of a painting against which reconstruction accuracy can be quantitatively validated. The research is therefore required to focus on assessing the capacity of models to extrapolate a plausible spatial structure consistent with the artist’s visual intention.

It is not surprising that in recent years AI-based methods started to be adopted by researchers to solve this issue. In particular, Monocular Depth Estimation (MDE) methods proved to be effective in providing 3D information from 2D scenes. In fact, MDE models infer dense depth maps from single RGB images, offering a potentially robust mechanism for reconstructing the perceived spatial structure of artworks (Zhang J. et al., 2025). However, the direct application of these state-of-the-art models to the domain of 3D reconstruction from paintings is still a challenge in the field. First, MDE models are trained primarily on natural image datasets [e.g., NYU-Depth V2 and KITTI (Zhang X. et al., 2025)] and therefore tend to produce depth maps that are perceptually inconsistent or semantically inaccurate when applied to artistic imagery. Bridging this domain gap requires specialized adaptation strategies, including fine-tuning on art-specific datasets, style-aware feature normalization, or the integration of perceptual priors inspired by art theory and human visual interpretation (de Mota Gomes et al., 2025).

Secondly, from an artistic perspective, the main issue related to the 3D reconstruction of artworks is primarily deriving from the need to interpret and materialize elements that were originally created using artistic impression and convention rather than strict, objective representation (Hu X. et al., 2025). In this context, the translation leads to a change in terms of style and medium of the original artwork, creating a new artwork characterized by its own artistic significance. Elements like the glow of a painted subject, the sheen of oil, or the delicate edges of watercolor, for example, do not have a direct 3D equivalent translation. Moreover, a painting only shows one view. Any part of an object or figure obscured by another element, or even the back of a figure/object itself, is invented in the 3D reconstruction, thus involving interpretation, rather than replication.

Based on the aforementioned considerations, the main research questions for translating paintings into 3D models to be touched by blind and visually impaired people are:

1) To what extent can one-shot monocular depth estimation (MDE), large language models and large reconstruction models effectively overcome the inherent depth ambiguity and artistic subjectivity encountered during the 3D reconstruction of modern and contemporary paintings?
2) What criteria must be established for the selection and 3D reconstruction of modern and contemporary paintings?

This work aims to provide a preliminary response to these two fundamental questions by (1) exploring the potential of current AI-based methods to retrieve 2.5 and 3D models from painted subjects and (2) by proposing a set of criteria to be adopted for comparing the results obtained by reconstruction strategies. In particular, the main contributions of this paper are as follows:

1) It provides a systematic comparison between fundamentally different AI paradigms for 2.5D reconstruction from paintings, namely Monocular Depth Estimation, Large Multimodal Models, and Large Reconstruction Models, highlighting their strengths and limitations in the context of artworks reconstruction.
2) It introduces an evaluation framework integrating a set of criteria based on the combination of assessment by art historians and formal geometric validation (integrability and topological consistency); furthermore, manufacturability analysis for tactile reproduction is also addressed.
3) It hints at the possible gap between perceptual plausibility and geometric validity in AI-based reconstruction of artworks, showing that visually appealing results may not correspond to physically consistent surfaces.
4) It suggests an application-oriented analysis in the context of cultural heritage accessibility, discussing the implications of different reconstruction strategies for the generation of tactile bas-reliefs.

The comparison is carried out within the CHANGES Project funded by the Italian Ministry of University under the PNRR (Piano Nazionale di Ripresa e Resilienza) program and refers to a set of artworks hosted by Centro Studi e Archivio della Comunicazione dell’Università di Parma (CSAC) an archive, museum, and research center of the University of Parma founded in 1968 by Arturo Carlo Quintavalle.

A second contribution of the present work is the definition of an evaluation framework combining qualitative and quantitative criteria to assess the performance of AI-based reconstruction methods. Since the ground truth 3D geometry for painted artworks is not available, the method integrates expert-based qualitative evaluation with formal geometric verification.

2 Materials and methods

2.1 Statement of the problem

Let be a set of images representing the alleged 3D shape that a subject, object or scene would have if painted by an artist. Since this work explores the reconstruction of 3D models from painted scenes, in this specific case Let us suppose that the image is acquired by means of a high resolution (e.g., 20Mpixel) machine vision system using a single digital camera and a proper illumination system and possible distortion is corrected by using known distortion correction models (Wang, 2015). Bas-relief retrieval of the painted scene consists of finding a predictor that can infer a shape that is as close as possible to the “expected shape” . From a mathematical point of view this problem is a process where an objective function (named loss function, see Equation 1) has to be minimized (Imran et al., 2019).

As already mentioned, the true 3D shape (i.e., the ground truth ) is not available and therefore the performance of AI-based methods using standard accuracy metrics or other quantitative performance criteria is not a practicable way (see Figure 1).

Figure 1

“Internal” coherence of the minimization routine is assessed by using the widely known Loss Factor (Imran et al., 2019). As already mentioned, from a quantitative perspective, one of the main challenges in reconstructing 3D models from 2D images of artworks is the absence of ground truth data, since no actual 3D model exists for a painted scene. As a result, the loss functions used to train CNN architectures must rely on a “simulated” ground truth, typically derived from reference models in the training dataset that resemble the subjects depicted in the paintings. The most commonly used loss function for voxel-based 3D reconstruction is the one in Equation 2:

Where is the ground truth voxel (0 or 1) and is the predicted voxel occupancy.

Another commonly adopted loss function is the Dice Loss (see Equation 3), which measures the degree of overlap between the predicted and ground-truth voxel occupancy. This metric is effective in situations characterized by class imbalance. It is suitable, for instance, when the majority of voxels are empty and only a small portion corresponds to occupied space.

Once the predictor is retrieved, the result of the AI-based approach consists of a depth map i.e. a representation (in gray scale) of the distance between the camera and the surfaces in a scene, encoding three-dimensional structure in a two-dimensional image format. Stored as a grayscale image where intensity values correspond to relative or absolute distance, depth maps are a bidimensional representation of 3D data since they provide geometric information that enables the recovery of surface orientation, spatial relationships, and scale (see Figure 2).

Figure 2

As a consequence, for the intent of 2.5D reconstruction, the assessment of the performance translates into the valuation on how the depth map (and more precisely of the meshed surface that can be retrieved from such a map) visually resembles the painted subject/object/scene. Subsequent CAD-based routines commonly adopted to refine the model are therefore not considered in this work.

2.2 Zero-shot methods for 2.5D retrieval

There are many different network topologies that can be used to model the predictor . Most of them have the same basic structure. In general, this architecture has an image encoder that turns the input image into a latent vector x through a series of convolutional blocks and pooling layers, and then fully connected layers. Then, either more fully connected layers or a deconvolutional (transpose convolution) network can decode the encoded representation into the desired 3D model. A previous study provided a comparative analysis of recent zero-shot methods available in literature. Among the models investigated in Furferi (2025) the following are considered: Depth Anything (v1 and v2) (Yang et al., 2024), Marigold (Viola et al., 2025), Metric3D (v2) (Yin et al., 2023), ZoeDepth (Bhat et al., 2023) and UniDepth (v1, v2, and v2_old) (Piccinelli et al., 2024).

These models were directly applied to the digital images of the chosen paintings using publicly accessible implementations and pre-trained weights, without any adjustments for artistic imagery. The goal was not to improve the performance of the model, but to see if AI models could figure out a plausible spatial structure when given inputs that did not have physical depth. It is important to note that the models were trained on several well-known datasets found in the literature, like BlendedMVS, Hypersim, IRS, and TartanAir (Kim et al., 2022). However, none of these datasets have anything to do with paintings. Results provided in cited literature demonstrate that Depth Anything v2 is among the most suitable zero-shot architecture for providing 3D reconstruction of paintings, at least considering the case studies selected and an assessment based on subjective evaluation of retrieved 3D models.

In particular, metric3Dv2, unidepthv1 and v2 and depth-pro, struggled to clearly separate the figure from the background, producing a relatively flat depth map, or one with little if any detail. Quite the reverse, DepthAnything v1 and v2, and GeoWizard v2 managed to capture more detail of the figure. Finally, DepthAnything v2 was also able to capture details of the face, arms and hands of the subjects painted in the reconstructed artworks (see Figure 3).

Figure 3

2.3 LLMs for 2.5D retrieval

Besides zero-shot methods, Large Language Models (LLMs) could also be used to convey 3D information from 2D images. In particular, LLMs add semantic reasoning, contextual inference, and prior knowledge about object structure, materials, and spatial relationships. When used with appropriate prompts, they can figure out reasonable depth, hidden geometry, and object continuity by using learned models built into large training data. Unlike MDE and LRMs, the 2.5D retrieval carried out in this paper through LLMs is not the result of an explicit geometric reconstruction framework nor an optimization process based on a defined loss function. The reconstruction is, instead, generated through a prompt-driven process, where the input image is provided to a pre-trained multimodal model together with a textual instruction requesting the generation of a plausible bas-relief representation and the corresponding depth map. As a consequence, the model does not estimate depth by using physical or photometric constraints but infers a volumetric interpretation of the scene by means of semantic knowledge, learned priors, and contextual reasoning embedded in the used training. Generated outputs are therefore non-deterministic (and strongly dependent on the prompt formulation). For this reason, the output should result in semantically plausible reconstructions. The idea, therefore, is to test if this visually satisfactory bas-relief reconstruction is also accurate in terms of depth estimations even if a comparison with vision-based approaches based on geometric consistency is obviously only qualitative. For the sake of reproducibility, the exact prompt adopted in the experimental campaign is reported in Section 4.1.

It should be noticed that LLMs capacity is expected to be strongest in scenarios where reconstruction is underdetermined—such as monocular input or incomplete datasets—while precise engineering or surveying applications still depend on physically grounded vision algorithms. As multimodal foundation models evolve, tighter integration between geometric learning and language-based reasoning may further reduce the gap between perceptual plausibility and metric accuracy in 3D reconstruction workflows. Nevertheless, LLMs are known to be unreliable for computing metric-accurate geometry and a more seamless connection between geometric learning and language-based reasoning is the only way to obtain plausible 3D reconstructions.

In the present work Gemini 3 and Chatgpt 5.2 were adopted for testing the 2.5D retrieval from paintings, again in terms of depth map reconstruction. The experimentation was carried out by using as an input the original image of the artwork and prompting a request to directly retrieve a depth map representing the position of the subjects in the pictorial space.

2.4 LRMs for 2.5D retrieval

Large Reconstruction Models (LRMs) represent a new idea in 3D computer vision, moving away from traditional methods that require multiple images for 3D reconstruction towards a fast 3D reconstruction from a single 2D image. The models are trained to directly output a 3D model, not a depth function. The pioneering work on this topic is Hong et al. (2023) where a large transformer-based encoder-decoder architecture is adopted for learning 3D representations of objects from a single image. The method takes an image as input and regresses Neural Radiance Fields (NeRF) in the form of a triplane representation.

Specifically, LRM utilizes the pre-trained visual transformer named DINO (Caron et al., 2021) as the image encoder to generate the image features and learns an image-to-triplane transformer decoder to project the 2D image features onto the 3D triplane via cross-attention and model the relations among the spatially structured triplane tokens via self-attention. The output tokens from the decoder are reshaped and sampled up to the final triplane feature maps. Afterwards, the images are rendered at an arbitrary view by decoding the triplane feature of each point with an additional shared multi-layer perception (MLP) (Taud and Mas, 2018) to get its color and density and performing volume rendering. A further enhancement of the aforementioned method is TripoSR (Tochilkin et al., 2024).

An image encoder, an image-to-triplane decoder, and a triplane-based neural radiance field (NeRF) comprise the core of TripoSR. The model is able to “guess” the camera parameters (both intrinsic and extrinsic) during training and inference rather than conditioning the image-to-triplane projection on camera parameters. This improves the model’s resilience to real-world input images during the inference phase. A plausible three-dimensional structure is retrieved by using learned geometric priors and generative modeling techniques trained on large amounts of 3D data. Instead of using classical photogrammetric pipelines to explicitly reconstruct geometry, it uses deep neural networks that encode statistical patterns in shapes, object types, and spatial arrangements. In this work Tripo3D was tested with the intent of directly providing 3D models of the painted scene, instead of retrieving the depth map. In fact, such models are specifically trained using large 3D datasets.

3 Selection of the case studies

During the CHANGES project, a systematic study of the cultural heritage preserved at the Centro Studi e Archivio della Comunicazione (CSAC) of the University of Parma was carried out. The initial step was to define a selection of works on which the different 2.5D and 3D strategies were applied and subsequently tested. The selection was based on a series of the following agreed criteria:

The historical and artistic significance of each work and its emblematic nature in relation to CSAC’s collecting policies;
The representativeness of the richness and variety of CSAC’s collections (around 12 million pieces across five sections – art, photography, media, fashion, design);
Their location within CSAC’s exhibition spaces or archive;
Their state of conservation and the feasibility of moving them;
The practical applicability of further technological developments aimed at enhanced enjoyment of the works.

The result of this selection process led to the identification of several case studies. In this work, four of them are proposed to test the performance of AI-based methods under inspection:

1) Pescatore di Camogli [1950] by Bruno Stefani, photograph;
2) La giovinezza (Uomo a cavallo) [1936] by Mario Sironi, Realism painting;
3) Senza Titolo [1938] by Mario Sironi, Realism painting;
4) Grande grigio bruno [1954] by Carla Accardi, Abstract painting;

This selection is intended to test reconstruction methods starting from a photograph in order to validate the methodology on a more structurally coherent and realistic image. The first work measures 39.2 × 29.5 cm and depicts, as its title plainly suggests, an elderly fisherman from a Ligurian coastal village intent on mending a fishing net while smoking his pipe. The representational approach is anti-monumental and non-rhetorical, at the same time distancing itself from a purely formalist composition concerned only with framing, light, and shadow. Instead, Stefani shows in this work a growing closeness to the Neorealist experience of those years, as well as a deep interest in the human subject. Here, as in other works of the same period, the subject is caught almost by surprise, if not entirely unaware of the shot, and depicted in the midst of a daily activity.

Then, the experimentation is extended to two works by Mario Sironi, representing Futurism and early 20th-century Realism, where spatial construction and subjects are more interpretative. In particular, La giovinezza (cartoon for the mosaic of Italia corporativa, Milan, Palazzo dei Giornali, 1936), is an oil on canvas-mounted paper (316 × 317 cm). It is one of the large-scale preparatory cartoons for the mosaic L’Italia corporativa, an exemplary work of Sironi’s mature mural style and of Fascist monumental art at its height. Senza Titolo, 1938, again by Mario Sironi is a painting (305×307 cm) depicting a male figure represented in a stylized and monumental way. The man is nude or nearly nude, with a small cloth draped around his hips. He has a solid, compact build, with simplified musculature rendered through brown and ochre tones. His face is shown in profile, with sharp, almost sculptural features that recall archaic classical art.

Finally, the methods are tested against abstract artwork. Grande grigio bruno (1954) by Carla Accardi where oil on canvas alternates with enamel. The choice of the aforementioned artwork permits to intentionally cover a spectrum from the most straightforward to the most complex case allowing the evaluation of the robustness, adaptability, and limitations of the reconstruction techniques on different levels of artistic abstraction.

It is important to remark that for contemporary artworks term “expected 3D shape” does not refer to an objective or measurable ground truth, but rather to a perceptual and interpretative reference inferred by the observer based on visual cues and artistic conventions. In the case of abstract artworks, such a reference cannot be defined in geometric terms, and the evaluation must instead rely on criteria such as topological consistency and suitability for tactile exploration.

4 Bas-relief (2.5D) reconstruction results

4.1 Bruno Stefani photograph case study

Figure 4A shows the rectified image of the photograph by Bruno Stefani. The image was used as input for Depth Anything v2 architecture. Processing was conducted on a workstation equipped with an NVIDIA RTX 4080 SUPER GPU. As already mentioned, the MDE architecture is pre-trained using real images large datasets. The retrieved depth map and the corresponding mesh surface are in Figures 4B,C. Figure 4D depicts the retrieved mesh with texture. This representation can be useful, for instance, in a use in Institutions websites. In this context, it is not intended for direct tactile exploration, but rather to support accessibility through multimodal digital tools, such as integration with audio descriptions, interactive interfaces, or as a preparatory resource for visually impaired users prior to engaging with physical tactile models. It may also support remote dissemination and educational activities.

Figure 4

From a qualitative analysis of the results, the retrieved depth map appears noticeably flat, i.e., characterized by limited depth variation, especially across semantically important regions. In particular, fine-grained geometric details of the fisherman, such as facial features, are significantly smoothed. This effect can be attributed to the intrinsic limitations of monocular depth estimation models. Since depth is inferred from a single 2D image, the reconstruction relies heavily on learned statistical priors. The model is globally coherent but has compressed depth gradients. Interestingly, the fisherman net is well defined and spatially detached from the background. For the purpose of 2.5D reconstruction for BP, a dedicated post-processing phase in a CAD environment is therefore necessary to recover the geometric features that are lost during automatic depth estimation. Manual or semi-automatic modeling interventions are in fact required to restore fine anatomical details and enhance local volumetric variations that are not adequately captured, thus enabling morphological accuracy.

Figures 5A–D show the depth maps and the 2.5D models obtained using the two LLMs mentioned in this work (ChatGPT 5.2 and Gemini 3).

Figure 5

For both images the following prompt was used:

“Provide a 2.5D model (i.e. bas-relief) whose topology and geometry is consistent with the input image by interpreting the represented scene; also provide the depth map derived from such a model”

This exact prompt was consistently used across all experiments without modification. The two models were accessed on February 13, 2026.

In detail, the prompt-based reconstruction is structured in the following phases. First, the input image of the artwork is provided to the selected LLM together with the predefined textual prompt. The resulting depth map provided as an output, expressed as a grayscale image encoding relative depth information, is then exported without further modification. Subsequently, the depth map is imported into a reverse engineering software package (in this work, Geomagic Design X), where it is converted into a digital height field. In particular, grayscale values are mapped into geometric displacements along the normal direction, enabling the generation of a bas-relief surface.

It is important to highlight that its formulation plays a critical role in the results obtained. In particular, requesting a bas-relief reconstruction prior to depth map extraction is intentionally designed to guide the model to reason in volumetric terms rather than performing a direct grayscale-to-depth conversion. This is made because in preliminary tests, all architectures provide such a conversion, which is not optimal for 2.5D reconstruction unless the image intensity is a reliable proxy for surface orientation (and indirectly depth), e.g., for Lambertian surfaces, uniform illumination and absence of perspective distortion. This indirectly confirms that reconstruction is highly sensitive to prompt design, which introduces an additional layer of variability and limits reproducibility.

The idea, instead, is to leverage the potential of LLMs in directly guessing the volumetric structure of the 2D image using semantic models and to reproject the model onto a depth map. This “reversed” process is, in authors’ opinion, more consistent when using this kind of architectures, rather than directly deriving the dept. map. This is mainly due to the fact that the network interpret the command “generate depthmap” from the figure as a simple conversion of grayscale value into height.

Although both the generated images simulating bas-reliefs and the proposed depth maps produce perceptually plausible results, the resulting 2.5D reconstructions are not geometrically consistent representations of the subject, as visible in Figures 6A,B where a zoom on the subject face is proposed. In fact, at a global scale the reconstructed meshes appear plausible: the overall volumetric structure is coherent, the main anatomical features are recognizable, and the rendered object provides a strong sense of relief. When examined at higher spatial resolution the reconstruction reveals significant topological and geometric inconsistencies. In practice, the predicted depth map tends to encode local intensity gradients rather than true surface curvature as well as shading variations rather than metric displacement. This leads to high-frequency artifacts in regions with strong texture or wrinkles (e.g., facial features), depth discontinuities, local self-intersections (or inverted curvature) after meshing and a surface that is not corresponding to a physically realizable manifold. Therefore, such depth maps are suitable for visualization but insufficient for physically meaningful 2.5D modeling without additional geometric information.

Figure 6

Finally, Figure 6C shows the 3D model proposed by Tripo3D algorithm. As already mentioned, Tripo3D does not attempt to preserve the exact pixel level depth of the input photograph; instead, it infers the most statistically plausible 3D shape consistent with the observed semantic content. In this example, a pronounced reinterpretation of the subject’s geometry is provided. Facial features, clothing folds, and fine structures are not reconstructed as literal depth displacements from shading but are regularized toward a manifold consistent with learned human shape distributions. This produces a globally smooth mesh that is topologically valid, but not necessarily metrically faithful to the specific photograph.

4.2 Sironi paintings case study

Moving towards a painting, the 2.5D retrieval becomes more challenging for AI-based approaches. Figure 7 shows the rectified image of the La giovinezza (Uomo a cavallo) painting together with the depth map retrieved by using Depth Anything v2 and the corresponding mesh.

Figure 7

From a qualitative point of view, reconstruction shows that the adopted MDE model is able to infer a depth hierarchy with partially preserved detachment of objects from the background. However, it still fails in reconstructing fine geometric features and even the head of the subject, which lies very close to the background. The resulting 2.5D surface is still plausible at a global level but lacks spatial coherence when inspected locally. Nonetheless, this model can be considered as a strong basis for subsequent CAD-based reconstruction since an expert user is able to refine details of the model very quickly, as demonstrated by the exemplificative results of Figure 8.

Figure 8

Similar results can be obtained when reconstructing the second chosen artwork, i.e., Senza Titolo, by Sironi. Figure 9 shows the results from the application of MDE (original image vs. depth map and 2.5D model with and without texture).

Figure 9

In fact, also in this case MDE tends to flatten the fine details of the subjects even if it is very reliable in providing their correct spatial position in the scene.

Results obtained for the two paintings by means of the two selected LLMs, using again the prompt and the workflow mentioned above, are in Figure 10. From a visual point of view, results obtained seem somehow astonishing: the bas-relief-like representations visually resemble the subjects of the original artworks, providing a correct positioning in the scene and even several fine details. In other words, depth maps generated by ChatGPT 5.2 and Gemini 3 do not exhibit the typical properties of physically meaningful depth maps, as they often emphasize local contrast variations rather than encoding a consistent depth ordering of the scene. By way of example, the head of the subject in the painting La giovinezza is still missing. Furthermore, retrieved depth maps are not consistent in terms of topology and geometric description of the subjects’ features.

Figure 10

Interestingly, even Tripo3D is not capable of retrieving the head of the human subject in the scene of the La giovinezza. As Shown in Figure 11, moreover, the human subject positioned on the left of the Senza Titolo painting is the only one reconstructed by Tripo3D, but it is completely reinterpreted by the algorithm.

Figure 11

4.3 Carla Accardi painting case study

When the previously explored algorithms are applied to abstract artworks, the problem changes. First, any spatial positioning of the represented elements may be considered geometrically admissible, since abstract compositions do not demand a depth hierarchy. In fact, there is no objective reference for relative distances, anatomical correctness, or perspectival coherence. Under these conditions, the primary evaluation criterion shifts from geometric fidelity to topological consistency. What becomes crucial is not whether the reconstructed depth corresponds to a “true” spatial arrangement, but whether the generated surface is coherent, free of discontinuities, self-intersections, or non-manifold artifacts, and suitable for physical realization. Figure 12A depicts the rectified artwork by Carla Accardi, while Figure 12B shows the 2.5D model obtained by using the selected MDE architecture. In this specific case the algorithm assigns to differently colored pixels an arbitrary height. Moreover, since the background has a light color, greater height is assigned to darker shades. This choice seems consistent with an attempt to reinterpret the artwork in terms of 2.5D geometry (see Figure 13).

Figure 12

Figure 13

For this specific case study, using again the prompt and the procedure described for the first case study, Gemini 3 outperforms ChatGPT 5.2 both in terms of perceived 2.5D image and depth map, showing a more topologically consistent result. Finally, Tripo3D results (see Figure 14) are interesting. The reconstruction algorithm detaches objects from the background and provides a 3D model representing single features as a whole.

Figure 14

5 Qualitative criteria

From a qualitative point of view, a set of evaluation criteria is defined to assess the performance of AI-based depth map retrieval. These criteria can help the user in assessing how close the reconstruction is to the “expected” 3D shape that the artist aimed to render in the painted artwork.

On the quantitative side, a set of criteria is introduced to formally assess spatial coherence and topological validity of the reconstructed surfaces. In particular, specific indicators are defined to determine the topology of the generated meshes, including integrability constraints, absence of non-manifold configurations, and detection of local curvature inversions or self-intersections. By explicitly separating perceptual plausibility from geometric validity, the assessment should be able to enable a structured comparison across heterogeneous reconstruction paradigms, while also ensuring that the resulting models are suitable for downstream fabrication processes such as additive manufacturing.

In detail, criteria are defined as follows:

Criterion 1—fine details: models trained primarily on photographs may tend to over smooth details when applied to painted imagery.

Criterion 2—contour definition: well-defined edges are essential for a clear perception of the resulting 3D model.

Criterion 3—spatial consistency: the estimated depth should not introduce unintended “jumps” or discontinuities, except where explicitly required by the artist’s intent. The discrepancy between the reconstructed shape with the expected one can be formalized as follows: let denote perceptual similarity and geometric fidelity. Depth estimators optimize the following equation:

Where should be the true 3D shape of the reconstructed scene, but as already stated in this case it is the “expected shape.” Since is defined differently from , being the first a similarity perceptual geometry and the latter a topologically consistent surface, a satisfying result can be obtained only when .

Criterion 4—artistic perspective management: artists deliberately use perspective in different ways—sometimes realistically (e.g., Renaissance art) and sometimes in a distorted or unconventional manner (e.g., Cubism or Surrealism). The model should interpret these choices appropriately.

Criterion 5—absence of topological artifacts: the depth maps should avoid introducing holes, floating regions, or inverted surfaces. A physically valid height field must satisfy integrability constraints. Therefore, if surface normals are estimated as , they must satisfy Equation 5:

Consistently, the curl of the gradient field is required to vanish according to the following Equation:

In other words, the gradient field may not be globally integrable, and small inconsistencies can accumulate into geometric artifacts. In these cases, the reconstructed mesh may contain local curvature inversions or micro-oscillations.

Criterion 6—suitability for tactile 3D printing: the depth map should enable the creation of a physically reproducible model that is meaningful and effective for tactile exploration.

6 Performance assessment

The performance of the four reconstruction pipelines under investigation (Depth Anything v2, ChatGPT 5.2, Gemini 3 and Tripo3D) was assessed according to the six qualitative criteria defined in Section 4. It is important to note that conventional image-based similarity metrics such as Peak Signal-to-Noise Ratio (PSNR) or Similarity Index Measure (SSIM) (Si et al., 2021) are not applicable in this context: as already mentioned, no ground truth depth or reference 3D geometry exists for painted scenes. Moreover, it should be remembered that the reconstruction problem is in this case ill-posed and admits multiple plausible solutions, none of which can be considered objectively correct. For this reason, the evaluation framework adopted in this work combines perceptual assessment with quantitative geometric indicators, such as Gradient Smoothness Index () and Integrability Error (), which allow assessing spatial coherence and topological consistency independently from ground truth availability. These indices are formally defined and discussed in the following section.

In detail, the evaluation strategy combined expert-based qualitative judgment with formalized geometric verification where applicable. Criteria 1 (fine details), 2 (contour definition), and 4 (artistic perspective management), were assessed in blind conditions by a panel of four experts in art history. The art historians involved in the study have documented expertise in modern and contemporary art analysis, while the additive manufacturing specialists have experience in mesh processing and tactile model fabrication. The evaluators were presented with anonymized reconstructions and asked to rate each output independently, without knowledge of the used algorithm.

Therefore, all evaluations were conducted independently, without interaction among panel members, in order to avoid mutual influence. Criteria 3 (spatial consistency) and 5 (absence of topological artifacts) were instead partially formalized through geometric analysis of the reconstructed height fields, in accordance with Equations 4–7. Finally, Criterion 6 (suitability for tactile 3D printing) was evaluated by two specialists in Additive Manufacturing, who assessed manufacturability, structural coherence, and tactile readability of the resulting meshes. This hybrid evaluation framework allows for a structured comparison that integrates perceptual plausibility, geometric validity, and fabrication feasibility within a unified methodological approach. For Criteria 1, 2 and 4, evaluations were conducted by adopting a 5-point Likert scale (Gürel and Çetin, 2019) according to Table 1.

Table 1

Score	Interpretation
1	Very poor – criterion not satisfied
2	Poor – significant inconsistencies
3	Moderate – partially satisfactory
4	Good – minor issues only
5	Excellent – fully satisfactory

Likert scale adopted for the experimentation.

It is important to remark that given the limited size of the expert panel, formal inter-rater reliability metrics (e.g., Cohen’s kappa or intra-class correlation coefficient) were not computed. Therefore, Likert-scale results should be interpreted as indicative rather than statistically generalizable trends.

The final score for each criterion was computed as:

Where is the number of evaluators and is the score assigned by evaluator to criterion .

For what concerns spatial consistency (Criterion 3), this was assessed by measuring the so-called Gradient Smoothness Index (GSI) that measures magnitude of gradient discontinuities (see Equation 8):

Where is the image domain.

Criterion 5 (absence of topological artifacts) was assessed by means of the Integrability Error IE, defined in Equation 9:

The lower the IE, the lower is the presence of artifacts. Finally, Criterion 5 (3D printability) was assessed again using a Likert scale by evaluating relief depth readability and structural continuity. Table 2 shows the overall results obtained for the four case studies.

Table 2

Algorithm	C1 fine details	C2 contour definition	C3 spatial consistency	C4 perspective MGMT	C5 topological integrity	C6 print suitability
Depth anything v2	3.4	4.2	3.8	3.6	4.1	4.3
ChatGPT 5.2	4.1	3.7	2.6	3.8	2.4	2.9
Gemini 3	4.3	3.9	2.9	4.0	3.1	3.4
Tripo3D	3.2	3.5	4.4	3.3	4.6	4.5

Performance assessment according to qualitative criteria.

From results in Table 2, it emerges that Depth Anything v2 provides a well-balanced performance profile. The expert panel assigned high scores for contour definition (C2), confirming the model’s capacity to separate foreground elements from the background and to preserve spatial hierarchies. Spatial consistency (C3) and topological integrity (C5) also received good evaluations, reflecting low integrability error and limited occurrence of non-manifold artifacts. As suggested by visually examining the case studies, the fine-detail criterion (C1) reflects the smoothing behavior of MDE, which tend to compress local depth gradients and attenuate anatomical features. From a fabrication perspective (C6), the resulting meshes were considered structurally reliable and suitable for tactile reproduction with minimal corrective intervention. ChatGPT 5.2 achieved high scores in fine detail (C1), indicating that the expert panel perceived a strong volumetric articulation of the subjects. Nevertheless, spatial consistency (C3) and topological integrity (C5) obtained lower values, as formal analysis revealed gradient non-integrability, local curvature inversions, and high-frequency oscillations. These inconsistencies also negatively affected manufacturability (C6), requiring substantial post-processing to ensure printability and structural coherence.

Gemini 3 exhibits a performance profile similar to ChatGPT 5.2 but with slightly improved geometric behavior, probably due to the best results obtained for the Accardi’s artwork.

Tripo3D achieved the highest scores in spatial consistency (C3) and topological integrity (C5). The generated meshes were globally coherent, manifold, and free from self-intersections, confirming the effectiveness of learned geometric priors in enforcing object-level structural validity. Additive manufacturing experts also rated the output highly (C6), citing structural continuity and stable relief gradients. However, fine details (C1) and artistic perspective management (C4) were slightly penalized, as the model tends to regularize shapes toward statistically plausible object templates, diverging from the specific stylistic intent of the original artwork.

7 Conclusion

The present paper proposed a methodological assessment of MDE, LLMs and LRMs for the 2.5D retrieval of subjects painted in artworks, with the final intent of rapidly expanding the plethora of tactile paintings available for Blind People. Two main research questions were established: “to what extent One-Shot Monocular Depth Estimation and AI-based approaches can overcome the inherent ambiguity and artistic subjectivity of painted scenes”, and “which criteria are required for selecting and reconstructing modern and contemporary artworks”.

The results of the assessment, even if limited to a small set of case studies, provide a structured response to the aforementioned questions.

Results indicate that no single model fully resolves the indeterminacy of reconstructing three-dimensional structure from inherently non-metric representations. MDE-based approaches demonstrate robust spatial hierarchy inference and manufacturable outputs, yet they attenuate fine artistic details. LLMs-based methods exhibit strong semantic interpretation capabilities and perceptual plausibility, but their reconstruction frequently lack geometric integrability and topological stability. LRM-based reconstruction enforces structural coherence and fabrication readiness, but at the cost of stylistic reinterpretation. Therefore, AI-based models can approximate a plausible spatial structure, but they do not eliminate the intrinsic subjectivity of the translation from pictorial space to geometric form. In these terms, post processing performed by CAD experts is still required prior to Additive Manufacturing the resulting 2.5D model.

For what concerns criteria for the evaluation, the integration of perceptual expert evaluation (art historians), the use of formal geometric validation (integrability and topological constraints), and manufacturability assessment (performed by additive manufacturing specialists) seems to be a viable process. In particular, the distinction between perceptual plausibility and geometric validity emerges as a central issue.

From an application perspective, the results suggest that MDE-based approaches are preferable when manufacturability and structural reliability are required, while LLMs-based methods may support semantic interpretation but require substantial geometric post-processing. LRM-based approaches provide structurally coherent models but may deviate from artistic intent. Therefore, the selection of the reconstruction pipeline should be driven by the intended use of the tactile model. Therefore, no single method can be considered universally optimal, and therefore user-guided post-processing is required to achieve both perceptual and geometric fidelity.

This study presents several limitations. Case studies, while intentionally selected to span photography, realism, and abstraction, represent a very limited subset of artistic styles and cannot exhaustively capture the variability of modern and contemporary art practices. As a second remark, it should be noted that ChatGPT 5.2 and Gemini 3 are inherently sensitive to prompt formulation, including wording, level of specificity, and implicit assumptions about geometry and topology. Although a standardized prompt was adopted to safeguard consistency, minor variations in phrasing may lead to significantly different volumetric interpretations. In other words, prompt dependency can introduce a layer of subjectivity that is external to the artwork itself and not directly comparable to deterministic MDE or LRM methods. Accordingly, the reproducibility and stability of LLMs-based 2.5D retrieval remain conditioned by linguistic mediation, representing a methodological constraint of the present study. Referring to the expert panel, although composed of qualified art historians and additive manufacturing specialists, it remains limited in size, and wider validation could further strengthen statistical robustness. Another issue is that the proposed analysis focuses on zero-shot or pre-trained implementations of the selected models without task-specific fine-tuning on art-domain datasets, which may influence performance outcomes.

Future work will be addressed to the development of user-guided methods integrating MDE with CAD modelers for fine-tuning the 2.5D models. Moreover, future studies will systematically analyze prompt sensitivity in LLMs-based approaches, quantifying the variability induced by linguistic formulations and developing standardized prompting protocols to enhance reproducibility. Finally, extended validation involving visually impaired users will be conducted to directly assess the tactile effectiveness and cognitive interpretability of the generated models, ensuring that technical improvements translate into meaningful accessibility outcomes.

Statements

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

RF: Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Validation, Writing – original draft, Writing – review & editing. LG: Formal analysis, Investigation, Validation, Writing – review & editing. YV: Formal analysis, Investigation, Visualization, Writing – review & editing. MS: Formal analysis, Software, Writing – review & editing. FB: Formal analysis, Investigation, Writing – review & editing.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that Generative AI was used in the creation of this manuscript. We used Generative AI to refine the English language of some parts in the text.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1
BhatS. F.BirklR.WofkD.WonkaP.MüllerM. (2023). Zoedepth: zero-shot transfer by combining relative and metric depth. arXiv. doi: 10.48550/arXiv.2302.12288
- CrossRef
- Google Scholar
2
BuonamiciF.FurferiR.GoverniL.VolpeY. (2015). “Making blind people autonomous in the exploration of tactile models: a feasibility study,” in Universal Access in Human-Computer Interaction. Access to Interaction, Lecture Notes in Computer Science, vol. 9176, eds. AntonaM.StephanidisC. (Cham: Springer), 85–96.
- Google Scholar
3
CarfagniM.FurferiR.GoverniL.VolpeY.TennirelliG. (2012). “Tactile representation of paintings: an early assessment of possible computer-based strategies,” in Progress in Cultural Heritage Preservation, Lecture Notes in Computer Science, vol. 7616, eds. IoannidesM.et al. (Berlin, Heidelberg: Springer), 287–296.
- Google Scholar
4
CaronM.TouvronH.MisraI.JégouH.MairalJ.BojanowskiP.et al. (2021). Emerging Properties in Self-Supervised vision Transformers. In: IEEE/CVF International Conference on Computer Vision (ICCV), 9650–9660
- Google Scholar
5
Cavazos QueroL.Iranzo BartoloméJ.ChoJ. (2021). Accessible visual artworks for blind and visually impaired people: comparing a multimodal approach with tactile graphics. Electronics10:297. doi: 10.3390/electronics10030297
- CrossRef
- Google Scholar
6
de Mota GomesM.CheniauxE.Egidio NardiA. (2025). The illusory visual spectrum: perception, neuroscience, and art. Psychiatr. Danub.37, 430–439. doi: 10.24869/psyd.2025.430,
7
FurferiR. (2025). Deep learning approaches for 3D model generation from 2D artworks to aid blind people with tactile exploration. Heritage8:12. doi: 10.3390/heritage8010012
- CrossRef
- Google Scholar
8
FurferiR.Di AngeloL.BertiniM.MazzantiP.De VecchisK.BiffiM. (2024). Enhancing traditional museum fruition: current state and emerging tendencies. Herit. Sci.12:20. doi: 10.1186/s40494-024-01139-y
- CrossRef
- Google Scholar
9
FurferiR.GoverniL.VolpeY.PuggelliL.VanniN.CarfagniM. (2014). From 2D to 2.5D, i.e., from painting to tactile model. Graph. Models76, 706–723. doi: 10.1016/j.gmod.2014.10.001
- CrossRef
- Google Scholar
10
GürelD.ÇetinT. (2019). Intangible cultural heritage attitude scale: validity and reliability study. Bartın Üniv. Eğit. Fak. Derg.8, 82–102. doi: 10.14686/buefad.465604
- CrossRef
- Google Scholar
11
HoiemD.EfrosA. A.HebertM.. (2005). Geometric context from a single image. Tenth IEEE International Conference on Computer Vision, 1, 654–661.
- Google Scholar
12
HongY.ZhangK.GuJ.BiS.ZhouY.LiuD.et al. (2023). LRM: large reconstruction model for single image to 3D. arXiv. doi: 10.48550/arXiv.2311.04400
- CrossRef
- Google Scholar
13
HuQ.WangJ.PengX.LiT.CaoR. (2025). Three-dimensional reconstruction image generation of traditional Chinese painting elements. Eng. Appl. Artif. Intell.158:111417. doi: 10.1016/j.engappai.2025.111417
- CrossRef
- Google Scholar
14
HuX.XingY.CaiX.ZhaoY.CookM.BorgoR.et al. (2025). Designing Interactions with Generative AI for art and Creativity: a systematic Review and Taxonomy. Proceedings of the 2025 ACM Designing Interactive Systems Conference, 1126–1155.
- Google Scholar
15
ImranS.LongY.LiuX.MorrisD.. (2019). Depth Coefficients for Depth Completion. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12438–12447.
- Google Scholar
16
KimS. Y.ZhangJ.NiklausS.FanY.ChenS.LinZ.et al. (2022). Layered Depth Refinement with mask Guidance. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3855–3865.
- Google Scholar
17
PhutaneM.WrightJ.CastroB. V.ShiL.SternS. R.LawsonH. M.et al. (2022). Tactile materials in practice: understanding the experiences of teachers of the visually impaired. ACM Trans. Access. Comput.15, 1–34. doi: 10.1145/3508364
- CrossRef
- Google Scholar
18
PiccinelliL.YangY.-H.SakaridisC.SeguM.LiS.Van GoolL.et al. (2024). UniDepth: universal Monocular metric Depth Estimation. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10106–10116
- Google Scholar
19
PlessN.MaakT. (2004). Building an inclusive diversity culture: principles, processes and practice. J. Bus. Ethics54, 129–147. doi: 10.1007/s10551-004-9465-8
- CrossRef
- Google Scholar
20
SchmitzC.RöschC.ZingsheimD.KleinR. (2023). Interactive pose and shape editing with simple sketches from different viewing angles. Comput. Graph.114, 347–356. doi: 10.1016/j.cag.2023.06.024
- CrossRef
- Google Scholar
21
SiJ.YangH.HuangB.PanZ.SuH. (2021). A full-reference stereoscopic image quality assessment index based on stable aggregation of monocular and binocular visual features. IET Image Process.15, 1629–1643. doi: 10.1049/ipr2.12132
- CrossRef
- Google Scholar
22
TaudH.MasJ. (2018). “Multilayer perceptron (MLP),” in Geomatic Approaches for Modeling Land Change Scenarios, Lecture Notes in Geoinformation and Cartography, eds. Camacho OlmedoM.PaegelowM.MasJ. F.EscobarF. (Cham: Springer).
- Google Scholar
23
TochilkinD.PankratzD.LiuZ.HuangZ.LettsA.LiY.et al. (2024). Triposr: fast 3D Object Reconstruction from a single Image. arXiv.
- Google Scholar
24
UboldiS.BortolottiA.CandeloroG.MarascoA.SardellaF.TartariM.et al. (2025). From touch to mental imagery: the embodied aesthetic experience of late-blind people engaged in the tactile exploration of Enrico Castellani’s pseudo-braille surface. Cult. Med. Psychiatry49, 725–764. doi: 10.1007/s11013-025-09904-9,
25
ViolaM.QuK.MetzgerN.KeB.BeckerA.SchindlerK.et al. (2025). Marigold-dc: zero-shot Monocular Depth Completion with Guided Diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 5359–5370
- Google Scholar
26
WangZ. (2015). Removal of noise and radial lens distortion during calibration of computer vision systems. Opt. Express23, 11341–11356. doi: 10.1364/OE.23.011341
- CrossRef
- Google Scholar
27
YangL.KangB.HuangZ.ZhaoZ.XuX.FengJ.et al. (2024). Depth anything V2. Adv. Neural Inf. Process. Syst.37, 21875–21911. doi: 10.52202/079017-0688
- CrossRef
- Google Scholar
28
YinW.ZhangC.ChenH.CaiZ.YuG.WangK.et al. (2023). Metric3D: Towards zero-shot metric 3D Prediction from a single Image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 9043–9053
- Google Scholar
29
ZhangX.SongY.LeeD. W.. (2025). A review of vision-based depth estimation: current methods and future directions. In: 2025 IEEE 22nd International Conference on Mobile Ad-Hoc and Smart Systems (Chicago, IL), 574–579
- Google Scholar
30
ZhangJ.WuY.JiangH. (2025). Survey on monocular metric depth estimation. Computers14:502. doi: 10.3390/computers14110502
- CrossRef
- Google Scholar

Summary

Keywords

2.5D reconstruction, accessibility, cultural heritage, monocular depth estimation (MDE), tactile representation of artworks

Citation

Furferi R, Governi L, Volpe Y, Servi M and Buonamici F (2026) From pictorial space to tactile form: a comparative evaluation of AI-based 2.5D reconstruction from modern artwork paintings. Front. Comput. Sci. 8:1821454. doi: 10.3389/fcomp.2026.1821454

Received

02 March 2026

Revised

04 April 2026

Accepted

06 April 2026

Published

16 April 2026

Volume

8 - 2026

Edited by

Jingru Zhang, University of Science Malaysia (USM), Malaysia

Reviewed by

Yogi Udjaja, National Taiwan Normal University, Taiwan

Sofia Menconero, Consiglio Nazionale delle Ricerche Istituto di Scienze del Patrimonio Culturale Roma, Italy

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Rocco Furferi, rocco.furferi@unifi.it

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Human-Media Interaction

ORIGINAL RESEARCH article

From pictorial space to tactile form: a comparative evaluation of AI-based 2.5D reconstruction from modern artwork paintings

Abstract

1 Introduction