About this Research Topic
The modeling and replication of visual attention mechanisms have been extensively studied for more than 80 years by neuroscientists and more recently by computer vision researchers, contributing to the formation of various subproblems in the field. Among them, saliency estimation and human-eye fixation prediction have demonstrated their importance in improving many vision-based inference mechanisms: image segmentation and annotation, image and video captioning, and autonomous driving are some examples. Nowadays, with the surge of attentive and Transformer-based models, the modeling of attention has grown significantly and is a pillar of cutting-edge research in computer vision, multimedia, and natural language processing. In this context, current research efforts are also focused on new architectures which are candidates to replace the convolutional operator, as testified by recent works that perform image classification using attention-based architectures or that combine vision with other modalities, such as language, audio, and speech, by leveraging on fully-attentive solutions.
Given the fundamental role of attention in the field of computer vision, the goal of this Research Topic is to contribute to the growth and development of attention-based solutions focusing on both traditional approaches and fully-attentive models. Moreover, the study of human attention has inspired models that leverage human gaze data to supervise machine attention. This Research Topic aims to present innovative research that relates to the study of human attention and to the usage of attention mechanisms in the development of deep learning architectures and enhancing model explainability.
Research papers employing traditional attentive operations or employing novel Transformer-based architectures are encouraged, as well as works that apply attentive models to integrate vision and other modalities (e.g., language, audio, speech, etc.). We also welcome submissions on novel algorithms, datasets, literature reviews, and other innovations related to the scope of this Research Topic.
The topics of interest include but are not limited to:
- Saliency prediction and salient object detection
- Applications of human attention in Vision
- Visualization of attentive maps for Explainability of Deep Networks
- Use of Explainable-AI techniques to improve any aspect of the network (generalization, robustness, and fairness)
- Applications of attentive operators in the design of Deep Networks
- Transformer-based or attention-based models for Computer Vision tasks (e.g. classification, detection, segmentation)
- Transformer-based or attention-based models to combine Vision with other modalities (e.g. language, audio, speech)
- Transformer-based or attention-based models for Vision-and-Language tasks (e.g., image and video captioning, visual question answering, cross-modal retrieval, textual grounding / referring expression localization, vision-and-language navigation)
- Computational issues in attentive models
- Applications of attentive models (e.g., robotics and embodied AI, medical imaging, document analysis, cultural heritage)
Keywords: Attention, Attentive Models, Transformer, Saliency, Explainability, Multimodal Networks, Vision-and-Language
Important Note: All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.