Research Topic

Attentive Models in Vision

About this Research Topic

The modeling and replication of visual attention mechanisms have been extensively studied for more than 80 years by neuroscientists and more recently by computer vision researchers, contributing to the formation of various subproblems in the field. Among them, saliency estimation and human-eye fixation prediction have demonstrated their importance in improving many vision-based inference mechanisms: image segmentation and annotation, image and video captioning, and autonomous driving are some examples. Nowadays, with the surge of attentive and Transformer-based models, the modeling of attention has grown significantly and is a pillar of cutting-edge research in computer vision, multimedia, and natural language processing. In this context, current research efforts are also focused on new architectures which are candidates to replace the convolutional operator, as testified by recent works that perform image classification using attention-based architectures or that combine vision with other modalities, such as language, audio, and speech, by leveraging on fully-attentive solutions.

Given the fundamental role of attention in the field of computer vision, the goal of this Research Topic is to contribute to the growth and development of attention-based solutions focusing on both traditional approaches and fully-attentive models. Moreover, the study of human attention has inspired models that leverage human gaze data to supervise machine attention. This Research Topic aims to present innovative research that relates to the study of human attention and to the usage of attention mechanisms in the development of deep learning architectures and enhancing model explainability.

Research papers employing traditional attentive operations or employing novel Transformer-based architectures are encouraged, as well as works that apply attentive models to integrate vision and other modalities (e.g., language, audio, speech, etc.). We also welcome submissions on novel algorithms, datasets, literature reviews, and other innovations related to the scope of this Research Topic.

The topics of interest include but are not limited to:

- Saliency prediction and salient object detection
- Applications of human attention in Vision
- Visualization of attentive maps for Explainability of Deep Networks
- Use of Explainable-AI techniques to improve any aspect of the network (generalization, robustness, and fairness)
- Applications of attentive operators in the design of Deep Networks
- Transformer-based or attention-based models for Computer Vision tasks (e.g. classification, detection, segmentation)
- Transformer-based or attention-based models to combine Vision with other modalities (e.g. language, audio, speech)
- Transformer-based or attention-based models for Vision-and-Language tasks (e.g., image and video captioning, visual question answering, cross-modal retrieval, textual grounding / referring expression localization, vision-and-language navigation)
- Computational issues in attentive models
- Applications of attentive models (e.g., robotics and embodied AI, medical imaging, document analysis, cultural heritage)


Keywords: Attention, Attentive Models, Transformer, Saliency, Explainability, Multimodal Networks, Vision-and-Language


Important Note: All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.

The modeling and replication of visual attention mechanisms have been extensively studied for more than 80 years by neuroscientists and more recently by computer vision researchers, contributing to the formation of various subproblems in the field. Among them, saliency estimation and human-eye fixation prediction have demonstrated their importance in improving many vision-based inference mechanisms: image segmentation and annotation, image and video captioning, and autonomous driving are some examples. Nowadays, with the surge of attentive and Transformer-based models, the modeling of attention has grown significantly and is a pillar of cutting-edge research in computer vision, multimedia, and natural language processing. In this context, current research efforts are also focused on new architectures which are candidates to replace the convolutional operator, as testified by recent works that perform image classification using attention-based architectures or that combine vision with other modalities, such as language, audio, and speech, by leveraging on fully-attentive solutions.

Given the fundamental role of attention in the field of computer vision, the goal of this Research Topic is to contribute to the growth and development of attention-based solutions focusing on both traditional approaches and fully-attentive models. Moreover, the study of human attention has inspired models that leverage human gaze data to supervise machine attention. This Research Topic aims to present innovative research that relates to the study of human attention and to the usage of attention mechanisms in the development of deep learning architectures and enhancing model explainability.

Research papers employing traditional attentive operations or employing novel Transformer-based architectures are encouraged, as well as works that apply attentive models to integrate vision and other modalities (e.g., language, audio, speech, etc.). We also welcome submissions on novel algorithms, datasets, literature reviews, and other innovations related to the scope of this Research Topic.

The topics of interest include but are not limited to:

- Saliency prediction and salient object detection
- Applications of human attention in Vision
- Visualization of attentive maps for Explainability of Deep Networks
- Use of Explainable-AI techniques to improve any aspect of the network (generalization, robustness, and fairness)
- Applications of attentive operators in the design of Deep Networks
- Transformer-based or attention-based models for Computer Vision tasks (e.g. classification, detection, segmentation)
- Transformer-based or attention-based models to combine Vision with other modalities (e.g. language, audio, speech)
- Transformer-based or attention-based models for Vision-and-Language tasks (e.g., image and video captioning, visual question answering, cross-modal retrieval, textual grounding / referring expression localization, vision-and-language navigation)
- Computational issues in attentive models
- Applications of attentive models (e.g., robotics and embodied AI, medical imaging, document analysis, cultural heritage)


Keywords: Attention, Attentive Models, Transformer, Saliency, Explainability, Multimodal Networks, Vision-and-Language


Important Note: All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.

About Frontiers Research Topics

With their unique mixes of varied contributions from Original Research to Review Articles, Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author.

Topic Editors

Loading..

Submission Deadlines

30 September 2021 Abstract
31 January 2022 Manuscript

Participating Journals

Manuscripts can be submitted to this Research Topic via the following journals:

Loading..

Topic Editors

Loading..

Submission Deadlines

30 September 2021 Abstract
31 January 2022 Manuscript

Participating Journals

Manuscripts can be submitted to this Research Topic via the following journals:

Loading..
Loading..

total views article views article downloads topic views

}
 
Top countries
Top referring sites
Loading..