AUTHOR=Lan Yubin , Guo Yaqi , Chen Qizhen , Lin Shaoming , Chen Yuntong , Deng Xiaoling 

TITLE=Visual question answering model for fruit tree disease decision-making based on multimodal deep learning

JOURNAL=Frontiers in Plant Science

VOLUME=Volume 13 - 2022

YEAR=2023

URL=https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2022.1064399

DOI=10.3389/fpls.2022.1064399

ISSN=1664-462X

ABSTRACT=VQA(visual question answering) about diseases is an important embodiment of intelligent management in smart agriculture. Currently, research on fruit tree diseases using deep learning mainly using single-source data information, such as visible images or spectral data, obtaining classification and identification results that cannot be directly used in practical agricultural decision making. In this study, a VQA model for diseases of fruit  based on multimodal feature fusion was designed, fusing images and Q&A knowledge of disease management, the model gets the decision-making answer by querying questions about fruit tree disease images to find relevant disease image regions. The main contributions of this study are as follows: (1) a multimodal bilinear factorized pooling model using Tucker decomposition to fuse the image features with question features was proposed; (2) a deep modular co-attention architecture is explored to simultaneously learn image and question attention to obtain richer graphical features and interactivity. The experiments show that the proposed unified model combining bilinear model and co-attentive learning in a new network architecture got the accuracy of 86.36% for decision making, under the condition of limited data (8450 images and 4560k Q&A pairs of data), outperforming existing multimodal methods. The proposed multimodal fusion model achieved friendly interaction and fine-grained identification and decision making performance, which can be widely expanded in smart agriculture.