Challenges and Prospects in Vision and Language Research

Language grounded image understanding tasks have often been proposed as a method for evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of capabilities that integrate computer vision, reasoning, and natural language understanding. However, the datasets and evaluation procedures used in these tasks are replete with flaws which allows the vision and language (V&L) algorithms to achieve a good performance without a robust understanding of vision and language. We argue for this position based on several recent studies in V&L literature and our own observations of dataset bias, robustness, and spurious correlations. Finally, we propose that several of these challenges can be mitigated by creation of carefully designed benchmarks.


INTRODUCTION
Advancements in deep learning and the availability of large-scale datasets have resulted in great progress in computer vision and natural language processing (NLP).Deep convolutional neural networks (CNNs) have enabled unprecedented improvements in classical computer vision tasks, e.g., image classification and object detection.Progress in many NLP tasks has been similarly swift.Building upon these advances, there is a push to attack new problems that enable concept comprehension and reasoning capabilities to be studied at the intersection of vision and language (V&L) understanding.There are numerous applications for V&L systems, including enabling the visually impaired to interact with visual content using language, human-computer interaction, and visual search.Human-robot collaboration would be greatly enhanced by giving robots understanding of human language to better understand the visual world.
However, the primary objective of many scientists working on V&L problems is to have them serve as stepping stones toward a visual Turing test (Geman et al., 2015), a benchmark for progress in artificial intelligence (AI).Grounding visual processing using language can provide a test-bed for goal-directed visual understanding, with language queries determining the task to be performed.V&L tasks can demand many disparate computer vision and NLP skills to be used simultaneously.The same system may be required to simultaneously engage in reasoning, object recognition, attribute detection, and more.Most V&L benchmarks capture only a fraction of the requirements of a rigorous visual Turing test; however, we argue that across V&L tasks a rigorous evaluation should assess numerous scene understanding capabilities individually and give confidence that an algorithm is right for the right reasons.If it is possible to do well on a benchmark by only answering common easy questions, not looking at the image, or by merely guessing using spurious correlations, then it will not satisfy these requisites for a good test.
Many V&L tasks have been proposed, including image and video captioning (Mao et al., 2015;Yu et al., 2016), visual question answering (VQA) (Antol et al., 2015;Kafle and Kanan, 2017b;Zhang et al., 2016;Agrawal et al., 2017Agrawal et al., , 2018;;Kafle and Kanan, 2017a), referring expression recognition (RER) (Kazemzadeh et al., 2014), image retrieval (Mezaris et al., 2003;Johnson et al., 2015), activity recognition (Yatskar et al., 2016;Zhao et al., 2017a), and language-guided image generation (Reed et al., 2016;Zhang et al., 2017).A wide variety of algorithms have been proposed for each of these tasks, producing increasingly better results across datasets.However, several studies have called into question the true capability of these systems and the efficacy of current assessment methods (Kafle and Kanan, 2017a;Cirik et al., 2018;Madhyastha et al., 2018).Systems are heavily influenced by dataset bias and lack robustness to uncommon visual configurations (Agrawal et al., 2017;Kafle and Kanan, 2017a;Madhyastha et al., 2018), but these are often not measured and call into question the value of these benchmarks.These issues also impact system assessment and deployment.Systems can amplify spurious correlations between gender and potentially unrelated variables in V&L problems (Hendricks et al., 2018;Zhao et al., 2017a), resulting in the possibility of severe negative real-world impact.
In this article, we outline the current state of V&L research.We identify the challenges of developing good algorithms, datasets, and evaluation metrics.We discuss issues unique to individual tasks as well as identify common shortcomings shared across V&L benchmarks.We provide our perspective on potential future directions for V&L research, especially on the requisites for benchmarks to better serve as a visual Turing tests.

A BRIEF SURVEY OF V&L RESEARCH
Multiple V&L tasks have been proposed for developing and evaluating AI systems.We briefly describe the most prominent V&L tasks and discuss baseline and state-of-the-art algorithms.Some of these tasks are shown in Fig. 1.

Tasks in V&L research
Bidirectional sentence-to-image and image-to-sentence retrieval problems are among the earliest V&L tasks (Mezaris et al., 2003).Early works dealt with simpler keyword-based image retrieval (Mezaris et al., 2003), with later approaches using deep learning and graph-based representations (Johnson et al., 2015).Visual semantic role labeling requires recognizing activities and semantic context in images (Yatskar et al., 2016;Zhao et al., 2017a).Image captioning, the task of generating descriptions for visual content, involves both visual and language understanding.It requires describing the gist of the interesting content in a scene (Lin et al., 2014;Donahue et al., 2015), while also capturing specific image regions (Johnson et al., 2016).Video captioning adds the additional complexity of understanding temporal relations (Yu et al., 2016).Unfortunately, it is difficult to evaluate the quality and relevance of generated captions without involving humans (Elliott and Keller, 2014).Automatic evaluation metrics (Papineni et al., 2002;Lin, 2004;Papineni et al., 2002) are incapable of assigning due merit to the large range of valid and relevant descriptions for visual content and are poorly correlated with human judgment, often ranking machine-generated captions as being better than human captions (Kilickaya et al., 2017;Bernardi et al., 2016).
VQA involves answering questions about visual content.Compared to captioning, it is better suited for automatic evaluation as the output can be directly compared against ground truth answers as long as the answers are one or perhaps two words long (Antol et al., 2015;Kumar et al., 2016;Goyal et al., 2017).VQA was proposed as a form of visual Turing test, since answering arbitrary questions could demand many different skills to facilitate scene understanding.While many believed VQA would be extremely challenging, results on the first natural image datasets quickly rivaled humans, which was in large part due to question-answer distribution bias being ignored in evaluation (Agrawal et al., 2016;Zhang et al., 2016;Agrawal et al., 2017Agrawal et al., , 2018;;Kafle and Kanan, 2017a).Results were good for common questions, but systems were fragile and were incapable of handling rare questions or novel scenarios.Later datasets attempted to better assess generalization.The Task Directed Image Understanding Challenge (TDIUC) tests generalization to multiple question-types (Kafle and Kanan, 2017a), Compositional VQA (C-VQA) evaluates the ability to handle novel concept compositions (Agrawal et al., 2017), and VQA under Changing Priors (VQA-CP) tests generalization to different answer distributions (Agrawal et al., 2018).It is harder to excel on these datasets by just exploiting biases.However, the vast majority of the questions in these datasets do not require complex compositional reasoning.The CLEVR dataset attempts to address this by generating synthetic questions demanding complex chains of reasoning about synthetic scenes consisting of simple geometric shapes (Johnson et al., 2017a).Similar to CLEVR, the GQA dataset measures compositional reasoning in natural images by asking long and complex questions in visual scenes involving real-world complexities (Hudson and Manning, 2019).Video Question Answering has the additional requirement of understanding temporal dynamics (Zhu et al., 2017;Zhao et al., 2017b).We refer readers to survey articles for extensive reviews on VQA (Kafle and Kanan, 2017b) and image captioning (Bernardi et al., 2016).
With VQA, models do not have to provide visual evidence for their outputs.In contrast, RER requires models to provide evidence by either selecting among a list of possible image regions or generating bounding boxes that correspond to input phrases (Kazemzadeh et al., 2014;Rohrbach et al., 2016;Plummer et al., 2017).Since the output of an RER query is always a single box, it is often quite easy to guess the correct box.To counter this, Acharya et al. (2019) proposed visual query detection (VQD), a form of goal-directed object detection, where the query can have 0-15 valid boxes making the task more difficult and more applicable to real-world applications.FOIL takes a different approach and requires a system to differentiate invalid image descriptions from valid ones (Shekhar et al., 2017).Natural Language Visual Reasoning (NLVR) requires verifying if image descriptions are true (Suhr et al., 2017(Suhr et al., , 2018)).
Unlike the aforementioned tasks, EmbodiedQA requires the agent to explore its environment to answer questions (Das et al., 2018).The agent must actively perceive and reason about its visual environment to determine its next actions.In visual dialog, an algorithm must hold a conversation about an image (Das et al., 2017a,b).In contrast to VQA, visual dialog requires understanding the conversation history, which may contain visual co-references that a system must resolve correctly.The idea of conversational visual reasoning has also been explored in Co-Draw (Kim et al., 2017), a task where a teller describes visual scenes and a drawer draws them without looking at the original scenes.
Of course, it is impossible to create an agent that knows everything about the visual world.Agents are bound to encounter novel situations, and handling these situations requires them to be aware of their own limitations.Visual curiosity addresses this by creating agents that pose questions to knowledgeable entities, e.g., humans or databases, and then they incorporate the new information for future use (Zhang et al., 2018;Yang et al., 2018;Misra et al., 2018).

V&L algorithms
There are similarities between many V&L algorithms.Almost all algorithms use pre-trained CNNs for natural scenes and train shallow CNNs for synthetic scenes (Santoro et al., 2017).For language representation, almost all models use recurrent neural networks, with LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) being the most popular choices.Some algorithms have also made use of linguistic parsers to discover sub-tasks in natural language queries (Andreas et al., 2016;Hu et al., 2017).Recently, the community has been adopting graph-based representations for image retrieval (Johnson et al., 2015), image generation (Johnson et al., 2018), VQA (Yi et al., 2018), and semantic knowledge incorporation (Yi et al., 2018), due to their intuitiveness and suitability for symbolic reasoning (Johnson et al., 2015).
Most of the V&L algorithms fuse visual and language representations.Fusion mechanisms range from simple techniques, such as concatenation and Hadamard products (Kafle and Kanan, 2016;Antol et al., 2015), to more intricate methods, e.g., bilinear fusion (Fukui et al., 2016), which are argued to better capture interactions between visual and linguistic representations.Attention mechanisms that enable extraction of query-relevant information have also been heavily explored (Yang et al., 2016;Anderson et al., 2018;Kim et al., 2018;Yu et al., 2018).Attention mechanisms learn to assign higher importance to relevant information using both top-down and bottom-up pathways (Anderson et al., 2018).
Another common requirement for many V&L tasks is a multi-step reasoning mechanism.For this, the community has proposed modular networks that use pre-defined components to perform pre-specified reasoning functions, e.g., filtering and describing visual regions (Andreas et al., 2016;Hu et al., 2017;Yu et al., 2018), providing a transparent reasoning process.Compositional reasoning can also be achieved by capturing pairwise interactions between V&L representations (Santoro et al., 2017) and by recurrently extracting and consolidating information from the input (Hudson and Manning, 2018).These approaches directly learn reasoning from data by utilizing structural biases provided by the model definition.
While these algorithms show impressive new capabilities, their development and evaluation has been split into two distinct camps: the first camp focuses on monolithic architectures that often excel at natural image V&L tasks (Yang et al., 2016;Kim et al., 2016), whereas the second camp focuses on compositional architectures, that excel at synthetically generated scenes testing for compositional reasoning (Santoro et al., 2017;Hudson and Manning, 2018).Algorithms developed for one camp are often not evaluated on the datasets from other camp, which makes it difficult to gauge the true capabilities of V&L algorithms.Shrestha et al. (2019) showed that most of the algorithms developed for natural image VQA do not perform well on synthetic compositional datasets and vice-versa.The authors further propose a simple architecture that compares favorably against state-of-the-art algorithms from both camps, indicating that specialized mechanisms such as: attention, modular reasoning and fusion mechanisms, used in more intricate methods may been been over-engineered to perform well on selected datasets.

SHORTCOMINGS OF V&L RESEARCH
Progress in V&L has been swift.Benchmarks for several V&L tasks show that algorithms are equalling or even surpassing human performance (Johnson et al., 2017b;Bernardi et al., 2016).In this section, we will outline several shortcomings and challenges faced by the V&L tasks that show that existing results can be misleading.We will also discuss the efficacy of existing remedial measures in tackling these shortcomings.

Dataset bias
Unwanted or unchecked biases in natural datasets are arguably the most prevalent issues in V&L tasks.Since the data used for training and testing a model are often collected homogeneously (Antol et al., 2015;Goyal et al., 2017;Lin et al., 2014), they share common patterns and regularities.Hence, it is possible for an algorithm to get good results by memorizing those patterns, undermining our efforts to evaluate the understanding of vision and language.The biases in datasets can stem from several sources, can be hard to track, and can result in severely misleading model evaluation.Two of the most common forms of bias stem from bias in crowd-sourced annotators and naturally occurring regularities.Finally, 'photographer's bias' is also prevalent in V&L benchmarks, because images found on the web share similarities in posture and composition due to humans having preferences for specific views (Azulay and Weiss, 2018).Since the same biases and patterns are also mirrored in the test dataset, algorithms can simply memorize these superficial patterns (If the question has the pattern 'Is there an OBJECT in the picture?', then answer 'yes') instead of learning to actually solve the intended task (answer 'yes' only if the OBJECT is actually present).If this bias is not compensated for during evaluation, benchmarks may only test a very narrow subset of capabilities.This can enable algorithms to perform well for the wrong reasons and algorithms can end up catastrophically failing in uncommon scenarios (Alcorn et al., 2018;Agrawal et al., 2018).
Several studies demonstrate the issue of bias in V&L tasks.For example, blind VQA models that 'guess' the answers without looking at images show relatively high accuracy (Kafle and Kanan, 2016).In captioning, simple nearest neighbor-based approaches yield surprisingly good results (Devlin et al., 2015).Dataset bias occurs in other V&L tasks as well (Zhao et al., 2017a;Shekhar et al., 2017;Zellers et al., 2018;Cirik et al., 2018).Recent studies (Zhao et al., 2017a) have shown that algorithms not only mirror the dataset bias in their predictions, but in fact amplify the effects of bias (see Fig. 2).
Numerous studies have sought to quantify and mitigate the effects of answer distribution bias on an algorithm's performance.As a straightforward solution, Zhang et al. (2016) and Kafle and Kanan (2017a) proposed balanced training sets with a uniform distribution over possible answers.This is somewhat effective for simple binary questions and synthetically generated visual scenes, but it does not address  (Agrawal et al., 2018), and test-set predictions of the state-of-the-art VQA model, BAN (Kim et al., 2018).In VQA-CP, the distribution of test is intentionally made different from train to assess if the algorithms can perform under changing priors.Algorithms not only fails to perform well under changing priors, but it also has a bias-amplification effect, i.e., the predictions show even stronger bias towards answers that are more common in the training set than the actual level of bias.Similar observations have been made for semantic role labeling (Zhao et al., 2017a).
the imbalance in the kinds of questions present in the datasets.Re-balancing all kinds of query types is infeasible for large-scale natural image datasets.Furthermore, it may be counterproductive to forgo information contained in natural distributions in the visual and linguistic content, and focus should instead be on rigorous evaluation that compensates for bias or demonstrates bias robustness (Agrawal et al., 2018).We discuss this further in the next section.

Evaluation metrics
Proper evaluation of V&L algorithms is difficult.A lot of the challenge arises from the complexity of the natural language.Language can be used to express similar semantic content in different ways, which makes automatic evaluation of models that emit words and sentences particularly challenging.For example, the captions 'A man is walking next to a tree' and 'A guy is taking a stroll by the tree' are nearly identical in meaning, but it can be hard for automatic systems to infer that fact.Several evaluation metrics have been proposed for captioning, including simple n-gram matching systems (e.g., BLEU (Papineni et al., 2002), CIDEr (Vedantam et al., 2015) and ROUGE (Lin, 2004)) and human consensus-based measures (Vedantam et al., 2015).Most of these metrics have limitations (Kilickaya et al., 2017;Bernardi et al., 2016), with n-gram based metrics suffering immensely for sentences that are phrased differently but have identical meaning or use synonyms (Kilickaya et al., 2017).Alarmingly, evaluation metrics often rank machinegenerated captions as being better than human captions but fail when human subjectivity is taken into account (Bernardi et al., 2016;Kilickaya et al., 2017).Even humans find it hard to agree on what a 'good' caption entails (Vedantam et al., 2015).Automatic evaluation of captioning is further complicated because it is not clear what is expected from the captioning system.A given image can have many valid captions ranging from descriptions of specific objects in an image, to an overall description of the entire image.However, due to natural regularities and photographer bias, generic captions can apply to a large number of images, thereby gaining high evaluation scores without demonstrating visual understanding (Devlin et al., 2015).
Evaluation issues are lessened in VQA and RER where the output is better defined; however, it is not completely resolved.If performance for VQA is measured using exact answer matches, then even small variations will be harshly punished, e.g., if a model predicts 'bird' instead of 'eagle', then the algorithm is punished as harshly as if it were to predict 'table.'Several solutions have been proposed, but they have their own limitations, e.g., Wu-Palmer Similarity (WUPS), a word similarity metric, cannot be used with sentences and phrases.Alternately, consensus based metrics have been explored (Malinowski et al., 2015;Antol et al., 2015), where multiple annotations are collected for each input, with the intention of capturing common variations of the ground truth answer.However, this paradigm can make many questions unanswerable due to low human consensus (Kafle andKanan, 2017a, 2016).Multiple-choice evaluation has been proposed by several benchmarks (Antol et al., 2015;Goyal et al., 2017).While this simplifies evaluation, it takes away a lot of the open-world difficulty from the task and can lead to inflated performance via smart guessing (Jabri et al., 2016).
Dataset biases introduce further complications for evaluation metrics.Inadequate metrics can conflate the issues of bias when the statistical distributions of the training and test sets are not taken into account, artificially inflating performance.Metrics normalized to account for the distribution of training data (Kafle and Kanan, 2017a) and diagnostic datasets that artificially perturb the distribution of train and test data (Agrawal et al., 2018) have been proposed to remedy this.Furthermore, open-ended V&L language tasks can potentially test a variety of skills, ranging from relatively easy sub-tasks (detection of large, well-defined objects), to fairly difficult sub-tasks (fine-grained attribute detection, spatial and compositional reasoning, counting, etc.).However, these tasks are not evenly distributed.Placing all skill types on the same footing can inflate system scores and hide how fragile these systems are.Dividing the dataset into underlying tasks can help (Kafle and Kanan, 2017a), but the best way to make such a division is not clearly defined.

Are V&L systems 'horses?'
Strum defines a 'horse' as 'a system that appears as if it is solving a particular problem when it actually is not' (Sturm, 2016).Of course, the 'horse' here refers to the infamous horse named Clever Hans, thought to be capable of arithmetic and abstract thought but was in reality exploiting the micro-signals provided by its handler and audience.Apart from bias and evaluation, there are other issues in V&L datasets that are harder to pinpoint.We review several of these issues and highlight existing studies that scrutinize the true capabilities of existing V&L systems to assess whether they are 'horses.'

Superficial correlations and true vs. apparent difficulty
Due to superficial correlations, the difficulty of V&L datasets may be much lower than the true difficulty of comprehensively solving the task (see Fig. 3).We outline some of the key studies and their findings that Figure 3.The apparent versus true complexity of V&L tasks.In RER (left), omitting a large amount of text has no effect on the output of the system (Yu et al., 2018).Similarly, a seemingly detailed caption (right) can apply to a large number of images from the dataset making it easy to 'guess' based on shallow correlations.While it appears as though the captioning system can identify objects ('bus', 'building', 'people'), spatial relationships ('next to', 'on'), and activities ('walking').However, it is entirely possible for the captioning system to have 'guessed' the caption by detection of one of the objects in the caption, e.g., a 'bus' or even a common latent object such as 'traffic light'.
suggest V&L algorithms are relying on superficial correlations that enable them to achieve high performance in common situations but make them vulnerable when tested under different, but not especially unusual, conditions.
VQA: Image-blind algorithms that only see questions often perform surprisingly well (Kafle and Kanan, 2016;Yang et al., 2016), sometimes even surpassing the algorithms having access to both (Kafle and Kanan, 2016).Algorithms also often provide inconsistent answers due to irrelevant changes in phrasing (Kafle and Kanan, 2017b;Ray et al., 2018), signifying a lack of question comprehension.When a VQA dataset is divided into different question-types, algorithms performed well only on easier tasks that CNNs alone excel at, e.g., detecting whether an object is present, but they performed poorly for complex questions that require bi-modal reasoning (Kafle and Kanan, 2017a).This discrepancy in accuracy is not clearly conveyed when simpler accuracy metrics are used.In a multi-faceted study, Agrawal et al. (2016) showed several quirks of VQA, including how VQA algorithms converge to an answer without even processing one half of the question and show an inclination to fixate on the same answer when the same question is repeated for a different image.Similarly, Goyal et al. (2017) showed that VQA algorithm performance deteriorates when tested on pairs of images that have opposite answers.As shown in Fig. 2, VQA systems can actually amplify bias.
Image captioning: In image captioning, simply predicting the caption of the training image with the most similar visual features yields relatively high scores using automatic evaluation metrics (Devlin et al., 2015).Captioning algorithms exploit multi-modal distributional similarity (Madhyastha et al., 2018), and generate captions similar to images in the training set, rather than learning concrete representations of objects and their properties.
Embodied QA and visual dialog: EmbodiedQA ostensibly requires navigation, visual information collection, and reasoning, but Anand et al. (2018) showed that vision blind algorithms perform competitively.
Similarly, visual dialog should require understanding both visual content and dialog history (Massiceti et al., 2018), but an extremely simple method produces near state-of-the-art performance for visual dialog, despite ignoring both visual and dialog information (Massiceti et al., 2018).
Scene graph parsing: Predicting scene graphs requires understanding object properties and their relationships to each other.However, Zellers et al. (2018) showed that objects alone are highly indicative of their relationship labels.They further demonstrated that for a given object pair, simply guessing the most common relation for those objects in the training set yields improved results compared to state-of-the-art methods.

RER:
In a multi-faceted study of RER, Cirik et al. (2018) demonstrated multiple alarming issues.The first set of experiments involved tampering with the input referring expression to examine if algorithms properly used the text information.Tampering should reduce performance if algorithms make proper use of text to predict the correct answers.However, their results were relatively unaffected when the words were shuffled and nouns/adjectives were removed from the referring expressions.This signifies that it is possible for algorithms to get high scores without explicitly learning to model the objects, attributes and their relationships.The second set of experiments demonstrated that it is possible to predict correct candidate boxes for over 86% of referring expressions, without ever feeding the referring expression to the system.This demonstrates that algorithms can exploit regularities and biases in these datasets to achieve good performance, making these datasets a poor test of the RER task.Some recent works have attempted to create more challenging datasets that probe the abilities to properly ground vision and language beyond shallow correlations.In FOIL (Shekhar et al., 2017), a single noun from a caption is replaced with another, making the caption invalid.Here the algorithm, must determine if the caption has been FOILed and then detect the FOIL word and replace it with a correct word.Similarly, in NLVR (Suhr et al., 2017), an algorithm is tasked with finding whether a description applies to a pair of images.Both tasks are extremely difficult for modern V&L algorithms with the best performing system on NLVR limited to around 55% (random guess is 50%), well short of the human performance of over 95%.These benchmarks may provide a challenging test bed that can spur the development of next-generation V&L algorithms.However, they remain limited in scope, with FOIL being restricted to noun replacement for a small number of categories (less than 100 categories from the COCO dataset).Hence, it does not test understanding of attributes or relationships between objects.Similarly, NLVR is difficult, but it lacks additional annotations to aid in the measurement of why a model fails, or eventually, why it succeeds.

Lack of interpretability and confidence
Human beings can provide explanations, point to evidence, and convey confidence in their predictions.They also have an ability to say 'I do not know' when the information provided is insufficient.However, almost none of the existing V&L algorithms are equipped with these abilities, making the models highly uninterpretable and unreliable.
In VQA, algorithms provide high-confidence answers even when the question is nonsensical for a given image, e.g., 'What color is the horse?' for an image that does not contain a horse can yield 'brown' with a very high confidence.Very limited work has been done in V&L to assess a system's ability to deal with lack of information.While Kafle and Kanan (2017a) proposed a class of questions called 'absurd' questions to test a system's ability to determine if a question was unanswerable, they were limited in scope to simple detection questions.More complex forms of absurdity are yet to be tested.
Because VQA and captioning do not explicitly require or test for proper grounding or pointing to evidence, the predictions made by these algorithms remain uninterpretable.A commonly practiced remedy is to include visualization of attention maps for attention-based methods, or use post-prediction visualization methods such as Grad-CAM (Selvaraju et al., 2017).However, these visualizations shed little light on whether the models have 'attended' to the right image regions.First, most V&L datasets do not contain attention maps that can be compared to the predicted attention maps; therefore, it is difficult to gauge the prediction quality.Second, even if such data were available, it is not clear what image regions the model should be looking at.Even for well-defined tasks such as VQA, answers to questions like 'Is it sunny?' can be inferred using multiple image regions.Indeed, inclusion of attention maps does not make a model more predictable for human observers (Chandrasekaran et al., 2018), and the attention-based models and humans do not look at same image regions (Das et al., 2016).This suggests attention maps are an unreliable means of conveying interpretable predictions.
Several works propose the use of textual explanations to improve interpretability (Hendricks et al., 2016;Li et al., 2018).Li et al. (2018) collected text explanations in conjunction with standard VQA pairs and a model must predict both the correct answer and the explanation.However, learning to predict explanations can suffer from many of the same problems faced by image captioning: evaluation is difficult and there can be multiple valid explanations.Currently, there is no reliable evidence that such explanations actually make the model more interpretable, but there is some evidence of the contrary (Chandrasekaran et al., 2018).
Modular and compositional approaches attempt to reveal greater insight by incorporating interpretability directly into the design of the network (Hu et al., 2017;Johnson et al., 2018Johnson et al., , 2017b)).However, these algorithms are primarily tested on simpler, synthetically constructed datasets that lack the diversity of natural images and language.The exceptions that are tested on natural images rely on hand-crafted semantic parsers to pre-process the questions (Hu et al., 2017), which often over-simplify the complexity of the questions (Kafle and Kanan, 2017b).

Lack of compositional concept learning
It is hard to verify that a model has understood concepts.One method to do this is to use it in a novel setting or in a previously unseen combination.For example, most humans would not have a problem recognizing a purple colored dog, even if they have never seen one before, given that they are familiar with the concepts of purple and dog.Measuring such compositional reasoning could be crucial in determining whether a V&L system is a 'horse.'This idea has received little attention, with few works devoted to it (Johnson et al., 2017a;Agrawal et al., 2017).Ideally, an algorithm should not show any decline in performance for novel concept combinations.However, even for CLEVR, which is composed of basic geometric shapes and colors, most algorithms show a large drop in performance for novel shape-color combinations (Johnson et al., 2017a).For natural images, the drop in performance is even higher (Agrawal et al., 2017).

ADDRESSING SHORTCOMINGS
In this survey, we have compiled a wide range of shortcomings and challenges faced by modern V&L research based on the datasets and evaluation of tasks.One of the major issues stems from the difficulty in evaluating if an algorithm is actually solving the task, which is confounded by hidden perverse incentives in modern datasets that cause algorithms to exploit unwanted correlations.Lamentably, most proposed tasks do not have built-in safeguards against this or even an ability to measure it.Many post-hoc studies have shed light on this problem.However, they are often limited in scope, require collecting additional Table 1.A summary of challenges and potential solutions for V&L problems.

Shortcomings/Challenges Potential Solutions
Evaluation metrics are a poor measure for competence of algorithms due to dataset bias.
• Use metrics that account for dataset biases.
• Carefully measure and report performance on individual abilities.
It is hard to tell if algorithms are 'right for the right reasons.'They can perform well on benchmarks without actually solving the problem.
• Test the algorithms by withholding varying degrees of task-critical information from them to measure if they understand concepts.
• Measure task understanding by asking the model to do the same task in dissimilar contexts and with alternative phrasing.
• Develop defense mechanisms against 'accidentally' reaching the correct solutions.
Trained systems are fragile and easily break when humans use them.
• Incorporate prediction confidence into evaluation.
• Allow systems to output 'I don't know.' V&L Systems are one-trick-ponies, rarely able to generalize to more than one task.
• Create a V&L decathlon that tests numerous V&L tasks.Assess positive transfer among tasks.
data (Shekhar et al., 2017), or the modification of 'standard' datasets (Agrawal et al., 2017(Agrawal et al., , 2018;;Kafle and Kanan, 2016).We outline prospects for future research in V&L, with an emphasis on discussing the characteristics of future V&L tasks and evaluation suites that are better aligned with the goals of a visual Turing test.Table 1 presents a short summary of challenges and potential solutions in V&L research.

New natural image tasks that measure core abilities
Existing V&L evaluation schemes for natural datasets ignore bias, making it possible for algorithms to excel on standard benchmarks without demonstrating proper understanding.We argue that a carefully designed suite of tasks could be used to overcome this obstacle.We propose some possible approaches to improve evaluation by tightly controlling the evaluation of core abilities and ensuring that evaluation compensates for bias.
Programmatically created datasets, e.g., CLEVR for VQA, can enable fine-grained evaluation of specific components by using simple synthetically created scenes.We could create a similar dataset for natural images by composing scenes of natural objects (see Fig. 4).This could be used to test higher-levels of visual knowledge, which is not possible in synthetic environments.This approach could be used to examine reasoning and bias-resistance by placing objects in unknown combinations and then asking questions with long reasoning chains, novel concept compositions, and distinct train/test distributions.
Current benchmarks cannot reliably ascertain whether an algorithm has learned to represent objects and their attributes properly, and it is often easy to produce a correct response by 'guessing' prominent objects in the scene (Cirik et al., 2018).To examine whether an algorithm demonstrates concept understanding, we envision a dataset containing simple queries, where given a set of objects and/or attributes as queries, the algorithm needs to highlight all objects that satisfy all of the conditions in the set, e.g., for query={red}, the algorithm must detect all red objects, and for {red,car}, it must detect all red cars.However, all queries would have distractors in the scene, e.g., {red, car} is only used when the scene also contains 1) cars that Snowboarding is generally correlated with gender 'male' and context 'snow' (Hendricks et al., 2018).are non-red, 2) objects other than cars or 3) other non-red objects.By abandoning the complexity of natural language, this dataset allows for the creation of queries that are hard to 'guess' without learning proper object and attribute representations.Since the chance of a random guess being successful is inversely proportional to number of distractors, the scoring can also be made proportional to additional information over a random guess.
We hope that carefully designed test suites that measure core abilities of V&L systems in a controlled manner will be developed.This serves as a necessary adjunct to more open-ended benchmarks, and would help dispel the 'horse' in V&L.

Better evaluation of V&L systems
V&L needs better evaluation metrics for standard benchmarks.Here, we will outline some of the key points future evaluation metrics should account for: • Evaluation should test individual skills to account for dataset biases (Kafle and Kanan, 2017a) and measure performance relative to 'shallow' guessing (Agrawal et al., 2018;Kafle and Kanan, 2017b;Cirik et al., 2018).• Evaluation should include built-in tests for 'bad' or 'absurd' queries (Kafle and Kanan, 2017a;Cirik et al., 2018).• Test sets should contain a large number of compositionally novel instances that can be inferred from training but not directly matched to a training instance (Devlin et al., 2015;Johnson et al., 2017a).• Evaluation should keep the 'triviality' of the task in mind when assigning score to a task, e.g., if there is only a single cat then 'Is there a black cat sitting between the sofa and the table?' reduces to 'Is there a cat?' for that image (Agrawal et al., 2016;Cirik et al., 2018).• Robustness to semantically identical queries must be assessed.
• Evaluation should be done on questions with unambiguous answers; if humans strongly disagree, it is likely not a good question for a visual Turing test.
We believe future evaluation should probe algorithms from multiple angles such that a single score is derived from a suite of sub-scores that test different capabilities.The score could be divided into underlying core abilities (e.g., counting, object detection, fine-grained recognition, etc.), and also higher-level functions (e.g., consistency, predictability, compositionality, resistance to bias, etc.)

V&L decathlon
Most of the V&L tasks seek to measure language grounded visual understanding.Therefore, it is not unreasonable to expect an algorithm designed for one benchmark to readily transfer to other V&L tasks with only minor modifications.However, most algorithms are tested on single task (Kafle and Kanan, 2016;Yu et al., 2018;Yang et al., 2016), with very few exceptions (Anderson et al., 2018;Kim et al., 2018;Shrestha et al., 2019).Even within the same task, algorithms are almost never evaluated on multiple datasets to assess different skills, which makes it difficult to study the true capabilities of the algorithms.
To measure holistic progress in V&L research, we believe it is imperative to create a large-scale V&L decathlon benchmark.Work in a similar spirit has recently been proposed as DecaNLP (McCann et al., 2018), where many constituent NLP tasks are represented in a single benchmark.In DecaNLP, all constituent tasks are represented as question-answering for an easier input-output mapping.To be effective, a V&L decathlon benchmark should not only contain different sub-tasks and diagnostic information but also entirely different input-output paradigms.We envision models developed for a V&L decathlon to have a central V&L core and multiple input-output nodes that the model selects based on the input.Both training and test splits of the decathlon should consist of many different input-output mappings representing distinct V&L tasks.For example, the same image could have a VQA question 'What color is the cat?', a pointing question 'What is the color of "that" object?',where "that" is a bounding box pointing to an object, and a RER 'Show me the red cat.' Integration of different tasks encourages development of more capable V&L models.Finally, the test set should contain unanswerable queries (Kafle and Kanan, 2017a;Cirik et al., 2018), compositionally novel instances (Johnson et al., 2017b;Agrawal et al., 2017), pairs of instances with subtle differences (Goyal et al., 2017), equivalent queries with same ground truth but different phrasings, and many other quirks that allow us to peer deeper into the reliability and true capacity of the models.These instances can then be used to produce a suite of metrics as discussed earlier.

CONCLUSION
While V&L work originally seemed incredibly difficult, progress on benchmarks rapidly made it appear that systems would soon rival humans.A wide range of studies have shown that much of this progress may be misleading due to flaws in standard evaluation methods.While this should serve as a cautionary story for future research in other areas, V&L research has a bright future.While the vast majority of research is on creating new algorithms, we argue that constructing good evaluation techniques is just as critical, if not more so, for progress to continue.V&L has the potential to be a visual Turing test for assessing progress in AI, but this potential can only be achieved through the monumental task of developing strong benchmarks that evaluate many capabilities individually and thoroughly on rich real-world imagery to evaluate system competence.These systems can enable much richer interactions with computers and robots, but this demands that the systems are trustworthy and robust across scenarios.

Figure 1 .
Figure 1.Common tasks in vision and language research.

Figure 2 .
Figure 2. Distribution of answers for questions starting with 'How many' in the train and test split of the VQA-CP dataset(Agrawal et al., 2018), and test-set predictions of the state-of-the-art VQA model, BAN(Kim et al., 2018).In VQA-CP, the distribution of test is intentionally made different from train to assess if the algorithms can perform under changing priors.Algorithms not only fails to perform well under changing priors, but it also has a bias-amplification effect, i.e., the predictions show even stronger bias towards answers that are more common in the training set than the actual level of bias.Similar observations have been made for semantic role labeling(Zhao et al., 2017a).

Figure 4 .
Figure4.Posters dataset can help test bias.In this example, both contextual and gender bias are tested by placing out-of-context poster-cut-outs.Snowboarding is generally correlated with gender 'male' and context 'snow'(Hendricks et al., 2018).