VQA - 近五年视觉问答顶会论文创新点笔记

2014 A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input

Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input[C]//Advances in neural information processing systems. 2014: 1682-1690.

2015 Are You Talking to a Machine Dataset and Methods for Multilingual Image Question

Gao H, Mao J, Zhou J, et al. Are you talking to a machine? dataset and methods for multilingual image question[C]//Advances in neural information processing systems. 2015: 2296-2304.

1. LSTM提取问题表示；
2. CNN提取图像视觉表示；
3. LSTM存储一个回答的语言上下文；
4. 一个融合组件用于结合前三者并生成答案。

Malinowski M, Rohrbach M, Fritz M. Ask your neurons: A neural-based approach to answering questions about images[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1-9.

2015 Exploring Models and Data for Image Question Answering

Ren M, Kiros R, Zemel R. Exploring models and data for image question answering[C]//Advances in neural information processing systems. 2015: 2953-2961.

2015 VisKE Visual Knowledge Extraction and Question Answering by Visual Verification of Relation Phrases

Sadeghi F, Kumar Divvala S K, Farhadi A. Viske: Visual knowledge extraction and question answering by visual verification of relation phrases[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1456-1464.

2015 Visual Madlibs Fill in the Blank Description Generation and Question Answering

Yu L, Park E, Berg A C, et al. Visual madlibs: Fill in the blank description generation and question answering[C]//Proceedings of the ieee international conference on computer vision. 2015: 2461-2469.

Antol S, Agrawal A, Lu J, et al. Vqa: Visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2425-2433.

Kafle K, Kanan C. Answer-type prediction for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4976-4984.

2016 Ask Me Anything Free-Form Visual Question Answering Based on Knowledge from External Sources

Wu Q, Wang P, Shen C, et al. Ask me anything: Free-form visual question answering based on knowledge from external sources[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4622-4630.

2016 Hierarchical Question-Image Co-Attention for Visual Question Answering

Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering[C]//Advances In Neural Information Processing Systems. 2016: 289-297.

2016 Image Question Answering Using Convolutional Neural Network With Dynamic Parameter Prediction

Noh H, Hongsuck Seo P, Han B. Image question answering using convolutional neural network with dynamic parameter prediction[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 30-38.

2016 MovieQA Understanding Stories in Movies Through Question-Answering

Tapaswi M, Zhu Y, Stiefelhagen R, et al. Movieqa: Understanding stories in movies through question-answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4631-4640.

MovieQA数据集是多选题QA，让模型在5个选项中选答案。

2016 Stacked Attention Networks for Image Question Answering

Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 21-29.

SANs模型把问题语句表示为语义表示（向量），以此为查询来搜索图像中与答案相关的区域。本文认为VQA需要多步推理，因此设计了一个多层SAN，可以对图像做多次查询，以便渐进地推理出答案。

14×14 is the number of regions in the image and 512 is the dimension of the feature vector for each region.

2016 Visual Question Answering with Question Representation Update (QRU)

Li R, Jia J. Visual question answering with question representation update (qru)[C]//Advances in Neural Information Processing Systems. 2016: 4655-4663.

2016 Visual7W Grounded Question Answering in Images

Zhu Y, Groth O, Bernstein M, et al. Visual7w: Grounded question answering in images[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4995-5004.

Visual7W数据集不仅包含图像、问题及其答案，还标注了答案对应的图像grounding区域。

2016 Where to Look Focus Regions for Visual Question Answering

Shih K J, Singh S, Hoiem D. Where to look: Focus regions for visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4613-4621.

• 图像区域特征向量 + 文本特征向量 // 此处“+”是连接

• 对每个区域的注意力权重，用于对右侧绿色框中的N个向量做加权平均。

• dot product, softmax是一次注意力机制，根据文本特征关注图像区域。region向量和text向量映射到公共向量空间中。

2016 Yin and Yang Balancing and Answering Binary Visual Questions

Zhang P, Goyal Y, Summers-Stay D, et al. Yin and yang: Balancing and answering binary visual questions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5014-5022.

2017 A Dataset and Exploration of Models for Understanding Video Data Through Fill-In-The-Blank Question-Answering

Maharaj T, Ballas N, Rohrbach A, et al. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6884-6893.

MovieFIB采用填空（Fill-In-the-Blank）QA题型。

2017 An Analysis of Visual Question Answering Algorithms

Kafle K, Kanan C. An analysis of visual question answering algorithms[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1965-1973.

2017 An Empirical Evaluation of Visual Question Answering for Novel Objects

Ramakrishnan S K, Pal A, Sharma G, et al. An empirical evaluation of visual question answering for novel objects[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4392-4401.

2017 Are You Smarter Than a Sixth Grader Textbook Question Answering for Multimodal Machine Comprehension

Kembhavi A, Seo M, Schwenk D, et al. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4999-5007.

2017 Creativity Generating Diverse Questions Using Variational Autoencoders

Jain U, Zhang Z, Schwing A G. Creativity: Generating diverse questions using variational autoencoders[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6485-6494.

2017 End-To-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

Yu Y, Ko H, Choi J, et al. End-to-end concept word detection for video captioning, retrieval, and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 3165-3173.

2017 Explicit Knowledge-based Reasoning for Visual Question Answering

Wang P, Wu Q, Shen C, et al. Explicit knowledge-based reasoning for visual question answering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 1290-1296.

2017 Graph-Structured Representations for Visual Question Answering

Teney D, Liu L, van den Hengel A. Graph-structured representations for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1-9.

2017 High-Order Attention Models for Visual Question Answering

Schwartz I, Schwing A, Hazan T. High-order attention models for visual question answering[C]//Advances in Neural Information Processing Systems. 2017: 3664-3674.

1. 泛化性好（generally applicable），能够广泛应用于各种任务的注意力机制。
2. 高阶相关性（high-order correlation），能够学习不同数据模态之间的高阶相关性。k阶相关性能够建模k种模态之间的相关性。

1. 数据嵌入；
2. 注意力机制；
3. 决策产生；

1. 一元势（unary potentials）：$$\theta_V$$$$\theta_Q$$$$\theta_A$$，表示视觉输入、问题语句、回答语句中每个元素的重要性。
2. 成对势（pairwise potentials）：$$\theta_{V,Q}$$$$\theta_{V,A}$$$$\theta_{Q,A}$$表示两种模态之间的相关性。
3. 三元势（ternary potentials）：$$\theta_{V,Q,A}$$捕捉三种模特之间的依存性。

1. MCB池化（Multimodal Compact Bilinear Pooling）：本文的决策生成阶段使用该双线性池化把成对情况（pairwise setting）下的两种模态做池化输出。
2. MCT池化（Multimodal Compact Trilinear Pooling）：本文的决策生成阶段使用该三线性池化把三种模态的数据池化输出。

2017 Knowledge Acquisition for Visual Question Answering via Iterative Querying

Zhu Y, Lim J J, Fei-Fei L. Knowledge acquisition for visual question answering via iterative querying[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1154-1163.

2017 Learning to Disambiguate by Asking Discriminative Questions

Li Y, Huang C, Tang X, et al. Learning to disambiguate by asking discriminative questions[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 3419-3428.

2017 Learning to Reason End-to-End Module Networks for Visual Question Answering

Hu R, Andreas J, Rohrbach M, et al. Learning to reason: End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 804-813.

2017 Making the V in VQA Matter Elevating the Role of Image Understanding in Visual Question Answering

Goyal Y, Khot T, Summers-Stay D, et al. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6904-6913.

VQA被数据中的天然规律和语言因素带偏了，训练出的VQA模型在根据数据中的统计规律（自然规律、语言规律）作答，而忽视了视觉因素。

2017 MarioQA Answering Questions by Watching Gameplay Videos

Mun J, Hongsuck Seo P, Jung I, et al. Marioqa: Answering questions by watching gameplay videos[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 2867-2875.

2017 Multi-level Attention Networks for Visual Question Answering

Yu D, Fu J, Mei T, et al. Multi-level attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4709-4717.

• $$v_{img}$$：从CNN中层输出编码为空间嵌入表示（各区域的图像表示），并通过视觉注意力取得的图像表示；
• $$v_{c}$$：从CNN的高层语义生成语义概念（semantic concepts），并通过语义注意力选择后的图像表示。

2017 Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering

Yu Z, Yu J, Fan J, et al. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2017: 1821-1830.

Multi-modal Compact Bilinear pooling (MCB)对两个特征向量做外积，因二次方膨胀产生了非常高维的特征向量。MLB通过低阶映射矩阵改进了高维问题。

Multi-modal Low-rank Bilinear Pooling (MLB)： $z = MLB(x,y)=(U^Tx)\circ(V^Ty)$

• x通过矩阵U变换为o维向量；
• y通过矩阵V变换为o维向量；
• 随后两个o维向量做逐元素相乘，取得o维向量z。

• $$x \in \mathbb{R}^m$$$$y \in \mathbb{R}^n$$$$W_i \in \mathbb{R}^{m \times n}$$

Multi-modal Factorized Bilinear pooling (MFB)： $z_i = x^TU_iV_i^Ty = \sum^k_{d=1}x^Tu_dv_d^Ty = \mathbb{1}^T(U_i^Tx \circ V_i^Ty)$ 或写作： $z = \mathrm{SumPooling}(\tilde{U}^Tx \circ \tilde{V}^Ty, k)$

2017 Multimodal Learning and Reasoning for Visual Question Answering

Ilievski I, Feng J. Multimodal learning and reasoning for visual question answering[C]//Advances in Neural Information Processing Systems. 2017: 551-562.

Namely, the VQA problem can be solved by modeling the likelihood probability distribution $$p_{vqa}$$ which for each answer $$a$$ in the answer set $$\Omega$$ outputs the probability of being the correct answer, given a question $$Q$$ about an image $$I$$:

$\hat{a} = \mathop{\arg\max}_{a\in\Omega} p_{vqa}(a|Q,I;\theta)$

• 圆角矩形表示注意力模块；
• 方角矩形表示分类模块；
• 斜方形（trapezoid）表示编码器单元；
• 符号$$\otimes$$表示双线性交互模型（bilinear interaction model）；
• 大斜方形表示多层感知机网络，即最终的答案分类网络。

ReasonNet通过多个模块（注意力模块、分类模块）对图像和问题做处理，处理结果在编码后做双线性交互（bilinear interaction），最终取得的各个向量连接为长向量，用于最后的回答分类器做分类。

2017 MUTAN Multimodal Tucker Fusion for Visual Question Answering

Ben-Younes H, Cadene R, Cord M, et al. Mutan: Multimodal tucker fusion for visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2612-2620.

Tucker分解：本文的方法基于对协相关张量的Tucker分解，能够表示全双线性互动，同时维持易处理模型的尺寸。

2017 Structured Attentions for Visual Question Answering

Zhu C, Zhao Y, Huang S, et al. Structured attentions for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1291-1300.

2017 TGIF-QA Toward Spatio-Temporal Reasoning in Visual Question Answering

Jang Y, Song Y, Yu Y, et al. Tgif-qa: Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2758-2766.

1. 为视频VQA提出三种新的研究方向，需要从视频到答案的时空推理能力。
1. 重复计数（Repetition count）：回答一个动作发生了多少次；
2. 重复动作（Repeating action）：回答视频中重复的动作是什么；
3. 状态转换（State transition）：回答例如：表情、动作、地点、目标属性的状态转换情况。
2. 本文给出了一个视频VQA的大型数据集——TGIF-QA。
3. 本文提出了一个基于双LSTM的方法，包含空间和时间注意力。

2017 The VQA-Machine Learning How to Use Existing Vision Algorithms to Answer New Questions

Wang P, Wu Q, Shen C, et al. The vqa-machine: Learning how to use existing vision algorithms to answer new questions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1173-1182.

• 模型输入：问题、视觉事实（visual facts）、图像；

• 问题经过层次问题编码（Hierarchical Question Encoding）表示，包含三层：单词、短语、句子；
• 视觉事实通过三元组（subject, relation, object）表示；
• 图像划分区域，计算区域注意力。
• 三个输入在问题编码的三个级别（$$Q^w$$$$Q^P$$$$Q^q$$）做协同注意力加权。

• MLP分类器根据协同注意力加权后的特征进行分类，预测答案。

• 模型对输入的视觉事实进行排序，用于生成理由。

2017 Video Question Answering via Hierarchical Spatio-Temporal Attention Networks

Zhao Z, Yang Q, Cai D, et al. Video question answering via hierarchical spatio-temporal attention networks[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 3518-3524.

2017 VQS Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Gan C, Li Y, Li H, et al. Vqs: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1811-1820.

2017 What's in a Question Using Visual Questions as a Form of Supervision

Ganju S, Russakovsky O, Gupta A. What's in a question: Using visual questions as a form of supervision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 241-250.

iBOWING-2x模型。输入包含三部分： 1. 视觉图像特征； 2. 目标问题的文本嵌入； 3. 其它问题连接起来后的文本嵌入。 // 增加了图像的其它问题作为弱监督学习信息

2018 Chain of Reasoning for Visual Question Answering

Wu C, Liu J, Wang X, et al. Chain of reasoning for visual question answering[C]//Advances in Neural Information Processing Systems. 2018: 275-285.

$$O^{(t)}$$$$R^{(t)}$$的关系推理（relational reasoning）可表述为： $G_l = \sigma(relu(QW_{l_1})W_{l_2}), \\ G_r = \sigma(relu(QW_{r_1})W_{r_2}), \\ R_{ij} = (O_i^{(t)} \odot G_l) \oplus (O_j^{(1)} \odot G_r),$

• $$O^{(t)}$$中的$$m$$的目标在问题$$Q$$的指导下与初始目标集$$O^{(1)}$$中的$$m$$个目标交互。

$$R^{(t)}$$$$O^{t+1}$$的目标修正可表述为： $O_j^{t+1} = \sum^m_{i=1}\alpha_i^{(t)}R_{ij}^{(t)}$

Chao W L, Hu H, Sha F. Cross-dataset adaptation for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5716-5725.

2018 Customized Image Narrative Generation via Interactive Visual Question Generation and Answering

Shin A, Ushiku Y, Harada T. Customized Image Narrative Generation via Interactive Visual Question Generation and Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 8925-8933.

2018 Differential Attention for Visual Question Answering

Patro B, Namboodiri V P. Differential attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7680-7688.

1. 根据输入图像和问题取得引用注意力嵌入（reference attention embedding）；
2. 根据该引用注意力嵌入，在数据库中找出样本，取近样本作为支持范例、远样本作为反对范例；
3. 支持范例和反对范例用于计算微分注意力向量；
4. 通过微分注意力网络（differential attention network, DAN）或微分上下文网络（differential context network）分别可以改进注意力或取得微分上下文特征，这两种方法可以提升注意力与人工注意力的相关性；

2018 Don't Just Assume; Look and Answer Overcoming Priors for Visual Question Answering

Agrawal A, Batra D, Parikh D, et al. Don't just assume; look and answer: Overcoming priors for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4971-4980.

1. 如果不是“yes/no”型问题，那么ACP工作，而CE不工作。ACP从问题语句中预测出的概念据类输入到AP中，给出998预定义回答词的分类结果。
2. 如果是“yes/no”型问题，那么CE工作，ACP不工作，VCC和CE输入到VV中，给出Yes/No的分类结果。

2018 DVQA Understanding Data Visualizations via Question Answering

Kafle K, Price B, Cohen S, et al. DVQA: Understanding data visualizations via question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5648-5656.

1. 分类子网络：生成一般回答；
2. OCR子网络：生成针对图表的回答。

Das A, Datta S, Gkioxari G, et al. Embodied question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018: 2054-2063.

The embodiment hypothesis is the idea that intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity.

-Smith and Gasser, “The development of embodied cognition: six lessons from babies.,” Artificial life, vol. 11, no. 1-2, 2005.

2018 Focal Visual-Text Attention for Visual Question Answering

Liang J, Jiang L, Cao L, et al. Focal visual-text attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6135-6143.

2018 FVQA Fact-Based Visual Question Answering

Wang P, Wu Q, Shen C, et al. Fvqa: Fact-based visual question answering[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 40(10): 2413-2427.

• DBpedia，通过众包从维基百科中抽取的结构化知识。本文使用其中的视觉概念的类属关系（concepts are linked to their categories and super-categories based on the SKOS Vocabulary）。
• ConceptNet，从Open Mind Common Sense (OMCS) 项目中的句子中自动生成。由几种常识关系（commonsense relations）组成，如：UsedFor, CreatedBy, IsA。本文使用其中的11种常见关系。
• WebChild，从Web中自动抽取生成，一个被忽视（overlooked）的常识事实库，涉及比较关系（comparative relations），如：Faster, Bigger, Heavier。

2018 Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering

Nguyen D K, Okatani T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6087-6096.

2018 IQA Visual Question Answering in Interactive Environments

Gordon D, Kembhavi A, Rastegari M, et al. Iqa: Visual question answering in interactive environments[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4089-4098.

2018 iVQA Inverse Visual Question Answering

Liu F, Xiang T, Hospedales T M, et al. iVQA: Inverse visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 8611-8619.

Hu H, Chao W L, Sha F. Learning answer embeddings for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5428-5436.

1. 图像和问题的嵌入；

2. 回答的嵌入。

Misra I, Girshick R, Fergus R, et al. Learning by asking questions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 11-20.

2018 Learning Conditioned Graph Structures for Interpretable Visual Question Answering

Norcliffe-Brown W, Vafeias S, Parisot S. Learning conditioned graph structures for interpretable visual question answering[C]//Advances in Neural Information Processing Systems. 2018: 8334-8343.

1. 问题编码；

2. 一组目标的边界框及其目标图像特征向量；

$e_n = F([v_n||q]), n = 1,2,...,N$

$N(i) = topm(a_i)$

2018 Learning to Specialize with Knowledge Distillation for Visual Question Answering

Mun J, Lee K, Shin J, et al. Learning to specialize with knowledge distillation for visual question answering[C]//Advances in Neural Information Processing Systems. 2018: 8081-8091.

Visual Question Answering (VQA) is a notoriously challenging problem because it involves various heterogeneous tasks defined by questions within a unified framework. Learning specialized models for individual types of tasks is intuitively attracting but surprisingly difficult; it is not straightforward to outperform naïve independent ensemble approach. We present a principled algorithm to learn specialized models with knowledge distillation under a multiple choice learning (MCL) framework, where training examples are assigned dynamically to a subset of models for updating network parameters. The assigned and non-assigned models are learned to predict ground-truth answers and imitate their own base models before specialization, respectively. Our approach alleviates the limitation of data deficiency in existing MCL frameworks, and allows each model to learn its own specialized expertise without forgetting general knowledge. The proposed framework is model-agnostic and applicable to any tasks other than VQA, e.g., image classification with a large number of labels but few per-class examples, which is known to be difficult under existing MCL schemes. Our experimental results indeed demonstrate that our method outperforms other baselines for VQA and image classification.

2018 Learning Visual Knowledge Memory Networks for Visual Question Answering

Su Z, Zhu C, Dong Y, et al. Learning visual knowledge memory networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7736-7745.

1. 明显目标（apparent objective）：从图像中可以直接回答出来；
2. 隐约目标（indiscernible objective）：视觉识别不清，需要借助常识作为约束；
3. 不可见目标（invisible objective）：无法借助视觉内容回答，需要对外部知识做归纳/推理才行。

1. 把视觉内容和知识事实做集成的机制。VKMN模型通过把知识三元组(subject, relation, target)和深度视觉特征联合嵌入进视觉知识特征的方式实现该机制。

2. 处理从问题和答案对中扩展出的多项知识事实的机制。VKMN模型通过键值对结构在记忆网络中存储联合嵌入，以便处理多条事实。

2018 Motion-Appearance Co-Memory Networks for Video Question Answering

Gao J, Ge R, Chen K, et al. Motion-appearance co-memory networks for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6576-6585.

2018 Neural-Symbolic VQA Disentangling Reasoning from Vision and Language Understanding

Yi K, Wu J, Gan C, et al. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding[C]//Advances in Neural Information Processing Systems. 2018: 1031-1042.

• 图像--神经网络-->场景表示；
• 问题--生成-->程序跟踪；
• 通过符号程序执行器，和神经解析器配合，执行程序进行推理取得答案。

2018 Out of the Box Reasoning with Graph Convolution Nets for Factual Visual Question Answering

Narasimhan M, Lazebnik S, Schwing A. Out of the box: Reasoning with graph convolution nets for factual visual question answering[C]//Advances in Neural Information Processing Systems. 2018: 2654-2665.

Ramakrishnan S, Agrawal A, Lee S. Overcoming language priors in visual question answering with adversarial regularization[C]//Advances in Neural Information Processing Systems. 2018: 1541-1551.

2018 Textbook Question Answering Under Instructor Guidance With Memory Networks

Li J, Su H, Zhu J, et al. Textbook question answering under instructor guidance with memory networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 3655-3663.

2018 Tips and Tricks for Visual Question Answering Learnings from the 2017 Challenge

Teney D, Anderson P, He X, et al. Tips and tricks for visual question answering: Learnings from the 2017 challenge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4223-4232.

1. 使用sigmoid outputs：能够允许对每个问题的多个正确答案；
2. 使用soft scores作为ground truth target：把问题转换为对候选答案的得分回归问题，而不是传统分类；
3. 使用gated tanh activations：非线性层的激活函数；
4. 使用image features from buttom-up attention：提供特定区域的特征，而不是对传统的从CNN中取得的特征映射图做网格划分；
5. 使用pretrained representations of candidate answers：初始化输出层的权重；
6. 在随机梯度下降SGD训练中，使用large mini-batches和smart shuffling处理数据。

2018 Two can play this Game Visual Dialog with Discriminative Question Generation

Jain U, Lazebnik S, Schwing A G. Two can play this game: visual dialog with discriminative question generation and answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5754-5763.

2018 Visual Question Answering with Memory-Augmented Networks

Ma C, Shen C, Dick A, et al. Visual question answering with memory-augmented networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6975-6984.

Li Y, Duan N, Zhou B, et al. Visual question generation as dual task of visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6116-6124.

2018 Visual Question Reasoning on General Dependency Tree

Cao Q, Liang X, Li B, et al. Visual question reasoning on general dependency tree[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7249-7257.

2. 一个残差组合模块（residual composition module）：组合已经挖掘出的依据。

2018 VizWiz Grand Challenge Answering Visual Questions from Blind People

Gurari D, Li Q, Stangl A J, et al. Vizwiz grand challenge: Answering visual questions from blind people[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 3608-3617.

VizWiz数据集时面向盲人问答的数据集，数据集中的图像和问题由盲人用手机拍摄和记录，每个问题包含10个众包答案。

Shrestha R, Kafle K, Kanan C. Answer them all! toward universal visual question answering models[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 10472-10481.

VQA的研究现状分两个阵营：

1. 专注于需要现实图像理解的VQA数据集；

2. 专注于检验推理能力的合成数据集。

2019 Cycle-Consistency for Robust Visual Question Answering

Shah M, Chen X, Rohrbach M, et al. Cycle-consistency for robust visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 6649-6658.

1. 新评价协议；
2. 相应的新数据集VQA-Rephrasings。 本文的研究表明目前的VQA模型在问题语句的语言变化时难以保持稳定。

VQA-Rephrasing数据集从VQA v2.0发展而来，包含4万个图像，对应的4万个问题通过人工改写成了3个表述方式不同的问题语句。

1. 重新生成的问题和答案应与ground-truth保持一致
1. 视觉问题生成模块的架构细节

2019 Deep Modular Co-Attention Networks for Visual Question Answering

Yu Z, Yu J, Cui Y, et al. Deep Modular Co-Attention Networks for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 6281-6290.

1. 自注意力单元（self-attention, SA）：建模模态内的密集交互（单词与单词、区域与区域）；

2. 导向注意力单元（guided-attention, GA）：建模模态间的交互（单词与区域）； 模块协同注意力（Modular Co-Attention ,MCA）层则通过组合SA和GA单元实现。MCA层支持深度级联。多个级联的MCA层组成了本文提出的深度MCAN模型。 本文在VQA-v2数据集上开展的实验表明，自注意力和导向注意力在协同注意力学习中具备不错的协同增效作用。

2019 Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering

Gao P, Jiang Z, You H, et al. Dynamic Fusion With Intra-and Inter-Modality Attention Flow for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 6639-6648.

• 堆叠若干个DFAF块可以帮助模型逐渐关注到重要的图像区域、问题单词和隐含对齐关系。
• 动态模态内注意力流（Dynamic Intra-Modality Attention Flow）模块；
• 对问题特征平均池化出的条件门控向量（conditional gating vector）可以控制区域特征之间流动的信息。，这样一来，注意力机制就会关注于与问题相关的信息流。

2019 Explicit Bias Discovery in Visual Question Answering Models

Manjunatha V, Saini N, Davis L S. Explicit Bias Discovery in Visual Question Answering Models[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 9562-9571.

• 论文图中的单词拼错了，应该是antecedent；
• VQA数据集的VQA样本示例，表格中展示了统计偏差对回答的影响。

2019 Generating Question Relevant Captions to Aid Visual Question Answering

JialinWu, Zeyuan Hu, Raymond J. Mooney. Generating Question Relevant Captions to Aid Visual Question Answering[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3585–3594.

2019 Improving Visual Question Answering by Referring to Generated Paragraph Captions

Hyounghun Kim, Mohit Bansal. Improving Visual Question Answering by Referring to Generated Paragraph Captions[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3606-3612.

• 本文的段落描述模块基于的Melas-Kyriazi等人2018年的研究，使用CIDEr作为reward实现强化学习。

• 早期融合（Early Fusion）：该阶段把视觉特征（Visual Feature）和段落描述（Paragraph Caption）与目标属性（Object Properties）特征进行融合。

• 晚期融合（Late Fusion）：该阶段把各模块输出的逻辑值整合到一个向量中。

2019 Information Maximizing Visual Question Generation

Krishna R, Bernstein M, Fei-Fei L. Information Maximizing Visual Question Generation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 2008-2018.

• 模型的训练过程
• 模型的测试/推理过程

2019 Multi-grained Attention with Object-level Grounding for Visual Question Answering

Pingping Huang, Jianhui Huang, Yuqing Guo, Min Qiao, Yong Zhu. Multi-grained Attention with Object-level Grounding for Visual Question Answering[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3595–3600.

• Word-Label Attention
• Word-Object Attention
• Sentence-Object Attention
• 把WL, WO, SO三种注意力权重加起来，取得Object features的多粒度注意力权重结果，用这个多粒度注意力去加权平均object features取得attended object feature向量，作为视觉信息表示。最终与Sentence embedding组合为融合特征，用作VQA答案分类。

2019 MUREL Multimodal Relational Reasoning for Visual Question Answering

Cadene R, Ben-Younes H, Cord M, et al. Murel: Multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 1989-1998.

1. MuRel cell：一个原子化的推理基元，能够通过一个富向量表示来表示问题和图像区域之间的交互，通过成对结合建模区域之间的关系。
2. MulRel network：逐步修正视觉和问题交互，比仅使用注意力映射图可以更好定义可视化模式。
• MuRel cell
• MuRel单元中：双线性融合可以表示问题向量$$q$$和区域向量$$s_i$$之间的丰富、细粒度的交互。多模态向量$$m_i$$通过成对关系建模（Pairwise Relational Modeling）块为每个区域生成一个上下文感知嵌入$$x_i$$。对$$x_i$$$$s_i$$求和获取$$\hat{s}_i$$，此处$$s_i$$相当于恒等映射的shortcut，计算过程形成了残差函数。

MuRel cell的输入是问题向量$$q$$$$N$$个视觉特征$$s_i \in \mathbb{R}^{d_v}$$（还有对应的bounding box坐标信息$$b_i$$）。

1. 一个高效的双线性融合模块会把问题特征向量和区域特征向量做融合，取得$$N$$个区域的局部多模态嵌入（local multimodal embedding）；
2. 一个成对关系建模（Pairwise Relational Modeling）组件会根据每个融合过的区域特征向量$$m_i$$的空间和视觉上下文，来更新$$m_i$$$$x_i$$

2019 OK-VQA A Visual Question Answering Benchmark Requiring External Knowledge

Marino K, Rastegari M, Farhadi A, et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 3195-3204.

OK-VQA数据集的问题包含对所需外部知识分类的标注，把需要用到的外部知识疯了10类，例如：车辆与交通；品牌、公司和产品；……。

2019 Psycholinguistics Meets Continual Learning Measuring Catastrophic Forgetting in Visual Question Answering

Claudio Greco, Barbara Plank, Raquel Fernández, Raffaella Bernardi. Psycholinguistics Meets Continual Learning Measuring Catastrophic Forgetting in Visual Question Answering[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3601-3605.

2019 Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension

Daesik Kim, Seonhoon Kim, Nojun Kwak. Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3568-3584.

Noh H, Kim T, Mun J, et al. Transfer Learning via Unsupervised Task Discovery for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 8385-8394.