Building on top of general image captioning another more challenging task in computer vision has come up recently. It is called visual question answering a question referencing some of the input image’s contents is part of the input. The model then tries to answer the question as accurate as possible. Most publications have agreed on one approach to tackle this problem. The question and image are both embedded in a vector representation, then combined in some way and the answer is generated as the most likely answer out of 3000 to 5000 possible answers. Therefore, the problem is modeled as a classification problem, i.e. all possible answers are assigned a probability and the most likely answer is selected out of all answers. In our research, we work on a model that doesn’t rely on answering the question based on a predefined answer set, i.e., an answer to a question has an higher variability than only 3000 possible answers. We employ an LSTM to dynamically generate answers. These answers show to have a greater variability than the ground-truth, and, in addition also new – previously unseen – answers are generated that correctly answer the question.


Images associated with question and generated answers by our model. All answers shown are new ones not contained in the training set. Figures (a) - (d) show correct answers not detected by the official evaluation script. The second row shows wrong answers. Especially, (e) and (f) show sentences, where the end of sentence token was generated to early (dataset bias of short answers). (g) and (h) show wrong answers.




Philipp Harzig, Christian Eggert, Rainer Lienhart. Visual Question Answering With a Hybrid Convolution Recurrent Model, ACM International Conference on Multimedia Retrieval 2018 (ACM ICMR 2018) Yokohama, June 2018 [ PDF]



For more information please contact  Philipp Harzig.