Deep Image Captioning

Generating captions that describe the content of an image is a task emerging in computer vision. Lastly, Recurrent Neural Networks (RNN) in the form of Long Short-Term Memory (LSTM) networks have shown great success in generating captions matching an image's content. In contrast to traditional tasks like image classification or object detection, this task is more challenging. A model not only needs to identify a main class but also needs to recognize relationships between objects and describe them in a natural language like English. Recently, an encoder/decoder network presented by Vinyals et al. [1] won the Microsoft COCO 2015 captioning challenge.


[1] Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Medical Image Captioning

Image Captioning also started to become popular in automatically generating doctor’s reports for thorax x-ray images. Annotating chest x-rays is a tedious and time-consuming job, which involves a lot of domain knowledge. In the recent year, more and more approaches were introduced that try to automatically generate paragraphs of text, which read like a doctor’s report. However, data is really scarce and annotations cannot be gathered as easily as for tasks like generic image captioning or image classification, because domain experts are needed to create a textual impression of a patient’s chest x-ray. Second, real medical data has to conform to privacy laws and, therefore, anonymized. The only publicly available dataset, which combines chest x-ray images with doctor’s reports only contains 7470 sample, of which only half has a unique doctor’s report (there are mostly two chest x-ray images showing a different view per report).


Two examples from the Indiana University Chest X-Ray collection. The upper row shows a normal case without findings, while the bottom row shows a case with findings. We highlighted the sentences with our human abnormality annotation, i.e., normal sentences are highlighted in blue and abnormal sentences are written in green.


In our research, we focus on correctly identifying abnormalities, as the fraction of sentences describing the abnormalities are very rare. We want to improve the captioning quality on a correct identification of abnormalities, and, not based on a machine translation metric like BLEU.

For more information please contact  Philipp Harzig.