Deep Image Captioning

Generating captions that describe the content of an image is a task emerging in computer vision. Lastly, Recurrent Neural Networks (RNN) in the form of Long Short-Term Memory (LSTM) networks have shown great success in generating captions matching an image's content. In contrast to traditional tasks like image classification or object detection, this task is more challenging. A model not only needs to identify a main class but also needs to recognize relationships between objects and describe them in a natural language like English. Recently, an encoder/decoder network presented by Vinyals et al. [1] won the Microsoft COCO 2015 captioning challenge.


[1] Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Image Captioning of Branded Products

In a collaboration with the GfK Verein (link), we introduced a pipeline capable of automatically generating captions for images from social media. In particular, we look at images that contain an object which is related to a brand by depicting a logo of this brand on it.



Test images from our dataset. Our model generates “a female hand holds a can of cocacola above a tiled floor.”, “a hand is holding a kinderriegel bar.”, “a hand is holding a can of heinz.”, and “a young woman is holding a nutella jar in front of her face.” for the top left, top right, bottom left, and bottom right image, respectively.

In this project, we focus on correctly identifing the brand contained in the image, but state of the art models like Vinyals et al. tend to produce rather generalized descriptions. In contrast, we want our model to correctly mention the name of the brand contained in the image within the sentence. Simultaneously, we predict attributes that describe the involvement of the human with the brand, whether the branded product appears in a positive or negative context, and whether the interaction is functional or emotional.




  • Philipp Harzig, Stephan Brehm, Rainer Lienhart, Carolin Kaiser, René Schallner, Multimodal Image Captioning for Marketing Analysis 
    IEEE MIPR 2018 Miami, FL, USA, April 2018, [ PDF]

For more information please contact  Philipp Harzig.