Deep Image Captioning

Generating captions that describe the content of an image is a task emerging in computer vision. Lastly, Recurrent Neural Networks (RNN) in the form of Long Short-Term Memory (LSTM) networks have shown great success in generating captions matching an image's content. In contrast to traditional tasks like image classification or object detection, this task is more challenging. A model not only needs to identify a main class but also needs to recognize relationships between objects and describe them in a natural language like English.

Image Captioning of Branded Products

In a collaboration with the GfK Verein (link), we introduced a pipeline capable of automatically generating captions for images from social media. In particular, we look at images that contain an object which is related to a brand by depicting a logo of this brand on it.


In this project, we focus on correctly identifing the brand contained in the image, but state of the art models like Vinyals et al. [1] tend to produce rather generalized descriptions. In contrast, we want our model to correctly mention the name of the brand contained in the image within the sentence. Simultaneously, we predict attributes that describe the involvement of the human with the brand, whether the branded product appears in a positive or negative context, and whether the interaction is functional or emotional.



  • Philipp Harzig, Stephan Brehm, Rainer Lienhart, Carolin Kaiser, René Schallner, Multimodal Image Captioning for Marketing Analysis IEEE MIPR 2018 Miami, FL, USA, April 2018, [ PDF]




Test images from our dataset. Our model generates “a female hand holds a can of cocacola above a tiled floor.”, “a hand is holding a kinderriegel bar.”, “a hand is holding a can of heinz.”, and “a young woman is holding a nutella jar in front of her face.” for the top left, top right, bottom left, and bottom right image, respectively.





Further References:


[1] Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

For more information please contact  Philipp Harzig.