Themen für Abschlussarbeiten

Image Captioning

Image Captioning in computer vision. Image taken from [1].


Generating captions that describe the content of an image is a task emerging in computer vision. Lastly, Recurrent Neural Networks (RNN) in the form of Long Short-Term Memory (LSTM) networks have shown great success in generating captions matching an image's content. In contrast to traditional tasks like image classification or object detection, this task is more challenging. A model not only needs to identify a main class but also needs to recognize relationships between objects and describe them in a natural language like English. Recently, an encoder/decoder network presented by Vinyals et al. [1] won the Microsoft COCO 2015 captioning challenge.


Available Topics

Generating Image Captions for Unkown Objects

Examples of generated captions (LRCN) and improved sentences (LSTM-C) by using a copying mechanism that replaces unknown words by detected objects within the caption. Figure taken from [2].


As seen in [1], deep neural networks a capable of generating simple captions for given images when given a large dataset in the order of a million labeled image-caption pairs. However, as every data driven machine learning approach these models can’t create captions for objects never seen before, i.e. the dataset doesn’t contain images or captions of a certain type. Training such a model takes days to weeks on modern GPUs, but a requirement may be that this model can describe new object categories afterwards with little to no extra effort.


Ting et al. use a hybrid model that combines an object detection model with a captioning model. In figure 1 you can see a picture and its ground-truth caption. A standard captioning model (LRCN) may produce a caption that is not accurate (e.g. “a red fire hydrant is parked in the middele of a city”). Combined with an object detection pipeline, which detects a bus in this case this hybrid model can generate a much more accurate caption “a bus driving down a street next to a building”.


Your tasks will be:

  • Familiarize yourself with the  Tensorflow Framework and the  Show and Tell model
  • Integrate the LSTM copying mechanism introduced in [2] into a given Tensorflow Model (Show and Tell)
  • Train and evaluate your implementation on the  MSCOCO dataset by holding out 8 objects out of the dataset
  • Compare and analyze the performance of your implementation against the paper [2].

Python and Numpy knowledge is advantageous as Tensorflow models are implemented in Python.

If you are interested and want more information, please contact  Philipp Harzig.

Visual Question Answering

Examples of question/image input pairs and generated answers. Picture taken from [3].


Building on top of general image captioning another more challenging task in computer vision has come up recently. It is called visual question answering a question referencing some of the input image’s contents is part of the input. The model then tries to answer the question as accurate as possible. See figure 2 for image-question-answer examples. Most publications have agreed on one approach to tackle this problem. The question and image are both embedded in a vector representation, then combined in some way and the answer is generated as the most likely answer out of 3000 to 5000 possible answers. Therefore, the problem is modeled as a classification problem, i.e. all possible answers are assigned a probability and the most likely answer is selected out of all answers.


Your tasks will be:

  • Familiarize yourself with the  Tensorflow Framework and a classification pipeline already implemented in Tensorflow (e.g. the  Slim Framework)
  • Implement a model that embeds an image with a deep convolutional neural network (DCNN) like the ResNet or Inception network and a question with an LSTM network. This model should combine both representations and use a classification model that selects the most probable answer for this combination.
  • Train and evaluate your implementation on the  VQA-v2 dataset.
  • Compare and analyze the performance vs. state-of-the-art approaches.

Python and Numpy knowledge is advantageous as Tensorflow models are implemented in Python. If you are interested and want more information, please contact  Philipp Harzig.


Literature for Image Captioning

[1] Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[2] Yao, Ting, et al. "Incorporating copying mechanism in image captioning for learning novel objects." 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.

[3] Teney, Damien, et al. "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge." arXiv preprint arXiv:1708.02711 (2017).

Visual Odometry


Visual odometry (VO) is a technique for estimating the location and orientation (pose) of a robot using information from a camera. This topic has attracted great interest in the field computer vision as a supplement or replacement to GPS and Intertial Navigation Systems, especially in indoor environments where GPS may not be usable. State-of-the-art approaches are based on a well-defined pipeline consisting of several consecutive steps for the final pose estimation. Each of these modules has to be designed carefully and fine-tuned to ensure optimum performance and the camera usually needs to be calibrated before use.  Deep Learning (DL) is currently dominating many computer vision tasks with promising results. In recent years, deep learning-based VO has drawn considerable attentions due to its potential in learning capability and the robustness to camera parameters and different environments.


Available Topics

Visual Odometry through Self-supervised Deep Learning

Conventional vs. Deep Learning-based method. Image taken from [4]. IEEE International Conference on Robotics and Automation (ICRA)

The task of this work will be to implement a neural network that takes two images at different timesteps and use them to calculate the translation and rotation of the camera. The main objective of this task is to train the network in a self-supervised learning procedure, which means that we don’t use the ground truth pose information but known 3D geometric properties to define the loss function for the training. Such an approach is interesting because it does not require a labeled dataset and can continually improve itself. State-of-the-art CNN networks that predict optical flow can be used as base models in this work. Evaluations should cover the potential of a self-supervised learning and compare the result to state-of-the-art approaches.



Your tasks will be:

  • Familiarize yourself with Tensorflow/Keras.
  • Literature research about the classic visual odometry pipeline and existing deep learning approaches.
  • Implement a model that predicts the 6-DoF pose given two images at different timesteps. Design and implement a self-supervised training procedure which allows the model to continually improve.
  • Train and evaluate your implementation on a real-world scenario or existing datasets like the Kitti VO benchmark [3].
  • Compare and analyse the performance and computational time against state-of-the-art VO approaches.



  • Experience in Python (strongly recommended).
  • Experience in deep learning frameworks like Tensorflow or Torch/PyTorch (recommended).
  • Motivation to work on deep learning.


Possible Extensions:

  • Train the self-supervised solution with a stereo camera to track head movements of people wearing a Head Mounted Displays in Virtual Reality scenarios. Such an application enforces e.g. very fast pose calculations and has to solve the motion blur problem at fast head movements. 
  • In addition to visual information, one can also integrate measurements from an Inertial Measurement Unit (IMU) into the odometry estimation, resulting in a Visual-Inertial Odometry algorithm. Further work could cover the integration and fusion of both data using an additional neural network and evaluate it against existing approaches based on a Kalman Filter for sensor fusion.

If you are interested and want more information, please contact Sebastian Scherer. Students who write a bachelor thesis may get a simplified task.



[1] D. Scaramuzza and F. Fraundorfer, “Visual odometry: Tutorial,” IEEE Robotics & Automation Magazine, vol. 18, no. 4, pp. 80–92, 2011

[2] F. Fraundorfer and D. Scaramuzza, “Visual odometry: Part II: Matching, robustness, optimization, and applications,” IEEE Robotics & Automation Magazine, vol. 19, no. 2, pp. 78–90, 2012.

[3] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

[4] S. Wang, R. Clark, H. Wen and N. Trigoni “DeepVO: Towards end-to-end visual odometry with deep recurrent convolutional neural networks,” IEEE International Conference on Robotics and Automation (ICRA), 2017.

[5] R. Li, S. Wang, Z Long and D Gu “UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning,” IEEE International Conference on Robotics and Automation (ICRA), 2018.