Topics for Theses

Available Topics

Image Captioning in computer vision. Image taken from [1].

Introduction

Generating captions that describe the content of an image is a task emerging in computer vision. Lastly, Recurrent Neural Networks (RNN) in the form of Long Short-Term Memory (LSTM) networks have shown great success in generating captions matching an image's content. In contrast to traditional tasks like image classification or object detection, this task is more challenging. A model not only needs to identify a main class but also needs to recognize relationships between objects and describe them in a natural language like English. Recently, an encoder/decoder network presented by Vinyals et al. [1] won the Microsoft COCO 2015 captioning challenge.

 

Available Topics
 

Generating Image Captions for Unkown Objects

Examples of generated captions (LRCN) and improved sentences (LSTM-C) by using a copying mechanism that replaces unknown words by detected objects within the caption. Figure taken from [2].

 

As seen in [1], deep neural networks a capable of generating simple captions for given images when given a large dataset in the order of a million labeled image-caption pairs. However, as every data driven machine learning approach these models can’t create captions for objects never seen before, i.e. the dataset doesn’t contain images or captions of a certain type. Training such a model takes days to weeks on modern GPUs, but a requirement may be that this model can describe new object categories afterwards with little to no extra effort.

 

Ting et al. use a hybrid model that combines an object detection model with a captioning model. In figure 1 you can see a picture and its ground-truth caption. A standard captioning model (LRCN) may produce a caption that is not accurate (e.g. “a red fire hydrant is parked in the middele of a city”). Combined with an object detection pipeline, which detects a bus in this case this hybrid model can generate a much more accurate caption “a bus driving down a street next to a building”.

 

Your tasks will be:

  • Familiarize yourself with the  Tensorflow Framework and the  Show and Tell model
  • Integrate the LSTM copying mechanism introduced in [2] into a given Tensorflow Model (Show and Tell)
  • Train and evaluate your implementation on the  MSCOCO dataset by holding out 8 objects out of the dataset
  • Compare and analyze the performance of your implementation against the paper [2].

Python and Numpy knowledge is advantageous as Tensorflow models are implemented in Python.

If you are interested and want more information, please contact  Philipp Harzig.
 

Visual Question Answering

Examples of question/image input pairs and generated answers. Picture taken from [3].

 

Building on top of general image captioning another more challenging task in computer vision has come up recently. It is called visual question answering a question referencing some of the input image’s contents is part of the input. The model then tries to answer the question as accurate as possible. See figure 2 for image-question-answer examples. Most publications have agreed on one approach to tackle this problem. The question and image are both embedded in a vector representation, then combined in some way and the answer is generated as the most likely answer out of 3000 to 5000 possible answers. Therefore, the problem is modeled as a classification problem, i.e. all possible answers are assigned a probability and the most likely answer is selected out of all answers.

 

Your tasks will be:

  • Familiarize yourself with the  Tensorflow Framework and a classification pipeline already implemented in Tensorflow (e.g. the  Slim Framework)
  • Implement a model that embeds an image with a deep convolutional neural network (DCNN) like the ResNet or Inception network and a question with an LSTM network. This model should combine both representations and use a classification model that selects the most probable answer for this combination.
  • Train and evaluate your implementation on the  VQA-v2 dataset.
  • Compare and analyze the performance vs. state-of-the-art approaches.

Python and Numpy knowledge is advantageous as Tensorflow models are implemented in Python. If you are interested and want more information, please contact  Philipp Harzig.

 

Literature for Image Captioning

[1] Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
 

[2] Yao, Ting, et al. "Incorporating copying mechanism in image captioning for learning novel objects." 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
 

[3] Teney, Damien, et al. "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge." arXiv preprint arXiv:1708.02711 (2017).

Generating ground truth data can be very costly and time consuming. For instance, generating labels for a single image in semantic segmentation can take up to 1 hour. Alternatively, synthetic data can be generated automatically much easier and faster. However, neural networks trained on synthetic data show a poor generalization to real data. In this thesis we assume that we have a synthetic data set with labels and an unlabeled real dataset consisting only of images. The task is to set up methods that allow the model trained on synthetic data to generalize on real data as well. Various approaches can be used to solve this task using Generative Adversarial Networks (GANs) or a self-supervised training pipeline.

If you are interested and want more information, please contact Sebastian Scherer

In this work we assume the following szenario. We have a large amount of images, but only a small subset has annotated ground truth labels. Supervised approaches only allow the usage of the small subset of data. The question we ask in this work is, can we make use of all images? We will pre-train our models in a self-supervised task on a large amount of unlabeled data before adapting the model to the target task. Possible target taks may be semantic segmentation, human pose estimation or 3D object detection.

If you are interested and want more information, please contact Sebastian Scherer

Supervised training of deep neural networks require large labeled datasets. However, the label generation process can be very noisy/error-prone in the sense that some labels are labeled incorrectly. Additionally, there are self-supervised methods that generate pseudo labels on non-annotated data that are used for training afterwards. Training on noisy labels can yield to a poor performance. In this work, we will investigate the effect of wrong annotations in the training and design approaches that overcome this issue. Possible target taks may be image classification, semantic segmentation, human pose estimation or 3D object detection.

If you are interested and want more information, please contact Sebastian Scherer

Bottom-up Human Pose Estimation pipelines have the advantage that regardless of the number of people in the image, the runtime is nealy equal. On the contrary, it is complicated to group all the detected keypoints to person instances. The model HigherHRNet achieves great accuracy in keypoint detection, but needs a lot of time in this grouping postprocessing step compared to the model runtime. Furthermore, its grouping mechanism called Associative Embedding has some difficulties with occluded persons. Another promising possibility to perform the grouping task is based on Part Affinity Fields (PAF). In this thesis, the PAF grouping technique should be incorporated in the HigherHRNet model and compared to the current Associative Embedding technique regarding its runtime and grouping performance.

If you are interested and want more information, please contact Katja Ludwig

Search