Topics for Theses

Available Topics

Last edited: 27.08.2024

Keywords: Scene Graph, Graph Database, LLM, Synthetic Data

 

Scene Graphs are used to represent a given image or video as a graph structure. This graph structure can then be queried to support further down stream tasks. Scene graphs can be queried by converting the graph to a graph database (e.g. Neo4j) and querying it using a graph query language (e.g. Cypher). Users could define their queries in such a query language, but in this thesis, the query should be derived from a text prompt using an LLM.

However, most Scene Graph datasets were not built with complex queries in mind and relations were selected by how easy they were to obtain and not how beneficial they are for a complex graph query. Therefore, you will create a set of dummy graphs (without any connection to existing images/videos) and explore relations and settings that allow for complex reasoning using graph queries. Next, you will use large language models (LLMs) to convert user text prompts into graph queries and evaluate the results.

Finally, with the insights gained, a set of useful relations and attributes for complex scene graph queries can be derived. These relations can then be generated and evaluated with our synthetic dataset generator.

 

Project outline:

  1. Get to know Neo4j and Cypher
  2. Create example graphs that can be queried
  3. Derive graph queries from text prompts using an LLM
  4. Define a set of relations that future scene graph datasets must contain
  5. Generate and evaluate a synthetic dataset with our dataset generator

If you are interested in this topic or require more information, please contact Julian Lorenz.

Last edited: 27.08.2024

Keywords: Scene Graph, 3D Vision, Generative AI, Diffusion Models, Auto Encoder, Synthetic Data

 

3D scene synthesis methods populate 3D scenes with believable layouts. For example, in indoor 3D scene synthesis, methods are able to populate 3D rooms with 3D furniture objects to create believable 3D scenes of indoor rooms. Recent methods have used diffusion models to generate 3D scenes (DiffuScene, PhyScene). However, these methods are limited to a fixed set of 3D assets and have to be retrained for new objects. Therefore, they are also limited to specific datasets (3D Future).

In this thesis, you will generalise these models by removing the constraint on a fixed set of assets. DiffuScene and PhyScene use an auto encoder to encode shapes of objects into a vector representation. Instead, you will encode bounds and other abstract features like orientation and usage attributes into a vector representation, enabling more flexibility. With this design, more datasets can be used, such as [Hypersim](https://github.com/apple/ml-hypersim). Explore various datasets and include them in your model training. Evaluate the results.

 

Project outline:

  1. Get to know PhyScene and run training and generation
  2. Replace the asset auto encoder
  3. Evaluate performance of the new model
  4. Explore additional datasets and include them in the training
  5. Evaluate performance of the new model with the additional data

If you are interested in this topic or require more information, please contact Julian Lorenz.

Last edited: 27.08.2024

Keywords: Scene Graph, Active Learning

 

Scene Graphs describe relations in an image using a graph structure. This graph structure can then be used for down stream applications. Existing Scene Graph datasets like PSG or Visual Genome are very imbalanced and have very few images for certain relation classes. Therefore, neural networks for Scene Graphs struggle with these relations.

To improve the performance of rare relations or to support entirely new relations, new datasets have to be built. However, simply annotating new images is time consuming and ineffective. A better approach is to use active learning to select the most valuable images for training. Your task will be to build a new Scene Graph dataset by employing various active learning techniques. Evaluate and compare different used techniques.

You will be starting to work on a code base that already contains various active learning approaches. You will have to make yourself familiar with that code base and adapt it to your needs. A lot of coding will be with JavaScript and some SQL, so you should be familiar with these languages.

 

Project outline:

  1. Implement various active learning approaches. A reference implementation is provided.
  2. Devise a way to compare the different approaches
  3. Build a scene graph dataset using active learning

If you are interested in this topic or require more information, please contact Julian Lorenz.

Human Pose Estimation (HPE) is the task of detecting human keypoints in images or videos. 2D Human Pose Estimation means the localization of these keypoints in pixel coordinates in the image or video frame. 3D Human Pose Estimation is the task of estimating a three dimensional pose of the humans in the image or video. Mostly, this task is accomplished by uplifting estimated 2D poses to the third dimension, e.g., by leveraging the time context in videos.

Transformer architectures are currently most common in these taks. They have the benefit to have a global view instead of the local view that convolution operations have. Thesis topics in this field could include analyzing 3D HPE architectures, improving/adapting them, e.g., for different domains or target applications, analyzing different input or training modes like semi-supervised learning, etc. 

Semi-Supervised Learning is an active research field in computer vision with the goal to train neural networks with only a small labeled dataset and a lot of unlabeled data. For human pose estimation, this means that a large dataset with images from people is available, but only a small subset has annotated keypoints. Semi-supervised human pose estimation uses different techniques to train jointly on labeled and unlabeled images in order to improve the detection performance of the network. Popular methods are pseudo labels - the usage of network predictions as annotations - and teacher-student-approaches, where one network is enhanced by being trained by a second network.   

 

If you are interested and want more information, please contact Katja Ludwig

The computer vision task of Human Pose Estimation estimates keypoints of humans in either 2D or 3D. These keypoints can be connected such that a skeleton model of the human can be created. This skeleton model is sufficient for some tasks, but does not reflect the body shape of the person. Human Mesh Estimation overcomes this issue. It estimates not only keypoints, but a whole mesh representing the pose and the body shape of humans. This task is more challenging than pure 3D Human Pose Estimation, as a lot more parameters need to be estimated. In order to keep the amount of parameters relatively small, body models like SMPL and its successors are common in this field. Thesis topics could include the analysis of Human Mesh architectures, slight adaptations to the models or training routines, analyses or conversion of body models, etc.    

 

If you are interested and want more information, please contact  Katja Ludwig

 

The access to masks for objects in images is of great importance to many computer vision tasks. Manually annotating such object masks (for example with polygon drawings), however, takes an extensive amount of time. In addition to this, the annotation of finely jagged edges and delicate structures poses a considerable problem. Interactive segmentation systems try to drastically ease this task by using forms of user guidance that can be annotated cheaply in order to predict an object mask. Usually this guidance takes the form of right/left mouse clicks to annotate single background/foreground pixels.

Semantic segmentation constitutes the task of classfiying every single pixel into one of several predefined classes. In consequence interactive segmentation systems constitute a combination of the two tasks: The segmentation happens on the basis of user guidance while the goal is to circumvent a costly annotation process. Instead of annotating single objects, the goal is to divide the entire input image into several class surfaces.

 

Literature:

[1] : https://ceur-ws.org/Vol-2766/paper1.pdf

[2] : https://arxiv.org/abs/2003.14200

 

If case of interest, contact Robin Schön (robin.schoen@uni-a.de)

 

Accurately tracking the ball in 3D space is crucial for sports analysis. Existing technologies, like goal-line technology in soccer, rely on expensive setups with multiple cameras. Our research explores using computer vision and machine learning to estimate the ball's 3D position in cost-effective, single-camera videos.

 

We focus on two promising approaches:

  • Direct 3D Prediction in Single Images: Neural networks can be trained to directly estimate the ball's 3D location from a single image, considering its size and surrounding scene. While effective, this approach can be imprecise due to inherent limitations of only considering single images.
  • Physics-Guided Tracking: We can also track the ball's movement across a video sequence, ensuring predictions align with the laws of physics. This method offers greater accuracy.

Multiple projects for your thesis are possible. Here are a few examples:

  • Developing novel methods for direct 3D prediction: This involves exploring new approaches to estimate the ball's location directly from single images, potentially improving accuracy.
  • Fusing direct predictions with physics-guided tracking: This research direction investigates combining the strengths of both techniques for even more robust 3D ball tracking.
  • Adapting tracking algorithms to real-world data: This project area focuses on making tracking algorithms more resilient to the inevitable noise and imperfections present in real-world video data.

Background and Skills:

A strong foundation in camera matrices and coordinate transformations, as covered in the "Grundlagen der Signalverarbeitung und des Maschinellen Lernens" lecture, will be highly beneficial for these projects. While a background in physics is not required, you should not be afraid of the relevant equations.

 

For more information, contact Daniel Kienzle

 

Search