Themen für Abschlussarbeiten

Aktuelle Themen

Neuronale Netzwerke funktionieren in der Bildverarbeitung heutzutage so gut, dass es nahezu unmöglich ist mit anderen (klassischen) Methoden eine vergleichbare Performance zu erzielen. Trotz ihrer hohen Leistungsfähigkeit haben Neuronale Netze jedoch einen großen Nachteil: Sie sind rechenaufwändig, speicheraufwändig und brauchen sehr viele Daten. Klassische Methoden sind dagegen auf die gegebenen Probleme zugeschniten und kommen meistens ohne große Datenmengen aus.

Ziel dieser Arbeit ist es klassische Methoden mit Neuronalen Netzen zu kombinieren, um den Datenhunger, Speicherbedarf und Rechenbedarf dieser Netze zu verringern. Hierzu wird die Aufgabe der Semantischen Segmentierung betrachtet und die Ausgaben eines tiefen Segmentierungsnetzes sollen mit der Technik des Relaxation Labellings aufbereitet werden. Im Laufe der Arbeit sollen verschiedene Kombinationsmethoden der Techniken ausprobiert werden und die Ergebnisse müssen sauber verglichen werden. Des Weiteren lässt sich das Thema für interessierte Studenten auch in verschiedene Richtungen vertiefen.

Das Thema eignet sich zeitlich besonders für das Projektmodul oder eine Masterarbeit. Für motivierte Studenten kann das Thema aber auch im Rahmen des Forschungsmoduls oder der Bachelorarbeit durchgeführt werden.

Bei Interesse meldet euch bitte bei Daniel Kienzle.

Für lange Zeit wurden in der Bildverarbeitung ausschließlich Faltungsnetze eingesetzt, jedoch werden sie in letzter Zeit immer häufiger durch die Transformerarchitektur ersetzt oder mit ihr kombiniert. Ein großer Vorteil der Transformerarchitektur ist, dass sie sehr flexibel einsetzbar ist. Diese Flexibilität soll in dieser Arbeit ausgenutzt werden.

Wird ein Neuronales Netz auf einen Datensatz trainiert, so ist es anschließend in der Lage ausschließlich für die Klassen aus diesem Datensatz Vorhersagen zu Treffen. Möchte man weitere Klassen hinzunehmen, muss man das Netzwerk aufwändig auf alle Daten neu trainieren. Sind die alten Daten nicht mehr vorhanden, stößt man außerdem auf das Problem des "Catastrophic Forgetting", das Netzwerk liefert also auf die alten Daten nicht mehr so gute Ergebnisse. In dieser Arbeit soll die Flexibilität der Transformerarchitektur ausgenutzt werden um diesen Problemen in der Aufgabe der Semantischen Segmentierung entgegen zu wirken. Dadurch wird einerseits das "Catastrophic Forgetting" verhindert, andererseits wird ermöglicht neue Klassen zu lernen ohne das komplette Netzwerk neu trainieren zu müssen.

Diese Arbeit eignet sich besonders für das Projektmodul oder eine Masterarbeit. Das Thema ist sehr forschungsnah, das heißt, dass die zu erwarteten Ergebnisse ungewiss sind, jedoch ein großer Erkenntnisgewinn möglich ist.

Bei Interesse meldet euch bei Daniel Kienzle.

Human Pose Estimation is the task of detecting human keypoints in images or videos. 2D Human Pose Estimation means the localization of these keypoints in 2D coordinates in the image or video frame. Convolutional neural networks are the most common for such tasks. Recently, the Transformer architecture emerged from natural language processing tasks to vision tasks. It has the benefit to have a global view instead of the local view that convolution operations have. As it was originally not designed for vision tasks, some adaptations have to made to make this architecture feasible for vision tasks. A lot of variants have been proposed recently, but they are mostly not evaluated for Human Pose Estimation. Theses in this topic should analyze the performance of different Transformer variants for Human Pose Estimation. Variants could include different basic architectures, target heads, architecture nuances/hyperparameters etc.


If you are interested and want more information, please contact Katja Ludwig

Semi-Supervised Learning is an active research field in computer vision with the goal to train neural networks with only a small labeled dataset and a lot of unlabeled data. For human pose estimation, this means that a large dataset with images from people is available, but only a small subset has annotated keypoints. Semi-supervised human pose estimation uses different techniques to train jointly on labeled and unlabeled images in order to improve the detection performance of the network. Popular methods are pseudo labels - the usage of network predictions as annotations - and teacher-student-approaches, where one network is enhanced by being trained by a second network.       


If you are interested and want more information, please contact Katja Ludwig

Note: The following topics are currently only available for practical courses (interships, "Projektmodul", ...) and maybe Bachelor's theses.


Driven by the massive progress in 2D Human Pose Estimation and related detection-based tasks over the last years, active research is steadily advancing to the next logical step: the reconstruction of the human pose in 3D space. And while existing multi-view or RGB-D motion capture systems are perfectly capable of this task, current research focuses on the difficult and highly under-constrained case of single-view RGB images and videos. Reliably estimating the 3D pose of a human from a single consumer-grade camera opens up a vast area of practical applications.


Like in many computer vision topics, all current state-of-the-art methods in 3D human pose estimation (HPE) evolve around some form of convolutional neural network (CNN), Transformer network, or a combination of both. The main differences come from

  • the specific task definition
  • the pose representation, especially within the CNN/Transformer
  • the type of supervision (configuration and quantity of data and labels)
  • the runtime vs. fidelity trade-off

There is onging research in all these different topics, with year-to-year gains in precision, reliability and efficiency.


Thesis Topics

The research at our chair covers all the aforementioned concepts. We always have specific research questions that are suitabel for a Bachelor or Master thesis as well as practical courses and internships. With the speed of new developments and advancements in this field, the detailed topic for a thesis will be defined on-demand based on the current research at our chair and the prerequsites and interests of the student. Below are some topics that are suitabel for a potential thesis or intership. If you are interested in the overall reserach field or one of the following topics, please contact

Moritz Einfalt


Pose Representations for Multi-Person 3D HPE

Coming from the current state in 2D human pose estimation, the quasi-standard method to represent the detection targets for pose-defining human keypoints in CNN/Transformer models are spatial 2D heatmaps. Retaining the spatial dimensions from input (image) to output (heatmaps, one per keypoint) is the currently best performing mode. The naive transfer of this concept to the 3D task (i.e. the detection of keypoints in 3D space) are volumentric 3D heatmaps. However, the additional dimension in the network output makes this representation very costly, especially when a high spatial resolution in the predicted heatmaps is required. And while the approach can be feasible for single-person 3D HPE on tight image crops, it completely breaks in the multi-person case on large images.


Current solutions try to factorize the 3D volume into smaller, more efficent 1D and 2D components [1]. This divides the learning task into a detection part (2D heatmaps) and a regression part (e.g. numerical regression of the depth component), see figure 1. Other approaches use a learned compact representation of 3D heatmaps from an integrated autoencoder [2], see figure 2. Both variants have disadvantages and can lead to ambiguites in the encoding of 3D keypoints of tightly grouped people in the image. Potential topics for theses under this research question include the comparison of different existing representation methods and the development of new representations under the contraints of efficiency or precision.



Figure 1: Mixed pose representation: Spatial detection task with 2D heatmaps + sparse regression of 3D keypoint locations. Image taken from [1].


Figure 2: Reconstructed volumentic heatmap (summed over the z-axis for visualization) with the autoencoder from [2]. Image taken from the JTA dataset [3].


Real-Time 3D HPE on Edge Devices

The current state-of-the-art in monocular 3D HPE  is already at a level of precision and reliability where it can be intergrated into actual applications. This can range from analytical applications, where the motion of humans in 3D space is infered and evaluated, to interactive scenarios, where the human body is used as an input mode to control other agents (robots, virtual characters, ...). However, most of the current best-perfoming monocular 3D HPE methods rely on very deep CNNs, large spatial input and ouput sizes and sometimes even the combination of multiple CNN/Tranformer models for two-step person detection and pose estimation. Aside from the need of entire GPU servers for training, these models still require a dedicated high-end consumer or even professional GPU during inference (i.e. application) to reach real-time capabilities. This contraint massively hinders the development of new applications: It restricts the usage to stationary scenarios, where the recording device is connected to a powerful GPU machine. The true application potential lies in mobile applications, where the 3D HPE is performed direclty on the recording device (read: smartphone).


One highly relevant research questions is therefore the transfer of the current state-of-the-art in 3D HPE to less powerful edge devices like smartphones. Existing approaches focus on single-shot architectures [4] (  see figure 3),    low-resolution image crops or CNN model compression [5] (see figure 4). Potential topics for theses under this research question include benchmarking and adapting existing methods and developing new strategies in teacher-student model compression.



>> Mobile Application for Real-Time 3D HPE <<

We are currently looking for students that are intersted in working on a standalone iOS mobile demo application with a complete 3D HPE pipeline. This project covers model compression of 2D and 3D HPE models, platform conversion as well as the developtment of the final application. The project is best suited for bachelor or master students that want to complete their internship ("Betriebspraktium") or practical course ("Forschungs-/Projektmodul) at our lab and have some prior experience with iOS development. Ideally, two students will tackle the project as a team. Please contact Moritz Einfalt for further details and prerequisites.



Figure 3: Single-shot multi-person 3D HPE with Pandanet. Image taken from [4].


Figure 4: Real-time 3D HPE directly on a smartphone via CNN model compression. Image taken from [5].



[1] Mehta, Dushyant, et al. "XNect: Real-time multi-person 3D motion capture with a single RGB camera." ACM Transactions on Graphics (TOG) 39.4 (2020): 82-1.


[2] Fabbri, Matteo, et al. "Compressed volumetric heatmaps for multi-person 3d pose estimation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.


[3] Fabbri, Matteo, et al. "Learning to detect and track visible and occluded body joints in a virtual world." Proceedings of the European Conference on Computer Vision (ECCV). 2018.


[4] Benzine, Abdallah, et al. "Pandanet: Anchor-based single-shot multi-person 3d pose estimation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.


[5] Hwang, Dong-Hyun, et al. "Lightweight 3D human pose estimation network training using teacher-student learning." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2020.




Semantic segmentation entails the task of assigning every pixel in an image to one of multiple classes. The output can thus be interpreted as a mask, dividing the image into multiple zones.

Classically, this task is performed on RGB images. However, using the color channels alone might in some cases be insufficient for a neural network to overcome potential ambiguities. Fortunately, in some instances we have additional information on the observed scene on our hand, which we can use to enhance network performance. Such information commonly consists of depth maps, which assign each pixel a (relative) depth disparity. However other information such as thermal maps (displaying the temperature) and polarization maps may also be of use.
The thesis task of this topic will concerne itself with methods to beneficially integrate these additional types of information.

In case of interest, or for additional information, contact: Robin Schön




[1] : Fuqin Deng et al.,  "FEANet: Feature-Enhanced Attention Network for RGB-Thermal Real-time Semantic Segmentation",

[2] : Qishen Ha et al.,  "MFNet: Towards Real-Time Semantic Segmentation for Autonomous Vehicles with Multi-Spectral Scenes", 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)