Paper accepted for the IEEE International Conference on Image Processing 2022

The paper titled "Synchronized Audio-Visual Frames With Fractional Positional Encoding for Transformers in Video-to-Text Translation" from Philipp Harzig, Moritz Einfalt und Rainer Lienhart is accepted for the IEEE International Conference on Image Processing 2022. This paper presents a novel way to synchronize audio and video features for the automated generation of textutal video descriptions.


Video-to-text (VTT) is the task of automatically generating descriptions for short audio-visual video clips. It can help visually impaired people to understand scenes shown in a YouTube video, for example. Transformer architectures have shown great performance in both machine translation and image captioning. In this work, we transfer promising approaches from image captioning and video processing to VTT and develop a straightforward Transformer architecture. Then, we expand this Transformer by a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments on the VATEX dataset and improve the CIDEr and BLEU-4 scores by 21.72 and 8.38 points compared to a vanilla Transformer network and achieve state-of-the art results on the MSR-VTT and MSVD datasets. Also, our novel FPE helps increase the CIDEr score by relative 8.6%


Philipp Harzig, Moritz Einfalt and Rainer Lienhart. 2022. Synchronized audio-visual frames with fractional positional encoding for transformers in video-to-text translation. DOI: 10.1109/ICIP46576.2022.9897804
PDF | BibTeX | RIS | DOI