Paper für die IEEE International Conference on Image Processing 2022 akzeptiert

Das Paper mit dem Titel "Synchronized Audio-Visual Frames With Fractional Positional Encoding for Transformers in Video-to-Text Translation" von Philipp Harzig, Moritz Einfalt und Rainer Lienhart wurde für die IEEE International Conference on Image Processing (ICIP) 2022 akzeptiert. In diesem Paper untersuchen die Autoren, wie Audio- und Bildinformation aus Videos besser für die automatisierte Generiererung von textuellen Videobeschreibungen kombiniert werden können.


Video-to-text (VTT) is the task of automatically generating descriptions for short audio-visual video clips. It can help visually impaired people to understand scenes shown in a YouTube video, for example. Transformer architectures have shown great performance in both machine translation and image captioning. In this work, we transfer promising approaches from image captioning and video processing to VTT and develop a straightforward Transformer architecture. Then, we expand this Transformer by a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments on the VATEX dataset and improve the CIDEr and BLEU-4 scores by 21.72 and 8.38 points compared to a vanilla Transformer network and achieve state-of-the art results on the MSR-VTT and MSVD datasets. Also, our novel FPE helps increase the CIDEr score by relative 8.6%


Philipp Harzig, Moritz Einfalt and Rainer Lienhart. in press. Synchronized audio-visual frames with fractional positional encoding for transformers in video-to-text translation.
PDF | BibTeX | RIS