Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation
Since the introduction of the Transformer architecture by Vaswani et al., massive improvements in the task of sequence transduction and machine translation have been made. Thus, it is natural to adapt this technique to image captioning and video-to-text (VTT).
In this project, we address the video-to-text (VTT) task and extend a standard Transformer to be able to cope with video inputs. Furthermore, we investigate several improvements by adopting various techniques from the domain of image captioning.
We present a way to easily align video and audio features independent of their respective sampling rates. We align the features by extending the positional encoding to support fractional positions.
- Harzig, Philipp, Moritz Einfalt, and Rainer Lienhart. "Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation." arXiv preprint arXiv:2112.14088 (2021). [ PDF]
- Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
For more information please contact Philipp Harzig.