Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

Since the introduction of the Transformer architecture by Vaswani et al.,  massive improvements in the task of sequence transduction and machine translation have been made. Thus, it is natural to adapt this technique to image captioning and video-to-text (VTT).
In this project, we address the video-to-text (VTT) task and extend a standard Transformer to be able to cope with video inputs. Furthermore, we investigate several improvements by adopting various techniques from the domain of image captioning. 

 

 

Examples of generated descriptions for an example video from the validation split. We see four frames from each video together with the frame number on the left and the generated caption for each model on the right. CC BY-NC-ND

 

 

We present a way to easily align video and audio features independent of their respective sampling rates. We align the features by extending the positional encoding to support fractional positions.

 

The default positional encoding for audio and video frames (on top) in comparison with the FPE (bottom) for an exemplary video. The video has 32 I3D frames and 11 audio frames. The lengths (d) of audio and video frames differ. CC BY-NC-ND

 

 

References

  • Harzig, Philipp, Moritz Einfalt, and Rainer Lienhart. "Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation." arXiv preprint arXiv:2112.14088 (2021). [ PDF]
  • Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).


For more information please contact  Philipp Harzig.

Suche