Face-Dubbing++: LIP-Synchronous, Voice Preserving Translation Of Videos

Creative Commons License

Waibel A., Behr M., Yaman D., Eyiokur F. I., Nguyen T., Mullov C., ...More

2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, 4 - 10 June 2023 identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/icasspw59220.2023.10193719
  • City: Rhodes Island
  • Country: Greece
  • Keywords: end-to-end video translation, lip generation, speech translation, text-to-speech, voice conversion
  • Istanbul Technical University Affiliated: Yes


In this paper, we propose a neural end-to-end system for voice preserving and lip-synchronous video translation. The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language that is lip-synchronous with the target speech, yet maintains emphases in speech, voice characteristics, and face video of the original speaker. The result is a video of a speaker speaking in another language without actually knowing it. For the evaluation, we present a user study of the complete system and separate evaluations of the single components. Since there is no available dataset to evaluate our whole system, we collect a test set to evaluate our system. The results indicate that our system is able to generate convincing videos of the original speaker speaking the target language while preserving the original speaker's characteristics.