New Research Aligns Text to Speech Effortlessly | Google

Overcome Sequence length mismatch without explicitly specifying it.

Mandar Karhade, MD. PhD.
Towards AI
Published in
6 min readAug 20, 2023

--

TLDR

Training a text-speech (multimodal Model) has its own problems. Given the audio sample rate is high, the sequence length for audio is a lot longer than the corresponding text. To train both text and audio simultaneously, we need to overcome this disparity (lazily without having to generate explicitly annotated training data). This paper solves that problem.

--

--