New Research Aligns Text to Speech Effortlessly | Google
Overcome Sequence length mismatch without explicitly specifying it.
Published in
6 min readAug 20, 2023
TLDR
Training a text-speech (multimodal Model) has its own problems. Given the audio sample rate is high, the sequence length for audio is a lot longer than the corresponding text. To train both text and audio simultaneously, we need to overcome this disparity (lazily without having to generate explicitly annotated training data). This paper solves that problem.