New Research Aligns Text to Speech Effortlessly | Google

Overcome Sequence length mismatch without explicitly specifying it.

Published in

Towards AI

6 min readAug 20, 2023

TLDR

Training a text-speech (multimodal Model) has its own problems. Given the audio sample rate is high, the sequence length for audio is a lot longer than the corresponding text. To train both text and audio simultaneously, we need to overcome this disparity (lazily without having to generate explicitly annotated training data). This paper solves that problem.

New Research Aligns Text to Speech Effortlessly | Google

Overcome Sequence length mismatch without explicitly specifying it.

TLDR

Written by Mandar Karhade, MD. PhD.