This seems like a massive improvement for openly available local ASR. Even the 300M model outperforms whisper-large-v3 according to the paper's benchmarks.
> Omnilingual ASR was designed as a community-driven framework. People around the world can extend Omnilingual ASR to new languages by using just a few of their own samples.
How hard is it to make TTS out of this? A few independent journalists from Belarus asked for TTS in their language, but I am no expert, was thinking about re-using Mozilla's work. What's the easiest way to get working TTS for a language?
As far as I understand, the MMS TTS models are trained from scratch (section 7.1 of [1]), they do not employ any SSL models. So the OmniASR SSL models are not useful here.
What might be interesting is the newly released OmniASR data, because the MMS data, which was used for the MMS TTS, was never released.
Also, the OmniASR can be used to transcribe some untranscribed speech to train a TTS on it.
Meta cheated with the mms models. That is they didn’t use a phonemeizsr step. This means they just won’t work or sound very strange. ASR data is usually not quite right for tts. But anyhow - not really answering your question but many of these languages already done in mms. Try them https://huggingface.co/spaces/willwade/sherpa-onnx-tts
This seems like a massive improvement for openly available local ASR. Even the 300M model outperforms whisper-large-v3 according to the paper's benchmarks.
Not sure, I recorded 3 seconds of voice (a single sentence) and the hf demo misrecognized about half of the words.
Does anyone else feel like they buried the lead?
> Omnilingual ASR was designed as a community-driven framework. People around the world can extend Omnilingual ASR to new languages by using just a few of their own samples.
The world just got smaller
Only a few gb of weights will recognize speech in 1600+ languages.
Freely downloadable and usable by anyone for almost anything.
We truly live in the future.
Seeing the absurd number of languages made me think of the norm macdonald joke:
Music is the universal language, but one day soon it will be replaced by Chinese.
How hard is it to make TTS out of this? A few independent journalists from Belarus asked for TTS in their language, but I am no expert, was thinking about re-using Mozilla's work. What's the easiest way to get working TTS for a language?
EDIT: My bad, please disregard; As akreal pointed out, the MMS TTS models aren’t using the SSL models.
Original post:
You can use the OmniASR SSL models instead of their older MMS models to create TTS models: https://github.com/ylacombe/finetune-hf-vits
As far as I understand, the MMS TTS models are trained from scratch (section 7.1 of [1]), they do not employ any SSL models. So the OmniASR SSL models are not useful here.
What might be interesting is the newly released OmniASR data, because the MMS data, which was used for the MMS TTS, was never released.
Also, the OmniASR can be used to transcribe some untranscribed speech to train a TTS on it.
[1] MMS paper: https://arxiv.org/pdf/2305.13516
You’re completely right, I misremembered. I edited my post.
Meta cheated with the mms models. That is they didn’t use a phonemeizsr step. This means they just won’t work or sound very strange. ASR data is usually not quite right for tts. But anyhow - not really answering your question but many of these languages already done in mms. Try them https://huggingface.co/spaces/willwade/sherpa-onnx-tts
From TFA, it says that it’s extremely easy to add new languages with just a few examples. I didn’t see specifics on how “few” it really is, though.
This is ASR not TTS though.
the global language explorer is fascinating -great work guys
https://aidemos.atmeta.com/omnilingualasr/language-globe
- we are getting closer to BabelFish.. at least for the Earth!
any insights on latency?
HF Demo: https://huggingface.co/spaces/facebook/omniasr-transcription...
GitHub: https://github.com/facebookresearch/omnilingual-asr
Thanks! I've added those links to the toptext as well.