Failed attempt at free multilingual TTS by abusing IPA
My recent job have been developing a very fast TTS system for our internal use. I can't share too many details about it because NDA. But I do want to share some findings while messing with our system. - My boss is ok with me sharing these findings. We are looking to open source it soon anyway. It's not a breakthrough or anything. Just well engineered and fast - To make our model perform better, our system first converts the input text into the International Phonetic Alphabet. That is then fed into the speech synthesis model. Me with some basic phonetics training thought to myself, IPA is supposed to be universal. Can I, in principle, train a model in English. Then force it to speak Esperanto by feeding it the corresponding IPA? More far fetch, can I make it speak Japanese or Chinese?
Kinda. But not really.
The International Phonetic Alphabet is a writing system that allows us to write down the sounds of human languages. It's not a language itself. It records the sound of the spoken language itself. For example, the sentence "I would just like to interject for a moment" first get converts into
aɪ wʊd dʒˈʌst lˈaɪk tʊ ˌɪntɚdʒˈɛkt fɚɹə mˈoʊmənt. The IPA letters look weird. But if you squint hard enough, you can almost see how it spells out the pronunciation.
Let's try an actual example. The sentence
It is possible for a rocket to reach Jupiter, but it would require significant resources and technology. is pernounced as
ɪɾ ɪz pˈɑːsᵻbəl fɚɹɚ ɹˈɑːkɪt tə ɹˈiːtʃ dʒˈuːpɪɾɚ, bˌʌt ɪt wʊd ɹᵻkwˈaɪɚ sɪɡnˈɪfɪkənt ɹᵻsˈoːɹsᵻz ænd tɛknˈɑːlədʒi. And our TTS system generates the following audio IMPORTANT: This is a private model I trained for my own use. My company does not own the rights to this audio. Nor uses the model in production:
We can, for example, set the English to IPA conversation to use the British accent instead of US accent. This is the same sentence, but with a British-ish accent:
That kinda works ain't it. But something feels slightly off. Anyway, now the fun part, now the same sentence, but in Esperanto: Eblas ke raketo atingu Jupitero'n, sed ĝi postulus signifajn rimedojn kaj teknologion. Phonemized into
ˈeblas ke rakˈeto atˈinɡu jˌupitˈeron, sed dʒi postˈulus siɡnˈifaɪn rimˈedoɪn kaɪ tˌeknoloɡˈion
It does pronounce the sentence. But a lot of the letters are missing. And the accent is off. Esperanto always have the heavy sound at the second-to-last vowel. For example, the first word "Eblas" should have E be pronounced heavily, but it didn't. The entire "raketo" seems to be pronounced "faketo". The "g" in teknologion is pernounced as "d" even though the IPA does show 'g'. etc... These issue convinces me there is 2 problems
- The speech synthesis model fed so out-out-distribution data that it can't handle it
- IPA does not contain enough information to handle the accent
I knew this would be a long shot. But I was hoping that the IPA would be universal enough that I can just feed it into the model and make it speak a language it didn't.
It fell even harder when I force it to speak Japanese. It tried to say "私は日本語を話せません". But some phonemes are missing. So it turned out really weird. It also pronounced ん as "ennn" instead of "nnnn". Sad but I tried.
Maybe the input format needs to be improved. Instead of using IPA as separate tokens. Each IPA characters can be represented as 5 vectors, denoting features of different mouth movement. Maybe. I'm too lazy to try.
If you happen to be interested in our fast TTS system (it can synthesize at more then 60x realtime, with high concurrency), please contact me or get in contact with getlumina.com. We don't advertise it on the website yet. Nor it is anything related to this post. We do work on it as an internal project. Again, looking to open source it soon.
Finally, this post is not an advertisment. I just want to write down the results of my experiments.
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict
- marty1885 \at protonmail.com
- Matrix: @clehaxze:matrix.clehaxze.tw
- Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df