- Abstract
- Zero-Shot Style Transfer
- Multi-Level Style Control
- Cross-Lingual Style Transfer
- Speech-to-Singing Style Transfer
- Ablation Study
- Clustering Style Encoder
- Style Control Extended Experiments
Abstract
Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text prompts. However, the multifaceted nature of singing styles poses a significant challenge for effective modeling, transfer, and control. Furthermore, current SVS models often fail to generate singing voices rich in stylistic nuances for unseen singers. To address these challenges, we introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. Specifically, TCSinger proposes three primary modules: 1) the clustering style encoder employs a clustering vector quantization model to stably condense style information into a compact latent space; 2) the Style and Duration Language Model (S\&D-LM) concurrently predicts style information and phoneme duration, which benefits both; 3) the style adaptive decoder uses a novel mel-style adaptive normalization method to generate singing voices with enhanced details. Experimental results show that TCSinger outperforms all baseline models in synthesis quality, singer similarity, and style controllability across various tasks, including zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer.
Note: We conduct all tasks in the zero-shot scenario, with training and testing on cross-lingual speech and singing data.
Zero-Shot Style Transfer
To assess the performance of TCSinger and baseline models in the zero-shot style transfer task, we randomly select samples with unseen singers from the test set as targets and different utterances from the same singers to form prompts.
1.Target Word: 又 站 在 你 家 的 门 口 我 们 重 复 沉 默
Prompt: 终 于 你 开 口 向 我 诉 说 她 有 多 温 柔
Successfully transferring the timbre, resonance in pop singing method, mixed voice technique, pronunciation, rhythm, and pitch transition style.
Prompt | Ground Truth |
---|---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
2.Target Word: 我 听 见 雨 滴 落 在 青 青 草 地 SP 我 听 见 远 方 下 课 钟 声 响 起
Prompt: 可 是 我 没 有 听 见 你 的 声 音 SP 认 真 呼 唤 我 姓 名
Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.
Prompt | Ground Truth |
---|---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
3.Target Word: 也 不 是 真 的 AP 不 会 想 你 AP 全 都 不 是 真 的 AP 是 骗 自 己
Prompt: 让 我 这 样 吧 SP 并 不 是 真 的 AP 路 过 而 已
Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.
Prompt | Ground Truth |
---|---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
4.Target Word: 谁 娶 了 多 愁 善 感 的 你
Prompt: 谁 把 你 的 长 发 盘 起
Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.
Prompt | Ground Truth |
---|---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
5.Target Word: settled down that you AP found a girl and you AP
Prompt: I head AP that you are AP
Successfully transferring the timbre, pronunciation, pitch transition style, rhythm, and glissando technique.
Prompt | Ground Truth |
---|---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
6.Target Word: you belong with me belong with me
Prompt: standing by and waiting at your back door SP all this time how could you not know baby Successfully transferring the timbre, pronunciation, pitch transition style, rhythm, and vibrato technique.
Prompt | Ground Truth |
---|---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
Multi-Level Style Control
We add global and phoneme-level text embedding to each baseline model to enable style control. Then, we compare TCSinger using multi-level text prompts. We conduct both parallel and non-parallel experiments according to the target styles. For global styles, we specify singing methods (bel canto and pop) and emotions (happy and sad) for each test target. For phoneme-level styles, we select none, one or more specific techniques (mixed voice, falsetto, breathy, vibrato, glissando, and pharyngeal) for each phoneme of target content.
Parallel Style Control
In the parallel experiments, we randomly select unseen audio from the test set, using the GT global style and phoneme-level techniques as the target.
1.Target Word: 你 是 魔 鬼 中 的 天 使 所 以 送 我 心 碎 的 方 式 AP 是 让 我 笑 到 最 后 AP
Global Text Prompt (Singing Method and Emotion): bel canto, sad
Phoneme-Level Text Prompt (Technique Sequence):
['AP(0)', 'n(1)', 'i(1)', 'sh(1)', 'i(1)', 'm(1)', 'o(1)', 'g(1)', 'uei(1)', 'zh(1)', 'ong(1)', 'd(1)', 'e(1)', 't(1)', 'ian(1)', 'sh(1)', 'i(1)', 's(1)', 'uo(1)', 'i(1)', 's(1)', 'ong(1)', 'uo(1)', 'x(1)', 'in(1)', 's(1)', 'uei(1)', 'd(1)', 'e(1)', 'f(1)', 'ang(1)', 'sh(1)', 'i(1)', 'AP(0)', 'sh(1)', 'i(1)', 'r(1)', 'ang(1)', 'uo(1)', 'x(1)', 'iao(1)', 'd(1)', 'ao(1)', 'z(1)', 'uei(1)', 'h(1)', 'ou(1)', 'AP(0)']
(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)
Successfully control global singing method and emotion, and the phoneme-level techniques of glissando and mixed voice.
Ground Truth |
---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
2.Target Word: 宁 愿 选 择 留 恋 不 放 手 AP 等 到 风 景 都 看 透 AP 也 许 你 会 陪 我 看 细 水 AP 长 流
Global Text Prompt (Singing Method and Emotion): pop, sad
Phoneme-Level Text Prompt (Technique Sequence):
['n(1)', 'ing(1)', 'van(1)', 'x(1)', 'van(1)', 'z(1)', 'e(1)', 'l(2)', 'iou(2)', 'l(2)', 'ian(2)', 'b(2)', 'u(2)', 'f(2)', 'ang(2)', 'sh(2)', 'ou(2)','AP(0)' 'd(2)', 'eng(2)', 'd(2)', 'ao(2)', 'f(2)', 'eng(2)', 'j(2)', 'ing(2)', 'd(2)', 'ou(2)', 'k(2)', 'an(2)', 't(2)', 'ou(2)','AP(0)', 'ie(2)', 'x(2)', 'v(2)', 'n(12', 'i(2)', 'h(2)', 'uei(2)', 'p(2)', 'ei(2)', 'uo(2)', 'k(2)', 'an(2)', 'x(1)', 'i(1)', 'sh(1)', 'uei(1)', 'AP(0)','ch(1)', 'ang(1)', 'l(1)', 'iou(1)']
(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)
Successfully control global singing method and emotion, and the phoneme-level techniques of falsetto and mixed voice.
Ground Truth |
---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
Non-Parallel Style Control
In the non-parallel experiments, global styles and six techniques are randomly yet appropriately assigned.
1.Target Word: remember us this way way AP Global Text Prompt (Singing Method and Emotion): pop, sad
Phoneme-Level Text Prompt (Technique Sequence):
['R(1)', 'IH0(1)', 'M(1)', 'EH1(1)', 'M(1)', 'B(1)', 'ER0(1)', 'AH1(1,5,6)', 'S(1,5,6)', 'DH(1)', 'IH1(1)', 'S(1)', 'W(1,6)', 'EY1(2,6)', 'W(2)', 'EY1(2)', 'AP(0)']
(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)
Successfully control global singing method and emotion, and the phoneme-level techniques of vibrato, glissando, falsetto, and mixed voice.
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
2.Target Word: SP I remember tears streaming down your face when I sad I’ll never let you go AP
Global Text Prompt (Singing Method and Emotion): pop, sad
Phoneme-Level Text Prompt (Technique Sequence):
[['SP(0)', 'AY1(2)', 'R(2)', 'IH0(2)', 'M(2)', 'EH1(2)', 'M(2)', 'B(2)', 'ER0(2)', 'T(2)', 'IH1(2)', 'R(2)', 'Z(2)', 'S(2)', 'T(2)', 'R(2)', 'IY1(2)', 'M(2)', 'IH0(2)', 'NG(2)', 'D(2)', 'AW1(2)', 'N(2)', 'Y(2)', 'UH1(2)', 'R(2)', 'F(2)', 'EY1(2)', 'S(2)', 'HH(1)', 'W(1)', 'EH1(1)', 'N(1)', 'AY1(1)', 'S(1)', 'EH1(1)', 'D(1)', 'AY1(1)', 'L(1)', 'N(1)', 'EH1(1)', 'V(1)', 'ER0(1)', 'L(1)', 'EH1(1)', 'T(1)', 'Y(1)', 'UW1(1)', 'G(1)', 'OW1(1)', 'AP(0)']
(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)
Successfully control global singing method and emotion, and the phoneme-level techniques of vibrato, falsetto, and mixed voice.
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
Cross-Lingual Style Transfer
To test the zero-shot cross-lingual style transfer performance of various models, we use unseen test data with different lyrics’ languages as prompts and targets for inference (like English and Chinese).
1.Target Word: I love you baby SP trust in me when I say
Prompt: 让 我 掉 下 眼 泪 的 不 止 昨 夜 的 酒
Language: Chinese->English
Successfully transferring the timbre, the articulation method, pronunciation, pitch transition style, and rhythm.
Prompt |
---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
2.Target Word: 情 丝 百 转 丝 丝 缠 乱 犹 不 知
Prompt: They’ve all been said before you know so why don’t we AP just play pretend AP
Language: English->Chinese
Successfully transferring the timbre, the articulation method, pronunciation, pitch transition style, and rhythm.
Prompt |
---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
Speech-to-Singing Style Transfer
We conducted experiments on both parallel and cross-lingual speech-to-singing style transfer.
Parallel Speech-to-Singing Style Transfer
In parallel experiments, we randomly select samples with unseen singers from the test set as targets and different speech from the same singers to form prompts.
1.Target Word: You make me happy AP when skies are gray
Prompt: I belive that the heart does go on
Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.
Prompt | Groud Truth |
---|---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
2.Target Word: It can happen to anyone of us anyone you think of
Prompt: And you wonder SP I wonder how I wonder why
Successfully transferring the timbre, the articulation method, pronunciation, pitch transition style, and rhythm.
Prompt | Ground Truth |
---|---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
Cross-Lingual Speech-to-Singing Style Transfer
In cross-lingual experiments, we select the speech prompt in a different lyric language from the target (such as Chinese and English).
1.Target Word: 当 花 瓣 离 开 花 朵 AP 暗 香
Prompt: it can happen to anyone of us
Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.
Prompt |
---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
2.Target Word: the song we sang together AP oh yeah AP
Prompt: 我 全 部 的 心 跳 AP 随 你 跳
Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.
Prompt |
---|
YourTTS | Mega-TTS | RMSSinger | StyleSinger | TCSinger |
---|---|---|---|---|
Ablation Study
we undertake ablation studies to showcase the efficacy of various designs incorporated within TCSinger. SAD denotes using the style adaptive decoder or only diffusion decoder, DM means using the duration model in S\&D-LM or using a simple duration predictor of Fastspeech2, and CVQ means using the CVQ model or VQ model in the clustering style encoder.
1.Target Word: 我 的 背 脊 如 荒 丘 而 你 却 微 笑 摆 首 AP 把 它 当 成 整 个 宇 宙 你 与 太 阳 挥 手 也 同 海 鸥 问 候
Prompt: 直 到 那 一 天 SP 你 的 衣 衫 破 旧 而 歌 声 却 温 柔 陪 我 漫 无 目 的 的 四 处 漂 流
Successfully synthesizing the timbre, articulation method, pronunciation, pitch transition style, and rhythm.
Prompt |
---|
Gronud Truth | TCSinger | w/o SAD | w/o DM | w/o CVQ |
---|---|---|---|---|
2.Target Word: 把 一 个 人 的 温 暖 转 移 到 另 一 个 的 胸 膛 AP 让 上 次 犯 的 错 反 省 出 梦 想 AP 每 个 人 都 是 这 样
Prompt: 才 能 知 道 伤 感 是 爱 的 遗 产 AP 流 浪 过 几 张 双 人 床 换 过 几 次 信 仰 才 让 戒 指 义 无 反 顾 的 交 换 AP
Successfully synthesizing the timbre, articulation method, pronunciation, pitch transition style, and rhythm.
Prompt |
---|
Gronud Truth | TCSinger | w/o SAD | w/o DM | w/o CVQ |
---|---|---|---|---|
Clustering Style Encoder
In these tests, we utilized the timbre of singer A and the style information of singer B to synthesize results that match the timbre of singer A while differing from that of singer B. This outcome evidentially shows that our clustering style encoder successfully decouples timbre and style in the mel spectrogram.
1.Target Word: 我 们 这 些 努 力 不 简 单 快 乐 炼 成 泪 水 是 一 种 勇 敢
Successfully synthesizing the timbre of singer A, the pronunciation, pitch transition style, and rhythm of singer B.
Singer A | Singer B | Result |
---|---|---|
2.Target Word: 这 瞬 眼 的 光 景 最 亲 密 的 距 离 沿 着 你 皮 肤 纹 理
Successfully synthesizing the timbre of singer A, the pronunciation, pitch transition style, and rhythm of singer B.
Singer A | Singer B | Result |
---|---|---|
Style Control Extended Experiments
In these tests, we compare different text prompts to show TCSinger’s controllability.
Emotion Difference
Target Word: 东 汉 末 年 分 三 国 SP 烽 火 连 天 不 休
Obviously, TCSinger can control different emotions.
Happy | Sad |
---|---|
Technique Difference
Target Word: 为 了 爱 孤 军 奋 斗 AP 早 就 吃 够 了 爱 情 的 苦
Obviously, TCSinger can control different techniques for each phoneme.
1.Phoneme-Level Text Prompt (Technique Sequence):
['uei(1)','l(1)','e(1)','ai(1)','g(1)','u(1)','j(1)','vn(1)','f(1)','en(1)','d(1)','ou(1)','AP(0)','z(1)','ao(1)','j(1)','iou(1)','ch(1)','i(1)','g(1)','ou(1)','l(1)','e(1)','ai(1)','q(1)','ing(1)','d(1)','e(1)','k(1)','u(1)']
(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)
TCSinger |
---|
2.Phoneme-Level Text Prompt (Technique Sequence):
['uei(3)','l(3)','e(3)','ai(3)','g(3)','u(3)','j(3)','vn(3)','f(3)','en(3)','d(3)','ou(3)','AP(0)','z(3)','ao(3)','j(3)','iou(3)','ch(3)','i(3)','g(3)','ou(3)','l(3)','e(3)','ai(3)','q(3)','ing(3)','d(3)','e(3)','k(3)','u(3)']
(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)
TCSinger |
---|
3.Phoneme-Level Text Prompt (Technique Sequence):
['uei(0)','l(0)','e(0)','ai(0)','g(1)','u(1)','j(1)','vn(1)','f(1)','en(1)','d(1)','ou(1)','AP(0)','z(0)','ao(0)','j(0)','iou(0)','ch(0)','i(0)','g(0)','ou(0)','l(2)','e(2)','ai(2)','q(2)','ing(2)','d(2)','e(2)','k(2)','u(2)']
(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)
TCSinger |
---|
4.Phoneme-Level Text Prompt (Technique Sequence):
['uei(0)','l(0)','e(0)','ai(0)','g(3)','u(3)','j(3)','vn(3)','f(3)','en(3)','d(3)','ou(3)','AP(0)','z(0)','ao(0)','j(0)','iou(0)','ch(0)','i(0)','g(0)','ou(0)','l(0)','e(0)','ai(3)','q(3)','ing(3)','d(3)','e(3)','k(3)','u(3)']
(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)
TCSinger |
---|