Abstract
Zero-Shot Style Transfer
Multi-Level Style Control
- Parallel Style Control
- Non-Parallel Style Control
Cross-Lingual Style Transfer
Speech-to-Singing Style Transfer
- Parallel Speech-to-Singing Style Transfer
- Cross-Lingual Speech-to-Singing Style Transfer
Ablation Study
Clustering Style Encoder
Style Control Extended Experiments
- Emotion Difference
- Technique Difference

Abstract

Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text prompts. However, the multifaceted nature of singing styles poses a significant challenge for effective modeling, transfer, and control. Furthermore, current SVS models often fail to generate singing voices rich in stylistic nuances for unseen singers. To address these challenges, we introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. Specifically, TCSinger proposes three primary modules: 1) the clustering style encoder employs a clustering vector quantization model to stably condense style information into a compact latent space; 2) the Style and Duration Language Model (S\&D-LM) concurrently predicts style information and phoneme duration, which benefits both; 3) the style adaptive decoder uses a novel mel-style adaptive normalization method to generate singing voices with enhanced details. Experimental results show that TCSinger outperforms all baseline models in synthesis quality, singer similarity, and style controllability across various tasks, including zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer.

Note： We conduct all tasks in the zero-shot scenario, with training and testing on cross-lingual speech and singing data.

Zero-Shot Style Transfer

To assess the performance of TCSinger and baseline models in the zero-shot style transfer task, we randomly select samples with unseen singers from the test set as targets and different utterances from the same singers to form prompts.

1.Target Word: 又站在你家的门口我们重复沉默

Prompt: 终于你开口向我诉说她有多温柔

Successfully transferring the timbre, resonance in pop singing method, mixed voice technique, pronunciation, rhythm, and pitch transition style.

Prompt	Ground Truth

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

2.Target Word: 我听见雨滴落在青青草地 SP 我听见远方下课钟声响起

Prompt: 可是我没有听见你的声音 SP 认真呼唤我姓名

Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.

Prompt	Ground Truth

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

3.Target Word: 也不是真的 AP 不会想你 AP 全都不是真的 AP 是骗自己

Prompt: 让我这样吧 SP 并不是真的 AP 路过而已

Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.

Prompt	Ground Truth

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

4.Target Word: 谁娶了多愁善感的你

Prompt: 谁把你的长发盘起

Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.

Prompt	Ground Truth

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

5.Target Word: settled down that you AP found a girl and you AP

Prompt: I head AP that you are AP

Successfully transferring the timbre, pronunciation, pitch transition style, rhythm, and glissando technique.

Prompt	Ground Truth

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

6.Target Word: you belong with me belong with me

Prompt: standing by and waiting at your back door SP all this time how could you not know baby Successfully transferring the timbre, pronunciation, pitch transition style, rhythm, and vibrato technique.

Prompt	Ground Truth

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

Multi-Level Style Control

We add global and phoneme-level text embedding to each baseline model to enable style control. Then, we compare TCSinger using multi-level text prompts. We conduct both parallel and non-parallel experiments according to the target styles. For global styles, we specify singing methods (bel canto and pop) and emotions (happy and sad) for each test target. For phoneme-level styles, we select none, one or more specific techniques (mixed voice, falsetto, breathy, vibrato, glissando, and pharyngeal) for each phoneme of target content.

Parallel Style Control

In the parallel experiments, we randomly select unseen audio from the test set, using the GT global style and phoneme-level techniques as the target.

1.Target Word: 你是魔鬼中的天使所以送我心碎的方式 AP 是让我笑到最后 AP

Global Text Prompt (Singing Method and Emotion): bel canto, sad

Phoneme-Level Text Prompt (Technique Sequence):

['AP(0)', 'n(1)', 'i(1)', 'sh(1)', 'i(1)', 'm(1)', 'o(1)', 'g(1)', 'uei(1)', 'zh(1)', 'ong(1)', 'd(1)', 'e(1)', 't(1)', 'ian(1)', 'sh(1)', 'i(1)', 's(1)', 'uo(1)', 'i(1)', 's(1)', 'ong(1)', 'uo(1)', 'x(1)', 'in(1)', 's(1)', 'uei(1)', 'd(1)', 'e(1)', 'f(1)', 'ang(1)', 'sh(1)', 'i(1)', 'AP(0)', 'sh(1)', 'i(1)', 'r(1)', 'ang(1)', 'uo(1)', 'x(1)', 'iao(1)', 'd(1)', 'ao(1)', 'z(1)', 'uei(1)', 'h(1)', 'ou(1)', 'AP(0)']

(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)

Successfully control global singing method and emotion, and the phoneme-level techniques of glissando and mixed voice.

Ground Truth

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

2.Target Word: 宁愿选择留恋不放手 AP 等到风景都看透 AP 也许你会陪我看细水 AP 长流

Global Text Prompt (Singing Method and Emotion): pop, sad

Phoneme-Level Text Prompt (Technique Sequence):

['n(1)', 'ing(1)', 'van(1)', 'x(1)', 'van(1)', 'z(1)', 'e(1)', 'l(2)', 'iou(2)', 'l(2)', 'ian(2)', 'b(2)', 'u(2)', 'f(2)', 'ang(2)', 'sh(2)', 'ou(2)','AP(0)' 'd(2)', 'eng(2)', 'd(2)', 'ao(2)', 'f(2)', 'eng(2)', 'j(2)', 'ing(2)', 'd(2)', 'ou(2)', 'k(2)', 'an(2)', 't(2)', 'ou(2)','AP(0)', 'ie(2)', 'x(2)', 'v(2)', 'n(12', 'i(2)', 'h(2)', 'uei(2)', 'p(2)', 'ei(2)', 'uo(2)', 'k(2)', 'an(2)', 'x(1)', 'i(1)', 'sh(1)', 'uei(1)', 'AP(0)','ch(1)', 'ang(1)', 'l(1)', 'iou(1)']

(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)

Successfully control global singing method and emotion, and the phoneme-level techniques of falsetto and mixed voice.

Ground Truth

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

Non-Parallel Style Control

In the non-parallel experiments, global styles and six techniques are randomly yet appropriately assigned.

1.Target Word: remember us this way way AP Global Text Prompt (Singing Method and Emotion): pop, sad

Phoneme-Level Text Prompt (Technique Sequence):

['R(1)', 'IH0(1)', 'M(1)', 'EH1(1)', 'M(1)', 'B(1)', 'ER0(1)', 'AH1(1,5,6)', 'S(1,5,6)', 'DH(1)', 'IH1(1)', 'S(1)', 'W(1,6)', 'EY1(2,6)', 'W(2)', 'EY1(2)', 'AP(0)']

(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)

Successfully control global singing method and emotion, and the phoneme-level techniques of vibrato, glissando, falsetto, and mixed voice.

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

2.Target Word: SP I remember tears streaming down your face when I sad I’ll never let you go AP

Global Text Prompt (Singing Method and Emotion): pop, sad

Phoneme-Level Text Prompt (Technique Sequence):

[['SP(0)', 'AY1(2)', 'R(2)', 'IH0(2)', 'M(2)', 'EH1(2)', 'M(2)', 'B(2)', 'ER0(2)', 'T(2)', 'IH1(2)', 'R(2)', 'Z(2)', 'S(2)', 'T(2)', 'R(2)', 'IY1(2)', 'M(2)', 'IH0(2)', 'NG(2)', 'D(2)', 'AW1(2)', 'N(2)', 'Y(2)', 'UH1(2)', 'R(2)', 'F(2)', 'EY1(2)', 'S(2)', 'HH(1)', 'W(1)', 'EH1(1)', 'N(1)', 'AY1(1)', 'S(1)', 'EH1(1)', 'D(1)', 'AY1(1)', 'L(1)', 'N(1)', 'EH1(1)', 'V(1)', 'ER0(1)', 'L(1)', 'EH1(1)', 'T(1)', 'Y(1)', 'UW1(1)', 'G(1)', 'OW1(1)', 'AP(0)']

(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)

Successfully control global singing method and emotion, and the phoneme-level techniques of vibrato, falsetto, and mixed voice.

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

Cross-Lingual Style Transfer

To test the zero-shot cross-lingual style transfer performance of various models, we use unseen test data with different lyrics’ languages as prompts and targets for inference (like English and Chinese).

1.Target Word: I love you baby SP trust in me when I say

Prompt: 让我掉下眼泪的不止昨夜的酒

Language: Chinese->English

Successfully transferring the timbre, the articulation method, pronunciation, pitch transition style, and rhythm.

Prompt

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

2.Target Word: 情丝百转丝丝缠乱犹不知

Prompt: They’ve all been said before you know so why don’t we AP just play pretend AP

Language: English->Chinese

Successfully transferring the timbre, the articulation method, pronunciation, pitch transition style, and rhythm.

Prompt

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

Speech-to-Singing Style Transfer

We conducted experiments on both parallel and cross-lingual speech-to-singing style transfer.

Parallel Speech-to-Singing Style Transfer

In parallel experiments, we randomly select samples with unseen singers from the test set as targets and different speech from the same singers to form prompts.

1.Target Word: You make me happy AP when skies are gray

Prompt: I belive that the heart does go on

Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.

Prompt	Groud Truth

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

2.Target Word: It can happen to anyone of us anyone you think of

Prompt: And you wonder SP I wonder how I wonder why

Successfully transferring the timbre, the articulation method, pronunciation, pitch transition style, and rhythm.

Prompt	Ground Truth

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

Cross-Lingual Speech-to-Singing Style Transfer

In cross-lingual experiments, we select the speech prompt in a different lyric language from the target (such as Chinese and English).

1.Target Word: 当花瓣离开花朵 AP 暗香

Prompt: it can happen to anyone of us

Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.

Prompt

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

2.Target Word: the song we sang together AP oh yeah AP

Prompt: 我全部的心跳 AP 随你跳

Successfully transferring the timbre, pronunciation, pitch transition style, and rhythm.

Prompt

YourTTS	Mega-TTS	RMSSinger	StyleSinger	TCSinger

Ablation Study

we undertake ablation studies to showcase the efficacy of various designs incorporated within TCSinger. SAD denotes using the style adaptive decoder or only diffusion decoder, DM means using the duration model in S\&D-LM or using a simple duration predictor of Fastspeech2, and CVQ means using the CVQ model or VQ model in the clustering style encoder.

1.Target Word: 我的背脊如荒丘而你却微笑摆首 AP 把它当成整个宇宙你与太阳挥手也同海鸥问候

Prompt: 直到那一天 SP 你的衣衫破旧而歌声却温柔陪我漫无目的的四处漂流

Successfully synthesizing the timbre, articulation method, pronunciation, pitch transition style, and rhythm.

Prompt

Gronud Truth	TCSinger	w/o SAD	w/o DM	w/o CVQ

2.Target Word: 把一个人的温暖转移到另一个的胸膛 AP 让上次犯的错反省出梦想 AP 每个人都是这样

Prompt: 才能知道伤感是爱的遗产 AP 流浪过几张双人床换过几次信仰才让戒指义无反顾的交换 AP

Successfully synthesizing the timbre, articulation method, pronunciation, pitch transition style, and rhythm.

Prompt

Gronud Truth	TCSinger	w/o SAD	w/o DM	w/o CVQ

Clustering Style Encoder

In these tests, we utilized the timbre of singer A and the style information of singer B to synthesize results that match the timbre of singer A while differing from that of singer B. This outcome evidentially shows that our clustering style encoder successfully decouples timbre and style in the mel spectrogram.

1.Target Word: 我们这些努力不简单快乐炼成泪水是一种勇敢

Successfully synthesizing the timbre of singer A, the pronunciation, pitch transition style, and rhythm of singer B.

Singer A	Singer B	Result

2.Target Word: 这瞬眼的光景最亲密的距离沿着你皮肤纹理

Successfully synthesizing the timbre of singer A, the pronunciation, pitch transition style, and rhythm of singer B.

Singer A	Singer B	Result

Style Control Extended Experiments

In these tests, we compare different text prompts to show TCSinger’s controllability.

Emotion Difference

Target Word: 东汉末年分三国 SP 烽火连天不休

Obviously, TCSinger can control different emotions.

Happy	Sad

Technique Difference

Target Word: 为了爱孤军奋斗 AP 早就吃够了爱情的苦

Obviously, TCSinger can control different techniques for each phoneme.

1.Phoneme-Level Text Prompt (Technique Sequence):

['uei(1)','l(1)','e(1)','ai(1)','g(1)','u(1)','j(1)','vn(1)','f(1)','en(1)','d(1)','ou(1)','AP(0)','z(1)','ao(1)','j(1)','iou(1)','ch(1)','i(1)','g(1)','ou(1)','l(1)','e(1)','ai(1)','q(1)','ing(1)','d(1)','e(1)','k(1)','u(1)']

(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)

TCSinger

2.Phoneme-Level Text Prompt (Technique Sequence):

['uei(3)','l(3)','e(3)','ai(3)','g(3)','u(3)','j(3)','vn(3)','f(3)','en(3)','d(3)','ou(3)','AP(0)','z(3)','ao(3)','j(3)','iou(3)','ch(3)','i(3)','g(3)','ou(3)','l(3)','e(3)','ai(3)','q(3)','ing(3)','d(3)','e(3)','k(3)','u(3)']

(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)

TCSinger

3.Phoneme-Level Text Prompt (Technique Sequence):

['uei(0)','l(0)','e(0)','ai(0)','g(1)','u(1)','j(1)','vn(1)','f(1)','en(1)','d(1)','ou(1)','AP(0)','z(0)','ao(0)','j(0)','iou(0)','ch(0)','i(0)','g(0)','ou(0)','l(2)','e(2)','ai(2)','q(2)','ing(2)','d(2)','e(2)','k(2)','u(2)']

(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)

TCSinger

4.Phoneme-Level Text Prompt (Technique Sequence):

['uei(0)','l(0)','e(0)','ai(0)','g(3)','u(3)','j(3)','vn(3)','f(3)','en(3)','d(3)','ou(3)','AP(0)','z(0)','ao(0)','j(0)','iou(0)','ch(0)','i(0)','g(0)','ou(0)','l(0)','e(0)','ai(3)','q(3)','ing(3)','d(3)','e(3)','k(3)','u(3)']

(0: no technique, 1: mix, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando)

TCSinger