Training-Free · Intra-Utterance Control · Zero-Shot TTS

TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

Abstract

While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. In this paper, we propose TED-TTS, a training-free controllable framework for pretrained zero-shot TTS to enable intra-utterance emotion and duration expression. Specifically, we propose a segment-aware emotion conditioning strategy that combines causal masking with monotonic stream alignment filtering to isolate emotion conditioning and schedule mask transitions, enabling smooth intra-utterance emotion shifts while preserving global semantic coherence. Based on this, we further propose a segment-aware duration steering strategy to combine local duration embedding steering with global EOS logit modulation, allowing local duration adjustment while ensuring globally consistent termination. To eliminate the need for segment-level manual prompt engineering, we construct a 30,000-sample multi-emotion and duration-annotated text dataset to enable LLM-based automatic prompt construction. Extensive experiments demonstrate that our training-free method not only achieves state-of-the-art intra-utterance consistency in multi-emotion and duration control, but also maintains baseline-level speech quality of the underlying TTS model.

Framework Overview

Task Definition (Left) and Technical Architecture Overview (Right)
Task Definition
Overview of our training-free framework for intra-utterance emotion and duration control, where the green, red, and blue regions denote three segments with different emotion and duration settings within the same utterance.
Detailed illustration of Monotonic Stream Alignment
Overview of our training-free framework for fine-grained intra-utterance emotion and duration control, illustrating the transition from the second (red) segment to the third (blue) segment via segment-aware duration steering (left) and segment-aware emotion conditioning (right) strategy.

Audio Examples

1. Intra-Utterance Emotion Control

(a) Speech-Referenced Emotion Prompt

💡 The above audio samples present comparative results for intra-utterance multi-emotion control using speech-referenced emotion prompts. Since comparative methods lack intra-utterance controllability, all segments are synthesized independently and concatenated for evaluation. Our method demonstrates smooth and coherent emotion transitions within a single utterance, along with consistent speaker similarity across segments, where subtle breath sounds at segment boundaries can also be perceived. Although baseline methods show strong emotion preservation at the segment level, this advantage largely stems from their independent segment synthesis setting. In contrast, our method performs multi-segment emotion control within a single generation, which makes emotion category preservation more challenging but better reflects realistic controllable speech synthesis scenarios.

(b) Text-Referenced Emotion Prompt

💡 The above audio samples present comparative results for intra-utterance multi-emotion control using text-referenced emotion prompts. Compared with speech-based emotion prompting, extracting emotion cues from natural language descriptions is considerably more challenging. Despite this increased difficulty, our method still demonstrates smooth intra-utterance emotion transitions, consistent speaker timbre across segments, and strong multi-emotion controllability within a single utterance.

2. Intra-Utterance Duration Control

💡 The above audio samples present comparative results for intra-utterance duration control using speech-referenced emotion prompts. For duration control experiments, the emotion category is fixed to neutral, and segment-level speech synthesis is evaluated under four duration scaling factors (×0.75, ×0.875, ×1.125, and ×1.25). The results show that, after extending controllability to duration, our method consistently preserves smooth intra-utterance transitions and coherent speaker timbre across segments. Moreover, our framework enables flexible duration adjustment for a target segment while keeping other segments unchanged, demonstrating strong training-free controllability within an autoregressive TTS framework.