We propose DVPD, an extremely lightweight (1.9M parameters) predictive diffusion model that uniquely exploits the dual nature of spectrograms as both visual textures and acoustic representations. DVPD achieves SOTA performance across universal and task-oriented SE benchmarks, while the training-free TLB strategy further elevates restoration quality with zero additional overhead.
| Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in inefficient spectral representation and prohibitive computational complexity. To bridge this gap, we propose DVPD, an extremely lightweight Dual-View Predictive Diffusion model, which uniquely exploits the dual nature of spectrograms as both visual textures and physical frequency-domain representations across both training and inference stages. Specifically, during training, we optimize spectral utilization via the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which preserves critical low-frequency harmonics while pruning high-frequency redundancies. Simultaneously, we introduce a Lightweight Image-based Spectro-Awareness (LISA) module to capture features from a visual perspective with minimal overhead. During inference, we propose a Training-free Lossless Boost (TLB) strategy that leverages the same dual-view priors to refine generation quality without any additional fine-tuning. Extensive experiments across various benchmarks demonstrate that DVPD achieves state-of-the-art performance while requiring only 35% of the parameters and 40% of the inference MACs compared to the SOTA lightweight model, PGUSE. These results highlight DVPD's superior ability to balance high-fidelity speech quality with extreme architectural efficiency. |
Frequency-Adaptive Non-uniform Compression: Exploits non-uniform info density by preserving low-frequency harmonics (0–2 kHz) while pruning high-frequency redundancies (>4 kHz) in alignment with human auditory resolution.
Lightweight Image-based Spectro-Awareness: Captures anisotropic features like horizontal harmonics and vertical transients using multi-scale heterogeneous dilated kernels with minimal overhead.
Training-free Lossless Boost: A dual-view inference technique that recalibrates feature scales (b, s) based on spectrogram's anisotropic properties to refine generation quality without additional training.
Below is an explicit comparison of our DVPD model against noisy inputs and clean references across both Universal Speech Enhancement (USE) and single-modality denoising. We specifically highlight examples where the enhanced PESQ is below 2 to demonstrate the most significant qualitative improvements enabled by our TLB strategy in these challenging scenarios.
The following figure demonstrates the denoising performance from dual perspectives. The red boxes highlight the most significant improvements in structural integrity and noise suppression achieved after applying the TLB strategy.
AcknowledgementsTemplate inspired by Colorful Image Colorization. Audio samples are from WSJ0-UNI and DEMAND datasets. |