Dual-View Predictive Diffusion: Lightweight Speech Enhancement via Spectrogram-Image Synergy

Anonymous Authors

━ Submitted to ICML 2026 ━

We propose DVPD, an extremely lightweight (1.9M parameters) predictive diffusion model that uniquely exploits the dual nature of spectrograms as both visual textures and acoustic representations. DVPD achieves SOTA performance across universal and task-oriented SE benchmarks, while the training-free TLB strategy further elevates restoration quality with zero additional overhead.

Abstract

Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in inefficient spectral representation and prohibitive computational complexity. To bridge this gap, we propose DVPD, an extremely lightweight Dual-View Predictive Diffusion model, which uniquely exploits the dual nature of spectrograms as both visual textures and physical frequency-domain representations across both training and inference stages. Specifically, during training, we optimize spectral utilization via the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which preserves critical low-frequency harmonics while pruning high-frequency redundancies. Simultaneously, we introduce a Lightweight Image-based Spectro-Awareness (LISA) module to capture features from a visual perspective with minimal overhead. During inference, we propose a Training-free Lossless Boost (TLB) strategy that leverages the same dual-view priors to refine generation quality without any additional fine-tuning. Extensive experiments across various benchmarks demonstrate that DVPD achieves state-of-the-art performance while requiring only 35% of the parameters and 40% of the inference MACs compared to the SOTA lightweight model, PGUSE. These results highlight DVPD's superior ability to balance high-fidelity speech quality with extreme architectural efficiency.

Overall Architecture

Architectural overview of the proposed DVPD (A). Include: (B) FANC Encoder; (C) Enhancement Network; (D) Interaction Module; FANC Decoder.

Methodology Overview

FANC Encoder

Frequency-Adaptive Non-uniform Compression: Exploits non-uniform info density by preserving low-frequency harmonics (0–2 kHz) while pruning high-frequency redundancies (>4 kHz) in alignment with human auditory resolution.

LISA Module

Lightweight Image-based Spectro-Awareness: Captures anisotropic features like horizontal harmonics and vertical transients using multi-scale heterogeneous dilated kernels with minimal overhead.

TLB Strategy

Training-free Lossless Boost: A dual-view inference technique that recalibrates feature scales (b, s) based on spectrogram's anisotropic properties to refine generation quality without additional training.

Detailed Components

Figure 3: Detailed architecture of the LISA module.

Figure 4: Detailed architecture of the TLB strategy.

Audio Samples Comparison

Below is an explicit comparison of our DVPD model against noisy inputs and clean references across both Universal Speech Enhancement (USE) and single-modality denoising. We specifically highlight examples where the enhanced PESQ is below 2 to demonstrate the most significant qualitative improvements enabled by our TLB strategy in these challenging scenarios.

Noisy Input

DVPD (ours)

DVPD + TLB (ours)

Clean Reference

Sample 1 (USE)

Sample 2 (USE)

Sample 3 (Denoise)

Sample 4 (Denoise)

Visual Comparison (Spectrogram)

The following figure demonstrates the denoising performance from dual perspectives. The red boxes highlight the most significant improvements in structural integrity and noise suppression achieved after applying the TLB strategy.

Acknowledgements

Template inspired by Colorful Image Colorization. Audio samples are from WSJ0-UNI and DEMAND datasets.