← Back to article

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Hau-Shiang Shiu1  Chin-Yang Lin1  Zhixiang Wang2
Chi-Wei Hsiao3  Po-Fan Yu1  Yu-Chih Chen1  Yu-Lun Liu1
1National Yang Ming Chiao Tung University  2Shanda AI Research Tokyo  3MediaTek Inc.
Abstract

Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP [99], it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130×\times. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/

[Uncaptioned image]
S0.F1
Figure 1:Comparison of visual quality and inference speed across various categories of VSR methods. Stream-DiffVSR achieves superior perceptual quality (lower LPIPS) and maintains comparable runtime to CNN- and Transformer-based online models, while also demonstrating significantly reduced inference latency compared to existing offline approaches. Best and second-best results are marked in red and green.

1 Introduction

Video super-resolution (VSR) aims to reconstruct high-resolution (HR) videos from low-resolution (LR) inputs and is vital in applications such as surveillance, live broadcasting, video conferencing, autonomous driving, and drone imaging. It is increasingly important in low-latency rendering workflows, including neural rendering and resolution upscaling in game engines and AR/VR systems, where latency-aware processing is crucial for visual continuity.

Specifically, latency-sensitive processing involves two key aspects: per-frame inference time (throughput) and end-to-end system latency (delay between receiving an input frame and producing its output). Existing VSR methods often struggle with this trade-off. While CNN- and Transformer-based models offer a balance between efficiency and quality, they fall short in perceptual detail. Diffusion-based models excel in perceptual quality due to strong generative priors, but suffer from high computational cost and reliance on future frames, making them impractical for time-sensitive video applications.

In this paper, we propose Stream-DiffVSR, a diffusion-based method specifically tailored to online video super-resolution, effectively bridging the gap between high-quality but slow diffusion methods and fast but lower quality CNN- or Transformer-based methods. Unlike previous diffusion-based VSR approaches (e.g., StableVSR [58] and MGLD-VSR [89]) that typically require 50 or more denoising steps and bidirectional temporal information, our method leverages diffusion model distillation to significantly accelerate inference by reducing denoising steps to just four. Additionally, we introduce an Auto-regressive Temporal Guidance mechanism and an Auto-regressive Temporal-aware Decoder to effectively exploit temporal information from previous frames, significantly enhancing temporal consistency and perceptual fidelity.

Fig. 1 illustrates the core advantage of our approach by comparing visual quality and runtime across various categories of video super-resolution methods. Our Stream-DiffVSR achieves superior perceptual quality (measured by LPIPS [96]) and temporal consistency, outperforming existing unidirectional CNN- and Transformer-based methods (e.g., MIA-VSR [105], RealViformer [98], TMP [99]). Notably, Stream-DiffVSR offers significantly faster per-frame inference than prior diffusion-based approaches (e.g., StableVSR [58], MGLD-VSR [89]), attributed to our use of a distilled 4-step denoising process and a lightweight temporal-aware decoder.

In addition, existing diffusion-based methods, such as StableVSR [58] typically rely on bidirectional or future-frame information, resulting in prohibitively high processing latency that is not suitable for online scenarios. Specifically, for a 100-frame video, StableVSR (46.2 s/frame) would incur an initial latency exceeding 4600 seconds on an RTX 4090 GPU, as it requires processing the entire sequence before generating even the first output frame. In contrast, our Stream-DiffVSR operates in a strictly causal, autoregressive manner, conditioning only on the immediately preceding frame. Consequently, the initial frame latency of Stream-DiffVSR corresponds to a single frame’s inference time (0.328 s/frame), reducing the latency by more than three orders of magnitude compared to StableVSR. This significant latency reduction demonstrates that Stream-DiffVSR effectively unlocks the potential of diffusion models for practical, low-latency online video super-resolution.

S1.T1
Table 1: Comparison of diffusion-based VSR methods. We report online capability, inference steps, runtime (FPS on 720p, RTX 4090), maximum end-to-end latency (sec), and whether each method uses distillation, temporal modeling, or offline future frames. OOM denotes out-of-memory, and - indicates missing public inference results. Notably, Stream-DiffVSR is the only diffusion-based method that runs in a strictly online, past-only setting with the lowest latency.

To summarize, the main contributions of this paper are:

  • We introduce the first diffusion-based framework explicitly designed for online, low-latency video super-resolution, achieving efficient inference through distillation from 50 denoising steps down to 4 steps.

  • We propose a novel Auto-regressive Temporal Guidance mechanism and a Temporal-aware Decoder to effectively leverage temporal information only from past frames, significantly enhancing perceptual quality and temporal consistency.

  • Extensive experiments demonstrate that our approach outperforms existing methods across key perceptual and temporal consistency metrics while achieving practical inference speeds, thereby making diffusion-based VSR applicable for real-world online scenarios.

To contextualize our contributions, Table 1 compares recent diffusion-based VSR methods in terms of online inference capability, runtime efficiency, and temporal modeling. Our method uniquely achieves online low-latency inference while preserving high visual quality and temporal stability. This substantial latency reduction of over three orders of magnitude compared to prior diffusion-based VSR models demonstrates that Stream-DiffVSR is uniquely suited for low-latency online applications such as video conferencing and AR/VR.

2 Related Work

Video Super-resolution.

VSR methods reconstruct high-resolution videos from low-resolution inputs through CNN-based approaches [87, 70, 79, 4, 5, 68], deformable convolutions [70, 12, 107], online processing [99], recurrent architectures [61, 15, 25, 91, 33], flow-guided methods [92, 19, 43], and Transformer-based models [73, 37, 36, 63, 105]. Despite advances, low-latency online processing remains challenging.

Real-world Video Super-resolution.

Real-world VSR addresses unknown degradations [88, 6] through pre-cleaning modules [6, 18, 80, 44], online approaches [98], kernel estimation [54, 28], synthetic degradations [27, 65, 97, 7], new benchmarks [102, 11], real-time systems [3], advanced GANs [9, 72], and Transformer restorers [93, 35, 2]. Warp error-aware consistency [31] emphasizes temporal error regularisation.

Diffusion-based Image and Video Restoration.

Diffusion models provide powerful generative priors [55, 14, 8] for single-image SR [60, 32, 24], inpainting [47, 81, 40, 71], and quality enhancement [23, 16, 77]. Video diffusion methods include StableVSR [58], MGLD-VSR [89], DC-VSR [20], DOVE [10], UltraVSR [42], Upscale-A-Video [104], DiffVSR [34], DiffIR2VR-Zero [90], VideoGigaGAN [86], VEnhancer [21], temporal coherence [76], AVID [100], and SeedVR2 [78]. Auto-regressive approaches [67, 84, 39, 101] show promise. Acceleration techniques include consistency models [48, 17], advanced solvers [46, 45, 103], flow-based methods [41, 29], distillation [62, 50, 106, 85, 108], and efficient architectures [1]. Theoretical advances [74, 75] and recent image/offline distillation methods [66, 94, 82, 83] exist, but our Stream-DiffVSR uniquely applies distillation in strict online settings with causal temporal modeling for real-time VSR.

3 Method

We propose Stream-DiffVSR, a streamable auto-regressive diffusion framework for efficient video super-resolution (VSR). Its core innovation lies in an auto-regressive formulation that improves both temporal consistency and inference speed. The framework comprises: (1) a distilled few-step U-Net for accelerated diffusion inference, (2) Auto-regressive Temporal Guidance that conditions latent denoising on previously warped high-quality frames, and (3) an Auto-regressive Temporal-aware Decoder that explicitly incorporates temporal cues. Together, these components enable Stream-DiffVSR to produce stable and perceptually coherent videos.

3.1 Diffusion Models Preliminaries

Diffusion Models [22] transform complex data distributions into simpler Gaussian distributions via a forward diffusion process and reconstruct the original data using a learned reverse denoising process. The forward process gradually adds Gaussian noise to the initial data x0x_{0}, forming a Markov chain: q(xtxt1)=𝒩(xt;1βtxt1,βtI)q(x_{t}\mid x_{t-1})=\mathcal{N}\!\bigl(x_{t};\sqrt{1-\beta_{t}}\,x_{t-1},\,\beta_{t}I\bigr) for t=1,,Tt=1,\dots,T, where βt\beta_{t} denotes a predefined noise schedule. At timestep tt, the noised data xtx_{t} can be directly sampled from the clean data x0x_{0} as: xt=αtx0+1αtϵx_{t}=\sqrt{\alpha_{t}}\,x_{0}+\sqrt{1-\alpha_{t}}\,\epsilon, where ϵ𝒩(0,I)\epsilon\sim\mathcal{N}(0,I) and αt=i=1t(1βi)\alpha_{t}=\prod_{i=1}^{t}(1-\beta_{i}), where αt=i=1t(1βi)\alpha_{t}=\prod_{i=1}^{t}(1-\beta_{i}). The reverse process progressively removes noise from xTx_{T}, reconstructing the original data x0x_{0} through a learned denoising operation modeled as a Markov chain, i.e., pθ(x0,,xT1xT)=t=1Tpθ(xt1xt)p_{\theta}(x_{0},\dots,x_{T-1}\mid x_{T})=\prod_{t=1}^{T}p_{\theta}(x_{t-1}\mid x_{t}). Each individual step is parameterized by a neural network-based denoising function pθ(xt1xt)=𝒩(xt1;μθ(xt,t),Σθ(t)I)p_{\theta}(x_{t-1}\mid x_{t})=\mathcal{N}\!\bigl(x_{t-1};\mu_{\theta}(x_{t},t),\Sigma_{\theta}(t)I\bigr). Typically, the network predicts the noise component ϵθ(xt,t)\epsilon_{\theta}(x_{t},t), from which the denoising mean is estimated as μθ(xt,t)=1αt(xt1αt1αtϵθ(xt,t))\mu_{\theta}(x_{t},t)=\tfrac{1}{\sqrt{\alpha_{t}}}\Bigl(x_{t}-\tfrac{1-\alpha_{t}}{\sqrt{1-\alpha_{t}}}\,\epsilon_{\theta}(x_{t},t)\Bigr). Latent Diffusion Models (LDMs) [57] further reduce computational complexity by projecting data into a lower-dimensional latent space using Variational Autoencoders (VAEs), significantly accelerating inference without sacrificing generative quality.

3.2 U-Net Rollout Distillation

We distill a pre-trained Stable Diffusion (SD) ×\times4 Upscaler [57, 56], originally designed for 50-step inference, into a 4-step variant that balances speed and perceptual quality. To mitigate the training–inference gap of timestep-sampling distillation, we adopt rollout distillation, where the U-Net performs the full 4-step denoising each iteration to obtain a clean latent. Detailed algorithms and implementation are provided in the supplementary material due to page limits.

Unlike conventional distillation that supervises random intermediate timesteps, our method applies loss only on the final denoised latent, ensuring the training trajectory mirrors inference and improving stability and alignment.

Our distillation requires no architectural changes. We train the U-Net by optimizing latent reconstruction with a loss that balances spatial accuracy, perceptual fidelity, and realism:

distill=\displaystyle\mathcal{L}_{\text{distill}}= 𝐳den𝐳gt22\displaystyle\,\|\mathbf{z}_{\text{den}}-\mathbf{z}_{\text{gt}}\|_{2}^{2} (1)
+λLPIPSLPIPS(D(𝐳den),𝐱gt)\displaystyle+\lambda_{\text{LPIPS}}\cdot\text{LPIPS}\left(D(\mathbf{z}_{\text{den}}),\mathbf{x}_{\text{gt}}\right)
+λGANGAN(D(𝐳den)),\displaystyle+\lambda_{\text{GAN}}\cdot\mathcal{L}_{\text{GAN}}\left(D(\mathbf{z}_{\text{den}})\right),

where 𝐳den\mathbf{z}_{\text{den}} and 𝐳gt\mathbf{z}_{\text{gt}} are the denoised and ground-truth latent representations. The decoder D()D(\cdot) maps latent features back to RGB space for perceptual (LPIPS) and adversarial (GAN) loss calculations, encouraging visually realistic outputs.

S3.F2
Figure 2: Overview of Auto-regressive Temporal-aware Decoder. Given the denoised latent and warped previous frame, our decoder enhances temporal consistency using temporal processor modules. This module aligns and fuses these features via interpolation, convolution, and weighted fusion, effectively stabilizing detail reconstruction when decoding into the final RGB frame.

3.3 Auto-regressive Temporal Guidance

Leveraging temporal information is essential for capturing dynamics and ensuring frame continuity in video super-resolution. However, extensive temporal reasoning often incurs significant computational overhead, increasing per-frame inference time and system latency. Thus, efficient online VSR requires carefully balancing temporal utilization and computational cost to support low-latency processing.

To this end, we propose Auto-regressive Temporal Guidance (ARTG), which enforces temporal coherence during latent denoising. At each timestep tt, the U-Net takes both the current noised latent ztz_{t} and the warped RGB frame from the previous output, x^t1warp=Warp(xt1SR,ftt1)\hat{x}_{t-1}^{\text{warp}}=\text{Warp}(x_{t-1}^{\text{SR}},f_{t\leftarrow t-1}), where ftt1f_{t\leftarrow t-1} is the optical flow from frame t1t{-}1 to tt. The denoising prediction is then formulated as:

ϵ^θ=UNet(zt,t,x^t1warp),\small\hat{\epsilon}_{\theta}=\text{UNet}(z_{t},t,\hat{x}_{t-1}^{\text{warp}}), (2)

where the warped image x^t1warp\hat{x}_{t-1}^{\text{warp}} serves as temporal conditioning input to guide the denoising process.

We train the ARTG module independently using consecutive pairs of low-quality and high-quality frames. The denoising U-Net and decoder are kept fixed during this stage, and the training objective focuses on reconstructing the target latent representation while preserving perceptual quality and visual realism. The total loss function is defined as:

ARTG=\displaystyle\mathcal{L}_{\text{ARTG}}= 𝐳den𝐳gt22\displaystyle\,\|\mathbf{z}_{\text{den}}-\mathbf{z}_{\text{gt}}\|_{2}^{2} (3)
+λLPIPSLPIPS(D(𝐳den),𝐱gt)\displaystyle+\lambda_{\text{LPIPS}}\cdot\text{LPIPS}(D(\mathbf{z}_{\text{den}}),\mathbf{x}_{\text{gt}})
+λGANGAN(D(𝐳den)),\displaystyle+\lambda_{\text{GAN}}\cdot\mathcal{L}_{\text{GAN}}(D(\mathbf{z}_{\text{den}})),

where 𝐳den\mathbf{z}_{\text{den}} denotes the denoised latent from DDIM updates with predicted noise ϵ^θ\hat{\boldsymbol{\epsilon}}_{\theta}, and 𝐳gt\mathbf{z}_{\text{gt}} is the ground-truth latent. The decoder D()D(\cdot) maps latents to RGB, producing D(𝐳den)D(\mathbf{z}_{\text{den}}) for comparison with the ground-truth image 𝐱gt\mathbf{x}_{\text{gt}}. The latent 2\ell_{2} loss enforces alignment, the perceptual loss preserves visual fidelity, and the adversarial loss promotes realism. This design leverages only past frames to propagate temporal context, improving consistency without additional latency.

S3.F3
Figure 3:Training pipeline of Stream-DiffVSR. The training process consists of three sequential stages: (1) Distilling the denoising U-Net to reduce diffusion steps while maintaining perceptual quality with training objective (1); (2) Training the Temporal Processor Module (TPM) within the decoder to enhance temporal consistency at the RGB level with training objective (3); (3) Training the Auto-Regressive Temporal Guidance (ARTG) module to leverage previously restored high-quality frames for improved temporal coherence with training objective (6). Each module is trained separately before integrating them into the final framework.
S3.F4
Figure 4:Overview of our pipeline. Given a low-quality (LQ) input frame, we first initialize its latent representation and employ an autoregressive diffusion model composed of a distilled denoising U-Net, autoregressive temporal Guidance, and an autoregressive temporal Decoder. Temporal guidance utilizes flow-warped high-quality (HQ) results from the previous frame to condition the current frame’s latent denoising and decoding processes, significantly improving perceptual quality and temporal consistency in an efficient, online manner.

3.4 Auto-regressive Temporal-aware Decoder

Although the Auto-regressive Temporal Guidance (ARTG) improves temporal consistency in the latent space, the features produced by the Stable Diffusion ×\times4 Upscaler remain at one-quarter of the target resolution. This mismatch may introduce decoding artifacts or misalignment in dynamic scenes.

To address this issue, we propose an Auto-regressive Temporal-aware Decoder that incorporates temporal context into decoding to enhance spatial fidelity and temporal consistency. At timestep tt, the decoder takes the denoised latent 𝐳tden\mathbf{z}_{t}^{\text{den}} and the aligned feature 𝐟^t1\hat{\mathbf{f}}_{t-1} derived from the previous super-resolved frame. Specifically, we compute:

𝐱^t1warp=Warp(𝐱t1SR,ftt1),𝐟^t1=Enc(𝐱^t1warp),\small\hat{\mathbf{x}}_{t-1}^{\text{warp}}=\text{Warp}(\mathbf{x}_{t-1}^{\text{SR}},f_{t\leftarrow t-1}),\,\,\,\,\hat{\mathbf{f}}_{t-1}=\text{Enc}(\hat{\mathbf{x}}_{t-1}^{\text{warp}}), (4)

where 𝐱t1SR\mathbf{x}_{t-1}^{\text{SR}} is the previously generated RGB output, ftt1f_{t\leftarrow t-1} is the optical flow from frame t1t-1 to tt, and Enc()\text{Enc}(\cdot) is a frozen encoder that projects the warped image into the latent feature space.

The decoder then synthesizes the current frame using:

𝐱tSR=Decoder(𝐳tden,𝐟^t1).\small\mathbf{x}_{t}^{\text{SR}}=\text{Decoder}(\mathbf{z}_{t}^{\text{den}},\hat{\mathbf{f}}_{t-1}). (5)

We adopt a multi-scale fusion strategy inside the decoder to combine current spatial information and prior temporal features across multiple resolution levels, as illustrated in Fig. 2. This design helps reinforce temporal coherence while recovering fine spatial details.

Temporal Processor Module (TPM).

We integrate TPM after each spatial convolutional layer in the decoder to explicitly inject temporal coherence, enhancing stability and continuity of reconstructed frames. These modules utilize latent features from the current frame and warped features from the previous frame, optimizing temporal consistency independently from spatial reconstruction. Our training objective for the TPM is defined as:

TPM=\displaystyle\mathcal{L}_{\text{TPM}}= rec(𝐱trec,𝐱tGT)\displaystyle\;\mathcal{L}_{\text{rec}}(\mathbf{x}_{t}^{\text{rec}},\mathbf{x}_{t}^{\text{GT}}) (6)
+λflowOF(𝐱trec,𝐱t1rec)OF(𝐱tGT,𝐱t1GT)22\displaystyle+\lambda_{\text{flow}}\left\|\text{OF}(\mathbf{x}_{t}^{\text{rec}},\mathbf{x}_{t-1}^{\text{rec}})-\text{OF}(\mathbf{x}_{t}^{\text{GT}},\mathbf{x}_{t-1}^{\text{GT}})\right\|_{2}^{2}
+λGANGAN(𝐱trec)\displaystyle+\lambda_{\text{GAN}}\mathcal{L}_{\text{GAN}}(\mathbf{x}_{t}^{\text{rec}})
+λLPIPSLPIPS(𝐱trec,𝐱tGT),\displaystyle+\lambda_{\text{LPIPS}}\text{LPIPS}(\mathbf{x}_{t}^{\text{rec}},\mathbf{x}_{t}^{\text{GT}}),

where 𝐱tSR3×H×W\mathbf{x}_{t}^{\text{SR}}\in\mathbb{R}^{3\times H\times W} is the predicted frame at time tt, and 𝐱tGT\mathbf{x}_{t}^{\text{GT}} is the ground-truth frame. The reconstruction loss rec=SmoothL1(𝐱trec,𝐱tGT)\mathcal{L}_{\text{rec}}=\text{SmoothL1}(\mathbf{x}_{t}^{\text{rec}},\mathbf{x}_{t}^{\text{GT}}) enforces spatial fidelity, the adversarial loss GAN\mathcal{L}_{\text{GAN}} improves realism, and the optical-flow term OF(,)\text{OF}(\cdot,\cdot) reduces temporal discrepancies, yielding consistent and perceptually faithful outputs.

S3.T2
Table 2: Quantitative comparison against bidirectional/offline methods on the REDS4 dataset. We compare CNN-, Transformer-, and diffusion-based methods on REDS4. Stream-DiffVSR achieves superior perceptual and temporal quality with high stability across sequences. ↑ indicates higher is better; ↓ indicates lower is better. Dir. denotes temporal direction: B for bidirectional/offline, U for unidirectional/online. Runtime is measured per 720p frame on an RTX 4090. Latency-max denotes the maximum end-to-end latency measured over 100-frame video sequences, providing a fair comparison with offline methods whose initial delay scales with sequence length. tLP and tOF are scaled by 100× and 10×. Best and second-best results are marked in red and blue.
S3.T3
Table 3: Quantitative comparison against unidirectional/online methods on the REDS4 dataset.

3.5 Training and Inference Stages

Our training pipeline consists of three independent stages (Fig. 3), while our inference process and the Auto-Regressive Diffusion-based VSR algorithm are illustrated in Fig. 4 and detailed in the appendix due to page constraints, respectively.

Distilling the Denoising U-Net.

We first distill the denoising U-Net using pairs of low-quality (LQ) and high-quality (HQ) frames to optimize per-frame super-resolution and latent-space consistency.

Training the Temporal Processor Module (TPM).

In parallel, we train the Temporal Processor Module (TPM) in the decoder using ground-truth frames, keeping all other weights fixed. This enhances the decoder’s capability to incorporate temporal information into the final RGB reconstruction.

Training Auto-regressive Temporal Guidance.

After training and freezing the U-Net and decoder, we train the ARTG, which leverages flow-aligned previous outputs to enhance temporal coherence without degrading spatial quality. This staged training strategy progressively refines spatial fidelity, latent consistency, and temporal smoothness in a decoupled manner.

Inference.

Given a sequence of low-quality (LQ) frames, our method auto-regressively generates high-quality (HQ) outputs. For each frame tt, denoising is conditioned on the previous output HQt1HQ_{t-1}, warped via optical flow to capture temporal motion. To balance quality and efficiency, we employ a 4-step DDIM scheme using a distilled U-Net. By combining motion alignment with reduced denoising steps, our inference pipeline achieves efficient and stable temporal consistency.

4 Experiment

Due to space limitations, we provide the experimental setup in the appendix.

S4.T4
Table 4: Quantitative comparison against bidirectional/offline methods on the Vimeo-90K-T dataset. Stream-DiffVSR surpasses other bidirectional methods in perceptual quality, temporal consistency, and runtime. Runtime is the average per-frame inference time (seconds) on 448×256 videos using an RTX 4090. Best and second-best results are shown in red and blue.
S4.T5
Table 5: Quantitative comparison against unidirectional/online methods on the Vimeo-90K-T dataset.
S4.T6
Table 6:Memory and inference speed comparison on NVIDIA A6000. OOM = out of memory. Our method achieves the lowest memory footprint, fastest runtime, and lowest latency.

We quantitatively compare Stream-DiffVSR with state-of-the-art VSR methods on REDS4, Vimeo-90K-T, VideoLQ, and Vid4, covering diverse scene content and motion characteristics. Tabs. 2 and 4 report results across CNN-, Transformer-, and diffusion-based approaches under both bidirectional (offline) and unidirectional (online) settings.

On REDS4, Stream-DiffVSR achieves superior perceptual quality (LPIPS=0.099) over CNN (BasicVSR++, RealBasicVSR), Transformer (RVRT), and diffusion-based methods (StableVSR, MGLD-VSR), while also delivering competitive temporal consistency (tLP=4.198, tOF=3.638). Notably, it attains these gains with substantially lower runtime (0.328s/frame vs. 43–46s/frame for diffusion models).

On Vimeo-90K-T, Stream-DiffVSR likewise attains leading perceptual performance (LPIPS=0.056, DISTS=0.105) and improved temporal consistency (tLP=4.307, tOF=2.689) with a competitive runtime of 0.041s/frame, highlighting its suitability for online deployment.

In addition to speed, Stream-DiffVSR achieves a markedly lower memory footprint. As shown in Tab. 6, prior diffusion-based VSR methods such as DOVE, SeedVR2, and Upscale-A-Video either require over 42 GB of GPU memory or fail with out-of-memory errors on an NVIDIA A6000. In contrast, Stream-DiffVSR operates within 20.8 GB while running more than 2.5×2.5\times faster, underscoring its efficiency and deployability.

Results on VideoLQ and Vid4 further confirm strong perceptual and temporal performance, demonstrating robust generalization across the entire evaluation dataset.

S4.F5
Figure 5:Qualitative comparison on REDS4 and Vimeo-90K-T datasets. Our method demonstrates superior visual quality with sharper details compared to unidirectional methods (TMP [99], RealViformer [98]) and competitive performance against bidirectional methods (StableVSR [58], MGLD-VSR [89], RVRT [37], BasicVSR++[5], RealBasicVSR[6]). Improvements include reduced artifacts and enhanced temporal stability (see zoomed patches).

4.1 Qualitative Comparisons

We provide qualitative comparisons in Fig. 5, where Stream-DiffVSR generates sharper details and fewer artifacts than prior methods. Additional visualizations of temporal consistency and flow coherence are included in the supplemental material. A qualitative comparison with Upscale-A-Video (UAV) [104] is included in the appendix.

S4.T7
Table 7:Ablation study of temporal modules in Stream-DiffVSR.
S4.T8
Table 8:Ablation study on training strategy.
S4.T9
Table 9:Ablation study on denoising step count within Stream-DiffVSR. We evaluate 50, 10, 1, and 4 steps. Our 4-step design achieves a favorable balance between perceptual quality and runtime.
S4.T10
Table 10:Ablation study on Rollout Training. Comparison of random timestep distillation vs. rollout training across fidelity and perceptual metrics.
S4.F8
Figure 6:Ablation study on the Temporal Processor Module (TPM). Integrating TPM improves motion stability and reduces temporal artifacts by leveraging warped previous-frame features, enhancing temporal consistency in video super-resolution.
S4.F8.3
Figure 7:Ablation study on inference steps. The 4-step model yields the best quality–efficiency trade-off, validating our distillation strategy.
S4.F8.3.2
Figure 8:Ablation study on Auto-regressive Temporal Guidance (ARTG). ARTG enhances temporal consistency and perceptual quality by leveraging warped previous frames, reducing flickering, and improving structural coherence.

4.2 Ablation Study

We ablate key components of Stream-DiffVSR including denoising-step reduction, ARTG, TPM, timestep selection, and training-stage combinations on REDS4 to ensure consistent evaluation of perceptual quality and temporal stability.

We perform ablation studies on training strategies in Tab. 10 and Tab. 8. For stage-wise training, partial or joint training yields inferior results, while our separate stage-wise scheme achieves the best trade-off across fidelity, perceptual, and temporal metrics. For distillation, rollout training outperforms random timestep selection in both quality and efficiency, reducing training cost from 60.5 to 21 GPU hours on 4×A6000 GPUs.

We assess the runtime–quality trade-off by varying DDIM inference steps while keeping model weights fixed. As shown in Tab. 9 and Fig. 8, fewer steps increase efficiency but reduce perceptual quality, whereas more steps improve fidelity with higher latency. A 4-step setting provides the best balance.

Tab. 7 and Fig. 8 show the effectiveness of ARTG and TPM. The per-frame baseline uses only the distilled U-Net with both ARTG and TPM disabled. In the ablation labels, w/o indicates that a module is fully removed; for instance, TPM (unwarp) feeds TPM the previous HR frame without flow-based warping, removing motion alignment. ARTG improves perceptual quality (LPIPS 0.117→0.099) and temporal consistency (tLP100 6.132→4.265). TPM further enhances temporal coherence through temporal-feature warping and fusion, yielding additional gains in tLP100. These results highlight the complementary roles of latent-space guidance and decoder-side temporal modeling.

5 Conclusion

We propose Stream-DiffVSR, an efficient online video super-resolution framework using diffusion models. By integrating a distilled U-Net, Auto-Regressive Temporal Guidance, and Temporal-aware Decoder, Stream-DiffVSR achieves superior perceptual quality, temporal consistency, and practical inference speed for low-latency applications.

Limitations.

Stream-DiffVSR remains heavier than CNN and Transformer models, and its use of optical flow can introduce fast-motion artifacts. Its auto-regressive design also weakens initial frames, indicating a need for better initialization. Improving robustness to real-world degradations remains important.

Acknowledgements.

This research was funded by the National Science and Technology Council, Taiwan, under Grants NSTC 112-2222-E-A49-004-MY2 and 113-2628-E-A49-023-. The authors are grateful to Google, NVIDIA, and MediaTek Inc. for their generous donations. Yu-Lun Liu acknowledges the Yushan Young Fellow Program by the MOE in Taiwan.

References

\thetitle

Supplementary Material

Overview

This supplementary material provides additional details and results to support the main paper. We first describe the complete experimental setup, including training procedures, datasets, evaluation metrics, and baseline configurations. We then present extended implementation details and a three-stage breakdown of our training pipeline, covering U-Net distillation, temporal-aware decoder training, and the Auto-regressive Temporal Guidance module. Next, we report additional quantitative comparisons on multiple benchmarks under both bidirectional and unidirectional settings, followed by extensive qualitative visualizations illustrating perceptual quality and temporal consistency. We also include representative failure cases to highlight current limitations.

Appendix A Experimental Setup

A.1 Training and Evaluation Setup

Stream-DiffVSR is trained in three sequential stages to ensure stable optimization and modular control over temporal components. All evaluation experiments are conducted on an NVIDIA RTX 4090 GPU with TensorRT acceleration. Details of the stage-wise training procedure and configurations are provided in the supplementary.

A.2 Datasets

We evaluate our method using widely-recognized benchmarks: REDS [53] and Vimeo-90K [87]. REDS consists of 300 video sequences (1280×\times720 resolution, 100 frames each); sequences 000, 011, 015, and 020 (REDS4) are used for testing. Vimeo-90K-T contains 91,701 clips (448×\times256 resolution), with 64,612 for training and 7,824 (Vimeo-90K) for evaluation, offering diverse real-world content for training.

For testing under real-world degradation, we also evaluate on two additional benchmarks: VideoLQ [95], a no-reference video quality dataset curated from real Internet content, and Vid4 [38], a classical benchmark with 4 videos commonly used for VSR evaluation. The evaluation results are provided in supplementary.

A.3 Evaluation metrics

We assess the effectiveness of our approach using a comprehensive set of perceptual and temporal metrics across multiple aspects. Reference-based Perceptual Quality: LPIPS [96] and DISTS [13]. No-reference Perceptual Quality: MUSIQ [30], NIQE [59], NRQM [49], BRISQUE [51]. Temporal Consistency: Temporal Learned Perceptual Similarity (tLP), and Temporal Optical Flow difference (tOF). Inference Speed: Per-frame runtime, latency measured on an NVIDIA RTX 4090 GPU to evaluate low-latency applicability. Note that while we report PSNR and SSIM results (REDS4: 27.256 / 0.768) for completeness, we do not rely on these distortion-based metrics in our main analysis, as they often fail to reflect perceptual quality and temporal coherence, especially in generative VSR settings. This has also been observed in prior work [96]. Our qualitative results demonstrate superior perceptual and temporal quality, as we prioritize low-latency stability and consistency over overfitting to any single metric.

A.4 Baseline methods

We evaluate our method against leading CNN-based, Transformer-based, and Diffusion-based models. Specifically, we include bidirectional (offline) methods such as BasicVSR++[5], RealBasicVSR[6], RVRT [37], StableVSR [58], MGLD-VSR [89], and unidirectional (online) methods including MIA-VSR [105], TMP [99], RealViformer [98], and StableVSR [58], comprehensively comparing runtime, perceptual quality, and temporal consistency.00footnotetext: StableVSR [58] is originally a bidirectional model. We implement a unidirectional variant (StableVSR) that only uses forward optical flow for fair comparison under the online setting.

Appendix B Additional Implementation Details

B.1 Implementation Details

Our UNet backbone is initialized from the StableVSR [58] released UNet checkpoint, which is trained for image-based super-resolution from Stable Diffusion (SD) x4 Upscaler [57, 56]. We then perform 4-step distillation to adapt this UNet for efficient video SR. ARTG, in contrast, is built upon our distilled UNet encoder and computes temporal residuals from previous high-resolution outputs using convolutional and transformer blocks. These residuals are injected into the decoder during upsampling, enhancing temporal consistency without modifying the encoder or increasing diffusion steps. Our decoder is initialized from AutoEncoderTiny and extended with a Temporal Processor Module (TPM) to incorporate multi-scale temporal fusion during final reconstruction.

Appendix C Additional Training Detials

C.1 Stage 1: U-Net Distillation

We initialize the denoising U-Net from the 50-step diffusion model released by StableVSR [58], which was trained on REDS [52] dataset. To accelerate inference, we distill the 50-step U-Net into a 4-step variant using a deterministic DDIM [64] scheduler. During training, our rollout distillation always starts from the noisiest latent at timestep 999999 and executes the full sequence of four denoising steps {999,749,499,249}\{999,749,499,249\}. Supervision is applied only to the final denoised latent at t=0t=0, ensuring that training strictly mirrors the inference trajectory and reducing the gap between training and inference. We use a batch size of 16, learning rate of 5e-5 with constant, and AdamW optimizer (β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, weight decay 0.01). Training is conducted for 600K iterations with a patch size of 512×512512\times 512.The distillation loss consists of MSE loss in latent space, LPIPS [96] loss, and adversarial loss using a PatchGAN discriminator [26] in pixel level, with weights of 1.0, 0.5, and 0.025 respectively. Adversarial loss are envolved after 20k iteration for training stabilization.

C.2 Stage 2: Temporal-aware Decoder Training

The decoder receives both the encoded ground truth latent features and temporally aligned context features (via flow-warped previous frames). The encoder used to extract temporal features is frozen.We use a batch size of 16, learning rate of 5e-5 with constant, and AdamW optimizer (β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, weight decay 0.01). Training is conducted for 600K iterations with a patch size of 512×512512\times 512. Loss consists of smooth L1 reconstruction loss, LPIPS [96] loss, flow loss using RAFT [69] and adversarial loss using a PatchGAN discriminator [26] in pixel level for training, with weights of 1.0, 0.3, 0.1 and 0.025 respectively. Flow loss and adversarial loss are envolved after 20k iteration for training stabilization.

C.3 Stage 3: Auto-regressive Temporal Guidance

We train the ARTG module while freezing both the U-Net and decoder. Optical flow is computed between adjacent frames using RAFT [69], and the warped previous super-resolved frame is injected into the denoising U-Net and decoder. The loss formulation is identical to Stage 1, conducted with 60K iterations. This guides ARTG to enhance temporal coherence while maintaining alignment with the original perceptual objectives.

algorithm1
Algorithm 1 Training procedure for U-Net rollout distillation.
algorithm2
Algorithm 2 Auto-Regressive Diffusion VSR.

Appendix D Additional Quantitative comparison.

A4.T11
Table 11: Quantitative comparison against bidirectional/offline methods on the REDS4 dataset. We compare CNN-, Transformer-, and diffusion-based approaches. Stream-DiffVSR shows superior perceptual quality, temporal consistency, and stability. All values are reported as mean ± std over 4 videos. ↑ / ↓ denote higher/lower is better. Dir.: B = bidirectional/offline, U = unidirectional/online. Runtime is measured per 720p frame on an RTX 4090. Latency-first and Latency-avg measure first-frame and average latency; tLP and tOF are scaled by 100× and 10×. Best and second-best values are marked in red and blue. For space reasons, the main paper reports the mean-only version; the full mean±std statistics are shown here.
A4.T12
Table 12: Quantitative comparison against unidirectional/online methods on the REDS4 dataset.
A4.T13
Table 13: Quantitative comparison on the Vimeo-90K-T dataset (bidirectional/offline). Our Stream-DiffVSR achieves superior perceptual quality, temporal consistency, and substantially lower runtime. Results are reported as mean ± std across the dataset, with runtime measured on 448×256 videos using an RTX 4090 GPU. Best and second-best results are shown in red and blue. For space reasons, the main paper presents the mean-only version; the full mean±std statistics are provided here.
A4.T14
Table 14: Quantitative comparison on the Vimeo-90K-T dataset(unidirectional/online).

We provide extended quantitative results across multiple datasets and settings. Specifically, we report both bidirectional and unidirectional performance with mean and standard deviation on REDS4 (Tabs. 11 and 12) and Vimeo-90K (Tabs. 13 and 14), while additional bidirectional results are provided on VideoLQ (Tab. 15) and Vid4 (Tabs. 17 and 17). These supplementary results further validate the robustness of our approach under diverse benchmarks and temporal settings.

A4.T15
Table 15:Quantitative comparison on the VideoLQ dataset. Left: Against bidirectional/offline methods ; Right: unidirectional/online methods.
A4.T17
Table 16:Quantitative comparison against bidirectional/offline methods on the Vid4 dataset.
A4.T17.16
Table 17:Quantitative comparison against unidirectional/online methods on the Vid4 dataset.

Appendix E Additional Visual Result

A5.F9
Figure 9:Qualitative comparison with Upscale-A-Video (UAV). Due to GPU memory limitations (OOM on an RTX 4090), we use UAV results extracted from its official project video for qualitative comparison. Despite this constraint, our Stream-DiffVSR exhibits superior visual fidelity and temporal consistency across frames.
A5.F10
Figure 10:Additional visual results.
A5.F11
Figure 11:Additional visual results.
A5.F12
Figure 12:Additional visual results.
A5.F13
Figure 13:Temporal consistency comparison. Qualitative comparison of temporal consistency across consecutive frames. Our proposed Stream-DiffVSR effectively mitigates flickering artifacts and maintains stable texture reconstruction, demonstrating superior temporal coherence compared to existing VSR methods.
A5.F14
Figure 14:Optical flow visualization comparison. Visualization of optical flow consistency across different VSR methods. Our proposed Stream-DiffVSR produces smoother and more temporally coherent flow fields, indicating improved motion consistency and reduced temporal artifacts compared to competing approaches.

Figs. 10, 11 and 12 presents qualitative results on challenging real-world sequences. Compared with CNN-based (TMP, BasicVSR++) and Transformer-based (RealViFormer) approaches, as well as the diffusion-based MGLD-VSR, our method produces sharper structures and more faithful textures while effectively reducing temporal flickering. These visual comparisons further demonstrate the effectiveness of our design in maintaining perceptual quality and temporal consistency across diverse scenes.

Temporal consistency comparison. As shown in the consecutive-frame comparisons Fig. 13, Stream-DiffVSR alleviates flickering artifacts and preserves stable textures over time, yielding noticeably stronger temporal coherence than prior VSR methods.

Optical flow visualization comparison. The optical flow consistency visualizations Fig. 14 further highlight our advantages: Stream-DiffVSR generates smoother and more temporally coherent flow fields, reflecting improved motion stability and reduced temporal artifacts.

We also provide qualitative comparisons with Upscale-A-Video [104] in Fig. 9. Owing to GPU memory constraints, the official model cannot be executed locally, so we rely on frames extracted from its project video. Despite this limitation, Stream-DiffVSR demonstrates superior fine-detail reconstruction and notably improved temporal stability in UAV scenarios.

Appendix F Failure cases

A6.F15
Figure 15: Limitation on the first frame without temporal context. Our method may underperform on the first frame of a video sequence due to the absence of prior temporal information. This limitation is inherent to online VSR settings, where no past frames are available for guidance.

Fig. 15 illustrates a limitation of our approach on the first frame of a video sequence. Since no past frames are available for temporal guidance, the model may produce blurrier details or less stable structures compared to subsequent frames. This issue is inherent to all online VSR settings, where temporal information cannot be exploited at the sequence start. As shown in later frames, once temporal context becomes available, our method quickly stabilizes and reconstructs high-fidelity details.