Fix CUDA synchronization bottleneck in LCMScheduler (#9485) by Liauuu · Pull Request #13969 · huggingface/diffusers

Liauuu · 2026-06-16T03:48:53Z

Description

This PR resolves the intermittent high latency and CUDA stream synchronization issue (cudaMemcpyAsync bottleneck) in LCMScheduler.

Previously, indexing the CPU tensor self.alphas_cumprod with the GPU tensor timestep triggered an internal aten::_local_scalar_dense call, causing a major performance overhead (up to 100ms+ during video generation loops).

Changes

Updated set_timesteps to maintain a CPU-side copy of timesteps (self.cpu_timesteps).
Modified step and get_scalings_for_boundary_condition_discrete to utilize self.cpu_timesteps[self.step_index] for indexing, ensuring efficient CPU tensor [CPU index] operations while keeping the original GPU timesteps for the model forwards.

All 34 tests in tests/schedulers/test_scheduler_lcm.py passed successfully.

…U timesteps

Fix CUDA synchronization bottleneck in LCMScheduler by maintaining CP…

94d26d0

…U timesteps

github-actions Bot added size/S PR with diff < 50 LOC fixes-issue schedulers and removed fixes-issue labels Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA synchronization bottleneck in LCMScheduler (#9485)#13969

Fix CUDA synchronization bottleneck in LCMScheduler (#9485)#13969
Liauuu wants to merge 1 commit into
huggingface:mainfrom
Liauuu:pr6

Liauuu commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Liauuu commented Jun 16, 2026

Description

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant