Skip to content

Fix CUDA synchronization bottleneck in LCMScheduler (#9485)#13969

Open
Liauuu wants to merge 1 commit into
huggingface:mainfrom
Liauuu:pr6
Open

Fix CUDA synchronization bottleneck in LCMScheduler (#9485)#13969
Liauuu wants to merge 1 commit into
huggingface:mainfrom
Liauuu:pr6

Conversation

@Liauuu

@Liauuu Liauuu commented Jun 16, 2026

Copy link
Copy Markdown

Description

Fixes #9485

This PR resolves the intermittent high latency and CUDA stream synchronization issue (cudaMemcpyAsync bottleneck) in LCMScheduler.

Previously, indexing the CPU tensor self.alphas_cumprod with the GPU tensor timestep triggered an internal aten::_local_scalar_dense call, causing a major performance overhead (up to 100ms+ during video generation loops).

Changes

  • Updated set_timesteps to maintain a CPU-side copy of timesteps (self.cpu_timesteps).
  • Modified step and get_scalings_for_boundary_condition_discrete to utilize self.cpu_timesteps[self.step_index] for indexing, ensuring efficient CPU tensor [CPU index] operations while keeping the original GPU timesteps for the model forwards.

All 34 tests in tests/schedulers/test_scheduler_lcm.py passed successfully.

@github-actions github-actions Bot added size/S PR with diff < 50 LOC fixes-issue schedulers and removed fixes-issue labels Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

schedulers size/S PR with diff < 50 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can we allow making everything on gpu/cuda for scheduler?

1 participant