-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[WIP] Update diffusers-cli for agentic use
#13966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
e84a3ef
59be753
4194c39
d8eb952
95f33c7
accfa06
4d4d9e8
f97aef8
3774951
add747b
934b557
0ae1eb0
dcfd09c
2221383
9515c55
404be8a
f3fa589
633461d
268bae9
fa7a0a2
55e1c14
6ba7a3f
6f02aed
889f646
ab70d69
af8cbf4
b50dae1
1d6f5b3
46849ae
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| --- | ||
| name: diffusers-cli | ||
| description: > | ||
| Use when the user wants to run a diffusers pipeline from a terminal (one-off | ||
| generation, batch jobs, smoke-testing a new model), submit jobs to HF Jobs | ||
| hardware via `--remote`, introspect a pipeline's input schema before | ||
| calling it, or attach a LoRA at inference time. Prefer this over writing | ||
| ad-hoc Python scripts for generation tasks. | ||
| --- | ||
|
|
||
| ## Overview | ||
|
|
||
| `diffusers-cli` is the shipped CLI in `src/diffusers/commands/`. Subcommands relevant to agentic use: | ||
|
|
||
| | Command | Purpose | | ||
| | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | `generate` | Run any `DiffusionPipeline` or `ModularPipeline`. Forwards `--pipeline-kwargs` verbatim, saves output by sniffing its runtime type, optionally runs on HF Jobs via `--remote`. | | ||
| | `describe` | Print the input schema for a pipeline repo (kwarg names, types, defaults, descriptions). **No weights downloaded** — only the small index file. | | ||
| | `custom_blocks` | Package a local `ModularPipelineBlocks` subclass for the Hub. | | ||
| | `env` | Print versions of diffusers + torch + transformers + accelerate + safetensors + CUDA + GPU info. Use when investigating environment issues, dtype/precision support, or building bug reports. | | ||
|
|
||
| ## When to read which file | ||
|
|
||
| Most agentic work goes through `generate`. Read the matching reference file before constructing a command: | ||
|
|
||
| - **[`generate.md`](generate.md)** — full reference for `diffusers-cli generate`. Covers `--pipeline-kwargs` | ||
| semantics and the shell-quoting gotcha, LoRA via `--lora`, optimization flags (`--dtype`, `--cpu-offload`, | ||
| `--attention-backend`, `--vae-tiling/slicing`), output handling and `--push-to` bucket uploads, the full | ||
| `--remote` HF Jobs flow (image, container command, log streaming, timing payload, artifact download), and | ||
| context parallel (`--context-parallel`) for both local-torchrun and `--remote` paths. | ||
|
|
||
| The other commands are small enough that `diffusers-cli <command> --help` is the canonical reference: | ||
|
|
||
| ```bash | ||
| diffusers-cli describe --help | ||
| diffusers-cli custom_blocks --help | ||
| diffusers-cli env --help | ||
| ``` | ||
|
|
||
| ## When NOT to use this skill | ||
|
|
||
| - Multi-stage workflows where you need intermediate tensor manipulation between pipelines → write Python. | ||
| - Training or fine-tuning → CLI only covers inference. | ||
| - Anything requiring custom `device_map`, `quantization_config`, or other low-level loader knobs not exposed by | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Feels like quantization could be exposed to the CLI. Right now, one can only do that when using a prequantized checkpoint?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Quantization has a fairly large API surface that might be better suited to writing a dedicated quantization script? e.g BnB quant config options have no overlap with TorchAO which in turn have no overlap with ModelOpt etc etc. TorchAO also supports using AOBaseConfig input which in turn has it's own input args. We could explore trying to provide the option via a more restricted API though.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No your reasoning makes sense. It's just that a user could expect it because quantization is sometimes the only way to do it locally. We can table it for now. |
||
| the CLI flags → write Python. | ||
|
|
||
| ## Verifying the CLI is installed | ||
|
|
||
| The console entry point is registered in `pyproject.toml` (`diffusers-cli = | ||
| "diffusers.commands.diffusers_cli:main"`). If `diffusers-cli` is not on PATH after `pip install -e .`, reinstall | ||
| with `pip install -e . --force-reinstall --no-deps` and check `which diffusers-cli`. If the installed binary is | ||
| missing recent features (e.g. you see `unrecognized arguments: --lora`), reinstall. | ||
|
|
||
| ## Output formats | ||
|
|
||
| `--format {auto, human, agent, json}` (top-level flag, must appear before the subcommand): | ||
|
|
||
| - **`human`** — plain-text indented output for terminals (default when not running under an agent harness). No ANSI color. | ||
| - **`agent`** — TSV tables and `key=value` lines. Auto-selected when an agent env var is present | ||
| (`CLAUDECODE`, `CLAUDE_CODE`, `CODEX_SANDBOX`, `CURSOR_AI`, `AIDER_AI_CONTEXT`, `GH_COPILOT_AGENT`, | ||
| `AI_AGENT`). Token-cheap for LLM agents to read. | ||
| - **`json`** — compact JSON. Use for programmatic parsing (scripts, services) where type fidelity and nested | ||
| structures matter. | ||
|
|
||
| `stdout` carries data; `stderr` carries hints/warnings/progress — parseable output is never polluted. | ||
|
|
||
| Rule of thumb: `--format json` for scripts that will `json.loads()` the output, otherwise leave it on | ||
| auto-detect (`agent` for LLMs, `human` for terminals). | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,175 @@ | ||
| # `diffusers-cli generate` — reference | ||
|
|
||
| Full surface for `diffusers-cli generate`. Use this file as the source of truth when constructing a `generate` | ||
| invocation. The top-level [`SKILL.md`](SKILL.md) covers when to use the CLI; this file covers how. | ||
|
|
||
| ## The describe → generate flow | ||
|
|
||
| For any model you haven't called before, run `describe` first to learn its input contract, then `generate` with | ||
| the right `--pipeline-kwargs`: | ||
|
|
||
| ```bash | ||
| # 1. Discover what kwargs the pipeline takes (no weight download) | ||
| diffusers-cli --format json describe --model black-forest-labs/FLUX.2-klein-9B | ||
|
|
||
| # 2. Run it | ||
| diffusers-cli generate \ | ||
| --model black-forest-labs/FLUX.2-klein-9B \ | ||
| --pipeline-kwargs '{"prompt": "Make the cats fur grey", "image": "https://blobcdn.same.energy/a/d0/58/d058b51c2329b0ea4057e9f12cd9a1da36347e34"}' \ | ||
| --dtype bf16 | ||
| ``` | ||
|
|
||
| `describe --format json` emits a `{task, model, pipeline_class, inputs[]}` payload where each input is | ||
| `{name, type_hint, default, required, description}`. | ||
|
|
||
| ## Standard vs modular detection | ||
|
|
||
| `generate` auto-detects which kind of pipeline it's calling: | ||
|
|
||
| 1. If `model_index.json` exists on the repo → `DiffusionPipeline.from_pretrained` path. | ||
| 2. Otherwise → `ModularPipeline.from_pretrained` path. | ||
|
|
||
| You don't need to tell it which. Modular repos must pass `--trust-remote-code` if they ship custom block code. | ||
|
|
||
| ## `--pipeline-kwargs` semantics | ||
|
|
||
| A JSON object passed straight through to `pipeline(**kwargs)`. String values at known image-input keys (`image`, | ||
| `mask_image`, `control_image`, `ip_adapter_image`, `image_2`) are auto-loaded as PIL images, so you can pass URLs | ||
| or local paths directly: | ||
|
|
||
| ```bash | ||
| diffusers-cli generate \ | ||
| --model black-forest-labs/FLUX.2-klein-9B \ | ||
| --pipeline-kwargs '{"image": "https://example.com/cat.png", "prompt": "make the fur grey", "strength": 0.6}' | ||
| ``` | ||
|
|
||
| **Shell-quoting gotcha**: the JSON must be on one line (or use `\` to line-continue). A literal newline inside the | ||
| single-quoted argument lands as a raw control char inside the string and breaks `json.loads`. | ||
|
|
||
| ## LoRA adapters (`--lora`) | ||
|
|
||
| Attach a LoRA after the pipeline loads via a JSON spec: | ||
|
|
||
| ```bash | ||
| diffusers-cli generate \ | ||
| --model black-forest-labs/FLUX.2-klein-9B \ | ||
| --pipeline-kwargs '{"prompt": "a tiny grey cat"}' \ | ||
| --lora '{"lora_id": "alvdansen/littletinies", "lora_scale": 0.8}' | ||
| ``` | ||
|
|
||
| Calls `pipeline.load_lora_weights(<lora_id>, adapter_name="default")` and, if `lora_scale` is present, | ||
| `pipeline.set_adapters(["default"], adapter_weights=[<scale>])`. Errors clearly if the pipeline doesn't support | ||
| LoRA or `lora_id` is missing. | ||
|
|
||
| ## Optimization flags | ||
|
|
||
| - `--dtype {auto, bf16, fp16, fp32, …}` — pipeline weight dtype. `bf16` is the right default for modern DiTs on | ||
| A100/H100. | ||
| - `--cpu-offload {model, group}` — `model` uses `enable_model_cpu_offload`, `group` uses | ||
| `enable_group_offload(offload_type="leaf_level", use_stream=True)`. Use `group` to fit a 9B+ model on a single A100. | ||
| - `--attention-backend {default, flash_hub, flash_varlen_hub, flash_4_hub, sage_hub}` — hub-hosted kernels, | ||
| auto-downloaded on first use. Failures (kernel not available, CUDA arch mismatch, network) raise a clear | ||
| `SystemExit` listing the alternatives instead of silently reverting to the default. | ||
| - `--vae-tiling` / `--vae-slicing` — lower peak VAE decode VRAM. | ||
| - `--context-parallel` — Ulysses-style context parallelism on a DiT. See [Context parallel](#context-parallel) below. | ||
|
|
||
| `disable_mmap=True` is always passed to `from_pretrained` — sequential reads are faster than mmap page-faults on | ||
| most filesystems. | ||
|
|
||
| ## Output handling | ||
|
|
||
| `generate` sniffs the pipeline return type and saves accordingly: | ||
|
|
||
| - `PIL.Image` / list of them → `outputs/generate-<i>.png` | ||
| - Frame sequence (≥2 PILs or ndarrays) → `outputs/generate-0.mp4` (uses `--fps`, default 8) | ||
| - Numpy audio array → `outputs/generate-0.wav` (uses `--sampling-rate`) | ||
| - Anything else → JSON dump | ||
|
|
||
| Override the destination with `--output <path>` (file or directory). | ||
|
|
||
| Use `--push-to <user>/<bucket>` to upload outputs to an HF bucket after saving. The bucket is created if it | ||
| doesn't exist; objects land under `<run_id>/<filename>`. | ||
|
|
||
| ## Remote execution (`--remote`) | ||
|
|
||
| Adds `--remote` to submit the same call as a Hugging Face Job: | ||
|
|
||
| ```bash | ||
| diffusers-cli generate \ | ||
| --model black-forest-labs/FLUX.2-klein-9B \ | ||
| --pipeline-kwargs '{"prompt": "Make the cats fur grey", "image": "https://blobcdn.same.energy/a/d0/58/d058b51c2329b0ea4057e9f12cd9a1da36347e34"}' \ | ||
| --remote --flavor a100-large \ | ||
| --dtype bf16 \ | ||
| --cpu-offload group | ||
| ``` | ||
|
|
||
| What happens: | ||
|
|
||
| 1. Your HF token is picked up (from `--token` or your login). | ||
| 2. A bucket (`<user>/jobs-artifacts` by default) is created if it doesn't exist. | ||
| 3. The job runs in a pytorch container that already has torch + CUDA preinstalled. Only the small Python | ||
| deps (`diffusers`, `accelerate`, `transformers`, `safetensors`) are installed at container start — about | ||
| 50 MB instead of 3 GB. | ||
| 4. Container logs stream to your terminal. When the job finishes, the CLI downloads every file the job | ||
| uploaded to the bucket under its `run_id` prefix into `./outputs/`. | ||
| 5. A timing breakdown (`queued_seconds`, `run_seconds`, `total_seconds`) is printed and included in the JSON | ||
| payload. | ||
|
|
||
| Flags: | ||
|
|
||
| - `--flavor <name>` — HF Jobs hardware (e.g. `a10g-small`, `a100-large`, `4xa100-large`). | ||
| - `--timeout <duration>` — max wallclock (e.g. `30m`, `2h`). Defaults to `10m`. | ||
| - `--dependencies <pkg>` — extra pip deps (repeatable). | ||
| - `--namespace <name>` — run under a different account. | ||
| - `--no-wait` — submit, return job id, don't stream logs. | ||
| - `--push-to <bucket>` — override the artifact bucket id. | ||
|
|
||
| ## Context parallel | ||
|
|
||
| `--context-parallel` enables Ulysses CP on a DiT-based pipeline. **Locally** the user must launch via torchrun: | ||
|
|
||
| ```bash | ||
| torchrun --nproc-per-node=2 -m diffusers.commands.diffusers_cli generate \ | ||
| --model black-forest-labs/FLUX.2-klein-9B \ | ||
| --pipeline-kwargs '{"prompt": "Make the cats fur grey"}' \ | ||
| --dtype bf16 \ | ||
| --context-parallel | ||
| ``` | ||
|
|
||
| **Remotely** the CLI handles the torchrun wrapping — just pass `--context-parallel` to a `--remote` invocation on | ||
| a multi-GPU flavor: | ||
|
|
||
| ```bash | ||
| diffusers-cli generate \ | ||
| --model black-forest-labs/FLUX.2-klein-9B \ | ||
| --pipeline-kwargs '{"prompt": "Make the cats fur grey", "image": "https://blobcdn.same.energy/a/d0/58/d058b51c2329b0ea4057e9f12cd9a1da36347e34"}' \ | ||
| --remote --flavor 4xa100-large \ | ||
| --dtype bf16 \ | ||
| --context-parallel | ||
| ``` | ||
|
|
||
| Inside the container, CP swaps the entrypoint to `torchrun --nproc-per-node=gpu -m | ||
| diffusers.commands.diffusers_cli`, initializes a hybrid process group (`cpu:gloo,cuda:nccl` — NCCL for the | ||
| attention all-to-all, Gloo for `ulysses_anything`'s per-rank size coordination), pins each rank to | ||
| `cuda:{LOCAL_RANK}`, and gates output saving/printing to rank 0 only. | ||
|
|
||
| **Memory note**: CP shards the sequence, **not the weights**. Every rank still holds the full transformer. Wins | ||
| are wall-clock attention speedup and headroom for very long sequences, not "fit a model that doesn't fit." For | ||
| weight sharding you'd want TP or FSDP — not exposed in the CLI yet. | ||
|
|
||
| CP is DiT-only. UNet pipelines raise a clear error directing you to a DiT pipeline (FLUX, SD3, HunyuanDiT, | ||
| AuraFlow, …). | ||
|
|
||
| ## Output mode (`--format`) | ||
|
|
||
| The CLI auto-detects when running under an AI coding agent (Claude Code, Cursor, Aider, GH Copilot Agent — via | ||
| `CLAUDECODE`, `CLAUDE_CODE`, `CURSOR_AI`, `AIDER_AI_CONTEXT`, `GH_COPILOT_AGENT`) and switches output to **agent | ||
| mode** automatically — TSV tables, `key=value` results, compact JSON dicts, no progress bars. | ||
|
|
||
| Override explicitly with `--format {auto, human, agent, json}` placed **before** the subcommand: | ||
|
|
||
| ```bash | ||
| diffusers-cli --format json generate --model <id> --pipeline-kwargs '...' | ||
| ``` | ||
|
|
||
| The legacy `--json` flag on `generate` still works as a shortcut for `--format json`. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| # Copyright 2026 The HuggingFace Team. All rights reserved. | ||
|
sayakpaul marked this conversation as resolved.
|
||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| """Shared helpers used by multiple ``diffusers-cli`` subcommands. | ||
|
|
||
| Anything imported by more than one command file lives here so command modules stay standalone — no cross-command | ||
| imports between e.g. ``describe`` and ``generate``. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from argparse import Namespace | ||
| from pathlib import Path | ||
|
|
||
|
|
||
| def try_fetch_config(args: Namespace, filename: str) -> str | None: | ||
| """Resolve ``filename`` for ``args.model`` (local path or Hub repo). Return None if absent. | ||
|
|
||
| Used by ``generate`` (to detect modular vs standard pipelines) and ``describe`` (to read the pipeline class for | ||
| schema introspection) — no weights are downloaded, only the small index file. | ||
| """ | ||
| local = Path(args.model) | ||
| if local.exists(): | ||
| candidate = local / filename | ||
| return str(candidate) if candidate.exists() else None | ||
|
|
||
| from huggingface_hub import hf_hub_download | ||
| from huggingface_hub.utils import EntryNotFoundError, HfHubHTTPError, RepositoryNotFoundError | ||
|
|
||
| try: | ||
| return hf_hub_download(args.model, filename, revision=args.revision, token=args.token) | ||
| except (EntryNotFoundError, HfHubHTTPError, RepositoryNotFoundError): | ||
| return None | ||
Uh oh!
There was an error while loading. Please reload this page.