Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions .ai/skills/diffusers-cli/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
name: diffusers-cli
description: >
Use when the user wants to run a diffusers pipeline from a terminal (one-off
generation, batch jobs, smoke-testing a new model), submit jobs to HF Jobs
hardware via `--remote`, introspect a pipeline's input schema before
calling it, or attach a LoRA at inference time. Prefer this over writing
ad-hoc Python scripts for generation tasks.
---

## Overview

`diffusers-cli` is the shipped CLI in `src/diffusers/commands/`. Subcommands relevant to agentic use:

| Command | Purpose |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `generate` | Run any `DiffusionPipeline` or `ModularPipeline`. Forwards `--pipeline-kwargs` verbatim, saves output by sniffing its runtime type, optionally runs on HF Jobs via `--remote`. |
| `describe` | Print the input schema for a pipeline repo (kwarg names, types, defaults, descriptions). **No weights downloaded** — only the small index file. |
| `custom_blocks` | Package a local `ModularPipelineBlocks` subclass for the Hub. |
| `env` | Print versions of diffusers + torch + transformers + accelerate + safetensors + CUDA + GPU info. Use when investigating environment issues, dtype/precision support, or building bug reports. |

## When to read which file

Most agentic work goes through `generate`. Read the matching reference file before constructing a command:

- **[`generate.md`](generate.md)** — full reference for `diffusers-cli generate`. Covers `--pipeline-kwargs`
Comment thread
DN6 marked this conversation as resolved.
semantics and the shell-quoting gotcha, LoRA via `--lora`, optimization flags (`--dtype`, `--cpu-offload`,
`--attention-backend`, `--vae-tiling/slicing`), output handling and `--push-to` bucket uploads, the full
`--remote` HF Jobs flow (image, container command, log streaming, timing payload, artifact download), and
context parallel (`--context-parallel`) for both local-torchrun and `--remote` paths.

The other commands are small enough that `diffusers-cli <command> --help` is the canonical reference:

```bash
diffusers-cli describe --help
diffusers-cli custom_blocks --help
diffusers-cli env --help
```

## When NOT to use this skill

- Multi-stage workflows where you need intermediate tensor manipulation between pipelines → write Python.
- Training or fine-tuning → CLI only covers inference.
- Anything requiring custom `device_map`, `quantization_config`, or other low-level loader knobs not exposed by

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like quantization could be exposed to the CLI. Right now, one can only do that when using a prequantized checkpoint?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quantization has a fairly large API surface that might be better suited to writing a dedicated quantization script? e.g BnB quant config options have no overlap with TorchAO which in turn have no overlap with ModelOpt etc etc. TorchAO also supports using AOBaseConfig input which in turn has it's own input args.

We could explore trying to provide the option via a more restricted API though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No your reasoning makes sense. It's just that a user could expect it because quantization is sometimes the only way to do it locally. We can table it for now.

the CLI flags → write Python.

## Verifying the CLI is installed

The console entry point is registered in `pyproject.toml` (`diffusers-cli =
"diffusers.commands.diffusers_cli:main"`). If `diffusers-cli` is not on PATH after `pip install -e .`, reinstall
with `pip install -e . --force-reinstall --no-deps` and check `which diffusers-cli`. If the installed binary is
missing recent features (e.g. you see `unrecognized arguments: --lora`), reinstall.

## Output formats

`--format {auto, human, agent, json}` (top-level flag, must appear before the subcommand):

- **`human`** — plain-text indented output for terminals (default when not running under an agent harness). No ANSI color.
- **`agent`** — TSV tables and `key=value` lines. Auto-selected when an agent env var is present
(`CLAUDECODE`, `CLAUDE_CODE`, `CODEX_SANDBOX`, `CURSOR_AI`, `AIDER_AI_CONTEXT`, `GH_COPILOT_AGENT`,
`AI_AGENT`). Token-cheap for LLM agents to read.
- **`json`** — compact JSON. Use for programmatic parsing (scripts, services) where type fidelity and nested
structures matter.

`stdout` carries data; `stderr` carries hints/warnings/progress — parseable output is never polluted.

Rule of thumb: `--format json` for scripts that will `json.loads()` the output, otherwise leave it on
auto-detect (`agent` for LLMs, `human` for terminals).
175 changes: 175 additions & 0 deletions .ai/skills/diffusers-cli/generate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# `diffusers-cli generate` — reference

Full surface for `diffusers-cli generate`. Use this file as the source of truth when constructing a `generate`
invocation. The top-level [`SKILL.md`](SKILL.md) covers when to use the CLI; this file covers how.

## The describe → generate flow

For any model you haven't called before, run `describe` first to learn its input contract, then `generate` with
the right `--pipeline-kwargs`:

```bash
# 1. Discover what kwargs the pipeline takes (no weight download)
diffusers-cli --format json describe --model black-forest-labs/FLUX.2-klein-9B

# 2. Run it
diffusers-cli generate \
--model black-forest-labs/FLUX.2-klein-9B \
--pipeline-kwargs '{"prompt": "Make the cats fur grey", "image": "https://blobcdn.same.energy/a/d0/58/d058b51c2329b0ea4057e9f12cd9a1da36347e34"}' \
--dtype bf16
```

`describe --format json` emits a `{task, model, pipeline_class, inputs[]}` payload where each input is
`{name, type_hint, default, required, description}`.

## Standard vs modular detection

`generate` auto-detects which kind of pipeline it's calling:

1. If `model_index.json` exists on the repo → `DiffusionPipeline.from_pretrained` path.
2. Otherwise → `ModularPipeline.from_pretrained` path.

You don't need to tell it which. Modular repos must pass `--trust-remote-code` if they ship custom block code.

## `--pipeline-kwargs` semantics

A JSON object passed straight through to `pipeline(**kwargs)`. String values at known image-input keys (`image`,
`mask_image`, `control_image`, `ip_adapter_image`, `image_2`) are auto-loaded as PIL images, so you can pass URLs
or local paths directly:

```bash
diffusers-cli generate \
--model black-forest-labs/FLUX.2-klein-9B \
--pipeline-kwargs '{"image": "https://example.com/cat.png", "prompt": "make the fur grey", "strength": 0.6}'
```

**Shell-quoting gotcha**: the JSON must be on one line (or use `\` to line-continue). A literal newline inside the
single-quoted argument lands as a raw control char inside the string and breaks `json.loads`.

## LoRA adapters (`--lora`)

Attach a LoRA after the pipeline loads via a JSON spec:

```bash
diffusers-cli generate \
--model black-forest-labs/FLUX.2-klein-9B \
--pipeline-kwargs '{"prompt": "a tiny grey cat"}' \
--lora '{"lora_id": "alvdansen/littletinies", "lora_scale": 0.8}'
```

Calls `pipeline.load_lora_weights(<lora_id>, adapter_name="default")` and, if `lora_scale` is present,
`pipeline.set_adapters(["default"], adapter_weights=[<scale>])`. Errors clearly if the pipeline doesn't support
LoRA or `lora_id` is missing.

## Optimization flags

- `--dtype {auto, bf16, fp16, fp32, …}` — pipeline weight dtype. `bf16` is the right default for modern DiTs on
A100/H100.
- `--cpu-offload {model, group}` — `model` uses `enable_model_cpu_offload`, `group` uses
`enable_group_offload(offload_type="leaf_level", use_stream=True)`. Use `group` to fit a 9B+ model on a single A100.
- `--attention-backend {default, flash_hub, flash_varlen_hub, flash_4_hub, sage_hub}` — hub-hosted kernels,
auto-downloaded on first use. Failures (kernel not available, CUDA arch mismatch, network) raise a clear
`SystemExit` listing the alternatives instead of silently reverting to the default.
- `--vae-tiling` / `--vae-slicing` — lower peak VAE decode VRAM.
- `--context-parallel` — Ulysses-style context parallelism on a DiT. See [Context parallel](#context-parallel) below.

`disable_mmap=True` is always passed to `from_pretrained` — sequential reads are faster than mmap page-faults on
most filesystems.

## Output handling

`generate` sniffs the pipeline return type and saves accordingly:

- `PIL.Image` / list of them → `outputs/generate-<i>.png`
- Frame sequence (≥2 PILs or ndarrays) → `outputs/generate-0.mp4` (uses `--fps`, default 8)
- Numpy audio array → `outputs/generate-0.wav` (uses `--sampling-rate`)
- Anything else → JSON dump

Override the destination with `--output <path>` (file or directory).

Use `--push-to <user>/<bucket>` to upload outputs to an HF bucket after saving. The bucket is created if it
doesn't exist; objects land under `<run_id>/<filename>`.

## Remote execution (`--remote`)

Adds `--remote` to submit the same call as a Hugging Face Job:

```bash
diffusers-cli generate \
--model black-forest-labs/FLUX.2-klein-9B \
--pipeline-kwargs '{"prompt": "Make the cats fur grey", "image": "https://blobcdn.same.energy/a/d0/58/d058b51c2329b0ea4057e9f12cd9a1da36347e34"}' \
--remote --flavor a100-large \
--dtype bf16 \
--cpu-offload group
```

What happens:

1. Your HF token is picked up (from `--token` or your login).
2. A bucket (`<user>/jobs-artifacts` by default) is created if it doesn't exist.
3. The job runs in a pytorch container that already has torch + CUDA preinstalled. Only the small Python
deps (`diffusers`, `accelerate`, `transformers`, `safetensors`) are installed at container start — about
50 MB instead of 3 GB.
4. Container logs stream to your terminal. When the job finishes, the CLI downloads every file the job
uploaded to the bucket under its `run_id` prefix into `./outputs/`.
5. A timing breakdown (`queued_seconds`, `run_seconds`, `total_seconds`) is printed and included in the JSON
payload.

Flags:

- `--flavor <name>` — HF Jobs hardware (e.g. `a10g-small`, `a100-large`, `4xa100-large`).
- `--timeout <duration>` — max wallclock (e.g. `30m`, `2h`). Defaults to `10m`.
- `--dependencies <pkg>` — extra pip deps (repeatable).
- `--namespace <name>` — run under a different account.
- `--no-wait` — submit, return job id, don't stream logs.
- `--push-to <bucket>` — override the artifact bucket id.

## Context parallel

`--context-parallel` enables Ulysses CP on a DiT-based pipeline. **Locally** the user must launch via torchrun:

```bash
torchrun --nproc-per-node=2 -m diffusers.commands.diffusers_cli generate \
--model black-forest-labs/FLUX.2-klein-9B \
--pipeline-kwargs '{"prompt": "Make the cats fur grey"}' \
--dtype bf16 \
--context-parallel
```

**Remotely** the CLI handles the torchrun wrapping — just pass `--context-parallel` to a `--remote` invocation on
a multi-GPU flavor:

```bash
diffusers-cli generate \
--model black-forest-labs/FLUX.2-klein-9B \
--pipeline-kwargs '{"prompt": "Make the cats fur grey", "image": "https://blobcdn.same.energy/a/d0/58/d058b51c2329b0ea4057e9f12cd9a1da36347e34"}' \
--remote --flavor 4xa100-large \
--dtype bf16 \
--context-parallel
```

Inside the container, CP swaps the entrypoint to `torchrun --nproc-per-node=gpu -m
diffusers.commands.diffusers_cli`, initializes a hybrid process group (`cpu:gloo,cuda:nccl` — NCCL for the
attention all-to-all, Gloo for `ulysses_anything`'s per-rank size coordination), pins each rank to
`cuda:{LOCAL_RANK}`, and gates output saving/printing to rank 0 only.

**Memory note**: CP shards the sequence, **not the weights**. Every rank still holds the full transformer. Wins
are wall-clock attention speedup and headroom for very long sequences, not "fit a model that doesn't fit." For
weight sharding you'd want TP or FSDP — not exposed in the CLI yet.

CP is DiT-only. UNet pipelines raise a clear error directing you to a DiT pipeline (FLUX, SD3, HunyuanDiT,
AuraFlow, …).

## Output mode (`--format`)

The CLI auto-detects when running under an AI coding agent (Claude Code, Cursor, Aider, GH Copilot Agent — via
`CLAUDECODE`, `CLAUDE_CODE`, `CURSOR_AI`, `AIDER_AI_CONTEXT`, `GH_COPILOT_AGENT`) and switches output to **agent
mode** automatically — TSV tables, `key=value` results, compact JSON dicts, no progress bars.

Override explicitly with `--format {auto, human, agent, json}` placed **before** the subcommand:

```bash
diffusers-cli --format json generate --model <id> --pipeline-kwargs '...'
```

The legacy `--json` flag on `generate` still works as a shortcut for `--format json`.
43 changes: 43 additions & 0 deletions src/diffusers/commands/_common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Copyright 2026 The HuggingFace Team. All rights reserved.
Comment thread
sayakpaul marked this conversation as resolved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Shared helpers used by multiple ``diffusers-cli`` subcommands.

Anything imported by more than one command file lives here so command modules stay standalone — no cross-command
imports between e.g. ``describe`` and ``generate``.
"""

from __future__ import annotations

from argparse import Namespace
from pathlib import Path


def try_fetch_config(args: Namespace, filename: str) -> str | None:
"""Resolve ``filename`` for ``args.model`` (local path or Hub repo). Return None if absent.

Used by ``generate`` (to detect modular vs standard pipelines) and ``describe`` (to read the pipeline class for
schema introspection) — no weights are downloaded, only the small index file.
"""
local = Path(args.model)
if local.exists():
candidate = local / filename
return str(candidate) if candidate.exists() else None

from huggingface_hub import hf_hub_download
from huggingface_hub.utils import EntryNotFoundError, HfHubHTTPError, RepositoryNotFoundError

try:
return hf_hub_download(args.model, filename, revision=args.revision, token=args.token)
except (EntryNotFoundError, HfHubHTTPError, RepositoryNotFoundError):
return None
Loading
Loading