DO-NOT-MERGE: RFC: NVMe-oF orchestrator coexistence — nvme-discoverd, ownership registry, and exclusion list#3442
Conversation
RFCs submitted for review only. Not intended to be merged. Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
6de7bfa to
d29d244
Compare
Registry: - Show only the guarded udev rule in §4; remove the naive unguarded version that preceded it — a skimming reader would copy the wrong one - Aperiodic audit is now bidirectional: re-assert ownership for live controllers with missing entries, not only remove stale ones - Document the residual TOCTOU race (dangerous direction) explicitly Exclusion list: - Fix Python match semantics: None/NULL caller params skip the comparison (same rule as C); they do not block a match - Clarify host-iface= scope: only matches interface-pinned connections; manual controller= entries without host-iface= are not matched - Add address normalization note (inet_pton/inet_ntop, not strcmp) - connect-all --nbft is exempt from the exclusion list Orchestrator coexistence: - Scope "no D-Bus signal" to between orchestrators - "No special-case logic required" scoped to disconnect logic in nvme-stas - connect-all --nbft exemption noted in Tier 2 summary - Add versioning note: these behaviors co-ship in nvme-cli 3.0 / nvme-stas 3.0; earlier pairings do not provide the guarantees nvme-discoverd: - Add RuntimeDirectoryPreserve=yes; without it systemd removes /run/nvme/discoverd on every stop/crash, destroying devid files needed by active ExecStop= lines and state files for crash recovery - Add registry ownership check before adopting pre-existing connections: skip any controller owned by neither discoverd nor nbft - Correct varlink -> D-Bus throughout: StartTransientUnit, StopUnit, RestartUnit, and JobRemoved are all org.freedesktop.systemd1.Manager - Scope "never issues disconnects" to steady-state operation - Clarify referral DC fizzle-out is intentional; add open questions: stale-cache aging (#2), shutdown vs. mounts (#3), NBFT root (#4) - Minor: "three lines" -> "four lines" for --devid-file specifier; GPL-2.0-only (not -or-later); inline unit comments moved to their own lines (systemd unit syntax has no end-of-line comments) Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Add §3.9 explaining the varlink/D-Bus situation for the systemd interface. Not depending on D-Bus was a design wish — systemd is migrating to varlink and new code should follow. However, varlink is not yet sufficient: io.systemd.Unit.StartTransient was only added in systemd v260 (March 2026), and StopUnit, RestartUnit, and ResetFailedUnit have no varlink equivalent at all. Since discoverd already depends on libsystemd for sd_event, using sd-bus (systemd's own D-Bus implementation, part of libsystemd) adds no new library dependency. The D-Bus usage is scoped to the systemd interface only; discoverd's own client-facing socket (§3.7) remains varlink-only. Add open question #5 to track the future migration once the varlink unit lifecycle API is complete. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Content fixes, verified against the kernel, systemd, libnvme, and nvme-stas sources where applicable: - io.systemd.Unit.StartTransient is in systemd v261 (still unreleased), not v260 as previously stated; v261 still lacks StopUnit, RestartUnit, and ResetFailedUnit, strengthening the D-Bus rationale (§3.9) - a unique-NQN DC gets the kernel's 5 s NVME_DEFAULT_KATO, not a dead session; --keep-alive-tmo=30 normalizes the DC kato regardless of NQN - FC WWN traddr comparison is case-insensitive string equality; nothing strips colon separators - TID hash field order now matches nvme-stas staslib/trid.py exactly - atomic write protocol documented as mkstemp() with random suffix, matching the implementation - disconnect-all confirmation: prompt on TTY for both --force and --owner; non-interactive invocations proceed without prompting Design additions from review: - transient units carry Before=nvme-discoverd.service so discoverd stops before shutdown-time disconnects, preventing reconnect attempts (and FC kickstart re-issue) against a shutting-down system - NBFT entries can be Discovery Controllers, not just IOCs; the --owner nbft substitution applies to both unit types - host-traddr= exclusion entries provide interface exclusion for RDMA and FC, where host-iface does not apply - registry §4.4 covers both initramfs boot paths: NBFT and the FC kickstart (unowned until discoverd adopts them at startup) - the recycled-devid stale-entry edge case requires the udev rule race and a libnvme bypass to coincide; normally the rule has already cleaned up Readability: split large sections into topical subsections (registry §1/§4, exclusion §3/§7, discoverd §3.2, coexistence §5) and break up long paragraphs throughout. No content changes from restructuring. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-fable-5 [Claude Code]
Add §11 "Future Release: DC Retention Policy" documenting how discovery-derived configuration is retained after a discovery source becomes unavailable. Introduce `discovery-retention-time` (replacing the mDNS-only `zeroconf-stale-timeout`) as a unified parameter covering all dynamic discovery sources: mDNS, referral DCs, and FC kickstart DCs. Statically configured and NBFT-derived DCs are retried indefinitely — they represent explicit intent and are not subject to the retention timer. Introduce `fc-kickstart-interval-minutes` for periodic FC fabric probing, motivated by the equipment replacement scenario where a new DC may have a different address/NQN and can only be discovered via an active kickstart. Update §4 and §7.1 to note that FC kickstart has no TID cache in the initial release, but the retention policy will require one. Update §5 retry policy to reflect the static vs. dynamic DC distinction. Align §11.1 timer-start semantics with §11.2 (timer starts on source disappearance regardless of connection state). Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Periodic FC kickstart is opt-in, not on by default. nvme-discoverd ships on all Linux systems — enabling periodic fabric probing by default on laptops and desktops with no FC infrastructure would be wasteful and surprising. Use 0 to disable (consistent with systemd's convention for interval parameters, e.g. WatchdogSec=0). 0 was previously invalid; infinity was the disable value. Any value >= 1 is a valid interval in minutes. Fix the consistency issue introduced by the default change: soften "not acceptable in a production environment" to "may not be acceptable in an FC production environment", and reframe the orthogonality paragraph to read naturally with the default-is-0 framing. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
The entry-ID hash was unnecessary complexity: exclusion list management is human-only, so a sequential number scoped to a single interactive `nvme exclusion remove` invocation suffices, and `libnvmf_exclusion_remove()` now matches by exact entry-string content instead. The Python bindings no longer expose exclusion_add/remove/create/delete at all — providing a programmatic management API would contradict the human-administered design; only the read-only exclusion_match/lists/entries() remain. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
The RFC referenced the legacy NVMe-oF autoconnect components only in passing (§1, §7.1), with no single place describing what nvme-cli installs today, when each piece fires, and how nvme-discoverd subsumes it. Add §12 with a full inventory and a clear split between components that establish connections (replaced) and those doing orthogonal tuning, key provisioning, or interface naming (kept). Document the NBFT late-connect path in particular: nvmf-connect-nbft.service has no [Install] section and is driven only by a NetworkManager dispatcher script, so it never fires under systemd-networkd or other managers. discoverd's retry loop covers the same case manager-agnostically, making the late-connect a latency optimization rather than a correctness requirement. Renumber Open Questions to §13 and Glossary to §14. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Expand the RFC with reviewer-driven clarifications: - §6.1: explain why discoverd uses its own discoverd.conf rather than reusing config.json (json-c dependency; JSON has no comments) or discovery.conf (DC-only; cannot express IOCs or global toggles). - §12.1/§12.2: nvmf-connect.target was only a collective handle for the daemon-less design; the discoverd daemon is that coordinator, so no target equivalent is needed. - §12.2: scope the autoconnect-rule replacement to the running system. Initramfs (Phase 1) connect is dracut's job (74nvmf, formerly 95nvmf), which has never used 70-nvmf-autoconnect.rules. Note that StartTransientUnit works over systemd's private socket without the D-Bus daemon, so discoverd is kept out of Phase 1 by design, not impossibility. Recommend removing the dead 70-nvmf-autoconnect.conf snippet from nvme-cli. - §12.5: emphasize systemd-networkd as the common NetworkManager alternative that the legacy NBFT dispatcher path does not cover. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Reconcile the exclusion RFC with what was actually built (review item L13): - §6.1: libnvmf_exclusion_match() takes a struct libnvmf_tid * rather than seven positional string arguments — fewer call-site mistakes. Document the new libnvmf_exclusion_entry_valid() helper used to pre-validate hand-edited files without filesystem side effects. - §6.3: mark the proposed key_value_list_parse()/_free() utilities as intentionally not provided; libnvmf_tid_parse() already covers parsing a semicolon-separated key=value string, and the matcher parses entries internally. - §5: show the actual --name/--entry option form instead of positional <name>/<entry> arguments. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
tbzatek
left a comment
There was a problem hiding this comment.
Well written design documents, I appreciate the level of detail!
Question about the legacy nvme connect-all --nbft: still being used by dracut, so I guess it stays, yet discoverd reimplements some logic. Would be nice to unify the codebase to avoid duplication. My concern is interpretation of various NBFT flags and processes... although certain best practices got added in the Boot Spec, there are still gaps. There have always been disparity between various UEFI implementations and the OS, even though we generally tend to avoid adding quirks in the nvme-cli NBFT code for firmware bugs.
| | `nvmf-connect-nbft.service` | systemd oneshot | On demand, started by the NM dispatcher on NBFT-interface up | **Replaced** — discoverd adopts/reconnects NBFT controllers from its NBFT cache (§7.1); see §12.5 | | ||
| | `80-nvmf-connect-nbft.sh` | NM dispatcher | NetworkManager interface-up for `nbft*`/HFI connections | **Replaced** — discoverd's retry loop, manager-agnostic (§12.5) | | ||
| | `70-nvmf-autoconnect.conf` | dracut conf | Build-time `install_items+=` snippet that copies `70-nvmf-autoconnect.rules` into the initramfs | **Remove outright** (pre-existing cruft). dracut has never used `70-nvmf-autoconnect.rules` and ships its own initramfs mechanism, so this snippet only copies an inert rule into the initrd. Safe to delete from nvme-cli independently of discoverd; early-boot connect is dracut's job (§7.1, §12.2) | | ||
| | `65-persistent-net-nbft.rules` | udev rule | Naming of `nbft*` interfaces | **Kept** — interface naming, not connect logic | |
There was a problem hiding this comment.
Not sure whether this ever served its purpose... @mwilck ?
There was a problem hiding this comment.
I suspect you're right that it never really served a purpose, and the reason fits the rest of the picture. This rule does one thing — pin nbft* interface names so udev doesn't rename them — and the only thing that ever consumed that name was the NM dispatcher (80-nvmf-connect-nbft.sh), which matches nbft* to trigger the late-NBFT connect — i.e. the very path you just described as never really working. So with no working consumer, the naming rule has nothing to serve. I currently list it as "Kept" (orthogonal naming), but I'm inclined to move it to "Remove" alongside the NM-dispatcher path: if the late-NBFT machinery goes, this goes with it. I'll defer the definitive history to @mwilck — he'd know if anything else ever relied on the nbft* name — and tie the final disposition to the broader question of whether discoverd carries NBFT handling at all.
Updates forllowing following peer reviews Signed-off-by: Martin Belanger <martin.belanger@dell.com>
|
|
||
| `nvme connect` is used instead of `nvme discover --persistent` because `nvme discover` always issues a Get Log Page immediately — nvme-discoverd fetches the DLP separately via ioctl after the DC device appears, so it would be redundant. | ||
|
|
||
| `--keep-alive-tmo=30` sets the keep-alive timeout explicitly: `nvme connect`'s automatic 30 s discovery KATO only fires for the well-known discovery NQN (`nqn.2014-08.org.nvmexpress.discovery`). NVMe Base Spec 2.x allows DCs to use a unique NQN; for those, neither nvme-cli nor the kernel recognizes the connection as a discovery session, and the kernel falls back to its 5 s I/O-controller default (`NVME_DEFAULT_KATO`). Setting the option explicitly gives every DC session the DC-appropriate 30 s value regardless of which NQN it uses. |
There was a problem hiding this comment.
Is there anything special about KATO and DC? Just wondering why this is highlighted here.
There was a problem hiding this comment.
discoverd uses nvme connect to establish every session — Discovery Controller (DC) or I/O Controller (IOC) — rather than nvme discover --persistent (the command usually used for DCs). Because nvme connect is the command normally used for IOCs, its keep-alive defaults to the 5 s I/O value (NVME_DEFAULT_KATO); but a DC session is persistent and otherwise idle — it exists only to keep the session up and receive AENs — so I want the 30 s discovery keep-alive there instead.
That is why --keep-alive-tmo=30 is set explicitly, and it's a footgun worth highlighting: nvme connect only applies the 30 s automatically when it recognizes the well-known discovery NQN (nqn.2014-08.org.nvmexpress.discovery). NVMe Base Spec 2.x lets a DC use a unique NQN (CDCs do), and for those neither nvme-cli nor the kernel classifies the session as discovery, so the keep-alive silently stays at the 5 s I/O default. Setting it explicitly gives every DC session 30 s regardless of which NQN it uses.
There was a problem hiding this comment.
Ohh, I do remember now. Should we add additional knobs to the cli, --io-keep-alive-tmo, --dc-keep-alive-tmo?
There was a problem hiding this comment.
Worth separating the two commands here, because the answer differs.
For nvme connect — which is the only one discoverd uses — I don't think the knobs help. nvme connect brings up exactly one controller, and the user already knows what they're connecting to: that's why we have nvme connect (typically used for I/O Controllers - IOC) and nvme discover (used for Discovery Controlers - DC) as separate commands. So the existing --keep-alive-tmo / -k already expresses everything — one controller, one value the admin chooses. A per-class split adds nothing here: with a single controller you'd just set -k to the value you want, and --dc-keep-alive-tmo on a connection you already know is an IOC (or vice-versa) is just a redundant spelling of -k. discoverd sets -k explicitly on every connection, so it needs nothing new.
nvme connect-all is the one command with the double duty — it loops over a discovery log and brings up both DCs (referral entries) and IOCs in a single shot, so one -k genuinely can't express a different keep-alive for the two classes. That's where --dc-keep-alive-tmo / --io-keep-alive-tmo do make sense, letting it set them independently (with the 30 s / 5 s defaults). So if we want the knobs, connect-all is their natural home — but that's a standalone connect-all enhancement, independent of this RFC, and irrelevant to discoverd, which never calls connect-all.
It's worth being clear about where the original footgun came from, because it's a library classification gap, not a user one. For a long time libnvme used the well-known discovery NQN (nqn.2014-08.org.nvmexpress.discovery) to tell a DC from an IOC, so nvme connect only applies the 30 s discovery default when it sees that NQN. TP8010 (CDCs) then allowed a DC to use a unique NQN, and libnvme had to catch up — a DC with a unique NQN still silently falls back to the 5 s I/O default, because the library can't recognize it as a DC from the NQN alone. The user knew it was a DC all along; the library didn't. discoverd sidesteps the whole thing by always setting -k explicitly.
|
|
||
| DC units whose TID is present in the NBFT cache carry `--owner nbft` instead of `--owner discoverd` — the NBFT records the Discovery Controller the firmware used to find the boot targets. The substitution rule is shared with IOC units and described in §3.4. | ||
|
|
||
| `nvme connect` establishes the connection and exits; the kernel maintains the session and delivers future AENs on it. nvme-discoverd monitors for the resulting device-add event; after a ~1 s sysfs soak (sysfs attributes are not fully populated at the moment the `add` event fires), it checks the `cntrltype` attribute to confirm the device is a Discovery Controller. Once the device name is known (e.g., `/dev/nvme0`), libnvme is called directly (`libnvme_scan_ctrl` + `libnvmf_get_discovery_log`) to fetch the DLP via Get Log Page — a short ioctl on the already-connected device, not a `/dev/nvme-fabrics` write. |
There was a problem hiding this comment.
Which sysfs entries are missing? This sounds like a bug in the kernel. When the udev is fired, all relevant entries need to be present IMO. Just recently, @maurizio-lombardi fixed the ordering of sysfs entries adding and the udev event fired.
There was a problem hiding this comment.
This came out of testing on a real system about five years ago. sysfs exposes two kinds of controller attribute: (A) the local parameters used to make the connection (transport, address, host NQN, …), which appear immediately; and (B) attributes retrieved from the remote controller, such as cntrltype, which can lag. What I observed was that the kernel emitted the add nvmeX event as soon as the connection was established, but the type-(B) remote attributes were not necessarily in sysfs yet — adding a ~1 s delay after receiving the udev event before reading cntrltype reliably worked around it.
That was a long time ago, so it may well be fixed by now. If @maurizio-lombardi's recent add-event ordering fix guarantees the remote attributes are populated before the add event fires, the soak is unnecessary and I'll drop it. I'll re-test on a current kernel — connect a DC and read cntrltype at the add event — and remove the soak if it reads correctly, noting the minimum kernel version in the RFC.
There was a problem hiding this comment.
I suppose you are talking about this: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=37953cec775ed34e59cf9a7d7bb9b0610daa3f3e
This patch doesn't have any relevant effect outside of those sysfs attributes that check for the NVME_CTRL_STARTED_ONCE flag; so I am pretty sure it won't fix the problem you are seeing
There was a problem hiding this comment.
Indeed was just the NVME_EVENT=connected event. But this means we should at least review the relevant kernel code and figure out if this is not something we need to fix. I am allergic to random sleep sprinkled around.
|
|
||
| Each DLPE in the DLP is processed based on its `subtype`: | ||
| - `NVME_NQN_NVME` — I/O Controller: create a transient IOC unit. | ||
| - `NVME_NQN_DISC` — Referral to another DC: create a transient DC unit for the referred DC. When that DC's device appears, its DLP is fetched and processed identically. Referral chains of arbitrary depth are followed naturally through the event loop — no explicit recursion is needed. Note: `nvme connect-all` traverses referrals the same way; `nvme discover` does not (it only fetches one level). If a referral DC becomes permanently unreachable, its per-DC DLP cache entry is eventually evicted; IOCs reachable exclusively through that referral chain fizzle out naturally as their connections drop and are not reconnected. This is intentional — the desired set is derived from current cache state. See §11 (DC Retention Policy) for the planned design. |
There was a problem hiding this comment.
I think we should add an option to nvme discover to support recursive lookups as well. Or is this stupid?
There was a problem hiding this comment.
Not stupid — I think it's genuinely useful. nvme discover today fetches one level and prints it; a recursive variant would follow referral DCs and print the whole reachable tree without connecting anything — the read-only mirror of what connect-all already does, and a handy "show me everything this fabric exposes" diagnostic.
I'd keep it as its own nvme-cli enhancement rather than fold it into this RFC, since it is independent of the daemon. Two things it would need: referral-loop detection (a misconfigured fabric can have cyclic referrals) and a way to present the tree (indent per level, or flatten with a "via" column). I'm happy to file a separate issue/RFC for it; it doesn't block discoverd either way.
There was a problem hiding this comment.
Yes, let's split this task out. I would like to avoid adding discoverd to the pre OS environment if possible. Thus nvme-cli should have at least the same feature set as one shot operation. (still not read through the whole RFC, I suppose there is something on this topic).
| CollectMode=inactive-or-failed | ||
| ``` | ||
|
|
||
| **NBFT-sourced units use `--owner nbft`.** The NBFT can describe Discovery Controllers as well as I/O subsystems — the firmware records the DC it used to discover the boot targets — so this rule applies to both unit types. When discoverd creates a transient unit, DC or IOC, for a controller present in its NBFT cache, it substitutes `--owner nbft` for `--owner discoverd` in the `ExecStart=` line — the only difference between an NBFT unit and a regular unit. Because `RestartUnit` replays the original `ExecStart=` verbatim, all subsequent reconnects also carry `--owner nbft`, preserving the `owner=nbft` lifetime invariant without any reconnect-time special-casing. At unit-creation time, discoverd selects the template variant based on a cache lookup: TID present in the NBFT cache → `--owner nbft`; otherwise → `--owner discoverd`. |
There was a problem hiding this comment.
Do I understand this correctly, you want to run discoverd in pre-OS env? I assume just a single pass, right? Shoudn't this part then single nvme-cli command?
This PR is for review only and will not be merged. It contains four RFC documents proposing a coordinated set of features for NVMe-oF orchestrator coexistence in nvme-cli 3.0. Please use inline comments to provide feedback on specific sections.
Background
As NVMe-oF deployments grow, hosts increasingly run multiple tools that manage NVMe-oF connections: nvme-stas, dracut/initramfs scripts, and soon nvme-discoverd. Without coordination, these orchestrators can conflict — one may disconnect a controller that another is actively managing, or connect to a controller that another has deliberately excluded.
This RFC set proposes two libnvme building blocks to solve the coexistence problem (ownership registry and exclusion list), and a new daemon orchestrator (nvme-discoverd) that puts them to use.
The four RFCs
rfc-nvme-orchestrator-coexistence.md — Start here
The top-level document. Frames the two distinct conflict scenarios (accidental disconnect, accidental connect), introduces the two prevention mechanisms (ownership registry and exclusion list), defines a three-tier orchestrator hierarchy (raw commands → manual orchestrators → daemon orchestrators), and shows how nvme-discoverd and nvme-stas naturally partition work without requiring IPC coupling between them.
rfc-nvme-registry.md — Ownership registry
A lightweight, cooperative registry under
/run/nvme/registry/that lets orchestrators declare ownership of connected controllers.nvme disconnect-allconsults the registry before acting so it never disconnects a controller managed by a running daemon. The registry is a libnvme building block: a C API (libnvmf_registry_*), a Python binding, and anvme registryCLI command family. An implementation PR already exists: #3425.rfc-nvme-exclusion.md — Exclusion list
A human-administered exclusion list at
/etc/nvme/exclusions/that prevents orchestrators from auto-connecting to controllers the administrator wants excluded. Its design center is auto-discovered controllers (mDNS, CDC DLP) where there is no configuration entry to remove. Managed vianvme exclusionCRUD commands. Enforcement is cooperative — each orchestrator reads the list and skips matching controllers; libnvme does not enforce it.rfc-nvme-discoverd.md — nvme-discoverd
A new daemon proposed for nvme-cli 3.0. It manages NVMe-oF connections for statically configured controllers, NBFT boot controllers, FC-discovered controllers, and (in a future release) mDNS-discovered controllers. Key design choices:
libnvmf_connect()directly; blocking/dev/nvme-fabricswrites happen in child processes managed by systemd. Achieves Daniel's no-threading goal.--owner discoverdon each generated unit; NBFT reconnects use--owner nbftto preserve the lifetime invariant set at boot.What feedback is sought
nvme exclusioncommand design?owner=nbftlifetime invariant, NBFT exclusion bypass, unowned-connections-are-fair-game semantics — do these cover the cases you care about?Related
registry-v2)