MLX: on-device token sampling with Gumbel-max by kiymetakdemir · Pull Request #20454 · pytorch/executorch

kiymetakdemir · 2026-06-23T19:20:29Z

Summary

Adds token sampling that runs inside the exported .pte for the MLX backend: a model wrapped in SamplingHead returns a sampled token id instead of [B, S, vocab] logits, avoiding the per-step logits copy to host and the host-side softmax+multinomial.

Sampling uses Gumbel-max: argmax(logits / temperature + g), g = -log(-log(u)). The only new schema primitive is a random source, RandomBitsNode, the rest reuses existing nodes. Greedy = temperature → 0. temperature is a runtime input; seed is optional.

Changes

schema.fbs: new RandomBitsNode (append-only union member, optional seed).
custom_kernel_ops/sample.py: mlx::sample op + register_fake + CPU reference.
ops.py: _sample_handler lowering the Gumbel-max graph.
runtime/MLXInterpreter.h: exec_random_bits + dispatch.
llm/sampling.py: SamplingHead wrapper.
generate.py: None-guard optional compound fields so the optional seed (de)serializes.

Notes

Uniform/gumbel computed in fp32 (bf16 rounds the ~1.0 clamp up → log(0)=-inf → poisons argmax).
Tests: custom_kernel_ops/test/test_sample.py, eager parity/distribution/determinism, export+partition lowering, and on-device e2e (incl. a bf16 large-vocab regression).

Fixes #20353

pytorch-bot · 2026-06-23T19:20:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20454

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ You can merge normally! (1 Unrelated Failure), 3 Unclassified Failures

As of commit 20af908 with merge base 6f6225c ():

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

MLX / test-mlx / test-mlx (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-qwen35-moe / test-mlx-qwen35-moe (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Library not loaded: @rpath/liblz4.1.dylib
pull / test-arm-backend-public-api-backward-compatibility / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command docker exec -t 1e3c75d547bc376f3a46904e6b5e0c43b95a73bd7bd9d5ad5a4e230beac1c734 /exec failed with exit code 127

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-23T19:21:22Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

metascroy · 2026-06-23T19:42:53Z

+from torch import Tensor
+
+
+@torch.library.custom_op("mlx::sample", mutates_args=())


This should be in backends/mlx/custom_ops.py

(It's not a custom kernel op b/c it doesn't have metal shader code)

Moved it to custom_ops.py and relocated the test.

metascroy · 2026-06-23T19:50:14Z

@claude review this code

claude · 2026-06-23T19:50:59Z

Claude finished @metascroy's task in 3m 8s —— View job

Code review: MLX on-device token sampling

Read through the full diff (sample.py, _sample_handler, exec_random_bits, SamplingHead, schema + generator changes, and the tests). Overall this is clean, well-tested work — the Gumbel-max lowering is faithful to MLX's uniform → gumbel → argmax layering, the schema change is correctly append-only, and the bf16 fp32-clamp fix is a nice catch with a dedicated regression test. Comments below.

Addressing the two inline review comments

1. MLXInterpreter.h:1703 — uint32 vs n.width (your question)

Yes, they must stay in sync, and right now they're only implicitly consistent:

check_allocation_bounded(shape, uint32, "random_bits");   // hard-coded 4-byte dtype
...
st.set_tensor(n.out, random::bits(shape, n.width, key, s)); // n.width drives real element size

The bound check assumes 4 bytes/element while the actual allocation is n.width bytes/element. They agree only because _sample_handler always emits width=4 and the schema default is 4. If anyone ever sets a different width, the bound check silently uses the wrong size and random::bits produces a non-uint32 dtype that the rest of the graph (the AsTypeNode → /uint32_max chain) assumes is uint32.

Since nothing emits a non-4 width, the simplest hardening is to assert the invariant rather than carry a flexible-but-unsupported field:

if (n.width != 4) {
  throw std::runtime_error("random_bits: only width=4 (uint32) is supported");
}

That makes the uint32 in the bound check provably correct. (Alternatively, map n.width → dtype and pass that to check_allocation_bounded, but that's more machinery than the single supported path warrants.) Fix this →

2. sample.py location — move to custom_ops.py

Agreed. mlx::sample has no Metal shader; it's a graph-lowered op like mlx::rope / mlx::gather_mm / mlx::gather_qmm, which all live in backends/mlx/custom_ops.py. custom_kernel_ops/ is for ops with actual kernel source (tq_norm, tq4_compress, gated_delta_rule, …). Moving it requires updating the two import sites:

backends/mlx/ops.py:22 (import ...custom_kernel_ops.sample)
backends/mlx/custom_kernel_ops/test/test_sample.py:27 (and ideally relocate the test to backends/mlx/test/).

Fix this →

Additional observations

Reference op vs. on-device graph are two different implementations. The CPU reference in sample.py uses torch.rand directly (no nextafter clamp), while the lowered graph uses bits/uint32_max → min(·, nextafter(1,0)) → gumbel. The eager tests (test_matches_independent_gumbel_reference) validate the reference against an independent torch impl, and the e2e tests validate the graph distributionally — but the two paths are never checked to agree with each other, and seeded determinism differs (torch RNG host-side vs. MLX RNG on-device). That's a reasonable design given the host-copy avoidance goal, but worth a one-line comment in sample.py noting the reference is for export/shape only and is not bit-identical to the delegated path, so a future reader doesn't assume seeded parity across host/device.

temperature = 0 → division by zero. Greedy is documented as "temperature → 0", and tests use 1e-4. Since temperature is a runtime input, an actual 0.0 produces inf/nan in logits / temperature. Not a blocker (callers are expected to pass a small epsilon), but a brief note in SamplingHead's docstring that temperature must be > 0 would prevent a confusing on-device result.

ArgmaxNode(axis=-1) — confirm exec_argmax/MLX argmax accepts a negative axis at runtime. Logits are [B, vocab] so -1 is unambiguous, but if MLX expects a non-negative axis it'd be safer to emit the resolved positive axis (the handler has n.args[0].meta available). The e2e test would catch this if it's a problem; flagging in case the e2e runner wasn't exercised on this path.

Minor: emit_lifted_constant(P, 4294967295.0, ...) and the nextafter constant are recomputed each call — fine, but if make_or_get_constant-style dedup applies these will already be shared; no action needed.

None of these are blockers. The two inline items (width invariant + file relocation) are the ones worth doing before merge.
· branch mlx-ondevice-sampling

metascroy · 2026-06-23T19:43:33Z

+        return self.head(logits, temperature=temperature)
+
+
+class TestSampleOp(unittest.TestCase):


These test cases should be inside backends/mlx/test/test_ops.py

Moved TestSampleOp into test_ops.py.

metascroy · 2026-06-23T19:46:00Z

+            pte,
+        )
+        self.assertEqual(count_mlx_delegate_segments(pte), 1)
+        counts = get_mlx_node_counts(pte)


See test_ops.py. There are utilities for testing node counts

metascroy · 2026-06-23T19:47:11Z

+        # optional_str carries its own None handling; other compound offset
+        # fields (int_or_vid, etc.) must be guarded when optional so a None
+        # value is serialized as an absent field rather than crashing.
+        if fld.required or kind == "optional_str":


Why are these changes needed? You don't have a string arg on the new node?

metascroy · 2026-06-23T19:53:52Z

+    Gumbel-max sampling from softmax(logits / temperature).
+    logits:      [B, vocab]
+    temperature: scalar float tensor    (runtime input)
+    seed:        scalar int tensor or None


Does it not export if seed is an int?

Yes, a plain Python int doesn't export.

metascroy · 2026-06-23T19:56:30Z

+        AsTypeNode(
+            x=P.slot_to_tid(g_f32),
+            out=P.slot_to_tid(g),
+            scalar_type=torch_dtype_to_scalar_type(dt),


Should we have this at all? Why not compute divide/argmax/etc in same fp32 dtype? The final output type is integer

Thanks, fixed this.

metascroy · 2026-06-23T22:39:14Z

+    """
+    Gumbel-max sampling from softmax(logits / temperature).
+    logits:      [B, vocab]
+    temperature: scalar float tensor    (runtime input)


Can we add top-p as well?

metascroy · 2026-06-23T22:39:58Z

+    logits, temperature = args[0], args[1]
+    seed = args[2] if len(args) > 2 and args[2] is not None else None
+
+    dt = n.args[0].meta["val"].dtype


Can we use emit_if_else to specialize on temperature 0 as argmax?

Added this.

…hain; reorg sample tests

metascroy · 2026-06-25T00:52:32Z


        echo "::group::Run Python unit tests"
+        # test_ops.py is listed for its unittest classes; run_all_tests.py only
+        # runs its OpTestCase classes.


Wouldn't this run it twice?

What tests are non-OpTestCase classes in test_ops.py?

metascroy · 2026-06-25T01:03:42Z

+    return torch.ops.mlx.sample(logits, t, p, s)
+
+
+class TestSampleOp(unittest.TestCase):


Can these move into test_sample?

test_ops.py should only have OpTestCase because of how its driven in CI

metascroy · 2026-06-25T01:08:49Z

+class TestSampleOp(unittest.TestCase):
+    """Eager reference behavior of mlx::sample (no export / no runtime)."""
+
+    def test_greedy_parity_small_temperature(self):


This case and the next greedy one can be handled in a OpTestCase using regular. machinery (b/c it's bit exact)

metascroy · 2026-06-25T01:14:41Z

+        other = run(logits, torch.tensor(1.0), torch.tensor(124, dtype=torch.int64))
+        self.assertFalse(torch.equal(first, other))
+
+    def test_export_strict_with_graph_inputs(self):


Isn't OpTestCase in test_ops already testing export?

metascroy · 2026-06-25T01:22:42Z

        "from __future__ import annotations",
        "",
-        "from typing import List, Tuple, Dict",
+        "from typing import Dict, List, Optional, Tuple",


Do we need any generate.py changes if we make seed Vid (optional)?

You already emit seed as a Vid (not an int literal).

metascroy · 2026-06-25T01:23:29Z

@claude review this code

claude · 2026-06-25T01:24:15Z

Claude finished @metascroy's task in 32m 1s —— View job

Code review: MLX on-device token sampling (round 2)

Re-reviewed the latest revision (top-p nucleus, temperature==0 greedy via IfNode, fp32 sampling chain, test reorg, optional-seed serialization). The lowering is correct and the bf16/fp32 reasoning is sound. Below I focus on the still-open threads from your latest review pass plus a couple of new observations.

Open review threads

1. test_ops.py should contain only OpTestCase (CI driver constraint) — not yet resolved

TestSampleOp(unittest.TestCase) is still in test_ops.py:7726. Earlier the class was moved into test_ops.py, but your later comment ("test_ops.py should only have OpTestCase because of how it's driven in CI") asks for the opposite: the plain unittest.TestCase eager tests should live in test_sample.py, leaving test_ops.py with only the OpTestCase subclasses (SampleSeededTest, SampleUnseededTest, SampleTopPTest). Right now test_ops.py mixes both. Fix this →

2. CI mlx.yml:78 — test_ops.py in the pytest list runs twice / becomes unnecessary

This is the consequence of (1). run_all_tests.py already executes the OpTestCase classes in test_ops.py; the only reason to also list test_ops.py under pytest is to pick up the unittest.TestCase class. Once TestSampleOp moves to test_sample.py, test_ops.py should be dropped from the pytest invocation (it would otherwise re-import/re-collect with nothing pytest-runnable, and the comment about "its unittest classes" no longer applies). Note: I can't edit .github/workflows/ myself (GitHub App permissions), so this one is yours regardless.

3. The two greedy cases are bit-exact → make them OpTestCase, dropping skip_comparison

test_greedy_parity_small_temperature (T=1e-4) and test_greedy_temperature_zero (T=0) are both deterministic on device:

T=0 takes the IfNode greedy branch → argmax(logits), no RNG.
T=1e-4 makes logits/T dominate the O(1) Gumbel noise, so argmax collapses to argmax(logits) regardless of which RNG produced the noise.

So both are host/device bit-identical and belong as real OpTestCase runs (no skip_comparison), which exercises export + lowering + on-device parity in one shot via the standard harness — exactly what you flagged. The sampling cases that genuinely aren't bit-identical (SampleSeededTest etc.) correctly keep skip_comparison=True.

4. test_sample.py:112 export tests vs. OpTestCase export coverage

TestSampleExport (export-only, runs .module() on host) overlaps with what the OpTestCase lowering tests already cover (export → partition → node counts). The genuinely unique host-only assertions are runtime-input semantics: same exported program at two temperatures (test_runtime_temperature_single_export) and seed-independent-of-host-RNG (test_seeded_export_reproducible_no_host_rng). test_export_strict_with_graph_inputs (just counts 3 user inputs) is fully subsumed by the OpTestCase export and can go. I'd keep the two semantic tests, drop the pure-export one.

5. generate.py:575 — do we need these changes if seed is an optional Vid?

Short answer: yes, the generate.py changes are still needed, and switching seed from IntOrVid to Vid would not remove them — it would move the problem to a path that isn't handled.

Why: before this PR no schema had an optional compound/struct field. seed is always emitted via P.to_int_or_vid(seed_val) where seed_val is a Slot, so it's IntOrVid.from_vid(...) — never a literal — and may be absent. The codegen needed three gaps closed for an absent compound field:

Python _build_int_or_vid returning None for None input, and the Add site guarding if seed_off is not None (schema.fbs seed is optional → no AddSeed call when absent).
C++ loader presence-guarding convert_* (it dereferences a null FlatBuffer pointer otherwise).
_fbs_type_to_cpp mapping optional IntOrVid → std::optional<IntOrVid>.

If you instead declared seed: Vid (optional), the Python emitter would route through _PY_STRUCT_CREATOR → CreateVid(builder, op.seed.idx) inlined unconditionally, which throws on None and has no None-guard today — so you'd have to add optional-Vid support there instead. Net: IntOrVid reuses the prebuilt-offset path that you've now made None-safe, with less new machinery than optional-Vid would require. So the current choice is the lower-churn one; the generate.py diff is justified. (If you do prefer Vid for clarity since it's always a vid, that's fine too — just be aware it needs its own None-guard in the struct-creator path.)

New observations (non-blocking)

MLXInterpreter.h:1708 bound check. Generalizing to widths {1,2,4} is fine, but check_allocation_bounded(shape, uint32, ...) now intentionally over-estimates for width 1/2. The comment says so, which is good — just confirming the guard is conservative-safe (it bounds above the real allocation), not unsafe.
Negative temperature diverges between reference and device. The device emits GreaterNode(temperature, 0) → any temperature < 0 takes the greedy branch, while the CPU reference only treats == 0 as greedy and would compute inverted-sampling for a negative value. Callers shouldn't pass negatives, but the two paths disagree there. A one-line note in SamplingHead's docstring ("temperature must be ≥ 0") or clamping in the reference would keep them consistent.
sample.py/custom_ops.py reference honesty. The docstring already states the reference isn't bit-identical to the delegate — 👍, that addresses my prior comment.

Summary

Logic and lowering look correct and well-tested. The remaining work is test organization, all from your own latest pass: move TestSampleOp out of test_ops.py (1), drop test_ops.py from the pytest list afterward (2), promote the two bit-exact greedy cases to OpTestCase (3), and trim the redundant export-only test (4). Item (5) is answered above — keep as-is.
· branch mlx-ondevice-sampling

MLX: on-device token sampling (mlx::sample, Gumbel-max)

220b53f

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 23, 2026

metascroy reviewed Jun 23, 2026

View reviewed changes

Comment thread backends/mlx/runtime/MLXInterpreter.h

metascroy reviewed Jun 23, 2026

View reviewed changes

MLX: reject non-uint32 width; move sample to custom_ops; docs

bddb819

metascroy reviewed Jun 23, 2026

View reviewed changes

MLX: top-p sampling; temperature=0 greedy via IfNode; fp32 sampling c…

20af908

…hain; reorg sample tests

metascroy reviewed Jun 25, 2026

View reviewed changes

		from torch import Tensor


		@torch.library.custom_op("mlx::sample", mutates_args=())

		return self.head(logits, temperature=temperature)


		class TestSampleOp(unittest.TestCase):

		return torch.ops.mlx.sample(logits, t, p, s)


		class TestSampleOp(unittest.TestCase):

Uh oh!

Conversation

kiymetakdemir commented Jun 23, 2026

Summary

Uh oh!

pytorch-bot Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20454

❗ 2 Active SEVs

❌ You can merge normally! (1 Unrelated Failure), 3 Unclassified Failures

Uh oh!

github-actions Bot commented Jun 23, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

metascroy commented Jun 23, 2026

Uh oh!

claude Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review: MLX on-device token sampling

Addressing the two inline review comments

Additional observations

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

metascroy Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

metascroy commented Jun 25, 2026

Uh oh!

claude Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review: MLX on-device token sampling (round 2)

Open review threads

New observations (non-blocking)

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

pytorch-bot Bot commented Jun 23, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 23, 2026 •

edited

Loading

metascroy Jun 25, 2026 •

edited

Loading

claude Bot commented Jun 25, 2026 •

edited

Loading