Apple Silicon support: arm64 build fixes + MPS device for torch backend by oceanapplications · Pull Request #609 · PufferAI/PufferLib

oceanapplications · 2026-07-04T04:06:08Z

Apple Silicon support: arm64 build fixes + MPS device for the torch backend

Makes ./build.sh <env> --cpu build on Apple Silicon and the PyTorch (--slowly) backend train on the M-series GPU via MPS. Related: #590, #507, #422, #532.

Changes

build.sh

Gate -mavx2 -mfma on x86_64 (arm64 clang rejects them; same idea as Conditionally apply AVX2/FMA flags by architecture (fix Apple Silicon build) #590).
macOS OpenMP: compile with -Xpreprocessor -fopenmp + Homebrew's omp.h, but link against torch's bundled libomp.dylib. This is the subtle one: linking Homebrew's libomp builds fine, but any process that also imports torch then has two OpenMP runtimes loaded — it either aborts at startup (OMP: Error #15) or segfaults inside cpu_vec_step's parallel regions. Falls back to Homebrew's libomp when torch isn't importable at build time.
Clear error message pointing at brew install libomp when omp.h is missing.
Linux flag behavior is unchanged.

pufferlib/torch_pufferl.py

Device selection: cuda → mps (when available) → cpu.
Move actions to host memory before cpu_step (the vecenv memcpys from the raw pointer; an MPS data_ptr() segfaults).
compute_puff_advantage: round-trip through CPU for non-CUDA accelerators, since _C.puff_advantage_cpu reads raw host pointers. (A native Metal kernel like feature: add mps kernel for compute_puff_advantage #422 would avoid the copy; this keeps the diff minimal.)

pufferlib/pufferl.py

Auto-fall back to the torch backend when _C was built with --cpu (no create_pufferl), so puffer train <env> works on macOS without knowing about --slowly.

Results (M-series Mac, macOS 25.5 / Apple clang 21, torch 2.12.1)

Benchmark on breakout (default config, 4096 agents, 32.5K-param policy), steady-state over 3 epochs:

device	SPS
cpu	190K
mps	605K (3.2×)

Losses match CPU training qualitatively; checkpoints save/load fine.

Also ran a full build + one-epoch MPS training smoke test of every env in ocean/ on arm64: 37 of the 38 currently-buildable envs train on MPS (the 38th, squared_continuous, has no config file). Includes chess, craftax, drive, nmmo3, moba, and terraform. The 20 envs that don't build fail for reasons unrelated to this PR: 17 are broken on all platforms on current master (references to a missing ocean/env_binding.h, or binding.c missing its OBS_TENSOR_T define), plus craftax_classic (x86-only AVX-512), matsci (needs LAMMPS), and nethack/impulse_wars (external/unvendored deps).

The second commit fixes a torch-backend crash that predates this PR and affects all platforms: shipped configs contain sweep-produced float values (num_layers = 2.11327 in cartpole.ini), which the native backend truncates to C ints but the torch backend passed straight to nn.Linear. 17 of the 38 buildable envs crashed on this before the fix.

Not addressed

The native CUDA training backend (out of scope per Metal 4 backend for Apple Silicon #532's discussion).
craftax_classic: hand-written AVX-512 obs path (44 _mm512_* intrinsics, no scalar fallback) — x86-only regardless of this PR.
Envs currently broken on master for all platforms (missing ocean/env_binding.h, missing OBS_TENSOR_T defines).

- build.sh: gate -mavx2/-mfma on x86_64 (arm64 clang rejects them) - build.sh: macOS OpenMP via -Xpreprocessor -fopenmp with Homebrew's omp.h, linked against torch's bundled libomp.dylib. Linking a second OpenMP runtime (e.g. Homebrew's) into a process that imports torch aborts at startup or segfaults in the vecenv's parallel regions. - torch_pufferl: select mps when available and _C has no CUDA; move actions to host memory before cpu_step; round-trip advantage computation through CPU for non-CUDA accelerators - pufferl: fall back to the torch backend automatically when _C was built with --cpu, so 'puffer train env' works without --slowly Verified on M-series (breakout): 605K SPS on MPS vs 190K on CPU. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019AsyRcLQqeJondzSTM6xsn

…ckend Shipped configs contain sweep-produced floats (e.g. num_layers = 2.11327 in cartpole.ini). The native backend truncates them on assignment to C ints; the torch backend passed them straight to nn.Linear and crashed with 'float object cannot be interpreted as an integer'. Fixes the torch (--slowly) backend for 17 of the 38 currently-buildable ocean envs, on all platforms. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019AsyRcLQqeJondzSTM6xsn

oceanapplications · 2026-07-04T04:06:37Z

ocean env matrix on Apple Silicon (arm64, --cpu build, MPS training)

37/38 buildable envs train one full epoch on MPS. 20 build failures: 17 are broken on all platforms on current master (missing ocean/env_binding.h or OBS_TENSOR_T defines), 1 is x86-only SIMD (craftax_classic), and matsci/nethack/impulse_wars need external deps or unvendored headers.

env	build	MPS train	note
asteroids	FAIL	-	upstream: references missing ocean/env_binding.h (broken on all platforms)
battle	FAIL	-	upstream: references missing ocean/env_binding.h (broken on all platforms)
benchmark	FAIL	-	upstream: binding.c missing OBS_TENSOR_T define (broken on all platforms)
blastar	FAIL	-	upstream: binding.c missing OBS_TENSOR_T define (broken on all platforms)
boids	FAIL	-	upstream: references missing ocean/env_binding.h (broken on all platforms)
boxoban	OK	OK	OK steps=2097152 pg=-0.0587
breakout	OK	OK	OK steps=262144 pg=0.0379
cartpole	OK	OK	OK steps=131072 pg=0.0069
chain_mdp	FAIL	-	upstream: references missing ocean/env_binding.h (broken on all platforms)
checkers	FAIL	-	upstream: references missing ocean/env_binding.h (broken on all platforms)
chess	OK	OK	OK steps=524288 pg=-0.0118
connect4	OK	OK	OK steps=131072 pg=-0.0195
convert	SKIP	-	template/scaffolding, not a runnable env
convert_circle	SKIP	-	template/scaffolding, not a runnable env
craftax	OK	OK	OK steps=2097152 pg=0.0215
craftax_classic	FAIL	-	x86-only SIMD: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/21
dino	OK	OK	OK steps=262144 pg=0.0089
docking	OK	OK	OK steps=262144 pg=-0.0050
double_pendulum	OK	OK	OK steps=262144 pg=0.1214
drive	OK	OK	OK steps=1048576 pg=0.0236
drmario	OK	OK	OK steps=262144 pg=-0.0411
drone	OK	OK	OK steps=131072 pg=0.0004
enduro	OK	OK	OK steps=16384 pg=0.0000
freeway	OK	OK	OK steps=1048576 pg=0.0569
g2048	OK	OK	OK steps=524288 pg=0.0127
go	OK	OK	OK steps=32768 pg=-0.0005
hex	OK	OK	OK steps=65536 pg=-0.1148
impulse_wars	FAIL	-	ocean/impulse_wars/binding.c:1:10: fatal error: 'Python.h' file not found
laser_puzzle	OK	OK	OK steps=49152 pg=0.0015
lightsout	OK	OK	OK steps=262144 pg=-0.0082
matsci	FAIL	-	ocean/matsci/matsci.h:5:10: fatal error: 'lammps/library.h' file not found
maze	OK	OK	OK steps=131072 pg=0.1720
memory	FAIL	-	upstream: references missing ocean/env_binding.h (broken on all platforms)
minimal	OK	OK	OK steps=524288 pg=0.0675
moba	OK	OK	OK steps=131072 pg=-0.0294
nethack	FAIL	-	Building libnethack.so ...
nmmo3	OK	OK	OK steps=524288 pg=-0.0142
onestateworld	FAIL	-	upstream: references missing ocean/env_binding.h (broken on all platforms)
onlyfish	FAIL	-	upstream: references missing ocean/env_binding.h (broken on all platforms)
overcooked	OK	OK	OK steps=524288 pg=0.0123
pacman	OK	OK	OK steps=524288 pg=0.0122
pong	OK	OK	OK steps=32768 pg=-0.0601
robocode	OK	OK	OK steps=524288 pg=0.0236
rware	OK	OK	OK steps=262144 pg=0.0419
scape	FAIL	-	src/vecenv.h:364:24: error: no member named 'rng' in 'Scape'
shared_pool	FAIL	-	upstream: references missing ocean/env_binding.h (broken on all platforms)
slimevolley	OK	OK	OK steps=262144 pg=0.0084
snake	FAIL	-	upstream: binding.c missing OBS_TENSOR_T define (broken on all platforms)
squared	OK	OK	OK steps=262144 pg=-0.0104
squared_continuous	OK	SKIP	no config/squared_continuous.ini
tactical	FAIL	-	upstream: references missing ocean/env_binding.h (broken on all platforms)
target	OK	OK	OK steps=262144 pg=0.0537
template	SKIP	-	template/scaffolding, not a runnable env
terraform	OK	OK	OK steps=524288 pg=-0.0039
tetris	OK	OK	OK steps=524288 pg=0.0076
tmaze	FAIL	-	upstream: references missing ocean/env_binding.h (broken on all platforms)
tower_climb	OK	OK	OK steps=1048576 pg=-0.0376
trash_pickup	OK	OK	OK steps=65536 pg=0.1007
tripletriad	OK	OK	OK steps=65536 pg=-0.0078
whackamole	OK	OK	OK steps=262144 pg=0.0089
whisker_racer	FAIL	-	upstream: binding.c missing OBS_TENSOR_T define (broken on all platforms)

The MPS multinomial kernel can intermittently return indices outside [0, num_categories) (pytorch#136623, still unfixed upstream; the fix PR pytorch#170195 was closed unmerged). In long MPS training runs this surfaced as an intermittent 'AcceleratorError: index N is out of bounds' raised at the next sync point — the .cpu() transfer in compute_puff_advantage — with N ~ 2x total_agents, because the bad index from the prioritized-replay multinomial feeds lazily-queued gathers (obs[idx]) and scatters (ratio[idx], val[idx]) that only validate at materialization. Clamp multinomial output on MPS at both call sites: minibatch segment sampling and action sampling (where an out-of-range action would be memcpy'd into the C envs and corrupt memory silently instead of raising). Cost is one elementwise op; out-of-range draws are ~1e-5 rare. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019AsyRcLQqeJondzSTM6xsn

Include paths (pybind11, numpy, sysconfig) and the torch libomp lookup used bare 'python', which fails with ModuleNotFoundError when the target venv is not the active interpreter. Resolve once at the top: PYTHON=$path ./build.sh <env> --cpu now works from any shell. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019AsyRcLQqeJondzSTM6xsn

oceanapplications · 2026-07-05T16:18:51Z

Used to train this 6 link inverted pendulum on a MacBook. So real world tested over hundreds of billions of steps.

demo_250th.trimmed.mp4

oceanapplications and others added 2 commits July 4, 2026 11:42

oceanapplications and others added 2 commits July 4, 2026 15:06

oceanapplications force-pushed the macos-apple-silicon-mps branch from f506345 to 62c1220 Compare July 4, 2026 07:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apple Silicon support: arm64 build fixes + MPS device for torch backend#609

Apple Silicon support: arm64 build fixes + MPS device for torch backend#609
oceanapplications wants to merge 4 commits into
PufferAI:4.0from
oceanapplications:macos-apple-silicon-mps

oceanapplications commented Jul 4, 2026

Uh oh!

oceanapplications commented Jul 4, 2026

Uh oh!

oceanapplications commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

oceanapplications commented Jul 4, 2026

Apple Silicon support: arm64 build fixes + MPS device for the torch backend

Changes

Results (M-series Mac, macOS 25.5 / Apple clang 21, torch 2.12.1)

Not addressed

Uh oh!

oceanapplications commented Jul 4, 2026

ocean env matrix on Apple Silicon (arm64, --cpu build, MPS training)

Uh oh!

oceanapplications commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant