Skip to content

Apple Silicon support: arm64 build fixes + MPS device for torch backend#609

Open
oceanapplications wants to merge 4 commits into
PufferAI:4.0from
oceanapplications:macos-apple-silicon-mps
Open

Apple Silicon support: arm64 build fixes + MPS device for torch backend#609
oceanapplications wants to merge 4 commits into
PufferAI:4.0from
oceanapplications:macos-apple-silicon-mps

Conversation

@oceanapplications

Copy link
Copy Markdown

Apple Silicon support: arm64 build fixes + MPS device for the torch backend

Makes ./build.sh <env> --cpu build on Apple Silicon and the PyTorch (--slowly) backend train on the M-series GPU via MPS. Related: #590, #507, #422, #532.

Changes

build.sh

  • Gate -mavx2 -mfma on x86_64 (arm64 clang rejects them; same idea as Conditionally apply AVX2/FMA flags by architecture (fix Apple Silicon build) #590).
  • macOS OpenMP: compile with -Xpreprocessor -fopenmp + Homebrew's omp.h, but link against torch's bundled libomp.dylib. This is the subtle one: linking Homebrew's libomp builds fine, but any process that also imports torch then has two OpenMP runtimes loaded — it either aborts at startup (OMP: Error #15) or segfaults inside cpu_vec_step's parallel regions. Falls back to Homebrew's libomp when torch isn't importable at build time.
  • Clear error message pointing at brew install libomp when omp.h is missing.
  • Linux flag behavior is unchanged.

pufferlib/torch_pufferl.py

  • Device selection: cudamps (when available) → cpu.
  • Move actions to host memory before cpu_step (the vecenv memcpys from the raw pointer; an MPS data_ptr() segfaults).
  • compute_puff_advantage: round-trip through CPU for non-CUDA accelerators, since _C.puff_advantage_cpu reads raw host pointers. (A native Metal kernel like feature: add mps kernel for compute_puff_advantage #422 would avoid the copy; this keeps the diff minimal.)

pufferlib/pufferl.py

  • Auto-fall back to the torch backend when _C was built with --cpu (no create_pufferl), so puffer train <env> works on macOS without knowing about --slowly.

Results (M-series Mac, macOS 25.5 / Apple clang 21, torch 2.12.1)

Benchmark on breakout (default config, 4096 agents, 32.5K-param policy), steady-state over 3 epochs:

device SPS
cpu 190K
mps 605K (3.2×)

Losses match CPU training qualitatively; checkpoints save/load fine.

Also ran a full build + one-epoch MPS training smoke test of every env in ocean/ on arm64: 37 of the 38 currently-buildable envs train on MPS (the 38th, squared_continuous, has no config file). Includes chess, craftax, drive, nmmo3, moba, and terraform. The 20 envs that don't build fail for reasons unrelated to this PR: 17 are broken on all platforms on current master (references to a missing ocean/env_binding.h, or binding.c missing its OBS_TENSOR_T define), plus craftax_classic (x86-only AVX-512), matsci (needs LAMMPS), and nethack/impulse_wars (external/unvendored deps).

The second commit fixes a torch-backend crash that predates this PR and affects all platforms: shipped configs contain sweep-produced float values (num_layers = 2.11327 in cartpole.ini), which the native backend truncates to C ints but the torch backend passed straight to nn.Linear. 17 of the 38 buildable envs crashed on this before the fix.

Not addressed

  • The native CUDA training backend (out of scope per Metal 4 backend for Apple Silicon #532's discussion).
  • craftax_classic: hand-written AVX-512 obs path (44 _mm512_* intrinsics, no scalar fallback) — x86-only regardless of this PR.
  • Envs currently broken on master for all platforms (missing ocean/env_binding.h, missing OBS_TENSOR_T defines).

oceanapplications and others added 2 commits July 4, 2026 11:42
- build.sh: gate -mavx2/-mfma on x86_64 (arm64 clang rejects them)
- build.sh: macOS OpenMP via -Xpreprocessor -fopenmp with Homebrew's omp.h,
  linked against torch's bundled libomp.dylib. Linking a second OpenMP
  runtime (e.g. Homebrew's) into a process that imports torch aborts at
  startup or segfaults in the vecenv's parallel regions.
- torch_pufferl: select mps when available and _C has no CUDA; move
  actions to host memory before cpu_step; round-trip advantage
  computation through CPU for non-CUDA accelerators
- pufferl: fall back to the torch backend automatically when _C was
  built with --cpu, so 'puffer train env' works without --slowly

Verified on M-series (breakout): 605K SPS on MPS vs 190K on CPU.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019AsyRcLQqeJondzSTM6xsn
…ckend

Shipped configs contain sweep-produced floats (e.g. num_layers = 2.11327
in cartpole.ini). The native backend truncates them on assignment to C
ints; the torch backend passed them straight to nn.Linear and crashed
with 'float object cannot be interpreted as an integer'. Fixes the torch
(--slowly) backend for 17 of the 38 currently-buildable ocean envs, on
all platforms.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019AsyRcLQqeJondzSTM6xsn
@oceanapplications

Copy link
Copy Markdown
Author

ocean env matrix on Apple Silicon (arm64, --cpu build, MPS training)

37/38 buildable envs train one full epoch on MPS. 20 build failures: 17 are broken on all platforms on current master (missing ocean/env_binding.h or OBS_TENSOR_T defines), 1 is x86-only SIMD (craftax_classic), and matsci/nethack/impulse_wars need external deps or unvendored headers.

env build MPS train note
asteroids FAIL - upstream: references missing ocean/env_binding.h (broken on all platforms)
battle FAIL - upstream: references missing ocean/env_binding.h (broken on all platforms)
benchmark FAIL - upstream: binding.c missing OBS_TENSOR_T define (broken on all platforms)
blastar FAIL - upstream: binding.c missing OBS_TENSOR_T define (broken on all platforms)
boids FAIL - upstream: references missing ocean/env_binding.h (broken on all platforms)
boxoban OK OK OK steps=2097152 pg=-0.0587
breakout OK OK OK steps=262144 pg=0.0379
cartpole OK OK OK steps=131072 pg=0.0069
chain_mdp FAIL - upstream: references missing ocean/env_binding.h (broken on all platforms)
checkers FAIL - upstream: references missing ocean/env_binding.h (broken on all platforms)
chess OK OK OK steps=524288 pg=-0.0118
connect4 OK OK OK steps=131072 pg=-0.0195
convert SKIP - template/scaffolding, not a runnable env
convert_circle SKIP - template/scaffolding, not a runnable env
craftax OK OK OK steps=2097152 pg=0.0215
craftax_classic FAIL - x86-only SIMD: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/21
dino OK OK OK steps=262144 pg=0.0089
docking OK OK OK steps=262144 pg=-0.0050
double_pendulum OK OK OK steps=262144 pg=0.1214
drive OK OK OK steps=1048576 pg=0.0236
drmario OK OK OK steps=262144 pg=-0.0411
drone OK OK OK steps=131072 pg=0.0004
enduro OK OK OK steps=16384 pg=0.0000
freeway OK OK OK steps=1048576 pg=0.0569
g2048 OK OK OK steps=524288 pg=0.0127
go OK OK OK steps=32768 pg=-0.0005
hex OK OK OK steps=65536 pg=-0.1148
impulse_wars FAIL - ocean/impulse_wars/binding.c:1:10: fatal error: 'Python.h' file not found
laser_puzzle OK OK OK steps=49152 pg=0.0015
lightsout OK OK OK steps=262144 pg=-0.0082
matsci FAIL - ocean/matsci/matsci.h:5:10: fatal error: 'lammps/library.h' file not found
maze OK OK OK steps=131072 pg=0.1720
memory FAIL - upstream: references missing ocean/env_binding.h (broken on all platforms)
minimal OK OK OK steps=524288 pg=0.0675
moba OK OK OK steps=131072 pg=-0.0294
nethack FAIL - Building libnethack.so ...
nmmo3 OK OK OK steps=524288 pg=-0.0142
onestateworld FAIL - upstream: references missing ocean/env_binding.h (broken on all platforms)
onlyfish FAIL - upstream: references missing ocean/env_binding.h (broken on all platforms)
overcooked OK OK OK steps=524288 pg=0.0123
pacman OK OK OK steps=524288 pg=0.0122
pong OK OK OK steps=32768 pg=-0.0601
robocode OK OK OK steps=524288 pg=0.0236
rware OK OK OK steps=262144 pg=0.0419
scape FAIL - src/vecenv.h:364:24: error: no member named 'rng' in 'Scape'
shared_pool FAIL - upstream: references missing ocean/env_binding.h (broken on all platforms)
slimevolley OK OK OK steps=262144 pg=0.0084
snake FAIL - upstream: binding.c missing OBS_TENSOR_T define (broken on all platforms)
squared OK OK OK steps=262144 pg=-0.0104
squared_continuous OK SKIP no config/squared_continuous.ini
tactical FAIL - upstream: references missing ocean/env_binding.h (broken on all platforms)
target OK OK OK steps=262144 pg=0.0537
template SKIP - template/scaffolding, not a runnable env
terraform OK OK OK steps=524288 pg=-0.0039
tetris OK OK OK steps=524288 pg=0.0076
tmaze FAIL - upstream: references missing ocean/env_binding.h (broken on all platforms)
tower_climb OK OK OK steps=1048576 pg=-0.0376
trash_pickup OK OK OK steps=65536 pg=0.1007
tripletriad OK OK OK steps=65536 pg=-0.0078
whackamole OK OK OK steps=262144 pg=0.0089
whisker_racer FAIL - upstream: binding.c missing OBS_TENSOR_T define (broken on all platforms)

oceanapplications and others added 2 commits July 4, 2026 15:06
The MPS multinomial kernel can intermittently return indices outside
[0, num_categories) (pytorch#136623, still unfixed upstream; the fix PR
pytorch#170195 was closed unmerged). In long MPS training runs this
surfaced as an intermittent 'AcceleratorError: index N is out of bounds'
raised at the next sync point — the .cpu() transfer in
compute_puff_advantage — with N ~ 2x total_agents, because the bad index
from the prioritized-replay multinomial feeds lazily-queued gathers
(obs[idx]) and scatters (ratio[idx], val[idx]) that only validate at
materialization.

Clamp multinomial output on MPS at both call sites: minibatch segment
sampling and action sampling (where an out-of-range action would be
memcpy'd into the C envs and corrupt memory silently instead of raising).
Cost is one elementwise op; out-of-range draws are ~1e-5 rare.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019AsyRcLQqeJondzSTM6xsn
Include paths (pybind11, numpy, sysconfig) and the torch libomp lookup
used bare 'python', which fails with ModuleNotFoundError when the target
venv is not the active interpreter. Resolve once at the top:
PYTHON=$path ./build.sh <env> --cpu now works from any shell.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019AsyRcLQqeJondzSTM6xsn
@oceanapplications oceanapplications force-pushed the macos-apple-silicon-mps branch from f506345 to 62c1220 Compare July 4, 2026 07:09
@oceanapplications

Copy link
Copy Markdown
Author

Used to train this 6 link inverted pendulum on a MacBook. So real world tested over hundreds of billions of steps.

demo_250th.trimmed.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant