zarrs bindings [do not merge]#4064
Conversation
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Create zarrs-bindings/ with Cargo.toml, pyproject.toml (maturin build backend), and src/lib.rs (exceptions + version function). Wire into the root pyproject.toml via a new `zarrs` dependency group and [tool.uv.sources]. Add zarrs-bindings/target/ to .gitignore. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add StoreShim, a synchronous adapter over zarr's async Store ABC, and resolve_store, which maps a Store to either a native config dict (for LocalStore) or a StoreShim for Rust to call back into. Also convert the store fixture in tests/zarrs/conftest.py to an async generator with teardown so stores are properly closed after each test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds `create_array` and `read_metadata` pyfunctions to the zarrs-bindings crate, and exposes them as `create_new_array`, `create_overwrite_array`, and `read_metadata` in the `zarr.zarrs` subpackage. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add `delete_node` and `list_children` to the zarrs-bindings Rust crate and the `zarr.zarrs` Python subpackage. `delete_node` erases the node prefix via `erase_prefix`, raising `NodeNotFoundError` when the node is absent. `list_children` opens the target as a `Group` and returns direct children as `(path, metadata_document)` pairs. Both are covered by 4 new tests (× 2 stores = 8 parametrized cases); total suite: 34 passed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Add --ignore=src/zarr/zarrs to pytest addopts so --doctest-modules doesn't attempt to import the bindings module when zarrs-bindings is not installed, preventing a collection ERROR in jobs that don't use the zarrs dependency group. - Add dtolnay/rust-toolchain step (SHA-pinned, stable) to zarrs.yml CI so the build is not reliant on whatever Rust version the runner image ships; ensures rust-version = "1.91" in the crate is satisfied. - Fix spec: abi3-py311 -> abi3-py312 to match zarr's requires-python >=3.12. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Move the pytest.importorskip("_zarrs_bindings") guard from conftest.py
(module-level) to the top of each test module. When the bindings are
absent, xdist workers would previously raise Skipped while importing the
conftest, causing "Different tests were collected between gw0 and gwX"
failures. Per-module guards are the standard xdist-safe pattern.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add `retrieve_array_subset` Rust binding and `decode_region` Python API function. Selection normaliser maps integers/slices/Ellipsis to a step-1 bounding box fetched in one zarrs call; strides, reversals, and integer-axis removal are applied as numpy views on the result. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Add docstring note to decode_region warning that zarrs fetches the step-1 bounding box, so strided selections read O(span) bytes. - Remove dead isinstance(sel, EllipsisType) raise in _normalize_selection (Ellipsis is expanded to slice(None) before the per-dimension loop); replace with an assert to preserve mypy type narrowing. - Guard non-integral float shape elements in _array_shape so shape=[1.5] raises TypeError instead of silently truncating to 1. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add a process-wide LRU cache (capacity 128) in chunk.rs keyed on (filesystem root, node path, metadata JSON) that memoises the result of Array::new_with_metadata for the native FilesystemStore path. Generic Python-backed stores (MemoryStore, ZipStore, custom) are not cached. The cache key encodes root + path + metadata so an Array is never reused for a different store or different codec chain; chunk data continues to flow through the store on every call, so the cache cannot return stale data. Two test-hook pyfunctions (array_cache_len / clear_array_cache) are exposed on _zarrs_bindings; five correctness tests in tests/zarrs/test_cache.py cover population, non-caching of MemoryStore, distinct-metadata entries, root-keying, and write visibility. All 117 tests pass; cargo clippy -D warnings clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace all four `array_cache().lock().unwrap()` call sites with a `lock_cache()` helper that uses `.unwrap_or_else(|e| e.into_inner())`. If a thread panics while holding the mutex the lock is now recovered (worst case: a stale cache entry) rather than poisoning every subsequent lock call and wedging all array I/O permanently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
read_chunk stays parameter-free (chunk-grid addressed, whole chunk); all selection-based reads route through read_region/read_subset (array-coordinate, spans chunks). Drops the deferred chunk-subset selection parameter. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…d method Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- wrap the lazy `import zarr.zarrs` in `get_backend` with a try/except that raises a helpful ImportError when the zarrs-bindings extension is not installed - add a thread-safety comment above the `_BACKENDS` dict explaining that CPython's import lock + GIL make it safe without additional locking - document the `runtime_checkable` limitation in the `CrudBackend` docstring: isinstance only checks method names, not signatures or async-ness; mypy is the authoritative conformance check Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Implements `ReferenceBackend` in `src/zarr/crud/_reference.py`, registers it as the default `"reference"` backend at `zarr.crud` import time. Drives chunk encode/decode via `create_codec_pipeline` + the abstract `CodecPipeline.encode`/`decode` methods (not `encode_batch`/`decode_batch`, which live only on `BatchedCodecPipeline` and are not in the abstract Protocol), and multi-chunk subset reads via `AsyncArray.getitem`. Fixes two previously-failing registry tests; adds four new backend tests covering round-trip chunk I/O, multi-chunk subset reads, duplicate-create error, and missing-metadata error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`_array_spec` was hardcoding `ArrayConfig.from_dict({})`, which always
defaults to C order, causing silent data corruption for v2 arrays created
with `order="F"`. `read_chunk` returned transposed bytes and `write_chunk`
stored bytes that zarr-python read back incorrectly.
Fix: detect `ArrayV2Metadata` and pass its `.order` attribute into the
`ArrayConfig`, so the codec pipeline uses the correct memory layout.
v3 arrays are unaffected (order there is a transpose codec, not memory
order).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds the public `zarr.crud` facade dispatching to a CrudBackend, plus the differential test suite exercising it against the reference backend. The zarrs param in the backend fixture is skipped until Task 4 registers the zarrs backend. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replaces the old flat-function API in `zarr.zarrs._api` with `ZarrsBackend` (a `CrudBackend` implementation) in `zarr.zarrs._backend`. Importing `zarr.zarrs` now registers the backend under the key `"zarrs"` via `zarr.crud.register_backend`, enabling `backend="zarrs"` in all crud facade functions and activating the previously-skipped zarrs params in `tests/crud/test_crud.py`. Old tests that exercised the removed flat API are deleted; cache tests are migrated to use the `zarr.crud` facade. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…drops - Add ZGROUP_JSON + a _node_exists() helper probing all three meta-key variants (zarr.json / .zarray / .zgroup) so v2 groups are visible to every CRUD operation; rewrite _create/read_metadata/delete_node/list_children to use it; remove dead _meta_key(). - Add _is_all_fill_value() and drop all-fill chunks in write_chunk to match zarrs's sparse-storage convention. - Add differential tests for v2 groups, v2 array read_metadata, and all-fill chunk writes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ment it Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Matches the existing packages/zarr-metadata subpackage layout. Updates the uv source path, sdist exclude, gitignore, and design doc accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
@clbarnes related to your interests |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4064 +/- ##
==========================================
- Coverage 93.52% 91.09% -2.43%
==========================================
Files 90 99 +9
Lines 11926 12337 +411
==========================================
+ Hits 11154 11239 +85
- Misses 772 1098 +326
🚀 New features to boost your workflow:
|
| async def list_children(store, path) -> list[tuple[str, dict]] # (path, metadata) | ||
|
|
||
| # chunk-level I/O | ||
| async def decode_chunk(metadata, store, path, chunk_coords, *, selection=None) -> np.ndarray |
There was a problem hiding this comment.
Returning an ndarray will force a copy out of rust memory. Longer term, I'd perhaps suggest instead making a wrapper class on the Rust side that will expose an ndarray. At least whenever the data type is supported by the buffer protocol that should remove a memory copy. See for example async_tiff.Array
There was a problem hiding this comment.
we have very adjacent prior art in https://github.com/zarrs/zarrs-python, I should see what they do!
There was a problem hiding this comment.
You can instantiate numpy arrays on the rust side, either from a vec directly or with an ndarray: https://docs.rs/numpy/latest/numpy/convert/trait.IntoPyArray.html#tymethod.into_pyarray
I imagine it would get more complicated when custom buffer prototypes are passed, because they need to be instantiated on the python side.
There was a problem hiding this comment.
I am kind of hoping to put the custom buffer prototype thing in the rear view window. It has not worked well.
There was a problem hiding this comment.
I am open to all ideas here, but "every routine that allocates memory takes a python callable" is probably not going to work for cross-language interop. Another complication is that "contiguous stream of bytes" is a lower-level abstraction than "N-dimensional array".
My rough feeling is that we should partition the chunk encoding / decoding routines by the memory access pattern needed for the array side of the array -> bytes transformation. For dense arrays, rust can hand back a contiguous buffer, or write into a pre-allocated buffer. For sparse arrays, we will need to settle on an memory layout, and we probably can't support writing into a pre-allocated sparse array.
| 3. **Anything else** → generic `PyStore`: a Rust struct implementing | ||
| `ReadableStorageTraits` / `WritableStorageTraits` / | ||
| `ListableStorageTraits` over a Python callback object. |
There was a problem hiding this comment.
Nit: as you might expect from our zulip discussion I'm hesitant to lump all of these together. I'm interested in spending a little time with this thinking about how to represent these traits through a Python API
| The callback path: the async API function wraps the user's `Store` in a small | ||
| sync Python shim whose methods submit coroutines to zarr-python's existing | ||
| sync event-loop thread (`zarr.core.sync`, | ||
| `asyncio.run_coroutine_threadsafe(...)` + blocking result). Rust calls the | ||
| shim while holding no locks of its own. This makes any conformant `Store` | ||
| (Memory, Zip, Logging, Wrapper, user-defined) work without Rust knowing its | ||
| type. Deadlock safety relies on the existing invariant that code running on | ||
| the zarr sync loop never blocks on these Rust entry points. |
There was a problem hiding this comment.
You know I don't love this 😅. I'd like to specialize Sync and Async stores on the rust side.
There was a problem hiding this comment.
haha I'm not a fan either! ideal scenario is that a store connection has a declarative form that rust can use to reconstruct it (like a URL). Things tied to python memory will remain hard.
There was a problem hiding this comment.
Fwiw that's how obstore works when passed into a pyo3 binding. It recreates the store on the rust side so no requests need to go through Python
| µs of actual chunk I/O on a warm filesystem. To amortize it across the common | ||
| "open one array, then do many chunk operations" pattern, the chunk/region | ||
| routines memoize the constructed `Array` in a process-global LRU cache | ||
| (capacity 128) keyed on `(filesystem root, node path, metadata JSON)`. |
There was a problem hiding this comment.
I think it's pretty messy to do this in a global LRU cache. There's no reason to have a public LRU cache when you could just cache the metadata in... a class instance 😄
There was a problem hiding this comment.
in... a class instance 😄
storing the metadata in a class instance isn't the same as globally caching it. The point here is to avoid performing the same computation that is keyed by exactly the metadata document. Even if we were using OOP, I think it would make sense for the procedure that takes metadata and emits chunk encoding / decoding machinery to use a cache.
There was a problem hiding this comment.
The point here is to avoid performing the same computation that is keyed by exactly the metadata document
You're keying on filesystem - node path - metadata JSON... how would that key ever be shared across more than one Array? Seems like an Array class is a logical place to put that metadata
There was a problem hiding this comment.
it's possible that we aren't disagreeing! I want a functional API because I think that's actually a very natural way to interact with zarr data -- for a given operation, you specify the metadata document you want to use, the location, and any other parameters.
It also definitely makes sense to provide users with objects that persistently bind the metadata and the storage backend. That's how zarr-python works today. But if you start with the object oriented API, it can be hard to add the functional API later. I think the functional API is actually really important for doing a lot of cool things with Zarr! so i want to see if we can bake it in foundationally, and add the object-oriented API on top.
| zarr-python already contains everything a pure-Python backend needs: | ||
| `BatchedCodecPipeline` (`src/zarr/core/codec_pipeline.py`), `BasicIndexer` | ||
| (`src/zarr/core/indexing.py`), `save_metadata` (`src/zarr/core/metadata/io.py`), | ||
| metadata parsing (`ArrayV3Metadata.from_dict` / `ArrayV2Metadata.from_dict`), | ||
| and chunk-key encoding (`src/zarr/core/chunk_key_encodings.py`). |
There was a problem hiding this comment.
I need to learn more about codec pipelines to understand how they integrate with zarrs
| but carries only defaults (fields become meaningful in Phase 3). | ||
|
|
||
| ```python | ||
| # node lifecycle |
There was a problem hiding this comment.
I don't like the functional API because it means that you're reparsing metadata that makes sense to just store in a class instance
vibe-coded zarrs bindings. I am not done with this, but I figured it's in a good state for signposting / discussion.
strategy
I wanted to keep the contact surface between zarr-python and zarrs minimal and low-state, so I defined a functional crud API that expresses the core of zarr IO. (that API is not wired up to the top-level
zarr.Arrayclass!). The idea is that we can express the stuff users want to do to their data -- create new arrays / groups, write chunks , read chunks, asf(metadata, storage, *parameters).The crud API supports multiple registered backends, e.g. a default python backend (based on repurposing our existing routines) and the rust backend, when zarrs is available.
The statelessness of the functional API is also a downside if you call
read_chunk(metadata, store, ...)repeatedly, because the rust code will re-construct the same chunk decoding machinery each time. I address this with an LRU cache on the rust side. I think "metadata + store + options" is a good cache key but we need to discuss this design further.An alternative strategy would be to write a zarrs-based expression for the many methods defined on the
ArrayandAsyncArrayclasses, while ensuring that we avoid crossing the FFI boundary excessively. I avoided this because I imagined it would require covering a huge code surface area and raise tough questions about whether python or rust was owning the life cycle of the object. If people really want a full rust-backedArrayclass, we can explore that direction.caveats:
performance
the zarrs backend is faster! here's a benchmark script you can run yourself. It requires the rust toolchain for building the bindings.
I'm seeing ~15x throughput improvement, looks good.
impact
these changes require internal changes in the
zarrpackage, as well as a new subpackage for the rust bindings. it adds the rust toolchain to the developer dependencies of the project. it exposes us to changes in thezarrspackage, which is outside thezarr-developersorg. We definitely need a design plan to limit complexity if we want to pursue this further.