Skip to content

feat(xenium): import onboard secondary analysis (clustering/PCA/UMAP/diffexp) into the table#405

Open
Tomatokeftes wants to merge 3 commits into
scverse:mainfrom
Tomatokeftes:feat/xenium-cells-analysis
Open

feat(xenium): import onboard secondary analysis (clustering/PCA/UMAP/diffexp) into the table#405
Tomatokeftes wants to merge 3 commits into
scverse:mainfrom
Tomatokeftes:feat/xenium-cells-analysis

Conversation

@Tomatokeftes

Copy link
Copy Markdown

Motivation

xenium() currently imports only the raw outputs (boundaries, transcripts, images, cell-feature matrix). The Xenium Onboard Analysis secondary analysis under analysis/ — graph-based + k-means clustering, PCA, UMAP, and differential expression — is dropped, so the 10x-computed clusters/embeddings have to be recomputed downstream even though they ship with every standard run. I couldn't find an existing issue/PR covering this (the nearby #385 is about Xenium Explorer selection GeoJSON, a different artifact).

What this does

Adds a cells_analysis: bool = True option to xenium() that, when the analysis/ folder is present, enriches the cell table:

Source Target
analysis/clustering/<name>/clusters.csv one categorical column per clustering in table.obs (named <name>, e.g. gene_expression_graphclust)
analysis/pca/<name>/projection.csv table.obsm["X_pca"]
analysis/umap/<name>/projection.csv table.obsm["X_umap"]
analysis/diffexp/<name>/differential_expression.csv table.uns["diffexp"][<name>]

Design notes:

  • Everything is joined to the cells by cell_id (the CSV Barcode column == table.obs_names), so it stays aligned with the shapes/table index regardless of row order.
  • Cells absent from a given result (e.g. filtered out by QC before clustering) get a missing value (NaN category / NaN obsm row), never dropped — so n_obs is unchanged.
  • Cluster ids are stored as string categories ("1", "2", …), idiomatic for scanpy/squidpy plotting.
  • A missing analysis/ folder (re-segmented data, matrix-only exports) is a no-op.
  • Opt-out via cells_analysis=False; requires cells_table=True.

Tests

Self-contained unit tests for the parser (_add_cells_analysis): join-by-barcode with scrambled/partial rows, missing-cell handling, obsm alignment with NaN rows, and the no-op-when-absent case. No network/example data required. Verified end-to-end on a real Xenium 2.x run (24,005 cells): all 10 clusterings land in obs, X_pca/X_umap in obsm, with the QC-filtered cells correctly left as NaN.

Notes

  • New constants added to XeniumKeys; ruff lint + format clean.
  • Happy to adjust the default (True vs False), the obs column naming, or whether diffexp belongs in uns — flagging those as the main review-judgment calls.

Add a `cells_analysis` option (default True) to `xenium()` that reads the
Xenium output's `analysis/` folder into the cell table when present:

- `analysis/clustering/<name>/clusters.csv` -> one categorical column per
  clustering in `table.obs` (e.g. `gene_expression_graphclust`,
  `gene_expression_kmeans_10_clusters`), joined to the cells by `cell_id`
  (the CSV `Barcode`). Cells absent from a clustering (filtered by QC) get a
  missing value rather than being dropped.
- `analysis/pca/<name>/projection.csv`  -> `table.obsm["X_pca"]`.
- `analysis/umap/<name>/projection.csv` -> `table.obsm["X_umap"]`.
- `analysis/diffexp/<name>/differential_expression.csv` ->
  `table.uns["diffexp"][<name>]`.

Until now `xenium()` imported only the raw outputs (boundaries, transcripts,
images, cell-feature matrix); the onboard secondary analysis was dropped, so
the 10x-computed clusters/embeddings had to be recomputed downstream. Joining
by `cell_id` keeps everything aligned to the shapes/table index. A missing
`analysis/` folder is a no-op (e.g. re-segmented data, matrix-only exports).

Adds self-contained unit tests for the parser (join-by-barcode, missing-cell
handling, obsm alignment, no-op when the folder is absent).
…ndex

The analysis CSVs key on the cell_id barcode; join clustering + projections on
the cell_id obs column instead of obs_names so the import stays correct even
when the table index is positional rather than the barcode.
The CLI-completeness test (test_cli_exposes_all_reader_params) requires every
xenium() parameter to have a matching click option in xenium_wrapper. Add the
--cells-analysis option + param + pass-through for the new cells_analysis kwarg.
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.32258% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 38.02%. Comparing base (a63ca08) to head (687a283).

Files with missing lines Patch % Lines
src/spatialdata_io/readers/xenium.py 88.23% 6 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (a63ca08) and HEAD (687a283). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (a63ca08) HEAD (687a283)
3 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #405       +/-   ##
===========================================
- Coverage   63.38%   38.02%   -25.36%     
===========================================
  Files          26       26               
  Lines        3217     3279       +62     
===========================================
- Hits         2039     1247      -792     
- Misses       1178     2032      +854     
Files with missing lines Coverage Δ
src/spatialdata_io/__main__.py 81.90% <100.00%> (-2.65%) ⬇️
src/spatialdata_io/_constants/_constants.py 100.00% <100.00%> (ø)
src/spatialdata_io/readers/xenium.py 28.98% <88.23%> (-45.84%) ⬇️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Tomatokeftes

Copy link
Copy Markdown
Author

CI triage note: the failures on Python 3.12/3.13 look pre-existing and unrelated to this PR.

  • The test job passes fully on 3.11.
  • On 3.12/3.13 the dominant failure is ValueError: Key 'Abc' is not unique, or another case-variant of it exists (~44 occurrences), originating in shared test fixtures and hitting test_generic, test_macsima, test_seqfish, test_visium_hd, test_dataframe, and the Xenium example-data tests collaterally — i.e. readers this PR doesn't touch. It looks like dependency drift (zarr / case-sensitivity validation) since the last green main run.
  • This PR's diff is confined to xenium.py, __main__.py, and test_xenium.py. The new cells_analysis unit tests pass, and the only failure actually caused by this change (test_cli_exposes_all_reader_params — a missing CLI option for the new cells_analysis param) is now fixed by exposing --cells-analysis.

Happy to rebase once the fixture/dependency issue is addressed on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants