feat(xenium): import onboard secondary analysis (clustering/PCA/UMAP/diffexp) into the table#405
Open
Tomatokeftes wants to merge 3 commits into
Open
feat(xenium): import onboard secondary analysis (clustering/PCA/UMAP/diffexp) into the table#405Tomatokeftes wants to merge 3 commits into
Tomatokeftes wants to merge 3 commits into
Conversation
Add a `cells_analysis` option (default True) to `xenium()` that reads the Xenium output's `analysis/` folder into the cell table when present: - `analysis/clustering/<name>/clusters.csv` -> one categorical column per clustering in `table.obs` (e.g. `gene_expression_graphclust`, `gene_expression_kmeans_10_clusters`), joined to the cells by `cell_id` (the CSV `Barcode`). Cells absent from a clustering (filtered by QC) get a missing value rather than being dropped. - `analysis/pca/<name>/projection.csv` -> `table.obsm["X_pca"]`. - `analysis/umap/<name>/projection.csv` -> `table.obsm["X_umap"]`. - `analysis/diffexp/<name>/differential_expression.csv` -> `table.uns["diffexp"][<name>]`. Until now `xenium()` imported only the raw outputs (boundaries, transcripts, images, cell-feature matrix); the onboard secondary analysis was dropped, so the 10x-computed clusters/embeddings had to be recomputed downstream. Joining by `cell_id` keeps everything aligned to the shapes/table index. A missing `analysis/` folder is a no-op (e.g. re-segmented data, matrix-only exports). Adds self-contained unit tests for the parser (join-by-barcode, missing-cell handling, obsm alignment, no-op when the folder is absent).
…ndex The analysis CSVs key on the cell_id barcode; join clustering + projections on the cell_id obs column instead of obs_names so the import stays correct even when the table index is positional rather than the barcode.
The CLI-completeness test (test_cli_exposes_all_reader_params) requires every xenium() parameter to have a matching click option in xenium_wrapper. Add the --cells-analysis option + param + pass-through for the new cells_analysis kwarg.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #405 +/- ##
===========================================
- Coverage 63.38% 38.02% -25.36%
===========================================
Files 26 26
Lines 3217 3279 +62
===========================================
- Hits 2039 1247 -792
- Misses 1178 2032 +854
🚀 New features to boost your workflow:
|
Author
|
CI triage note: the failures on Python 3.12/3.13 look pre-existing and unrelated to this PR.
Happy to rebase once the fixture/dependency issue is addressed on |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
xenium()currently imports only the raw outputs (boundaries, transcripts, images, cell-feature matrix). The Xenium Onboard Analysis secondary analysis underanalysis/— graph-based + k-means clustering, PCA, UMAP, and differential expression — is dropped, so the 10x-computed clusters/embeddings have to be recomputed downstream even though they ship with every standard run. I couldn't find an existing issue/PR covering this (the nearby #385 is about Xenium Explorer selection GeoJSON, a different artifact).What this does
Adds a
cells_analysis: bool = Trueoption toxenium()that, when theanalysis/folder is present, enriches the cell table:analysis/clustering/<name>/clusters.csvtable.obs(named<name>, e.g.gene_expression_graphclust)analysis/pca/<name>/projection.csvtable.obsm["X_pca"]analysis/umap/<name>/projection.csvtable.obsm["X_umap"]analysis/diffexp/<name>/differential_expression.csvtable.uns["diffexp"][<name>]Design notes:
cell_id(the CSVBarcodecolumn ==table.obs_names), so it stays aligned with the shapes/table index regardless of row order.n_obsis unchanged."1","2", …), idiomatic for scanpy/squidpy plotting.analysis/folder (re-segmented data, matrix-only exports) is a no-op.cells_analysis=False; requirescells_table=True.Tests
Self-contained unit tests for the parser (
_add_cells_analysis): join-by-barcode with scrambled/partial rows, missing-cell handling,obsmalignment with NaN rows, and the no-op-when-absent case. No network/example data required. Verified end-to-end on a real Xenium 2.x run (24,005 cells): all 10 clusterings land inobs,X_pca/X_umapinobsm, with the QC-filtered cells correctly left as NaN.Notes
XeniumKeys;rufflint + format clean.TruevsFalse), theobscolumn naming, or whetherdiffexpbelongs inuns— flagging those as the main review-judgment calls.