Skip to content

[Feature]: Refactor Conformational State Analysis into a Chunked Map-Reduce Workflow #361

@harryswift01

Description

@harryswift01

Problem / Motivation

The current frame-parallel implementation now provides a deterministic frame/chunk map-reduce execution path for covariance and neighbour-count calculations. This improves the frame-local execution architecture and gives us the infrastructure needed to move more trajectory-dependent work into the parallel path.

However, conformational/dihedral analysis is still executed as part of the static LevelDAG stage through ComputeConformationalStatesNode and ConformationStateBuilder. Although this is currently treated as a static setup step, it still performs expensive trajectory-dependent work over the selected frames.

The current conformational workflow already has a two-pass structure:

  1. Identify dihedral histogram peaks from selected-frame angle data.
  2. Re-run/inspect dihedral angle data to assign conformational state labels using those peaks.

Because this work scans trajectory frames and may be a remaining serial bottleneck after covariance/neighbour frame parallelism, it should be refactored into a chunked map-reduce workflow.

The goal is to reduce wall-clock runtime and improve Dask/HPC scaling by moving heavy frame-dependent conformational work into the same general execution model used by the new frame/chunk infrastructure.

Proposed Solution

Refactor conformational state construction into a deterministic chunked map-reduce pipeline while preserving the existing output contract:

shared_data["conformational_states"] = {
    "ua": states_ua,
    "res": states_res,
}

shared_data["flexible_dihedrals"] = {
    "ua": flexible_ua,
    "res": flexible_res,
}

The proposed structure is:

Pass 1:
    For each frame chunk:
        compute raw dihedral angle observations or partial histograms

Reduce 1:
    merge partial angle/histogram data
    identify global dihedral peaks/states

Pass 2:
    For each frame chunk:
        assign conformational states using the global peak/state definitions

Reduce 2:
    merge state labels and flexible-dihedral counts
    produce states_ua, states_res, flexible_ua, flexible_res

A possible implementation approach:

  • Keep ComputeConformationalStatesNode as the static DAG entry point.
  • Replace the direct call to ConformationStateBuilder.build_conformational_states(...) with a new conformational map-reduce pipeline.
  • Add a dedicated module such as:
CodeEntropy/levels/execution/conformations.py
  • Reuse the new execution infrastructure where appropriate:

    • chunk_frame_indices
    • ExecutionPolicy
    • serial fallback
    • Dask chunk submission when a client is available
    • deterministic chunk-order reduction
    • compact worker partials
  • Split the existing conformational logic into clearer phases:

    • static dihedral/topology discovery
    • chunk-local angle collection
    • peak/histogram reduction
    • chunk-local state assignment
    • final state/flexible-dihedral reduction

The implementation should carefully preserve the frame-index contract:

MDAnalysis Dihedral.run uses active analysis-universe frame indices.
Dihedral results are indexed locally from zero.
Absolute/source frame indices must not be used directly to index Dihedral results.

Alternatives Considered

  • Keep conformational analysis in the static stage.

    This is simpler and preserves current behaviour, but it may leave expensive trajectory-dependent work outside the frame-parallel execution path and limit whole-workflow scaling.

  • Only optimise individual functions inside ConformationStateBuilder.

    This may improve local runtime, but it does not address the larger architectural issue that conformational analysis scans selected frames serially.

  • Add conformational analysis directly into the existing covariance/neighbour frame worker.

    This is not ideal because conformational state assignment likely requires a two-pass workflow: first to identify global peaks/states, then to assign states using those global definitions. It should therefore be implemented as a dedicated conformational map-reduce pipeline rather than forced into the single-pass covariance/neighbour worker.

  • Store all raw dihedral angles from workers before reduction.

    This may be simple initially, but could increase memory usage for large systems. Where possible, compact partial histograms or reduced angle summaries should be preferred.

Expected Impact

  • Moves remaining trajectory-dependent conformational work toward the chunked frame execution model.
  • Improves potential Dask/HPC scaling beyond covariance and neighbour-count parallelism.
  • Reduces the amount of expensive work left in the serial static stage.
  • Provides a clearer architecture for conformational state construction.
  • Preserves existing downstream output contracts for configurational entropy calculations.
  • May reduce memory pressure if compact partial histograms/counts are used instead of large all-frame angle collections.
  • Improves the performance story for benchmarks and future paper discussion.

Additional Context

The previous frame execution refactor introduced deterministic frame/chunk map-reduce infrastructure for covariance and neighbour-count calculations. That work should be treated as the foundation for this follow-up.

The conformational code already has the conceptual shape needed for map-reduce:

_identify_peaks()
    collect selected-frame dihedral angles
    build histograms
    identify peaks

_assign_states()
    re-run or inspect selected-frame dihedral angles
    assign states using peaks
    collect state labels and flexible-dihedral counts

This issue should focus on turning that existing two-pass serial structure into an explicit chunked map-reduce implementation, without changing the scientific output contract.

Metadata

Metadata

Assignees

Labels

Type

No fields configured for Task.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions