# Shared Infrastructure Roadmap **Version:** 1.0.0-draft **Status:** Draft **Last updated:** 2026-03-10 --- ## 1. Introduction ### 1.1 Purpose This document outlines a future refactoring to extract shared infrastructure from individual external tool implementations into a common `core` module. It captures the rationale, identifies candidates for consolidation, and proposes a migration strategy. ### 1.2 Status This is a **roadmap**, not a specification. The approach described here is intentionally deferred: - **Current state:** Each tool (BayesEoR, pyuvsim) implements its own utilities - **Future state:** Common patterns extracted into `valska.external_tools.core` - **Trigger:** Refactoring should occur after two or more tools are implemented and patterns are validated in practice ### 1.3 Rationale for deferral Extracting shared infrastructure *before* implementing multiple tools risks: 1. **Premature abstraction** — Guessing at the right interfaces without concrete evidence 2. **Over-engineering** — Building flexibility that is never used 3. **Churn** — Refactoring the shared layer as each new tool reveals edge cases By implementing BayesEoR and pyuvsim as standalone modules first, we gain: 1. **Concrete duplication** — Clear evidence of what is genuinely shared 2. **Validated patterns** — Interfaces proven to work for different tool characteristics 3. **Lower risk** — Refactoring existing working code is safer than designing upfront --- ## 2. Current state ### 2.1 BayesEoR module structure ``` src/valska/external_tools/bayeseor/ ├── __init__.py ├── cli_prepare.py ├── cli_submit.py ├── cli_sweep.py ├── runner.py ├── setup.py ├── slurm.py ├── submit.py ├── sweep.py ├── templates.py └── templates/ ``` ### 2.2 pyuvsim module structure (planned) ``` src/valska/external_tools/pyuvsim/ ├── __init__.py ├── cli_prepare.py ├── cli_submit.py ├── runner.py ├── setup.py ├── slurm.py ├── submit.py ├── templates.py └── templates/ ``` ### 2.3 Observed duplication Based on the BayesEoR implementation and the pyuvsim patterns described in the Tool Implementer's Guide, the following areas exhibit significant overlap: | Concern | BayesEoR location | pyuvsim location | Duplication level | |---------|-------------------|------------------|-------------------| | Run directory creation | `setup.py` | `setup.py` | High | | Manifest writing | `setup.py` | `setup.py` | High | | Jobs.json writing | `submit.py` | `submit.py` | High | | SLURM submission | `submit.py` | `submit.py` | High | | Runner definitions | `runner.py` | `runner.py` | Medium | | Template utilities | `templates.py` | `templates.py` | High | | Configuration loading | `cli_prepare.py` | `cli_prepare.py` | Medium | | SLURM directive handling | `slurm.py` | `slurm.py` | Medium | | UTC timestamp helpers | Multiple files | Multiple files | High | | Dry-run semantics | CLI files | CLI files | Medium | --- ## 3. Proposed shared infrastructure ### 3.1 Target structure ``` src/valska/external_tools/ ├── core/ │ ├── __init__.py │ ├── run_directory.py # Run directory creation and validation │ ├── manifest.py # Manifest reading/writing │ ├── jobs.py # Jobs.json reading/writing │ ├── slurm.py # SLURM submission utilities │ ├── runner.py # Base runner classes │ ├── templates.py # Template discovery utilities │ ├── config.py # Configuration loading and merging │ └── utils.py # Common utilities (timestamps, JSON, etc.) ├── bayeseor/ │ └── ... # Tool-specific implementation └── pyuvsim/ └── ... # Tool-specific implementation ``` ### 3.2 Module responsibilities #### 3.2.1 `core/run_directory.py` **Responsibility:** Run directory creation, validation, and path construction. **Candidate functions:** ```python def build_run_dir( results_root: Path, tool: str, taxonomy: dict[str, str], run_id: str, unique: bool = False, ) -> Path: """ Construct a run directory path. Parameters ---------- results_root Base results directory. tool Tool identifier (e.g., 'bayeseor', 'pyuvsim'). taxonomy Tool-specific hierarchy components as key-value pairs. e.g., {'beam_model': 'achromatic_Gaussian', 'sky_model': 'GLEAM'} run_id User-provided run identifier. unique If True, append UTC timestamp for uniqueness. Returns ------- Path to run directory. """ pass def validate_run_dir(run_dir: Path) -> None: """ Validate that a run directory exists and contains required files. Raises ------ RunDirectoryError If validation fails. """ pass def ensure_run_dir(run_dir: Path, exist_ok: bool = False) -> None: """ Create run directory, optionally failing if it exists. """ pass ``` **Current duplication:** - `bayeseor/setup.py`: Manual path construction - `pyuvsim/setup.py`: `build_run_dir()` function --- ### 3.2.2 `core/manifest.py` **Responsibility:** Manifest definition, reading, and writing. **Candidate functions:** ```python from dataclasses import dataclass from datetime import datetime from pathlib import Path from typing import Any @dataclass class ManifestBase: """Required fields for all manifests.""" tool: str created_utc: str valska_version: str run_id: str run_dir: str results_root: str def write_manifest( run_dir: Path, *, tool: str, run_id: str, results_root: Path, extra_fields: dict[str, Any] | None = None, ) -> Path: """ Write manifest.json with required fields plus tool-specific extras. Automatically includes: - tool - created_utc - valska_version - run_id - run_dir - results_root """ pass def load_manifest(run_dir: Path) -> dict[str, Any]: """ Load and validate manifest.json from a run directory. Raises ------ ManifestError If manifest is missing or invalid. """ pass def validate_manifest(manifest: dict[str, Any]) -> None: """ Validate that manifest contains all required fields and raise clear errors. """ pass ``` **Current duplication:** - `bayeseor/setup.py`: `write_manifest()` with BayesEoR-specific fields - `bayeseor/submit.py`: `load_manifest()` - `pyuvsim/setup.py`: `write_manifest()` with pyuvsim-specific fields - `pyuvsim/submit.py`: `load_manifest()` --- #### 3.2.3 `core/jobs.py` **Responsibility:** Jobs.json schema definition, reading, writing, and archival. **Candidate functions:** ```python from pathlib import Path from typing import Any def write_jobs_json( run_dir: Path, *, stage: str, jobs: dict[str, dict[str, Any]], commands: list[str], dry_run: bool = False, sbatch: str = "sbatch", extra_fields: dict[str, Any] | None = None, ) -> Path: """ Write jobs.json with submission details. Automatically includes: - run_dir - manifest path - submitted_utc """ pass def load_jobs_json(run_dir: Path) -> dict[str, Any] | None: """ Load jobs.json if it exists, otherwise return None. """ pass def archive_jobs_json(run_dir: Path) -> Path | None: """ Archive existing jobs.json with timestamp suffix. Returns path to archived file, or None if no jobs.json existed. """ pass def has_submitted_stage(jobs: dict[str, Any] | None, stage: str) -> bool: """ Check if a stage has been submitted. """ pass ``` **Current duplication:** - `bayeseor/submit.py`: Jobs.json writing and archival - `bayeseor/cli_submit.py`: `_archive_jobs_json()`, `_load_jobs_json()` - `pyuvsim/submit.py`: Jobs.json writing --- #### 3.2.4 `core/slurm.py` **Responsibility:** SLURM job submission and output parsing. **Candidate functions:** ```python import re import subprocess from pathlib import Path from typing import Any JOBID_REGEX = re.compile(r"Submitted\s+batch\s+job\s+(\d+)", re.IGNORECASE) class SlurmSubmissionError(RuntimeError): """Raised when sbatch submission fails.""" pass def submit_script( script_path: Path, *, sbatch: str = "sbatch", dependency: str | None = None, cwd: Path | None = None, dry_run: bool = False, ) -> str | None: """ Submit a SLURM script and return the job ID. Parameters ---------- script_path Path to submit script. sbatch Path to sbatch executable. dependency Dependency specification (e.g., 'afterok:12345'). cwd Working directory for submission. dry_run If True, print command without executing. Returns ------- Job ID as string, or None if dry_run. Raises ------ SlurmSubmissionError If submission fails. """ pass def parse_job_id(sbatch_output: str) -> str: """ Extract job ID from sbatch output. Raises ------ SlurmSubmissionError If job ID cannot be parsed. """ pass def build_dependency_string(job_ids: list[str], mode: str = "afterok") -> str: """ Build a SLURM dependency string. Parameters ---------- job_ids List of job IDs to depend on. mode Dependency mode ('afterok', 'afterany', etc.). Returns ------- Dependency string (e.g., 'afterok:123:456'). """ pass ``` **Current duplication:** - `bayeseor/submit.py`: `_JOBID_RE`, sbatch invocation - `pyuvsim/submit.py`: `_JOBID_RE`, sbatch invocation --- #### 3.2.5 `core/runner.py` **Responsibility:** Base runner class definitions. **Candidate classes:** ```python from dataclasses import dataclass from pathlib import Path @dataclass(frozen=True) class CondaRunner: """Execute a tool within a conda environment.""" conda_sh: str """Command to source conda (e.g., 'source /path/to/conda.sh').""" conda_env: str """Name of the conda environment.""" def activation_commands(self) -> list[str]: """Return shell commands to activate the environment.""" return [self.conda_sh, f"conda activate {self.conda_env}"] @dataclass(frozen=True) class ContainerRunner: """Execute a tool within an Apptainer/Singularity container.""" container_image: Path """Path to the container image (.sif file).""" container_bind: str | None = None """Bind mount specification.""" def exec_prefix(self) -> str: """Return the apptainer exec prefix.""" bind = f"--bind {self.container_bind} " if self.container_bind else "" return f"apptainer exec {bind}{self.container_image}" ``` **Current duplication:** - `bayeseor/runner.py`: `CondaRunner`, `ContainerRunner`, `BayesEoRInstall` - `pyuvsim/runner.py`: `CondaRunner`, `ContainerRunner` **Note:** Tool-specific install classes (e.g., `BayesEoRInstall`) remain in tool modules. --- #### 3.2.6 `core/templates.py` **Responsibility:** Template discovery utilities. **Candidate functions:** ```python from pathlib import Path def get_templates_dir(tool_module_path: Path) -> Path: """ Return the templates directory for a tool module. Parameters ---------- tool_module_path Path to a file in the tool module (typically __file__). Returns ------- Path to templates/ directory. """ return tool_module_path.parent / "templates" def list_templates(templates_dir: Path, suffix: str = ".yaml") -> list[str]: """ List available template names in a directory. """ if not templates_dir.exists(): return [] return sorted(p.name for p in templates_dir.glob(f"*{suffix}")) def get_template_path( templates_dir: Path, name: str, suffix: str = ".yaml", ) -> Path: """ Get full path to a template by name. Raises ------ FileNotFoundError If template does not exist. """ if not name.endswith(suffix): name = f"{name}{suffix}" path = templates_dir / name if not path.exists(): available = list_templates(templates_dir, suffix) raise FileNotFoundError( f"Template not found: {name}\n" f"Available: {', '.join(available) or '(none)'}" ) return path ``` **Current duplication:** - `bayeseor/templates.py`: `_templates_dir()`, `list_templates()`, `get_template_path()` - `pyuvsim/templates.py`: Identical pattern --- #### 3.2.7 `core/config.py` **Responsibility:** Configuration loading and merging. **Candidate functions:** ```python from pathlib import Path from typing import Any import yaml def load_runtime_paths() -> dict[str, Any]: """ Load runtime_paths.yaml from the config directory. Returns empty dict if file does not exist. """ pass def get_tool_config(runtime: dict[str, Any], tool: str) -> dict[str, Any]: """ Extract tool-specific configuration section. """ return runtime.get(tool, {}) def get_nested(d: dict[str, Any], *keys: str, default: Any = None) -> Any: """ Safely navigate nested dictionary keys. """ cur = d for k in keys: if not isinstance(cur, dict): return default cur = cur.get(k, default) if cur is default: return default return cur def merge_slurm_config( defaults: dict[str, Any], tool_config: dict[str, Any], cli_overrides: dict[str, Any], ) -> dict[str, Any]: """ Merge SLURM configuration with precedence: CLI > tool config > defaults. None values in higher-precedence layers suppress the key. """ result = dict(defaults) result.update({k: v for k, v in tool_config.items() if v is not None}) result.update({k: v for k, v in cli_overrides.items() if v is not None}) # Remove any keys explicitly set to None return {k: v for k, v in result.items() if v is not None} ``` **Current duplication:** - `bayeseor/cli_prepare.py`: `_get_nested()`, `_slurm_defaults()` - `pyuvsim/cli_prepare.py`: `load_runtime_config()`, `build_slurm_config()` --- #### 3.2.8 `core/utils.py` **Responsibility:** Common utility functions. **Candidate functions:** ```python import json from datetime import datetime, timezone from pathlib import Path from typing import Any def utc_now_iso() -> str: """Return current UTC time in ISO 8601 format.""" return datetime.now(timezone.utc).isoformat() def utc_now_compact() -> str: """Return current UTC time in compact format (for filenames).""" return datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ") def write_json_atomic(path: Path, data: dict[str, Any], indent: int = 2) -> None: """ Write JSON atomically to avoid partial writes on failure. """ tmp_path = path.with_suffix(".json.tmp") with tmp_path.open("w") as f: json.dump(data, f, indent=indent) tmp_path.rename(path) def load_json(path: Path) -> dict[str, Any]: """Load JSON file.""" with path.open() as f: return json.load(f) ``` **Current duplication:** - `bayeseor/setup.py`: `_utc_stamp()` - `bayeseor/cli_submit.py`: `_utc_now_compact()` - `bayeseor/submit.py`: `_utc_now_iso()` - `pyuvsim/submit.py`: `_utc_now_iso()` --- ## 4. Migration strategy ### 4.1 Prerequisites Before beginning migration: 1. **pyuvsim implemented** — At least two tools exist to validate patterns 2. **Tests passing** — Both tool modules have comprehensive test coverage 3. **Patterns validated** — Identified shared code confirmed to be genuinely common ### 4.2 Phase 1: Extract utilities (low risk) **Scope:** `core/utils.py` **Steps:** 1. Create `core/` directory with `__init__.py` 2. Implement `core/utils.py` with timestamp and JSON helpers 3. Update BayesEoR to import from `core.utils` 4. Update pyuvsim to import from `core.utils` 5. Remove duplicated helpers from tool modules 6. Run tests, verify no regressions **Estimated effort:** 1–2 hours --- ### 4.3 Phase 2: Extract templates (low risk) **Scope:** `core/templates.py` **Steps:** 1. Implement `core/templates.py` with generic template utilities 2. Update tool-specific `templates.py` to use core utilities 3. Tool modules retain thin wrappers that specify their templates directory 4. Run tests **Estimated effort:** 1–2 hours --- ### 4.4 Phase 3: Extract runner base classes (medium risk) **Scope:** `core/runner.py` **Steps:** 1. Implement `core/runner.py` with `CondaRunner` and `ContainerRunner` 2. Tool-specific install classes (e.g., `BayesEoRInstall`) remain in tool modules 3. Update tool modules to import base runners from core 4. Run tests **Estimated effort:** 2–3 hours --- ### 4.5 Phase 4: Extract SLURM submission (medium risk) **Scope:** `core/slurm.py` **Steps:** 1. Implement `core/slurm.py` with submission utilities 2. Update tool `submit.py` modules to use core submission 3. Tool modules retain script *generation* (tool-specific) 4. Run tests **Estimated effort:** 3–4 hours --- ### 4.6 Phase 5: Extract manifest/jobs handling (medium risk) **Scope:** `core/manifest.py`, `core/jobs.py` **Steps:** 1. Implement `core/manifest.py` with generic manifest utilities 2. Implement `core/jobs.py` with jobs.json utilities 3. Update tool modules to use core for reading/writing 4. Tool modules provide tool-specific extra fields 5. Run tests **Estimated effort:** 4–6 hours --- ### 4.7 Phase 6: Extract configuration loading (medium risk) **Scope:** `core/config.py` **Steps:** 1. Implement `core/config.py` with configuration utilities 2. Update tool CLI modules to use core configuration loading 3. Tool modules retain tool-specific argument parsing 4. Run tests **Estimated effort:** 3–4 hours --- ### 4.8 Phase 7: Extract run directory handling (low–medium risk) **Scope:** `core/run_directory.py` **Steps:** 1. Implement `core/run_directory.py` with generic path construction 2. Tool modules provide taxonomy definition, core handles construction 3. Update tool modules 4. Run tests **Estimated effort:** 2–3 hours --- ## 5. Post-migration structure ### 5.1 Core module ``` src/valska/external_tools/core/ ├── __init__.py ├── config.py ├── jobs.py ├── manifest.py ├── run_directory.py ├── runner.py ├── slurm.py ├── templates.py └── utils.py ``` ### 5.2 Tool module (simplified) ``` src/valska/external_tools/pyuvsim/ ├── __init__.py ├── cli_prepare.py # Argument parsing, tool-specific logic ├── cli_submit.py # Thin wrapper around core submission ├── runner.py # Tool-specific install class (if needed) ├── slurm.py # SLURM script generation (tool-specific) ├── templates.py # Thin wrapper around core templates └── templates/ ``` ### 5.3 Dependency direction ``` ┌─────────────────────┐ │ core module │ ← No dependencies on tool modules └─────────────────────┘ ▲ │ imports │ ┌─────────────────────┐ │ bayeseor module │ └─────────────────────┘ ┌─────────────────────┐ │ pyuvsim module │ └─────────────────────┘ ``` Tool modules depend on core; core never imports from tools. --- ## 6. Success criteria ### 6.1 Quantitative | Metric | Target | |--------|--------| | Lines of code reduction per tool | 30–50% | | Test duplication reduction | 40–60% | | Time to implement new tool | 50% less than current | ### 6.2 Qualitative - New tool implementations focus on tool-specific concerns - Common bugs fixed once in core, all tools benefit - Consistent behaviour across tools for shared operations - Clear separation between "what all tools do" and "what this tool does" --- ## 7. Risks and mitigations | Risk | Impact | Likelihood | Mitigation | |------|--------|------------|------------| | Core abstraction doesn't fit new tool | High | Medium | Keep core interfaces minimal; allow bypass | | Breaking changes during migration | Medium | Medium | Migrate one module at a time; comprehensive tests | | Over-abstraction | Medium | Low | Extract only proven patterns; resist "future-proofing" | | Increased coupling | Medium | Low | Core has no knowledge of specific tools | --- ## 8. Decision log | Date | Decision | Rationale | |------|----------|-----------| | 2026-01-21 | Defer shared infrastructure extraction | Avoid premature abstraction; validate patterns with pyuvsim first | | 2026-01-21 | Document intended structure as roadmap | Capture intent while allowing standalone tool development | --- ## 9. Open questions ### 9.1 To be resolved before migration 1. **Exception hierarchy** — Should core define a base exception class that tool-specific exceptions inherit from? 2. **Logging** — Should core provide a logging configuration, or leave this to tools? 3. **CLI framework** — Is `argparse` sufficient, or should core provide argument parsing helpers? 4. **Schema validation** — Should core provide formal validators (e.g., Pydantic) or lightweight validate-on-read helpers that produce user-friendly errors and include `valska_version` in messages? ### 9.2 To be resolved during migration 1. **Backwards compatibility** — How long to maintain deprecated imports in tool modules? 2. **Documentation** — Should core have its own API documentation, or integrate with tool docs? --- ## 10. References - [External Tool Integration Specification](external_tool_integration_spec.md) - [Tool Implementer's Guide](tool_implementers_guide.md) - BayesEoR reference implementation: `src/valska/external_tools/bayeseor/` --- ## 11. Related future direction: public Python API and documentation strategy ### 11.1 Purpose This roadmap primarily covers shared infrastructure for external tool integration. A closely related, but distinct, future task is to define a small, intentional **public Python API** for `valska` and align the documentation around that supported surface. This section records the intended direction so that later refactoring can be planned deliberately rather than emerging accidentally from documentation or import convenience. ### 11.2 Earlier approach and current interim state At present, the repository still exposes only a minimal top-level Python API, while the documentation includes a broader set of module pages. The **earlier recursive documentation approach** relied more heavily on broad discovery from the package root. In practice, that created a few avoidable problems: 1. **Fragile documentation builds** API generation became sensitive to how Sphinx discovered and imported submodules, rather than depending only on the pages we intended to build. 2. **Accidental coupling to package layout** Internal module organisation and generated intermediate package pages began to affect the success of the docs build, even when the runtime behaviour of the package itself was unchanged. 3. **Ambiguity about what is public** Recursive discovery made it easier for internal modules to look supported merely because they appeared in the generated reference output. 4. **Higher maintenance cost when imports evolved** Changes in import structure, optional dependencies, or notebook-related modules could cause documentation failures that were disproportionate to the actual code change. The **new interim explicit-module approach** improves this situation substantially. In particular, it: 1. reduces dependence on recursive module discovery; 2. removes the need for fragile intermediate package-summary pages to remain consistent; 3. makes the documentation build depend on an intentional list of module pages; and 4. provides a practical, low-risk fix that keeps continuous integration stable without changing runtime behaviour. This interim approach is acceptable for now, because it: 1. avoids making premature compatibility promises through `valska.__init__`; 2. keeps the branch-protection-critical documentation build stable; and 3. allows BayesEoR and related tooling to continue evolving without implying that every internal module is public and supported. ### 11.3 Recommended target state The recommended long-term model is the one commonly used by public-facing scientific Python packages: 1. **Curate a supported public API** Re-export only the functions, classes, and helpers that are intended for external users. 2. **Document the public API first** Treat the top-level package, and a small number of genuinely public subpackages, as the main reference surface. 3. **Keep internal modules internal** Workflow orchestration helpers, implementation details, and CLI plumbing should not become de facto public merely because they appear in recursive documentation output. 4. **Use narrative documentation for workflows** Operational and scientific workflows should continue to be explained through guides, examples, and CLI documentation rather than relying on low-level API pages alone. This long-term direction is still preferable to the present explicit-module listing, because it offers a better overall balance for a public-facing repository: 1. **Clearer support boundary** Users can distinguish more easily between supported public imports and internal implementation detail. 2. **Lower ongoing documentation churn** A curated public API reduces the need to update long lists of individual modules whenever internal structure changes. 3. **Greater refactoring freedom** Internal modules can evolve with less risk of accidental public commitments. 4. **Better alignment with user needs** Most users need stable entry points and workflow documentation rather than a near-complete map of internal source files. ### 11.4 Why this is deferred This change should be deferred until after the current modularisation work is merged and the public surface can be reviewed intentionally. Deferral is recommended because: 1. **The package surface is still evolving** BayesEoR orchestration code is currently being restructured, so it would be unhelpful to freeze import paths prematurely. 2. **Not every documented module should be public** Some modules are implementation detail, developer utility, or CLI support code rather than stable library API. 3. **Top-level re-exports create maintenance obligations** Once symbols are presented as public, later refactoring becomes more constrained. 4. **Documentation stability is currently the priority** The present explicit module listing is a pragmatic fix that keeps the docs build reliable without changing runtime behaviour. ### 11.5 Interim approach Until a public API review is completed, the preferred documentation strategy is: - keep the current explicit API reference list in `docs/source/api.rst`; - avoid broad recursive discovery from the package root; - continue documenting workflows and CLI usage separately from the Python API; - treat any future top-level re-export as an explicit design decision. ### 11.6 Proposed review questions for the follow-up task When this work is taken up, the review should answer the following: 1. Which functions and classes are intended for external Python users? 2. Which import paths should be considered stable across releases? 3. Which modules are internal implementation detail and should remain outside the top-level namespace? 4. Which CLI-facing modules should be documented as commands rather than as Python API surface? 5. Which imports are lightweight and safe enough to expose at package-import time? ### 11.7 Suggested implementation sequence 1. Review the existing Python surface and identify genuinely public objects. 2. Define a small curated export set in `valska.__init__`. 3. Add or refine top-level package documentation around that curated surface. 4. Retain detailed module pages only where they are useful for advanced users or developers. 5. Update contributor guidance once the public/private boundary has been agreed. ### 11.8 Decision note For the avoidance of doubt, the current documentation fix does **not** commit the project to the present module layout as a permanent public API. It is an interim documentation-stability measure, not a statement that every documented module path is part of the long-term supported interface.