BayesEoR validation workflow (ValSKA)

This document describes the BayesEoR validation workflow implemented in ValSKA-HERA-beam-FWHM, with an emphasis on reproducibility, resumability, and high-performance computing (HPC) best practices.

It is intended to complement the project README by providing additional context, rationale, and operational detail.


Design goals

Naming conventions used in this page:

  • run_id: user-chosen identifier for one prepared run or sweep campaign

  • sweep_id: commonly used synonym for a sweep-level run_id

  • run_dir: a single prepared point directory

  • sweep_dir: the _sweeps/<run_id> root containing sweep_manifest.json

The BayesEoR workflow in ValSKA is designed to satisfy the following principles:

  • Reproducibility Every run is fully specified by explicit configuration files and provenance metadata.

  • Inspectability Users should be able to examine exactly what will be run before submitting jobs to a scheduler.

  • Resumability Long-running Bayesian inference (e.g. MultiNest) must be easy to resume after walltime or preemption.

  • HPC appropriateness The workflow should integrate cleanly with batch schedulers (SLURM), avoid hidden state, and respect site policies.

To achieve this, ValSKA enforces a strict separation between preparation and submission.


Prepare vs submit: separation of concerns

Prepare phase

The prepare phase is performed using:

valska-bayeseor-prepare

This phase:

  • renders BayesEoR configuration YAML files from templates

  • writes SLURM submit scripts for each execution stage

  • creates a structured run directory under the ValSKA results root

  • writes a manifest.json capturing full provenance

Critically, the prepare phase:

  • does not execute BayesEoR

  • does not submit jobs

  • does not modify any global state

The output of this phase is a self-contained run directory that can be:

  • inspected

  • versioned

  • archived

  • copied to another system

  • reused for resubmission


Submit phase

The submit phase is performed using:

valska-bayeseor-submit <run_dir>

This phase:

  • reads an existing prepared run directory

  • submits SLURM jobs using the generated scripts

  • enforces job dependencies explicitly

  • records submitted job IDs in jobs.json

Submission is intentionally lightweight: it orchestrates sbatch calls but does not attempt to manage or monitor running jobs.


Run directory structure

A typical prepared run directory has the form:

<results_root>/bayeseor/<beam_model>/<sky_model>//<run_label>/<run_id>/

or, if --unique is used:

<results_root>/bayeseor/<beam_model>/<sky_model>//<run_label>/<run_id>//

For sweep campaigns, point run directories are created under:

<results_root>/bayeseor/<beam_model>/<sky_model>/_sweeps/<run_id>//<run_label>/

Inside this directory you will typically find:

  • config_signal_fit.yaml

  • config_no_signal.yaml

  • submit_cpu_precompute.sh

  • submit_signal_fit_gpu_run.sh

  • submit_no_signal_gpu_run.sh

  • manifest.json

  • (after submission) jobs.json

This directory is the unit of reproducibility in ValSKA.


Execution stages and dependencies

BayesEoR runs are structured as two stages:

1. CPU precompute stage

  • Computes instrument transfer matrices and related quantities

  • Shared across signal and no-signal hypotheses

  • Typically CPU-bound

Submitted via:

submit_cpu_precompute.sh

2. GPU inference stages

  • One job per hypothesis (e.g. signal_fit, no_signal)

  • Runs BayesEoR with --gpu --run

  • Typically long-running and GPU-bound

Submitted via:

submit_<hyp>_gpu_run.sh

GPU jobs depend explicitly on the successful completion of the CPU stage using SLURM afterok dependencies.


Provenance and state tracking

manifest.json

Written at prepare time.

Contains:

  • template name or path

  • BayesEoR repository path

  • runtime configuration

  • SLURM parameters

  • applied overrides (e.g. FWHM perturbations)

  • paths to all generated artefacts

This file represents what was intended to be run.

It is treated as immutable.


jobs.json

Written at submit time.

Contains:

  • SLURM job IDs

  • submission timestamps

  • dependency structure

  • exact sbatch commands used

This file represents what was actually submitted.

If resubmission is required, previous versions of this file may be archived (e.g. jobs_YYYYMMDDTHHMMSSZ.json) to preserve history.


Resubmission and walltime handling

BayesEoR uses MultiNest, which can resume from existing output directories.

If a job hits walltime:

  • output files remain in the run directory

  • no configuration regeneration is required

To requeue cleanly:

valska-bayeseor-submit <run_dir> --stage gpu --resubmit

This will:

  • archive the existing jobs.json

  • submit new GPU jobs

  • reuse existing BayesEoR outputs

This pattern avoids accidental double submission while making recovery trivial.


Manual submission remains supported

At all times, users may bypass ValSKA-managed submission and run:

sbatch submit_cpu_precompute.sh
sbatch submit_signal_fit_gpu_run.sh
sbatch submit_no_signal_gpu_run.sh

ValSKA does not hide or replace native scheduler behaviour.


When to use –unique vs stable run directories

  • Use stable run directories (default) when:

    • you want to resume runs

    • you expect walltime interruptions

    • you are iterating on the same configuration

  • Use –unique when:

    • performing parameter sweeps

    • launching many independent realisations

    • archival separation is preferred over resumability

Both modes are fully supported by the workflow.


Summary

The ValSKA BayesEoR workflow emphasises:

  • explicit artefact generation

  • transparent submission

  • reproducible directory structures

  • safe and convenient resubmission

This structure is intentional and designed to scale to both validation studies and production inference runs on shared HPC systems.