BayesEoR validation workflow (ValSKA)
This document describes the BayesEoR validation workflow implemented in
ValSKA-HERA-beam-FWHM, with an emphasis on reproducibility, resumability, and
high-performance computing (HPC) best practices.
It is intended to complement the project README by providing additional context, rationale, and operational detail.
Design goals
Naming conventions used in this page:
run_id: user-chosen identifier for one prepared run or sweep campaignsweep_id: commonly used synonym for a sweep-levelrun_idrun_dir: a single prepared point directorysweep_dir: the_sweeps/<run_id>root containingsweep_manifest.json
The BayesEoR workflow in ValSKA is designed to satisfy the following principles:
Reproducibility Every run is fully specified by explicit configuration files and provenance metadata.
Inspectability Users should be able to examine exactly what will be run before submitting jobs to a scheduler.
Resumability Long-running Bayesian inference (e.g. MultiNest) must be easy to resume after walltime or preemption.
HPC appropriateness The workflow should integrate cleanly with batch schedulers (SLURM), avoid hidden state, and respect site policies.
To achieve this, ValSKA enforces a strict separation between preparation and submission.
Prepare vs submit: separation of concerns
Prepare phase
The prepare phase is performed using:
valska-bayeseor-prepare
This phase:
renders BayesEoR configuration YAML files from templates
writes SLURM submit scripts for each execution stage
creates a structured run directory under the ValSKA results root
writes a
manifest.jsoncapturing full provenance
Critically, the prepare phase:
does not execute BayesEoR
does not submit jobs
does not modify any global state
The output of this phase is a self-contained run directory that can be:
inspected
versioned
archived
copied to another system
reused for resubmission
Submit phase
The submit phase is performed using:
valska-bayeseor-submit <run_dir>
This phase:
reads an existing prepared run directory
submits SLURM jobs using the generated scripts
enforces job dependencies explicitly
records submitted job IDs in
jobs.json
Submission is intentionally lightweight: it orchestrates sbatch calls but
does not attempt to manage or monitor running jobs.
Run directory structure
A typical prepared run directory has the form:
<results_root>/bayeseor/<beam_model>/<sky_model>/
or, if --unique is used:
<results_root>/bayeseor/<beam_model>/<sky_model>/
For sweep campaigns, point run directories are created under:
<results_root>/bayeseor/<beam_model>/<sky_model>/_sweeps/<run_id>/
Inside this directory you will typically find:
config_signal_fit.yamlconfig_no_signal.yamlsubmit_cpu_precompute.shsubmit_signal_fit_gpu_run.shsubmit_no_signal_gpu_run.shmanifest.json(after submission)
jobs.json
This directory is the unit of reproducibility in ValSKA.
Execution stages and dependencies
BayesEoR runs are structured as two stages:
1. CPU precompute stage
Computes instrument transfer matrices and related quantities
Shared across signal and no-signal hypotheses
Typically CPU-bound
Submitted via:
submit_cpu_precompute.sh
2. GPU inference stages
One job per hypothesis (e.g.
signal_fit,no_signal)Runs BayesEoR with
--gpu --runTypically long-running and GPU-bound
Submitted via:
submit_<hyp>_gpu_run.sh
GPU jobs depend explicitly on the successful completion of the CPU stage
using SLURM afterok dependencies.
Provenance and state tracking
manifest.json
Written at prepare time.
Contains:
template name or path
BayesEoR repository path
runtime configuration
SLURM parameters
applied overrides (e.g. FWHM perturbations)
paths to all generated artefacts
This file represents what was intended to be run.
It is treated as immutable.
jobs.json
Written at submit time.
Contains:
SLURM job IDs
submission timestamps
dependency structure
exact
sbatchcommands used
This file represents what was actually submitted.
If resubmission is required, previous versions of this file may be archived
(e.g. jobs_YYYYMMDDTHHMMSSZ.json) to preserve history.
Resubmission and walltime handling
BayesEoR uses MultiNest, which can resume from existing output directories.
If a job hits walltime:
output files remain in the run directory
no configuration regeneration is required
To requeue cleanly:
valska-bayeseor-submit <run_dir> --stage gpu --resubmit
This will:
archive the existing
jobs.jsonsubmit new GPU jobs
reuse existing BayesEoR outputs
This pattern avoids accidental double submission while making recovery trivial.
Manual submission remains supported
At all times, users may bypass ValSKA-managed submission and run:
sbatch submit_cpu_precompute.sh
sbatch submit_signal_fit_gpu_run.sh
sbatch submit_no_signal_gpu_run.sh
ValSKA does not hide or replace native scheduler behaviour.
When to use –unique vs stable run directories
Use stable run directories (default) when:
you want to resume runs
you expect walltime interruptions
you are iterating on the same configuration
Use –unique when:
performing parameter sweeps
launching many independent realisations
archival separation is preferred over resumability
Both modes are fully supported by the workflow.
Summary
The ValSKA BayesEoR workflow emphasises:
explicit artefact generation
transparent submission
reproducible directory structures
safe and convenient resubmission
This structure is intentional and designed to scale to both validation studies and production inference runs on shared HPC systems.