Tutorial

This tutorial walks through the current CLI workflow from data loading to post-training analysis and simulation benchmarking.

1. Confirm The Data Layout

BSVAE trains on feature profiles, not sample profiles. The main matrix must be features x samples.

rows are feature IDs
columns are sample IDs
CSV and TSV files use the first column as the row index

Supported formats:

.csv / .csv.gz
.tsv / .tsv.gz
.h5 / .hdf5
.h5ad with optional anndata

2. Run A Minimal Training Job

Use a short run first to confirm the install and data format.

bsvae-train tutorial_min \
  --dataset data/expression.csv \
  --epochs 5 \
  --batch-size 64 \
  --n-modules 8 \
  --latent-dim 12

Sanity checks:

results/tutorial_min/model.pt exists
results/tutorial_min/specs.json exists
results/tutorial_min/train_losses.csv exists

3. Select The Number Of Modules

--n-modules controls the number of Gaussian-mixture components in the prior. For real analyses, the recommended path is bsvae-sweep-k.

bsvae-sweep-k sweep_prod \
  --dataset data/expression.csv \
  --k-grid 8,12,16,24,32 \
  --sweep-epochs 60 \
  --stability-reps 5 \
  --val-frac 0.1

Key outputs:

results/sweep_prod/sweep_k/sweep_results.csv
results/sweep_prod/sweep_k/sweep_summary.json
per-K replicate directories under results/sweep_prod/sweep_k/k<K>/rep_<rep>/
final retrained model under results/sweep_prod/final_k<K>/

Selection behavior:

with --stability-reps 1, the best K is chosen by validation loss
with --stability-reps > 1, the best K is chosen by mean pairwise ARI on held-out features

4. Train A Final Model Directly

If you already know K, train directly with bsvae-train.

bsvae-train study1 \
  --dataset data/expression.csv \
  --epochs 120 \
  --n-modules 24 \
  --latent-dim 32

Useful flags to review for production runs:

--normalize-input
--warmup-epochs
--transition-epochs
--free-bits
--sep-strength
--bal-strength
--checkpoint-every

5. Extract Feature Networks

bsvae-networks extract-networks builds sparse feature-feature graphs from trained latents.

bsvae-networks extract-networks \
  --model-path results/sweep_prod/final_k16 \
  --dataset data/expression.csv \
  --output-dir results/sweep_prod/final_k16/networks \
  --methods mu_cosine gamma_knn \
  --top-k 50

Methods:

mu_cosine: top-k cosine neighbors in latent mean space
gamma_knn: FAISS-based kNN graph in soft-assignment space

Outputs are sparse adjacency files such as:

mu_cosine_adjacency.npz
gamma_knn_adjacency.npz

6. Extract Modules

extract-modules saves soft and hard GMM assignments. Add --expr and --soft-eigengenes to compute eigengenes.

bsvae-networks extract-modules \
  --model-path results/sweep_prod/final_k16 \
  --dataset data/expression.csv \
  --output-dir results/sweep_prod/final_k16/modules \
  --expr data/expression.csv \
  --soft-eigengenes

Primary outputs:

gamma.npz
hard_assignments.npz
soft_eigengenes.csv when requested

Optional comparison outputs:

leiden_modules.csv with --use-leiden
gamma_gene.npz and hard_assignments_gene.npz with --aggregate-to-gene --tx2gene

7. Export Latents

export-latents writes a compressed NumPy archive.

bsvae-networks export-latents \
  --model-path results/sweep_prod/final_k16 \
  --dataset data/expression.csv \
  --output results/sweep_prod/final_k16/latents

Saved arrays:

mu
logvar
gamma
feature_ids

8. Analyze The Latent Space

bsvae-networks latent-analysis \
  --model-path results/sweep_prod/final_k16 \
  --dataset data/expression.csv \
  --output-dir results/sweep_prod/final_k16/latent_analysis \
  --kmeans-k 16 \
  --umap

Typical outputs:

latent_mu.csv
latent_logvar.csv
latent_clusters.csv when clustering is requested
latent_embeddings.csv when --umap or --tsne is used
latent_covariate_correlations.csv when --covariates is provided

9. Generate Synthetic Data And Benchmark Recovery

Generate one dataset:

bsvae-simulate generate \
  --output data/sim_expr.csv \
  --save-ground-truth data/sim_truth.csv

Benchmark a trained model against the ground truth:

bsvae-simulate benchmark \
  --dataset data/sim_expr.csv \
  --ground-truth data/sim_truth.csv \
  --model-path results/sweep_prod/final_k16 \
  --output results/sweep_prod/final_k16/sim_metrics.json

Generate a publication-style scenario grid:

bsvae-simulate init-config --output sim.yaml

bsvae-simulate generate-grid \
  --config sim.yaml \
  --outdir results/sim_pub_v1 \
  --reps 30 \
  --base-seed 13

bsvae-simulate validate-grid --grid-dir results/sim_pub_v1

Each replicate directory includes method-ready files such as:

expr/features_x_samples.tsv.gz
expr/samples_x_features.tsv.gz
truth/modules_hard.csv
method_inputs.json

10. Common Problems

Data orientation is wrong: transpose sample-by-feature matrices before training.
CUDA memory is tight: reduce --batch-size or use --no-cuda.
gamma_knn fails: verify faiss-cpu is installed (required dependency, but may be missing in some custom envs).
Hierarchical options fail: make sure --tx2gene matches the matrix row IDs.
No eigengene file appears: --soft-eigengenes only writes output when --expr is supplied.

11. Legacy Configuration Note

The active CLI does not use hyperparam.ini. The files in src/bsvae/hyperparam.ini and docs/hyperparam.ini are retained for compatibility context only.

For reproducible runs, prefer shell scripts or workflow files with explicit CLI flags.