Tutorial

This tutorial walks through the current CLI workflow from data loading to post-training analysis and simulation benchmarking.

1. Confirm The Data Layout

BSVAE trains on feature profiles, not sample profiles. The main matrix must be features x samples.

  • rows are feature IDs

  • columns are sample IDs

  • CSV and TSV files use the first column as the row index

Supported formats:

  • .csv / .csv.gz

  • .tsv / .tsv.gz

  • .h5 / .hdf5

  • .h5ad with optional anndata

2. Run A Minimal Training Job

Use a short run first to confirm the install and data format.

bsvae-train tutorial_min \
  --dataset data/expression.csv \
  --epochs 5 \
  --batch-size 64 \
  --n-modules 8 \
  --latent-dim 12

Sanity checks:

  • results/tutorial_min/model.pt exists

  • results/tutorial_min/specs.json exists

  • results/tutorial_min/train_losses.csv exists

3. Select The Number Of Modules

--n-modules controls the number of Gaussian-mixture components in the prior. For real analyses, the recommended path is bsvae-sweep-k.

bsvae-sweep-k sweep_prod \
  --dataset data/expression.csv \
  --k-grid 8,12,16,24,32 \
  --sweep-epochs 60 \
  --stability-reps 5 \
  --val-frac 0.1

Key outputs:

  • results/sweep_prod/sweep_k/sweep_results.csv

  • results/sweep_prod/sweep_k/sweep_summary.json

  • per-K replicate directories under results/sweep_prod/sweep_k/k<K>/rep_<rep>/

  • final retrained model under results/sweep_prod/final_k<K>/

Selection behavior:

  • with --stability-reps 1, the best K is chosen by validation loss

  • with --stability-reps > 1, the best K is chosen by mean pairwise ARI on held-out features

4. Train A Final Model Directly

If you already know K, train directly with bsvae-train.

bsvae-train study1 \
  --dataset data/expression.csv \
  --epochs 120 \
  --n-modules 24 \
  --latent-dim 32

Useful flags to review for production runs:

  • --normalize-input

  • --warmup-epochs

  • --transition-epochs

  • --free-bits

  • --sep-strength

  • --bal-strength

  • --checkpoint-every

5. Extract Feature Networks

bsvae-networks extract-networks builds sparse feature-feature graphs from trained latents.

bsvae-networks extract-networks \
  --model-path results/sweep_prod/final_k16 \
  --dataset data/expression.csv \
  --output-dir results/sweep_prod/final_k16/networks \
  --methods mu_cosine gamma_knn \
  --top-k 50

Methods:

  • mu_cosine: top-k cosine neighbors in latent mean space

  • gamma_knn: FAISS-based kNN graph in soft-assignment space

Outputs are sparse adjacency files such as:

  • mu_cosine_adjacency.npz

  • gamma_knn_adjacency.npz

6. Extract Modules

extract-modules saves soft and hard GMM assignments. Add --expr and --soft-eigengenes to compute eigengenes.

bsvae-networks extract-modules \
  --model-path results/sweep_prod/final_k16 \
  --dataset data/expression.csv \
  --output-dir results/sweep_prod/final_k16/modules \
  --expr data/expression.csv \
  --soft-eigengenes

Primary outputs:

  • gamma.npz

  • hard_assignments.npz

  • soft_eigengenes.csv when requested

Optional comparison outputs:

  • leiden_modules.csv with --use-leiden

  • gamma_gene.npz and hard_assignments_gene.npz with --aggregate-to-gene --tx2gene

7. Export Latents

export-latents writes a compressed NumPy archive.

bsvae-networks export-latents \
  --model-path results/sweep_prod/final_k16 \
  --dataset data/expression.csv \
  --output results/sweep_prod/final_k16/latents

Saved arrays:

  • mu

  • logvar

  • gamma

  • feature_ids

8. Analyze The Latent Space

bsvae-networks latent-analysis \
  --model-path results/sweep_prod/final_k16 \
  --dataset data/expression.csv \
  --output-dir results/sweep_prod/final_k16/latent_analysis \
  --kmeans-k 16 \
  --umap

Typical outputs:

  • latent_mu.csv

  • latent_logvar.csv

  • latent_clusters.csv when clustering is requested

  • latent_embeddings.csv when --umap or --tsne is used

  • latent_covariate_correlations.csv when --covariates is provided

9. Generate Synthetic Data And Benchmark Recovery

Generate one dataset:

bsvae-simulate generate \
  --output data/sim_expr.csv \
  --save-ground-truth data/sim_truth.csv

Benchmark a trained model against the ground truth:

bsvae-simulate benchmark \
  --dataset data/sim_expr.csv \
  --ground-truth data/sim_truth.csv \
  --model-path results/sweep_prod/final_k16 \
  --output results/sweep_prod/final_k16/sim_metrics.json

Generate a publication-style scenario grid:

bsvae-simulate init-config --output sim.yaml

bsvae-simulate generate-grid \
  --config sim.yaml \
  --outdir results/sim_pub_v1 \
  --reps 30 \
  --base-seed 13

bsvae-simulate validate-grid --grid-dir results/sim_pub_v1

Each replicate directory includes method-ready files such as:

  • expr/features_x_samples.tsv.gz

  • expr/samples_x_features.tsv.gz

  • truth/modules_hard.csv

  • method_inputs.json

10. Common Problems

  • Data orientation is wrong: transpose sample-by-feature matrices before training.

  • CUDA memory is tight: reduce --batch-size or use --no-cuda.

  • gamma_knn fails: verify faiss-cpu is installed (required dependency, but may be missing in some custom envs).

  • Hierarchical options fail: make sure --tx2gene matches the matrix row IDs.

  • No eigengene file appears: --soft-eigengenes only writes output when --expr is supplied.

11. Legacy Configuration Note

The active CLI does not use hyperparam.ini. The files in src/bsvae/hyperparam.ini and docs/hyperparam.ini are retained for compatibility context only.

For reproducible runs, prefer shell scripts or workflow files with explicit CLI flags.