Tutorial
This tutorial walks through the current CLI workflow from data loading to post-training analysis and simulation benchmarking.
1. Confirm The Data Layout
BSVAE trains on feature profiles, not sample profiles. The main matrix must be features x samples.
rows are feature IDs
columns are sample IDs
CSV and TSV files use the first column as the row index
Supported formats:
.csv/.csv.gz.tsv/.tsv.gz.h5/.hdf5.h5adwith optionalanndata
2. Run A Minimal Training Job
Use a short run first to confirm the install and data format.
bsvae-train tutorial_min \
--dataset data/expression.csv \
--epochs 5 \
--batch-size 64 \
--n-modules 8 \
--latent-dim 12
Sanity checks:
results/tutorial_min/model.ptexistsresults/tutorial_min/specs.jsonexistsresults/tutorial_min/train_losses.csvexists
3. Select The Number Of Modules
--n-modules controls the number of Gaussian-mixture components in the prior. For real analyses, the recommended path is bsvae-sweep-k.
bsvae-sweep-k sweep_prod \
--dataset data/expression.csv \
--k-grid 8,12,16,24,32 \
--sweep-epochs 60 \
--stability-reps 5 \
--val-frac 0.1
Key outputs:
results/sweep_prod/sweep_k/sweep_results.csvresults/sweep_prod/sweep_k/sweep_summary.jsonper-K replicate directories under
results/sweep_prod/sweep_k/k<K>/rep_<rep>/final retrained model under
results/sweep_prod/final_k<K>/
Selection behavior:
with
--stability-reps 1, the bestKis chosen by validation losswith
--stability-reps > 1, the bestKis chosen by mean pairwise ARI on held-out features
4. Train A Final Model Directly
If you already know K, train directly with bsvae-train.
bsvae-train study1 \
--dataset data/expression.csv \
--epochs 120 \
--n-modules 24 \
--latent-dim 32
Useful flags to review for production runs:
--normalize-input--warmup-epochs--transition-epochs--free-bits--sep-strength--bal-strength--checkpoint-every
5. Extract Feature Networks
bsvae-networks extract-networks builds sparse feature-feature graphs from trained latents.
bsvae-networks extract-networks \
--model-path results/sweep_prod/final_k16 \
--dataset data/expression.csv \
--output-dir results/sweep_prod/final_k16/networks \
--methods mu_cosine gamma_knn \
--top-k 50
Methods:
mu_cosine: top-k cosine neighbors in latent mean spacegamma_knn: FAISS-based kNN graph in soft-assignment space
Outputs are sparse adjacency files such as:
mu_cosine_adjacency.npzgamma_knn_adjacency.npz
6. Extract Modules
extract-modules saves soft and hard GMM assignments. Add --expr and --soft-eigengenes to compute eigengenes.
bsvae-networks extract-modules \
--model-path results/sweep_prod/final_k16 \
--dataset data/expression.csv \
--output-dir results/sweep_prod/final_k16/modules \
--expr data/expression.csv \
--soft-eigengenes
Primary outputs:
gamma.npzhard_assignments.npzsoft_eigengenes.csvwhen requested
Optional comparison outputs:
leiden_modules.csvwith--use-leidengamma_gene.npzandhard_assignments_gene.npzwith--aggregate-to-gene --tx2gene
7. Export Latents
export-latents writes a compressed NumPy archive.
bsvae-networks export-latents \
--model-path results/sweep_prod/final_k16 \
--dataset data/expression.csv \
--output results/sweep_prod/final_k16/latents
Saved arrays:
mulogvargammafeature_ids
8. Analyze The Latent Space
bsvae-networks latent-analysis \
--model-path results/sweep_prod/final_k16 \
--dataset data/expression.csv \
--output-dir results/sweep_prod/final_k16/latent_analysis \
--kmeans-k 16 \
--umap
Typical outputs:
latent_mu.csvlatent_logvar.csvlatent_clusters.csvwhen clustering is requestedlatent_embeddings.csvwhen--umapor--tsneis usedlatent_covariate_correlations.csvwhen--covariatesis provided
9. Generate Synthetic Data And Benchmark Recovery
Generate one dataset:
bsvae-simulate generate \
--output data/sim_expr.csv \
--save-ground-truth data/sim_truth.csv
Benchmark a trained model against the ground truth:
bsvae-simulate benchmark \
--dataset data/sim_expr.csv \
--ground-truth data/sim_truth.csv \
--model-path results/sweep_prod/final_k16 \
--output results/sweep_prod/final_k16/sim_metrics.json
Generate a publication-style scenario grid:
bsvae-simulate init-config --output sim.yaml
bsvae-simulate generate-grid \
--config sim.yaml \
--outdir results/sim_pub_v1 \
--reps 30 \
--base-seed 13
bsvae-simulate validate-grid --grid-dir results/sim_pub_v1
Each replicate directory includes method-ready files such as:
expr/features_x_samples.tsv.gzexpr/samples_x_features.tsv.gztruth/modules_hard.csvmethod_inputs.json
10. Common Problems
Data orientation is wrong: transpose sample-by-feature matrices before training.
CUDA memory is tight: reduce
--batch-sizeor use--no-cuda.gamma_knnfails: verifyfaiss-cpuis installed (required dependency, but may be missing in some custom envs).Hierarchical options fail: make sure
--tx2genematches the matrix row IDs.No eigengene file appears:
--soft-eigengenesonly writes output when--expris supplied.
11. Legacy Configuration Note
The active CLI does not use hyperparam.ini. The files in src/bsvae/hyperparam.ini and docs/hyperparam.ini are retained for compatibility context only.
For reproducible runs, prefer shell scripts or workflow files with explicit CLI flags.