Releases · jaydu1/causarray

01 Jun 13:18

jaydu1

v0.0.6

da5832d

v0.0.6 Latest

Latest

[0.0.6] - 2026-05-31

Added

Batch-wise fitting API (causarray/gcate.py, causarray/DR_learner.py, causarray/utils.py)
- fit_gcate_batch(Y, X, A, r, batch_size=10, max_cells=2000, n_ctrl=2000, ...):
  Fits GCATE independently on batches of perturbations. A shared control
  subsample of n_ctrl cells is reused across batches; dispersion is
  pre-estimated once on the control pool. Supports skip_batches to resume
  interrupted runs; reports per-batch wall time and ETA when verbose=True.
- gcate_lfc_batch(Y, X, A, r, batch_size=10, max_cells=2000, n_ctrl=2000, cache_path=None, ...):
  End-to-end batch pipeline — runs GCATE and LFC per batch, freeing large
  intermediate arrays after each batch. cache_path enables HDF5 disk
  caching via pandas.HDFStore so interrupted runs resume from the last
  completed batch. Returns a concatenated DataFrame with a 'batch' column.
- LFC_batch(...): deprecated alias for gcate_lfc_batch; emits
  DeprecationWarning and will be removed in a future release.
- n_batches parameter for fit_gcate_batch and gcate_lfc_batch:
  specifies total number of batches instead of per-batch count; overrides
  batch_size when set.
- estimate_r(max_cells=N, random_state=0): new parameter that
  automatically subsamples to at most N cells before running JIC
  selection, prioritising control cells.
Fast GLM backend via crispyx (nb_glm_fast.py, gcate_glm.py)
- fit_glm_fast(): Batch NB-GLM fitting using crispyx's NBGLMBatchFitter,
  replacing per-gene statsmodels IRLS with vectorized batch IRLS.
- estimate_disp_fast(): Vectorized method-of-moments dispersion estimation.
- fit_glm_ondisk(): On-disk streaming GLM fitting for large h5ad files.
- Per-perturbation fitting (_fit_glm_fast_per_perturbation): for
  multi-treatment data, fits binary (ctrl vs. treatment_k) models
  independently, then assembles the full coefficient matrix.
- fit_glm_auto(): Routes to fit_glm_fast() when crispyx is available and
  the effective design dimension is small; falls back to statsmodels otherwise.
- estimate_disp_auto(): Routes to estimate_disp_fast() for large gene
  counts; falls back to statsmodels otherwise.

Fixed

Numba TBB fork warning: Set NUMBA_THREADING_LAYER_PRIORITY to prefer
OpenMP over TBB in __init__.py, eliminating fork warnings when Joblib forks
after Numba parallel execution. Added llvm-openmp to conda dependencies.
Fast-path threshold (gcate_glm.py): Raised the effective design-dimension
ceiling so larger r values and wide batch designs correctly use the crispyx path.
Backend toggle (gcate_glm.py): Re-added _USE_FAST_BACKEND module flag
and _backend_override() context manager for reliable statsmodels fallback.
Weighted dispersion (nb_glm_fast.py): Dispersion averaging is now
cell-count-weighted; low-coverage perturbations contribute proportionally less.
Control-cell residuals (nb_glm_fast.py): Fixed last-perturbation overwrite
bug; control-cell deviance residuals and fitted values are now initialised from
the global covariate model.
Module-qualified imports (gcate_opt.py, gcate.py, DR_estimation.py):
Backend toggles now propagate correctly at call time.
estimate_r bare name (gcate.py): Fixed NameError caused by a bare
fit_glm_auto reference.
crispyx availability check (gcate_glm.py): Users without crispyx now get
a transparent fallback to statsmodels instead of a traceback.

Changed

⚠️ alter_min() early-stopping defaults (gcate_opt.py):
- Default kwargs_es['max_iters'] reduced from 500 → 50.
- Default tolerance reduced from 1e-3 → 0.0; new scale-invariant
  rel_tol=2e-4 introduced. To reproduce pre-v0.0.6 behavior, pass
  kwargs_es_1=dict(max_iters=500) and kwargs_es_2=dict(max_iters=500).
⚠️ BREAKING — LFC() variance and default usevar (DR_learner.py):
- Default usevar changed from 'pooled' to 'unequal' (Welch). Revert
  with LFC(..., usevar='pooled') if reproducing pre-v0.0.6 results.
- 'unequal' formula corrected: variance is now s₀²/n₀ + s₁²/n₁
  (standard Welch); the prior version used (s₀²/n₀ + s₁²/n₁)/2
  ("half-Welch"), under-estimating the standard error by √2.
- p-values now use the t-distribution with Welch-Satterthwaite degrees of
  freedom per gene; the prior version used a Normal approximation.
alter_min() initialisation, _check_input(), estimate_r(), and
cross_fitting() now use the auto-dispatch GLM/dispersion paths.
LFC() now accepts backend: str = "auto" ("fast" forces crispyx,
"original" forces statsmodels).
comp_size_factor() vectorized with np.nanmean/np.nanmedian.

Performance

Benchmarked on Perturb-seq data (n = 2,926 cells, p = 3,221 genes, 29 perturbations):

Component	Original	Fast	Speedup
GCATE	331.6 s	298.5 s	1.1×
LFC	87.8 s	65.7 s	1.3×
Total	419.3 s	364.2 s	1.2×

On synthetic data (n = 500, p = 200): 61.5× GLM fit speedup, 7.1× imputation speedup.
Latent factor recovery: mean canonical correlation 0.998. LFC correlation: 0.856.

Additional LFC throughput improvements on the Replogle tutorial dataset
(79,865 cells × 8,563 genes, 200 perturbations, 14 batches):

Change	Speedup contribution	Accuracy impact
Stage 1 `max_iter` 50 → 5 (NB) / 10 (Poisson)	−10 min	identical (r=1.000)
Stage 1 ≤3,000-cell mixed subsample	−55 min	tau r=0.992, Jaccard=0.80
Stage 2 joint fit	−5 min	tau r=0.9994, Jaccard=0.975
Combined	−70.6 min / 1.48×	tau r=0.9994, Jaccard=0.975