Releases: jaydu1/causarray
v0.0.6
[0.0.6] - 2026-05-31
Added
-
Batch-wise fitting API (
causarray/gcate.py,causarray/DR_learner.py,causarray/utils.py)fit_gcate_batch(Y, X, A, r, batch_size=10, max_cells=2000, n_ctrl=2000, ...):
Fits GCATE independently on batches of perturbations. A shared control
subsample ofn_ctrlcells is reused across batches; dispersion is
pre-estimated once on the control pool. Supportsskip_batchesto resume
interrupted runs; reports per-batch wall time and ETA whenverbose=True.gcate_lfc_batch(Y, X, A, r, batch_size=10, max_cells=2000, n_ctrl=2000, cache_path=None, ...):
End-to-end batch pipeline — runs GCATE and LFC per batch, freeing large
intermediate arrays after each batch.cache_pathenables HDF5 disk
caching viapandas.HDFStoreso interrupted runs resume from the last
completed batch. Returns a concatenated DataFrame with a'batch'column.LFC_batch(...): deprecated alias forgcate_lfc_batch; emits
DeprecationWarningand will be removed in a future release.n_batchesparameter forfit_gcate_batchandgcate_lfc_batch:
specifies total number of batches instead of per-batch count; overrides
batch_sizewhen set.estimate_r(max_cells=N, random_state=0): new parameter that
automatically subsamples to at mostNcells before running JIC
selection, prioritising control cells.
-
Fast GLM backend via crispyx (
nb_glm_fast.py,gcate_glm.py)fit_glm_fast(): Batch NB-GLM fitting using crispyx'sNBGLMBatchFitter,
replacing per-gene statsmodels IRLS with vectorized batch IRLS.estimate_disp_fast(): Vectorized method-of-moments dispersion estimation.fit_glm_ondisk(): On-disk streaming GLM fitting for large h5ad files.- Per-perturbation fitting (
_fit_glm_fast_per_perturbation): for
multi-treatment data, fits binary (ctrl vs. treatment_k) models
independently, then assembles the full coefficient matrix. fit_glm_auto(): Routes tofit_glm_fast()when crispyx is available and
the effective design dimension is small; falls back to statsmodels otherwise.estimate_disp_auto(): Routes toestimate_disp_fast()for large gene
counts; falls back to statsmodels otherwise.
Fixed
- Numba TBB fork warning: Set
NUMBA_THREADING_LAYER_PRIORITYto prefer
OpenMP over TBB in__init__.py, eliminating fork warnings when Joblib forks
after Numba parallel execution. Addedllvm-openmpto conda dependencies. - Fast-path threshold (
gcate_glm.py): Raised the effective design-dimension
ceiling so largerrvalues and wide batch designs correctly use the crispyx path. - Backend toggle (
gcate_glm.py): Re-added_USE_FAST_BACKENDmodule flag
and_backend_override()context manager for reliable statsmodels fallback. - Weighted dispersion (
nb_glm_fast.py): Dispersion averaging is now
cell-count-weighted; low-coverage perturbations contribute proportionally less. - Control-cell residuals (
nb_glm_fast.py): Fixed last-perturbation overwrite
bug; control-cell deviance residuals and fitted values are now initialised from
the global covariate model. - Module-qualified imports (
gcate_opt.py,gcate.py,DR_estimation.py):
Backend toggles now propagate correctly at call time. estimate_rbare name (gcate.py): FixedNameErrorcaused by a bare
fit_glm_autoreference.- crispyx availability check (
gcate_glm.py): Users without crispyx now get
a transparent fallback to statsmodels instead of a traceback.
Changed
-
⚠️ alter_min()early-stopping defaults (gcate_opt.py):- Default
kwargs_es['max_iters']reduced from 500 → 50. - Default
tolerancereduced from1e-3→0.0; new scale-invariant
rel_tol=2e-4introduced. To reproduce pre-v0.0.6 behavior, pass
kwargs_es_1=dict(max_iters=500)andkwargs_es_2=dict(max_iters=500).
- Default
-
⚠️ BREAKING —LFC()variance and defaultusevar(DR_learner.py):- Default
usevarchanged from'pooled'to'unequal'(Welch). Revert
withLFC(..., usevar='pooled')if reproducing pre-v0.0.6 results. 'unequal'formula corrected: variance is nows₀²/n₀ + s₁²/n₁
(standard Welch); the prior version used(s₀²/n₀ + s₁²/n₁)/2
("half-Welch"), under-estimating the standard error by √2.- p-values now use the t-distribution with Welch-Satterthwaite degrees of
freedom per gene; the prior version used a Normal approximation.
- Default
-
alter_min()initialisation,_check_input(),estimate_r(), and
cross_fitting()now use the auto-dispatch GLM/dispersion paths. -
LFC()now acceptsbackend: str = "auto"("fast"forces crispyx,
"original"forces statsmodels). -
comp_size_factor()vectorized withnp.nanmean/np.nanmedian.
Performance
Benchmarked on Perturb-seq data (n = 2,926 cells, p = 3,221 genes, 29 perturbations):
| Component | Original | Fast | Speedup |
|---|---|---|---|
| GCATE | 331.6 s | 298.5 s | 1.1× |
| LFC | 87.8 s | 65.7 s | 1.3× |
| Total | 419.3 s | 364.2 s | 1.2× |
On synthetic data (n = 500, p = 200): 61.5× GLM fit speedup, 7.1× imputation speedup.
Latent factor recovery: mean canonical correlation 0.998. LFC correlation: 0.856.
Additional LFC throughput improvements on the Replogle tutorial dataset
(79,865 cells × 8,563 genes, 200 perturbations, 14 batches):
| Change | Speedup contribution | Accuracy impact |
|---|---|---|
Stage 1 max_iter 50 → 5 (NB) / 10 (Poisson) |
−10 min | identical (r=1.000) |
| Stage 1 ≤3,000-cell mixed subsample | −55 min | tau r=0.992, Jaccard=0.80 |
| Stage 2 joint fit | −5 min | tau r=0.9994, Jaccard=0.975 |
| Combined | −70.6 min / 1.48× | tau r=0.9994, Jaccard=0.975 |
Full-run: 217.5 min → 146.9 min (1.48× faster); sig pairs −0.2%, perts with ≥1 hit −0.6%.
v0.0.5
- Add more tuning options for random-forest classifiers
- Improve the uninformative tests thresholding
Full Changelog: v0.0.4...v0.0.5