Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, Alan Yuille
Independently trained models keep returning to the same low-dimensional region in weight space.
This is the official repository for the Universal Weight Subspace Hypothesis paper.
Universal Weight Subspace Hypothesis: Backpropagated neural networks - despite being trained on diverse, potentially disjoint datasets with varying hyperparameters, initializations, and regularization techniques - systematically converge to learn architecture-specific, layer-wise low-rank joint subspaces, which we term the Universal Subspace.
For the full paper, use arXiv or the local paper PDF.
| Experiments | Location |
|---|---|
| ViT analysis | VIT |
| GPT-2 analysis | GPT2 |
| LLaMA analysis | LLaMA |
| ResNet-50 experiments | CNN |
| additional ResNet-50 (+alpha) analysis | additional_R50 |
| additional ViT, GPT-2, and LLaMA analysis | Main_weight_analysis |
| Mistral LoRA / generation experiments | NLG |
| GLUE / RoBERTa adapter experiments | NLU |
| ViT adapter image classification | ImageClassification |
| SDXL style LoRAs | Diffusion |
| image generation (StyleGAN2) | StyleGANV2 |
| 3D reconstruction (Geometric Distribution) | GeomDist |
Low Rank Adapter-based results are also a major part of the paper and they are included directly in this repo.
| Experiment/Task | What it covers | Location |
|---|---|---|
NLG |
Mistral LoRA reconstruction and SubspaceAdapter evaluation across generation tasks, with ROUGE-L scoring | NLG |
NLU |
RoBERTa on GLUE with LoRA, HOSVD, and SubspaceAdapter initialization | NLU |
ImageClassification |
ViT image classification with LoRA, VeRA, and SubspaceAdapter | ImageClassification |
Diffusion |
SDXL style LoRAs, reconstructed adapters, and SubspaceAdapter outputs for image generation | Diffusion |
diffusers |
Local diffusion adapter plumbing for PEFT and SubspaceAdapter state dicts and loaders | diffusers |
For a cleaner dedicated PEFT codebase and further implementation details, refer to EigenLoRA.
| Experiment | What it covers | What to look at |
|---|---|---|
CNN |
scratch ResNet-50 training and subspace reconstruction | the main ResNet-50 workflow used in the paper |
additional_R50 |
9 ResNet-50 models with spatial analysis and alpha filtering | mean top-1 reaches 0.8085 at 95% explained variance, with coefficient calibration included |
ViT |
464 models, 18 retained layers in the extended analysis |
q_all = 0.0518, q_arch = 0.3545, and 76.99% savings at 80% explained variance |
GPT-2 |
177 models, 23 retained layers in the extended analysis |
q_all = 0.0725, q_arch = 0.3059, and 64.73% savings at 80% explained variance |
LLaMA |
50 models, 116 retained layers in the extended analysis |
q_all = 0.0940, q_arch = 0.3290, showing the same structure at larger scale |
Flan-T5 GLUE |
196 task-adapted checkpoints in the recent analysis release |
q_all = 0.0224, q_arch = 0.4850, and 85.13% savings at 90% explained variance |
StyleGAN2 |
20 public generators, 9 retained convolution layers |
the released UniSub reconstructions stay visually close to the source generators |
GeomDist |
retained spectral layers for neural geometry | the reconstructed loong shape stays faithful while still giving a useful compression gain |
The alpha metric is the layer-selection score we use to decide which parts of an architecture are good candidates for a shared UniSub basis.
For each corresponding layer l across a model family, we estimate one power-law exponent alpha_(m,l) per model m using a WeightWatcher-style fit to the tail of that layer's eigenvalue spectrum. The key idea is simple: if a layer keeps landing in the same well-behaved spectral regime across independently trained models, it is a much better candidate for a shared basis than a layer whose spectrum is erratic from model to model.
Let
be the finite alpha values for layer l, where N_l is the number of valid models for that layer. We first compute the basic summary statistics
and the fraction of models whose alpha lies in the preferred range
In the current pipeline we use alpha_min = 2, alpha_max = 6, which is the band we treat as the most reliable spectral regime.
For each individual alpha value, the code defines an in-range quality term
with eta = 0.5. A layer receives a high score only if it has enough evidence, enough in-range alphas, and low dispersion across models:
with kappa = 20, lambda = 1, and q_l clipped to [0, 1].
Each factor has a clear role:
1 - exp(-N_l / kappa)is an evidence term. It prevents a layer from looking artificially strong when only a handful of checkpoints are available.mean(phi(alpha))rewards layers whose alphas stay inside the preferred band and are closer to the lower end of that band.exp(-lambda * sigma_l)penalizes layers whose alpha values vary too much across models.
Before applying q_l, we also use a coarse statistical filter:
A layer is retained only if it passes that coarse filter and also satisfies
The family-level summaries reported in this repo are geometric means of the per-layer scores:
over all analyzed layers, and
over the retained layer set R, with eps = 1e-12 used only for numerical stability.
This is why the alpha metric is useful for UniSub. It is a weights-only criterion that looks for layers that are common, stable, and spectrally consistent across many independently trained models. Those are exactly the layers where a shared low-dimensional basis is most likely to be meaningful. While we do run Universal Subspace analysis for non-ideal layers as well, for analysis - this metric will give an idea of how useful your extracted Universal Subspace for an architecture or even a layer is, since extracting a Universal Subspace from noisy or poorly train models will lead to a bad approximation.
| 3D Geometry Generation | Generative modeling |
|---|---|
![]() |
![]() |
| additional_R50 scree | ViT scree |
|---|---|
![]() |
![]() |
The loong point cloud can be explored directly in the browser. This is a real interactive comparison, not a static screenshot:
Main_weight_analysis packages the extra analysis we ran on top of the main experiment directories.
It currently includes:
- additional_R50
- ViT analysis
- GPT-2 analysis
- LLaMA analysis
- Flan-T5 GLUE analysis
- analysis notes
- cross-model summary CSV
Important code note:
- For
vit,gpt2, andllama, this directory combines released analysis artifacts with copies of the base download/PCA/plot scripts from this repo. flan_t5_glueis currently an artifact-backed release page rather than a duplicated script package.- It is not a verbatim mirror of the separate transformer analysis workspace.
- Coefficient calibration is currently included for
additional_R50.
- EigenLoRA: PEFT training built around UniSub-style shared eigenspaces
- SHARE: UniSub-inspired work on continual learning
- Model merging experiments on top of the retained UniSub bases
- A Python library that extracts a universal subspace from a directory of checkpoints together with their alpha values
- Layerwise interactive controls for alpha thresholds, retained layers, and explained-variance targets
- Hosted checkpoint bundles for the larger model collections
- More releases once the reconstructions are ready to show cleanly
@misc{kaushik2025universalweightsubspacehypothesis,
title={The Universal Weight Subspace Hypothesis},
author={Prakhar Kaushik and Shravan Chaudhari and Ankit Vaidya and Rama Chellappa and Alan Yuille},
year={2025},
eprint={2512.05117},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2512.05117},
}




