The Universal Weight Subspace Hypothesis

Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, Alan Yuille

Independently trained models keep returning to the same low-dimensional region in weight space.

This is the official repository for the Universal Weight Subspace Hypothesis paper.

Universal Weight Subspace Hypothesis: Backpropagated neural networks - despite being trained on diverse, potentially disjoint datasets with varying hyperparameters, initializations, and regularization techniques - systematically converge to learn architecture-specific, layer-wise low-rank joint subspaces, which we term the Universal Subspace.

For the full paper, use arXiv or the local paper PDF.

Application Examples

Repo Structure

Experiments	Location
ViT analysis	VIT
GPT-2 analysis	GPT2
LLaMA analysis	LLaMA
ResNet-50 experiments	CNN
additional ResNet-50 (+alpha) analysis	additional_R50
additional ViT, GPT-2, and LLaMA analysis	Main_weight_analysis
Mistral LoRA / generation experiments	NLG
GLUE / RoBERTa adapter experiments	NLU
ViT adapter image classification	ImageClassification
SDXL style LoRAs	Diffusion
image generation (StyleGAN2)	StyleGANV2
3D reconstruction (Geometric Distribution)	GeomDist

Adapter And LoRA Experiments

Low Rank Adapter-based results are also a major part of the paper and they are included directly in this repo.

Experiment/Task	What it covers	Location
`NLG`	Mistral LoRA reconstruction and SubspaceAdapter evaluation across generation tasks, with ROUGE-L scoring	NLG
`NLU`	RoBERTa on GLUE with LoRA, HOSVD, and SubspaceAdapter initialization	NLU
`ImageClassification`	ViT image classification with LoRA, VeRA, and SubspaceAdapter	ImageClassification
`Diffusion`	SDXL style LoRAs, reconstructed adapters, and SubspaceAdapter outputs for image generation	Diffusion
`diffusers`	Local diffusion adapter plumbing for PEFT and SubspaceAdapter state dicts and loaders	diffusers

For a cleaner dedicated PEFT codebase and further implementation details, refer to EigenLoRA.

Representative Results

Experiment	What it covers	What to look at
`CNN`	scratch ResNet-50 training and subspace reconstruction	the main ResNet-50 workflow used in the paper
`additional_R50`	9 ResNet-50 models with spatial analysis and alpha filtering	mean top-1 reaches `0.8085` at `95%` explained variance, with coefficient calibration included
`ViT`	`464` models, `18` retained layers in the extended analysis	`q_all = 0.0518`, `q_arch = 0.3545`, and `76.99%` savings at `80%` explained variance
`GPT-2`	`177` models, `23` retained layers in the extended analysis	`q_all = 0.0725`, `q_arch = 0.3059`, and `64.73%` savings at `80%` explained variance
`LLaMA`	`50` models, `116` retained layers in the extended analysis	`q_all = 0.0940`, `q_arch = 0.3290`, showing the same structure at larger scale
`Flan-T5 GLUE`	`196` task-adapted checkpoints in the recent analysis release	`q_all = 0.0224`, `q_arch = 0.4850`, and `85.13%` savings at `90%` explained variance
`StyleGAN2`	`20` public generators, `9` retained convolution layers	the released UniSub reconstructions stay visually close to the source generators
`GeomDist`	retained spectral layers for neural geometry	the reconstructed `loong` shape stays faithful while still giving a useful compression gain

Alpha Metric

The alpha metric is the layer-selection score we use to decide which parts of an architecture are good candidates for a shared UniSub basis.

For each corresponding layer l across a model family, we estimate one power-law exponent alpha_(m,l) per model m using a WeightWatcher-style fit to the tail of that layer's eigenvalue spectrum. The key idea is simple: if a layer keeps landing in the same well-behaved spectral regime across independently trained models, it is a much better candidate for a shared basis than a layer whose spectrum is erratic from model to model.

Let

$$ A_l = {\alpha_{m,l}}_{m=1}^{N_l} $$

be the finite alpha values for layer l, where N_l is the number of valid models for that layer. We first compute the basic summary statistics

$$ \mu_l = \frac{1}{N_l}\sum_{m=1}^{N_l}\alpha_{m,l}, \qquad \sigma_l = \sqrt{\frac{1}{N_l}\sum_{m=1}^{N_l}(\alpha_{m,l}-\mu_l)^2}, $$

and the fraction of models whose alpha lies in the preferred range

$$ f_l = \frac{1}{N_l}\sum_{m=1}^{N_l}\mathbf{1}[,\alpha_{\min} \le \alpha_{m,l} \le \alpha_{\max},]. $$

In the current pipeline we use alpha_min = 2, alpha_max = 6, which is the band we treat as the most reliable spectral regime.

For each individual alpha value, the code defines an in-range quality term

$$ \phi(\alpha)= \begin{cases} \exp!\left(-\eta(\alpha-\alpha_{\min})\right), & \alpha_{\min} \le \alpha \le \alpha_{\max}, \\ 0, & \text{otherwise}, \end{cases} $$

with eta = 0.5. A layer receives a high score only if it has enough evidence, enough in-range alphas, and low dispersion across models:

$$ q_l = \left(1 - e^{-N_l/\kappa}\right) \left(\frac{1}{N_l}\sum_{\alpha \in A_l}\phi(\alpha)\right) e^{-\lambda \sigma_l}, $$

with kappa = 20, lambda = 1, and q_l clipped to [0, 1].

Each factor has a clear role:

1 - exp(-N_l / kappa) is an evidence term. It prevents a layer from looking artificially strong when only a handful of checkpoints are available.
mean(phi(alpha)) rewards layers whose alphas stay inside the preferred band and are closer to the lower end of that band.
exp(-lambda * sigma_l) penalizes layers whose alpha values vary too much across models.

Before applying q_l, we also use a coarse statistical filter:

$$ \mu_l \in [2, 6], \qquad \sigma_l \le 1, \qquad f_l \ge 0.5. $$

A layer is retained only if it passes that coarse filter and also satisfies

$$ q_l \ge 0.2. $$

The family-level summaries reported in this repo are geometric means of the per-layer scores:

$$ q_{\mathrm{all}} = \exp!\left(\frac{1}{L}\sum_{l=1}^{L}\log(\max(q_l,\varepsilon))\right), $$

over all analyzed layers, and

$$ q_{\mathrm{arch}} = \exp!\left(\frac{1}{|\mathcal{R}|}\sum_{l \in \mathcal{R}}\log(\max(q_l,\varepsilon))\right), $$

over the retained layer set R, with eps = 1e-12 used only for numerical stability.

This is why the alpha metric is useful for UniSub. It is a weights-only criterion that looks for layers that are common, stable, and spectrally consistent across many independently trained models. Those are exactly the layers where a shared low-dimensional basis is most likely to be meaningful. While we do run Universal Subspace analysis for non-ideal layers as well, for analysis - this metric will give an idea of how useful your extracted Universal Subspace for an architecture or even a layer is, since extracting a Universal Subspace from noisy or poorly train models will lead to a bad approximation.

Visual Examples

3D Geometry Generation	Generative modeling

additional_R50 scree	ViT scree

3D Viewer

The loong point cloud can be explored directly in the browser. This is a real interactive comparison, not a static screenshot:

Extended Weight Analysis

Main_weight_analysis packages the extra analysis we ran on top of the main experiment directories.

It currently includes:

Important code note:

For vit, gpt2, and llama, this directory combines released analysis artifacts with copies of the base download/PCA/plot scripts from this repo.
flan_t5_glue is currently an artifact-backed release page rather than a duplicated script package.
It is not a verbatim mirror of the separate transformer analysis workspace.
Coefficient calibration is currently included for additional_R50.

Related Projects

EigenLoRA: PEFT training built around UniSub-style shared eigenspaces
SHARE: UniSub-inspired work on continual learning

TODO

Model merging experiments on top of the retained UniSub bases
A Python library that extracts a universal subspace from a directory of checkpoints together with their alpha values
Layerwise interactive controls for alpha thresholds, retained layers, and explained-variance targets
Hosted checkpoint bundles for the larger model collections
More releases once the reconstructions are ready to show cleanly

Citation

@misc{kaushik2025universalweightsubspacehypothesis,
      title={The Universal Weight Subspace Hypothesis}, 
      author={Prakhar Kaushik and Shravan Chaudhari and Ankit Vaidya and Rama Chellappa and Alan Yuille},
      year={2025},
      eprint={2512.05117},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.05117}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Universal Weight Subspace Hypothesis

Application Examples

Repo Structure

Adapter And LoRA Experiments

Representative Results

Alpha Metric

Visual Examples

3D Viewer

Extended Weight Analysis

Related Projects

TODO

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
CNN		CNN
Diffusion		Diffusion
GPT2		GPT2
GeomDist		GeomDist
ImageClassification		ImageClassification
LLaMA		LLaMA
Main_weight_analysis		Main_weight_analysis
Mistral_LoRAs		Mistral_LoRAs
NLG		NLG
NLU		NLU
SDFStudio		SDFStudio
StyleGANV2		StyleGANV2
VIT		VIT
additional_R50		additional_R50
assets		assets
diffusers		diffusers
docs		docs
peft		peft
transformers		transformers
README.md		README.md
index.html		index.html

Folders and files

Latest commit

History

Repository files navigation

The Universal Weight Subspace Hypothesis

Application Examples

Repo Structure

Adapter And LoRA Experiments

Representative Results

Alpha Metric

Visual Examples

3D Viewer

Extended Weight Analysis

Related Projects

TODO

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages