MetaAdamW — optimizer learns per-layer hyperparameters via self-attention
Standard adaptive optimizers (AdamW, Adam) apply the same learning rate and weight decay across all parameter groups, treating a ResNet's early layers the same as its final classifier.
MetaAdamW uses a Transformer encoder to observe gradient statistics (norms, momentum, correlations) from each layer and emit group-specific learning rates and decay factors in real time.
The attention module trains via a meta-learning loss combining gradient alignment, immediate loss decrease, and generalization gap—no grid search required.