Research

Paper

TESTING March 23, 2026

dynActivation: A Trainable Activation Family for Adaptive Nonlinearity

Authors

Alois Bachmann

Abstract

This paper proposes $\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \mathrm{BaseAct}(x)(α_i - β_i) + β_i x$, where $α_i$ and $β_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54\%$ over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02\%$ on AttentionCNN with an average improvment by $+6.00\%$, with a $24\%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95\%$ test accuracy ($95.3$--$99.3\%$), while ReLU collapses below $80\%$ at 25 layers. Under FGSM at $\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39\%$ accuracy drop versus $62.79\%$ for ReLU ($7.40\%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3\%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.

Metadata

arXiv ID: 2603.22154

Provider: ARXIV

Primary Category: cs.LG

Published: 2026-03-23

Fetched: 2026-03-24 06:02

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.22154v1</id>\n    <title>dynActivation: A Trainable Activation Family for Adaptive Nonlinearity</title>\n    <updated>2026-03-23T16:18:28Z</updated>\n    <link href='https://arxiv.org/abs/2603.22154v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.22154v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>This paper proposes $\\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \\mathrm{BaseAct}(x)(α_i - β_i) + β_i x$, where $α_i$ and $β_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54\\%$ over ReLU.\n  On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02\\%$ on AttentionCNN with an average improvment by $+6.00\\%$, with a $24\\%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95\\%$ test accuracy ($95.3$--$99.3\\%$), while ReLU collapses below $80\\%$ at 25 layers. Under FGSM at $\\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39\\%$ accuracy drop versus $62.79\\%$ for ReLU ($7.40\\%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3\\%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n    <published>2026-03-23T16:18:28Z</published>\n    <arxiv:comment>22 pages, 15 figures</arxiv:comment>\n    <arxiv:primary_category term='cs.LG'/>\n    <author>\n      <name>Alois Bachmann</name>\n    </author>\n  </entry>"
}