3:40 PM - 4:00 PM
[1S4-GS-2-01] TELU: A Faster Alternative to GELU and Swish
Keywords:activation function
In recent deep learning models, smooth activation functions such as GELU and Swish are widely used instead of ReLU. Such activation functions are known to have advantages over ReLU in terms of robustness to noise, etc., but they are slow because they involve the computation of transcendental functions such as Gaussian error functions and sigmoid functions. In this study, we propose a faster and smoother activation function, the T Error Linear Unit (TELU), which can be computed using only algebraic functions and is faster than GELU and other functions, while maintaining the smoothness of the function. Experimental results show that TELU can replace GELU in the pre-training of GPT-2, and that TELU is faster than GELU while maintaining high performance.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.