Research on large-scale training optimization for transformer of domestic accelerator

Yan Zhu; Chen Hu; Shan-shan Ai; Jing-de Bu; Meng-zhi Han; Lin Han

doi:10.1117/12.2673331

13 April 2023 Research on large-scale training optimization for transformer of domestic accelerator

Yan Zhu, Chen Hu, Shan-shan Ai, Jing-de Bu, Meng-zhi Han, Lin Han

Proceedings Volume 12605, 2022 2nd Conference on High Performance Computing and Communication Engineering (HPCCE 2022); 126050J (2023) https://doi.org/10.1117/12.2673331
Event: Second Conference on High Performance Computing and Communication Engineering, 2022, Harbin, China

Abstract

Large-scale network models based on transformer architecture have strong versatility in many fields. Due to the computation-intensive and large-scale characteristics of the model, large-scale training on domestic heterogeneous accelerators is restricted by aspects such as computing and communication efficiency, resulting in poor training performance. Aiming at this problem, the hot functions and performance bottlenecks in the training process are studied and analyzed, and the corresponding performance optimization methods are proposed based on the hardware characteristics of domestic heterogeneous accelerators. In order to solve the problem of low performance in low accuracy training, the low accuracy package optimization is carried out for the underlying matrix multiplication core operator. To solve the problem of significant startup delay of kernel function caused by fine-grained core operators, the LightSeq framework is transplanted on the domestic heterogeneous platform for the first time, and the core fine-grained operators are specially optimized to adapt to the hardware structure according to the characteristics of network structure to accelerate the training process. In large-scale training, in order to solve the problem of low bandwidth in cross node communication, distributed communication optimization is carried out from two levels of data transmission and hardware topology, and communication efficiency is improved by reducing the frequency of communication and increasing the communication bandwidth. The experimental results show that using the WMT '14 English-German translation dataset, the performance is improved by two times after optimization on a single node without loss of training accuracy. The computing scale is gradually expanded to 128 nodes (512 accelerator cards) for large-scale distributed training and verification. Under the premise of ensuring performance improvement, the scalability can reach more than 90% in 256 accelerator cards.

Citation Download Citation

Yan Zhu, Chen Hu, Shan-shan Ai, Jing-de Bu, Meng-zhi Han, and Lin Han "Research on large-scale training optimization for transformer of domestic accelerator", Proc. SPIE 12605, 2022 2nd Conference on High Performance Computing and Communication Engineering (HPCCE 2022), 126050J (13 April 2023); https://doi.org/10.1117/12.2673331

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
12 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Education and training

Transformers

Mathematical optimization

Matrix multiplication

Computer hardware

Data communications

Matrices

Show All Keywords

Keywords/Phrases

Search In:

Publication Years