Paper
13 April 2023 Research on large-scale training optimization for transformer of domestic accelerator
Author Affiliations +
Proceedings Volume 12605, 2022 2nd Conference on High Performance Computing and Communication Engineering (HPCCE 2022); 126050J (2023) https://doi.org/10.1117/12.2673331
Event: Second Conference on High Performance Computing and Communication Engineering, 2022, Harbin, China
Abstract
Large-scale network models based on transformer architecture have strong versatility in many fields. Due to the computation-intensive and large-scale characteristics of the model, large-scale training on domestic heterogeneous accelerators is restricted by aspects such as computing and communication efficiency, resulting in poor training performance. Aiming at this problem, the hot functions and performance bottlenecks in the training process are studied and analyzed, and the corresponding performance optimization methods are proposed based on the hardware characteristics of domestic heterogeneous accelerators. In order to solve the problem of low performance in low accuracy training, the low accuracy package optimization is carried out for the underlying matrix multiplication core operator. To solve the problem of significant startup delay of kernel function caused by fine-grained core operators, the LightSeq framework is transplanted on the domestic heterogeneous platform for the first time, and the core fine-grained operators are specially optimized to adapt to the hardware structure according to the characteristics of network structure to accelerate the training process. In large-scale training, in order to solve the problem of low bandwidth in cross node communication, distributed communication optimization is carried out from two levels of data transmission and hardware topology, and communication efficiency is improved by reducing the frequency of communication and increasing the communication bandwidth. The experimental results show that using the WMT '14 English-German translation dataset, the performance is improved by two times after optimization on a single node without loss of training accuracy. The computing scale is gradually expanded to 128 nodes (512 accelerator cards) for large-scale distributed training and verification. Under the premise of ensuring performance improvement, the scalability can reach more than 90% in 256 accelerator cards.
© (2023) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Yan Zhu, Chen Hu, Shan-shan Ai, Jing-de Bu, Meng-zhi Han, and Lin Han "Research on large-scale training optimization for transformer of domestic accelerator", Proc. SPIE 12605, 2022 2nd Conference on High Performance Computing and Communication Engineering (HPCCE 2022), 126050J (13 April 2023); https://doi.org/10.1117/12.2673331
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Education and training

Transformers

Mathematical optimization

Matrix multiplication

Computer hardware

Data communications

Matrices

Back to Top