一种基于线程束级分支发散度的GPGPU自适应检错架构

李圣龙; 卫宇坤; 刘波; 王明羽

doi:10.3969/j.issn.1674-1579.2025.03.009

一种基于线程束级分支发散度的GPGPU自适应检错架构

An Adaptive Error Detection GPGPU Architecture Based on Warp-Level Branch Divergence

摘要

摘要: 航天器智能化已经成为空间技术发展的重要趋势，智能算力是承载空间智能算法和数据的基础设施. 通用图形处理器（general purpose graphics processing unit, GPGPU）是构建星载智能计算平台最主流的算力芯片，然而GPGPU芯片要在空间辐射环境中稳定运行，必须增加可靠性加固措施. 本文拟通过对GPGPU体系结构的研究、GPGPU可靠性加固设计的可行性评估，探索可靠GPGPU芯片研发的可行性，为航天器的自主化、智能化和可靠性提供支撑，提出基于线程束（又称为Warp）级分支发散度的GPGPU自适应检错架构，该架构充分探索了GPGPU中未充分利用的并行性，确保了全面的错误检测和较低的性能损失，且保证检错即时性和超低开销. 通过实时捕捉Warp分支发散等级信息，在针对不同维度提出的两种检错架构之间自适应切换. 试验评估表明，与先前的研究相比，所提出的自适应检错架构性能损失为3.67%，分别实现了超过90%和16.5%的平均性能提升. 此外，检错覆盖率平均提高16%，确保了100%的错误检测，并在付出4.3%额外开销的前提下，检错延迟实现了平均28.27倍的加速优化. 本文提出的自适应冗余检错架构显著提高了GPGPU检错的准确性和可靠性，作为增强GPGPU可靠性的重要解决方案，具有巨大的潜力.

Abstract: The intelligence of spacecraft has become an important trend in the development of space technology. Computing power serves as the infrastructure that supports space intelligence algorithms and data. General Purpose Graphics Processing Unit (GPGPU) has become the most popular acceleration component for space borne intelligent computing platforms. However, for GPGPU chips to operate stably in space radiation environment, reliability reinforcement designs must be implemented. To develop radiation resistant GPGPU design and support the autonomy, intelligence, and reliability of spacecraft, an adaptive error detection architecture for GPGPUs based on Warp-level branch divergence is proposed, which ensures full error detection, low performance penalty and error detection latency and ultra-low overhead with fully exploiting the underutilized parallelism in GPGPUs. Adaptive switching between two error detection architectures proposed for different dimensions is achieved by capturing the real-time level of Warp branch divergence. Within a Warp, a spatial redundancy error detection architecture with ultra-low performance penalty is realized, while among different Warps, temporal redundancy is improved into a fine-grained spatial redundancy error detection architecture through a special reconfiguration mechanism to achieve full error detection. Experimental evaluations show that the proposed architecture achieves an average performance improvement of over 90% and 16.5% compared to previous studies, respectively, with a performance penalty of 3.67%. Moreover, the error coverage is improved by an average of 16%, ensuring 100% error detection, the latency of error detection is optimized with an average speedup of 28.27 times with the overhead of 4.3%. The proposed adaptive redundant error detection architecture significantly improves the accuracy and reliability of GPGPU error detection, and has great potential as an important solution to enhance GPGPU reliability.

HTML全文

参考文献(28)

施引文献

资源附件(0)