Abstract:
The intelligence of spacecraft has become an important trend in the development of space technology. Computing power serves as the infrastructure that supports space intelligence algorithms and data. General Purpose Graphics Processing Unit (GPGPU) has become the most popular acceleration component for space borne intelligent computing platforms. However, for GPGPU chips to operate stably in space radiation environment, reliability reinforcement designs must be implemented. To develop radiation resistant GPGPU design and support the autonomy, intelligence, and reliability of spacecraft, an adaptive error detection architecture for GPGPUs based on Warp-level branch divergence is proposed, which ensures full error detection, low performance penalty and error detection latency and ultra-low overhead with fully exploiting the underutilized parallelism in GPGPUs. Adaptive switching between two error detection architectures proposed for different dimensions is achieved by capturing the real-time level of Warp branch divergence. Within a Warp, a spatial redundancy error detection architecture with ultra-low performance penalty is realized, while among different Warps, temporal redundancy is improved into a fine-grained spatial redundancy error detection architecture through a special reconfiguration mechanism to achieve full error detection. Experimental evaluations show that the proposed architecture achieves an average performance improvement of over 90% and 16.5% compared to previous studies, respectively, with a performance penalty of 3.67%. Moreover, the error coverage is improved by an average of 16%, ensuring 100% error detection, the latency of error detection is optimized with an average speedup of 28.27 times with the overhead of 4.3%. The proposed adaptive redundant error detection architecture significantly improves the accuracy and reliability of GPGPU error detection, and has great potential as an important solution to enhance GPGPU reliability.