Radiation-Tolerant Deep Learning Processor Unit (DPU)-Based Platform Using Xilinx 20-nm Kintex UltraScale FPGA
2022; Institute of Electrical and Electronics Engineers; Volume: 70; Issue: 4 Linguagem: Inglês
10.1109/tns.2022.3216360
ISSN1558-1578
AutoresPierre Maillard, Yanran P. Chen, Jason Vidmar, Nicholas J. Fraser, Giulio Gambardella, Minal Sawant, Martin L. Voogel,
Tópico(s)Radiation Detection and Scintillator Technologies
ResumoThis article presents a platform and design appr- oach for enabling radiation-tolerant deep learning acceleration on static random access memory (SRAM)-based 20-nm Kintex UltraScale field-programmable gate arrays (FPGAs), for terrestrial and high-radiation environments. The presented platform is suitable for deep neural network (DNN) implementations with an emphasis on image classification and includes the solutions to mitigate both radiation-induced single-event functional interrupts (SEFIs) and network datapath corruptions. The radiation-tolerant deep learning platform combines Xilinx's deep learning processing unit (DPU) IP, triple modular redundancy (TMR) MicroBlaze soft processor IP, and soft error mitigation (SEM)-IP to mitigate SEFIs. Furthermore, a technique known as fault aware training (FAT) was applied to effectively mitigate single-event effects in the datapath. Test results from a high-energy proton beam ( $>$ 60 MeV) experiment using the ResNet-18 convolutional neural network (CNN) for image classification are presented. The single-event upset (SEU) rate, system-level SEFI rate, and neural network classification/datapath performance are compared between the radiation-tolerant platform and a standard, nonmitigated approach. Results show that datapath classification errors dominate the system response (90%) versus SEFIs (10%). When compared to standard nonmitigated training techniques, the radiation-tolerant platform using FAT methods shows dramatic improvements in overall system response: the overall single-event cross Section was reduced by half and 40% reduction in misclassification errors was observed. Also, datapath events with classification accuracy degradation larger than 5% were completely mitigated. The SEFI rate was reduced by $100\times $ with implemented solutions and can be further reduced by optimizing the physical separation between TMR modules.
Referência(s)