Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline

Artigo Acesso aberto Revisado por pares

Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline

2008; Electronics and Telecommunications Research Institute; Volume: 30; Issue: 4 Linguagem: Inglês

10.4218/etrij.08.0107.0343

ISSN

2233-7326

Autores

Jae‐Geun Oh, Seok Joong Hwang, Huong Giang Nguyen, Areum Kim, Seon Wook Kim, Chulwoo Kim, Jong‐Kook Kim,

Tópico(s)

Embedded Systems Design Techniques

Resumo

ETRI JournalVolume 30, Issue 4 p. 576-586 Regular PaperFree Access Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline Jaegeun Oh, Jaegeun OhSearch for more papers by this authorSeok Joong Hwang, Seok Joong HwangSearch for more papers by this authorHuong Giang Nguyen, Huong Giang NguyenSearch for more papers by this authorAreum Kim, Areum KimSearch for more papers by this authorSeon Wook Kim, Seon Wook KimSearch for more papers by this authorChulwoo Kim, Chulwoo KimSearch for more papers by this authorJong-Kook Kim, Jong-Kook KimSearch for more papers by this author Jaegeun Oh, Jaegeun OhSearch for more papers by this authorSeok Joong Hwang, Seok Joong HwangSearch for more papers by this authorHuong Giang Nguyen, Huong Giang NguyenSearch for more papers by this authorAreum Kim, Areum KimSearch for more papers by this authorSeon Wook Kim, Seon Wook KimSearch for more papers by this authorChulwoo Kim, Chulwoo KimSearch for more papers by this authorJong-Kook Kim, Jong-Kook KimSearch for more papers by this author First published: 01 August 2008 https://doi.org/10.4218/etrij.08.0107.0343Citations: 2 Jaegeun Oh (phone: 82 2 3290 3892, email: [email protected]), Seok Joong Hwang (email: [email protected]), Huong Giang Nguyen (email: [email protected]), Areum Kim (email: [email protected]), Seon Wook Kim (phone: + 82 2 3290 3251, email: [email protected]), Chulwoo Kim (email: [email protected]), and Jong-Kook Kim (email: [email protected]) are with the School of Electrical Engineering, Korea University, Seoul, Rep. of Korea AboutPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Abstract In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler-hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write-back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single-instruction multiple-data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32-bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2-way MLEP and 33.7% faster with a 4-way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler. References 1H.C. Hunter and J.H. Moreno, "A New Look at Exploiting Data Parallelism in Embedded Systems," CASE, 2003, pp. 159– 169. 2I. Karkowski and H. Corporaal, "Exploiting Fine- and Coarse-Grain Parallelism in Embedded Programs," PACT, 1998, pp. 60– 67. 3J.E. Smith and G.S. Sohi, "The Microarchitecture of Superscalar Processors," Proc. of the IEEE, Vol. 83, Dec. 1995, pp. 1609– 1624. 4D.M. Tullsen et al., "Simultaneous Multithreading: Maximizing On-Chip Parallelism," ISCA-22, June 1995. 5 Analog Devices, Inc. ADSP-BF561 Blackfin Embedded Symmetric Multiprocessor Rev. 0. 6ARM. ARM11 MPCore. http://www.arm.com/. 7 EEMBC (EDN Embedded Microprocessor Benchmark Consortium). http://www.eembc.org. 8J. Oh et al., "OpenMP and Compilation Issue in Embedded Applications," LNCS, Vol. 2716, June 2003, pp. 109– 121. 9Extendable Instruction Set Computer. http://www.adc.co.kr. 10A. Eichenberger et al., "A Tutorial on BG/L Dual FPU Simdization," BlueGen System Software Workshop, 2005. 11C. Kozyrakis and D. Patterson, "Vector vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks," MICRO-35, 2002, pp. 283– 293. 12D. Talla et al., "Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW, and Superscalar Architectures," ICCD, 2000, pp. 163– 172. 13 OpenMP Forum, http://www.openmp.org/. OpenMP: A Proposed Industry Standard API for Shared Memory Programming, Oct. 1997. 14M. Sato et al., "Design of OpenMP Compiler for an SMP Cluster," EWOMP, Sept. 1999, pp. 32– 39. 15H.G. Nguyen, S.J. Hwang, and S.W. Kim, "Compiler Construction for Lockstep Execution of Multithreaded Processors," CIT, 2007, pp. 829– 834. 16J.L. Lo et al., "Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading," ACM Trans. Computer Systems, Vol. 15, no. 3, 1997, pp. 322– 354. 17J. Collins and D. Tullsen, "Clustered Multithreaded Architectures: Pursuing both IPC and Cycle Time," IPDPS, 2004, pp. 766– 775. 18H. Zhong, S.A. Lieberman, and S.A. Mahlke, "Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications, HPCA, Feb. 2007, pp. 25– 36. 19J.R. Nickols, "The Design of the MasPar MP-1: A Cost Effective Massively Parallel Computer," IEEE COMPCON, Spring 1990, pp. 25– 28. 20W.W.L. Fung, et al., "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," MICRO, Dec. 2007, pp. 407– 420. 21T.R. Halfhill, "Parallel Processing With CUDA," Microprocessor Report, Jan. 2008. 22GeForce Family, http://www.nvidia.com/page/geforce8.html. Citing Literature Volume30, Issue4August 2008Pages 576-586 ReferencesRelatedInformation

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline