
Foreword to the special issue of the workshop on high performance computing systems (XVIII Simpósio em Sistemas Computacionais de Alto Desempenho, WSCAD 2017)
2019; Wiley; Volume: 31; Issue: 18 Linguagem: Inglês
10.1002/cpe.5319
ISSN1532-0634
AutoresCésar A. F. De Rose, Márcio Castro,
Tópico(s)Cryptography and Residue Arithmetic
ResumoThis special issue of Concurrency and Computation Practice and Experience gathers extended versions of six selected research articles that were previously presented at the Brazilian Workshop on High Performance Computing Systems (“XVIII Simpósio em Sistemas Computacionais de Alto Desempenho”, WSCAD 2017), held in conjunction with the 29th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2017, in Campinas, SP, Brazil, from the 17th to the 20th of October 2017. Since 2000, this workshop has presented important and interesting research in the fields of Computer Architecture, High Performance Computing and Distributed Systems. The scope of the current special issue is broad and representative of the multidisciplinary nature of High Performance Computing and Computer Architecture research domains. The set of accepted research articles was organized under three key themes: Parallel Algorithms and Optimizations, Scheduling and Placement, and Parallel Architecture Design. In the following sections, we provide a brief description of each one of the research articles accepted in this special issue. To achieve the best performance possible, algorithms must be carefully parallelized and optimized for multi-core or many-core processors. Current multi-core and many-core architectures may feature different technologies, such as distributed memory banks, vector instructions, and specialized cores. Oftentimes, different classes of multi-core and many-core processors are combined to construct a heterogeneous multiprocessing system. Today's technologies enable heterogeneous multiprocessing systems on a chip containing multi-cores, many-cores (eg, GPUs), and FPGAs. The following research articles present parallel solutions and optimizations to different classes of algorithms on multi-core and many-core processors. The paper “Optimized implementation of QC-MDPC code-based cryptography” presents a new enhanced version of the QcBits key encapsulation mechanism (KEM), which is a constant time implementation of the Niederreiter cryptosystem using QC-MDPC codes.1 The parallel solution uses vector instructions (AVX 512) and applies several other techniques to achieve a competitive performance level. The enhanced version is 1.9x faster when decrypting messages when compared with BIKE, which was the state-of-the-art implementation for QC-MDPC codes. The paper “On the Parallelization of Hirschberg's Algorithm for Multi-core and Many-core Systems” focuses on improving the execution efficiency of Hirschberg's algorithm, which aims at finding the longest common subsequence between two strings, on multi-core and many-core systems.2 The proposed solution exploits vector instructions and different parallelization strategies to achieve the best performance possible. Results showed that the parallel solution can achieve speedups of up to 15.5x on a 18-core Xeon processor and of up to 105x on a 68-core Intel Xeon Phi many-core processor. Finally, the paper “A Hybrid CPU-GPU-MIC Algorithm for Minimal Hitting Set Enumeration” proposes a hybrid exact algorithm for the Minimal Hitting Set (MHS) Enumeration Problem for highly heterogeneous platforms.3 The experiments were carried out on heterogeneous platforms composed of Intel Xeon E5-2620v2 CPUs, Intel Xeon Phi 3120A, and a GTX TITAN X GPUs. The results showed that the proposed algorithm was able to distribute parallel tasks among the processing units according to their computational efficiency in processing the task batches, achieving speedups of up to 25.3x in comparison with using two Intel Xeon E5-2620v2 CPUs. To deliver high performance to large-scale engineering and scientific applications, particular intricacies of the application and the underlying platform should be considered, so that tailored techniques can be employed to map one into another. In this context, evenly distributing the workload of an application among its threads, processes, or virtual machines is an NP-Hard minimization problem known as scheduling, and allocating these work abstractions to the underlying infrastructure is called placement. These problems are significant to the academic community and industry, and they are a hot research topic in High Performance Computing (HPC). The following research articles present contributions to these problems for different abstraction levels and domains. The paper “A Comprehensive Performance Evaluation of the BinLPT Workload-Aware Loop Scheduler” focuses on improving a workload-aware scheduling strategy called BinLPT, previously proposed by the same authors.4 Two new contributions are presented to the state of the art. First, a multiloop support feature was introduced to BinLPT, which enables the reuse of workload estimations across loops. Based on this feature, BinLPT was integrated into a real-world elastodynamics application and evaluated running on a supercomputer. Second, BinLPT was evaluated using simulations as well as synthetic and application kernels. This analysis was carried out on a large-scale NUMA machine under a variety of workloads. The results revealed that BinLPT is able to balance the load of irregular OpenMP parallel loops among the application threads, delivering up to 37% and 9% better performance than well-known loop scheduling strategies, for the application kernels and the elastodynamics simulation, respectively. Finally, the paper “Optimizing the Performance of Multi-tier Applications Using Interference and Affinity-aware Placement Algorithms” proposes a combined approach that considers both resource interference and network affinity to decide the best placement of multi-tier applications in consolidated environments.5 In their previous work, the same authors identified that a combined approach could result in better solutions for this problem and proposed a set of placement policies that explore this tradeoff. The authors propose a new family of placement algorithms based on these policies and evaluated for different workload scenarios using a visual simulation tool called CIAPA. CIAPA introduces a performance degradation model, a cost function, and heuristics to find a placement with the minimum cost for a specific workload of multi-tier applications. The solution generated by CIAPA was compared to other placement strategies from related work, and delivered placement decisions with better cost, and, consequently, improved performance. An average reduction in response time of 10% was observed when compared to interference strategies, and up to 18% when considering only affinity strategies. Dataflow-based FPGA accelerators have become a promising alternative to deliver energy efficient platforms for the HPC domain. However, FPGA programming is still a challenge. Although reconfigurable FPGA technologies have been around since the 1980s, their utilization as a general-purpose processing platform is recent. Historically, both FPGA and ASIC developers have employed Hardware Description Languages (HDLs) to implement their designs, which is usually outside the main expertise area of software developers. The lack of simple and common programming models prevents software developers from easily designing accelerators and delays a broader adoption of this technology. In this context, the paper “ADD: Accelerator Design and Deploy - A Tool for FPGA High Performance Dataflow Computing” presents a high-level framework to specify, to simulate, and to implement dataflow accelerators for streaming applications.6 The Accelerator Design and Deploy (ADD) framework includes an open dataflow operator library, and templates are provided to easily design new operators. The framework also provides a high-level and an accurate simulation at circuit level with short execution times. Moreover, ADD provides software and hardware APIs to simplify the integration process, extending the benefits of portability from low-cost FPGA boards to high performance datacenter FPGA platforms. The framework supports coupling with high-level programming languages, and it has been validated on two FPGA platforms: the Intel high-performance CPU-FPGA heterogeneous computing platform and an educational FPGA kit. The authors show that the proposed approach presents competitive performance, both in time and energy, when compared to multi-core and GPU accelerators. Concerning energy, it is 18.8x and 193.2x more efficient than the GPU and multi-core evaluated platforms, respectively. The research articles presented in this special issue provide insights in fields related to High Performance Computing, including Parallel Algorithms and Optimizations, Scheduling and Placement, and Parallel Architecture Design. We believe that the main contributions presented in the research articles are timely and important. We hope that readers can benefit from insights of these research articles and contribute to these rapidly growing areas. Dr. César A. F. De Rose has a B.Sc. degree in Computer Science from the Pontifical Catholic University of Rio Grande do Sul (PUCRS, Porto Alegre, Brazil, 1990), an M.Sc. in Computer Science from the Federal University of Rio Grande do Sul (PGCC/UFRGS, Porto Alegre, Brazil, 1993), and a Doctoral degree from Karlsruhe Institute of technology (KIT - Karlsruhe, Germany, 1998). In 1998, he joined the Faculty of Informatics at PUCRS as an associate professor and member of the Resource Management and Virtualization Group (full professor since 2012). His research interests include resource management, dynamic provisioning and allocation, monitoring techniques (resource and application), application modeling, scheduling and optimization in parallel and distributed environments (Cluster, Grid, Cloud), and virtualization. In 2009, he founded PUCRS High Performance Computing Laboratory (LAD-PUCRS) being nowadays senior researcher. Dr. Márcio Castro received a B.Sc. in Computer Science with honors (Summa Cum Laude) from Pontifical Catholic University of Rio Grande do Sul (PUCRS, Brazil) in 2006 and an M.Sc. degree in Computer Science from the same university in 2009. He received a Ph.D. in Computer Science in 2012 from the University of Grenoble Alpes, France. Then, he worked as a postdoctoral fellow at the Federal University of Rio Grande do Sul (UFRGS), Brazil. Since 2014, he is an associate professor at the Federal University of Santa Catarina (UFSC), Brazil. His main research area is High Performance Computing, with focus on parallel programming models, load balancing, high performance parallel applications, and parallel and distributed computing on multi-core and many-core architectures. We would like to thank all the authors who provided valuable contributions to this special issue. We are also grateful to the reviewers for their feedback to the authors. Indeed, their advices were essential to further improve the quality of the papers. Finally, we would like to express our sincere gratitude to Professor Geoffrey Fox, the Editor in Chief, for providing us with this unique opportunity to present the selected papers from WSCAD 2017 in the International Journal of Concurrency and Computation: Practice and Experience.
Referência(s)