Artigo Acesso aberto Revisado por pares

New advances in High Performance Computing and simulation: parallel and distributed systems, algorithms, and applications

2016; Wiley; Volume: 28; Issue: 7 Linguagem: Inglês

10.1002/cpe.3774

ISSN

1532-0634

Autores

Waleed W. Smari, Mohamed Bakhouya, Sandro Fiore, Giovanni Aloisio,

Tópico(s)

Parallel Computing and Optimization Techniques

Resumo

Recent developments in research and technological studies have shown that High Performance Computing (HPC) will indeed lead to advances in several areas of Science, engineering, and technology, permitting the successful completion of more computationally intensive and data-intensive problems such as those in healthcare, biomedical and biosciences, climate and environmental changes, multimedia processing, design and manufacturing of advanced materials, geology, astronomy, chemistry, physics, and even financial systems. However, further research is required for developing computing infrastructures, models to support newly evolving architectures, programming paradigms, tools to simulate and evaluate new approaches and solutions, and programming languages that are appropriate for the new and emerging domains and applications. The development of the HPC infrastructure has been accelerated by the advances in silicon technology, which permitted the design of complex systems able to incorporate many hardware and software blocks and cores. More precisely, recent rapid advances in technology and design tools enabled engineers to design systems with hundreds of cores, called multi-processor system-on-chip. These systems are composed of several processing elements, that is, dedicated hardware and software components that are interconnected by an on-chip interconnect. According to Moore's law, the number of cores on-chip will double every 18 months; therefore, thousands of cores-on-chip will be integrated in the next 20 years to meet the power and performance requirements of applications. Moreover, current trends on the road to exascale are moving toward the integration of more and more cores into a single chip 1, 2. For example, accelerators and heterogeneous processing offer some opportunities to greatly increase computational performance and to match increasing application requirements 3. Engineering these computing systems is one of the most dynamic fields in modern Science and technology. That said, there will continue to be a growing demand for more powerful HPC in the upcoming years, not just to tackle basic mounting computing needs but also to lay out the foundations for the HPC market that is becoming potentially larger than the desktop/laptop computer market. Furthermore, HPC is turning out to be a major source of hope for future applications development that require greater amounts of computing resources in various modern Science domains such as bioengineering, nanotechnology, and energy where HPC capabilities are mandatory in order to run simulations and perform visualization tasks. At the time of writing this editorial, petaflop computing is well established 4. Several architectures are making major breakthroughs: commodity, accelerators on commodity, and special-purpose cores. All of top 500 systems are based on multicore technologies 5. HPC usage is growing considerably, especially in industry. And significant efforts toward establishing exascale are underway. At the same time, several challenges have been recently identified in order to create large-scale computing systems that meet current and projected application requirements. Most of them are related to system architectures, algorithms, big data processing, and programming models 6. However, energy cost, resilience, Central Processing Unit (CPU) access latency and memory transfers are key challenges to address in the era of exascale. To address these challenges, completely new approaches and technologies and a shift from the current approaches used for application development and execution to adaptive approaches are required. Consequently, further research is required for developing advanced exascale-based computing infrastructures, models, and paradigms to support newly emerging architectures, programming models, tools to simulate and evaluate more elaborate solutions and applications, and programming languages that are appropriate for these new and emerging domains and challenges. This special issue is intended to provide an overview of some key topics and state-of-the-art of recent advances in subjects relevant to High Performance Computing and simulation. The general objectives are to address, explore, and exchange information on the challenges and current state-of-the-art in high-performance and large-scale computing systems, their use in modeling and simulation, their design, performance and use, and their impact in various Science and engineering domains and applications. This special issue contains research papers addressing the state-of-the-art in high-performance and large-scale computing systems. A set of carefully selected works was invited based on the original presentations at the 2013 IEEE International Conference on High Performance Computing and Simulation (HPCS 2013), which was held in Helsinki, Finland, July 01–05, 2013 7. The extended works have been thoroughly reviewed by an international technical reviewing committee, and only thirteen papers covering a wide range of relevant challenges in HPC were selected for this special issue. The manuscripts tackle research on different topics including HPC, distributed, Peer-to-Peer (P2P) systems, data mining, Graphics Processor Unit (GPU), multicore systems as well as real-world simulations related to Computational Fluid Dynamics (CFD), neuroinformatics, bioinformatics and weather forecast performed on large computational infrastructures. The set of accepted papers can be organized under the following key subjects and subsections, and are briefly described in the remaining parts of this Section. Accelerating compute-intensive applications is another recent research in HPC domain. These accelerators are special-purpose processors, which are designed mainly to speed up compute-intensive sections of applications and achieve better performance than CPUs for certain workloads 8. There are two types of accelerators: field-programmable gate arrays (FPGAs) and GPUs. FPGAs are highly customized and designed to be configured, while GPUs provide massive parallel execution resources and high memory bandwidth. GPU is designed to rapidly manipulate and alter memory to accelerate image-processing applications. Generally, GPUs are easier to program and require less hardware resources, while FPGAs provide the best expectation of performance, flexibility, and low overhead. Hardware acceleration using GPUs or FPGAs could potentially improve run times or higher accuracy simulations. Werner et al. 9, in their article Accelerated Join Evaluation in Semantic Web Databases by Using FPGA, provide different FPGA implementations of the join operation in the context of Semantic Web Databases. Authors develop a flexible FPGA-based hardware accelerator to improve the performance of query evaluation in a Semantic Web database. They propose an architecture based on partial reconfiguration to integrate the FPGA in the software system logically and physically optimized Semantic Web Database engines 10 to accelerate the database operations. Thus, hardware architecture considers joining algorithms for query execution implemented on FPGA. Experimental results are compared with a C code software solution for general-purpose CPU and show the efficiency of the hardware-based solution. Another application that could benefit from using accelerators like FPGA and GPU is the Weather Research and Forecasting (WRF) model 11. It is a model designed to serve both atmospheric research and operational forecasting needs. It is a next-generation mesoscale numerical weather prediction system to allow researchers to generate atmospheric simulations based on real data. However, the WRF model requires significant execution time and storage space. The porting to HPC platforms enables the simulations to run faster. GPUs are designed for computationally intensive applications, via a large number of threads on a larger number of processing elements. In their article An Analysis of the Feasibility and Benefits of GPU/Multicore Acceleration of the Weather Research and Forecasting Model, Vanderbauwhede and Takemi 12 show that porting numerical weather prediction model on GPU outperform current multi-core CPUs implementations. First, a simple study is done to evaluate the possible gains of porting a kernel to the GPU. Then, one kernel is selected through profiling and ported to GPU using OpenCL. The performance is then studied both independently and once the kernel integrated into WRF. Porting the code for GPU greatly improves the parallelization, which translates into better scalability for the OpenMP version of the code. HPC infrastructures can also be used for the development, acceleration, and application of bioinformatics applications. Jaziri et al. 13, in their article High Performance Computing of Oligopeptides Complete Backtranslation applied to DNA Microarray Probe Design, tackle the issue of large-scale backtranslation of oligopeptides, a step of generating all possible nucleic acid sequences from a protein sequence, which is needed for the discovery of new organisms. Because back translation is a time-consuming task that can generate very large quantities of data, authors propose an efficient distributed algorithm to compute a complete back translation of several hundreds of oligopeptides for functional Deoxyribose Nucleic Acid microarrays. The proposed algorithm was implemented, and simulations have been conducted on both simulated and real biological datasets. The results reported show a significant computing speedup on different architectures (symmetric multiprocessors, cluster, and grid). Most applications are computationally intensive. Scientists have traditionally attempted to parallelize their algorithms across HPC infrastructures. However, this task requires significant work and effort for learning parallel programming. Developing parallel libraries and high-level and easy to use languages are required to hide parallelization complexity of programs. Another important issue that can influence HPC systems performance is vectorisation. Many scientific codes have vectorisation potential that cannot be exploited due to an algorithm-driven choice of data layouts. In their article Data Layout Inference for Code Vectorisation, Sinkarovs and Scholz 14 propose an interesting approach for automatically generating efficient code for vectorisation by mainly focusing on the evaluation of a family of data layout transformations. The authors demonstrate the effectiveness of their approach by applying it on an N-body simulation. Coullon et al. 15, in their paper Implicit parallelism on 2D meshes using SkelGIS, tackle the issue of overcoming restrictions in parallelization of scientific simulations because of the complexity of functional concepts and specific features. Parallelization of scientific simulations requires a lot of efforts and field-specific knowledge to produce efficient parallel programs. For this reason, the authors introduce SkelGIS as a solution for abstracted and implicit parallelism. They apply SkelGIS to solve heat equations and shallow-water equations and compare both the SkelGIS performance and the SkelGIS programming effort with Message Passing Interface (MPI) solutions counterparts. The DARPA High Productivity Computing Systems program (2002–2011) has defined and released benchmark suits for measuring performance, portability, programmability, robustness, and the productivity in the HPC domain 3, 16. This suite is composed of several performance tests that are required to examine the performance and classify the HPC architectures, languages, and libraries 6: (i) High-Performance Linpack for evaluating the floating point rate of execution for solving a linear system of equations; (ii) Double precision GEneral Matrix Multiply for measuring the floating point rate of execution of double precision real matrix–matrix multiplication; (iii) STREAM for sustainable memory bandwidth evaluation; (iv) PTRANS for testing the total communications capacity of the network; (v) Random Access, which is required for measuring the rate of integer random updates of memory; (vi) Fast Fourier Transform for evaluating the floating point rate of execution of double precision complex one-dimensional Discrete Fourier Transform; and (vii) b-eff for measuring latency and bandwidth of a number of simultaneous communication patterns. Heinecke et al. 17, in their article Data Mining on Vast Datasets as a Cluster System Benchmark, introduce recent situation about benchmark and procurement of supercomputers, and describe the trend from Linpack benchmark to miniapp benchmarks to determine the performance in real usage. Also, it describes the difficulty to optimize the benchmark for new architectures. They discuss a data mining application that is compared on different (accelerated) cluster architectures and the needed optimization to efficiently run on different platforms. The authors in their work demonstrate such an optimization for a data mining algorithm, which solves regression and classification problems on vast datasets. In other terms, the authors propose a data mining application for cluster system benchmark using overlapping of computation and communication to hide latency and overhead reduction. Experiments have been conducted with different datasets and on the SuperMUC machine at the Leibniz-Rechenzentrum, the local CoolMAX AMD GPU cluster in Munich, the Phi-accelerated Beacon at University of Tennessee, and the Todi Cray XK7 at the Swiss National Supercomputing Centre. The performance results show that for strong scaling settings, GPUs and coprocessors suffer from lack of parallelism and do not perform as well at large scale. For weak scaling settings however, they always outperform. Several challenges, as stated previously, have been identified in order to create large-scale computing systems that meet current application requirements. These computing systems may rely on distributed computing mechanisms, implemented often as clusters and clouds, to provide continuous access to a variety of resources, for example, processing cores, large data stores, and information repositories. For example, a computational grid is a distributed computing infrastructure that can provide globally available network resources. These environments have the potential and ability to integrate large-scale computing resources, on demand. User ability to compute will no longer be limited to the resources he has at hand currently or those localized statically on a set of hosts known a priori. However, the main challenge in large-scale computing is how to program and control these distributed systems (Clouds or Exascale) with a billion nodes. Evidently, algorithms and simulators have to be developed to explore these new infrastructures. Distributed peer-to-peer control mechanisms could be devised and used to simulate new algorithms and computing architectures. In their article Flexible Replica Placement for Optimized P2P Backup on Heterogeneous, Unreliable Machines, Skowron and Rzadca 18 tackle the issue of data replication over distributed P2P systems with unreliable machines. They introduce a P2P backup system architecture using an optimal replication strategy for storing data over distributed P2P systems. Furthermore, research works to date have concentrated on static approaches tailored to parallelize existing applications on different HPC systems. However, the rapid growth in the size and complexity of contemporary distributed parallel applications, usually assembled out of a set of interacting software components executed over distributed and heterogeneous platforms, makes such approaches unsuitable for these dynamic environments. Therefore, dynamic approaches are required to allow the system to autonomously adapt its structure and its behavior during the course of its operation. In other words, these approaches allow the system to automatically modify its configuration according to the settings of its computing environment and the properties of its workload. These approaches are mainly motivated by the following issues. The high number of nodes makes the system vulnerable to failures. Consequently, its ability to react autonomously to faults is challenging. For example, the system's nodes should react to the changing environment by taking over pending tasks from faulty nodes. Static and centralized configurations of these systems are difficult or even impossible to use for dynamic and large-scale applications. For instance, for a large system with thousands of nodes, several applications could compete for resources. However, managing the resources at runtime and in a decentralized manner is a challenging task. Mencagli 19 in his article, Adaptive Model Predictive Control of Autonomic Distributed Parallel Computations with Variable Horizons and Switching Costs, addresses the dynamic reconfiguration issue of parallel computations by proposing an automatic method for reconfiguring distributed parallel computing. An autonomic computing approach is applied. It monitors the behavior of parallel modules and adjusts the degree of parallelism in order to achieve a global optimization while balancing the number of reconfigurations in view of performance and efficiency. The approach is based on model predictive control. The paper investigates especially the influence of different horizon lengths and different models for switching costs. The proposed method was evaluated by using a video-streaming application with synthesized workload. A model predictive control-based policy is evaluated with fixed horizon and variable horizon, respectively, and results are reported to show solution's effectiveness in improving the target properties of the adaptation process. Such autonomic control models to automate reconfiguration management based on current system load and application status could be employed in cloud computing platforms in which decision-making strategies are required for the purpose of resource management. Several compute-intensive and emerging applications have also been the subject of extensive research. These applications range from life sciences (e.g., medical imaging and gene sequencing), financial trading, oil and gas exploration, to bioscience, combustion (e.g., complex fluid simulation), astrophysics (e.g., formation of stars, evolution of galaxies), and environment (e.g., modeling world climate). The special issue includes a few such application areas with interesting representative works. Future human brain neuroimaging requires the integration of HPC to achieve high-temporal and high-spatial resolutions. Salman et al. 20, in their article Concurrency in Electrical Neuroinformatics: Parallel Computation for Studying the Volume Conduction of Brain Electrical Fields in Human Head Tissues, highlight the necessity of integrating HPC tools and techniques in order to have a systematic methodology for analyzing the main factors and study the interdependent parameters that affect the accuracy of solutions especially for large, multi-dimensional images. Their paper discusses challenges in human brain neuroimaging, particularly how to achieve high-temporal and high-spatial resolution. They provide two accurate, efficient, and reliable finite difference method-based forward solvers that are parallelized using OpenMP in shared memory and CUDA on GPU in order to show that advances in neuroimaging science and engineering will depend significantly on HPC integration. In their article A Novel Technique for Detecting Suspicious Lesions in Breast Ultrasound Images, Karimi and Krzyzak 21 address an important practical problem of automatic classification of breast lesion images using ultrasound. The problem of automatic classification of suspicious masses in ultrasound images has fundamental importance in oncology. The main advantage of ultrasound is that it is a noninvasive diagnostic tool, its main disadvantage being the heavy presence of acoustic noise. Any progress in automatic breast cancer classification using ultrasound may have significant impact on early detection and treatment of breast cancer. The authors tackle this issue by introducing a novel automatic classification technique of suspicious breast lesions using ultrasound images. The system proposed in the paper is a pipeline, which consists of several functional components. The first component uses the fuzzy logic approach, texture and morphology for de-noising, and segmentation of suspicious lesions. The second component deals with feature extraction and selection. The authors considered geometrical, texture, and morphological features. After applying sequential forward and backward searches, they selected the best features and passed them to the third component, which implements support vector machine classifier categorizing suspicious lesions into benign and malignant classes. The system is validated by a computer experiment on 80 real images. According to the authors, its performance reached a 98% success rate. It was then compared with two other methods, which were significantly outperformed by the proposed system. Pattern recognition is another research field, which focuses on the recognition of patterns and regularities in data. Most approaches used in pattern recognition employ classification methods. Support vector machines (SVM) are considered the most widely used classification technique in the pattern recognition community. It is a supervised learning model with associated learning algorithms that are used for classification and regression analysis needed for recognizing patterns and data analysis. In other words, SVM is mainly a classifier method that performs classification tasks by constructing hyperplanes in a multidimensional space. Chen et al. 22, in their article Sparse Support Vector Machine for Pattern Recognition, propose to improve the SVM classification techniques by using sparse SVM classification. They examine the sparse SVM's performance on pattern recognition. The authors implement a sparse SVM, base on the LIBSVM source code, with the RBF kernel and verified its effectiveness on eight datasets from LIBSVM: SVMGuide4, vowel, SVMGuide3, Deoxyribose Nucleic Acid, Satimage, SVMGuide1, stage, yeast. Experimental results conducted in this paper show that the proposed SVM is feasible in practical pattern recognition applications. Codreanu et al. 23, in their paper Evaluating Automatically Parallelized Versions of the Support Vector Machine, deal with the parallelization on multi-core computers of SVM supervised learning algorithm. They propose a new gradient-ascent-based SVM algorithm combined with a particle swarm optimization algorithm for automatic parameter tuning. The authors have investigated two parallelization approaches on GPU using GPSME toolkit and OpenACC. The results reported demonstrate an important speed-up for the proposed approach when compared with the CPU and OpenACC versions. Lattice Boltzmann Methods are classes of compute-intensive applications for complex fluid simulation that have attracted interest from researchers in computational physics. Lattice Boltzmann Methods are typical examples for the large class of algorithms used to simulate different types of flow (e.g., water, oil, and gas) that require resolution and memory requirements. Therefore, optimizing them on recent platforms and for different application cases has been searched intensively in the last 10 years. In their article Chip-level and Multi-node Analysis of Energy-optimized Lattice-Boltzmann CFD Simulation, Wittmann et al. 24 analyze the behavior of D3Q19 lattice-Boltzmann solvers on modern HPC systems. They first present chip-level models for both performance and energy consumption. In the article, authors also analyze the performance aspects on Message Passing Interface-parallel runs showing that their chip-level models are effective tools to identify optimal operating points for large-scale simulations. They highlight the importance of single-core performance and the choice of number of cores used per chip to guide energy optimization. The articles presented in this special issue provide recent advances in some fields related to High Performance Computing and simulations. In particular, the manuscripts undertake research on different topics including HPC, distributed/P2P systems, data mining, GPU/multicore systems as well as real-world simulations related to CFD, neuroinformatics, bioinformatics, and weather forecasting performed on large computational infrastructures. We hope that the readers can benefit from the perspectives presented in this special issue and will contribute to these strategically important, exciting, and fast-growing research areas. The guest editors of this special issue wish to express their sincere gratitude to all of the authors who submitted their papers to this special issue. We are also grateful to the Reviewing Committee for the hard work and the feedback provided to the authors. As guest editors of this special issue, we also wish to express our gratitude to the Editor-in-Chief Geoffrey C. Fox for the opportunity to edit this special issue, his assistance during the special issue preparation, and for giving the authors the opportunity to present their work in the international journal of Concurrency and Computation: Practice and Experience. Lastly, we wish to thank the Journal's staff for their assistance and suggestions. We acknowledge the following Reviewing Committee members: Chaker El Amrani (Morocco), Andres Avila (Chile), Carlos Berderian (Argentina), Massimo Cafaro (Italy), Ron Chiang (USA), Antonio Cofino (Spain), Alessandro D'Anca (Italy), Minh Ngoc Dinh (Australia), Laurent d'Orazio (France), Anders Eklund (Sweden), Francoise Baude (France), Jaafar Gaber (France), Frederic Gava (France), Ivan Gonzalez (Spain), William Gropp (USA), Bilel Hadri (USA), Miaoqing Huang (USA), Atman Jbari (Morocco), Chao Jin (Australia), William Johnston (USA), Abdullah Kayi (USA), Harald Koestler (Germany), Harald Kosch (Germany), Dieter Kranzlmueller (Germany), Erwin Laure (Sweden), Sergio Lopez (Spain), Nouredine Melab (France), Mariofanna Milanova (USA), Maria Mirto (Italy), Vikram Narayana (USA), Dinh, Minh Ngoc (Australia), Christian Obrecht (France), Amanda Peters Randles (USA), Volkmar Schau (Germany), Olivier Serres (USA), Suboh Suboh (USA), Xiaoping Sun (China), Osamu Tatebe (Japan), Christian Trefftz (USA), Ventzeslav Valev (Bulgaria), Timothy J. Williams (USA), Ramin Yahyapour (Germany), Chao-Tung Yang (Taiwan), Mostapha Zbakh (Morocco), Yong Zhao (USA).

Referência(s)