Muthu Manikandan Baskaran, J. Ramanujam, P. Sadayappan,
Graphics Processing Units (GPUs) offer tremendous computational power. CUDA (Compute Unified Device Architecture) provides a multi-threaded ... parallel view make manual development of high-performance CUDA code rather complicated. Hence the automatic transformation of sequential input programs into efficient parallel CUDA programs is of considerable interest. This paper describes an automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular ( ... optimization practically effective, we develop a C-to-CUDA transformation system that generates two-level parallel CUDA ...
Tópico(s): Real-Time Systems Scheduling
2010 - Springer Science+Business Media | Lecture notes in computer science
M J Harvey, Gianni De Fabritiis,
... The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by ... Swan" for facilitating the conversion of an existing CUDA code to use the OpenCL model, as a means to aid programmers experienced with CUDA in evaluating OpenCL and alternative hardware. While the performance of equivalent OpenCL and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to OpenCL exhibits an overall 50% ... portable GPU applications but that the more mature CUDA tools continue to provide best performance. Program title: ...
Tópico(s): Software Testing and Debugging Techniques
2011 - Elsevier BV | Computer Physics Communications
Yukihiro Komura, Yutaka Okabe,
We present new versions of sample CUDA programs for the GPU computing of the Swendsen–Wang multi-cluster spin flip algorithm. In this update, we add the method of ... 26316 Distribution format: tar.gz Programming language: C, CUDA. Computer: System with an NVIDIA CUDA enabled GPU. Operating system: No limits (tested on ... multi-cluster spin flip Monte Carlo method. The CUDA implementation for the cluster-labeling is based on ... for high-precision Monte Carlo simulations. In the CUDA, the cuRAND library [2], which focuses on the ...
Tópico(s): Random Matrices and Applications
2015 - Elsevier BV | Computer Physics Communications
Emanuele Manca, Andrea Manconi, Alessandro Orro, Giuliano Armano, Luciano Milanesi,
... the GPU‐quicksort, a compute‐unified device architecture (CUDA) iterative implementation, and the CUDA dynamic parallel (CDP) quicksort, a recursive implementation provided by NVIDIA Corporation. We propose CUDA‐quicksort an iterative GPU‐based implementation of the sorting algorithm. CUDA‐quicksort has been designed starting from GPU‐quicksort. ... performed on six sorting benchmark distributions show that CUDA‐quicksort is up to four times faster than ... An in‐depth analysis of the performance between CUDA‐quicksort and GPU‐quicksort shows that the main ...
Tópico(s): Advanced Data Storage Technologies
2015 - Wiley | Concurrency and Computation Practice and Experience

Vladimir Lončar, Luis E. Young-S., Srdjan Škrbić, Paulsamy Muruganandam, Sadhan K. Adhikari, Antun Balaž,
... new versions of the previously published C and CUDA programs for solving the dipolar Gross–Pitaevskii equation ... on distributed-memory systems. Finally, previous three-dimensional CUDA-parallelized programs are further parallelized using MPI, similarly ... comparison with the previous sequential C and parallel CUDA programs. The improvements to the sequential version yield ... on a computer cluster with 32 nodes used. CUDA/MPI version shows a speedup of 9–10 ... with 32 nodes. Program Title: DBEC-GP-OMP-CUDA-MPI: (1) DBEC-GP-OMP package: (i) imag1dX- ...
Tópico(s): Cold Atom Physics and Bose-Einstein Condensates
2016 - Elsevier BV | Computer Physics Communications
Mubeen Ghafoor, Shahzaib Iqbal, Syed Ali Tariq, Imtiaz Ahmad Taj, Noman M. Jafri,
... NVIDIA [23-25] introduced 'compute unified device architecture' (CUDA) in 2006. GPUs have been used efficiently in ... 2. Section 3 discusses the GPU and NVIDIA CUDA architecture. Section 4 discusses the proposed implementation of ... overview of the GPU architecture and introduces NVIDIA CUDA programming architecture. 3 GPU and NVIDIA CUDA architecture To transform or map CPU algorithm to ... power of GPU can be optimally utilised. NVIDIA CUDA is the hardware/software architecture where hardware architecture ...
Tópico(s): Forensic Fingerprint Detection Methods
2017 - Institution of Engineering and Technology | IET Image Processing
Stefan K. Muller, Jan Hoffmann,
... high throughput in vector-parallel applications. NVIDIA's CUDA toolkit seeks to make GPGPU programming accessible by ... small extension of C/C++. However, due to CUDA's complex execution model, the performance characteristics of CUDA kernels are difficult to predict, especially for novice ... paper introduces a novel quantitative program logic for CUDA kernels, which allows programmers to reason about both functional correctness and resource usage of CUDA kernels, paying particular attention to a set of ...
Tópico(s): Embedded Systems Design Techniques
2021 - Association for Computing Machinery | Proceedings of the ACM on Programming Languages
Masashi Fukuzawa, Jeffrey G. Williams,
ABSTRACT The cudA gene encodes a nuclear protein that is essential for normal multicellular development. At the slug stage cudA is expressed in the prespore cells and in ... show that cap site distal promoter sequences direct cudA expression in prespore cells, while proximal sequences direct ... acting part of the prespore domain of the cudA promoter. However, Dd-STATa cannot be utilised for ... shows that Dd-STATa is not necessary for cudA transcription in prespore cells. In contrast, the part of the cudA promoter that directs prestalk-specific expression contains a ...
Tópico(s): Biocrusts and Microbial Ecology
2000 - The Company of Biologists | Development
Peitao Song, Zhijian Zhang, Qian Zhang, Liang Liang, Qiang Zhao,
... cluster. In this paper, a heterogeneous MPI + OpenMP/CUDA parallel algorithm for solving the 2D neutron transport ... exploited through OpenMP (in CPU calculated domain) and CUDA (in GPU calculated domain) based on the ray ... Moreover, the strong scaling performance of the MPI + CUDA parallelization is studied through a performance analysis model ... GPUs, and the MPI communication in the MPI + CUDA parallel algorithm. And the corresponding conclusion is still tenable for the MPI + OpenMP/CUDA parallelization. The C5G7 2D benchmark and an extended ...
Tópico(s): Advanced Neural Network Applications
2019 - Elsevier BV | Annals of Nuclear Energy
Yoko Yamada, Hong Yu Wang, Masashi Fukuzawa, Geoffrey J. Barton, Jeffrey G. Williams,
CudA, a nuclear protein required for Dictyostelium prespore-specific gene expression, binds in vivo to the promoter ... 14 nucleotide region of the cotC promoter binds CudA in vitro and ECudA, an Entamoeba CudA homologue, also binds to this site. The CudA and ECudA DNA-binding sites contain a dyad and, consistent with a symmetrical binding site, CudA forms a homodimer in the yeast two-hybrid system. Mutation of CudA binding sites within the cotC promoter reduces expression from cotC in prespore cells. The CudA and ECudA proteins share a 120 amino acid ...
Tópico(s): interferon and immune responses
2008 - The Company of Biologists | Development
Matthew J. Thurley, V. Danell,
... for faster morphological image processing, and the NVIDIA CUDA architecture offers a relatively inexpensive and powerful framework ... generic morphological erosion and dilation operation in the CUDA NPP library is relatively naive, and performance scales ... morphological image processing community. Open-source extensions to CUDA (hereafter referred to as LTU-CUDA) have been produced for erosion and dilation using ... by forgoing the use of shared memory in CUDA multiprocessors. The vHGW algorithm for erosion and dilation ...
Tópico(s): Advanced Neural Network Applications
2012 - Institute of Electrical and Electronics Engineers | IEEE Journal of Selected Topics in Signal Processing
... parallel program, such as a GPU kernel in CUDA programs, still contains both se-quential code and ... our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of ... for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. ... like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically gen-erates the optimized GPU ... optimized and contain nested parallelism, our pro-posed CUDA-NP framework further improves the perfor-mance by ...
Tópico(s): Interconnection Networks and Systems
2014 - Association for Computing Machinery | ACM SIGPLAN Notices
Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, Wen‐mei Hwu,
... this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow ... the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high- ... that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive ... best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and ...
Tópico(s): Interconnection Networks and Systems
2013 - Association for Computing Machinery | ACM Transactions on Embedded Computing Systems
Yi Yang, Chao Li, Huiyang Zhou,
... parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel ... our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of ... for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. ... like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. ... been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up ...
Tópico(s): Interconnection Networks and Systems
2015 - Springer Science+Business Media | Journal of Computer Science and Technology
Michał Januszewski, Marcin Kostur,
... with popular NVIDIA Graphics Processing Units using the CUDA programming environment. We address general aspects of numerical ... etc.: 5905 Distribution format: tar.gz Programming language: CUDA C Computer: any system with a CUDA-compatible GPU Operating system: Linux RAM: 64 MB ... 3 External routines: The program requires the NVIDIA CUDA Toolkit Version 2.0 or newer and the ... and perform the calculations on GPUs using the CUDA programming environment. The GPU's ability to execute ... question is performed on a GPU using the CUDA environment. Running time: < 1 minute
Tópico(s): stochastic dynamics and bifurcation
2009 - Elsevier BV | Computer Physics Communications
Étiennette Combe, T. Achi, R. Pion, MC Valluy, ML Houlier, M. SALLAS, A. SELLE,
... satis- faire les besoins de la croissance.Le CUDa de l'azote est respectivement de 72 -75 - ... cas des lots fève -lentillepois chiche, mais le CUDa de certains acides aminés indispensables est nettement plus ... 71 - 75 pour la valine alors que le CUDa de l'arginine est toujours plus élevé 87 - ... to suit growth requirements.Nitrogen apparent digestibility coefficient (CUDa) was 72% in the faba bean, 75% in ... the chick P ea groups respectively, but the CUDa of some essential amino acids were much lower : ... cystine, 73 -71 -75% for valine, while arginine CUDa values (87 -87 -82) were higher than all ...
Tópico(s): Proteins in Food Systems
1991 - Elsevier BV | annales de biologie animale biochimie biophysique
... by using an extension to C language, in CUDA which is a parallel programming environment supported on ... Hwu is principle investigator for the first NVIDIA CUDA Center of Excellence at the University of Illinois ... It also covers data parallelism, the basics of CUDA memory/threading models, the CUDA extensions to the C language, and the basic ... 7) enhances student programming skills by explaining the CUDA memory model and its types, strategies for reducing global memory traffic, the CUDA threading model and granularity which include thread scheduling ...
Tópico(s): Cloud Computing and Resource Management
2010 - | Scalable Computing Practice and Experience
Wenqian Jiang, Menghao Zhang, Yichen Wang,
... from vegetations. Nevertheless, the Compute Unified Device Architecture (CUDA) gives developers access to the virtual instruction set ... memory of the parallel computational elements in the CUDA compatible Graphics Processing Unit (GPU), which encourages us to develop a CUDA-based simulator for the solution. This paper analyzes the radiative transfer method and the CUDA architecture, and then presents a CUDA parallel algorithm for calculating the EM scattering from a two-layer vegetation canopy. In the CUDA-based simulation, with a GTS250 GPU as, which ...
Tópico(s): Cryospheric studies and observations
2010 - Taylor & Francis | Journal of Electromagnetic Waves and Applications
Haixiang Shi, Bertil Schmidt, Weiguo Liu, Wolfgang Müller‐Wittig,
... we have used the Compute Unified Device Architecture (CUDA) programming model to design and implement a new parallel algorithm. Our implementation, called CUDA-MI, can achieve speedups of up to 82 ... datasets. We have used the results obtained by CUDA-MI to infer gene regulatory networks (GRNs) from ... existing methods including ARACNE and TINGe show that CUDA-MI produces GRNs of higher quality in less time.CUDA-MI is publicly available open-source software, written in CUDA and C++ programming languages. It obtains significant speedup ...
Tópico(s): DNA and Biological Computing
2011 - BioMed Central | BMC Research Notes
... well.In this paper, we will show how CUDA can fully utilize the tremendous power of these GPUs.CUDA is NVIDIA's parallel computing architecture.It enables ... power of the GPU.This paper talks about CUDA and its architecture.It takes us through a comparison of CUDA C/C++ with other parallel programming languages like ... paper also lists out the common myths about CUDA and how the future seems to be promising for CUDA.
Tópico(s): Advanced Image and Video Retrieval Techniques
2012 - | Advanced Computing An International Journal
Yukihiro Komura, Yutaka Okabe,
We present sample CUDA programs for the GPU computing of the Swendsen–Wang multi-cluster spin flip algorithm. We deal with the classical ... 14688 Distribution format: tar.gz Programming language: C, CUDA. Computer: System with an NVIDIA CUDA enabled GPU. Operating system: System with an NVIDIA CUDA enabled GPU. Classification: 23. External routines: NVIDIA CUDA Toolkit 3.0 or newer Nature of problem: ... multi-cluster spin flip Monte Carlo method. The CUDA implementation for the cluster-labeling is based on ...
Tópico(s): Random Matrices and Applications
2013 - Elsevier BV | Computer Physics Communications
Panagiotis D. Michailidis, Konstantinos G. Margaritis,
... Processing Units (GPUs) using Compute Unied Device Architecture (CUDA) programming model. In this work we discuss a naive and two optimised CUDA algorithms for the two kernel estimation methods: univariate ... also present exploratory experimental results of the proposed CUDA algorithms according to the several values of parameters ... results show signicant performance gains of all proposed CUDA algorithms over serial CPU version and small performance speed-ups of the two optimised CUDA algorithms over naive GPU algorithms. Finally, based on ...
Tópico(s): Advanced Data Compression Techniques
2013 - | Applied Mathematical Sciences
Vincent Roberge, Mohammed Tarbouchi,
... optimization (PSO) on graphical processing units (GPU) using CUDA. By fully utilizing the processing power of graphic processors, our implementation (CUDA-PSO) provides a speedup of 167× compared to ... CPU, it may be unfair to compare our CUDA implementation to a sequential one. For this reason, ... MPI-PSO) and compared its performance against our CUDA-PSO. The execution time of our CUDA-PSO remains 15.8× faster than our MPI- ... statistical significance that the results obtained using our CUDA-PSO are of equal quality as the results ...
Tópico(s): Islanding Detection in Power Systems
2013 - Imperial College Press | International Journal of Computational Intelligence and Applications
... multi-level parallelism on GPU clusters with MPI-CUDA and hybrid MPI-OpenMP-CUDA parallel implementations, in which all computations are done on the GPU using CUDA. We explore efficiency and scalability of incompressible flow ... merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism that use either MPI ... large data sets, and a dual-level MPI-CUDA implementation with maximum overlapping of computation and communication ... also find that our tri-level MPI-OpenMP-CUDA parallel implementation does not offer a significant advantage ...
Tópico(s): Plant Virus Research Studies
1956 - Elsevier BV | Experimental Parasitology
Yongchao Liu, Bertil Schmidt, Weiguo Liu, Douglas L. Maskell,
... to employ emerging many-core architectures such as CUDA-enabled GPUs. In this paper, we present a ... of the MEME motif discovery algorithm using the CUDA programming model. To achieve high efficiency, we introduce ... ZOOPS) motif search model. The runtime speedups of CUDA–MEME on a single GPU are also comparable ... workstation cluster. In addition to the fast speed, CUDA–MEME has the capability of finding motif instances ...
Tópico(s): Fractal and DNA sequence analysis
2009 - Elsevier BV | Pattern Recognition Letters
Yonghong Yan, Max Grossman, Vivek Sarkar,
... GPGPUs) to obtain order-of-magnitude performance improvements. CUDA has emerged as a popular programming model for ... and C#, it is natural to explore how CUDA-like capabilities can be made accessible to those ... can be used by Java programmers to invoke CUDA kernels. Using this interface, programmers can write Java codes that directly call CUDA kernels, and delegate the responsibility of generating the Java-CUDA bridge codes and host-device data transfer calls ...
Tópico(s): Advanced Data Storage Technologies
2009 - Springer Science+Business Media | Lecture notes in computer science
Tianyi David Han, Tarek S. Abdelrahman,
... GPU programmability. Although the Compute Unified Device Architecture (CUDA) is a simple C-like interface for programming NVIDIA GPUs, porting applications to CUDA remains a challenge to average programmers. In particular, CUDA places on the programmer the burden of packaging ... hiCUDA}, a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious ... compiler that translates a hiCUDA} program to a CUDA program. Our compiler is able to support real- ... and use dynamically allocated arrays. Experiments using nine CUDA benchmarks show that the simplicity hiCUDA} provides comes ...
Tópico(s): Real-Time Systems Scheduling
2010 - Institute of Electrical and Electronics Engineers | IEEE Transactions on Parallel and Distributed Systems
Wladimir J. van der Laan, Andrei C. Jalba, Jos B. T. M. Roerdink,
... regarded as massively parallel coprocessors through NVidia's CUDA compute paradigm. The three main hardware architectures for ... based) are shown to be unsuitable for a CUDA implementation. Our CUDA-specific design can be regarded as a hybrid ... to an optimized CPU implementation and earlier non-CUDA-based GPU DWT methods, both for 2D images ... performance analysis shows that the results of our CUDA-specific design are in close agreement with our ...
Tópico(s): Digital Filter Design and Implementation
2010 - Institute of Electrical and Electronics Engineers | IEEE Transactions on Parallel and Distributed Systems
Tomasz Dziubak, Jacek Matulewski,
... FFT algorithm. The solution is based on NVIDIA CUDA technology. The speed-up factor in the test ... format: tar.gz Programming language: C++, C for CUDA Computer: Graphics card with CUDA technology recommended Operating system: No limits (tested on ... of processors used – one CPU core and all CUDA cores of the selected processor of graphics card ... equation. Solution method: FFT and Chebyshev polynomial algorithm, CUDA technology. Running time: Every test example included in ...
Tópico(s): Spectroscopy and Quantum Chemical Studies
2011 - Elsevier BV | Computer Physics Communications
Daniel Kuchelmeister, Thomas Müller, Marco Ament, Günter Wunner, Daniel Weiskopf,
... GPU using NVidia’s Compute Unified Device Architecture (CUDA), which leads to performance improvement of an order ... 1334251 Distribution format: tar.gz Programming language: C++, CUDA. Computer: Linux platforms with a NVidia CUDA enabled GPU (Compute Capability 1.3 or higher), C++ compiler, NVCC (The CUDA Compiler Driver). Operating system: Linux. RAM: 2 GB ... External routines: OpenGL Utility Toolkit development files, NVidia CUDA Toolkit 3.2, Lua5.2 Nature of problem: ... of light rays, GPU-based parallel programming using CUDA, 3D-Rendering via OpenGL. Running time: Problem dependent, ...
Tópico(s): Pulsars and Gravitational Waves Research
2012 - Elsevier BV | Computer Physics Communications