Please login to be able to save your searches and receive alerts for new content matching your search criteria.
The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs.
This study proposes a data fusion and deep learning (DL) framework that learns high-level traffic features from network-level images to predict large-scale, multi-route, speed and volume of connected vehicles (CVs). We present a scalable and parallel method of processing statewide CVs’ trajectory data that leads to real-time insights on the micro-scale in time and space (two-dimensional (2D) arrays) on graphics processing unit (GPUs) using the Nvidia rapids framework and dask parallel cluster, which provided a 50× speed-up in the data extraction, transform and load (ETL). A UNet model is then applied to perform feature extraction and multi-route speed and volume channels over a multi-step prediction horizon. The accuracy and robustness of the proposed model are evaluated by taking different road types, times of day and image snippets and comparing the model to benchmarks: Convolutional Long–Short-Term Memory (ConvLSTM) and a historical average (HA). The results show that the proposed model outperforms benchmarks with an average improvement of 15% over ConvLSTM and 65% over the HA. Comparing the image snippets from each prediction model to the actual image shows that image textures were highly similar in UNet to the benchmark models used. UNet’s dominance in performing image predictions was also evident in multi-step forecasting, where the increase in errors was relatively minimal over longer prediction horizons.
This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.
Improving the image quality and the rendering speed have always been a challenge to the programmers involved in large scale volume rendering especially in the field of medical image processing. The paper aims to perform volume rendering using the graphics processing unit (GPU), in which, with its massively parallel capability has the potential to revolutionize this field. This work is now better with the help of GPU accelerated system. The final results would allow the doctors to diagnose and analyze the 2D computed tomography (CT) scan data using three dimensional visualization techniques. The system is used in multiple types of datasets, from 10 MB to 350 MB medical volume data. Further, the use of compute unified device architecture (CUDA) framework, a low learning curve technology, for such purpose would greatly reduce the cost involved in CT scan analysis; hence bring it to the common masses. The volume rendering has been done on Nvidia Tesla C1060 (there are 240 CUDA cores, which provides execution of data parallely) card and its performance has also been benchmarked.
We present an overview of the Graphics Processing Unit (GPU)-based spatial processing system created for the Canadian Hydrogen Intensity Mapping Experiment (CHIME). The design employs AMD S9300x2 GPUs and readily available commercial hardware in its processing nodes to provide a cost- and power-efficient processing substrate. These nodes are supported by a liquid-cooling system which allows continuous operation with modest power consumption and in all but the most adverse conditions. Capable of continuously correlating 2048 receiver-polarizations across 400MHz of bandwidth, the CHIME X-engine constitutes the most powerful radio correlator currently in existence. It receives 6.6Tb/s of channelized data from CHIME’s FPGA-based F-engine, and the primary correlation task requires 8.39×1014 complex multiply-and-accumulate operations per second. The same system also provides formed-beam data products to commensal FRB and Pulsar experiments; it constitutes a general spatial-processing system of unprecedented scale and capability, with correspondingly great challenges in computation, data transport, heat dissipation, and interference shielding.
Graphics Processing Units (GPU) are application specific accelerators which provide high performance to cost ratio and are widely available and used, hence places them as a ubiquitous accelerator. A computing paradigm based on the same is the general purpose computing on the GPU (GPGPU) model. The GPU due to its graphics lineage is better suited for the data-parallel, data-regular algorithms. The hardware architecture of the GPU is not suitable for the data parallel but data irregular algorithms such as graph connected components and list ranking.
In this paper, we present results that show how to use GPUs efficiently for graph algorithms which are known to have irregular data access patterns. We consider two fundamental graph problems: finding the connected components and finding a spanning tree. These two problems find applications in several graph theoretical problems. In this paper we arrive at efficient GPU implementations for the above two problems. The algorithms focus on minimising irregularity at both algorithmic and implementation level. Our implementation achieves a speedup of 11-16 times over a corresponding best sequential implementation.
The computational power requirements of real-world optimization problems begin to exceed the general performance of the Central Processing Unit (CPU). The modeling of such problems is in constant evolution and requires more computational power. Solving them is expensive in computation time and even metaheuristics, well known for their eficiency, begin to be unsuitable for the increasing amount of data. Recently, thanks to the advent of languages such as CUDA, the development of parallel metaheuristics on Graphic Processing Unit (GPU) platform to solve combinatorial problems such as the Quadratic Assignment Problem (QAP) has received a growing interest. It is one of the most studied NP-hard problems and it is known for its high computational cost. In this paper, we survey several of the most important metaheuristics approaches for the QAP and we focus our survey on parallel metaheuristics using the GPU.
We propose a pre-stack reverse time migration (RTM) seismic imaging method using the pseudospectral time-domain (PSTD) algorithm. Traditional pseudospectral method uses the fast Fourier transform (FFT) algorithm to calculate the spatial derivatives, but is limited by the wraparound effect due to the periodicity assumed in the FFT. The PSTD algorithm combines the pseudospectral method with a perfectly matched layer (PML) for acoustic waves. PML is a highly effective absorbing boundary condition that can eliminate the wraparound effect. It enables a wide application of the pseudospectral method to complex models. RTM based on the PSTD algorithm has advantages in the computational efficiency compared to traditional methods such as the second-order and high order finite difference time-domain (FDTD) methods. In this work, we implement the PSTD algorithm for acoustic wave equation based RTM. By applying the PSTD-RTM method to various seismic models and comparing it with RTM based on the eighth-order FDTD method, we find that PSTD-RTM method has better performance and saves more than 50% memory. The method is suitable for parallel computation, and has been accelerated by general purpose graphics processing unit.
We have investigated the folding of two helix-bundle proteins, 36-residue Villin headpiece and 56-residue E-domain of Staphylococcal protein A, by combining molecular dynamics (MD) simulations with Coarse-Grained United-Residue (UNRES) Force Field and all-atom force field. Starting from extended structures, each of the proteins was folded to a stable structure within a short time frame using the UNRES model. However, the secondary structures of helices were not well formed. Further refinement using MD simulations with the all-atom force field was able to fold the protein structure into the native-like state with the smallest main-chain root-mean-square deviation of around 3 Å. Detailed analysis of the folding trajectories was presented and the performance of GPU-based MD simulations was also discussed.
The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core Graphics Processor Units (GPUs), exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work, we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenAcc, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.
We demonstrate that parallel deterministic sample sort for many-core GPUs (GPU BUCKET SORT) is not only considerably faster than the best comparison-based sorting algorithm for GPUs (THRUST MERGE [Satish et.al., Proc. IPDPS 2009]) but also as fast as randomized sample sort for GPUs (GPU SAMPLE SORT [Leischner et.al., Proc. IPDPS 2010]). However, deterministic sample sort has the advantage that bucket sizes are guaranteed and therefore its running time does not have the input data dependent fluctuations that can occur for randomized sample sort.
The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA. This makes development of stencil applications a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high-level programming abstractions with competitive performance on multi-GPU systems. SkelCL extends the OpenCL standard by three high-level features: 1) pre-implemented parallel patterns (a.k.a. skeletons); 2) container data types for vectors and matrices; 3) automatic data (re)distribution mechanism. We introduce two new SkelCL skeletons which specifically target stencil computations – MapOverlap and Stencil – and we describe their use for particular application examples, discuss their efficient parallel implementation, and report experimental results on systems with multiple GPUs. Our evaluation of three real-world applications shows that stencil code written with SkelCL is considerably shorter and offers competitive performance to hand-tuned OpenCL code.
Web services are globally utilized by clients to accomplish the required functionality over the web. As a result of its popularity and flexibility in usage, thousands of functionally similar web services are available over the network. Hence, it becomes necessary to select the optimal web service to satisfy the clients’ need. Various methodologies like machine learning, genetic algorithm, bio-inspired techniques, multi-criteria decision making (MCDM) methods and many others aid in the process of selecting the best web service from thousands of alternatives. This paper aims in proposing a relatively new MCDM approach to solve the selection issue and thereby proposes a novel framework incorporating the proposed MCDM method to aid in the process of service selection. Reference ideal method (RIM) is a state-of-the-art MCDM technique to select the optimal web service based on user inputs. In spite of its popularity, this method is found to have multiple pitfalls which make the selection process less effective. This paper proposes a novel MCDM methodology named improved RIM (I-RIM) to overcome the existing pitfalls in RIM. The paper also proposes a novel framework which combines the power of graphics processing unit (GPU) and I-RIM to enhance the efficiency of the selection process. The proposed I-RIM when parallelized using GPU is found to outperform the parallelized MCDM techniques taken for study. The results also imply that the I-RIM is more consistent and stable towards the ranking process. It is also evident that the proposed framework which incorporates I-RIM outperforms RIM in terms of execution time, mean reciprocal rank and Spearman’s correlation coefficient which makes the framework more stable and reliable, thus, making it suitable for real-time web service selection.
An efficient parallel elastoplastic reanalysis method is suggested. The main backbone of the suggested method is based on combined approximation (CA) reanalysis. GPU parallel computation is used to accelerate assembling the stiffness matrix. Assembling process is divided into the offline part for strain matrix and online part for element stiffness matrix, which makes the structure of the program more reasonable and efficient. Pseudo elastic analysis is introduced and extended to load increment method to make the CA method more feasible. The numerical examples show that the suggested method can improve the efficiency of elastoplastic analysis significantly and the accuracy of results can also be ensured.
Three-dimensional image reconstruction with Feldkamp, Davis, and Kress (FDK) algorithm is the most time consuming part in Micro-CT. The parallel algorithm based on the computer cluster is capable of accelerating image reconstruction speed; however, the hardware is very expensive. In this paper, using the most current graphics processing units (GPU), we present a method based on common unified device architecture (CUDA) for speeding up the Micro-CT image reconstruction process. The most time consuming filtering and back-projection parts of the FDK algorithm are parallelized for the CUDA architecture. The CUDA-based reconstruction speed and image qualities are compared with CPU results for the projecting data of the Micro-CT system. The results show that the 3D image reconstruction speed based on CUDA is ten times faster than the speed with CPU. In conclusion the FDK algorithm based on CUDA for Micro-CT can reconstruct the 3D image right after the end of data acquisition.
In this paper, we present a method for fluid simulation based on smoothed particle hydrodynamic (SPH) with fast collision detection on boundaries on GPU. The major goal of our algorithm is to get a fast SPH simulation and rendering on GPU. Additionally, our algorithm has the following three features: At first, to make the SPH method GPU-friendly, we introduce a spatial hash method for neighbor search. After sorting the particles based on their grid index, neighbor search can be done quickly on GPU. Second, we propose a fast particle-boundary collision detection method. By precomputing the distance field of scene boundaries, collision detection's computing cost arrived as O(n), which is much faster than the traditional way. Third, we propose a pipeline with fine-detail surface reconstruction, and progressive photon mapping working on GPU. We experiment our algorithm on different situations and particle numbers of scenes, and find out that our method gets good results. Our experimental data shows that we can simulate 100K particles, and up to 1000K particles scene at a rate of approximately 2 times per second.
Smoothed Particle Hydrodynamics (SPH) is fast emerging as a practically useful computational simulation tool for a wide variety of engineering problems. SPH is also gaining popularity as the back bone for fast and realistic animations in graphics and video games. The Lagrangian and mesh-free nature of the method facilitates fast and accurate simulation of material deformation, interface capture, etc. Typically, particle-based methods would necessitate particle search and locate algorithms to be implemented efficiently, as continuous creation of neighbor particle lists is a computationally expensive step. Hence, it is advantageous to implement SPH, on modern multi-core platforms with the help of High-Performance Computing (HPC) tools. In this work, the computational performance of an SPH algorithm is assessed on multi-core Central Processing Unit (CPU) as well as massively parallel General Purpose Graphical Processing Units (GP-GPU). Parallelizing SPH faces several challenges such as, scalability of the neighbor search process, force calculations, minimizing thread divergence, achieving coalesced memory access patterns, balancing workload, ensuring optimum use of computational resources, etc. While addressing some of these challenges, detailed analysis of performance metrics such as speedup, global load efficiency, global store efficiency, warp execution efficiency, occupancy, etc. is evaluated. The OpenMP and Compute Unified Device Architecture(CUDA) parallel programming models have been used for parallel computing on Intel Xeon(R) E5-2630 multi-core CPU and NVIDIA Quadro M4000 and NVIDIA Tesla p100 massively parallel GPU architectures. Standard benchmark problems from the Computational Fluid Dynamics (CFD) literature are chosen for the validation. The key concern of how to identify a suitable architecture for mesh-less methods which essentially require heavy workload of neighbor search and evaluation of local force fields from neighbor interactions is addressed.
Advances in astronomy are intimately linked to advances in digital signal processing (DSP). This special issue is focused upon advances in DSP within radio astronomy. The trend within that community is to use off-the-shelf digital hardware where possible and leverage advances in high performance computing. In particular, graphics processing units (GPUs) and field programmable gate arrays (FPGAs) are being used in place of application-specific circuits (ASICs); high-speed Ethernet and Infiniband are being used for interconnect in place of custom backplanes. Further, to lower hurdles in digital engineering, communities have designed and released general-purpose FPGA-based DSP systems, such as the CASPER ROACH board, ASTRON Uniboard, and CSIRO Redback board. In this introductory paper, we give a brief historical overview, a summary of recent trends, and provide an outlook on future directions.
Recent progress of graphics processing unit (GPU) computing with applications in science and technology has demonstrated tremendous impact over the last decade. However, financial applications by GPU computing are less discussed and may cause an obstacle toward the development of financial technology, an emerging and disruptive field focusing on the efficiency improvement of our current financial system. This chapter aims to raise the attention of GPU computing in finance by first empirically investigating the performance of three basic computational methods including solving a linear system, Fast Fourier transform, and Monte Carlo simulation. Then a fast calibration of the wing model to implied volatilities is explored with a set of traded futures and option data in high frequency. At least 60% executing time reduction on this calibration is obtained under the Matlab computational environment. This finding enables the disclosure of an instant market change so that a real-time surveillance for financial markets can be established for either trading or risk management purposes.
This paper presents a passive vision system used on a small five-legged walking robot. The bio-inspired omidirectional vision system matches the omnidirectional motion capabilities of the robot, which can change the walking direction without the need of rotation. To make the vision system suitable for a walking machine lacking a powerful on-board computer the developed sensor is based on an embedded single-board computer providing acceleration by GPU. This parallel architecture enables the robot to detect quickly potential obstacles in omnidirectional images and to avoid them, without involving the main controller responsible for gait generation.