The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core Graphics Processor Units (GPUs), exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work, we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenAcc, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.
This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.
Graphics Processing Units (GPU) are application specific accelerators which provide high performance to cost ratio and are widely available and used, hence places them as a ubiquitous accelerator. A computing paradigm based on the same is the general purpose computing on the GPU (GPGPU) model. The GPU due to its graphics lineage is better suited for the data-parallel, data-regular algorithms. The hardware architecture of the GPU is not suitable for the data parallel but data irregular algorithms such as graph connected components and list ranking.
In this paper, we present results that show how to use GPUs efficiently for graph algorithms which are known to have irregular data access patterns. We consider two fundamental graph problems: finding the connected components and finding a spanning tree. These two problems find applications in several graph theoretical problems. In this paper we arrive at efficient GPU implementations for the above two problems. The algorithms focus on minimising irregularity at both algorithmic and implementation level. Our implementation achieves a speedup of 11-16 times over a corresponding best sequential implementation.
The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs.
Widespread heterogeneous parallelism is unavoidable given the emergence of General-Purpose computing on graphics processing units (GPGPU). The characteristics of a Graphics Processing Unit (GPU)—including significant memory transfer latency and complex performance characteristics—demand new approaches to ensuring that all available computational resources are efficiently utilised. This paper considers the simple case of a divisible workload based on widely-used numerical linear algebra routines and the challenges that prevent efficient use of all resources available to a naive SPMD application using the GPU as an accelerator. We suggest a possible queue monitoring strategy that facilitates resource usage with a view to balancing the CPU/GPU utilisation for applications that fit the pipeline parallel architectural pattern on heterogeneous multicore/multi-node CPU and GPU systems. We propose a stochastic allocation technique that may serve as a foundation for heuristic approaches to balancing CPU/GPU workloads.
We demonstrate that parallel deterministic sample sort for many-core GPUs (GPU BUCKET SORT) is not only considerably faster than the best comparison-based sorting algorithm for GPUs (THRUST MERGE [Satish et.al., Proc. IPDPS 2009]) but also as fast as randomized sample sort for GPUs (GPU SAMPLE SORT [Leischner et.al., Proc. IPDPS 2010]). However, deterministic sample sort has the advantage that bucket sizes are guaranteed and therefore its running time does not have the input data dependent fluctuations that can occur for randomized sample sort.
We present a number of optimization techniques to compute prefix sums on linked lists and implement them on the multithreaded GPUs Tesla C1060, Tesla C2050, and GTX480 using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. Our experimental results show scalability for list sizes ranging from 1M nodes to 256M nodes, and significantly improve on the recently published parallel implementations of list ranking, including implementations on the Cell Processor, the MTA-8, and the NVIDIA GT200 and Fermi series. They also compare favorably to the performance of the best known CUDA algorithm for the scan operation on the Tesla C1060 and GTX480.
The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA. This makes development of stencil applications a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high-level programming abstractions with competitive performance on multi-GPU systems. SkelCL extends the OpenCL standard by three high-level features: 1) pre-implemented parallel patterns (a.k.a. skeletons); 2) container data types for vectors and matrices; 3) automatic data (re)distribution mechanism. We introduce two new SkelCL skeletons which specifically target stencil computations – MapOverlap and Stencil – and we describe their use for particular application examples, discuss their efficient parallel implementation, and report experimental results on systems with multiple GPUs. Our evaluation of three real-world applications shows that stencil code written with SkelCL is considerably shorter and offers competitive performance to hand-tuned OpenCL code.
Sorting and scanning are two fundamental primitives for constructing highly parallel algorithms. A number of libraries now provide implementations of these primitives for GPUs, but there is relatively little information about the performance of these implementations.
We benchmark seven libraries for 32-bit integer scan and sort, and sorting 32-bit values by 32-bit integer keys. We show that there is a large variation in performance between the libraries, and that no one library has both optimal performance and portability.
In this paper, we illustrate the possibility of developing strategies to carry out matrix computations on heterogeneous platforms which achieve native GPU performance on very large data sizes up to the capacity of the CPU memory. More specifically, we present a dense matrix multiplication strategy on a heterogeneous platform, specifically tailored for the case when the input is too large to fit on the device memory, which achieves near peak GPU performance. Our strategy involves the development of CUDA stream based software pipelines that effectively overlap PCIe data transfers with kernel executions. As a result, we are able to achieve over 1 and 2 TFLOPS performance on a single node using 1 and 2 GPUs respectively.
The computational power requirements of real-world optimization problems begin to exceed the general performance of the Central Processing Unit (CPU). The modeling of such problems is in constant evolution and requires more computational power. Solving them is expensive in computation time and even metaheuristics, well known for their eficiency, begin to be unsuitable for the increasing amount of data. Recently, thanks to the advent of languages such as CUDA, the development of parallel metaheuristics on Graphic Processing Unit (GPU) platform to solve combinatorial problems such as the Quadratic Assignment Problem (QAP) has received a growing interest. It is one of the most studied NP-hard problems and it is known for its high computational cost. In this paper, we survey several of the most important metaheuristics approaches for the QAP and we focus our survey on parallel metaheuristics using the GPU.
Verification of digital circuits by Cycle-based simulation can be performed in parallel. The parallel implementation requires two phases: the compilation phase, that sets up the data needed for the execution of the simulation, and the simulation phase, that consists in executing the parallel simulation of the considered circuit for a certain number of cycles. During the early phase of design, compilation phase has to be repeated each time a bug is found. Thus, if the time of the compilation phase is too high, the advantages stemming from the parallel approach may be lost. In this work we propose an effective version of the compilation phase and compute the corresponding execution time. We also analyze the percentage of execution time required by the different steps of the compilation phase for a set of literature benchmarks. Further, we implemented the simulation phase exploiting the GPU architecture, and we computed the execution times for a set of benchmarks obtaining values comparable with literature ones. Finally, we implemented the sequential version of the Cycle-based simulation in such a way that the execution time is optimized. We used the sequential values to compute the speedup of the parallel version for the considered set of benchmarks.
In deep learning, one of the popular subclasses of multilayer perceptron is the Convolutional Neural Networks (CNNs), optimized for image and semantic segmentation tasks. With a large number of floating-point operations and relatively less data transfer in the training phase, CNNs are well suited to be handled by parallel architectures. The high accuracy of CNNs comes at the cost of consequential compute and memory demands. In this article, we present a comparative analysis of a pre-trained CNN framework for recognizing a handwritten digit on different processing platforms. The performance of the neural network as the function of the parallel processing platforms offers parametric insights and factors that influence the inference on state-of-the-art Graphics Processing Units (GPUs) and systolic arrays such as Eyeriss and Tensor Processing Unit (TPU). Through inference time analysis, we observed that the systolic arrays can outperform upto 58.7 times better than a Turing architecture powered GPU. We show that while the efficiency with which the available resources are utilized is higher in the existing GPUs (upto 32%) than the TPUs, the efficiency of application-specific systolic arrays can be on par with that of GPUs. We present the results obtained from three different customized systolic array-based platforms which can be adopted by the designers to decide the hardware optimization goal.
SpGEMM (General Sparse Matrix-Matrix Multiplication) is one of the kernels of an algebraic multi-grid method, graph algorithm, and solving linear equations. Due to the non-uniformity of some sparse matrices, the existing parallel SpGEMM algorithms suffer from load imbalance, lead to a decrease in computational efficiency. This paper proposes a new algorithm, SPMSD (SpGEMM Based on Minimum Standard Deviation). The algorithm is developed based on a hash table and partition strategy. First, the number of intermediate results in the matrix is divided into multiple blocks based on a new partition strategy to ensure the minimum standard deviation among blocks. Second, the input matrix is transformed according to the result of the partition strategy. Finally, SPMSD performs the parallel computing of SpGEMM based on the advantages of fast insertion and also fast access storage of the hash table and the calculation process controls the insertion and merging of intermediate results according to the offset to avoid the shortage of atomic operations. These experiments indicate the execution of SPMSD is faster than the existing cuSPARSE libraries by 7.4x. Compared with the Out of Core method, SPMSD improves the computational performance by 1.2x, SPMSD memory utilization is decreased by 0.19x.
Novel paradigm of Sleptsov Net Computing (SNC) mends imperfections of modern HPC architecture with computing memory implementation and provides a vivid graphical language, fine granulation of concurrent processes, and wide application of formal methods for reliable software design. IDE and VM for SNC have been developed and described in early papers. In the present paper, we considerably reduce GPU memory and thread consumption of the previous prototype implementation, introducing a matrix with condensed columns (MCC), and enhance performance, using the first fireable transition choice on the transition sequence reordered according to the lattice of priorities. MCC takes into consideration procedures of Sleptsov net (SN) arcs processing to enhance performance with rather little overhead compared to traditional sparse matrix formats. We represent a matrix of SN arcs by a pair of considerably smaller matrices, having the same number of columns, with the number of rows equal to the maximal number of nonzero elements over the source matrix columns. The first matrix contains the indexes within the source matrix, the second matrix contains the values within the source matrix. To run an SN, we use the corresponding rectangular matrix of GPU threads, with very few indirect memory accesses via MCC. As a result, we increase VM performance in 2–4 times, and reduce by hundred times GPU memory and thread consumption on programs from the SN software collection.
The temperature induced phase transition is investigated in the one-component scalar field ϕ4 model on the lattice. Using the GPU cluster a huge amount of Monte Carlo simulation data is collected for a wide interval of coupling values. This gives a possibility to determine the low bound on the coupling constant λ0 when the transition happens and investigate its type. We found that for the values of λ close to this bound a weak-first-order phase transition takes place. It converts into a second-order one with the increase of λ. A comparison with the results obtained in analytic and numeric calculations by other authors is given.
The accelerating progress and availability of low cost computers, high speed networks, and software for high performance distributed computing allow us to reconsider computationally expensive techniques in image processing and pattern recognition. We propose a two-level hierarchical k-nearest neighbor classifier where the first level uses graphics processor units (GPUs) and the second level uses a high performance cluster (HPC). The system is evaluated on the problem of character recognition with nine databases (Arabic digits, Indian digits (Bangla, Devnagari, and Oriya), Bangla characters, Indonesian characters, Arabic characters, Farsi characters and digits). Contrary to many approaches that tune the model for different scripts, the proposed image classification method is unchanged throughout the evaluation on the nine databases. We show that a hierarchical combination of decisions based on two distances, using GPUs and a HPC provides state-of-the-art performances on several scripts, and provides a better accuracy than more complex systems.
Guide wire tracking in fluoroscopic images has done a significant task in assisting the physicians during radiology-aided interventions. Many groups have tried to detect the guide wire from the fluoroscopic images based on the image properties. The main challenge is that manual intervention is required during the detection. Other groups try to introduce localizers to track guide wires during intervention, which requires additional hardware equipment, and may intervene with the traditional clinical routines. Machine learning methods are also exploited. Although such methods may provide accurate tracking, they often require large amount of data and training time.
In this paper, we propose a GPU-based fast and automatic approach to track guide wires in fluoroscopic sequences. We propose a multi-scale filtering and gradient vector field-based real-time tracking method for guide wire tracking from fluoroscopic images. To improve calculation efficiency and meet real-time application requirement, we propose a GPU-based acceleration scheme, and also a Bayesian filter-like motion tracking method to limit the guide wire tracking to a smaller range to improve calculation efficiency.
We test our proposed method on two test data sets of fluoroscopic sequences of 102 frames and 72 frames. We achieve an average guide wire detection rate of 96.7%, a false detection rate of 0.0011% and an error distance measure of 0.83 pixels for the first sequence, and 98.8%, 0.000069% and 0.85 pixels, respectively, for the second sequence. With the proposed acceleration method, we finish calculation for the first sequence in nine seconds, thus, efficiency is enhanced by 100 times with the unaccelerated algorithm.
In order to solve the key problems of secondary surveillance radar signal debugging on special devices such as DSP and FPGA, a parallel processing scheme of secondary surveillance radar response signal is proposed based on CPU–GPU architecture. It can effectively reduce the difficulty of code development and improve the portability of program. The parallel optimization design of each processing model of response signal is made through the characteristics of shared memory in CPU–GPU architecture to improve the processing speed of the algorithm. The proposed scheme is tested and analyzed on different graphics cards by the secondary surveillance radar Mode-5 response signal in this paper. The experimental results showed that it takes 8390.52960 us to implement the signal processing algorithm based on NVIDIA Tesla K40c graphics card, which can save 51.98% of time than NVIDIA Quadro K4200 graphics card, and makes it possible to do the real-time processing of the secondary surveillance radar signal.
Modern GPUs can execute multiple kernels concurrently to keep the hardware resources busy and to boost the overall performance. This approach is called simultaneous multiple kernel execution (MKE). MKE is a promising approach for improving GPU hardware utilization. Although modern GPUs allow MKE, the effects of different MKE scenarios have not adequately studied by the researchers. Since cache memories have significant effects on the overall GPU performance, the effects of MKE on cache performance should be investigated properly. The present study proposes a framework, called RDMKE (short for Reuse Distance-based profiling in MKEs), to provide a method for analyzing GPU cache memory performance in MKE scenarios. The raw memory access information of a kernel is first extracted and then RDMKE enforces a proper ordering to the memory accesses so that it represents a given MKE scenario. Afterward, RDMKE employs reuse distance analysis (RDA) to generate cache-related performance metrics, including hit ratios, transaction counts, cache sets and Miss Status Holding Register reservation fails. In addition, RDMKE provides the user with the RD profiles as a useful locality metric. The simulation results of single kernel executions show a fair correlation between the generated results by RDMKE and GPU performance counters. Further, the simulation results of 28 two-kernel executions indicate that RDMKE can properly capture the nonlinear cache behaviors in MKE scenarios.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.