The complex nature of the ocean environment requires advanced computational acoustic models to gain insights into the dominating physical factors controlling the underwater sound propagation and scattering in the ocean. In this study, we explore the “ab-initio” approach to solve wave equations with numerical algorithms that can be implemented in distributed memory parallel computers. The goal is to improve the calculation speed so long-range global scale hydroacoustic wave propagation can be studied more efficiently and effectively. Two major algorithms: the finite difference time domain (FDTD) method and the Parabolic Equation (PE) method are investigated. Two PE-based numerical models are considered. One is the Split-Step-Fourier Parabolic Equation (SSFPE) model using split-step Fourier schemes in three-dimensional (3D) environments. The second PE model is a multi-frequency implementation of the Range-dependent Acoustic Model (RAM) that computes two-dimensional (2D) sound pressure fields.
One of the primary challenges in global-scale hydroacoustic propagation modeling with the “ab-initio” approach is the demand for significant computational resources on both memories and computational speeds. To address this, we employed parallel computing techniques using directive-based programming languages, such as OpenACC and XscalableACC, to leverage the power of multiple Graphics Processing Unit (GPU) systems and Central Processing Unit (CPU) cluster computers. Our performance evaluation revealed substantial speedup gains. For the FDTD method, we achieved approximately 3.5-fold and 4-fold speedups with four GPUs and four CPU cluster nodes, respectively. With the 3D SSFPE model, we obtained a speedup of 3.7-fold using four CPU cluster nodes. A significant result was observed with the 2D broadband multi-frequency RAM model, where we achieved a 110-fold speedup compared to one core original implementation.
These results demonstrate the promising potential of massively parallel computers for ocean acoustic propagation modeling and highlight the significant performance benefits of utilizing parallel computing techniques. Our findings emphasize the importance of efficient computational strategies.
This paper presents an outline of the Japanese national research project "High speed computing system for scientific and technological uses", not only in terms of its technological aspects but also its social significance and role.
We propose two scheduling algorithms for precedence-related tasks in a parallel computing system with finite number of available processors. The first algorithm, called Scheduling with Successors' Level Tracking (SSLT), is designed for TOS (Task-Only-Scheduling) on multiprocessor systems. The second algorithm, named Scheduling with Communication Successors' Priority Tracking (SCST), is designed for TCS (Task-with-Communication Scheduling) on distributed systems. The performance analysis and simulation studies of SSLT and SCST are given. The results have shown that the performance of the SSLT is better than classical LEVEL and LABEL algorithms in terms of application program execution time. The schedule returned by SSLT or SCST is very near to the optimal schedule determined using exhaustive search. Both SSLT and SCST require much less computational effort compared with exhaustive search in obtaining task schedule.
Since the general problem of optimal multiprocessor scheduling is NP-Hard, a lot of research effort has focused on developing heuristic methods. Simple and low cost heuristic algorithms that produce fairly good schedules are of interest for practical reasons. In this work, we give an extensive analysis of the existing Heavy Node First(HNF) algorithm, and propose an improved version called Multi-level Heavy Node First (MHNF) algorithm. We also initiate a new approach in worst case evaluation of various heuristics, with which we compare HNF with MHNF. We find that the improved MHNF enjoys the low cost of HNF but better quality of average case scheduling.
This paper describes how to carry out the matrix-vector multiplications Ax and ATx on parallel computers where A is a sparse matrix arising from the discretization of partial differential equations. Two partitionings of the sparse matrix suitable for parallel computers are discussed. They are derived by interpreting the sparse matrix as a graph. One of the techniques partitions the graph of the matrix by finding edge separators. The other technique partitions the graph by finding vertex separators. We claim that in either case computing ATx is no more complex than computing Ax. Results from the implementation of the matrix-vector multiplications on the Intel iPSC/860 are presented which substantiate the claim. Results from an efficient implementation on the Cray Y-MP/1 are also presented for comparison.
Assume that a black/white n×n image is stored one pixel per element of a vector of length n2. We consider determining some characteristics of such images using the scan model of parallel computation. The scan model was introduced by G.E. Blelloch and is a single instruction multiple data (SIMD) vector model of computation. The primitive operations of the model work on vectors (one dimensional arrays) of values, with three types of primitive operations: Elementwise arithmetic and logical operations, permutation operations, and scan operations, a type of prefix computation (a scan operation takes a binary associative operator ⊗ and a vector [a1,…, an] and returns the vector [a1, a1⊗a2,…, a1⊗a2⊗…⊗an]). We show that many important characteristics of binary images can be determined in constant time on the scan model of parallel computation. These include the convex hull construction, diameter, width, smallest enclosing box, perimeter, area, detecting digital convexity, parallel and point visibility (determining for each pixel of the image the portion that is visible, i.e., not obstructed by any black pixel, in a given direction from infinity or from a given point, respectively) of an image, smallest, largest and Hausdorff distances between two images, linear separability of two images, and the recognition of digital lines, rectangles and arcs.
The multiplication of a vector by a matrix is the kernel operation in many algorithms used in scientific computation. A fast and efficient parallel algorithm for this calculation is therefore desirable. This paper describes a parallel matrix-vector multiplication algorithm which is particularly well suited to dense matrices or matrices with an irregular sparsity pattern. Such matrices can arise from discretizing partial differential equations on irregular grids or from problems exhibiting nearly random connectivity between data structures. The communication cost of the algorithm is independent of the matrix sparsity pattern and is shown to scale as for an n×n matrix on p processors. The algorithm’s performance is demonstrated by using it within the well known NAS conjugate gradient benchmark. This resulted in the fastest run times achieved to date on both the 1024 node nCUBE 2 and the 128 node Intel iPSC/860. Additional improvements to the algorithm which are possible when integrating it with the conjugate gradient algorithm are also discussed.
In this paper a parallel algorithm for solving systems of linear equation on the k-ary n-cube is presented and evaluated for the first time. The proposed algorithm is of O(N3/kn) computation complexity and uses O(Nn) communication time to factorize a matrix of order N on the k-ary n-cube. This is better than the best known results for the hypercube, O(N log kn), and the mesh, , each with approximately kn nodes. The proposed parallel algorithm takes advantage of the extra connectivity in the k-ary n-cube in order to reduce the communication time involved in tasks such as pivoting, row/column interchanges, and pivot row and multipliers column broadcasts.
Given the large communication overheads characteristic of modern parallel machines, optimizations that improve locality by executing tasks close to data that they will access may improve the performance of parallel computations. This paper describes our experience automatically applying locality optimizations in the context of Jade, a portable, implicitly parallel programming language designed for exploiting task-level concurrency. Jade programmers start with a program written in a standard serial, imperative language, then use Jade constructs to declare how parts of the program access data. The Jade implementation uses this data access information to automatically extract the concurrency and apply locality optimizations. We present performance results for several Jade applications running on the Stanford DASH machine. We use these results to characterize the overall performance impact of the locality optimizations. In our application set the locality optimization level has little effect on the performance of two of the applications and a large effect on the performance of the rest of the applications. We also found that, if the locality optimization level had a significant effect on the performance, the maximum performance was obtained when the programmer explicitly placed tasks on processors rather than relying on the scheduling algorithm inside the Jade implementation.
This work presents approximation algorithms for scheduling the tasks of a parallel application that are subject to precedence constraints. The considered tasks are malleable which means that they may be executed on a varying number of processors in parallel. The considered objective criterion is the makespan, i.e., the largest task completion time. We demonstrate a close relationship between this scheduling problem and one of its subproblems, the allotment problem. By exploiting this relationship, we design a polynomial time approximation algorithm with performance guarantee arbitrarily close to for the special case of series parallel precedence constraints and for the special case of precedence constraints of bounded width. These special cases cover the important situation of tree structured precedence constraints. For arbitrary precedence constraints, we give a polynomial time approximation algorithm with performance guarantee
.
The Parallel Random Access Machine is a very strong model of parallel computing that has resisted cost-efficient implementation attempts for decades. Recently, the development of VLSI technology has provided means for indirect on-chip implementation, but there are different variants of the PRAM model that provide different performance, area and power figures and it is not known how their implementations compare to each others. In this paper we measure the performance and estimate the cost of practical implementations of four PRAM models including EREW, Limited Arbitrary CRCW, Full Arbitrary CRCW, Full Arbitrary Multioperation CRCW on our Eclipse chip multiprocessor framework. Interestingly, the most powerful model shows the lowest simulation cost and highest performance/area and performance/power figures.
A many-to-many k-disjoint path cover (k-DPC) of a graph G is a set of k vertex-disjoint paths joining k distinct pairs of source and sink in which each vertex of G is contained exactly once in a path. The balanced hypercube BHn, a variant of the hypercube, was introduced as a desired interconnection network topology. Let S={s1,s2,…,s2n−2} and T={t1,t2,…,t2n−2} be any two sets of vertices in different partite sets of BHn (n≥2). Cheng et al. in [Appl. Math. Comput. 242 (2014) 127–142] proved that there exists paired many-to-many 2-disjoint path cover of BHn when |S|=|T|=2. In this paper, we prove that there exists unpaired many-to-many (2n−2)-disjoint path cover of BHn (n≥2) from S to T, which has improved some known results. The upper bound 2n−2 is best possible in terms of the number of disjoint paths in unpaired many-to-many k-DPC of BHn.
A multiloop network is a Cayley digraph of the additive group ℤm of integers modulo m generated by a set A of k ≥ 2 "steps". Because of their application to the computer network design, multiloop networks have been studied extensively. Given an integer d ≥ 2 and a real number r > 1, let m(d,3) be the maximum size of a triple loop network with transmission delay (diameter) less than or equal to d, and let m*(r,3) be the maximum size of a triple loop network with average transmission delay less than or equal to r. In this paper, the following lower bounds are proved:
which improve earlier lower bounds by Jia and Aguiló, Fiol and Garcia.
Due to the inefficiency of multiple binary images encryption, a parallel binary image encryption framework based on the typical variants of spiking neural networks, spiking neural P (SNP) systems is proposed in this paper. More specifically, the two basic units in the proposed image cryptosystem, the permutation unit and the diffusion unit, are designed through SNP systems with multiple channels and polarizations (SNP-MCP systems), and SNP systems with astrocyte-like control (SNP-ALC systems), respectively. Different from the serial computing of the traditional image permutation/diffusion unit, SNP-MCP-based permutation/SNP-ALC-based diffusion unit can realize parallel computing through the parallel use of rules inside the neurons. Theoretical analysis results confirm the high efficiency of the binary image proposed cryptosystem. Security analysis experiments demonstrate the security of the proposed cryptosystem.
In Ref. 1, a new model for the description of the financial markets dynamics has been proposed. Traders move on a two dimensional lattice and interact by means of mechanisms of mutual influence. In the present paper, we present results from large-scale simulations of the same model enhanced by the introduction of rational traders modeled as moving-averages followers. The dynamics now accounts for log-normal distribution of volatility which is consistent with some observation of real financial indexes7 at least for the central part of the distribution.
A novel approach to parallelize the well-known Hoshen–Kopelman algorithm has been chosen, suitable for simulating huge lattices in high dimensions on massively-parallel computers with distributed memory and message passing. This method consists of domain decomposition of the simulated lattice into strips perpendicular to the hyperplane of investigation that is used in the Hoshen–Kopelman algorithm. Systems of world record sizes, up to L = 4 000 256 in two dimensions, L = 20 224 in three, and L = 1036 in four, gave precise estimates for the Fisher exponent τ, the corrections to scaling Δ1, and for the critical number density nc.
This paper discusses the performance studies of a coarse grained parallel neural network training code for control of nonlinear dynamical systems, implemented in the shared memory and message passing parallel programming environments OpenMP and MPI, respectively. In addition, these codes are compared to an implementation utilizing SHMEM the native data passing SGI/Cray environment for parallel programming. The multiprocessor platform used in the study is a SGI/Cray Origin 2000 with up to 32 processors, which supports all these programming models efficiently. The dynamical system used in this study is a nonlinear 0D model of a thermonuclear fusion reactor with the EDA-ITER design parameters. The results show that OpenMP outperforms the other two environments when large number of processors are involved, while yielding a similar or a slightly poorer behavior for small number of processors. As expected the native SGI/Cray environment outperforms MPI for the entire range of processors used. Reasons for the observed performance are given. The parallel efficiency of the code is always greater than 60% regardless of the parallel environment for the range of processors used in this study.
The Piecewise Parabolic Method (PPM) hydrodynamics code and other codes based on the PPM technique have been used extensively for the simulation of astrophysical phenomena. A new version of the PPM hydrodynamics code under development at the Laboratory for Computational Science and Engineering (LCSE) at the University of Minnesota is described. This new code incorporates a version of dynamic local adaptive mesh refinement (AMR) targeted specifically at improving the treatment of shocks and contact discontinuities. This AMR technique is not intended to increase grid resolution in entire regions of the problem for which standard techniques of nonuniform grids and simple grid motion are adequate. Because the AMR is targeted at surfaces within a flow that can develop complex shapes, a cell-by-cell approach to the grid refinement is adopted in order to minimize the number of refined grid cells, with the hope of controlling the computational cost. As a result of this approach, the number of refined cells needed across a given shock or captured contact discontinuity is modest (< 10). Because the PPM method is very complex, ordinarily its wide difference stencil demands that a relatively large number of extra grid cells need to be added to each end of a grid strip, even when adapting to a thin feature of the flow. To avoid the work associated in handling these extra cells, only to conveniently produce valid data in the thin strip of actual interest, we have used intermediate results of the coarse grid calculation in order to eliminate the need for all but two of these extra cells at each end of the refined grid strip. The benefits of this approach will be described through sample results in 2D, and the method for handling the boundaries of the refined grid strips efficiently will be discussed. The data structures in this method are being carefully designed to permit efficient implementation of the algorithm on clusters of shared memory machines, with automatic dynamic balancing of computational loads over the cluster members.
Based on the double distribution function Boltzmann-BGK equations, a cell-centered finite volume lattice Boltzmann method on unstructured grids for high-speed viscid compressible flows is presented. In the equations, the particle distribution function is introduced on the basis of the D2Q17 circular function, and its corresponding total energy distribution function is adopted. In the proposed method, the advective term is evaluated by Roe’s flux-difference splitting scheme, and a limiter is used to prevent the generation of oscillations. The distribution functions on the interface are calculated by piecewise linear reconstruction, in which the gradient is computed by the least-squares approach. In order to do large-scale simulations, a parallel algorithm is illustrated. The present method is validated by a flow around the NACA0012 airfoil and a flow past a circular cylinder at high Mach numbers. The results agree well with the published results, which demonstrate that the present method is an efficient numerical method for high-speed viscid compressible flows. The parallel performance results show that the proposed parallel algorithm achieves 90% parallel efficiency on 4800 cores for a problem with 1.4×108 unstructured triangle cells, which shows the potential to perform fast and high-fidelity simulations of large-scale high-speed viscid compressible flows in complicated computational domains.
With the rapid development of network and multimedia technologies, the widespread use of images in social networks, military applications, and other fields has led to substantial demands for image encryption. To protect images from potential threats, most protocols perform multiple rounds of confusion and diffusion until a satisfactory security level is attained. However, this strategy is time-consuming and unable to meet the requirements for real-time video encryption. Consequently, many works realize video encryption by simplifying the encryption process, e.g. employing a single round of confusion and diffusion to encrypt each frame. To enhance security to the level of image encryption while ensuring real-time performance, this paper proposes a real-time video encryption protocol based on multi-round confusion-diffusion architecture. It splits the frame into subframes, uses a set of threads to concurrently perform confusion and diffusion. The statistical and security analyses demonstrate the proposed protocol exhibits exceptional statistical properties and is resistant to various attacks. The encryption speed evaluation shows that our method significantly enhances computational efficiency, achieving latency-free encryption of 512×512 24FPS videos using the Intel Xeon Gold 6226R@2.9GHz CPU. Even with the execution of four rounds of confusion and six rounds of diffusion operations on each frame, the average encryption time remains below 30ms.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.