You do not have any saved searches
This paper evaluates the performance of hypercube machines on a computational fluid dynamics problem. Our evaluation focuses on a prototype of a class of widely used fluid dynamics codes, FLO52, written by Antony Jameson, which solves the two-dimensional steady Euler equations describing flow around an airfoil. In this paper, we give a description of FLO52, its hypercube mapping, and the code modifications to increase machine utilization. Results from two hypercube computers (a 16 node iPSC/2, and a 512 node NCUBE/ten) are presented and compared. In addition, we develop a mathematical model of the execution time as a function of several machine and algorithm parameters. This model accurately predicts the actual run times obtained. Predictions about future hypercubes are made using this timing model.
Given two sorted arrays, A (of size n) and B (of size m) where n≤m, it is required to determine for every element of A its position within B. A parallel algorithm is presented to solve this problem on the exclusive-read exclusive-write parallel random access machine (EREW PRAM). The algorithm uses p processors and runs in time. When
and n≤mα, with 0<α<1, the cost of the algorithm, i.e. its running time multiplied by the number of processors it uses, matches the running time of the fastest known sequential algorithm for this problem.
Parallel and vector processing algorithms of Preconditioned Iterative methods for solving sparse linear systems have been studied for the efficient use of multivector computers. Several 3D orderings, satisfying the properties of compatibility and partial compatibility [6], are introduced and implemented, using the Nested Dissection technique [11]. These preconditioners, as well as the Tridiagonal Factorization preconditioner [5], are realized on the Cray-2 using the microtasking facility. Results of numerical experiments are described to characterize these algorithms on the Cray-2.
Much work has been done on the problem of synthesizing a processor array from a system of recurrence equations. Some researchers limit communication to nearest neighbors in the array; others use broadcast. In many cases, neither of the above approaches result in an optimal execution time.
In this paper a technique called bounded broadcast is explored whereby an element of a processor array can broadcast to a bounded number of other processors. This technique is applied to the problems of transitive closure and all-pairs shortest distance, resulting in time complexities that are smaller than those reported previously. In general, the technique can be used to design bounded broadcast systolic arrays for algorithms whose implementation can benefit from broadcasting.
In this paper, we present three parallel algorithms for matrix multiplication. The first one, which employs pipelining techniques on a mesh grid, uses only one copy of data matrices. The second one uses multiple copies of data matrices also on a mesh grid. Although data communication operations of the second algorithm are reduced, the requirement of local data memory for each processing element increases. The third one, which uses a cubic grid, shows the trade-offs between reducing the computation time and reducing the communication overhead. Performance models and feasibilities of these three algorithms are studied. We analyze the interplay among the numbers of processing elements, the communication overhead, and the requirements of local memory in each processing element. We also present experimental results of these three algorithms on a 32-node nCUBE-2 computer.
We present the results in embedding a multigrid solver for Poisson's equation into the parallel 3D Monte Carlo device simulator, PMC-3D. First we have implemented the sequential multigrid solver, and embedded it into the Monte Carlo code which previously was using the sequential successive overrelaxation (SOR) solver. Depending on the convergence threshold, we have obtained significant speedups ranging from 5 to 15 on a single HP 712/80 workstation. We have also implemented the parallel multigrid solver by extending the partitioning algorithm and the interprocessor communication routines of the SOR solver in order to service multiple grids. The Monte Carlo code with the parallel multigrid Poisson solver is 3 to 9 times faster than the Monte Carlo code with the parallel SOR code, based on timing results on a 32-node nCUBE multiprocessor.
Gang scheduling has been widely used as a practical solution to the dynamic parallel job scheduling problem. To overcome some of the limitations of traditional Gang scheduling algorithms, Concurrent Gang is proposed as a class of scheduling policies which allows the flexible and simultaneous scheduling of multiple parallel jobs. It hence improves the space sharing characteristics of Gang scheduling while preserving all other advantages. To provide a sound analysis of Concurrent Gang performance, a novel methodology based on the traditional concept of competitive ratio is also introduced. Dubbed dynamic competitive ratio, the new method is used to compare dynamic bin packing algorithms used in this paper. These packing algorithms apply to the Concurrent Gang scheduling of a workload generated by a statistical model. Moreover, dynamic competitive ratio is the figure of merit used to evaluate and compare packing strategies for job scheduling under multiple constraints. It will be shown that for the unidimensional case there is a small difference between the performance of best fit and first fit; first fit can hence be used without significant system degradation. For the multidimensional case, when memory is also considered, we concluded that the packing algorithm must try to balance the resource utilization in all dimensions simulataneously, instead of given priority to only one dimension of the problem.
We investigate the computational power of parallel models with directed reconfigurable buses and with shared memory. Based on feasibility considerations present in the literature, we split these models into "heavyweight" (directed reconfigurable buses, the Combining PRAM, and the BSR) and "lightweight" (all the other PRAMs) and then find that the heavyweight class is strictly more powerful than the lightweight class, as expected. On the other hand, we contradict the long held belief that the heavyweight models form a hierarchy, showing that all of them are identical in computational power with each other. We start the process by showing that the Collision write conflict resolution rule is universal on models with reconfigurable buses (in the sense that complex conflict resolution rules such as Priority and Combining can be simulated with constant overhead by Collision).
Lately, the Local Iterative Monte Carlo technique was introduced for an efficient simulation of effects connected to sparsely populated regions in semiconductor devices like hot electron effects in silicon MOSFETs. This paper focuses on computational aspects of this new Monte Carlo technique, namely the reduction of the computation time by parallel computation and the reuse of drift information. The Local Iterative Monte Carlo technique combines short Monte Carlo particle flight simulations with an iteration process to a complete device simulation. The separation between short Monte Carlo simulations and the iteration process makes a simple parallelization strategy possible. The necessary data transfer is small and can be performed via the Network File System. An almost linear speed up could be achieved. Besides by parallelization, the computational expenses can be significantly reduced, when the results of the short Monte Carlo simulations are memorized in a drift table and used several times. A comparison between a bulk, a one-dimensional and the two-dimensional Local Iterative Monte Carlo simulation reveals that by using the drift information more than once, becomes increasingly efficient with increasing dimension of the simulation.
A parallel version of the invaded cluster algorithm is described. Results from large scale (up to 40962 and 5123) simulations of the Ising model are reported. No evidence of critical slowing down is found for the three-dimensional Ising model. The magnetic exponent is estimated to be 2.482±0.001(β/ν=0.518±0.001) for the three-dimensional Ising model.
We assume the multitape real-time Turing machine as a formal model for parallel real-time computation. Then, we show that, for any positive integer k, there is at least one language Lk which is accepted by a k-tape real-Turing machine, but cannot be accepted by a (k - 1)-tape real-time Turing machine. It follows therefore that the languages accepted by real-time Turing machines form an infinite hierarchy with respect to the number of tapes used. Although this result was previously obtained elsewhere, our proof is considerably shorter, and explicitly builds the languages Lk. The ability of the real-time Turing machine to model practical real-time and/or parallel computations is open to debate. Nevertheless, our result shows how a complexity theory based on a formal model can draw interesting results that are of more general nature than those derived from examples. Thus, we hope to offer a motivation for looking into realistic parallel real-time models of computation.
Advances in DRAM density have led to several proposals to perform computation in memory [1] [2] [3]. Active Pages is a page-based model of intelligent memory that can exploit large amounts of parallel computation in data-intensive applications. With a simple VLIW processor embedded near each page on DRAM, Active Page memory systems achieve up to 1000X speedups over conventional memory systems [4].
Active Pages are specifically designed to support virtualized hardware resources. In this study, we examine operating system techniques that allow Active Page memories to share, or multiplex, embedded VLIW processors across multiple physical Active Pages. We explore the trade-off between individual page-processor performance and page-level multiplexing. We find that hardware costs of computational logic can be reduced from 31% of DRAM chip area to 12%, through multiplexing, without significant loss in performance. Furthermore, manufacturing defects that disable up to 50% of the page processors can be tolerated through efficient resource allocation and associative multiplexing.
A new computational paradigm is described which offers the possibility of superlinear (and sometimes unbounded) speedup, when parallel computation is used. The computations involved are subject only to given mathematical constraints and hence do not depend on external circumstances to achieve superlinear performance. The focus here is on geometric transformations. Given a geometric object A with some property, it is required to transform A into another object B which enjoys the same property. If the transformation requires several steps, each resulting in an intermediate object, then each of these intermediate objects must also obey the same property. We show that in transforming one triangulation of a polygon into another, a parallel algorithm achieves a superlinear speedup. In the case where a convex decomposition of a set of points is to be transformed, the improvement in performance is unbounded, meaning that a parallel algorithm succeeds in solving the problem as posed, while all sequential algorithms fail.
It is shown that the concept of a Universal Computer cannot be realized. Specifically, instances of a computable function are exhibited that cannot be computed on any machine
that is capable of only a finite and fixed number of operations per step. This remains true even if the machine
is endowed with an infinite memory and the ability to communicate with the outside world while it is attempting to compute
. It also remains true if, in addition,
is given an indefinite amount of time to compute
. This result applies not only to idealized models of computation, such as the Turing Machine and the like, but also to all known general-purpose computers, including existing conventional computers (both sequential and parallel), as well as contemplated unconventional ones such as biological and quantum computers. Even accelerating machines (that is, machines that increase their speed at every step) cannot be universal.
We present optimal parallel solutions to reporting paths between pairs of nodes in an n-node tree. Our algorithms are deterministic and designed to run on an exclusive read exclusive write parallel random-access machine (EREW PRAM). In particular, we provide a simple optimal parallel algorithm for preprocessing the input tree such that the path queries can be answered efficiently. Our algorithm for preprocessing runs in O(log n) time using O(n/log n) processors. Using the preprocessing, we can report paths between k node pairs in O(log n + log k) time using O(k + (n + S)/log n) processors on an EREW PRAM, where S is the size of the output. In particular, we can report the path between a single pair of distinct nodes in O(log n) time using O(L/log n) processors, where L denotes the length of the path.
The link metric, defined on a constrained region R of the plane, sets the distance between a pair of points in R to equal the minimum number of line segments or links needed to construct a path in R between the point pair. The minimum rectilinear link path problem considered here is to compute a rectilinear path consisting of the minimum number of links between two points in R, when R is inside an n-sided rectilinear simple polygon. In this paper we present optimal sequential and parallel algorithms to compute a minimum rectilinear link path in a trapezoided region R. Our parallel algorithm requires O(log n) time using a total of O(n) operations. The complexity of our algorithm matches that of the algorithm of McDonald and Peters [19]. By exploiting the dual structure of the trapezoidation of R, we obtain a conceptually simple and easy to implement algorithm. As applications of our techniques we provide an optimal solution to the minimum nested polygon problem and the minimum polygon separation problem. The minimum nested polygon problem asks for finding a rectilinear polygon, with minimum number of sides, that is nested between two given rectilinear polygons one of which is contained in the other. The minimum polygon separation problem asks for computing a minimum number of orthogonal lines and line segments that separate two given non-intersecting simple rectilinear polygons. All parallel algorithms are deterministic, designed to run on the exclusive read exclusive write parallel random access machine (EREW PRAM), and are optimal.
Traditionally, interest in parallel computation centered around the speedup provided by parallel algorithms over their sequential counterparts. In this paper, we ask a different type of question: Can parallel computers, due to their speed, do more than simply speed up the solution to a problem? We show that for real-time optimization problems, a parallel computer can obtain a solution that is better than that obtained by a sequential one. Specifically, a sequential and a parallel algorithm are exhibited for the problem of computing the best-possible approximation to the minimum-weight spanning tree of a connected, undirected and weighted graph whose vertices and edges are not all available at the outset, but instead arrive in real time. While the parallel algorithm succeeds in computing the exact minimum-weight spanning tree, the sequential algorithm can only manage to obtain an approximate solution. In the worst case, the ratio of the weight of the solution obtained sequentially to that of the solution computed in parallel can be arbitrarily large.
A parallel Navier-Stokes solver based on dynamic overset unstructured grids method is presented to simulate the unsteady turbulent flow field around helicopter in forward flight. The grid method has the advantages of unstructured grid and Chimera grid and is suitable to deal with multiple bodies in relatively moving. Unsteady Navier-Stokes equations are solved on overset unstructured grids by an explicit dual time-stepping, finite volume method. Preconditioning method applied to inner iteration of the dual-time stepping is used to speed up the convergence of numerical simulation. The Spalart-Allmaras one-equation turbulence model is used to evaluate the turbulent viscosity. Parallel computation is based on the dynamic domain decomposition method in overset unstructured grids system at each physical time step. A generic helicopter Robin with a four-blade rotor in forward flight is considered to validate the method presented in this paper. Numerical simulation results show that the parallel dynamic overset unstructured grids method is very efficient for the simulation of helicopter flow field and the results are reliable.
The paper presents a complete approach to the multithreaded execution of a control program prepared according to IEC61131-3 standard. The program is mapped to a dedicated multiple-core CPU unit. The CPU consists of multiple independent bit and word CPUs. The computation synchronization mechanism is based on memory cells with semaphored access, which enable hardware-level synchronization. The paper presents in detail the architecture, results of implementation and the achieved performance. The custom-developed compiler translates standard programming languages into a multithreaded executable form. It utilizes an original intermediate data flow graph to optimize and recognize program parallelisms. The program is automatically partitioned and mapped to the available computing resources. The paper is concluded with a performance comparison of program executions using the standard single-threaded and proposed approaches.
We introduce a parallel approach to geometric modeling of complex objects and scenes, combining a dataflow streaming of BSP trees with a partition of the object space into independent portions, to be evaluated in parallel with minimal interprocess communication. Binary Space Partition (BSP) is a space index used in graphics for hidden-surface removal and animation. We use BSP trees with fuzzy leaves as a progressive representation of solid meshes. Our approach is implemented as a dataflow with processes that progress concurrently, where each refinement of the input to a process is mapped instantly to a refinement of the output, so that the result is also a stream of progressive refinements. This framework allows for progressive generation of complex geometric parts and large-scale assemblies. We have adapted several graphics techniques, including BSP, boundary polygons, CSG, splines and subdivision methods, to fit into our dataflow graph, where four types of processes produce, transform, combineor consume mesh cells. This approach is scalable over different kinds of HPC hardware and different number of computing nodes, by way of the decomposition of the object space and of the distribution of computational processes. Compiling a generative geometric expression into a dataflow graph is well suited to SMP machines, whereas a space decomposition into independent portions fits well with computing clusters and grids.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.