Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

SEARCH GUIDE  Download Search Tip PDF File

  • articleNo Access

    PROGRAMMING MODELS, COMPILERS, AND ALGORITHMS FOR IRREGULAR DATA-PARALLEL COMPUTATIONS

    Advances in parallel computing have made it clear that the ability to express computations in a machine-independent manner and the ability to handle dynamic and irregular computations are two necessary features of future programming systems. In this paper, we describe the nested data-parallel model of programming, which has both these capabilities. We present an intermediate-level language called VCODE embodying the properties of the model. We discuss several representative irregular problems and show both their expression in the model and their performance on shared-memory multiprocessors. We also briefly discuss the compiler techniques necessary for achieving good performance on MIMD machines with this programming model.

  • articleNo Access

    SPARSE COMPUTATION WITH PEI

    PEI formalism has been designed to reason and develop parallel programs in the context of data parallelism. In this paper, we focus on the use of PEI to transform a program involving dense matrices into a new program involving sparse matrices, using the example of the matrix-vector product.

  • articleNo Access

    AN ANALYTICAL METHOD FOR PARALLELIZATION OF RECURSIVE FUNCTIONS

    Programming with parallel skeletons is an attractive framework because it encourages programmers to develop efficient and portable parallel programs. However, extracting parallelism from sequential specifications and constructing efficient parallel programs using the skeletons are still difficult tasks.

    In this paper, we propose an analytical approach to transforming recursive functions on general recursive data structures into compositions of parallel skeletons. Using static slicing, we have defined a classification of subexpressions based on their data-parallelism. Then, skeleton-based parallel programs are generated from the classification. To extend the scope of parallelization, we have adopted more general parallel skeletons which do not require the associativity of argument functions. In this way, our analytical method can parallelize recursive functions with complex data flows.

  • articleNo Access

    AN ANALYTICAL METHOD FOR PARALLELIZATION OF RECURSIVE FUNCTIONS

    Programming with parallel skeletons is an attractive framework because it encourages programmers to develop efficient and portable parallel programs. However, extracting parallelism from sequential specifications and constructing efficient parallel programs using the skeletons are still difficult tasks. In this paper, we propose an analytical approach to transforming recursive functions on general recursive data structures into compositions of parallel skeletons. Using static slicing, we have defined a classification of subexpressions based on their data-parallelism. Then, skeleton-based parallel programs are generated from the classification. To extend the scope of parallelization, we have adopted more general parallel skeletons which do not require the associativity of argument functions. In this way, our analytical method can parallelize recursive functions with complex data flows.

  • articleNo Access

    TUNING TASK GRANULARITY AND DATA LOCALITY OF DATA PARALLEL GPH PROGRAMS

    The performance of data parallel programs often hinges on two key coordination aspects: the computational costs of the parallel tasks relative to their management overhead — task granularity; and the communication costs induced by the distance between tasks and their data — data locality. In data parallel programs both granularity and locality can be improved by clustering, i.e. arranging for parallel tasks to operate on related sub-collections of data.

    The GPH parallel functional language automatically manages most coordination aspects, but also allows some high-level control of coordination using evaluation strategies. We study the coordination behavior of two typical data parallel programs, and find that while they can be improved by introducing clustering evaluation strategies, further performance improvements can be achieved by restructuring the program.

    We introduce a new generic Cluster class that allows clustering to be systematically introduced, and improved by program transformation. In contrast to many other parallel program transformation approaches, we transform realistic programs and report performance results on a 32-processor Beowulf cluster. The cluster class is highly-generic and extensible, amenable to reasoning, and avoids conflating computation and coordination aspects of the program.

  • articleNo Access

    THE INTEGRATION OF TASK AND DATA PARALLEL SKELETONS

    We describe a skeletal parallel programming library which integrates task and data parallel constructs within an API for C++. Traditional skeletal requirements for higher orderness and polymorphism are achieved through exploitation of operator overloading and templates, while the underlying parallelism is provided by MPI. We present a case study describing two algorithms for the travelling salesman problem.

  • articleNo Access

    DatTeL: A DATA-PARALLEL C++ TEMPLATE LIBRARY

    The concept of C++ templates, realized in the Standard Template Library (STL), allows development of generic programs suitable for various concrete data structures. Our aim in this paper is to provide an opportunity for efficient execution of STL programs on parallel machines. We present DatTeL – a data-parallel library, which allows simple, efficient programming for various parallel architectures while staying within the paradigm of classical C++ template programming. We describe the principles of our approach and explain our design decisions and their implementation in the library. The presentation is illustrated with a case study — parallelization of a generic algorithm for carry-lookahead addition. We compare DatTeL to related work and report experimental results for the current version of our library.

  • articleNo Access

    EVALUATING COMPUTATIONAL COSTS WHILE HANDLING DATA AND CONTROL PARALLELISM

    The aim of this work is to introduce a computational costs system associated to a semantic framework for orthogonal data and control parallelism handling. In such a framework a parallel application is described by a semantic expression involving in an orthogonal manner both data access and control parallelism abstractions. The evaluation of such an expression is driven by a set of rewriting rules each of which is combined with a computational cost. We present how to proceed in the evaluation of the final cost of the application as well as how such information together with the semantic framework capabilities can be exploited to increase the overall performance.

  • articleNo Access

    SIMULTANEOUS PARALLEL REDUCTION ON SIMD MACHINES

    Proper distribution of operations among parallel processors in a large scientific computation executed on a distributed-memory machine can significantly reduce the total computation time. In this paper, we propose an operation called simultaneous parallel reduction(SPR), that is amenable to such optimization. SPR performs reduction operations in parallel, each operation reducing a one-dimensional consecutive section of a distributed array. Each element of the distributed array is used as an operand to many reductions executed concurrently over the overlapping array's sections. SPR is distinct from a more commonly considered parallel reduction which concurrently evaluates a single reduction. In this paper we consider SPR on Single Instruction Multiple Data (SIMD) machines with different interconnection networks. We focus on SPR over sections whose size is not a power of 2 with the result shifted relative to the arguments. Several algorithms achieving some of the lower bounds on SPR complexity are presented under various assumptions about the properties of the binary operator of the reduction and of the communication cost of the target architectures.

  • articleNo Access

    SIMULTANEOUS ALLOCATION AND SCHEDULING USING CONVEX PROGRAMMING TECHNIQUES

    Simultaneous exploitation of task and data parallelism provides significant benefits for many applications. The basic approach for exploiting task and data parallelism is to use a task graph representation (Macro Dataflow Graph) for programs to decide on the degree of data parallelism to be used for each task (allocation) and an execution order for the tasks (scheduling). Previously, we presented a two step approach for allocation and scheduling by considering the two steps to be independent of each other. In this paper, we present a new simultaneous approach which uses constraints to model the scheduler during allocation. The new simultaneous approach provides significant benefits over our earlier approach for the benchmark task graphs that we have considered.

  • articleOpen Access

    Impact of Design Decisions on Performance of Embarrassingly Parallel .NET Database Application

    The implementation of parallel applications is always a challenge. It embraces many distinctive design decisions that are to be taken. The paper presents issues of parallel processing with use of .NET applications and popular Database Management Systems (DBMSes). In the paper, four design dilemmas are addressed: how efficient is the auto-parallelism implemented in the .NET TPL library, how do popular DBMSes differ in serving parallel requests, what is the optimal size of data chunks in the data parallelism scheme, and how the TPL auto-parallelism behaves in the public clouds. They are analyzed in the context of the typical and practical business case originated from IT solutions which are dedicated for the energy market participants. The paper presents the results of experiments conducted in a controlled, on-premises and cloud environments. The experiments allowed to compare the performance of the TPL auto-parallelism with a wide range of manually set numbers of worker threads. They also helped to evaluate four DBMSes: Oracle, MySQL, PostgreSQL, and MSSQL in the scenario of serving parallel queries. Finally, they showed the impact of data chunk sizes on the overall performance.