Advanced Search

Narrow Results

Results: 1 - 12of12

Follow results:

refine search

Filters

per page:

Sort: Relevance

Context for search term 1Search term 1*

All Dates

LastSelect static range

Custom Range

Select starting monthSelect starting year

Select ending monthSelect ending year

Advanced

Search name	Searched On	Run search
Keyword: Wave Modeling (3)	25 Mar 2025	Run
Keyword: Shared Memory (12)	25 Mar 2025	Run
Keyword: Magic Squares (4)	25 Mar 2025	Run
Keyword: Nanomaterials (106)	25 Mar 2025	Run
Keyword: Configuration (30)	25 Mar 2025	Run

articleNo Access
COMPARATIVE PERFORMANCE STUDY OF PARALLEL PROGRAMMING MODELS IN A NEURAL NETWORK TRAINING CODE
International Journal of Modern Physics C01 May 2002
Preview Abstract
This paper discusses the performance studies of a coarse grained parallel neural network training code for control of nonlinear dynamical systems, implemented in the shared memory and message passing parallel programming environments OpenMP and MPI, respectively. In addition, these codes are compared to an implementation utilizing SHMEM the native data passing SGI/Cray environment for parallel programming. The multiprocessor platform used in the study is a SGI/Cray Origin 2000 with up to 32 processors, which supports all these programming models efficiently. The dynamical system used in this study is a nonlinear 0D model of a thermonuclear fusion reactor with the EDA-ITER design parameters. The results show that OpenMP outperforms the other two environments when large number of processors are involved, while yielding a similar or a slightly poorer behavior for small number of processors. As expected the native SGI/Cray environment outperforms MPI for the entire range of processors used. Reasons for the observed performance are given. The parallel efficiency of the code is always greater than 60% regardless of the parallel environment for the range of processors used in this study.
articleNo Access
LOCALITY-PRESERVING LOAD-BALANCING MECHANISMS FOR SYNCHRONOUS SIMULATIONS ON SHARED-MEMORY MULTIPROCESSORS
- VOON-YEE VEE and
- WEN-JING HSU
Parallel Processing Letters01 Mar 2000
Preview Abstract
In the past decade, many synchronous algorithms have been proposed for parallel and discrete simulations. However, the actual performance of these algorithms have been far from ideal, especially when event granularity is small. Barring the case of low parallelism in the given simulation models, one of the main reasons of low speedups is in the uneven load distribution among processors. To amend for this, both static and dynamic load balancing approaches have been proposed. Nevertheless, static schemes based on partitioning of LPs are often subject to the dynamic behavior of the specific simulation models and are therefore application dependent; dynamic load balancing schemes, on the other hand, often suffer from loss of localities and hence cache misses, which could severely penalize on fine-grained event processing. In this paper, we present several new locality-preserving load balancing mechanisms for synchronous simulations on shared-memory multiprocessors. We focus on the type of synchronous simulations where the number of LPs to be processed within a cycle decreases monotonically. We show both theoretically and empirically that some of these mechanisms incur very low overhead. The mechanisms have been implemented by using MIT's Cilk and tested with a number of simulation applications. The results confirm that one of the new mechanisms is indeed more efficient and scalable than common existing approaches.
articleNo Access
SHARED MEMORY IMPLEMENTATION OF CONSTRAINT SATISFACTION PROBLEM RESOLUTION
- ZINEB HABBAS,
- MICHAËL KRAJECKI, and
- DANIEL SINGER
Parallel Processing Letters01 Dec 2001
Preview Abstract
Many problems in Computer Science, especially in Artificial Intelligence, can be formulated as Constraint Satisfaction Problems (CSP). This paper presents a parallel implementation of the Forward-Checking algorithm for solving a binary CSP over finite domains. Its main contribution is to use a simple decomposition strategy in order to distribute dynamically the search tree among machines. The feasibility and benefit of this approach are studied for a Shared Memory model. An implementation is drafted using the new emergent standard OpenMP library for shared memory, thus controlling load balancing. We mainly highlight satisfactory efficiencies without using any tricky load balancing policy. All the experiments were carried out running on the Sillicon Graphics Origin 2000 parallel machine.
articleNo Access
A NOTE ON LINEARIZABILITY AND THE GLOBAL TIME AXIOM
- BERNADETTE CHARRON-BOST and
- ROBERT CORI
Parallel Processing Letters01 Mar 2003
Preview Abstract
The assumption of the existence of global time, which significantly simplifies the analysis of distributed systems, is generally safe since most of the conclusions obtained under the global time axiom can be transferred to the frame where no such assumption is made. In this note, we show that the compositionality of the well-known correctness condition for concurrent objects called linearizability does not satisfy this simplification rule: we present a simple non-linearizable system composed of two objects which are individually linearizable.
articleNo Access
COMPLEXITY OF VERIFYING JAVA SHARED MEMORY EXECUTION
- ALEX GONTMAKHER,
- SERGEY POLYAKOV, and
- ASSAF SCHUSTER
Parallel Processing Letters01 Dec 2003
Preview Abstract
This paper studies the problem of testing shared memory Java implementations to determine whether the memory behavior they provide is consistent. The complexity of the task is analyzed. The problem is defined as that of analyzing memory access traces. The study showed that the problem is NP-complete, both in the general case and in some particular cases in which the number of memory operations per thread, the number of write operations per variable, and the number of variables are restricted.
articleNo Access
HIGH LEVEL PARALLEL SKELETONS FOR DYNAMIC PROGRAMMING
Parallel Processing Letters01 Mar 2008
Preview Abstract
Dynamic Programming is an important problem-solving technique used for solving a wide variety of optimization problems. Dynamic Programming programs are commonly designed as individual applications and software tools are usually tailored to specific classes of recurrences and methodologies. That contrasts with some other algorithmic techniques where a single generic program may solve all the instances. We have developed a general skeleton tool providing support for a wide range of dynamic programming methodologies on different parallel architectures. Genericity, flexibility and efficiency are basic issues of the design strategy. Parallelism is supplied to the user in a transparent manner through a common sequential interface. A set of test problems representative of different classes of Dynamic Programming formulations has been used to validate our skeleton on an IBM-SP.
articleNo Access
FAST PARALLEL PERMUTATION ALGORITHMS
- TORBEN HAGERUP and
- JÖRG KELLER
Parallel Processing Letters01 Jun 1995
Preview Abstract
We investigate the problem of permuting n data items on an EREW PRAM with p processors using little additional storage. We present a simple algorithm with run time O((n/p)log n) and an improved algorithm with run time O(n/p + log n log log(n/p)). Both algorithms require n additional global bits and O(1) local storage per processor. If prefix summation is supported at the instruction level, the run time of the improved algorithm is O(n/p). The algorithms can be used to rehash the address space of a PRAM emulation.
articleNo Access
Specification of the Behavior of Memory Operations in Distributed Systems
- Vicent Cholvi
Parallel Processing Letters01 Dec 1998
Preview Abstract
Shared memory is a mechanism used for inter-process communication in distributed systems which is considered a feasible alternative to the traditional communication model.
However, most of the work on shared memory has not paid enough attention to the way memory operations behave, leading to some degree of confusion.
In this paper, we describe a framework for specifying the behavior of memory operations. That framework has been used to formally specify some of the most significant memory models. In this framework, to characterize a memory model it is enough to specify the executions that it allows.
We use a dual approach. First, we provide axiomatic definitions of those memory models; then, we provide operational ones. Whereas axiomatic definitions are simple and intuitive, operational definitions are more convenient for being used in correctness proofs. We show that both approaches are equivalent.
articleNo Access
SHARED MEMORY VERSUS MESSAGE PASSING FOR ITERATIVE SOLUTION OF SPARSE, IRREGULAR PROBLEMS
- FREDERIC T. CHONG and
- ANANT AGARWAL
Parallel Processing Letters01 Mar 1999
Preview Abstract
The benefits of hardware support for shared memory versus those for message passing are difficult to evaluate without an in-depth study of real applications on a common platform. We evaluate the communication mechanisms of the MIT Alewife machine, a multiprocessor which provides integrated cache-coherent shared memory, massage passing, and DMA. We perform this evaluation with "best-effort" implementations which solve several sparse, irregular benchmark problems with a preconditioned conjugate gradient sparse matrix solver (ICCG).
We find that machines with fast global memory operations do not need message passing or bulk transfer to suport our irregular problems. This is primarily due to three reasons. First, a 5-to-1 ratio between global and local cache misses makes memory copies in bulk communication expensive relati to communication via shared memory. Second, although message passing has synchronization semantics superior to shared memory for data-driven computation, efficient shared memory can overcome this handicap by using global read-modify-writes to change from the traditional owner-computers model to a producer-computes model. Third, bulk transfers can result in high processor idle times in irregular applications.
articleNo Access
SPEEDUP OF THE n-PROCESS MUTUAL EXCLUSION ALGORITHM
- YOSHIHIDE IGARASHI and
- YASUAKI NISHITANI
Parallel Processing Letters01 Dec 1999
Preview Abstract
We propose two modifications of the n-process mutual exclusion algorithm by Peterson for the asynchronous multi-writer/reader shared memory model. By any of the modifications we can speed up the original n-process algorithm. The running times for the trying regions of the first modified algorithm and the second modified algorithm are (2n - 3)c + O(n³ l) and (n - 1)c + O(n³ l), respectively, where n is the number of processes, l is an upper bound on the time between two steps, and c is an upper bound on the time that any user spends in the critical region. These running times are improvements on the running time, O(n²c + n⁴ l) of the original n-process algorithm for the same asynchronous shared memory model.
articleNo Access
SPEEDING UP BEST NEIGHBORHOOD MATCHING ALGORITHM FOR HIGH-DEFINITION IMAGE ON GPU PLATFORM
- LIQIANG HE,
- GUANGYONG ZHANG, and
- YANYAN ZHANG
International Journal of Image and Graphics01 Jul 2011
Preview Abstract
Error concealment restores the visual integrity of image content that has been damaged due to a bad network transmission. Best neighborhood matching (BNM) is an effective image-recovery method that exploits the information redundancy in a block-coded broken image to find similar content that it then uses to repair or conceal errors. On a high-definition image, BNM is traditionally implemented sequentially, which requires a relatively long time and thus is not suitable for real-time or high-volume use. In this paper, we analyze the data access patterns of the BNM algorithm, and exploit a graphics process unit (GPU) platform to speed up the execution through a parallel implementation. We compare and combine several different GPU optimization methods (coalesced global memory access, shared memory, register files, etc.), and propose an improvement to the parallel BNM algorithm. Experimental results show that our approach can speed up BNM 62 times over the sequential approach without any obvious loss of accuracy.
chapterNo Access
Assessing the Scalability Issues on Multi-Core NUMA Machines
- Malak Aljabri,
- Phil Trinder, and
- Hans-Wolfgang Loidl
Proceedings of the Eighth Saudi Students Conference in the UK15 Dec 2015
Preview Abstract
Non-uniform memory access (NUMA) architectures are modern shared-memory, multi-core machines offering varying access time and latencies between different memory banks. The organisation of nodes across different regions with nodes in the same regions that share the same memory poses challenges to efficient shared-memory access, thus negatively affecting the scalability of parallel applications. This paper studies the effect of state-of-the-art physical shared-memory NUMA architectures on the performance scalability of parallel applications using a range of programs and various language technologies. In particular, different parallel programs are used with different communication libraries and patterns in two sets of experiments. The first experiment examines the performance of the mainstream, widely used parallel technologies MPI and OpenMP, which utilise message passing and shared-memory communication patterns respectively. In addition, the performance implications of message passing versus shared-memory access on NUMA are compared using a concordance application. The second experiment assesses the performance of two parallel Haskell implementations as examples of a high-level language with automatic memory management.The results revealed that in the case of OpenMP the scalability was good, with threads up to six representing threads allocated in the same NUMA node. Moreover, as the number of threads increased, the performance dramatically decreased, confirming the effect of inefficient memory access. Likewise, MPI demonstrated similar behaviours, with the optimum speedup at six cores. However, unlike OpenMP, performance did not decrease sharply beyond that point, illustrating the benefits of message passing as opposed to shared-memory access. In terms of the standard, shared-memory parallel Haskell implementation, the scalability was limited to between 10 to 25 cores out of 48 across three parallel programs, with high memory management overheads. On the other hand, our parallel Haskell implementation, GUMSMP, which combines both distributed and shared-heap abstractions, scaled consistently with a speedup of up to 24 on 48 cores and overall performance improvement of up to 57%, as compared with the shared-memory implementation.