Please login to be able to save your searches and receive alerts for new content matching your search criteria.
In symbolic computation, polynomial multiplication is a fundamental operation akin to matrix multiplication in numerical computation. We present efficient implementation strategies for FFT-based dense polynomial multiplication targeting multi-cores. We show that balanced input data can maximize parallel speedup and minimize cache complexity for bivariate multiplication. However, unbalanced input data, which are common in symbolic computation, are challenging. We provide efficient techniques, that we call contraction and extension, to reduce multivariate (and univariate) multiplication to balanced bivariate multiplication. Our implementation in Cilk++ demonstrates good speedup on multi-cores.
In this work we present an initial performance evaluation of Intel's latest, second-generation quad-core processor, Nehalem, and provide a comparison to first-generation AMD and Intel quad-core processors Barcelona and Tigerton. Nehalem is the first Intel processor to implement a NUMA architecture incorporating QuickPath Interconnect for interconnecting processors within a node, and the first to incorporate an integrated memory controller. We evaluate the suitability of these processors in quad-socket compute nodes as building blocks for large-scale scientific computing clusters. Our analysis of intra-processor and intra-node scalability of microbenchmarks, and a range of large-scale scientific applications, indicates that quad-core processors can deliver an improvement in performance of up to 4x over a single core depending on the workload being processed. However, scalability can be less when considering a full node. We show that Nehalem outperforms Barcelona on memory-intensive codes by a factor of two for a Nehalem node with 8 cores and a Barcelona node containing 16 cores. Further optimizations are possible with Nehalem, including the use of Simultaneous Multithreading, which improves the performance of some applications by up to 50%.
Processor and system architectures that feature multiple memory controllers and/or ccNUMA characteristics are prone to show bottlenecks and erratic performance numbers on scientific codes. Although cache thrashing, aliasing conflicts, and ccNUMA locality and contention problems are well known for many types of systems, they take on peculiar forms on the new Sun UltraSPARC T2 and T2+ processors, which we use here as prototypical multi-core designs. We analyze performance patterns in low-level and application benchmarks and put some emphasis on a comparison of performance features between T2 and its successor. Furthermore we show ways to circumvent bottlenecks by careful data layout, placement and padding.
On-line Analytical Processing (OLAP) has become one of the most powerful and prominent technologies for knowledge discovery in VLDB (Very Large Database) environments. Central to the OLAP paradigm is the data cube, a multi dimensional hierarchy of aggregate values that provides a rich analytical model for decision support. Various sequential algorithms for the efficient generation of the data cube have appeared in the literature. However, given the size of contemporary data warehousing repositories, multi-processor solutions are crucial for the massive computational demands of current and future OLAP systems.
In this paper we discuss the development of MCMD-CUBE, a new parallel data cube construction method for multi-core processors with multiple disks. We present experimental results for a Sandy Bridge multi-core processor with four parallel disks. Our experiments indicate that MCMD-CUBE achieves very close to linear speedup. A critical part of our MCMD-CUBE method is parallel sorting. We developed a new parallel sorting method termed MCMD-SORT for multi-core processors with multiple disks which outperforms other previous methods.
Non-uniform memory access (NUMA) architectures are modern shared-memory, multi-core machines offering varying access time and latencies between different memory banks. The organisation of nodes across different regions with nodes in the same regions that share the same memory poses challenges to efficient shared-memory access, thus negatively affecting the scalability of parallel applications. This paper studies the effect of state-of-the-art physical shared-memory NUMA architectures on the performance scalability of parallel applications using a range of programs and various language technologies. In particular, different parallel programs are used with different communication libraries and patterns in two sets of experiments. The first experiment examines the performance of the mainstream, widely used parallel technologies MPI and OpenMP, which utilise message passing and shared-memory communication patterns respectively. In addition, the performance implications of message passing versus shared-memory access on NUMA are compared using a concordance application. The second experiment assesses the performance of two parallel Haskell implementations as examples of a high-level language with automatic memory management.The results revealed that in the case of OpenMP the scalability was good, with threads up to six representing threads allocated in the same NUMA node. Moreover, as the number of threads increased, the performance dramatically decreased, confirming the effect of inefficient memory access. Likewise, MPI demonstrated similar behaviours, with the optimum speedup at six cores. However, unlike OpenMP, performance did not decrease sharply beyond that point, illustrating the benefits of message passing as opposed to shared-memory access. In terms of the standard, shared-memory parallel Haskell implementation, the scalability was limited to between 10 to 25 cores out of 48 across three parallel programs, with high memory management overheads. On the other hand, our parallel Haskell implementation, GUMSMP, which combines both distributed and shared-heap abstractions, scaled consistently with a speedup of up to 24 on 48 cores and overall performance improvement of up to 57%, as compared with the shared-memory implementation.
Increasing network traffic requires higher performance in the processing ability of network devices. The routing table lookup based on single thread is not appropriate for usage in general-purpose multi-core processors to achieve parallel processing. Current parallelization of routing table lookup in application layer causes high overhead in data copy, network stack and conflict of shared memory access. Network Processor Development Kit (NPDK) is a development environment for the network processing platform. In this paper, we proposed a routing table lookup mechanism named Multicopy Parallel Processing (MCPP) provided by NPDK for the purpose of multi-core parallel processing. Based on multi-core parallel processing architecture of NPDK, the overhead of packet copy is eliminated and multi-copy technology efficiently avoids routing table access conflict. Moreover, in order to reduce the number of memory accesses, a lookup algorithm named Configurable Two-Level Operation (CTLO) was implemented. Lastly, we compare the forwarding performance between the MCPP mechanism and the traditional mechanism in NPE platform, which is based on the general-purpose multi-core processor FT1000A. The evaluation results show that the packet forwarding performance based on the MCPP mechanism outperforms that based on the traditional mechanism.
The big integer multiplication play a vital role in fields of cryptography, biological information, genetic engineering and so on. To accelerate it has become the one of the main objective of these fields’ algorithms research. In this paper, we choose to use static array to store big integer and design the parallel algorithms of big integer multiplication to improve the computational efficiency based on the Block Wiedemann algorithms. Using 32 threads on Graphics Processing Unit to carry out multiplication result for 1024-bit integer and 32-bit integer. And this implementation achieves the speedup of 7.44 times over the big integer multiplication implementation on a single CPU.
With rapid development of processors, multi-core processors occupied the main market due to their superior performance and power characteristics. Threads run concurrently on a multi-core processor and share the processor's the second-level (L2) cache. Cache allocation and sharing is critical to the effective utilization of multi-core processors. Unfair CPU cache allocation can easily delay the critical task and cause serious problems, such as thread starvation, priority inversion, and inadequate CPU accounting. To solve these problems, this paper proposed a compensation policy scheduling algorithm which can decrease system variability and improve system performance effectively. The experiment results show that the system variability adopting compensation policy scheduling algorithm is lower than that using conventional scheduling algorithms. Consequently, compensation policy scheduling algorithm improves the stability of system and applications effectively on multi-core processors.
Computation of approximation in Dominance-based Rough Sets Approach (DRSA) is a necessary step for multi-criteria decision analysis and other related works. This paper presents a parallel approach for computing approximations of DRSA. Its feasibility is validated by a numerical example in this paper.
In order to improve the multi-beam bathymetry data processing speed, we propose a parallel filtering algorithm based on multi thread technology. The algorithm consists of two parts. The first is the parallel data re-order step, in which the surveying area is divided into a regular grid, and the discrete bathymetry data is arranged into each grid by parallel method. The second part is the parallel filtering step, which involves dividing the grid into blocks and parallel executing filtering process in each block. In the experiment, the speedup of the proposed algorithm reaches to about 3.67 with an 8 core computer. The result shows the method can improve computing efficiency significantly comparing to the traditional algorithm.