Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  Bestsellers

  • articleNo Access

    Seismic Design and Performance Evaluation of Self-Centering Hybrid Rocking Braced Steel Frames

    An innovative self-centering hybrid rocking braced steel frame (SCHRBF) with separate braced span and rocking span is proposed for improving seismic resilience. The braced span utilizes buckling-restrained braces (BRBs) to provide energy-dissipating capacity, and the rocking span consists of stiff rocking cores and self-centering braces (SCBs) to achieve a uniform inter-story drift distribution and low post-earthquake residual displacement under a strong earthquake. This study first describes the basic composition and nonlinear mechanical behavior of this novel system. Then, a force-based seismic design procedure for the SCHRBF system is proposed, including the determination and allocation of base shear, design of BRBs in braced span, design of SCBs in rocking span, and design of rocking core members. The influence of key design parameters on the seismic responses of the system is then explored through parametric analysis. And recommended values of the design parameters are provided according to the analysis results. Although the properly designed structure has significant partial re-centering behavior, its peak inter-story drifts, residual inter-story drifts, and deformation patterns can be effectively controlled under strong earthquakes. Finally, the superiority of the SCHRBF in controlling seismic displacement responses is verified by comparing with other structural systems.

  • articleNo Access

    A NEW FAULT-TOLERANT ROUTING ALGORITHM FOR k-ARY n-CUBE NETWORKS

    This paper describes a new fault-tolerant routing algorithm for k-ary n-cubes using the concept of "probability vectors". To compute these vectors, a node determines first its faulty set, which represents the set of all its neighbouring nodes that are faulty or unreachable due to faulty links. Each node then calculates a probability vector, where the lth element represents the probability that a destination node at distance l cannot be reached through a minimal path due to a faulty node or link. The probability vectors are used by all the nodes to achieve an efficient fault-tolerant routing in the network. An extensive performance analysis conducted in this study reveals that the proposed algorithm exhibits good fault-tolerance properties in terms of the achieved percentage of reachability and routing distances.

  • articleNo Access

    MODELING METHODOLOGY FOR PERFORMANCE EVALUATION OF PARALLEL ARCHITECTURES A CASE STUDY, LCAP

    This paper presents a modeling methodology devoted to the performance evaluation of parallel architectures. The methodology is based on the decomposition of the modeling process into seven stages. In each stage specific techniques, appropriate to parallel architectures, are applied, such as aggregation methods, a thorough distinction between the architecture model and the application program model, the constitution of program classes using data analysis techniques. The methodology is then illustrated through a case study, the loosely Coupled Array of Processors (lCAP) system, designed at IBM Kingston. Use of the lCAP model allows to predict the performance of the lCAP system, in terms of response time, resource utilization, waiting times, and to investigate many alternatives with regard to the system configuration (e. g. number of system components, component interconnection scheme, component characteristics), or to the parallel program structure (e. g. parallel task granularity, load imbalance).

  • articleNo Access

    EFFICIENT IMAGE PROCESSING APPLICATIONS ON THE MasPar MASSIVELY PARALLEL COMPUTERS

    Image processing applications are suitable candidates for parallelism and have at least in part motivated the design and development of some of the pioneering massively parallel processing systems including the CLIP family, the DAP, the MPP and the GAPP. In this paper, we describe the implementation of various image processing algorithms on the MasPar massively parallel computer system. The suitability of the MasPar for solving image processing algorithms is demonstrated either by parallelizing the algorithms using successful known techniques and/or developing new techniques suitable for the MasPar architecture. We quantitatively evaluate the performance of MasPar in solving these problems. Then, we compare its performance to various related massively parallel architectures. It is shown that the MasPar system compares favorably to these architectures, and is able to execute many fundamental image processing applications in a time amenable to real-time processing. Thus, the MasPar seems to be a promising architecture for massively parallel real-time image processing applications.

  • articleNo Access

    GROUPING MEMORY CONSISTENCY MODEL FOR PARALLEL-MULTITHREADED SHARED-MEMORY MULTIPROCESSOR SYSTEMS

    In this paper, we propose a hardware-centric memory consistency model particularly for shared-memory multiprocessors with parallel-multithreaded processing elements. According to the behavior of critical sections and the feature of parallel-multithreaded processors, we extend the release consistency model to a more relaxed memory model. A release reference at the end of a critical section can be executed locally regardless of whether all of its previous ordinary references have performed. The requirement is that another thread on the same processor is waiting for the lock to be freed. Two new instructions and two additional macros are needed to properly label a program for our proposed model. Moreover, we use a table per processing element to determine if there are any threads waiting for a specific lock. We have used five benchmark programs in the SPLASH suite to evaluate the performance gain for the new model. According to the simulation results, our proposed model is superior to the release consistency model up to 25%.

  • articleNo Access

    DYNAMIC LOAD ASSIGNMENT OF REAL-TIME TASKS IN DISTRIBUTED MEMORY MULTIPROCESSORS

    In this paper, we consider a scalable distributed-memory architecture for which we propose a problem representation that assigns real-time tasks on the processing units of the architecture to maximize deadline compliance rate. Based on the selected problem representation, we derive an algorithm that dynamically schedules real-time tasks on the processors of the distributed architecture. The algorithm uses a formula to generate the adequate scheduling time so that deadline loss due to scheduling overhead is minimized while deadline compliance rate is being maximized. The technique we propose proved to be correct in the sense that the delivered solutions are not obsolete, i.e., the assigned tasks to working processors are guaranteed to meet their deadlines once executed. The correctness criterion is obtained based on our technique to control the scheduling time. To evaluate the performance of the algorithms that we propose, we provide a number of experiments through a simulation study. We also propose an implementation of our algorithms in the context of scheduling real-time transactions on an Intel-Paragon distributed-memory multiprocessor. The results of the conducted experiments show interesting performance trade-offs among the candidate algorithms.

  • articleNo Access

    CHARACTERISTICS OF DETERMINISTIC OPTIMAL ROUTING FOR TWO HETEROGENEOUS PARALLEL SERVERS

    This paper presents characteristics of optimal routing that assigns each arriving packet to one of two heterogeneous parallel servers, each with its own queue. The characteristics are derived from numerical solutions to an optimization problem, which is to find optimal routing that minimizes the average packet delay under the condition that all of the packets' arrival times as well as all of the packets' sizes are completely known in advance. There are four characteristics: (1) Under light or moderate traffic, the average packet delay of optimal routing is almost the same as that of join the shortest delay (JSD) policy. (2) Under heavier traffic, optimal routing comes to more often use fix queue based on size (FS) policy. (3) Under heavy traffic, optimal routing assigns small packets to the slower server. (4) As the ratio of the slower server's service rate to the faster server's service rate decreases, optimal routing comes to more often use FS policy under light or moderated traffic. These characteristics are verified by the fact that a mimic optimal routing designed based on the four characteristics attains almost the same performance as optimal routing.

  • articleNo Access

    ON THE BEHAVIOR OF PARALLEL GENETIC ALGORITHMS FOR OPTIMAL PLACEMENT OF ANTENNAE IN TELECOMMUNICATIONS

    In this article, evolutionary algorithms (EAs) are applied to solve the radio network design problem (RND). The task is to find the best set of transmitter locations in order to cover a given geographical region at an optimal cost. Usually, parallel EAs are needed to cope with the high computational requirements of such a problem. Here, we develop and evaluate a set of sequential and parallel genetic algorithms (GAs) to solve the RND problem efficiently. The results show that our distributed steady state GA is an efficient and accurate tool for solving RND that even outperforms existing parallel solutions. The sequential algorithm performs very efficiently from a numerical point of view, although the distributed version is much faster.

  • articleNo Access

    WORST-CASE PERFORMANCE EVALUATION ON MULTIPROCESSOR TASK SCHEDULING WITH RESOURCE AUGMENTATION

    We study the worst-case performance of approximation algorithms for the problem of multiprocessor task scheduling on m identical processors with resource augmentation, whose objective is to minimize the makespan. In this case, the approximation algorithms are given k (k ≥ 0) extra processors than the optimal off-line algorithm. For on-line algorithms, the Greedy algorithm and shelf algorithms are studied. For off-line algorithm, we consider the LPT (longest processing time) algorithm. Particularly, we prove that the schedule produced by the LPT algorithm is no longer than the optimal off-line algorithm if and only if k ≥ m - 2.

  • articleNo Access

    Complexity of Approximating Functions on Real-Life Computers

    We address the problem of estimating the computation time necessary to approximate a function on a real computer. Our approach gives a possibility to estimate the minimal time needed to compute a function up to the specified level of error. This can be explained by the following example. Consider the space of functions defined on [0,1] whose absolute value and the first derivative are bounded by C. In a certain sense, for almost every such function, approximating it on its domain using an Intel x86 computer, with an error not greater than ε, takes at least k(C, ε) seconds. Here we show how to find k(C, ε).

  • articleNo Access

    A DYNAMIC SCHEDULING COMMUNICATION PROTOCOL AND ITS ANALYSIS FOR HYPERCUBE NETWORKS

    We propose a new protocol for one-to-one communication in multiprocessor networks, which we call the Dynamic Scheduling Communication (or DSC) protocol. In the DSC protocol, the capacity of a link is partitioned into two channels: a data channel, used to transmit packets, and a control channel used to make reservations. We initially describe the DSC protocol and the data structures needed to implement it for a general network topology. We then analyze the steady-state throughput of the DSC protocol for random node-to-node communication in a hypercube topology. The analytical results obtained are in very close agreement with corresponding simulation results. For the hypercube topology, and under the same set of assumptions on the node architecture and the routing algorithm used, the DSC protocol is found to achieve higher throughput than packet switching, provided that the size of the network is sufficiently large. We also investigate the relationship between the achievable throughput and the fraction of network capacity dedicated to the control channel, and present a method to select this fraction so as to optimize throughput.

  • articleNo Access

    COUPLED DIPOLE SIMULATIONS OF ELASTIC LIGHT SCATTERING ON PARALLEL SYSTEMS

    The Coupled Dipole method is used to simulate Elastic Light Scattering from arbitrary shaped particles. To facilitate simulation of relative large particles, such as human white blood cells, the number of dipoles required for the simulation is approximately 105 to 106. In order to carry out such simulations, very powerful computers are necessary. We have designed a parallel version of the Coupled Dipole method, and have implemented it on a distributed memory parallel computer, a Parsytec PowerXplorer, containing 32 PowerPC-601 processors. The efficiency of the parallel implementation is investigated for simulations of model particles. Scattering by a sphere, modelled with 33552 dipoles, is simulated and compared with analytical Mie theory. Finally the suitability of the Coupled Dipole method to simulate Elastic Light Scattering from larger particles, such as white blood cells, is discussed.

  • articleNo Access

    THE IMPACT OF DYNAMIC CHANNELS ON FUNCTIONAL TOPOLOGY SKELETONS

    Parallel functional programs with implicit communication often generate purely hierarchical communication topologies during execution: communication only happens between parent and child processes. Hence, messages between siblings must be passed via the parent causing inefficiencies that can be avoided by direct communication between arbitrary processes. The Eden parallel functional language provides dynamic channels to implement arbitrary communication topologies. This paper analyses the impact of dynamic channels on Eden's topology skeletons, i.e. skeletons which define process topologies such as rings, toroids, or hypercubes. We compare topology skeletons with and without dynamic channels with respect to runtime and communication. Our case studies confirm that dynamic channels decrease the number of messages by up to 50% and substantially reduce runtime. Detailed analyses of EdenTV (Eden trace viewer) execution profiles reveal a bottleneck in the root process when only hierarchical channel connections are used and a better distribution of communications with dynamic channels.

  • articleNo Access

    Performance Evaluation of Multi-Core Intel Xeon Processors on Basic Linear Algebra Subprograms

    Multi-core technology is a natural next step in delivering the benefits of Moore's law to computing platforms. On multi-core processors, the performance of many applications would be improved by parallel processing threads of codes using multi-threading techniques. This paper evaluates the performance of the multi-core Intel Xeon processors on the widely used basic linear algebra subprograms (BLAS). On two dual-core Intel Xeon processors with Hyper-Threading technology, our results show that a performance of around 20 GFLOPS is achieved on Level-3 (matrix-matrix operations) BLAS using multi-threading, SIMD, matrix blocking, and loop unrolling techniques. However, on a small size of Level-2 (matrix-vector operations) and Level-1 (vector operations) BLAS, the use of multi-threading technique speeds down the execution because of the thread creation overheads. Thus the use of Intel SIMD instruction set is the way to improve the performance of single-threaded Level-2 (6 GFLOPS) and Level-1 BLAS (3 GFLOPS). When the problem size becomes large (cannot fit in L2 cache), the performance of the four Xeon cores is less than 2 and 1 GFLOPS on Level-2 and Level-1 BLAS, respectively, even though eight threads are executed in parallel on eight logical processors.

  • articleNo Access

    Exploiting ILP, TLP, and DLP to Improve Multi-Core Performance of One-Sided Jacobi SVD

    This paper shows how the performance of singular value decomposition (SVD) is enhanced through the exploitation of ILP, TLP, and DLP on Intel multi-core processors using superscalar execution, multi-threading computation, and streaming SIMD extensions, respectively. To facilitate the exploitation of TLP on multiple execution cores, the well-known cyclic one-sided Jacobi algorithm is restructured to work in parallel. On two dual-core Intel Xeon processors with hyper-threading technology running at 3.0 GHz, our results show that the multi-threaded implementation of one-sided Jacobi SVD gives about four times faster than the single-threaded superscalar implementation. Furthermore, the multi-threaded SIMD implementation speeds up the execution of single-threaded one-sided Jacobi by a factor of 10, which is close to the ideal speedup. On a reasonable large matrix size fitted in the L2 cache, our results show a performance of 11 GFLOPS (double-precision) is achieved on the target system through the exploitation of ILP, TLP, and DLP as well as memory hierarchy.

  • articleNo Access

    SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

    Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.

  • articleNo Access

    ON THE PERFORMANCE AND TECHNOLOGICAL IMPACT OF ADDING MEMORY CONTROLLERS IN MULTI-CORE PROCESSORS

    The increasing core-count on current and future processors is posing critical challenges to the memory subsystem to efficiently handle concurrent memory requests. The current trend is to increase the number of memory channels available to the processor's memory controller. In this paper we investigate the advantages and disadvantages of this approach from both a technological and an application performance viewpoint. In particular, we explore the trade-off between employing multiple memory channels per memory controller and the use of multiple memory controllers with fewer memory channels. Experiments conducted on two current state-of-the-art multi-core processors, a 6-core AMD Istanbul and a 4-core Intel Nehalem-EP, using the STREAM benchmark and a wide range of production applications. An analytical model of the STREAM performance is used to illustrate the diminishing return obtained when increasing the number of memory channels per memory controller whose effect is also seen in the application performance. In addition, we show that this performance degradation can be efficiently addressed by increasing the ratio of memory controllers to channels while keeping the number of memory channels constant. Significant performance improvements can be achieved in this scheme, up to 28%, in the case of using two memory controllers each with one channel compared with one controller with two memory channels.

  • articleNo Access

    HIGH LATENCY AND CONTENTION ON SHARED L2-CACHE FOR MANY-CORE ARCHITECTURES

    Several studies point out the benefits of a shared L2 cache, but some other properties of shared caches must be considered to lead to a thorough understanding of all chip multiprocessor (CMP) bottlenecks. Our paper evaluates and explains shared cache bottlenecks, which are very important considering the rise of many-core processors. The results of our simulations with 32 cores show low performance when L2 cache memory is shared between 2 or 4 cores. In these two cases, the increase of L2 cache latency and contention are the main causes responsible for the increase of execution time.

  • articleNo Access

    Extending PowerPack for Profiling and Analysis of High-Performance Accelerator-Based Systems

    Accelerators offer a substantial increase in efficiency for high-performance systems offering speedups for computational applications that leverage hardware support for highly-parallel codes. However, the power use of some accelerators exceeds 200 watts at idle which means use at exascale comes at a significant increase in power at a time when we face a power ceiling of about 20 megawatts. Despite the growing domination of accelerator-based systems in the Top500 and Green500 lists of fastest and most efficient supercomputers, there are few detailed studies comparing the power and energy use of common accelerators. In this work, we conduct detailed experimental studies of the power usage and distribution of Xeon-Phi-based systems in comparison to the NVIDIA Tesla and an Intel Sandy Bridge multicore host processor. In contrast to previous work, we focus on separating individual component power and correlating power use to code behavior. Our results help explain the causes of power-performance scalability for a set of HPC applications.

  • articleNo Access

    Evaluating Multiple Streams on Heterogeneous Platforms

    Using multiple streams can improve the overall system performance by mitigating the data transfer overhead on heterogeneous systems. Prior work focuses a lot on GPUs but little is known about the performance impact on (Intel Xeon) Phi. In this work, we apply multiple streams into six real-world applications on Phi. We then systematically evaluate the performance benefits of using multiple streams. The evaluation work is performed at two levels: the microbenchmarking level and the real-world application level. Our experimental results at the microbenchmark level show that data transfers and kernel execution can be overlapped on Phi, while data transfers in both directions are performed in a serial manner. At the real-world application level, we show that both overlappable and non-overlappable applications can benefit from using multiple streams (with an performance improvement of up to 24%). We also quantify how task granularity and resource granularity impact the overall performance. Finally, we present a set of heuristics to reduce the search space when determining a proper task granularity and resource granularity. To conclude, our evaluation work provides lots of insights for runtime and architecture designers when using multiple streams on Phi.