Please login to be able to save your searches and receive alerts for new content matching your search criteria.
A variety of algorithms have been proposed for sparse Cholesky factorization, including left-looking, right-looking, and supernodal algorithms. This article investigates shared-memory implementations of several variants of these algorithms in a task-oriented execution model with dynamic scheduling. In particular, we consider the degree of parallelism, the scalability, and the scheduling overhead of the different algorithms. Our emphasis lies in the parallel implementation for relatively large numbers of processors. As execution platform, we use the SB-PRAM, a shared-memory machine with up to 2048 processors. This article can be considered as a case study in which we try to answer the question of which performance we can hope to get for a typical irregular application on an ideal machine on which the locality of memory accesses can be ignored but for which the overhead for the management of data structures still takes effect. The investigation shows that certain algorithms are the best choice for a small number of processors, while other algorithms are better for many processors.
This paper explores the reasons as to why the quantum paradigm is not so easy to extend to all of the classical computational algorithms. We also explain the failure of programmability and scalability in quantum speed-up. Due to the presence of quantum entropy, quantum algorithm cannot obviate the curse of dimensionality encountered in solving many complex numerical and optimization problems. Finally, the stringent condition that quantum computers have to be interaction-free, leave them with little versatility and practical utility.
In this article, we introduce a new logical clock, the barrier-lock clock, whose conception is based on the lazy release consistency memory model (LRC) supported by several distributed shared memory (DSM) systems. Since in the LRC, the propagation of shared memory updates performed by the processes of a parallel application is induced by lock and barrier operations, our logical clock has been modeled on those operations. Each barrier-lock times-tamp encodes the synchronization operation with which it is associated. Its size is not dependent on the number of processes of the system, as the traditional logical vector clocks, but it is proportional to the number of locks. The barrier-lock time characterizes the causality of shared memory updates performed by processes of a parallel application running on a LRC-based DSM system. A formal proof and experimental tests have confirmed such property.
We discuss a parallel implementation of an agent-based simulation. Our approach allows to adapt a sequential simulator for large-scale simulation on a cluster of workstations. We target discrete-time simulation models that capture the behavior of Web users and Web sites. Web users are connected with each other in a graph resembling the social network. Web sites are also connected in a similar graph. Users are stateful entities. At each time step, they exhibit certain behaviour such as visiting bookmarked sites, exchanging information about Web sites in the "word-of-mouth" style, and updating bookmarks. The real-world phenomena of emerged aggregated behavior of the Internet population is studied. The system distributes data among workstations, which allows large-scale simulations infeasible on a stand-alone computer. The model properties cause traffic between workstations proportional to partition sizes. Network latency is hidden by concurrent simulation of multiple users. The system is implemented in Mozart that provides multithreading, dataflow variables, component-based software development, and network-transparency. Currently we can simulate up to 106 Web users on 104 Web sites using a cluster of 16 computers, which takes few seconds per simulation step, and for a problem of the same size, parallel simulation offers speedups between 11 and 14.
This paper analyses the micro-threaded model of concurrency making comparisons with both data and instruction-level concurrency. The model is fine grain and provides synchronisation in a distributed register file, making it a promising candidate for scalable chip-multiprocessors. The micro-threaded model was first proposed in 1996 as a means to tolerate high latencies in data-parallel, distributed-memory multi-processors. This paper explores the model's opportunity to provide the simultaneous issue of instructions, required for chip multiprocessors, and discusses the issues of scalability with regard to support structures implementing the model and communication in supporting it. The model supports deterministic distribution of code fragments and dynamic scheduling of instructions from within those fragments. The hardware also recognises different classes of variables from the register specifiers, which allows the hardware to manage locality and optimise communication so that it is both efficient and scalable.
Cray XT and IBM Blue Gene systems present current alternative approaches to constructing leadership computer systems relying on applications being able to exploit very large configurations of processor cores, and associated analysis tools must also scale commensurately to isolate and quantify performance issues that manifest at the largest scales. In studying the scalability of the Scalasca performance analysis toolset to several hundred thousand MPI processes on XT5 and BG/P systems, we investigated a progressive execution performance deterioration of the well-known ASCI Sweep3D compact application. Scalasca runtime summarization analysis quantified MPI communication time that correlated with computational imbalance, and automated trace analysis confirmed growing amounts of MPI waiting times. Further instrumentation, measurement and analyses pinpointed a conditional section of highly imbalanced computation which amplified waiting times inherent in the associated wavefront communication that seriously degraded overall execution efficiency at very large scales. By employing effective data collation, management and graphical presentation, in a portable and straightforward to use toolset, Scalasca was thereby able to demonstrate performance measurements and analyses with 294,912 processes.
Petascale parallel computers with more than a million processing cores are expected to be available in a couple of years. Although MPI is the dominant programming interface today for large-scale systems that at the highest end already have close to 300,000 processors, a challenging question to both researchers and users is whether MPI will scale to processor and core counts in the millions. In this paper, we examine the issue of scalability of MPI to very large systems. We first examine the MPI specification itself and discuss areas with scalability concerns and how they can be overcome. We then investigate issues that an MPI implementation must address in order to be scalable. To illustrate the issues, we ran a number of simple experiments to measure MPI memory consumption at scale up to 131,072 processes, or 80%, of the IBM Blue Gene/P system at Argonne National Laboratory. Based on the results, we identified nonscalable aspects of the MPI implementation and found ways to tune it to reduce its memory footprint. We also briefly discuss issues in application scalability to large process counts and features of MPI that enable the use of other techniques to alleviate scalability limitations in applications.
Bag-of-Tasks applications are parallel applications composed of independent (i.e., embarrassingly parallel) tasks that do not communicate with each other, may depend upon one or more input files, and can be executed in any order. Each file may be input for more than one task. A common framework to execute BoT applications is the master-slave topology, in which the user machine is used to control the execution of tasks. In this scenario, a large number of concurrent tasks competing for resources (e.g., CPU and communication links) severely limits the scalability. In this paper we studied the scalability of BoT applications running on multi-node systems (e.g. clusters and grids) organized as master-slave platforms, considering two communications paradigms: multiplexed connections and efficient broadcast. We prove that the lowest bound possible on the isoefficiency function for master-slave platforms is achievable by those platforms that have an O(1) efficient broadcast primitive available. We also analyze the impact of output file contention in scalability, under different assumptions. Our study employs a set of simulation experiments that confirms and extends the theoretical results (e.g. by simulating TCP links).
In a context where networks grow larger and larger, their nodes become more likely to fail. Indeed, they may be subject to crashes, attacks, memory corruptions… To encompass all possible types of failure, we consider the most general model of failure: the Byzantine model, where any failing node may exhibit arbitrary (and potentially malicious) behavior.
We consider an asynchronous grid-shaped network where each node has a probability λ to be Byzantine. Our metric is the communication probability, that is, the probability that any two nodes communicate reliably. A number of Byzantine-resilient broadcast protocols exist, but they all share the same weakness: when the size of the grid increases, the communication probability approaches zero.
In this paper, we present the first protocol that overcomes this difficulty, and ensures a communication probability of 1−4λ on a grid that may be as large as we want (for a sufficiently small λ, typically λ < 10−5). The originality of the approach lies in the fractal definition of the protocol, which, we believe, could be used to solve several similar problems related to scalability. We also extend this scheme to a 3-dimensional grid and obtain a 1−2λ communication probability for λ < 10−3.
Novel paradigm of Sleptsov Net Computing (SNC) mends imperfections of modern HPC architecture with computing memory implementation and provides a vivid graphical language, fine granulation of concurrent processes, and wide application of formal methods for reliable software design. IDE and VM for SNC have been developed and described in early papers. In the present paper, we considerably reduce GPU memory and thread consumption of the previous prototype implementation, introducing a matrix with condensed columns (MCC), and enhance performance, using the first fireable transition choice on the transition sequence reordered according to the lattice of priorities. MCC takes into consideration procedures of Sleptsov net (SN) arcs processing to enhance performance with rather little overhead compared to traditional sparse matrix formats. We represent a matrix of SN arcs by a pair of considerably smaller matrices, having the same number of columns, with the number of rows equal to the maximal number of nonzero elements over the source matrix columns. The first matrix contains the indexes within the source matrix, the second matrix contains the values within the source matrix. To run an SN, we use the corresponding rectangular matrix of GPU threads, with very few indirect memory accesses via MCC. As a result, we increase VM performance in 2–4 times, and reduce by hundred times GPU memory and thread consumption on programs from the SN software collection.
Proper distribution of operations among parallel processors in a large scientific computation executed on a distributed-memory machine can significantly reduce the total computation time. In this paper, we propose an operation called simultaneous parallel reduction(SPR), that is amenable to such optimization. SPR performs reduction operations in parallel, each operation reducing a one-dimensional consecutive section of a distributed array. Each element of the distributed array is used as an operand to many reductions executed concurrently over the overlapping array's sections. SPR is distinct from a more commonly considered parallel reduction which concurrently evaluates a single reduction. In this paper we consider SPR on Single Instruction Multiple Data (SIMD) machines with different interconnection networks. We focus on SPR over sections whose size is not a power of 2 with the result shifted relative to the arguments. Several algorithms achieving some of the lower bounds on SPR complexity are presented under various assumptions about the properties of the binary operator of the reduction and of the communication cost of the target architectures.
The BaBar experiment is currently operating near the rate limit of its ability to log event data to disk and tape using the existing hardware and software systems. Consequently we have chosen to design and implement a new system for logging event data. The new system is designed to be scalable, so that the data rate can be increased by adding systems at one of three levels. It also has the property that data can be logged at almost unlimited burst rates without introducing dead time. The key to these features lies in the use of many nodes within the level three trigger system of BaBar. This allows the events to first be logged to local disks within the trigger system, and then later to be merged to any of multiple merge servers in non-real-time.
We present a family of networks whose local interconnection topologies are generated by the root vectors of a semi-simple complex Lie algebra. Cartan classification theorem of those algebras ensures those families of interconnection topologies to be exhaustive. The global arrangement of the network is defined in terms of integer or half-integer weight lattices. The mesh or torus topologies that network millions of processing cores, such as those in the IBM BlueGene series, are the simplest member of that category. The symmetries of the root systems of an algebra, manifested by their Weyl group, lends great convenience for the design and analysis of hardware architecture, algorithms and programs.
This paper presents a scalable architecture suitable for the implementation of high-speed fuzzy inference systems on reconfigurable hardware. The main features of the proposed architecture, based on the Takagi–Sugeno inference model, are scalability, high performance, and flexibility. A scalable fuzzy inference system (FIS) must be efficient and practical when applied to complex situations, such as multidimensional problems with a large number of membership functions and a large rule base. Several current application areas of fuzzy computation require such enhanced capabilities to deal with real-time problems (e.g., robotics, automotive control, etc.). Scalability and high performance of the proposed solution have been achieved by exploiting the inherent parallelism of the inference model, while flexibility has been obtained by applying hardware/software codesign techniques to reconfigurable hardware. Last generation reconfigurable technologies, particularly field programmable gate arrays (FPGAs), make it possible to implement the whole embedded FIS (e.g., processor core, memory blocks, peripherals, and specific hardware for fuzzy inference) on a single chip with the consequent savings in size, cost, and power consumption. As a prototyping example, we implemented a complex fuzzy controller for a vehicle semi-active suspension system composed of four three-input FIS on a single FPGA of the Xilinx's Virtex 5 device family.
Text classification is an important way to handle and organize textual data. Among existing methods of text classification, semi-supervised clustering is a main-stream technique. In the era of ‘Big data’, the current semi-supervised clustering approaches for text classification generally do not apply for excessive costs in scalability and computing performance for massive text data. Aiming at this problem, this study proposes a scalable text classification algorithm for large-scale text collections, namely D-TESC by modifying a state-of-the-art semi-supervised clustering approach for text classification in a centralized fashion (TESC). D-TESC can process the textual data in a distributed manner to meet a great scalability. The experimental results indicate that (1) the D-TESC algorithm has a comparable classification quality with TESC, and (2) outperforms TESC by average 7.2 times by using eight CPU threads in terms of scalability.
We study scalable parallel computational geometry algorithms for the coarse grained multicomputer model: p processors solving a problem on n data items, were each processor has O(n/p)≫O(1) local memory and all processors are connected via some arbitrary interconnection network (e.g. mesh, hypercube, fat tree). We present O(Tsequential/p+Ts(n, p)) time scalable parallel algorithms for several computational geometry problems. Ts(n, p) refers to the time of a global sort operation.
Our results are independent of the multicomputer’s interconnection network. Their time complexities become optimal when Tsequential/p dominates Ts(n, p) or when Ts(n, p) is optimal. This is the case for several standard architectures, including meshes and hypercubes, and a wide range of ratios n/p that include many of the currently available machine configurations.
Our methods also have some important practical advantages: For interprocessor communication, they use only a small fixed number of one global routing operation, global sort, and all other programming is in the sequential domain. Furthermore, our algorithms use only a small number of very large messages, which greatly reduces the overhead for the communication protocol between processors. (Note however, that our time complexities account for the lengths of messages.) Experiments show that our methods are easy to implement and give good timing results.
We present output-sensitive scalable parallel algorithms for bichromatic line segment intersection problems for the coarse grained multicomputer model. Under the assumption that n≥p2, where n is the number of line segments and p the number of processors, we obtain an intersection counting algorithm with a time complexity of , where Ts(m, p) is the time used to sort m items on a p processor machine. The first term captures the time spent in sequential computation performed locally by each processor. The second term captures the interprocessor communication time. An additional
time in sequential computation is spent on the reporting of the k intersections. As the sequential time complexity is O(n log n) for counting and an additional time O(k) for reporting, we obtain a speedup of
in the sequential part of the algorithm. The speedup in the communication part obviously depends on the underlying architecture. For example for a hypercube it ranges between
and
depending on the ratio of n and p. As the reporting does not involve more interprocessor communication than the counting, the algorithm achieves a full speedup of p for k≥ O(max(n log n log p, n log3 p)) even on a hypercube.
This paper discusses the development of natural language interfaces to interactive computer systems using the NALIGE user interface management system. The task of engineering such interfaces is reduced to producing a set of well-formed specifications which describe lexical, syntactic, semantic, and pragmatic aspects of the selected application domain. These specifications are converted by NALIGE to an autonomous natural language interface that exhibits the prescribed linguistic and functional behavior. Development of several applications is presented to demonstrate how NALIGE and the associated development methodology may facilitate the design and implementation of practical natural language interfaces. This includes a natural language interface to Unix and its subsequent porting to MS-DOS, VAX/VMS, and VM/CMS; a natural language interface for Internet navigation and resource location; a natural language interface for text pattern matching; a natural language interface for text editing; and a natural language interface for electronic mail management. Additionally, design issues and considerations are identified/addressed, such as reuse and portability, content coupling, morphological processing, scalability, and habitability.
Majority of the e-commerce sites implement Recommender Systems (RS) to help users navigate through the large search space and assist their decision making process by suggesting products that the user may like. Collaborative Filtering (CF) is the most successful and widely used algorithm in the domain of RS. However, due to the exponential growth of the web in terms of both content and number of users, CF based RS face serious scalability issues. To alleviate this problem, we propose a clustering based CF approach using two hierarchical space partitioning data structures — K-d tree and Quadtree. We cluster or partition the users’ space of the system on the basis of user location and then use the resultant clusters for predicting ratings of a target user. Since the CF based recommendation algorithm is applied separately to the clusters and not on the entire rating data, it helps in bringing down the runtime of the algorithm substantially. We further measure spatial autocorrelation indices in the clusters to justify our clustering method. However, our objective is not only to reduce the runtime but also to maintain an acceptable recommendation quality. This requirement is rightly addressed by the proposed method which assures scalability, by processing very large datasets using the same computing resource. Moreover our proposed clustering scheme is oblivious of the underlying CF algorithm. Results from the extensive experiments conducted, show that our hierarchical clustering based recommendation approach reduces runtime of the standard CF algorithms by about 88%, 82%, 79% and 85% for MovieLens-100K, MovieLens-1M, Book-Crossing and TripAdvisor data respectively, while maintaining good recommendation quality.
Recommender systems have emerged as a class of essential tools in the success of modern e-commerce applications. These applications typically handle large datasets and often face challenges like data sparsity and scalability. Clustering techniques help to reduce the computational time needed for recommendation as well as handle the sparsity problem more efficiently. Traditional clustering based recommender systems create partitions (clusters) of the user-item rating matrix and execute the recommendation algorithm in the clusters separately in order to decrease the overall runtime of the system. Each user or item generally belong to at most one cluster. However, it may so happen that some users (boundary users) present in a particular cluster exhibit higher similarity with the preferences of the users residing in the nearby clusters than the ones present in their own cluster. Therefore, we propose a clustering based scalable recommendation algorithm that has a provision for switching a user from its original cluster to another cluster in order to provide more accurate recommendations. For a user belonging to multiple clusters, we aggregate recommendations from those clusters to which the user belongs in order to produce the final set of recommendations to that user. In this work, we propose two types of clustering, one on the basis of rating and the other on the basis of frequency and then compare their performances. Finally, we explore the applicability of cluster ensembles techniques in the proposed method. Our aim is to develop a recommendation framework that can scale well to handle large datasets without much affecting the recommendation quality. The outcomes of our experiments clearly demonstrate the scalability as well as efficacy of our method. It reduces the runtime of the baseline CF algorithm by a minimum of 58% and a maximum of 90% for MovieLens-10M dataset, and a minimum of 42% and a maximum of 84% for MovieLens-20M dataset. The accuracies of recommendations in terms of F1, MAP and NDCG metrics are also better than the existing clustering based recommender systems.