Please login to be able to save your searches and receive alerts for new content matching your search criteria.
In this paper, we propose a new fault tolerant multistage interconnection network (MIN) and a new adaptive self-routing scheme for the network. It can provide more multiple paths than the related previous networks between an input/output pair of a network by adding extra links between switching elements in the same stage and modifying the self-routing scheme of the banyan network. Our routing scheme is as simple as that of the banyan network, which is based on the topological relationships among the switching elements (SE’s) that render a packet to the same destination with the regular self-routing, which are discovered in this paper. We present an algebraic proof to show the correctness of this scheme, and an analytic reliability analysis to provide quantitative comparisons with other networks, which shows that the new network is more cost effective than the banyan network and other augmented MIN’s in terms of the reliability.
Many applications which require high speed evaluation of polynomials and exponentials of polynomials can now be implemented in the hardware very efficiently because of the advances in VLSI technology. Several fast algorithms have been proposed in the recent past for the efficient evaluation of polynomials and exponentials of polynomials for equispaced arguments on uniprocessor systems. In this paper, we consider the problem of organizing this evaluation on VLSI chips in the form of systolic arrays. We present linear fault tolerant systolic arrays which can evaluate the polynomials and exponentials of polynomials of any degree for a large number of equispaced points. These organizations have the main advantage that the interconnections between the processing elements are very regular and simple, and hence are very appropriate for VLSI implementation.
This paper proposes a procedure to design fault tolerant algorithms for the R-Mesh and some of its restrictive variations. This procedure first identifies a healthy sub-mesh from a faulty model using the bypass and removal fault model. Then it uses scalable algorithms to simulate the larger faulty model on the resulting healthy sub-mesh. The algorithms for the bypass model tolerates n faults in an n×n R-Mesh (LR-Mesh) and runs in O(T log n) (O(T)) time, where T is the execution time on the original mesh without faults. For the removal model, we design fault tolerant algorithms for some interesting variations of the R-Mesh, specifically, the NXR-Mesh and the NXLR-Mesh. We propose the first scaling simulations for these models and present a simulation of the R-Mesh on the NXR-Mesh. The results of this paper enable us to consider certain reconfigurable models in a more practical environment than previously allowed.
Traditional message passing based checkpointing and rollback recovery algorithms perform well for tightly coupled systems. In wide area distributed systems these algorithms may suffer from large overhead due to message passing delay and network traffic. Mobile agents offer an attractive option for designing checkpointing schemes for wide area distributed systems. Network topology is assumed to be arbitrary. Processes are mobile agent enabled. When a process wants to take a checkpoint, it just creates one mobile agent. Concurrent initiations by multiple processes are allowed. Synchronization and creation of a consistent global state (CGS) for checkpointing is managed by the mobile agent(s). In the worst case, for k concurrent initiations among n processes, checkpointing algorithm requires a total of O(kn) hops by all the mobile agents. A mobile agent carries O(n/k) (on the average) size data.
Checkpoint is a designated place in a program at which normal process is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. A checkpoint algorithm for mobile distributed systems needs to handle many new issues like: mobility, low bandwidth of wireless channels, lack of stable storage on mobile nodes, disconnections, limited battery power and high failure rate of mobile nodes. These issues make traditional checkpointing techniques unsuitable for such environments. Minimum-process coordinated checkpointing is an attractive approach to introduce fault tolerance in mobile distributed systems transparently. This approach is domino-free, requires at most two checkpoints of a process on stable storage, and forces only a minimum number of processes to checkpoint. But, it requires extra synchronization messages, blocking of the underlying computation or taking some useless checkpoints. In this paper, we design a minimum-process checkpointing algorithm for mobile distributed systems, where no useless checkpoint is taken. We reduce the blocking of processes by allowing the processes to do their normal computations, send messages and receive selective messages during their blocking period.
In this paper, we propose an approximation algorithm for the Fault-Tolerant Metric Facility Location problem which can be implemented in a distributed and asynchronous manner within O(n) rounds of communication, where n is the number of vertices in the network. Given a constant size set which represents distinct levels of fault-tolerant requirements of all cities, as well as the two-part (facility and connection) cost of the optimal solution, i.e. F* + C*, the cost of our solution is no more than
for the general case, and less than F* + 2C* for the special case where all cities have a uniform connectivity requirement. Extensive numerical experiments showed that the quality of our solutions is comparable (within 4% error) to the optimal solution in practice.
The growing size of the multiprocessor system increases its vulnerability to component failures. It is crucial to locate and replace the faulty processors to maintain a system's high reliability. The fault diagnosis is the process of identifying faulty processors in a system through testing. This paper shows that the largest connected component of the survival graph contains almost all of the remaining vertices in the dual-cube DCn when the number of faulty vertices is up to twice or three times of the traditional connectivity. Based on this fault resiliency, this paper determines that the conditional diagnosability of DCn (n ≥ 3) under the comparison model is 3n − 2, which is about three times of the traditional diagnosability.
Fault tolerance is especially important for interconnection networks, vastly influencing the performance of the parallel processing systems underlying the corresponding networks. This paper studies the fault tolerance of radix-kn-dimensional hypermesh optical interconnection networks, determines the connectivity of partial hypermesh, and derives the conditional connectivity of hypermesh provided that each adjacent set cannot be faulty simultaneously. Under this condition, the hypermesh networks can tolerate up to 2n(k-1)-k-1 fault processors without being disrupted, implying that when the number of dimension n (respectively, radix-k) is a fixed value in the hypermesh network, the larger the value of radix-k (respectively, dimension n) is, the higher the reliability and availability of the network becomes.
The arrangement graph An,k is one of the attractive underlying topologies for distributed systems. Let fm(n, k) be the minimum number of faulty links that make every sub-arrangement graph An-m,k-m faulty in An,k under link failure model. In this paper, we proved that ,
, and
for 2 ≤ m ≤ k − 2.
The augmented cube AQn is a variation of the hypercube Qn. This paper considers the fault-tolerant Panconnectivity of AQn. Assume that F⊆V(AQn)∪E(AQn) and n≥4. We prove that for any two fault-free vertices u and v with distance d in AQn, there exists a fault-free path Puv of each length from max{d+2,4} to 2n−fv−1 in AQn−F if |F|≤2n−4, where fv is the number of faulty vertices in AQn. Moreover, the bound is sharp.
As one of the most fundamental networks for parallel and distributed computation, cycle is suitable for developing simple algorithms with low communication cost. A graph G is called k-fault-tolerant edge-pancyclic if after deleting any faulty set F of k vertices and/or edges from G, every correct edge in the resulting graph lies in a cycle of every length from g to |V(G−F)|, inclusively, where g is the girth of G, the length of a shortest cycle in G. The n-dimensional crossed cube CQn is an important variant of the hypercube Qn, which possesses some properties superior to the hypercube. This paper investigates the fault-tolerant edge-pancyclicity of CQn, and shows that if CQn(n≥5) contains at most n−2 faulty vertices and/or edges then, for any fault-free edge uv and every length ℓ from 6 to |V(CQn−F)| except ℓ=7, there is a fault-free cycle of length ℓ containing the edge uv. The result is optimal in some senses.
The (n,k)-star graph Sn,k, which is introduced to address scaling issues of the star graph, is recognized as an attractive interconnection network topology for building multiprocessor systems because of its desirable properties. Let Fm(Sn,k) be the minimum number of faulty vertices that make every subgraph isomorphic to Sn−m,k−m faulty in Sn,k under vertex-failure model, where 0≤m≤k−1. In this paper, we prove that F0(Sn,k)=1, F1(Sn,k)=n, Fk−1(Sn,k)=n!/(n−k+1)!, and n!/(n−m)!≤Fm(Sn,k)≤(k−2m−1)n!/(n−m)!−2(k−3m−1)n!/(n−m+1)! for 1≤k≤n−2 and 2≤m≤k−2.
We give a new routing algorithm for double loop networks with n nodes which requires O(log n) time for preprocessing and constant processing time at each node on the route. A simple modification of the algorithm works for the case of a single fault (either node or link). The routing is always through a shortest path and the only information needed by a node to process is the address of the destination.
The capability of fault tolerance is one of the advantages of multiprocessor systems. In this paper, we prove that the fault tolerance of an n-star graph is 2n-5 with restriction to the forbidden faulty set. And we propose an algorithm for examining the connectivity of an n-star graph when there exist at most 2n - 4 faults. The algorithm requires O(n2log n) time. Besides, we improve the fault-tolerant routing algorithm proposed by Bagherzadeh et al. by calculating the cycle structure of a permutation and the avoidance of routing message to a node without any nonfaulty neighbor. This calculation needs only constant time. And then, we propose an efficient fault-tolerant broadcasting algorithm. When there is no fault, our broadcasting algorithm remains optimal. The penalty is O(n) if there exists only one fault, and the penalty is O(n2) if there exist at most n - 2 faults.
A new MIN called the Symmetric Banyan Network (SBM) is presented in this paper. In the SBN, 4 × 4 switching elements are used and they are connected symmetrically between the upper and lower parts of the network. There are 2N paths for every source-destination pair. The SBN is basically single-fault tolerant, but can tolerate up to three faults, with more elegant routing, except in the first and last stages which are still single-fault tolerant. And full accessibility is preserved even in some instances when half of the network is in fault. The routing of the SBN is self-adaptive in the presence of a fault. The throughput analysis of the SBN is done using computer simulations and shows that the SBN performs better than the Itoh's network, the ASEN (Augmented Shuffle Exchange Network) and the crossbar network. We analyze the cost/performance of the SBN against the MINs with multiple banyan networks such as MBSF (Multi Banyan Switching Fabric) and the PBSF (Piled Banyan Switching Fabric) and the analysis shows that the SBN is also attractive in terms of the cost.
We present a family of interconnection networks named the Cube-Of-Rings (COR) networks along with their basic graph-theoretic properties. Aspects of group graph theory are used to show the COR networks are symmetric and optimally fault tolerant. We present a closed-form expression of the diameter and optimal one-to-one routing algorithm for any member of the COR family. We also discuss the suitability of the COR networks as the interconnection network of scalable parallel computers.
The hypercube as a parallel interconnection network has been of academic and engineering concern for tens of year due to its many merits. However, its increasing node degree is an obvious weakness. Some networks such as the cube-connected cycles (CCC) and the de Bruijn network have been proposed to overcome the increasing degree of the hypercube. In this paper, we present a new cost-effective network which outperforms the cube network. It can overcome the increasing degree of cube networks while keeping the advantages of the cube network such as logarithmic diameter, easy routing, optimal fault tolerance, and suitability for the ASCEND/DESCEND class of parallel problems. Furthermore, the proposed network achieves the logarithmic diameter with a very small constant node degree, 3 or 4.
This paper gives a system designer's point of view regarding the emergence of the many new devices that are being considered as possible replacement of the Metal Oxyde Semiconductor Field effect transistor. The first part is a tentative to define criteria for the systemability of nanodevices. The second part is a short overview of the evolution of information processing systems. The third part is a more specific study of the demultiplexing and fault tolerance problems.
We study the effects of random faults on the behavior of one-dimensional, non-uniform cellular automata (CA), where the local update rule need not be identical for all grid sites. The CA systems examined were obtained via an approach known as cellular programming, which involves the evolution of non-uniform CAs to perform non-trivial computational tasks. Using the "system replicas" methodology, involving a comparison between a perfect, non-perturbed version of the CA and a faulty one, we find that our evolved systems exhibit graceful degradation in performance, able to tolerate a certain level of faults. We then "zoom" into the fault-tolerant zone, where "good" computational behavior is exhibited, introducing measures to fine-tune our understanding of the faulty CAs' operation. We study the error level as a function of time and space, as well as the recuperation time needed to recover from faults. Our investigation reveals an intricate interplay between temporal and spatial factors, with the presence of different rules in the grid giving rise to complex dynamics. Studies along this line may have applications to future computing systems that will contain thousands or even millions of computing elements, rendering crucial the issue of resilience.
In this paper, we propose the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI). The proposed scheme consists of a detection mechanism for detecting process failures, and a recovery mechanism. Two different cases are considered, both assuming the existence of a monitoring process, the Observer which triggers the recovery procedure in case of failure. In the first case, each process keeps a buffer with its own message traffic to be used in case of failure, while the implementor uses periodical tests for notification of failure by the Observer. The recovery function simulates all the communication of the processes with the dead one by re-sending to the replacement process all the messages destined for the dead one. In the second case, the Observer receives and stores all message traffic, and sends to the replacement all the buffered messages destined for the dead process. Solutions are provided to the dead communicator problem caused by the death of a process. A description of the prototype developed is provided along with the results of the experiments performed for efficiency and performance.