An important kernel of scientific software is the multiplication of a sparse matrix by a vector. The efficiency of the algorithm on a vector computer depends on the storage scheme. With storage by rows, performances are limited in general by the small vector length. Therefore a storage by so-called generalized columns has been designed, which provides long vectors and consequent good performance. However, it is not suitable for the symmetric case. A new type of storage, by sparse diagonals, has thus been defined. It still exhibits long vectors, with performances as good as previously, but it is also well-suited to symmetric matrices. Results on a CRAY 2, with various sparse matrices, compare the three algorithms and show the efficiency of the storage by sparse diagonals.
PEI formalism has been designed to reason and develop parallel programs in the context of data parallelism. In this paper, we focus on the use of PEI to transform a program involving dense matrices into a new program involving sparse matrices, using the example of the matrix-vector product.
We examine multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of 34 for 128 processors, resulting in an absolute performance of over 33 million double-precision floating point operations per second.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.
The benefits of hardware support for shared memory versus those for message passing are difficult to evaluate without an in-depth study of real applications on a common platform. We evaluate the communication mechanisms of the MIT Alewife machine, a multiprocessor which provides integrated cache-coherent shared memory, massage passing, and DMA. We perform this evaluation with "best-effort" implementations which solve several sparse, irregular benchmark problems with a preconditioned conjugate gradient sparse matrix solver (ICCG).
We find that machines with fast global memory operations do not need message passing or bulk transfer to suport our irregular problems. This is primarily due to three reasons. First, a 5-to-1 ratio between global and local cache misses makes memory copies in bulk communication expensive relati to communication via shared memory. Second, although message passing has synchronization semantics superior to shared memory for data-driven computation, efficient shared memory can overcome this handicap by using global read-modify-writes to change from the traditional owner-computers model to a producer-computes model. Third, bulk transfers can result in high processor idle times in irregular applications.
Nowadays the performance gap between processors and main memory makes an efficient usage of the memory hierarchy necessary for good program performance. Several techniques have been proposed for this purpose. Nevertheless most of them consider only regular access patterns, while many scientific and numerical applications give place to irregular patterns. A typical case is that of indirect accesses due to the use of compressed storage formats for sparse matrices. This paper describes an analytic approach to model both regular and irregular access patterns. The application modeled is an optimized sparse matrix-dense matrix product algorithm with several levels of blocking. Our model can be directly applied to any memory hierarchy consisting of K-way associative caches. Results are shown for several current microprocessor architectures.
Compared with the traditional finite-difference time-domain (FDTD) method, the symplectic finite-difference time-domain (SFDTD) method has the characteristics of high precision and low dispersion. However, because the higher-order difference is necessary for the calculation, a large sparse matrix is generated. It causes that the computational time is relatively long and the memory is more. To solve this problem, the incomplete Cholesky conjugate gradient (ICCG) method for solving the large sparse matrix needs to be taken into the SFDTD differential equations. The ICCG method can accelerate the iterations of the numerical calculation and reduce the memory with fast and stable convergence speed. The new ICCG–SFDTD method, which has both the advantages of the ICCG method and SFDTD method, is proposed. In this paper, the ICCG–SFDTD method is used for research on the characteristic parameters of the plasma photonic crystals (PPCs) under different conditions, such as the reflection electric field and the transmission coefficient, to verify the feasibility and accuracy of this method. The results prove that the ICCG–SFDTD method is accurate and has some advantages.
An important stage in speech enhancement is to estimate noise signal which is a difficult task in non-stationary and low signal-to-noise conditions. This paper presents an iterative speech enhancement approach which requires no prior knowledge of noise and is based on low-rank sparse matrix decomposition using Gammatone filterbank and convex distortion measure. To estimate noise and speech, the noisy speech is decomposed into low-rank noise and sparse-speech parts by enforcing sparsity regularization. The exact distribution of noise signals and noise estimator is not required in this approach. The experimental results demonstrate that our approach outperforms competing methods and yields better overall speech quality and intelligibility. Moreover, composite objective measure reinforced a better performance in terms of residual noise and speech distortion in adverse noisy conditions. The time-varying spectral analysis validates significant reduction of the background noise.
Approximate Nearest Neighbor (ANN) search is a challenging problem with the explosive high-dimensional large-scale data in recent years. The promising technique for ANN search include hashing methods which generate compact binary codes by designing effective hash functions. However, lack of an optimal regularization is the key limitation of most of the existing hash functions. To this end, a new method called Adaptive Hashing with Sparse Modification (AHSM) is proposed. In AHSM, codes consist of vertices on the hypercube and the projection matrix is divided into two separate matrices. Data is rotated through a orthogonal matrix first and modified by a sparse matrix. Here the sparse matrix needs to be learned as a regularization item of hash function which is used to avoid overfitting and reduce quantization distortion. Totally, AHSM has two advantages: improvement of the accuracy without any time cost increasement. Furthermore, we extend AHSM to a supervised version, called Supervised Adaptive Hashing with Sparse Modification (SAHSM), by introducing Canonical Correlation Analysis (CCA) to the original data. Experiments show that the AHSM method stably surpasses several state-of-the-art hashing methods on four data sets. And at the same time, we compare three unsupervised hashing methods with their corresponding supervised version (including SAHSM) on three data sets with labels known. Similarly, SAHSM outperforms other methods on most of the hash bits.
This paper presents the SECu-SVM algorithm for solving classification problems. It allows for a significant acceleration of the standard SVM implementations by transferring the most time-consuming computations from the standard CPU to the Graphics Processor Units (GPU). In addition, highly efficient Sliced EllR-T sparse matrix format was used for storing the dataset in GPU memory, which requires a very low memory footprint and is also well adapted to parallel processing. Performed experiments demonstrate an acceleration of 4–100 times over LibSVM. Moreover, in the majority of cases the SECu-SVM is less time-consuming than the best sparse GPU implementations and allows for handling significantly larger classification datasets.
The eigenvalue/eigenvector and linear solve problems arising in computational quantum dynamics applications (e.g. rovibrational spectroscopy, reaction cross-sections, etc.) often involve large sparse matrices that exhibit a certain block structure. In such cases, specialized iterative methods that employ optimal separable basis (OSB) preconditioners (derived from a block Jacobi diagonalization procedure) have been found to be very efficient, vis-à-vis reducing the required CPU effort on serial computing platforms. Recently,1,2 a parallel implementation was introduced, based on a nonstandard domain decomposition scheme. Near-perfect parallel scalability was observed for the OSB preconditioner construction routines up to hundreds of nodes; however, the fundamental matrix–vector product operation itself was found not to scale well, in general. In addition, the number of nodes was selectively chosen, so as to ensure perfect load balancing. In this paper, two essential improvements are discussed: (1) new algorithm for the matrix–vector product operation with greatly improved parallel scalability and (2) generalization for arbitrary number of nodes and basis sizes. These improvements render the resultant parallel quantum dynamics codes suitable for robust application to a wide range of real molecular problems, running on massively parallel computing architectures.
Sparse fuzzy ordering and partial ordering relations have recently become of great importance in the field of knowledge systems. As the size of the relations utilized in such a framework is extremely large, efficient algorithms are needed for their handling. More specifically, when a part of such a relation is updated, the property of transitivity needs to be re-established in timely manner, as the knowledge system often becomes unusable until this process is completed. In this paper we propose two algorithms for the incremental update of fuzzy transitive relations. The first one focuses on the incremental update of a part of an already transitive relation, while the other tackles the complete transitive closure of any relation. For the average sparse relation, both of the proposed algorithms have considerably smaller computational complexity than the classical approach, which we also prove experimentally via application to real life relations of this type.
In this paper, a novel linear equation solution method is proposed based on a row elimination back-substitution method (REBSM). The elimination and back-substitution procedures are carried out on individual row levels. The advantage of the proposed method is that it is much faster and requires less storage than the Gaussian elimination algorithm and, therefore, is capable of solving larger systems of equations. The method is particularly efficient for solving band diagonal sparse systems with symmetric or nonsymmetric coefficient matrices, and can be easily applied to popular numerical methods, such as the finite element method and the boundary element method. Detailed Fortran codes and examples are given to demonstrate the robustness and efficiency of the proposed method.
A new direct reanalysis algorithm for local high-rank structural modifications, based on triangular factorization, has recently been developed. In this work, an improvement is proposed for further reduction of the workload so that the algorithm becomes more efficient and also suitable for low-rank modifications. Compared to the original algorithm, firstly, an alternative formula for updating the triangular factors is derived to avoid repetitive calculations for certain low-rank cases. Secondly, to maximize the efficiency, a combined algorithm is proposed that estimates the workloads of the original and alternative algorithms for each row individually before numerical calculations and selects the one with smaller workload. Numerical experiments show that compared with full factorization, the combined algorithm is significantly more efficient and expected to save up to 75% of execution time in our numerical examples. The new method can be easily implemented and applied to engineering problems, especially to local and step-by-step modification of structures.
Patterned random matrices such as the reverse circulant, the symmetric circulant, the Toeplitz and the Hankel matrices and their almost sure limiting spectral distribution (LSD), have attracted much attention. Under the assumption that the entries are taken from an i.i.d. sequence with finite variance, the LSD are tied together by a common thread — the 2kth moment of the limit equals a weighted sum over different types of pair-partitions of the set {1,2,…,2k} and are universal. Some results are also known for the sparse case. In this paper, we generalize these results by relaxing significantly the i.i.d. assumption. For our models, the limits are defined via a larger class of partitions and are also not universal. Several existing and new results for patterned matrices, their band and sparse versions, as well as for matrices with continuous and discrete variance profile follow as special cases.
We give an upper bound on the total variation distance between the linear eigenvalue statistic, properly scaled and centered, of a random matrix with a variance profile and the standard Gaussian random variable. The second-order Poincaré inequality-type result introduced in [S. Chatterjee, Fluctuations of eigenvalues and second order poincaré inequalities, Prob. Theory Rel. Fields 143(1) (2009) 1–40.] is used to establish the bound. Using this bound, we prove central limit theorem for linear eigenvalue statistics of random matrices with different kind of variance profiles. We re-establish some existing results on fluctuations of linear eigenvalue statistics of some well-known random matrix ensembles by choosing appropriate variance profiles.
We consider two-dimensional integro-differential equation for currents in thin superconducting films. The integral operator of this equation is hypersingular operator with kernel decaying as 1/R3. For numerical solution Galerkin Finite Element Method (FEM) on triangular mesh with linear elements is used. It results in dense FEM matrix of large dimension. As the kernel is quickly decaying then off-diagonal elements of FEM matrix are small. We investigate simple sparsification approach based on dropping small entries of FEM matrix. The conclusion is that it allows to reduce to some extent memory requirements. Nevertheless for problems with large number of mesh points more complicated techniques as one of hierarchical matrices algorithms should be considered.
The large-scale matrix's numerical calculus is one of the hot spots in the parallel numerical operation domain, the traditional methods to realize large-scale matrix numerical operation is generally established the following thought: dividing the large-scale matrix into some sub-blocks. Large-scale sparse matrix's addition and subtraction operation have the following characteristics: 1) the number of zero elements is much larger than the number of non-zero elements; 2) zero elements do not affect the final results of addition and subtraction operation; 3) non-zero elements' distribution has the non-regularity. But the traditional division methods of matrix division cannot effectively use the above characteristics to improve the parallel processing efficiency. In this paper, a new describing method is proposed which expresses a non-zero element in the sparse matrix with a three-dimensional vector, and all nonzero elements in the large-scale sparse matrix with a three-dimensional tabulation. On this basis, construct some rules for merging two three-dimensional tabulations, to realize the large-scale sparse matrix's addition and subtraction operation by merging three-dimensional tabulations. When involved many addition and subtraction operations to large-scale sparse matrixes, the merits of this parallel processing new thinking will be highlighted.
A finite element equation always is a linear large sparse matrix equation and solving finite element equations is a key process in the numerical simulation of sheet metal forming. The solving method of equations and storage technology of the stiffness matrix affect the calculate efficiency of FEM numerical analysis significantly. In order to improve the solving efficiency of the self-developed code Quick-Form, the direct solver is replaced by the iterative solver. Based on two definitions of general adjacent node and general nodal adjacent relation, the generation principle of the total stiffness matrix and the distribution rules about nonzero matrix are introduced. An improved compact storage scheme named one dimension array with variational band width is proposed for the iterative algorithm solver, which can save large physical memory and avoid the optimization of nodal numbering. Numerical example demonstrates the higher efficiency and less memory needed of this proposed algorithm.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.