Please login to be able to save your searches and receive alerts for new content matching your search criteria.
A systolic array provides an alternative computing paradigm to the von Neumann architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Freed from the constraint of a planar layout, we present a three-dimensional virtual systolic array architecture for this algorithm. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems.
Low-rank matrix completion, which aims to recover a matrix with many missing values, has attracted much attention in many fields of computer science. A low-rank matrix fitting (LMaFit) method has been proposed for fast matrix completion recently. However, this method cannot converge accurately on matrices of real-world images. For improving the accuracy of LMaFit method, an improved low-rank matrix fitting (ILMF) method based on the weighted L1,p norm minimization is proposed in this paper, where the L1,p norm is the summation of the p-power (0<p<2) of L1 norms of rows in a matrix. In the proposed method, i.e. the ILMF method, the incomplete matrix that may be corrupted by noises is decomposed into the summation of a low-rank matrix and a noise matrix at first. Then, a weighted L1,p norm minimization problem is solved by using an alternating direction method for improving the accuracy of matrix completion. Experimental results on real-world images show that the ILMF method has much better performances in terms of both the convergence accuracy and convergence speed than the compared methods.
For the sake of reducing the hardware cost of multiple RF chains and the complexity for spatial multiplexing systems, in this paper a novel low-complexity transmit antenna selection criterion is proposed for V-BLAST system with MMSE OSIC detection. Based on MMSE extension of V-BLAST with sorted QR decomposition, an intuitional performance analysis of each sub-stream for MMSE V-BLAST transmission is done, and a new solution to the transmit antenna selection problem is suggested. Unlike most of the existing works, the proposed algorithm synthetically considers the impacts of detection order, interference cancellation, and noise amplification. Theoretical analysis and simulation results show that the proposed algorithm achieves well both in outage capacity and in symbol error rate performance with low computational complexity.
In this paper, we propose an efficient symbol detection algorithm for multiple-input multiple-output spatial multiplexing (MIMO-SM) systems and present its design and implementation results. By enhancing the error performance of the first detected symbol that causes error propagation, the proposed algorithm achieves a considerable performance gain compared with the conventional sorted QR decomposition (SQRD) based detection and the ordered successive detection (OSD) algorithms. The bit error rate (BER) performance of the proposed detection algorithm is evaluated by a simulation. In the case of a 16QAM MIMO-SM system with 4 transmit and 4 receive (4 × 4) antennas, at BER = 10-3 the proposed algorithm results in a gain improvement of about 2.5–13.5 dB over the previous algorithms. The proposed detection algorithm was designed in a hardware description language (HDL) and synthesized to gate-level circuits using 0.18 μm 1.8 V CMOS standard cell library. The results show that the proposed algorithm can be implemented without increasing the hardware costs significantly.
This paper presents a modified implementation of QR decomposition for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) detection based on the Givens rotation method. The QR decomposition hardware is constructed using the coordinate rotation digital computer (CORDIC) algorithm operating with fewer gate counts and lower power consumption than do triangular systolic array (TSA) structures. Accurate signal transmission is essential to wireless communication systems. Thus, a more effective data detection algorithm and precise channel estimation method play vital roles in MIMO systems. Implementing data detection with QR decomposition helps reduce the complexity of MIMO-OFDM detection. Implementation results reveal that the proposed recursive QR decomposition (RQRD) architecture has lower clock latency than do TSA structures, and has a smaller hardware area than do Gram–Schmidt structures.
QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12× and 6× speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.
Space-Time Adaptive Processing (STAP) can harness the efficacy of interference and clutter significantly. Calculations of the STAP weights involve solving linear equations which require very intensive computations. In this paper, the QR decomposition (QRD) using the modified gram-schmidt (MGS) algorithm is parameterized with vector size to create a trade-off between the hardware resources utilization and computation time. To achieve an efficient floating point structure, the proposed architecture of QRD-MGS algorithm is simulated and implemented in two modes: single-vector and multi-vector. Results show that the multi-vector method can lead to a high-performance design with higher operating frequency, lower power consumption, and less resource utilization than the single-vector method. For example, Modelism simulations show that the decomposition of a 12×51 matrix with vector size of 17 takes 7.86μs with the maximum clock frequency of 282MHz, for implementation on the Arria10 FPGA. In real STAP applications, the matrix sizes are too large to be fit on FPGAs and the update rate of the weights are high. Therefore, this method can fit any matrix in the contemporary FPGAs with an acceptable update rate.
Modified Gram–Schmidt (MGS) algorithm is one of the most-known forms of QR decomposition (QRD) algorithms. It has been used in many signal and image processing applications to solve least square problem and linear equations or to invert matrices. However, QRD is well-thought-out as a computationally expensive technique, and its sequential implementation fails to meet the requirements of many real-time applications. In this paper, we suggest a new parallel version of MGS algorithm that uses VLIW (Very Long Instruction Word) resources in an efficient way to get more performance. The presented parallel MGS is based on compact VLIW kernels that have been designed for each algorithm step taking into account architectural and algorithmic constraints. Based on instruction scheduling and software pipelining techniques, the proposed kernels exploit efficiently data, instruction and loop levels parallelism. Additionally, cache memory properties were used efficiently to enhance parallel memory access and to avoid cache misses. The robustness, accuracy and rapidity of the introduced parallel MGS implementation on VLIW enhance significantly the performance of systems under severe rea-time and low power constraints. Experimental results show great improvements over the optimized vendor QRD implementation and the state of art.
The subset selection problem of linear algebra is applied to identify independent patterns of COVID-19 evolution within Brazil. The data consist of a set of mortality curves in states of Brazil. A subset of the most independent curves is selected by using a functional version of the QR matrix decomposition technique with column pivoting. The selected subset is used next as a basis to represent the remaining curves filtering out any data redundancy. For each independent curve, an associated epidemiological region of influence is defined. The results show two main independent curves with a similar two-peak pattern and a 50-day shift between the patterns. Two main epidemiological regions are next identified: one encompassing most of the country from the center and northeast states to the south, an another one containing the Amazonian region at the northwest.