Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Porting scientific key algorithms to HPC architectures requires a thorough understanding of the subtle balance between gain in performance and introduced overhead. Here we continue the development of our recently proposed technique that uses plain execution times to predict the extent of parallel overhead. The focus here is on an analytic solution that takes into account as many data points as there are unknowns, i.e. model parameters. A test set of 9 applications frequently used in scientific computing can be well described by the suggested model even including atypical cases that were originally not considered part of the development. However, the choice about which particular set of explicit data points will lead to an optimal prediction cannot be made a-priori.
Integer sorting on multicores and GPUs can be realized by a variety of approaches that include variants of distribution-based methods such as radix-sort, comparison-oriented algorithms such as deterministic regular sampling and random sampling parallel sorting, and network-based algorithms such as Batcher’s bitonic sorting algorithm.
In this work we present an experimental study of integer sorting on multicore processors. We have implemented serial and parallel radix-sort for various radixes, deterministic regular oversampling, and random oversampling parallel sorting, including new variants of ours, and also some previously little explored or unexplored variants of bitonic-sort and odd-even transposition sort. The study uses multithreading and multiprocessing parallel programming libraries with the same C language code working under Open MPI, MulticoreBSP, and BSPlib. We first provide some general high-level observations on the performance of these implementations. If we can conclude anything is that accurate prediction of performance by taking into consideration architecture dependent features such as the structure and characteristics of multiple memory hierarchies is difficult and more often than not untenable. To some degree this is affected by the overhead imposed by the high-level library used in the programming effort.
Another objective is to model the performance of these algorithms and their implementations under the MBSP (Multi-memory BSP) model. Despite the limitations mentioned above, we can still draw some reliable conclusions and reason about the performance of these implementations using the MBSP model, thus making MBSP useful and usable.
The goal of a task graph partitioning and mapping strategy is to reduce the communication overhead in a parallel application. Much of the past work in this area has been in the context of a static network topology. Here we show that the flexibility provided by a reconfigurable network can help lower the overhead and provide additional performance gains. However, since a reconfigurable network can be set to many different topologies, a new approach for the mapping problem must be formulated. Our research is based on the Interconnection Cached Network (ICN) a prototype of which is currently under development. The ICN is a reconfigurable network suited for exploiting switching locality in applications. "Switching locality" refers to the phenomenon in parallel applications of having each task mostly communicating (switching) between a small set of other tasks. As evidenced by the sparse nature of most task graphs, this phenomenon is common to many parallel applications. We describe the ICN architecture, the problem of mapping task graphs in the ICN, and the performance advantages of complementing clever partitioning strategies with topology reconfiguration.
Mesh-connected parallel architectures have become increasingly popular in the design of multiprocessor systems in recent years. Many partitionable two-dimensional (2D) mesh systems have been built or are currently being developed. To allow the best usage of these systems, an effective mesh partitioning/submesh allocation strategy is desirable. In this paper, we report on a new best-fit processor allocation strategy for 2D mesh systems. An efficient implementation of this strategy is presented that keeps the searching overhead low. Extensive simulations have been performed to compare the performance of this strategy with existing ones. The results show that it outperforms existing strategies in terms of mean response time under all load conditions and different job characteristics.
The data envelopment analysis (DEA) method is a mathematical programming approach to evaluate the relative performance of portfolios. Considering that the risk input indicators of existing DEA performance evaluation indices cannot reflect the pervasive fat tails and asymmetry in return distributions of mutual funds, we originally introduce new risk measures CVaR and VaR into inputs of relevant DEA indices to measure relative performance of portfolios more objectively. To fairly evaluate the performance variation of the same fund during different time periods, we creatively treat them as different decision making units (DMUs). Different from available DEA applications which mainly investigate the American mutual fund performance from the whole market or industry aspect, we analyze in detail the effect of different input/output indicator combinations on the performance of individual funds. Our empirical results show that VaR and CVaR, especially their combinations with traditional risk measures, are very helpful for comprehensively describing return distribution properties such as skewness and leptokurtosis, and can thus better evaluate the overall performance of mutual funds.
We propose the performance evaluation method for the multi-task system with software reliability growth process. The software fault-detection phenomenon in the dynamic environment is described by the Markovian software reliability model with imperfect debugging. We assume that the cumulative number of tasks arriving at the system follows the homogeneous Poisson process. Then we can formulate the distribution of the number of tasks whose processes can be complete within a prespecified processing time limit with the infinite-server queueing model. From the model, several quantities for software performance measurement considering the real-time property can be derived and given as the functions of time and the number of debuggings. Finally, we present several numerical examples of the quantities to analyze the relationship between the software reliability characteristics and the system performance measurement.
The Indian banking industry is going through turbulent times. With the lowering of entry barriers and blurring product lines of banks and non-banks since the financial sector reforms, banks are functioning increasingly under competitive pressures. Hence, it is imperative that banks maintain a loyal customer base. In order to achieve this and improve their market positions, many retail banks are directing their strategies towards increasing customer satisfaction and loyalty through improved service quality. Moreover, with the advent of international banking and innovations in the marketplace, customers are having greater and greater difficulty in selecting one institution from another.
Hence, to gain and sustain competitive advantages in the fast changing retail banking industry in India, it is crucial for banks to understand in-depth what customers perceive to be the key dimensions of service quality and to evaluate banks on these dimensions. This is because if service quality dimensions can be identified, service managers should be able to improve the delivery of customer perceived quality during the service process and have greater control over the overall outcome.
The study suggests that customers distinguish four dimensions of service quality in the case of the retail banking industry in India, namely, customer-orientedness, competence, tangibles and convenience. A methodological innovation in this study has been in the use of TOPSIS in the field of customer-perceived service quality. TOPSIS has been used to evaluate and ranking the relative performance of the banks across the service quality dimensions. Identifying the underlying dimensions of the service quality construct and evaluating the performance of the banks across these factors is the first step in the definition and hence provision of quality service in the Indian retail banking industry.
In real-life applications, there generally exist Decision Makers (DMs) who have preferences over outputs and inputs. Choosing appropriate weights for different criteria by DMs often arises as a problem. The Best-Worst Method (BWM) in Multiple Criteria Decision-Making (MCDM) depends on very few pairwise comparisons and just needs DMs to identify the most desirable and the least desirable criteria. Unlike MCDM, Data Envelopment Analysis (DEA) does not generally assume a priority for an output (an input) over any other outputs (inputs). The link between DEA and MCDM can be introduced by considering Decision-Making Units (DMUs) as alternatives, outputs as criteria to be maximized, and inputs as criteria to be minimized. In this study, we propose a linear programming model to embed DEA and BWM appropriately. We first propose a modified BWM linear programming model to satisfy all conditions that DMs can assume. We then illustrate how a conventional DEA model can be developed to include the BWM conditions. From our approach, the MCDM problem to obtain the optimal weights of different criteria are measured. At the same time, the relative efficiency scores of DMUs corresponding to the MCDM criteria are also calculated. We provide the foundation of measuring the efficiency scores when most desirable and the least desirable inputs and outputs are known. To show the process of the proposed approach, a numerical example (including 17 DMUs with seven inputs and outputs) is also discussed.
Based on the exergo-economic analysis of low temperature heat exchanger heat transfer and flow process, a new exergo-economic criterion which is defined as the net profit per unit heat flux for cryogenic exergy recovery low temperature heat exchangers is put forward. The application of criterion is illustrated by the evaluation of down-flow, counter-flow and cross-flow low temperature heat exchangers performance.
In this paper, the influence of the stamping effect is investigated in the performance analysis of a side structure. The analysis covers the performance evaluation such as crashworthiness and NVH. Stamping analyses are carried out for a center pillar, and then, numerical simulations are carried out in order to identify the stamping effect on the crashworthiness and the natural frequency. The result shows that the analysis considering the forming history leads to a different result from that without considering the stamping effect, which demonstrates that the design of auto-body should be carried out considering the stamping history for accurate assessment of various performances.
Link prediction attracts the attention of a large number of researchers due to the extensive application in social and economic fields. Many algorithms have been proposed in recent years. They show good performance because of having own particularly selected networks. However, on the other networks, they do not necessarily have good universality. Moreover, there are no other methods to evaluate the performance of new algorithm except AUC and precision. Therefore, we cannot help questioning this phenomenon. Can it really reflect the performance of an algorithm? Which attributes of a network have great influence on the prediction effect? In this paper, we analyze 21 real networks by multivariate statistical analysis. On the one hand, we find that the heterogeneity of network plays a significant role in the result of link prediction. On the other hand, the selection of network is very essential when verifying the performance of new algorithm. In addition, a nonlinear regression model is produced by analyzing the relationship between network properties and similarity methods. Furthermore, 16 similarity methods are analyzed by means of the AUC. The results show that it is of great significance for the performance of a new algorithm to design the evaluation mechanism of classification.
Psychophysical researches on the human visual system have shown that the points of high curvature on the contour of an object play an important role in the recognition process. Inspired by these studies we propose: (i) a novel algorithm to select points of high curvature on the contour of an object which can be used to construct a recognizable polygonal approximation, (ii) a test which evaluates the effect of deletion of contour segments containing such points on the performance of contour based object recognition algorithms. We use complete contour representations of objects as a reference (training) set. Incomplete contour representations of the same objects are used as a test set. The performance of an algorithm is reported using the recognition rate as a function of the percentage of contour retained. We consider two types of contour incompleteness obtained by deletion of contour segments of high or low curvature. We illustrate the test procedure using two shape recognition algorithms that deploy a shape context and a distance multiset as local shape descriptors. Both algorithms qualitatively mimic human visual perception in that the deletion of segments of high curvature has a stronger performance degradation effect than the deletion of other parts of the contour. This effect is more pronounced in the performance of the shape context method.
For decades, published works on Automatic Signature Verification (ASV) that use threshold-based decision, depended on using one feature set for verification. Some researchers selected this feature set based on their experience, and others selected it using some feature selection algorithms that can select the best feature set (gives the highest performance). In practical systems, the signature data could be noisy, and recognition of the check writer in multi-signatory accounts is required. Due to the error caused by such requirements and data quality, improving the performance becomes a necessity. In this paper, a new technique for ASV decision making use of Multi-Sets of Features (MSF) is introduced. The new technique and its motivation are explained, and a precise evaluation of its efficiency is made. The experimental results have shown that the new technique gives important improvement in forgery detection and in the overall performance. This technique which was developed within an integrated plan of building a commercial offline ASV system to work in the actual USA banks environment was tested during the prototyping period with about 1000 signature samples, and has already been in use for years as a component of a cooperative decision making ASV system that tests over a million check signature every day without any false acceptance (False Acceptance = 0).
Writer recognition is to identify a person on the basis of handwriting, and great progress has been achieved in the past decades. In this paper, we concentrate ourselves on the issue of off-line text-independent writer recognition by summarizing the state of the art methods from the perspectives of feature extraction and classification. We also exhibit some public datasets and compare the performance of the existing prominent methods. The comparison demonstrates that the performance of the methods based on frequency domain features decreases seriously when the number of writers becomes larger, and that spatial distribution features are superior to both frequency domain features and shape features in capturing the individual traits.
Establishment of a performance evaluation model is a hotspot of current research. In this paper, the performance bottleneck is analyzed quantitatively, which provided programmers with a guidance to optimize the performance bottleneck. This paper takes a matrix as an example; the matrix is divided into a dense matrix or a sparse matrix. For dense matrix, the performance is first analyzed in a quantitative way, and an evaluation model is developed, which includes the instruction pipeline, shared memory, and global memory. For sparse matrix, this paper aims at the four formats of CSR, ELL, COO, and HYB, through the observation data obtained from the actual operation of large datasets, finds the relationship between the running time, dataset form, and storage model, and establishes their relational model functions. Through practical test and comparison, the error between the execution time of the test dataset that is predicted by the model function and the actual running time is found to be within a stable finite deviation threshold, proving that the model has certain practicability.
Network-on-Chip (NoC) is a strong candidate for scalable interconnect design of Multi-Processor System-on-Chip (MPSoC). Software tasks of MPSoC require a certain protocol to communicate with each other. In NoC such a communication protocol should be handled at Network Interface and/or Processor Element level and it is expected that different protocols show their trade-offs. In consideration of the above, we employed two types of basic protocol and investigated their performance impact. The contribution of this work is to quantitatively evaluate effectiveness of using separate communication protocols depending on the task structure.
Three-dimensional Network-on-Chip (3D NoC) architectures have gained a lot of popularity to solve the on-chip communication delays of next generation System-on-Chip (SoC) systems. However, the vertical interconnects of 3D NoC are expensive and complex to manufacture. Also, 3D router architecture consumes more power and occupies more area per chip floorplan compared to a 2D router. Hence, more efficient architectures should be designed. In this paper, we propose area efficient and low power 3D heterogeneous NoC architectures, which combines both the power and performance benefits of 2D routers and 3D NoC-bus hybrid router architectures in 3D NoC architectures. Experimental results show a negligible penalty (less than 5%) in average packet latency of the proposed heterogeneous 3D NoC architectures compared to typical homogeneous 3D NoCs, while the heterogeneity provides power and area efficiency of up to 61% and 19.7%, respectively.
This paper proposes a simultaneous multithreaded matrix processor (SMMP) to improve the performance of data-parallel applications by exploiting instruction-level parallelism (ILP) data-level parallelism (DLP) and thread-level parallelism (TLP). In SMMP, the well-known five-stage pipeline (baseline scalar processor) is extended to execute multi-scalar/vector/matrix instructions on unified parallel execution datapaths. SMMP can issue four scalar instructions from two threads each cycle or four vector/matrix operations from one thread, where the execution of vector/matrix instructions in threads is done in round-robin fashion. Moreover, this paper presents the implementation of our proposed SMMP using VHDL targeting FPGA Virtex-6. In addition, the performance of SMMP is evaluated on some kernels from the basic linear algebra subprograms (BLAS). Our results show that, the hardware complexity of SMMP is 5.68 times higher than the baseline scalar processor. However, speedups of 4.9, 6.09, 6.98, 8.2, 8.25, 8.72, 9.36, 11.84 and 21.57 are achieved on BLAS kernels of applying Givens rotation, scalar times vector plus another, vector addition, vector scaling, setting up Givens rotation, dot-product, matrix–vector multiplication, Euclidean length, and matrix–matrix multiplications, respectively. The average speedup over the baseline is 9.55 and the average speedup over complexity is 1.68. Comparing with Xilinx MicroBlaze, the complexity of SMMP is 6.36 times higher, however, its speedup ranges from 6.87 to 12.07 on vector/matrix kernels, which is 9.46 in average.
Nowadays, it is desirable to generate traffic which can reflect realistic network traffic environment for performance evaluation of network equipments. Existing traffic generation solutions mainly include special test equipments, software traffic generators and field programmable gate array (FPGA)-based traffic generators. However, special test equipments are generally too expensive, software traffic generators could not achieve high data rates and FPGA-based traffic generators mostly lack flexibility. This paper presents a novel traffic generation solution according to an aggregated process-based model to overcome the weakness of above methods. The traffic generator can generate real-time Poisson, two-state Markov-modulated Poisson process (MMPP-2) and self-similar traffic by hardware. The main structure of the traffic generator is presented and statistical properties of the generated traffic have been evaluated. Experiment results indicate that the proposed traffic generation solution can achieve better performance of the generated traffic compared with existing FPGA-based traffic generators while the required date rates can be up to Gbps line rate.
Mean Value Analysis (MVA) has long been a standard approach for performance analysis of computer systems. While the exact load-dependent MVA algorithm is an efficient technique for computer system performance modeling, it fails to address multi-core computer systems with Dynamic Frequency Scaling (DFS). In addition, the load-dependent MVA algorithm suffers from numerical difficulties under heavy load conditions. The goal of our paper is to find an efficient and robust method which is easy to use in practice and is also accurate for performance prediction for multi-core platforms. The proposed method, called Approximate Performance Evaluation for Multi-core computers (APEM), uses a flow-equivalent performance model designed specifically to address multi-core computer systems and identify the influence on the CPU demand of the effect of DFS. We adopt an approximation technique to estimate resource demands to parametrize MVA algorithms. To validate the application of our method, we investigate three case studies with extended TPC-W benchmark kits, showing that our method achieves better accuracy compared with other commonly used MVA algorithms. We compare the three different performance models, and we also extend our approach to multi-class models.