Please login to be able to save your searches and receive alerts for new content matching your search criteria.
We present several applications of non-linear data modeling, using principal manifolds and principal graphs constructed using the metaphor of elasticity (elastic principal graph approach). These approaches are generalizations of the Kohonen's self-organizing maps, a class of artificial neural networks. On several examples we show advantages of using non-linear objects for data approximation in comparison to the linear ones. We propose four numerical criteria for comparing linear and non-linear mappings of datasets into the spaces of lower dimension. The examples are taken from comparative political science, from analysis of high-throughput data in molecular biology, from analysis of dynamical systems.
Using local data information, the recently proposed local Fisher Discriminant Analysis (LFDA) algorithm18 provides a new way of handling the multimodal issues within classes where the conventional Fisher Discriminant Analysis (FDA) algorithm fails. Like the FDA algorithm (global counterpart), the LFDA suffers when it is applied to the higher dimensional data sets. In this paper, we propose a new formulation by which a robust algorithm can be formed. The new algorithm offers more robust results for higher dimensional data sets when compared with the LFDA in most cases. By extensive simulation studies, we have demonstrated the practical usefulness and robustness of our new algorithm in data visualization.
We study the problem of visualization of clusters in an educational data set based on convex-hull shape preservation algorithm. This problem considers multidimensional data with pre-established classes with the requirement of elements of different classes must be presented at distinctive regions. Such problem is commonly found on economic and social data, where visualization is important to understand a phenomenon before further analysis. In this paper, we propose an algorithm that uses a nonlinear transformation to preserve some data distance properties and display in a convenient format to interpretation. The proposed visualization algorithm is a partition-conforming projection, as defined by Kleinberg [An impossibility theorem for clustering, Adv. Neural Inform. Processing Syst. 15: Proc. 2002 Conf., 2003, The MIT Press, p. 463.], and completely separates the convex hull of data classes by applying locally linear operations. We applied this algorithm to visualize data from an important exam applied for over four million students of the Brazilian educational system Exame Nacional do Ensino Médio (ENEM). Results show that the proposed algorithm successfully separates unintelligible data and presents it more accessible to further visual analysis.
Boundary extraction is a fundamental post-clustering problem. It facilitates interpretability and usability of clustering results. Also, it provides visualization and dataset reduction. However, it has not attracted much attention compared to the clustering problem itself. In this work, we address the boundary extraction of clusters in 2- and 3-dimensional spatial datasets. We propose two algorithms based on Delaunay Triangulation (DT). Numerical experiments show that the proposed algorithms generate the cluster boundaries effectively. Also, they yield significant amounts of dataset reduction.
In recent years, with the rapid development of information technology, the volume of image data has grown exponentially. However, these datasets typically contain a large amount of redundant information. To extract effective features and reduce redundancy from images, a representation learning method based on the Vision Transformer (ViT) has been proposed, and to our best knowledge, Transformer was first applied to zero-shot learning (ZSL). The method adopts a symmetric encoder–decoder structure, where the encoder incorporates Multi-Head Self-Attention (MSA) mechanism of ViT to reduce the dimensionality of image features, eliminate redundant information, and decrease computational burden. Consequently, it effectively extracts features, and the decoder is utilized for reconstructing image data. We evaluated the representation learning capability of the proposed method in various tasks, including data visualization, image reconstruction, face recognition, and ZSL. By comparing with state-of-the-art representation learning methods, the outstanding results obtained validate the effectiveness of this method in the field of representation learning.
Advances in technology have produced more and more intricate industrial systems, such as nuclear power plants, chemical centers and petroleum platforms. Such complex plants exhibit multiple interactions among smaller units and human operators, rising potentially disastrous failure, which can propagate across subsystem boundaries. This paper analyzes industrial accident data-series in the perspective of statistical physics and dynamical systems. Global data is collected from the Emergency Events Database (EM-DAT) during the time period from year 1903 up to 2012. The statistical distributions of the number of fatalities caused by industrial accidents reveal Power Law (PL) behavior. We analyze the evolution of the PL parameters over time and observe a remarkable increment in the PL exponent during the last years. PL behavior allows prediction by extrapolation over a wide range of scales. In a complementary line of thought, we compare the data using appropriate indices and use different visualization techniques to correlate and to extract relationships among industrial accident events. This study contributes to better understand the complexity of modern industrial accidents and their ruling principles.
Comparing frequency distributions of experimental data is a routine engineering task in the semiconductor industry. The existing statistical approaches to the problem suffer from several limitations, which can be partially overcome via the time-consuming visual examination of frequency histograms by an experienced process engineer. This paper presents a novel, fuzzy-based method for automating the cognitive process of comparing frequency histograms. We use the evolving approach of type-2 fuzzy logic to utilize the domain knowledge of human experts. The proposed method is evaluated on the actual results of an engineering experiment, where it is shown to represent the experts' perception of the visualized data more accurately than a wide range of statistical tests. We also outline the potential directions for integrating the perception-based approach with other methods of data visualization and data mining.
Understanding trends is helpful to identify future behaviours in the field, and the roles of people, places, and institutions in setting those trends. Although traditional clustering strategies can group articles into topics, these techniques do not focus on topics over limited timescales; additionally, even when articles are grouped, the generated results are extensive and difficult to navigate. To address these concerns, we create an interactive dashboard that helps an expert in the field to better understand and quantify trends in their area of research. Trend detection is performed using the time-biased document clustering method. The developed and freely available web application enables users to detect well-defined trending topics by experimenting with various levels of temporal bias — from detecting short-timescale trends to allowing those trends to spread over longer times. Experts can readily drill down into the identified topics to understand their meaning through keywords, example articles, and time range. Overall, the interactive dashboard will allow experts in the field to sift through the vast literature to identify the concepts, people, places, and institutions most critical to the field.
Adherence to planned sequence is critical in Just-In-Time Lean Manufacturing (JITLM) environments such as that of the high volume automotive industry. The production efficiencies that Just-In-Time Lean Manufacturing provides are maximized where delivery of components and assemblies to production is matched by assembly of those components to a finished product in a planned order. Deviation from the planned balanced production sequence reduces manufacturing efficiency and may impact product quality where the product mix places a sustained or high demand on certain production resources.
This paper addresses how to process and display manufacturing sequence data for high volume variant manufacture in such a way that an accurate impression of the production status can be readily communicated to manufacturing management, supervision and operations personnel.
The authors review current practice in the field of sequencing for high volume manufacture and the available information presentation technologies. The realization and application of a novel prototype visualization, developed within a leading international automotive company, is described. The system incorporates both graphical and numeric representations of the manufacturing data. The representations clearly illustrate historical performance, indicate trends in performance and in the case of sequence adherence, provide a prediction of the process performance. These outputs are designed to be used locally in the plant and in the context of the global enterprise supply chain.
Most dimensionality reduction methods depend significantly on the distance measure used to compute distances between different examples. Therefore, a good distance metric is essential to many dimensionality reduction algorithms. In this paper, we present a new dimensionality reduction method for data visualization, called Distance-ratio Preserving Embedding (DrPE), which preserves the ratio between the pairwise distances. It is achieved by minimizing the mismatch between the distance ratios derived from input and output space. The proposed method can preserve the relational structures among points of the input space. Extensive visualization experiments compared with existing dimensionality reduction algorithms demonstrate the effectiveness of our proposed method.
In most real-world gene expression data sets, there are often multiple sample classes with ordinals, which are categorized into the normal or diseased type. The traditional feature or attribute selection methods consider multiple classes equally without paying attention to the up/down regulation across the normal and diseased types of classes, while the specific gene selection methods particularly consider the differential expressions across the normal and diseased, but ignore the existence of multiple classes. In this paper, to improve the biomarker discovery, we propose to make the best use of these two aspects: the differential expressions (that can be viewed as the domain knowledge of gene expression data) and the multiple classes (that can be viewed as a kind of data set characteristic). Therefore, we simultaneously take into account these two aspects by employing the 1-rank generalized matrix approximations (GMA). Our results show that GMA cannot only improve the accuracy of classifying the samples, but also provide a visualization method to effectively analyze the gene expression data on both genes and samples. Based on the mechanism of matrix approximation, we further propose an algorithm, CBiomarker, to discover compact biomarker by reducing the redundancy.
Most methods for the structural comparison of proteins utilize molecular coordinates in the three-dimensional physical space. Recently, a group has presented an elegant novel approach based on the characterization of protein shape in terms of backbone torsion angles. They have demonstrated considerable success in direct comparisons with other techniques, and their method lends itself to rapid screening of structural information from rapidly growing databases. We think that the torsion angle approach can be further strengthened by refining the distance notion that forms the basis of the computational scheme. In particular, we are suggesting to compute the distance along the path that minimizes the transition cost between aligned pairs of angles and therefore likely provides a more meaningful representation of distances between points in Ramachandran space.
Enzymes catalyze diverse biochemical reactions and are building blocks of cellular and metabolic pathways. Data and metadata of enzymes are distributed across databases and are archived in various formats. The enzyme databases provide utilities for efficient searches and downloading enzyme records in batch mode but do not support organism-specific extraction of subsets of data. Users are required to write scripts for parsing entries for customized data extraction prior to downstream analysis. Integrated Customized Extraction of Enzyme Data (iCEED) has been developed to provide organism-specific customized data extraction utilities for seven commonly used enzyme databases and brings these resources under an integrated portal. iCEED provides dropdown menus and search boxes using typehead utility for submission of queries as well as enzyme class-based browsing utility. A utility to facilitate mapping and visualization of functionally important features on the three-dimensional (3D) structures of enzymes is integrated. The customized data extraction utilities provided in iCEED are expected to be useful for biochemists, biotechnologists, computational biologists, and life science researchers to build curated datasets of their choice through an easy to navigate web-based interface. The integrated feature visualization system is useful for a fine-grained understanding of the enzyme structure–function relationship. Desired subsets of data, extracted and curated using iCEED can be subsequently used for downstream processing, analyses, and knowledge discovery. iCEED can also be used for training and teaching purposes.
Several time-critical problems relying on large amount of data, e.g., business trends, disaster response and disease outbreak, require cost-effective, timely and accurate data summary and visualization, in order to come up with an efficient and effective decision. Self-organizing map (SOM) is a very effective data clustering and visualization tool as it provides intuitive display of data in lower-dimensional space. However, with O(N2) complexity, SOM becomes inappropriate for large datasets. In this paper, we propose a force-directed visualization method that emulates SOMs capability to display the data clusters with O(N) complexity. The main idea is to perform a force-directed fine-tuning of the 2D representation of data. To demonstrate the efficiency and the vast potential of the proposed method as a fast visualization tool, the methodology is used to do a 2D-projection of the MNIST handwritten digits dataset.
Sentiment analysis over social media platforms has been an active case of study for more than a decade. This occurs due to the constant rising of Internet users over these platforms, as well as to the increasing interest of companies for monitoring the opinion of customers over commercial products. Most of these platforms provide free, online services such as the creation of interactive web communities, multimedia content uploading, etc. This new way of communication has affected human societies as it shaped the way by which an opinion can be expressed, sparking the era of digital revolution. One of the most profound examples of social networking platforms for opinion mining is Twitter as it is a great source for extracting news and a platform which politicians tend to use frequently. In addition to that, the character limitation per posted tweet (maximum of 280 characters) makes it easier for automated tools to extract its underlying sentiment. In this review paper, we present a variety of lexicon-based tools as well as machine learning algorithms used for sentiment extraction. Furthermore, we present additional implementations used for political sentiment analysis over Twitter as well as additional open topics. We hope the review will help readers to understand this scientifically rich area, identify best options for their work and work on open topics.
We have developed a software framework for scientific visualization in immersive-type, room-sized virtual reality (VR) systems, or Cave automatic virtual environment (CAVEs). This program, called Multiverse, allows users to select and invoke visualization programs without leaving CAVE’s VR space. Multiverse is a kind of immersive “desktop environment” for users, with a three-dimensional graphical user interface. For application developers, Multiverse is a software framework with useful class libraries and practical visualization programs as samples.
Breast cancer is one of the leading causes of untimely deaths among women in various countries across the world. This can be attributed to many factors including late detection which often increase its severity. Thus, detecting the disease early would help mitigate its mortality rate and other risks associated with it. This study developed a hybrid machine learning model for timely prediction of breast cancer to help combat the disease. The dataset from Kaggle was adopted to predict the breast tumor growth and sizes using random tree classification, logistic regression, XBoost tree and multilayer perceptron on the dataset. The implementation of these machine learning algorithms and visualization of the results was done using Python. The results achieved a high accuracy (99.65%) on training and testing datasets which is far better than traditional means. The predictive model has good potential to enhance early detection and diagnosis of breast cancer and improvement of treatment outcome. It could also assist patients to timely deal with their condition or life patterns to support their recovery or survival.
This paper develops a web database application to make space-time 4D visual delivery (4DVD) of big climate data. The delivery system shows climate data in a 4D space-time box and allows users to visualize the data. Users can zoom in or out to help identify desired information for particular locations. Data can then be downloaded for the spatial maps and historical climate time series of a given location after the maps and time series are identified to be useful. These functions enable a user to quickly reach the core interested features without downloading the entire dataset in advance, which saves both time and storage space. The 4DVD system has many graphical display options such as displaying data on a round globe or on a 2D map with detailed background topographic images. It can animate maps and show time series. The combination of these features makes the system a convenient and attractive multimedia tool for classrooms, museums, and households, in addition to climate research scientists, industrial applicants, and policy makers. To demonstrate the 4DVD’s usage, several application examples are included in this paper, such as the La Nina’s influence on land temperature.
This research develops a toolkit for snow-cover area calculation and display (SACD) based on the Interactive Multisensor Snow and Ice Mapping System (IMS). The paper uses the Tibetan Plateau region as an example to describe the toolkit’s method, results, and usage. The National Snow and Ice Data Center (NSIDC) provides to the public IMS a well-used system for monitoring the snow and ice cover. The newly developed toolkit is based on a simple shoe-lace formula for a grid box area on a sphere and can be conveniently used to calculate the total area of snow cover given the IMS data. The toolkit has been made available as an open source Python software on GitHub. The toolkit generates the time series of the daily snow-covered area for any region over the Northern Hemisphere from 4 February 1997. The toolkit also creates maps showing snow and ice coverage with an elevation background. The Tibetan Plateau (TP) region (25∘–45∘N)×(65∘–105∘E) is used as an example to demonstrate our work on SACD. The IMS products at 24, 4, and 1km resolutions include each grid’s latitude and longitude coordinates that are used to calculate the grid box’s area using the shoe-lace formula. The total TP area calculated by the sum of the areas of all the grid boxes approximates the true spherical TP surface area bounded by (25∘–45∘N) ×(65∘–105∘E) with a difference 0.046% for the 24km grid and 0.033% for the 4km grid. The differences in the snow-cover area reported by the 24km and 4km grids vary between −2.34% and 6.24%. The temporal variations of the daily TP snow cover are displayed in time series from 4 February 1997 to present with 4km and 24km resolutions.
Sequence alignment is a fundamental and important tool for sequence data analysis in molecular biology. Many applications in molecular biology require the detection of a similarity pattern displayed by a number of DNA and protein sequences. Visual front-ends are useful for an intuitive viewing of alignment and help to analyze the structure, functions, and evolution of the DNA and protein. In this paper, we designed and implemented an interactive system for data visualization in DNA and proteins, which can be used in determining a sequence alignment, similarity search of sequence data, and function interference. Experimental results show that a user can easily operate the system after one hour's practice on the proposed system, which provides a clean output, easy identification of similarity and visualization of alignment data.