The problem of information security is instantaneous, which involves the basic problems of science and technology and development. The analysis and integration of traditional information security big data are completed manually, which increases the time cost and labor cost to a certain extent, and greatly prolongs the time problem that information security incidents need to be processed in real time. Based on the above problems, an algorithm model is constructed to deal with the multi-source problem of information security big data with information technology. Quickly analyze and fuse similar data. In order to meet the actual requirements, this paper gives the corresponding algorithm process for data preprocessing, data correlation and other issues, mainly including noise filtering, correlation analysis and logic function for statistical logic problems caused by different amounts of data, and gives solutions to solve the basic data analysis layer problems. In view of the storage problem, it is necessary to judge the fusion degree of information security data, select the data in the form of database, add time series labels and then store them, which makes the security events have natural physical labels, and facilitate database operations such as output and retrieval. The construction of the algorithm can not only make an issue in theory, but also need to gather practice, which is the only and effective way to test the performance of the algorithm. Therefore, this paper puts forward seven simulation perspectives based on the common problems of information security big data, and takes several data evaluation criteria as simulation objects to explore and lay a theoretical foundation and experimental data for the long-term development of the algorithm. The full-text construction data analysis layer algorithm faces not cold data but social information with practical utility. Excellent analysis ability provides favorable and real objective underlying data for social decision makers, and provides substantive help for decision-making.
Class imbalance refers to classification problems in which many more instances are available for certain classes than for others. Such imbalanced datasets require special attention because traditional classifiers generally favor the majority class which has a large number of instances. Ensemble of classifiers has been reported to yield promising results. However, the majority of ensemble methods applied to imbalance learning are static ones. Moreover, they only deal with binary imbalanced problems. Hence, this paper presents an empirical analysis of Dynamic Selection techniques and data preprocessing methods for dealing with multi-class imbalanced problems. We considered five variations of preprocessing methods and 14 Dynamic Selection schemes. Our experiments conducted on 26 multi-class imbalanced problems show that the dynamic ensemble improves the AUC and the G-mean as compared to the static ensemble. Moreover, data preprocessing plays an important role in such cases.
Traffic prediction is essential for transportation planning, resource allocation, congestion management and enhancing travel experiences. This study optimizes data preprocessing techniques to improve machine learning-based traffic prediction models. Data preprocessing is critical in preparing the data for machine learning models. This study proposes an approach that optimizes data preprocessing techniques, focusing on flow-based analysis and optimization, to enhance traffic prediction models. The proposed approach explores fixed and variable orders of data preprocessing using a genetic algorithm across five diverse datasets. Evaluation metrics such as root mean squared error (RMSE), mean absolute error (MAE) and R-squared error assess model performance. The results indicate that the genetic algorithm’s variable order achieves the best performance for the ArcGIS Hub and Frementon Bridge Cycle datasets, fixed order one preprocessing for the Traffic Prediction dataset and variable order using the genetic algorithm for the PeMS08 dataset. Fixed order 2 preprocessing yields the best performance for the XI AN Traffic dataset. These findings highlight the importance of selecting the appropriate data preprocessing flow order for each dataset, improving traffic prediction accuracy and reliability. The proposed approach advances traffic prediction methodologies, enabling more precise and reliable traffic forecasts for transportation planning and management applications.
Feature Selection (FS) is an important preprocessing step in data analytics. It is used to select a subset of the original feature set such that the selected subset does not affect the classification performance significantly. Its objective is to remove irrelevant and redundant features from the original dataset. FS can be done either in offline mode or in online mode. The basic assumption in the former mode is that the entire dataset has been available for the FS algorithm; and the FS algorithm takes multiple epochs to select optimal feature subset that gives good accuracy. In contrast, the FS algorithms in online mode take input data one instance at a time and accumulate knowledge by learning each one of them. In online mode each instance of the original dataset is considered as training and testing sample as well. The offline FS algorithms require long time periods, if the data to be processed is large such as Big data. Whereas online FS algorithms will take only one epoch to learn the entire data and can produce the results swiftly which is highly desirable in the case of Big data. This paper deals with the online FS problem and provides a novel Feature Selection algorithm which uses the Sparse Gradient method to build a sparse classifier. In this proposed method, an online classifier is built and maintained throughout the learning process and feature weights, which are limited to a particular boundary limit, are reduced in a step by step decrement process. This method creates sparsity in the classifier. Effectively, the built classifier is used to select optimal feature subset from the incoming data. As this method reduces the weights in the classifier in step by step manner, only those important features which have value higher than the boundary survive from this repeated decrement process. The resultant optimal feature subset is formed using these non-zero weighted features. Most significantly, this particular method can be used with any learning algorithm. To show its applicability with different learning algorithms, various online feature selection models have been built using Learning Vector Quantization, Radial Basis Function Networks and Adaptive Resonance Theory MAP. In all these models, the proposed Sparse Gradient method is used. The encouraging results shows the effectiveness of the proposed method with different learning algorithms in medium and large sized benchmark datasets.
Currently, Text-to-Speech (TTS) or speech synthesis, the ability of the complex system to generate a human-like sounding voice from the written text, is becoming increasingly popular in speech processing in various complex systems. TTS is the artificial generation of human speech. A classical TTS system translates a language text into a waveform. Several English TTS systems produce human-like, mature, and natural speech synthesizers. On the other hand, other languages, such as Arabic, have just been considered. The present Arabic speech synthesis solution is of low quality and slow, and the naturalness of synthesized speech is lower than that of English synthesizers. Also, they lack crucial primary speech factors, including rhythm, intonation, and stress. Several studies have been proposed to resolve these problems, integrating using concatenative techniques like parametric or unit selection methods. This paper proposes an Applied Linguistics with Artificial Intelligence-Enabled Arabic Text-to-Speech Synthesizer (ALAI-ATTS) model. This ALAI-ATTS technique includes three essential components: data preprocessing through phonetization and diacritization, Extreme Learning Machine (ELM)-based speech synthesis, and Grey Wolf Fractals Optimization (GWO)-based parameter tuning. Initially, the data preprocessing step includes diacritization, where diacritics are restored to unvoweled text to ensure correct pronunciation, followed by phonetization, translating the text into its phonetic representation. Then, the ELM-based speech synthesis model uses the processed dataset for speech generation. ELMs, well known for their excellent generalization performance and fast learning speed, are especially suitable for real-time TTS applications, balancing high-quality speech output and computational efficiency. Lastly, the GWO methodology is employed to tune the parameters of the ELM. The simulation outcomes validate that the ALAI-ATTS technique considerably enhances the intelligibility and naturalness of Arabic synthesized speech compared to existing approaches. The experimental results of the ALAI-ATTS technique portrayed a lesser value of 3.48, 0.15 and 1.37, 0.25 under WER and DER.
Ocean turbines are a promising new source of clean energy, but their remote and inhospitable environment (the open ocean) poses reliability challenges. Machine condition monitoring/prognostic health monitoring (MCM/PHM) systems assure the reliability of these turbines by detecting and predicting machine state. These MCM/PHM systems use sensor data (such as vibration information) to determine whether or not the machine is operating properly. However, not all sensor data corresponds to the machine state: some portions of the sensor signal are influenced by certain environmental conditions which do not directly relate to machine health. Therefore, models must be built which can detect system state regardless of these environmental operating conditions. The proposed baseline-differencing approach permits this by creating a baseline for different conditions (such that each baseline represents what the normal, healthy machine state looks like while in that operating condition) and using the difference of the observed data and this baseline to train and evaluate models. We present two case studies, both conducted on data from a dynamometer representing an ocean turbine, to demonstrate the improved predictive capabilities of models which incorporate baseline-differencing, compared to the models which use the nonbaselined data. The results show that significantly more high-quality models can be built with baseline-differencing.
The ordinary feature selection methods select only the explicit relevant attributes by filtering the irrelevant ones. They trade the selection accuracy for the execution time and complexity. In which, the hidden supportive information possessed by the irrelevant attributes may be lost, so that they may miss some good combinations. We believe that attributes are useless regarding the classification task by themselves, sometimes may provide potentially useful supportive information to other attributes and thus benefit the classification task. Such a strategy can minimize the information lost, therefore is able to maximize the classification accuracy. Especially for the dataset contains hidden interactions among attributes. This paper proposes a feature selection methodology from a new angle that selects not only the relevant features, but also targeting at the potentially useful false irrelevant attributes by measuring their supportive importance to other attributes. The empirical results validate the hypothesis by demonstrating that the proposed approach outperforms most of the state-of-the-art filter based feature selection methods.
Clustering analysis has been widely applied to single-cell RNA-sequencing (scRNA-seq) data to discover cell types and cell states. Algorithms developed in recent years have greatly helped the understanding of cellular heterogeneity and the underlying mechanisms of biological processes. However, these algorithms often use different techniques, were evaluated on different datasets and compared with some of their counterparts usually using different performance metrics. Consequently, there lacks an accurate and complete picture of their merits and demerits, which makes it difficult for users to select proper algorithms for analyzing their data. To fill this gap, we first do a review on the major existing scRNA-seq data clustering methods, and then conduct a comprehensive performance comparison among them from multiple perspectives. We consider 13 state of the art scRNA-seq data clustering algorithms, and collect 12 publicly available real scRNA-seq datasets from the existing works to evaluate and compare these algorithms. Our comparative study shows that the existing methods are very diverse in performance. Even the top-performance algorithms do not perform well on all datasets, especially those with complex structures. This suggests that further research is required to explore more stable, accurate, and efficient clustering algorithms for scRNA-seq data.
In this paper, for the first time, a novel discretization scheme is proposed aiming at enabling scalability but also at least three other strong challenges. It is based on a Left-to-Right (LR) scanning process, which partitions the input stream into intervals. This task can be implemented by an algorithm or by using a generator that builds automatically the discretization program. We focus especially on unsupervised discretization and design a method called Usupervised Left to Right Discretization (ULR-Discr). Extensive experiments were conducted using various cut-point functions on small, large and medical public datasets. First, ULR-Discr variants under different statistics are compared between themselves with the aim at observing the impact of the cut-point functions on accuracy and runtime. Then the proposed method is compared to traditional and recent techniques for classification. The result is that the classification accuracy is highly improved when using our method for discretization.
The ground 3D laser scanning technology has the advantages such as normal traffic flow, large amount of data and high efficiency, so it is suitable for application in surveying and mapping of existing road reconstruction and expansion. This paper takes a specific road reconstruction and expansion project as an example to study the process which is suitable for 3D laser scanning technology from the survey program design, equipment selection, geodetic chain and target layout aspects. In the meantime, related software to complete the point cloud data filtering, splicing, coordinate conversion and simplification are utilized along with the CASS software to generate DTM model for road reconstruction and expansion design. It turns out that this technology has high data accuracy. According to the test results, the difference between the data obtained by this technology and by traditional measurement methods respectively is under 4 mm. It can fully meet the design requirements of road reconstruction and expansion and has very good application prospects.
Aiming at the problems of low precision and no obvious feature of the complex object alignment in reverse engineering, the faster and more accurate alignment method was proposed. The method was based on multi-sensor measurement technology and data preprocessing by the least squares and matrix transformation theory. The method, at the aid of the center coordinates of alignment of the standard balls, was fast and accurate positioning of different coordinates of point cloud data to achieve the purpose of data splicing. This method solves the problem about complex surface data alignment of multi sensor measurement and ensures the accuracy of the inverse modeling by an example.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.