The main goal of the new field of data mining is the analysis of large and complex datasets. Some very important datasets may be derived from business and industrial activities. This kind of data is known as “enterprise data”. The common characteristic of such datasets is that the analyst wishes to analyze them for the purpose of designing a more cost-effective strategy for optimizing some type of performance measure, such as reducing production time, improving quality, eliminating wastes, or maximizing profit. Data in this category may describe different scheduling scenarios in a manufacturing environment, quality control of some process, fault diagnosis in the operation of a machine or process, risk analysis when issuing credit to applicants, management of supply chains in a manufacturing system, or data for business related decision-making.
Sample Chapter(s)
Foreword (37 KB)
Chapter 1: Enterprise Data Mining: A Review and Research Directions (655 KB)
https://doi.org/10.1142/9789812779861_fmatter
The following sections are included:
https://doi.org/10.1142/9789812779861_0001
Manufacturing enterprise systems and service enterprise systems carry out the bulk of economic activities in any country and in the increasingly connected world. Enterprise data are necessary to ensure that each manufacturing or service enterprise system is run efficiently and effectively. As it becomes easier to capture and fairly inexpensive to store, digitized data gradually overwhelms our ability to analyze in order to turn them into useful information for decision making. The rise of data mining and knowledge discovery as an interdisciplinary field for uncovering hidden and useful knowledge from large volumes of data stored in a database or data warehouse is very promising in many areas, including enterprise systems. Over the last decade, numerous studies have been carried out to investigate how enterprise data could be mined to generate useful models and knowledge for running the business more efficiently and effectively. This chapter intends to provide a comprehensive overview of previous studies on enterprise data mining. To give some idea about where the research is heading, some on-going research programs and future research directions are also highlighted at the end of the chapter.
https://doi.org/10.1142/9789812779861_0002
Credit rating is a powerful tool that can help banks improve loan quality and decrease credit risk. This chapter examines major classification techniques, which include traditional statistical models (LDA, QDA and logistic regression), k-nearest neighbors, Bayesian networks (Naïve Bayes and TAN), decision trees (C4.5), associative classification (CBA), a neural network and support vector machines (SVM), and applies them to controlling credit risk. The experiments were conducted on 244 rated companies mainly from the Industrial and Commercial Bank of China. The receiver operating characteristic curve and the Delong-Pearson method were adopted to verify and compare their performance. The results reveal that while traditional statistical models produced the poorest outcomes, C4.5 or SVM did not perform satisfactorily, and CBA seemed to be the best choice for credit rating in terms of predictability and interpretability.
https://doi.org/10.1142/9789812779861_0003
Enterprise data present several difficulties when are used in data mining projects. Apart from being heterogeneous, noisy and disparate, they may also be characterized by major imbalances between the different classes. Predictive classification using imbalanced data necessitates methodologies that are adequate for such data, and particularly for the training of algorithms and evaluation of the resulting classifiers. This chapter suggests to experiment with several class distributions in the training sets and a variety of performance measures, especially those that are known to better expose the strengths and weaknesses of classification models. By combining classifiers into schemes that are suitable for the specific business domain, may improve predictions. However, the final evaluation of the classifiers must always be based on the impact of the results to the enterprise, which can take the form of a cost model that reflects requirements of existing knowledge. Taking a telecommunications company as an example, we provide a framework for handling enterprise data during the initial phases of the project, as well as for generating and evaluating predictive classifiers. We also provide the design of a decision support system, which embodies the above process with the daily routine of such company.
https://doi.org/10.1142/9789812779861_0004
Time series forecasting is one of the important problems in time series analysis. Many different approaches have been developed in this field. Unlike statistical methods, soft computing methods are more tolerant to imprecision, uncertainty, partial truth, and approximation in time series. This chapter addresses two major aspects of time series forecasting: 1) how to identify time series variables including exogenous ones relevant to forecasting future values, and 2) how to build a better forecasting model to improve the forecasting accuracy. Two different models are developed in this research. First, we propose a soft computing based hybrid method to improve the accuracy of a neural network model. Then a sub-clustered rule-based forecasting method, called WEFuNN, is developed to group similar time series data together in order to reduce the computational time and to increase the accuracy of the forecasting method.
https://doi.org/10.1142/9789812779861_0005
The current intense global competition and diverse customer requirements have been forcing manufacturing companies to produce quickly a high variety of customized products at low costs. The linchpin for companies to achieve efficiency, and thus surviving, lies in the ability to maintain the high variety production as stable as possible. Such stable production can only be achieved by adopting similar production processes to produce the diverse products. Process platforms have been recognized as a promising means for companies to configure optima, yet similar, production processes to fulfill the need for different products. This chapter applies data mining to form process platforms from the existing large volumes of production data in companies' production systems. To meet the challenges encountered in the formation process, more specific data mining techniques, including text mining, tree matching, fuzzy clustering, and tree unification, are incorporated in the proposed methodology. A case study of high variety production of vibration motors for mobile phones is also reported. The results illustrate the feasibility and potential of data mining application in process platform formation.
https://doi.org/10.1142/9789812779861_0006
This chapter presents a data mining based approach for developing production control strategies under a dynamic and complex manufacturing system. To control such complex systems, it is a challenge to determine appropriate dispatching strategies under various system conditions. Dispatching strategies are classified into two categories: a vehicle-initiated dispatching policy and a machine-initiated dispatching policy. It has been shown that no single strategy consistently dominate the rest. Both policies are important to improve the system performance, especially for the real time control of the system. Focusing on combining them under various situations for semiconductor manufacturing systems, the goal of this chapter is to develop a scheduler for the selection of dispatching rules in order to obtain desired performance given by a user for each production interval. For the proposed methodology, simulation and competitive neural network approaches are used. The test results indicate that applying our methodology to obtaining a dispatching strategy is an effective method given the complexity of semiconductor wafer fabrication systems.
https://doi.org/10.1142/9789812779861_0007
Wine quality is determined by a series of complex chemical processes. Factors affecting grape and wine performance range from climate conditions during the growing period to harvesting decisions controlled by humans. In this chapter, we apply single-objective and multi-objective classification algorithms for prediction of grape and wine quality in a multi-year agricultural database maintained by Yarden - Golan Heights Winery in Katzrin, Israel. The goal of the study is to discover relationships between 138 agricultural and meteorological attributes collected or derived during a single season and 27 dependent parameters measuring grapevine and wine quality. We have induced ordered (oblivious) decision-tree models from the target dataset using information-theoretic classification algorithms. The induced models, called single-objective and multi-objective information networks, have been combined into multi-level information graphs, each level standing for a different stage of the wine production process. The results clearly demonstrate the hitherto unexploited potential of the KDD technology for knowledge discovery in agricultural data.
https://doi.org/10.1142/9789812779861_0008
As global competition continues to intensity in high-tech industry such as the semiconductor industry, wafer fabs have been placing more importance on the increase of die yield and the reduction of costs. Because of automatic manufacturing and information integration technologies, a large amount of raw data has been increasingly accumulated from various sources. Mining potentially useful information from such large databases becomes very important for high-tech industry to enhance operational excellence and thus maintain competitive advantages. However, little research has been done on manufacturing data of high-tech industry. Due to the complex fabrication processes, the data integration, system design, and requirement for cooperation among domain experts, IT specialists, and statisticians, the development and deployment of data mining applications is difficult. This chapter aims to describe characteristics of various data mining empirical studies in semiconductor manufacturing, particularly defect diagnosis and yield enhancement. We analyze engineering data and manufacturing data in different cases and discuss specific needs for data preparation in light of different characteristics of these data. This study concludes with several critical success factors for the development of data mining applications in high-tech industry.
https://doi.org/10.1142/9789812779861_0009
This chapter aims at presenting our data mining vision on Statistical Process Control (SPC) analysis, specifically on the design of multivariate control charts for individual observations in the case of independent data and continuous variables. We first argue why the classic multivariate SPC tool, namely the Hotelling T2 chart, might not be appropriate for large data sets, and then we provide an up-to-date critical review of the methods suitable for dealing with data mining issues in control chart design. In order to address new SPC issues such as the presence of multiple outliers and incorrect model assumptions in the context of large data sets, we suggest exploitation of some multivariate nonparametric statistical methods. In a model-free environment, we present the way we handle large data sets: a multivariate control scheme based on the data depth approach. We first present the general framework, and then our specific idea on how to design a proper control chart. There follows an example, a simulation study, and some remarks on the choice of the depth function from a data mining perspective. A brief discussion of some open issues in data mining SPC closes the chapter.
https://doi.org/10.1142/9789812779861_0010
Multi-dimensional functional data, such as time series data and images from manufacturing processes, have been used for fault detection and quality improvement in many engineering applications such as automobile manufacturing, semiconductor manufacturing, and nano-machining systems. Extracting interesting and useful features from multi-dimensional functional data for manufacturing fault diagnosis is more difficult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of functional data types, high correlation, and nonstationary nature of the data. This chapter discusses accomplishments and research issues of multi-dimensional functional data mining in the following areas: dimensionality reduction for functional data, multi-scale fault diagnosis, misalignment prediction of rotating machinery, and agricultural product inspection based on hyperspectral image analysis.
https://doi.org/10.1142/9789812779861_0011
In recent years, extracting useful information from enterprise data and subsequently making sense of the extracted knowledge are IT (information technology) activities of utmost importance to many organizations. Frequently, the extracted knowledge is represented in the form of rules. This chapter describes a hybrid approach that integrates rough sets, tabu search, and genetic algorithms (GAs) for extracting rules from enterprise data for maintenance. The intensification and diversification strategies of tabu search are embedded in a GA search engine, in a bid to facilitate rule extraction. A case study on the maintenance of bridge cranes in an organization was used to illustrate the effectiveness of the proposed hybrid approach. The extracted rules appear to be reasonable. The details of the hybrid approach, the results of a comparative study between a traditional GA search engine and a tabu-enhanced GA search engine, and the details of the case study are presented.
https://doi.org/10.1142/9789812779861_0012
Workflow management systems are widely used by business enterprises as tools for administrating, automating and scheduling the business process activities with the available resources. Workflow models are the fundamental components of workflow management systems, and are used for defining, scheduling, and ordering of workflow tasks. Since the control flow specifications of workflows are manually designed, they entail assumptions and errors, leading to inaccurate workflow models. Moreover, companies increasingly follow flexible workflow models in order to adapt to changes in business logic, making it more challenging to understand or forecast process behavior. In this chapter we describe recently proposed techniques for optimizing business processes by analyzing the execution details of previously executed processes, stored as a workflow log. The applications of workflow mining that we describe include the (re)discovery of process models, the optimization of process models, and the development of mechanisms to predict the future behavior of a currently running invocation of a process.
https://doi.org/10.1142/9789812779861_0013
In the rapidly expanding fields of cellular and molecular biology, fluorescence illumination and observation is becoming one of the techniques of choice to study the localization and dynamics of proteins, organelles, and other cellular compartments, as well as a tracer of intracellular protein trafficking. The automatic analysis of these images and signals in medicine, biotechnology, and chemistry is a challenging and demanding field. Signal-producing procedures by microscopes, spectrometers and other sensors have found their way into wide fields of medicine, biotechnology, economy and environmental analysis. With this arises the problem of the automatic mass analysis of signal information. Signal-interpreting systems which automatically generate the desired target statements from the signals are therefore of compelling necessity. The continuation of mass analyses on the basis of the classical procedures leads to investments of proportions that are not feasible. New procedures and system architectures are therefore required. We will present, based on our flexible image analysis and interpretation system Cell_Interpret, new intelligent and automatic image analysis and interpretation procedures. We will demonstrate it in the application of the HEp-2 cell pattern analysis.
https://doi.org/10.1142/9789812779861_0014
Support Vector Machines (SVMs) methods have become a popular tool for predictive data mining problems and novelty detection. They show good generalization performance on many real-life datasets and they are motivated theoretically through convex programming formulations. There are relatively few free parameters to adjust using cross validation and the architecture of the SVM learning machine does not need to be found by experimentation as in the case of Artificial Neural Networks (ANNs). We discuss the fundamentals of SVMs with emphasis to multiclass classification problems and applications in science, business and engineering.
https://doi.org/10.1142/9789812779861_0015
We review the ideas, algorithms, and numerical performance of manifold-based machine learning and dimension reduction methods. The representative methods include locally linear embedding (LLE), ISOMAP, Laplacian eigenmaps, Hessian eigenmaps, local tangent space alignment (LTSA), and charting. We describe the insights from these developments, as well as new opportunities for both researchers and practitioners. Potential applications in image and sensor data are illustrated. This chapter is based on an invited survey presentation that was delivered by Huo at the 2004 INFORMS Annual Meeting, which was held in Denver, CO, U.S.A.
https://doi.org/10.1142/9789812779861_0016
Most enterprise datasets are large, but some are very small for predictive purposes due to expensive experiments, reduced budget or tight schedule required to generate them. The bootstrap approach is a method used frequently for small datasets in data mining. Numerous theoretical studies have been done on bootstrap in the past two decades but few have applied it to solve real world manufacturing problems. Bootstrap methods provide an attractive option when model selection becomes complex due to small sample sizes and unknown distributions. In principle, bootstrap methods are more widely applicable than the jackknife method, and also more dependable. In this chapter we focus on selecting the best model based on prediction errors computed using the revised bootstrap method, known as the 0.632 bootstrap. Models developed and selected are then clustered and the best cluster of models is next bagged to provide the minimum prediction errors. Numerical examples based on a small enterprise dataset illustrate how to use this procedure in selecting, validating, clustering, and bagging predictive regression models when sample sizes are small compared to the number of parameters in the model.
https://doi.org/10.1142/9789812779861_bmatter
The following sections are included: