Open Access

A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods

This work is supported by National Institutes of Health grants LM012601, AI116794, and DK112217.

Institute for Biomedical Informatics, University of Pennsylvania, D202 Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104, USA

Search for more papers by this author

Maksim Shestov

Institute for Biomedical Informatics, University of Pennsylvania, D202 Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104, USA

Search for more papers by this author

Peter Schmitt

Institute for Biomedical Informatics, University of Pennsylvania, D202 Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104, USA

Search for more papers by this author

, and

Randal S. Olson

Institute for Biomedical Informatics, University of Pennsylvania, D202 Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104, USA

Search for more papers by this author

https://doi.org/10.1142/9789813235533_0024Cited by:1 (Source: Crossref)

Abstract:

A central challenge of developing and evaluating artificial intelligence and machine learning methods for regression and classification is access to data that illuminates the strengths and weaknesses of different methods. Open data plays an important role in this process by making it easy for computational researchers to easily access real data for this purpose. Genomics has in some examples taken a leading role in the open data effort starting with DNA microarrays. While real data from experimental and observational studies is necessary for developing computational methods it is not sufficient. This is because it is not possible to know what the ground truth is in real data. This must be accompanied by simulated data where that balance between signal and noise is known and can be directly evaluated. Unfortunately, there is a lack of methods and software for simulating data with the kind of complexity found in real biological and biomedical systems. We present here the Heuristic Identification of Biological Architectures for simulating Complex Hierarchical Interactions (HIBACHI) method and prototype software for simulating complex biological and biomedical data. Further, we introduce new methods for developing simulation models that generate data that specifically allows discrimination between different machine learning methods.

Keywords:

Biocomputing 2018

Metrics

Downloaded 214 times

History

Information

Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License.

Keywords

PDF download

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods

Recommended