A LINGUISTIC INTEGRATION OF A BIOLOGICAL DATABASE
One of the major theoretical concerns associated with the Human Genome Project is that of the methodology to decipher “raw” sequences of DNA. This work is concerned with a subsequent problem, the one of how huge amounts of already deciphered information that will emerge in the near future can be integrated in order to enhance our biological understanding. The formal foundations for a linguistic theory of the regulation of gene expression will be discussed. The linguistic analysis presented here is restricted to sequences with known biological function since: i) there is no way to obtain, from DNA sequences alone, a regulatory representation of transcription units, and ii) the elements of substitution -methodologically equivalent to phonemes- are complete sequences of the binding sites of proteins.
We have recently collected and analyzed the regulatory regions of a large number of E.coli promoters. The number of sigma 70 promoters studied may well represent the largest homogeneous body of knowledge of gene regulation at present. This collection is a data set for the construction of a grammar of the sigma 70 system of transcription and regulation. This grammatical model generates all the arrays of the collection, as well as novel combinations predicted to be consistent with the principles of the data set. This Grammar is testable, as well as expandable if the analysis of emerging data requires it. The elaboration of a linguistic methodology capable of integrating prokaryotic data constitutes a preliminary step towards the analysis and integration of the more complex eukaryotic systems of regulation.