DNA AS A LANGUAGE
The classical triplet code is not the only code carried by the sequences. They contain, for example, the gene-splicing code, transcription codes and many other codes. By analyzing a large volume of the nucleotide sequences available, i.e., by performing various computer experiments with the sequences, one can decipher them and extract from them valuable biological information. At the DNA level there are at least two more codes — the DNA shape code and the chromatin code. The overall DNA shape is sequence-dependent and can be described by a set of angles characteristic for various dinucleotide elements — codons of the DNA shape code. The chromatin code provides instructions for histone octamers where along the DNA to form the nucleosomes. This code is expressed as positional periodicity of, primarily, AA and TT dinucleotides. A new RNA code has been described — the translation framing code. The frame seems to be maintained by a synchronizing pattern GCUGCUGCU… hidden in mRNA. Most enigmatic of all is, perhaps, the gene-splicing code. An interesting recent development indicates that the gene-splicing pattern in the sequences and the nucleosomal pattern have some common features. This has to do with superposition of the patterns that is characteristic for the sequence language in general which carries simultaneously many codes in one and the same text. This results in an increased complexity of the sequences. Analysis of the protein-coding sequence complexity in eukaryotes and in prokaryotes revealed that the former are simpler. This is interpreted as the result of a spatial separation of the triplet code (carried by exons) and the chromatin code (carried by introns). Perhaps, the necessity of the separation of otherwise conflicting codes is one of the reasons why the intervening sequences had been introduced at all. The nucleotide sequences are written in an unbroken manner. One way to detect “words” in such a continuous text is to evaluate the degree of internal correlation by calculating contrast values for the words. This technique allows one to derive vocabularies, which are species- and function-specific. The nucleotide sequences, thus, carry numerous superimposed messages. We do understand only a few of these messages while many more are waiting for their turn to be deciphered.