By Kevin E. Noonan --
While Craig Venter is trying to synthesize a "minimal genetic complement" bacteria (see "Patent Life (Really)"), a consortium of 35 research groups from 80 research centers are attacking the problem from the other end of the phylogenetic tree: what is needed (minimally) to encode a human being? Known as the ENCODE (ENCyclopedia Of DNA Elements) Project group, the consortium operates under the auspices, and with the financial support of, the National Human Genome Research Institute (NHGRI). And in a formal announcement of the publication of a synthesizing article in Nature and the concomitant publication of 83 separate supporting papers in the journal Genome Research, the latest results further distinguish the structure and complexity of the mammalian genome when compared to the more efficiently-designed bacterial genome.
In its report ("Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project," Nature 447:799-816), the ENCODE group presents results showing that a majority of the DNA sequences studied are transcribed into RNA, including both gene sequences and sequences understood to be non-coding - "junk" DNA - and that these primary transcripts are overlapping, i.e., they start and stop at a more diverse array of sites than previously appreciated. Overlapping genes were detected in 224 loci, and 180 of these contained at least one exon from an upstream gene. As the authors explain, "[i]nstead of the traditional view that many genes have one or more alternative transcripts that code for alternative proteins, our data suggest that a given gene may both encode multiple protein products and produce other transcripts that include sequences from both strands and from neighbouring loci (often without encoding a different protein)." An illustrative example is a fusion transcript consisting of "at least" three coding exons from the ATP5O gene and two coding exons from the DONSON gene, expressed in small intestine.
An important caveat to these results is that they are limited to a review of but 1% of the human genome (44 genomic region targets, 30 million basepairs). Within the studied sequences the authors did not find the kind of transcriptional distinctions between coding and non-coding DNA. The authors found more than ten times the number of transcriptional "start" sites associated with regulatory sequences than genes known to reside in the target loci. Moreover, there did not appear to be any differences in how evolutionarily conserved "gene" sequences were (compared with non-coding DNA) about 50% of the time, suggesting that evolution is operating not at the level of "conserved" genes but on interspersed elements, whose regulation-in-context (i.e., within the surrounding non-coding DNA) was also the subject of evolutionary pressures. Alternatively, the type of element-selection evolution could result in a plurality of alternative elements making up any particular portion of an encoded protein, and thus represent a "warehouse for natural selection."
These results have interesting consequences for protecting expressed sequence tag (EST) sequences, since if confirmed, they strike at the underlying rationales for the utility of such sequences. Using the traditional paradigm of discrete "gene" sequences being the templates for transcription, the existence of an EST in a tissue, particularly the differential expression of a particular EST in a tissue, was assumed to be significant and reflect a cell, tissue, or organ-specific gene expression event. If, on the other hand, there is a more general level of transcription, the assumption of utility ESTs have been imbued with is at best highly questionable.
These results also reinforce the message from sequencing the human genome by the Human Genome Project (HGP) at the turn of the century that we are at the beginning, not the end, of the road towards understanding how the decoded sequence information is organized and used by the cell. This report, like others arising directly from the HGP, indicates that mammalian genomes are much more complex and depend more upon assortment, shuffling, and RNA tailoring (splicing, etc.) for mediating gene expression than occur in bacteria. Indeed, the concepts of gene transcription elucidated over the past 40 years in lower organisms is likely to be seriously inadequate for understanding mammalian cell biology. As a consequence, the paradigm shift is underway to accommodate the realities of mammalian cell biology reports such as the ENCODE report, and to adapt our thinking about mammalian gene expression and genome structure to conform to our DNA and not the other way around.
ENCODE data can be accessed here.
Comments