By Kevin E. Noonan --
While Craig Venter is trying to synthesize a "minimal genetic complement" bacteria (see "Patent Life (Really)"),
a consortium of 35 research groups from 80 research centers are
attacking the problem from the other end of the phylogenetic tree:
what is needed (minimally) to encode a human being? Known as the ENCODE (ENCyclopedia Of DNA Elements) Project group, the consortium operates under the auspices, and with the financial support of, the National Human Genome Research Institute (NHGRI). And in a
formal announcement of the publication of a synthesizing article in Nature and the concomitant publication of 83 separate supporting papers in the journal Genome Research,
the latest results further distinguish the structure and complexity of
the mammalian genome when compared to the more efficiently-designed
bacterial genome.
In its report ("Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project," Nature
447:799-816), the ENCODE group presents results showing that a majority
of the DNA sequences studied are transcribed into RNA, including both
gene sequences and sequences understood to be non-coding - "junk" DNA -
and that these primary transcripts are overlapping, i.e., they
start and stop at a more diverse array of sites than previously
appreciated. Overlapping genes were detected in 224 loci, and 180 of
these contained at least one exon from an upstream gene. As the
authors explain, "[i]nstead of the traditional view that many genes
have one or more alternative transcripts that code for alternative
proteins, our data suggest that a given gene may both encode multiple
protein products and produce other transcripts that include sequences
from both strands and from neighbouring loci (often without encoding a
different protein)." An illustrative example is a fusion transcript
consisting of "at least" three coding exons from the ATP5O gene and two coding exons from the DONSON gene, expressed in small intestine.
An important caveat to these results is that they are limited to a review of but 1% of the human genome (44 genomic region targets, 30 million basepairs). Within the studied sequences the authors did not find the kind of transcriptional distinctions between coding and non-coding DNA. The authors found more than ten times the number of transcriptional "start" sites associated with regulatory sequences than genes known to reside in the target loci. Moreover, there did not appear to be any differences in how evolutionarily conserved "gene" sequences were (compared with non-coding DNA) about 50% of the time, suggesting that evolution is operating not at the level of "conserved" genes but on interspersed elements, whose regulation-in-context (i.e., within the surrounding non-coding DNA) was also the subject of evolutionary pressures. Alternatively, the type of element-selection evolution could result in a plurality of alternative elements making up any particular portion of an encoded protein, and thus represent a "warehouse for natural selection."
These results have interesting consequences for protecting expressed sequence tag (EST) sequences, since if confirmed, they strike at the underlying rationales for the utility of such sequences. Using the traditional paradigm of discrete "gene" sequences being the templates for transcription, the existence of an EST in a tissue, particularly the differential expression of a particular EST in a tissue, was assumed to be significant and reflect a cell, tissue, or organ-specific gene expression event. If, on the other hand, there is a more general level of transcription, the assumption of utility ESTs have been imbued with is at best highly questionable.
These results also reinforce the message from sequencing the human genome by the Human Genome Project
(HGP) at the turn of the century that we are at the beginning, not the
end, of the road towards understanding how the decoded sequence
information is organized and used by the cell. This report, like
others arising directly from the HGP, indicates that mammalian genomes
are much more complex and depend more upon assortment, shuffling, and
RNA tailoring (splicing, etc.) for mediating gene expression than occur
in bacteria. Indeed, the concepts of gene transcription elucidated
over the past 40 years in lower organisms is likely to be seriously
inadequate for understanding mammalian cell biology. As a consequence,
the paradigm shift is underway to accommodate the realities of
mammalian cell biology reports such as the ENCODE report, and to adapt
our thinking about mammalian gene expression and genome structure to
conform to our DNA and not the other way around.
ENCODE data can be accessed here.
Comments