Genome Sequencing
Created | Updated Mar 11, 2004
The genome of a cell or organism is the complete set of genetic material present in the DNA of its chromosomes1.
This genetic information is carried by DNA. Each molecule of DNA consists of two strands coiled round each other to form a double helix, a structure like a spiral ladder.
Each rung of the ladder consists of chemical groups called bases (of which there are four types - A, C, G and T2). They combine in specific pairs (A with T, C with G) so that the sequence on one strand of the double helix is complementary to that on the other: It's the specific sequence of bases which constitutes the genetic information.
If a genome was a book written in DNA, the bases would be its letters. Codons (triplets of bases that code for one amino acid*) would be its words, the genes would be its paragraphs, and its chapters would be the chromosomes.
Genome sequencing projects aim to work out the exact order of the base pairs (letters) in the DNA in all of the chromosomes (chapters) in an organism's genome (book). Once scientists have deciphered the letters of the book they can make sense of the whole genome, and obtain a solid understanding of the fundamental make-up of the organism that has been sequenced. Genome sequencing of an organism is the process of producing a detailed analysis of the DNA in its chromosomes.
This Entry describes:
- what has happened so far in this relatively new field
- how genomes are actually sequenced
- why genomic sequence information is useful.
In addition, some examples of notable genome sequencing projects are given.
The Picture So Far
In July 1995 the bacterium Haemophilus influenzae was the first organism to have its entire genome sequenced. The following year, baker's yeast, Saccharomyces cerevisiae was the first eukaryotic* organism to have its genome sequenced. Since then, much progress has been made. According to the National Centre for Biotechnology Information, to date* 126 free living organisms have their whole genomes sequenced. Of this 126, 112 are bacteria and the other 14 are eukaryotes. Below is a sample table comparing the genome sizes and gene content of a selection of organisms:
Organism | Genome Size in Base Pairs | Number of Genes (approx.) |
4x106 | 4,300 | |
Anthrax (Bacillus anthracis) | 5x106 | 4,700 |
Yeast (Saccharomyces cerevisiae) | 2x107 | 6,000 |
Caenorhabditis elegans (a nematode species) | 8x107 | 19,000 |
Drosophila melanogaster (a species of fruit fly) | 2x108 | 14,000 |
Mouse (Mus musculus) | 3x109 | 30,000 |
Human (Homo sapiens) | 3x109 | 30,000 |
Rice (Oryza sativa) | 4x1011 | 46,000 |
As you can see from the above table, humans, the most complex organisms in the list, have fewer genes and smaller genome sizes than the most humble of plants, rice. There seems to be a discrepancy between the complexity of an organism and the size of its genome of number of genes present. This has been noted in the past and has been dubbed the C-value* paradox. Modern DNA sequencing technology has reinforced this paradox and it still remains unresolved today.
DNA Sequencing Technology
Modern DNA sequencing is a highly automated process. The DNA of the genome is chopped up into random fragments and engineered into large plasmids (independently replicating circular rings of DNA in bacteria) called Bacterial Artificial Chromosomes or BACs for short. The bacteria, usually E. coli, whilst reproducing, help make thousands of copies of the BACs - a process called cloning. In genomic sequencing, it's these BACs that are directly sequenced.
DNA is a polymer of nucleotides, of which there are four types that contain the four base pairs, A, C, G, and T. If DNA is a ladder then the base pairs are the rungs, the bases paired together by hydrogen bonds. During normal DNA synthesis an enzyme called DNA polymerase unzips the ladder of the DNA double helix into two by breaking the hydrogen bonds in between the base pairs. To make an exact copy of the DNA strand that has just been unzipped, DNA polymerase add on new nucleotides to the exposed bases, making a complete ladder once more.
During DNA sequencing you add special nucleotides which terminate this reaction and when added into the newly-synthesised strand, halt the enzyme in its tracks. The special 'terminator nucleotides' are also tagged with a coloured flourescent marker. So, no more synthesis of that strand of DNA takes place, and it's final nucleotide is tagged. This is happening all over the place with many different DNA molecules, all incorporating randomly different terminator nucleotides creating thousands of strands which all terminate at different base pairs.
The result of this activity actually gives us the sequence of the base pairs in any given DNA sample, including any BAC, like the ones mentioned in the first paragraph of this section.
For example, for this very tiny DNA strand:
CGCTCTAGCTCGACACATG
We would know its sequence by randomly creating the following DNA strands - the bold, coloured letters denotes a fluorescent-tagged terminator nucleotide:
C
CG
CGC
CGCT
CGCTC
CGCTCT
CGCTCTA
CGCTCTAG
CGCTCTAGC
CGCTCTAGCT
CGCTCTAGCTC
CGCTCTAGCTCG
CGCTCTAGCTCGA
CGCTCTAGCTCGAC
CGCTCTAGCTCGACA
CGCTCTAGCTCGACAC
CGCTCTAGCTCGACACA
CGCTCTAGCTCGACACAT
CGCTCTAGCTCGACACATG
An automated DNA sequencer would draw the DNA through a capillary* and past a laser beam which excites the fluorescent tag. Each of the four terminator nucleotides has its own corresponding colour, and a camera detects which colour is fluorescing, thereby detecting which base is present. Larger DNA strands move up the capillary at slower rates, so the smallest is read first (in the above case one base), then the next fragment is read (two base pairs long), then the next (three) and so on in the above order. A picture, nucleotide-by-nucleotide, builds up detailing the exact order of the nucleotides and therefore also base pairs in the DNA being sequenced.
Why Sequence Genomes?
There are numerous benefits to possessing a full genome sequence of an organism. Here's a few:
Basic Research
To scientists, to possess a huge database of DNA sequence of the organism you're interested in is a huge boon to its basis genetic analysis. A genome acts as a jumping-off point for all future genetics studies in that organism. For example, the Human Genome Project is the basis for a vast array of pharmaceutical research. Pharmaceutical companies and academic institutions alike have recognised the potential applications of finding new genes through mining the human genome. Newly-discovered genes suggest new molecular targets that could lead to the development of new drugs.
Pharmacogenomics
In the future, individuals will be able to have pharmaceuticals tailored to their own genome sequences. Rapid advances in sequencing technology should make it possible to have your very own genome sequenced in a very short amount of time. Humans differ by only 0.2% of their DNA; this accounts for all the variations across humanity, regardless of race or ethnicity. We are more similiar to one another than we once thought. Also, these differences are responsible for a large number of diseases. The human genome sequence will allow science to identify precisely what genes go wrong when we develop disease. Once we know this vital information, we open up a tremendous number of potential therapies therapies including gene therapy, where the defective genomic DNA is replaced with a functional copy.
Comparative Genomics
Biology in the past has often used comparative studies to increase our understanding of organisms and their relations to one another and molecular biology is no exception. Comparative genomics analyses and compares genomes of different species. Its purpose is to grasp how species have evolved and to gain an understanding of the function of genes and non-coding sequences in genomes. Comparative genomics looks at gene sequence similarity, gene order and location similarity, and number of exons* within a gene, and the amount of non-coding DNA within a genome between such organisms as diverse as bacteria and mammals.
Phylogenomics
Gene function is usually determined by comparative genomics; when comparing an unknown gene sequence to that of a known gene, the more similar they are the more likely they are to have the same function and the more likely they are to be from more closely related organisms. This is a pretty good assumption for most cases, but it can yield inaccurate and misleading results because it doesn't take into account how the gene got to be the way it was, i.e., how it evolved.
During evolution gene functions change as genes adapt to the environment. Because of this, it is possible to reconstruct the evolutionary history of genes to help predict the function of newly-discovered uncharacterised genes. The sheer amount of DNA sequence being dumped into the public domain means that genomic sequence is becoming a valuable resource for people wanting to predict gene function, which is especially relevant as more genomes are sequenced.
Projects
A genome is typically very, very big indeed (especially non-bacterial genomes). For example, the human genome consists of three billion letters, which would fill two hundred 500-page phone books. If you felt like reciting the human genome at one base per second, it would take a century to do. So, it is no surprise to learn that sequencing an entire genome is a feat of Herculean proportions, and that's why they are either huge international collaborative projects or highly expensive corporate ventures. Either way sequencing a genome takes a while.
The Human Genome Project
In the summer of June 2000 it was announced that a rough draft* of the human genome had been obtained by both an international consortium of publicly-funded scientists collectively known as the Human Genome Project (HGP) and Celera Genomics, a privately-held company. In April 2003 it was announced that the accurate, finished map of the genome was complete, two years ahead of schedule.
The HGP has rightly been claimed as one of mankind's greatest achievements. It is routinely compared in significance with the splitting of the atom, the invention of the wheel, and the 1969 Moon landings, though the HGPs scientific importance and value to humanity far surpasses these events. The HGP is the first step in understanding humans at the molecular level. Its impact is already being felt in the world of pharmaceuticals, where thousands of new drug targets became suddenly available, and in diagnostics, where genetic disorders such as cancer and Huntington's disease will be able to be rapidly identified before it's too late, or at a time where therapeutics can have maximum efficacy.
'Just one part of this work - the sequencing of chromosome 20 - has already accelerated the search for genes involved in diabetes, leukaemia and childhood eczema. We shouldn't expect immediate major breakthroughs but there is no doubt we have embarked on one of the most exciting chapters of the book of life.'
- Professor Allan Bradley, Director of The Wellcome Trust Sanger Institute*, Cambridge, UK
The genome is a treasure trove of information that will take science decades to fully exploit. As Prof Bradley says, we haven't even begun to see any major breakthroughs because of the human genome yet, but almost all drug research involves genomic sequence from the HGP at some point during its development.
The HGP has given us insights into human evolution. Comparisons of the human genome to the mouse genome will tell how we differ evolutionary. If we compare ourselves to the single-celled eukaryote yeast then we might be able to tell what genes are necessary for basic and vital cellular functions. When we look at disparately different animals such as humans and fish, we both have livers and circulatory systems and other shared structures. Genome sequencing can answer the question: If we have shared functions with things a distantly related as fish, is it using the same basic construction kit as us? The answer is 'yes', as the recent sequencing of the fugu fish by the Joint Genome Institute has shown. We have genes in common with all organisms, from invertebrates, to plants, to microbes.
One of the surprises of the project was that it takes only 30,000 genes to make a human, about the same number for mice and other mammals. Previous estimates were much higher, ranging from 60,000 to 100,000 genes. The strange thing is there doesn't seem to be enough genes to do the job. Genes in a genome must interact with each other in an unforeseen way to create the massively complex mammalian body and physiology. And perhaps more significantly, it is the differences between those interactions in mammals that separates humans from other mammals, not just sheer difference in number of genes or amount of DNA. The Human Genome Project will revolutionise our understanding of human evolution, and what it takes to build a person.
The Rice Genome
Rice is mankind's most important food crop; it is the staple food for half of the world's population. The genome sequence information should speed up the development of higher-yielding, tougher plants to help feed Asia's burgeoning population. Syngenta, a Swiss-based agribusiness, sequenced the Nipponbare rice subspecies, whilst the Beijing Genomics Institute sequenced the Indica subspecies. Both were formally published in the Journal Science in April 2002. They have since collaborated and produced a version of the rice genome based upon both copies.
Rice was chosen as a candidate for sequencing for both its importance as a food crop and its evolutionary relationship to other cereals such as maize, wheat, and barley. The rice genome is very large, larger than humans, but it is comparatively simple when compared to the other cereals, thus it acts as a model genome for all cereals, which covers all of the world's staple foods. To understand the genome of rice is very important as rice shows synteny* with other cereals, we can learn from rice why the order of genes seems to be so important in a genome.
Malaria Genomes
In sub-Saharan Africa malaria kills at least one person ever 30 seconds, it is the biggest killer in Africa and every year over one million people die from it worldwide. The mosquito Anopheles gambiae is the main transmitter of the malarial parasite Plasmodium falciparum; this particular mosquito is the major ditributor of malaria, it even prefers to prey upon humans whom it can bite hundreds of time a day.
In October 2002 an international consortium simultaneously announced the sequencing of Anopheles, and its Plasmodium parasite. Science now has the genomes of the malarial parasite, its vector, and its host. Researchers have already identified nearly 200 genes in the Plasmodium genome that produce proteins that help it evade the body's defence mechanisms. They have also identified 276 genes in the mosquito genome that are critical to its sensory* systems that helps it to identify its human prey. Never before has medicine had every single genetic target at its disposal for the development of new drugs and vaccines that can either attack the parasite directly, or stop the mosquito vector transmitting the parasite, or find drugs that will make people immune to the effects of malaria.
Conclusion
Genome sequencing projects aim to determine the exact position of all the base pairs in a species DNA. Genomics, and specifically genome sequencing, is an unusually holistic approach to biology, especially as science and molecular biology tends to be reductionist in nature. There have been many organisms sequenced to date, and the number keeps on increasing and will do for the foreseeable future. What these projects have shown us is that the complexity of a species is not related to the complexity or size of its genome. Genome sequencing technology has been extensively automated and is a procedure that can generate a lot of DNA sequence with relatively little effort.
There are many benefits to genome sequencing. It is a boost to basic research and development, it will have massive implications in future medicine and will increase our understanding of the evolutionary relationships between not only us to the rest of the animal kingdom, but of all species of all kingdoms to one another. Genome sequencing projects, such as the glamorous Human Genome Project, are enormous international collaborative efforts that aim to generate all of the sequence that is so useful to scientists globally. These projects range from rice to malaria, from fish to bacteria.
Genome sequencing is just the beginning step of a new kind of exploration. It is the start of medicine and biology based on a thorough understanding of the underlying processes of life itself. Genome sequencing has spawned many other disciplines, such as proteomics, where scientists try to figure out all of the proteins that the genome codes for; and regulatory genomics, where researchers aim to find out how genes are switched on and off and how they interact with one another. Genome sequencing gives scientists a depth of information never before dreamed of, and has opened up thousands of potential research avenues that would not be possible without the genomic DNA sequence.