SciELO - Scientific Electronic Library Online

 
vol.35 issue3-4Signal transduction in lemon seedlings in the hypersensitive response against Alternaria alternata: participation of calmodulin, G-protein and protein kinasesThe expression of extracellular fungal cell wall hydrolytic enzymes in different Trichoderma harzianum isolates correlates with their ability to control Pyrenochaeta lycopersici author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

Share


Biological Research

Print version ISSN 0716-9760

Biol. Res. vol.35 no.3-4 Santiago  2002

http://dx.doi.org/10.4067/S0716-97602002000300013 

Biol Res 35: 385-399, 2002

 

Plant genomics: an overview

HUGO CAMPOS-DE QUIROZ

Trait Genetics. Semillas Pioneer Chile Ltda. Coyancura 2241, Piso 3, Providencia, Santiago, CHILE.
email: hugo.campos@pioneer.com

ABSTRACT

Recent technological advancements have substantially expanded our ability to analyze and understand plant genomes and to reduce the gap existing between genotype and phenotype. The fast evolving field of genomics allows scientists to analyze thousand of genes in parallel, to understand the genetic architecture of plant genomes and also to isolate the genes responsible for mutations. Furthermore, whole genomes can now be sequenced. This review addresses these issues and also discusses ways to extract biological meaning from DNA data. Although genomic issuesare addressed from a plant perspective, this review provides insights into the genomic analyses of other organisms.

Key terms: genomics, Arabidopsis thaliana, Oryza sativa, plant breeding, gene discovery

INTRODUCTION

Until very recently, the molecular analysis of plants often focused on the single gene level. Recent technological advances have changed this paradigm, enabling the analysis of organisms in terms of genome organization, expression and interaction. The study of the way genes and genetic information are organized within the genome, the methods of collecting and analyzing this information, and how this organization determines their biological functionality is referred to as genomics. Genomic approaches are permeating every aspect of plant biology, and since they rely on DNA-coded information, they expand molecular analyses from a single to a multispecies level. Plant genomics is reversing the previous paradigm of identifying genes behind biological functions and instead focuses on finding biological functions behind genes. It also reduces the gap between phenotype and genotype and helps to comprehend not only the isolated effect of a gene, but also the way its genetic context and the genetic networks it interacts with can modulate its activity. This review is organized into two main sections. The first deals with the current understanding of plant genomes, their genetic structure at the inter- and intra- species level and how whole genomes are sequenced, and its second section addresses some approaches used in order to achieve the final aim of genomics: finding the biological and functional significance of DNA sequence.

1 THE GENETIC STRUCTURE OF PLANT GENOMES

Plant genomes are best described in terms of genome size, gene content, extent of repetitive sequences and polyploidy/duplication events. Although plants also possess mitochondrial and chloroplast genomes, their nuclear genome is the largest and most complex. There is extensive variation in nuclear genome size (Table I) without obvious functional significance of such variation (Rafalski, 2002).

Plant genomes contain various repetitive sequences and retrovirus-like retrotransposons containing long terminal repeats and other retroelements, such as long interspersed nuclear elements and short-interspersed nuclear elements (Kumar and Bennetzen 1999). Retroelement insertions contribute to the large difference in size between collinear genome segments in different plant species and to the 50% or more difference in total genome size among species with relatively large genomes, such as corn. They contribute a smaller percentage of genome size in plants with smaller genomes such as Arabidopsis (The Arabidopsis Genome Initiative, 2000). If other repetitive sequences are accounted for, the corn genome is comprised of over 70% repetitive sequences and of 5% protein encoding regions (Meyers et al., 2001).

It is widely accepted that 70-80% of flowering plants are the product of at least one polyploidization event (Barnes, 2002). Many economically important plant species, such as corn, wheat, potato, and oat are either ancient or more recent polyploids, comprising more than one, and in cases such as wheat, three different homologous genomes within a single species. Duplicated segments also account for a significant fraction of the rice genome. About 60% of the Arabidopsis genome is present in 24 duplicated segments, each more than 100 kilobases (kb) in size (Bevan et al., 2001). Ancestral polyploidy contributes to create genetic variation through gene duplication and gene silencing. Genome duplication and subsequent divergence is an important generator of protein diversity in plants.

1.1 Model plant species

Model organisms (Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae) provide genetic and molecular insights into the biology of more complex species. Since the genomes of most plant species are either too large or too complex to be fully analyzed, the plant scientific community has adopted model organisms. They share features such as being diploid and appropriate for genetic analysis, being amenable to genetic transformation, having a (relatively) small genome and a short growth cycle, having commonly available tools and resources, and being the focus of research by a large scientific community. Although the advent of tissue culture techniques fostered the use of tobacco and petunia, the species now used as model organisms for mono- and dicotyledonous plants are rice (Oryza sativa) and Arabidopsis (Arabidopsis thaliana) respectively.

Arabidopsis, a small Cruciferae plant without agricultural use, sets seed in only 6 weeks from planting, has a small genome of 120 Megabases (Mb) and only five chromosomes. There are extensive tools available for its genomic analysis, whole genome sequence, Expressed Sequence Tags (ESTs) collections, characterized mutants and large populations mutagenized with insertion elements (transposons or the T-DNA of Agrobacterium). Arabidopsis can be genetically transformed on a large scale with Agrobacterium tumefaciens and biolistics. Other tools available for this model plant are saturated genetic and physical maps.

Unlike Arabidopsis, rice is one of the world's most important cereals. More than 500 million tons of rice is produced each year, and it is the staple food for more than half of the worldís population. There are two main rice subspecies. Japonica is mostly grown in Japan, while indica is grown in China and other Asia-Pacific regions. Rice also has very saturated genetic maps, physical maps, whole genome sequences, as well as EST collections pooled from different tissues and developmental stages. It has 12 chromosomes, a genome size of 420 Mb, and like Arabidopsis, it can be transformed through biolistics and A. tumefaciens. Efficient transposon-tagging systems for gene knockouts and gene detection have not yet become available for saturation mutagenesis in rice, although some recent successes have been reported.

1.2 Maps

1.2.1 Genetic maps

The development of molecular markers has allowed for constructing complete genetic maps for most economically important plant species. They detect genetic variation directly at the DNA level. A myriad of molecular marker systems are available, yet their description lies beyond the scope of this paper. A genetic map represents the ordering of molecular markers along chromosomes as well as the genetic distances, generally expressed as centiMorgans (cM), existing between adjacent molecular markers. Genetic maps in plants have been created from many experimental populations, but the most frequently used are F2, backcrosses and recombinant inbred lines. Although longer to develop, recombinant inbred lines offer a higher genetic resolution and practical advantages. Once a mapping population has been created, it takes only few months to produce a genetic map with a 10 cM resolution (Figure 2a). Genetic maps contribute to the understanding of how plant genomes are organized and once available they facilitate the development of practical applications in plant breeding, such as the identification of Quantitative Trait Loci and Marker Assisted Selection. Most economically important plant traits such as yield; plant height and quality components exhibit a continuous distribution rather than discrete classes and are regarded as quantitative traits. These traits are controlled by several loci each of small effect and different combinations of alleles at these loci can give different phenotypes.

Quantitative Trait Loci analysis refers to the identification of genomic regions associated with the phenotypic expression of a given trait. Once the location of such genomic regions is known they can be assembled into designer genotypes, i.e. individuals carrying chromosomic fragments associated with the expression of a given phenotype. The most important feature of Marker Assisted Selection is that once a molecular marker genetically linked to the expression of a phenotypically interesting allele has been detected, an indirect selection for such allele based upon the detection of the molecular marker can be accomplished, since little or any genetic recombination will occur between them. Therefore, the presence of the molecular marker will always be associated with the presence of the allele of interest.

Genetic maps are also an important resource for plant gene isolation, as once the genetic position of any mutation is established, it is possible to attempt its isolation through positional cloning (Campos-de Quiroz et al., 2000). Furthermore, genetic maps help establish the extent of genome colinearity and duplication between different species.

1.2.2 Physical maps

Although genetic maps provide much-needed landmarks along chromosomes, they are still too far apart to provide an entry point into genes, since even in model plants the kilobases per centiMorgan (kb/cM) ratio is large, from 120 to 250 kb/cM in Arabidopsis and between 500 and 1.500 kb/cM in corn. Therefore, a 1 cM interval may harbor ~30 to 100 or even more genes. Physical maps bridge such gaps, representing the entire DNA fragment spanning the genetic location of adjacent molecular markers.

Physical maps can be defined as a set of large insert clones with minimum overlap encompassing a given chromosome. First generation physical maps in plants were based on YACs (Yeast Artificial Chromosomes). Chimaerism and stability issues, however, dictated the development of low copy, E. coli-maintained vectors such as Bacterial Artificial Chromosomes (BACs) and P1-derived artificial chromosomes. Although BAC vectors are relatively small (molecular weight of BAC vector pBeloBAC11 is 7.4 kb for instance), they carry inserts between 80 and 200 kb on average and possess traditional plasmid selection features such as an antibiotic resistance gene and a polycloning site within a reporter gene allowing insertional inactivation. BAC clones are easier to manipulate than yeast-based clones. Once a BAC library is prepared, clones are assembled into contigs using fluorescent DNA fingerprint technologies and matching probabilities. Physical and genetic maps can be aligned, bringing along continuity from phenotype to genotype. Furthermore, they provide the platform clone-by-clone sequencing approaches rely upon. Figure 2b shows the relationship between genetic and physical maps and their alignment. Physical maps provide the bridge needed between the resolution achieved by genetic maps and that needed to isolate genes through positional cloning.

1.3 Genome colinearity/Genome evolution

A remarkable feature of plant genomics is its ability to bring together more than one species for analysis. The comparative genome mapping of related plant species has shown that the organization of genes is highly conserved during the evolution of members of taxonomic families. This has led to the identification of genome colinearity between the well-sequenced model crops and their related species (e.g. Arabidopsis for dicots and rice for monocots). Colinearity overrides the differences in chromosome number and genome size and can be defined as conservation of gene order within a chromosomal segment between different species. A related concept is synteny, which refers to the presence of two or more loci on the same chromosome regardless they are genetically linked or not.

Colinear relationships have been observed among cereal species (corn, wheat, rice, barley), legumes (beans, peas and soybeans), pines and Cruciferae species (canola, broccoli, cabbage, Arabidopsis thaliana). Recently, the first studies at the gene level have demonstrated that microcolinearity of genes is less conserved; small-scale rearrangements and deletions complicate microcolinearity between closely-related species. For instance, although a 78-kb genomic sequence of sorghum around the locus adh1 and its homologous genomic fragment from maize showed considerable microcolinearity and the fact that they share nine genes in perfect order and transcriptional direction, five additional, unshared genes reside in this genomic region (Tikhonov et al., 1999).

Comparing sequences of soybean and Arabidopsis demonstrated partial homology between two soybean chromosomes and a 25 cM section of chromosome 2 from Arabidopsis (Lee et al., 2001). Although such relationships need to be assessed on a case-by-case basis, they reflect the value Arabidopsis and other model species offer to economically important species.

Colinearity has also been established between rice and most cereal species, allowing the use of rice for genetic analysis and gene discovery in genetically more complex species, such as wheat and barley (Shimamoto and Kyozuka, 2002). A comparison of rice and barley DNA sequences from syntenic regions between barley chromosome 5H and rice chromosome 3 revealed the presence of four conserved regions, containing four predicted genes. General gene structure was largely conserved between rice and barley (Dubcovsky et al., 2001). A similar comparison between corn and rice, based on 340 kb around loci adh1 and adh2, showed five colinear genes between the two species, as well as a possible translocation on adh1. Rice genes similar to known disease resistant genes showed no cross-hybridization with corn genomic DNA, suggesting sequence divergence or their absence in maize (Tarchini et al., 2000). There are even reports of colinearity across the mono-dicotyledoneous division involving Arabidopsis and cereals which diverged as far back as 200 million years ago (Mayer et al., 2001) Exploiting colinearity helps to establish cross-species genetic links and also aids in the extrapolation of information from species with simpler genomes (i.e. rice) to genetically complex species (corn, wheat). Furthermore, it reflects the power of genomics to integrate genetic information across species.

1.4 Whole genome sequencing

Genetic and physical maps at the inter- or intra-species level represent a key layer of genomic information. However, sequence data represents the ultimate level of genetic information. Three major breakthroughs have allowed the sequencing of complete genomes: 1) The development of fluorescence-based DNA sequencing methods that provide at least 500 bases per read; 2) The automation of several processes such as picking and arraying bacterial subclones, purification of DNA from individual subclones and sample loading among others; and 3) The development of software and hardware able to handle massive amounts (gigabytes) of data points.

There are two main approaches to large scale sequencing (Figure 1). In clone-by-clone strategies (Figure 1a), large insert libraries, such as those based on BAC clones, are used as sequencing templates, and inserts are arranged into contigs using diverse fingerprinting methods to establish minimal tiling paths. Sequence Tagged Connectors extracted from large insert clones as well as FISH (Fluorescent in situ Hybridization) and optical mapping are used to extend contigs and close gaps (Marra et al., 1997). BAC clones from sequence-ready contigs are then fragmented into plasmid or M13 vector-based shotgun libraries with insert sizes of ~1 to 3 kb. Using more than one vector system reduces cloning bias issues. Sequencing efforts are tailored to the degree of coverage required. For instance, for a 5-fold coverage, and assuming 500 base pairs (bp) per sequencer reading, 800 clones are sequenced to cover an 80 kb BAC clone. Finished sequences are those obtained at a ~8-10 fold coverage and provide >99.99% accuracy, whereas working draft sequences are attained at a ~3-5 fold coverage. It is important to note, however, that even working draft sequences provide an enormous amount of information, and even shotgun approaches rely to some extent on clone-by clone information.

Figure 1. Whole genome sequencing a-. Clone-by-clone approach; b-. Shot-gun approach

Figure 2. Maps used in plant genetics. a: Genetic and physical maps of a hypothetical chromosome. Horizontal lines on the genetic map represents loci targeted by a molecular marker; vertical lines represent overlapping BAC clones. b: Alignment of genetic and physical maps using BAC ends sequence (dashed lines), ESTs (dotted line) and molecular markers (*).

After sequencing is concluded, DNA data is used to reassemble BAC clones. Base calling programs assigning quality scores to each read base such as Phred (Ewing et al., 1998), sequence assembly programs such as Phrap (Gordon et al., 1998), and graphical viewing tools are used to achieve such assembly. The finishing of the sequence then ensues, which can be done in part manually or with finishing software such as Autofinish (Gordon et al., 2001).

Annotation, or the process of identifying start and stop codons and the position of introns that permits the prediction of biological function from DNA sequence, proceeds through three main steps. The first is to use gene finders like Xgrail (Uberbacher and Mural, 1991) or others based on generalized hidden Markov models, such as GeneMark.hmm (Lukashin and Borodovsky, 1998) and GenScan (Burge and Karlin, 1997), specifically developed to recognize Arabidopsis genes. In the second step, sequences are aligned to protein and EST databases; and finally, putative functions are assigned to each gene sequence. Successful annotation processes often combine different software and manual inspection.

In shotgun approaches (Figure 1b), which have been successfully used to sequence many microorganisms and D. melanogaster, small insert libraries are prepared, and randomly selected inserts are sequenced until a ~5-fold or higher coverage is reached. Sequences are then assembled, gaps identified and closed, and finally annotation conducted. Shotgun sequencing does not rely upon the availability of minimal tiling paths and therefore reduces the cost and effort required to obtain whole genome sequences. Nevertheless, they require an enormous amount of computational power to assembly a large number of random sequences into a small number of contigs. Furthermore, the ultimate quality of large genomes that have been shotgun-sequenced may not be as high as that achievable using the clone-by-clone approach. Because of a high content of long and highly conserved repetitive sequences, including retrotransposons, shotgun sequencing of plant genomes may pose special challenges.

The Arabidopsis genome was the first to be fully sequenced. The ecotype chosen was Columbia. In 1996, sequencing groups in the US, Japan and Europe established the Arabidopsis Genome Initiative (AGI) and set common techniques and resources, accuracy standards, levels of analysis, and a common public release policy for sequence information. Since shotgun sequencing was not available at its inception, the sequencing of Arabidopsis followed a more conventional approach (The Arabidopsis Genome Initiative, 2000).

There are two remarkable lessons to be learned from the Arabidopsis sequencing effort. First, there is no alternative to establishing partnerships between even competing groups when tackling large genomic projects because of the complexity, expense, and infrastructure required. For example, the effectiveness of the AGI consortium resulted in the completion and release of the Arabidopsis sequence in 2000, fully four years ahead of schedule. Secondly, approximately one third of the genes putatively identified in Arabidopsis encode products lacking significant similarity to proteins of known function in other organisms. Moreover, only 9% of its genes have been characterized experimentally. Such figures reflect the power of genomic approaches and the wealth of information they provide us with. The gene complement of Arabidopsis is shown in Table II.

In rice, the IRGSP (International Rice Genome Sequencing Project) started in 1997, and includes members from developing countries in addition to European and USA partners. It is based on the Nipponbare cultivar, and its approach is similar to that used in Arabidopsis. Thus, the first task was to establish a sequence-ready BAC contig of the rice genome, followed by software assembly of DNA sequences, computational and manual annotation and final release of the terminated sequence. The expected deadline for release of the full sequence data is 2003.

Syngenta and a Chinese group recently made available the sequences of japonica and indica rice, respectively (Goff et al., 2002; Yu et al., 2002), and both were based on shotgun approaches. Gradients in Guanine/Cytosine content and codon usage for rice genes created unexpected problems in the gene annotation process, and the gene finder software FgeneSH was the most effective in rice (Yu et al., 2002). The number of genes predicted in rice ranges from 32,000 to 55,000, depending on the criteria used to recognize a gene (Goff et al., 2002; Yu et al., 2002). Regardless of the actual figure, it is interesting to note that such figures are similar or larger than the human gene complement (32,000-39,100 genes; Green and Chakravarti, 2001). The suggestion of Yu et al. (2002) that protein diversity in plants is generated primarily through gene duplication would account for the comparatively large number of genes predicted in rice.

Nevertheless, the actual number of genes existing in Arabidopsis, rice, or any other sequenced species remains to be established through functional genomic experiments that establish the biological meaning of DNA sequences, since gene prediction through homology comparisons and software tools is a statistical "best informed guess" rather than a biologically based process. Availability of extensive EST collections, which now exist in several plant species, including corn (Rafalski et al., 1998) and soybean (Cahoon et al., 1999), reduce the dependence of the annotation process on computational gene predictions.

2. EXTRACTING BIOLOGICAL SIGNIFICANCE FROM
DNA SEQUENCES

Issues discussed so far relate to how information can be extracted from DNA at different complexity levels (genetic or physical maps, colinearity, whole genome sequencing). This information does not represent the end of genomics however, but rather the starting point to assigning biological meaning to putative genes with no known phenotype. The field of genomics that addresses the function of genes discovered through sequencing efforts is referred to as functional genomics. Since genes encoding traits are expected to be functional across species, most of the information thus gathered will be useful to address plant improvement issues or biological processes.

2.1 ESTs (Expressed Sequence Tags)

Large scale sequencing facilities allow the development of ESTs, also known as single pass sequences of random cDNAs. ESTs are derived from the 3' end of transcripts and are expected to contain sequences including 3' untranslated regions (UTR) and to extend toward the 5' end, thus reaching exons. Their average length in plant species is ~350-650 bp.

Figure 3 presents the typical flowchart for an EST project. Although ESTs generated from 5' ends enhance the probability of spanning open reading frames, the availability of 3'UTR data is important for segregating members of gene families and selecting clones for transcriptional profiling studies. They allow the cataloguing of many genes without the effort involved in complete genome sequencing initiatives. Since many genes have tissue-specific temporal expression patterns, in order to collect cDNAs from most expressed genes it is necessary to prepare cDNA libraries from several different tissues and also from tissues challenged with diverse biotic and abiotic factors. The value of using normalized libraries as an EST discovery platform cannot be overstressed.

Searches for homology with known genes from other species allows the assignment of putative biological functions to ESTs. The use of ESTs also permits the identification of genes encoding functionally unknown proteins. A comparison of EST databases from different species and tissues reveals the diversity in coding sequences between plants and a global perspective on the similarities in genes for specific tissues or conditions.

Although ESTs are useful tools in gene discovery, those belonging to multigene families are not easily distinguishable, and rare transcripts have a small probability of being represented in EST collections. There is an extensive collection of public ESTs from a number of organisms in the database dbEST, a division of GenBank. Table III presents an updated (September 2002) record of plant-derived ESTs available in dbEST.

ESTs are also important in establishing expression/transcription maps, since their placement on a genetic map provides the precise location of genes. This helps us understand how genes belonging to a pathway or a given gene family are distributed across the genome. Furthermore, they provide anchors by which physical and genetic maps can be aligned (Figure 2b).

Research focusing on improving the nutritional quality of plant products reflects the value of EST programs in plant gene discovery. An EST from Arabidopsis allowed for the isolation of the gene p-hydroxyphenylpyruvate dioxygenase in Synechocystis PCC6803. This enzyme enables the first committed step in the synthesis of plastoquinones and tocopherols in plants and subsequently other steps of vitamin E synthesis. This led to the development of transgenic Arabidopsis plants with increased levels of vitamin E levels in oil. (Shintani and DellaPenna, 1998).

For species other than rice or Arabidopsis, large-scale EST sequencing provides an excellent and cost-effective way of obtaining valuable information on transcribed genes (Mayer and Mewes, 2001). ESTs are a highly effective gene discovery tool, and approximately 60% of putatively identified Arabidopsis genes have been tagged with ESTs (Table II). Furthermore, they have provided the platform necessary for transcriptional profiling experiments. EST sequencing programs, therefore, provide a powerful lead into genomic approaches in plants.

Figure 3. Schematic chart of an EST discovery program

2.2 Reverse genetics

Traditional genetic analysis aims to identify the DNA sequences associated with a given phenotype. Reverse genetics determines the function of a gene for which the sequence is known, by generating and analyzing the phenotype of the corresponding knockout mutant (Maes et al., 1999). Unlike yeast, in which gene disruption is available through homologous recombination, transposon and T-DNA tagging are the best methods available for developing mutagenized plant populations suitable for reverse genetics studies (Pereira, 2000). There are several mutagenized populations in Arabidopsis suited for reverse genetics studies. A European consortium is developing heterologous systems for rice based on the Ac element from corn (Greco, 2001). There are also proprietary populations such as Pioneer Hi-Bred International's Trait Utility System for Corn (TUSC), mutagenized with the high copy Mu element (Multani et al., 1998). Using high copy elements makes it possible to use smaller populations to ensure that tagged mutants will be found for most genes.

There are two main possibilities for identifying tagged genes at insertion sites. For unknown genes, sequences flanking the insertion can be obtained through inverse Polymerase Chain Reaction (PCR) (Ochman et al., 1988) or Thermal Assimetric Interlaced PCR (Liu and Whitier, 1995), whereas for insertions in genes of known sequence, it is possible to amplify and clone the sequence of interest through PCR using gene-specific and insertion-specific primers. Since in the latter case it is common to analyze thousands of plants, PCR-based screening is arranged into three-dimensional pools that allow the unequivocal identification of tagged individuals. Large databases of characterized insertion sites are becoming available that will further ease the use of insertion elements to isolate useful genes (Tissier et al., 1999).

Although several genes have been isolated through reverse genetic approaches, two main factors have limited their wider application. First, many genes are functionally redundant, as even species with simple genomes such as Arabidopsis carry extensive duplications, and second, mutations in many genes may be highly pleiotropic, which can mask the role of a gene in a specific pathway (Springer, 2000). Nevertheless, reverse genetics is considered to be a major component of the functional genomics toolbox, and it plays an important role in assigning biological functions to genes discovered through large-scale sequencing programs. Transposon tagging provides an excellent alternative to isolate tagged genes that exhibit relatively simple inheritance.

Gene traps refer to another application of transposons that responds to regulatory sequences at the site of insertion. Depending on the sequences engineered, they can be classified as reporter traps, enhancer traps, or gene traps. Since they rely on reporter gene expression, mutant phenotypes are not required, and they have been valuable in isolating tissue and cell specific sequences (Springer, 2000).

2.3 Transcriptional profiling

While molecular biology generally analyzes one or a few genes simultaneously, recent developments allow the parallel analysis of thousands of genes. This area of genomics involves the study of gene expression patterns across a wide array of cellular responses, phenotypes and conditions. The expression profile of a developmental stage or induced condition can identify genes and coordinately regulated pathways and their functions. This produces a more thorough understanding of the underlying biology (Quackenbush, 2001).

There are several systems available to analyze the parallel expression of many genes such as macroarrays (Desprez et al., 1998), microarrays (Schena et al., 1995) and Serial Analysis of Gene Expression (SAGE) (Velculescu et al., 1995), which consists of identifying short sequence tags from individual transcripts, their concatenation, sequencing and subsequent digital quantitation. SAGE provides expression levels for many transcripts across different stages of development.

There are open and closed transcriptional profiling systems. Open technologies survey a large number of transcripts and analyze their levels between different samples but the identity of the genes involved is not known a priori. One example of such a system is the GeneCalling technology (Bruce et al., 2000). Another open system is provided by Massively Parallel Sequence Signatures (MPSS), where microbeads are used to construct libraries of DNA templates and create hundred of thousands of gene signatures (Brenner et al., 2000).

Closed systems, on the other hand, analyze genes that have been previously characterized. They include most of the diverse microarray systems available, and these are based on the specific hybridization of labeled samples to spatially separate immobilized nucleic acids, thus enabling the parallel quantification of many specific mRNAs. It is important to select the system at the onset of any transcriptional profiling study and stay with it.

The focus here is on microarrays. In microarray experiments, DNA samples corresponding to thousands of genes of interest are immobilized on a solid surface such as glass slides in a regular array. The immobilized sequences are usually referred to as probes. RNA samples (or their cDNA derivatives) from a biological samples under study are hybridized to the array and are referred to as the target. Labeling with fluorescent dyes with different excitation and emission characteristics allows the simultaneous hybridization of two contrasting targets on a single array (Aharoni and Vorst, 2001). Microarrays can be based on oligonucleotides or cDNA molecules, and their basic features are presented in Table IV.

Microarray applications are broadly classified as expression-specific and genome-wide expression studies. In specific expression studies, they are used as a functional genomics tool to address the biological significance of genes discovered through large scale sequencing, as well as a means to understanding the genetic networks explaining biological processes or biochemical pathways. The value of using microarrays to identify novel response genes has been demonstrated by studying the gene expression patterns during corn embryo development (Lee et al., 2002), the response to drought and cold stresses (Seki et al., 2001), herbivory (Arimura et al., 2000), and nitrate treatments (Wang et al., 2000).

When addressing a specific pathway or biological process, it is useful to include genes beyond those of apparent interest, since over-specific microarrays would not be able to address genetic interactions with other biological processes. This principle revealed previously unexpected relationships between low soil phosphate levels and cold acclimation in Arabidopsis (Hurry et al., 2000). Genes obtained from the transcriptional analysis of plant responses to stress are of particular relevance for transgenic approaches, as thoroughly reviewed by Dunwell et al. (2001).

Genome-wide arrays are mostly designed for model organisms such as Arabidopsis or rice, as there are many genes available to select from, either as clones or as annotated genomic sequences for model species. They are also available to species such as corn that have extensive EST collections. This enabling technology is an immediate and direct result of large-scale sequencing projects. It is expected that microarrays covering most of Arabidopsis genes will become available in 2003. Genome-wide expression profiles are the ultimate tool to integrate all genes existing in an organism into a series of experiments. They also help to elucidate the coordinate expression of different genetic networks and document how changes in one would impact others. It is expected that such genome-wide approaches will be particularly useful in identifying new regulatory sequences and master switches that affect distinct but apparently unrelated genetic networks.

Transcriptional profiling technologies play a central role in predicting gene function since sequence comparison alone is insufficient to infer function. They also help to detect phenomena such as gene displacement -non-homologous genes coding for proteins that serve the same function- and gene recruitment -genes with identical sequences coding for completely different functions- (Noordewier and Warren, 2001).

Unlike animals, plants cannot move and have developed exquisite mechanisms to cope with changing environmental conditions and biotic challenges, since these directly or indirectly affect most biological processes occurring in plants. Therefore, a significant proportion of the information gathered by specific and genome wide transcription profiling processes should have practical applications and facilitate the development of plants more resilient to biotic and abiotic stimuli.

This overview cannot be complete without addressing some potential pitfalls that researchers interested in microarrays need to be aware of. Microarray technology is still a young and very complex approach with many variables involved. Sequences for a given experiment must be carefully selected in terms of coverage, redundancy, quality of annotation and cost. With commercial microarrays, the researcher has little if any control over the sequence content of the array and can only rely on the annotation of the manufacturer. Cross hybridization between family members and alternative splice forms can be misleading, although using 3' UTR sequences and exon-intron junctions may alleviate this. It is also necessary to carefully standardize experimental conditions (Aharoni and Vorst, 2001).

Once data has been gathered, they need to be normalized by comparison with internal standards ("housekeeping genes"), or via spiking with foreign mRNA. Data for each gene are generally reported as an expression ratio or its logarithm (Hess et al., 2001). The statistical analysis of the information gathered with microarrays is perhaps the most complex and often neglected step that is still being developed. When expression data from different experiments is compared, the aim is to identify genes with similar expression patterns. There are unsupervised approaches in which no knowledge about how the genes assessed are organized is available, for instance clustering algorithms, principal component analysis and k-means clustering. These systems can group genes on the basis of their separation in expression space. There are also neural network-based methods such as Self Organizing Maps (SOMs). Supervised methods, on the other hand, rely on previous information about the genes studied. The Support Vector Machine for instance uses a training set in which genes known to be related by, for example function, are provided as positive examples and genes known not to be members of that class are negative examples. Thus, SVM learns to distinguish between class members and non-members on the basis of expression data (Quackenbush, 2001).

The extensive amount of information generated from microarray experiments is best managed with Laboratory Information Management Systems handling sample submission, sample processing, sample tracking, data retrieving, sorting, visualization and statistical analysis. Finkelstein et al. (2002) have assembled a useful set of critical steps to follow when conducting microarray experiments: selection of probes; physical design of the array; data acquisition, extraction and normalization; design and control of the experiment through standardized annotation; and methods employed in data analysis.

The massive amount of information generated with microarray experiments is very appealing to many investigators, but should not deter one from rigorous experimental design and hypothesis development. Observations of expression data may help generate hypotheses, but additional experimentation, for example genetic mapping or transgenesis, may be necessary to validate these hypotheses.

CONCLUDING REMARKS

The current understanding of plant biology is limited by our understanding of gene function in the context of whole organism biology. Even in Arabidopsis, only a fraction of its genes have been characterized from a molecular standpoint. This paradigm is being challenged by a myriad of genomic approaches available to researchers that allow the identification of putative genes and the validation of their biological functionality through functional approaches. Furthermore, used in combination with genetics, genomics adds another level of understanding to plant biology through the integrated analysis of different species.

The large number of genes handled simultaneously by genomics sets a new paradigm in plant biology, since it allows the genetic integration of diverse processes, tissues and organisms. It is expected that a significant proportion of such information will be transferred to plant improvement programs and will thus contribute to meeting the increasing food requirements of the world.

Plant genomics will revolutionize the study of the molecular basis of plant biology. The traditional hypothesis-driven approach will be gradually transformed into an unbiased data collection at the tissue/organism level approach followed by bioinformatic analyses.

Finally, genomics is the ultimate interdisciplinary approach, as it covers the entire spectrum from DNA sequencing to field-based research. Only through the integrated endeavor of genetics, biology, bioinformatics, molecular biology, engineering, microbiology and related fields will the extensive benefits of genomics to mankind become reality.

ACKNOWLEDGMENTS

I would like to thank Drs. Gregory Edmeades (Pioneer Hi-Bred International), Antoni Rafalski (Dupont Crop Genetics-Genomics) and two anonymous reviewers for their comments on this manuscript.

REFERENCES

AHARONI A, VORST O (2001) DNA microarrays for functional plant genomics. Plant Mol Biol 48: 99-118         [ Links ]

ARIMURA G, TASHIRO K, KUHARA S, NISHIOKA T, OZAWA R, TAKABAYASHI, J (2000) Gene responses in bean leaves induced by herbivory and by herbivore-induced volatiles. Biochem Biophys Res Commun 277: 305-310         [ Links ]

BARNES S (2002) Comparing Arabidopsis to other flowering plants. Curr Op Plant Biology 5:128-133         [ Links ]

BEVAN M, MAYER K, WHITE O, EISEN JA, PREUSS D, BUREAU T, SALZBERG S, MEWES H-W (2001) Sequence and analysis of the Arabidopsis genome. Curr Op Plant Biol 4:105-110         [ Links ]

BRENNER S, JOHNSON M, BRIDGHAM J, GOLDA G, LLOYD DH, JOHNSON D, LUO SJ, MCCURDY S, FOY M, EWAN M, ROTH R, GEORGE D, ELETR S, ALBRECHT G, VERMAAS E, WILLIAMS SR, MOON K, BURCHAM T, PALLAS M, DUBRIDGE RB, KIRCHNER J, FEARON K, MAO J, CORCORAN K (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature Biotech 18: 630-634         [ Links ]

BRUCE W, FOLKERTS O, GARNAAT C, CRASTA O, ROTH B, BOWEN B (2000) Expression profiling of the maize flavonoid pathway genes controlled by estradiol-inducible transcription factors CRC and P. Plant Cell 12: 65-79         [ Links ]

BURGE C, KARLIN S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78-94         [ Links ]

CAHOON E, CARLSON T, RIPP K, SCHWEIGER B, COOK G, HALL S, KINEY AJ (1999) Biosynthetic origin of conjugated double bonds: Production of fatty acid components of high-value drying oils in transgenic soybean embryos. Proc Natl Acad Sci USA 96:12935-12940         [ Links ]

CAMPOS DE QUIROZ H, MAGRATH R, MCCALLUM D, KROYMANN J, SCNABELRAUCH D, MITCHELL-OLDS T, MITHEN R (2000) Keto acid elongation and glucosinolate biosynthesis in Arabidopsis thaliana. Theor Appl Genet 101: 429-437.         [ Links ]

DESPREZ T, AMSELEM J, CABOCHE M, HOFTE H (1998) Differential gene expression in Arabidopsis monitored using cDNA arrays. Plant J 14: 643-652         [ Links ]

DUBCOVSKY J, RAMAKRISHNA W, SANMIGUEL P, BUSSO C, YAN L, SHILOFF B, BENNETZEN JL (2001) Comparative sequence analysis of colinear barley and rice Bacterial Artificial Chromosomes. Plant Physiol 125:1342-1353         [ Links ]

DUNWELL JM, MOYA-LEON M A, HERRERA R (2001) Transcriptome analysis and crop improvement: (A review). Biol Res 34(3-4):153-164         [ Links ]

EWING B, HILLIER L, WENDL MC, GREEN P (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8: 175-185         [ Links ]

FINKELSTEIN D, EWING R, GOLLUB J, STERKY F, CHERRY M, SOMERVILLE S (2002) Microarray data quality analysis: lessons from the AFGC project. Plant Mol Biol 48: 119-131         [ Links ]

GOFF S, RICKE D, LAN TH, PRESTING G, WANG R, DUNN M, GLAZEBROOK J, SESSIONS A, OELLER P, VARMA H, HADLEY D, HUTCHISON D, MARTIN C, KATAGIRI F, LANGE B, MOUGHAMER T, XIA Y, BUDWORTH P, ZHONG J, MIGUEL T, PASZKOWSKI U, ZHANG S, COLBERT M, SUN M, CHEN L, COOPER B, PARK S, WOOD T, MAO L, QUAIL P, WING R, DEAN R, YU Y, ZHARKIKH A, SHEN R, SAHASRABUDHE S, THOMAS A, CANNINGS R, GUTIN A, PRUSS D, REID J, TAVTIGIAN S, MITCHELL J, ELDREDGE G, SCHOLL T, MILLER RM, BHATNAGAR S, ADEY N, RUBANO T, TUSNEEM N, ROBINSON R, FELDHAUS J, MACALMA T, OLIPHANT A , BRIGGS S (2002) A draft sequence of the rice genome (Oryza sativa L ssp japonica). Science 296: 92-100         [ Links ]

GORDON D, ABAIJAN C, GREEN P (1998) Consed: a graphical tool for sequence finishing. Genome Res 8:195-202         [ Links ]

GORDON D, DESMARAIS C, GREEN P (2001) Automated finishing with Autofinish. Genome Res 11:614-625         [ Links ]

GRECO R, OUWERKERK PBF, SALLAUD C, KOHLI A, COLOMBO L, PUIGDOMENECH P, GUIDERDONI E, CHRISTOU P, HOGE JHC, PEREIRA A (2001) Transposon insertional mutagenesis in rice Plant Physiol 125: 1175-1177         [ Links ]

GREEN E, CHAKRAVARTI A (2001) The Human Genome Sequence Expedition: Views from the ìBase Camp.î Genome Res 11:645 -651         [ Links ]

HESS KR, ZHANG W, BAGGERLY KA, STIVERS DN, COOMBESRL KR (2001) Microarrays: handling the deluge of data, extracting reliable information. T Biotech 19: 463-467         [ Links ]

HURRY V, STRAND A, FURBANK R, STITT M (2000) The role of inorganic phosphate in the development of freezing tolerance and the acclimatization of photosynthesis to low temperature is revealed by the pho mutants of Arabidopsis thaliana. Plant J 24: 383-396         [ Links ]

KUMAR A, BENNETZEN JL (1999) Plant retrotransposons. Ann Rev Genet 33:479-532         [ Links ]

LEE J-M, WILLIAMS ME, TINGEY SV, RAFALSKI A (2002) DNA array profiling of gene expression changes during maize embryo development. Funct Integr Genomics 2:13-27         [ Links ]

LEE JM, GRANT D, VALLEJOS CE, SHOEMAKER RC (2001) Genome organization in dicots II Arabidopsis as a ëbridging speciesí to resolve genome evolution events among legumes. Theor Appl Genet 103:765-773         [ Links ]

LEE JM, WILLIAMS ME, TINGEY SV, RAFALSKI JA (2002) DNA array profiling of gene expression changes during maize embryo development. Funct Integr Genomics 2:13-27         [ Links ]

LIU Y, WHITTIER, R (1995) Thermal asymmetric interlaced PCR: automatable amplification and sequencing of insert end fragments from P1 and YAC clones for chromosome walking. Genomics 25: 674-681         [ Links ]

LUKASHIN AV, BORODOVSKY M (1998) GeneMarkhmm: new solutions for gene finding. Nucleic Acids Res 26:1107-1115         [ Links ]

MAES T, DE KEUKELEIRE P, GERATS T (1999) Plant tagnology. T Plant Sci 4:90-96         [ Links ]

MARRA M, KUCABA T, DIETRICH N, GREEN E, BROWNSTEIN, WILSON RK, MCDONALD K, HILLIER L, MCPHERSON J, WATERSTON RH (1997) High Throughput Fingerprint Analysis of Large-Insert Clones. Genome Res 7: 1072-1084         [ Links ]

MAYER K, MEWES H (2001) How can we deliver the large plant genomes? Strategies and perspectives. Curr Op Plant Biol 5:173-177         [ Links ]

MAYER K, MURPHY G, TARCHINI R, WAMBUTT R, VOLCKAERT G, POHL T, DÜSTERHÖFT A, STIEKEMA W, ENTIAN K-D, TERRYN N, LEMCKE K, HAASE D, HALL C, VAN DODEWEERD A-M,TINGEY S, MEWES H-W,BEVAN MW , BANCROFT I (2001) Conservation of microstructure between a Sequenced Region of the Genome of Rice, Multiple Segments of the Genome of Arabidopsis thaliana. Genome Res 11:1167 -1174         [ Links ]

MEYERS B, TINGEY SV, MORGANTE M (2001) Abundance, distribution, and transcriptional activity of repetitive elements in the maize genome. Genome Res 11:1660-1676         [ Links ]

MULTANI D, MEELEY RB, PATERSON AH, GRAY J, BRIGGS SP, JOHAL GS (1998) Plant-pathogen microevolution: Molecular basis for the origin of a fungal disease in maize. Proc Natl Acad Sci USA 95:1686-1691         [ Links ]

NOORDEWIER, MO, WARREN PV (2001) Gene expression microarrays, the integration of biological knowledge T Biotech 19: 412-415         [ Links ]

OCHMAN H, GERBER A, HART D (1988) A genetic application of an inverse polymerase chain reaction. Genetics 120:621-623         [ Links ]

PEREIRA A (2000) A transgenic perspective on plant functional genomics. Transgenic Res 9: 245-260         [ Links ]

QUACKENBUSH J (2001) Computational analysis of microarray data Nature Rev Genetics 2:418-427         [ Links ]

RAFALSKI AJ (2002) Plant genomics: present state and a perspective on future developments. Briefings in Fundamental Genomics and Proteomics 1:1-15         [ Links ]

RAFALSKI AJ, HANAFEY M, MIAO GH, CHING A, DOLAN M, TINGEY S (1998) New experimental and computational approaches to the analysis of gene expression. Acta Bioch Polonica 45: 929-934         [ Links ]

SCHENA M, SHALON D, DAVIS RW, BROWN P (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270: 467-470         [ Links ]

SEKI M, NARUSAKA M, ABE H, KASUGA M, YAMAGUCHI-SHINOZAKI K, CARNINCI P, HAYASHIZAKI Y, SHINOZAKI K (2001) Monitoring the expression pattern of 1300 Arabidopsis genes under drought, cold stresses by using a full-length cDNA microarray. Plant Cell 13: 61-72         [ Links ]

SHIMAMOTO K, KYOZUKA J (2002) Rice as a model for comparative genomics of plants. Annu Rev Plant Biol 53:399-419         [ Links ]

SHINTANI D, DELLAPENNA D (1998) Elevating the vitamin E content of plants through metabolic engineering. Science 282 :2098-2100         [ Links ]

SPRINGER PS (2000) Gene Traps: Tools for Plant Development and Genomics. Plant Cell 12:1007-1020         [ Links ]

TARCHINI R, BIDDLE P, WINEL, R, TINGEY S, RAFALSKI A (2000) The complete sequence of 340 kb of DNA around the rice Adh1-Adh2 region reveals interrupted colinearity with maize chromosome 4. Plant Cell 12: 381-391         [ Links ]

THE ARABIDOPSIS GENOME INITIATIVE (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796-815         [ Links ]

TIKHONOV A, SANMIGUEL P, NAKAJIMANINA Y, GORENSTEIN M, BENNETZEN JL, AVRAMOVA Z (1999) Colinearity and its exceptions in orthologous adh regions of maize and sorghum. Proc Natl Acad Sci USA 96:7409-7414         [ Links ]

TISSIER AF, MARILLONNET S, KLIMYUK V, PATEL K, TORRES MA, MURPHY G, JONES JD (1999) Multiple independent defective suppressor-mutator transposon insertions in Arabidopsis: A tool for functional genomics. Plant Cell 11:1841-1852         [ Links ]

UBERBACHER EC, MURAL RJ (1991) Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci USA 88:11261-11265         [ Links ]

VELCULESCU V, ZHANG L, VOGELSTEIN B, KINZLER K (1995) Serial analysis of gene expression. Science 270: 484-487         [ Links ]

WANG R, GUEGLER K, LABRIE SAMUEL T, CRAWFORD NM (2000) Genomic analysis of a nutrient response in Arabidopsis reveals diverse expression patterns and novel metabolic and potential regulatory genes induced by nitrate. Plant Cell 12: 1491-1509         [ Links ]

YU J, HU S, WANG J, WONG LI ,LIU B, DENG Y, DAI L, ZHOU Y, ZHANG X, CAO M, LIU J, SUN J, TANG J, CHEN Y, HUANG X, LIN W, YE C, TONG W, CONG L, GENG J, HAN Y, LI L, LI W, HU G, HUANG X, LI X, LI J, LIU Z, LI L, LIU J, QI Q, LIU J, LI L, LI T, WANG X, LU H, WU T, ZHU M, NI P, HAN H, WEI D, REN X, FENG X, CUI P, LI X, WANG H, XU X, ZHAI W, XU Z, ZHANG J, HE S, ZHANG J, XU J, ZHANG K, ZHENG X, DONG J, ZENG W, TAO L, YE J, TAN J, REN X, CHEN X, HE J, LIU D, TIAN W, TIAN C, XIA H, BAO Q, LI G, GAO H, CAO T, WANG J, ZHAO W, LI P, CHEN W, WANG X, ZHANG Y, HU J, WANG J, LIU S, YANG J, ZHANG G, XIONG Y, LI Z, MAO L, ZHOU C, ZHU Z, CHEN R, HAO B, ZHENG W, CHEN S, GUO W, LI G, LIU S, TAO M, WANG J, ZHU L, YUAN L, YANG H (2002) A draft sequence of the rice genome (Oryza sativa L ssp indica). Science 296:79-91         [ Links ]

Received: July 26, 2002. In revised form: September 23, 2002. Accepted: October 1, 2002

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License