ITAG2 Tomato Genome Release Contents: 1. Introduction 2. Files in this release 3. Links and other resources 4. Release statistics == 1. Introduction == The International Tomato Annotation Group (ITAG) is pleased to announce the ITAG2 release of the official Tomato genome annotation (ITAG2), covering approximately 84% of the genome, with 35,802 gene models. This release file set was generated on January 17, 2011. In this release, 27,322 ( 76.3%) of the gene models are supported by homology to either existing ESTs or cDNA sequences, with 14,812 ( 41.4%) supported by both. All of the gene models are annotated with best-guess text descriptions of their function, and 20,124 ( 56.2%) have associated Gene Ontology terms describing their function. See section 4 for more statistics describing this release. Please send comments or questions about these annotations to: itag@sgn.cornell.edu == 2. Files in this release == ITAG2_README.txt Overview of the release, with some statistics. ITAG2_cdna.fasta fasta-format sequence file of cDNA sequences. ITAG2_cdna_alignments.gff3 GFF version 3 file containing alignments of existing EST and cDNA sequences to the genome. ITAG2_cds.fasta fasta-format sequence file of CDS sequences. ITAG2_de_novo_gene_finders.gff3 GFF version 3 file containing predictions from several de novo gene finders. These are intermediate data, used by EuGene to decide the final consensus gene models. ITAG2_gene_models.gff3 GFF version 3 file containing gene models in this release. ITAG2_genomic.fasta fasta-format sequence file of genomic contig sequences. ITAG2_genomic_all.gff3 GFF version 3 file containing all genomic annotations in this release. ITAG2_genomic_reagents.gff3 GFF version 3 file containing alignments to other genomic sequences from tomato: genomic clones, other genome builds, etc. ITAG2_other_genomes.gff3 GFF version 3 file containing alignments to genomic sequences from other organisms. ITAG2_protein_functional.gff3 GFF version 3 file containing functional annotations to protein sequences. ITAG2_proteins.fasta fasta-format sequence file of protein sequences. ITAG2_sgn_data.gff3 GFF version 3 file containing alignments to sequences related to data on SGN. Currently contains alignments to SGN unigenes, SGN marker sequences, and SGN locus sequences. == 3. Links and other resources == Sequences and annotations can also be viewed and searched on SGN: http://solgenomics.net/gbrowse/ The fully annotated chromosome sequences in GFF version 3 format, along with Fasta files of cDNA, CDS, genomic and protein sequences, and lists of genes are available from the SGN ftp site at: ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG2_release/ For those who are not familiar with the GFF3 file format, the format specification can be found here: http://www.sequenceontology.org/gff3.shtml A graphical display of the Tomato sequence and annotation can be viewed using SGN's genome browser. Browse the chromosomes, search for names or short sequences and view search hits on the whole genome, in a close-up view or on a nucleotide level: http://solgenomics.net/gbrowse/ SGN's BLAST services have also been updated with this dataset, available at: http://solgenomics.net/tools/blast/ ITAG is committed to the continual improvement of the Tomato genome annotation and actively encourages the community to contact us with new data, corrections and suggestions. Announcements of new releases, updates of data, tools, and other developments from ITAG can be found on SGN: http://solgenomics.net/ == 4. Release statistics == 4.1 Proportion of Genome Annotated Estimated genome size: 930 Mbp Size of annotated assembly: 782 Mbp Est. proportion of genome: 84% 4.2 Structural Annotation Gene model count: 35,802 Exon count: 160,011 Intron count: 124,209 Gene model length (bp) --------------------------------------------------- Min 109 Max 53,831 Range 53,722 Mean 2,898.9 StdDev 3,475.6 Median 1,842 Frequency Distribution: Bin Frequency 6,824 32,540 13,540 2,591 20,255 465 26,970 131 33,685 42 40,400 26 47,116 6 53,831 1 Intergenic distance (bp) --------------------------------------------------- Min 43 Max 1,950,945 Range 1,950,902 Mean 40,743.2 StdDev 96,998.3 Median 11,467 Frequency Distribution: Bin Frequency 243,906 34,488 487,768 914 731,631 249 975,494 78 1,219,357 33 1,463,220 11 1,707,082 1 1,950,945 2 Exons per gene model --------------------------------------------------- Min 1 Max 71 Range 70 Mean 4.5 StdDev 4.5 Median 3 Frequency Distribution: Bin Frequency 10 31,689 18 3,464 27 539 36 78 45 23 54 6 62 2 71 1 Exon length (bp) --------------------------------------------------- Min 3 Max 6,837 Range 6,834 Mean 231.5 StdDev 306.2 Median 135 Frequency Distribution: Bin Frequency 857 153,547 1,712 5,042 2,566 1,128 3,420 243 4,274 34 5,128 8 5,983 5 6,837 4 Intron length (bp) --------------------------------------------------- Min 42 Max 22,729 Range 22,687 Mean 539.3 StdDev 820.0 Median 215 Frequency Distribution: Bin Frequency 2,878 121,455 5,714 2,260 8,550 420 11,386 67 14,221 3 19,893 1 22,729 3 4.3 Functional Annotation Gene models with GO terms: 20,124 ( 56.2%) Unique GO terms associated: 2,030 Genes with splice variants: 0 Gene models with functional description text: 35,802 (100.0%) Gene Ontology terms associated, per gene model --------------------------------------------------- Min 0 Max 17 Range 17 Mean 0.9 StdDev 0.9 Median 1 Frequency Distribution: Bin Frequency 2 33,416 4 2,384 15 1 17 1 4.4 Gene model supporting evidence ESTs/cDNAs aligned to the genome: 215,875 Gene models with cDNA OR protein support: 27,322 ( 76.3%) Gene models with cDNA homology support: 17,410 ( 48.6%) Gene models without cDNA homology support: 18,392 ( 51.4%) Gene models with protein homology support: 24,724 ( 69.1%) Gene models without protein homology support: 11,078 ( 30.9%) Gene models with both cDNA and protein support: 14,812 ( 41.4%) Gene models with only cDNA homology support: 2,598 ( 7.3%) Gene models with only protein homology support: 9,912 ( 27.7%) Gene models with no homology support: 8,480 ( 23.7%)