Transcriptome sequences of the wild tomato species, S. peruvianum

Soon-Ju Park, Ke Jiang, Michael C. Schatz, and Zachary B. Lippman
Cold Spring Harbor Laboratory

The research groups of Michael C. Schatz and Zachary B. Lippman at Cold Spring Harbor Laboratory in New York have generated transcriptome sequences of the green-fruited wild tomato species Solanum peruvianum. These transcriptome data provide a new resource for biological discovery on tomato development and evolution. The transcriptome sequences were generated using Illumina sequencing technology consisting of paired-end 50 base pair reads, which resulted in an estimated 5 to 1000-fold transcript coverage, depending on the expression levels. In addition to establishing the transcriptome using the open source de novo mRNA assembly algorithm Inchworm (inchworm.sourceforge.net), reference-guided cDNA reconstruction was also conducted using S. lycopersicum cv. Heinz as a reference. 

Both approaches provided high quality transcriptome reconstructions, and provided below are a selection of statistics resulting from the de novo assembly and reference-based cDNA reconstruction. Only those contigs from the de novo assembly that match known annotated genes are provided.

de novo assembly:
	number of contigs: 177863
transcriptome size: 61852880
	N50: 633 bp
	minimum length: 100 bp
	maximum length: 15880 bp
	mean length: 347 bp
	median length: 156 bp
	
reference-based reconstruction:
	number of reconstructed cDNAs: 17430
	transcriptome size: 23325040 bp
	minimum length: 54 bp
	maximum length: 15320 bp
	mean length: 1338 bp
	median length: 1139 bp
	
Initial usage of the data has indicated excellent coverage and depth: the number of genes captured in S. peruvianum is very close to the the number of genes captured in a similar experiment completed from domesticated tomato. Preliminary analysis indicates that greater than 75% of the S. peruvianum raw sequence reads align to S. lycopersicum cv. Heinz, suggesting high conservation in coding regions between two tomato species. Initial estimate of divergence in coding regions between the two species found 834 genes with identical coding regions (zero divergence), a mean divergence of 0.00704 and median divergence of 0.005522 (under the Jukes-Cantor model for multi-substitution corrections).

We are pleased to provide to the Solanaceae and broader plant biology communities these pre-publication transcriptome sequences of the S. peruvianum. Both data sets, the de novo assembled transcriptome, and reference-guided reconstruction of cDNAs, are free for the community without any restrictions in their use. 

For further information, please contact Z.B. Lippman (lippman@cshl.edu)