*CONTENTS* This directory contains complete sets of sequence and quality information for the genomic clone end sequences in our database. Here you will find per-library data sets, as well as a combined set containing all libraries in our database. Additionally, each set is available in two versions, denoted 'raw' and 'screened_and_trimmed'. The 'screened_and_trimmed' sets have been basecalled (with phred), then screened for low-quality regions and vector sequence, which is trimmed away. These datasets will be appropriate for most users. The 'raw' sets have only been basecalled with phred, the sequences have not been altered in any other way. These datasets will usually only be appropriate for users wishing to run the sequences through their own screening pipelines. *FILE NAMING SCHEME* Naming scheme for files in this directory is as follows: bacends_[raw | screened_and_trimmed]_*.seq.gz - FASTA-formatted sequence files bacends_[raw | screened_and_trimmed]_*.qual.gz - FASTA-formatted quality files blastdb_bacends__[raw | screened_and_trimmed]_*.tar.gz - gzipped and tarred BLAST nucleotide databases that you can use to run BLAST on your local machine Library names are as follows: LE_Hba - Tomato HindIII BAC library SL_MboI - Tomato MboI BAC library SL_EcoRI - Tomato EcoRI BAC library SL_FOS - Tomato Fosmid library SL_MT - Tomato Micro-Tom HindIII BAC library LpenBAC - S. pennellii HindIII BAC library LpenCOS - S. pennellii HindIII unpackaged cosmid BAC library *READ NAMING SCHEME* The BAC end read identifiers are constructed as __ The BAC name subpart is constructed as . Here is a list of the forward and reverse sequencing primers for each library: sss library fwd rev LE_HBa T7 SP6 SL_MboI T7 SP6 SL_EcoRI T7 SP6 SL_FOS pIBF pIBR SL_MT T7 SP6 * PROCESSING DETAILS * The current version of the clone end processing pipeline goes as: - Check validity of seq and qual data if provided (basic correspondence between seq and qual, etc) - Find vector sequence in the ends using the NCBI 'vecscreen' tool with NCBI's UniVec database - If quality data is provided, annotate high- and low-quality regions in each sequence: * gaussian-filter the quality data to smooth it out slightly and remove any anomalous spikes and valleys * find the longest subsequence with bases all above the quality cutoff (currently 20) - Run BLASTs to find sequences that appear to be from sample contaminants. The current version of this pipeline compares the end sequences against the tomato chloroplast genome, the A. thaliana mitochondrial genome, and a lambda phage genome. - The raw sequences and quals are loaded into our relational database, along with annotations for which regions are high-quality, which are probably contaminants, etc. - The published sets on the ftp site are dumped out of the relational database, with the vector and low-quality regions trimmed out and contaminated sequences thrown out for the screened_and_trimmed sets