*CONTENTS*

This directory contains complete sets of sequence and quality
information for the genomic clone end sequences in our database.  Here
you will find per-library data sets, as well as a combined set
containing all libraries in our database.  Additionally, each set is
available in two versions, denoted 'raw' and 'screened_and_trimmed'.

The 'screened_and_trimmed' sets have been basecalled (with phred),
then screened for low-quality regions and vector sequence, which is
trimmed away.  These datasets will be appropriate for most users.

The 'raw' sets have only been basecalled with phred, the sequences
have not been altered in any other way.  These datasets will usually
only be appropriate for users wishing to run the sequences through
their own screening pipelines.


*FILE NAMING SCHEME*

Naming scheme for files in this directory is as follows:

bacends<library name>_[raw | screened_and_trimmed]_<version>*.seq.gz 
   - FASTA-formatted sequence files
bacends<library name>_[raw | screened_and_trimmed]_<version>*.qual.gz 
   - FASTA-formatted quality files
blastdb_bacends_<library name>_[raw | screened_and_trimmed]_<version>*.tar.gz
   - gzipped and tarred BLAST nucleotide databases that you can use to
     run BLAST on your local machine

Library names are as follows:

LE_Hba    -   Tomato HindIII BAC library
SL_MboI   -   Tomato MboI BAC library
SL_EcoRI  -   Tomato EcoRI BAC library
SL_FOS    -   Tomato Fosmid library
SL_MT     -   Tomato Micro-Tom HindIII BAC library
LpenBAC   -   S. pennellii HindIII BAC library
LpenCOS   -   S. pennellii HindIII unpackaged cosmid BAC library

*READ NAMING SCHEME*

The BAC end read identifiers are constructed as <bac name>_<read primer>_<read serial number>

The BAC name subpart is constructed as <library shortname><plate number><well row><well column>.

Here is a list of the forward and reverse sequencing primers for each library:
sss
library   fwd    rev
LE_HBa    T7      SP6
SL_MboI   T7      SP6
SL_EcoRI  T7      SP6
SL_FOS    pIBF    pIBR
SL_MT     T7      SP6

* PROCESSING DETAILS *

The current version of the clone end processing pipeline goes as:

- Check validity of seq and qual data if provided (basic
  correspondence between seq and qual, etc)

- Find vector sequence in the ends using the NCBI 'vecscreen' tool
  with NCBI's UniVec database

- If quality data is provided, annotate high- and low-quality regions
  in each sequence:

   * gaussian-filter the quality data to smooth it out slightly and
     remove any anomalous spikes and valleys

   * find the longest subsequence with bases all above the quality
     cutoff (currently 20)

- Run BLASTs to find sequences that appear to be from sample
  contaminants.  The current version of this pipeline compares the end
  sequences against the tomato chloroplast genome, the A. thaliana
  mitochondrial genome, and a lambda phage genome.

- The raw sequences and quals are loaded into our relational database,
  along with annotations for which regions are high-quality, which are
  probably contaminants, etc.

- The published sets on the ftp site are dumped out of the relational
  database, with the vector and low-quality regions trimmed out and
  contaminated sequences thrown out for the screened_and_trimmed sets