*CONTENTS*

Original source for this potato bac end data is the fasta-format file:
ftp://ftp.plantbiology.msu.edu/pub/data/sol/BAC_ends/s_tuberosum_BAC_ends.zip

This directory contains complete sets of sequence and quality
information (if available) for the potato genomic clone end sequences
in our database.  Here you will find per-clone-library potato data
sets, as well as a combined set containing all potato libraries in our
database (if multiple libraries are available).  Additionally, each
set is available in two versions, denoted 'raw' and
'screened_and_trimmed'.  Note that this directory contains only potato
data.  Please see ftp://ftp.sgn.cornell.edu/tomato_genome/bac_ends for
tomato BAC end data.

The 'screened_and_trimmed' sets have been screened for low-quality
regions and vector sequence, which is trimmed away.  These datasets
will be appropriate for most users.

The 'raw' sets have not been altered from their original sources, only
basecalled with with phred if they were originally made available to
us as chromatograms.  These datasets will usually only be appropriate
for users wishing to run the sequences through their own screening
pipelines.


*FILE NAMING SCHEME*

Naming scheme for files in this directory is as follows:

bacends<library name>_[raw | screened_and_trimmed]_<version>*.seq.gz 
   - FASTA-formatted sequence files
bacends<library name>_[raw | screened_and_trimmed]_<version>*.qual.gz 
   - FASTA-formatted quality files
blastdb_bacends_<library name>_[raw | screened_and_trimmed]_<version>*.tar.gz
   - gzipped and tarred BLAST nucleotide databases that you can use to
     run BLAST on your local machine

Library names are as follows:

RHPOTKEY - Potato BAC library

*READ NAMING SCHEME*

The BAC end read identifiers are constructed as
<bac name>_<read primer>_<read serial number>

The BAC name subpart is constructed as
<library shortname><plate number><well row><well column>.

Here is a list of the forward and reverse sequencing primers for each
library:

library     fwd     rev
RHPOTKEY    T7      SP6


* PROCESSING DETAILS *

The current version of the clone end processing pipeline goes as:

- Check validity of seq and qual data if provided (basic
  correspondence between seq and qual, etc)

- Find vector sequence in the ends using the NCBI 'vecscreen' tool
  with NCBI's UniVec database

- If quality data is provided, annotate high- and low-quality regions
  in each sequence:

   * gaussian-filter the quality data to smooth it out slightly and
     remove any anomalous spikes and valleys

   * find the longest subsequence with bases all above the quality
     cutoff (currently 20)

- Run BLASTs to find sequences that appear to be from sample
  contaminants.  The current version of this pipeline compares the end
  sequences against the tomato chloroplast genome, the A. thaliana
  mitochondrial genome, and a lambda phage genome.

- The raw sequences and quals are loaded into our relational database,
  along with annotations for which regions are high-quality, which are
  probably contaminants, etc.

- The published sets on the ftp site are dumped out of the relational
  database, with the vector and low-quality regions trimmed out and
  contaminated sequences thrown out for the screened_and_trimmed sets.