*CONTENTS* Original source for this potato bac end data is the fasta-format file: ftp://ftp.plantbiology.msu.edu/pub/data/sol/BAC_ends/s_tuberosum_BAC_ends.zip This directory contains complete sets of sequence and quality information (if available) for the potato genomic clone end sequences in our database. Here you will find per-clone-library potato data sets, as well as a combined set containing all potato libraries in our database (if multiple libraries are available). Additionally, each set is available in two versions, denoted 'raw' and 'screened_and_trimmed'. Note that this directory contains only potato data. Please see ftp://ftp.sgn.cornell.edu/tomato_genome/bac_ends for tomato BAC end data. The 'screened_and_trimmed' sets have been screened for low-quality regions and vector sequence, which is trimmed away. These datasets will be appropriate for most users. The 'raw' sets have not been altered from their original sources, only basecalled with with phred if they were originally made available to us as chromatograms. These datasets will usually only be appropriate for users wishing to run the sequences through their own screening pipelines. *FILE NAMING SCHEME* Naming scheme for files in this directory is as follows: bacends_[raw | screened_and_trimmed]_*.seq.gz - FASTA-formatted sequence files bacends_[raw | screened_and_trimmed]_*.qual.gz - FASTA-formatted quality files blastdb_bacends__[raw | screened_and_trimmed]_*.tar.gz - gzipped and tarred BLAST nucleotide databases that you can use to run BLAST on your local machine Library names are as follows: RHPOTKEY - Potato BAC library *READ NAMING SCHEME* The BAC end read identifiers are constructed as __ The BAC name subpart is constructed as . Here is a list of the forward and reverse sequencing primers for each library: library fwd rev RHPOTKEY T7 SP6 * PROCESSING DETAILS * The current version of the clone end processing pipeline goes as: - Check validity of seq and qual data if provided (basic correspondence between seq and qual, etc) - Find vector sequence in the ends using the NCBI 'vecscreen' tool with NCBI's UniVec database - If quality data is provided, annotate high- and low-quality regions in each sequence: * gaussian-filter the quality data to smooth it out slightly and remove any anomalous spikes and valleys * find the longest subsequence with bases all above the quality cutoff (currently 20) - Run BLASTs to find sequences that appear to be from sample contaminants. The current version of this pipeline compares the end sequences against the tomato chloroplast genome, the A. thaliana mitochondrial genome, and a lambda phage genome. - The raw sequences and quals are loaded into our relational database, along with annotations for which regions are high-quality, which are probably contaminants, etc. - The published sets on the ftp site are dumped out of the relational database, with the vector and low-quality regions trimmed out and contaminated sequences thrown out for the screened_and_trimmed sets.