SCRIPT FLOW:

CGN:
00_get_brc_seq.pl -o $order_number[ $order_number]
01_create_traces.pl -p cgn
02_basecall.pl -p cgn
03_new_seq_upload.pl -p cgn
04_vector_screen.pl -p cgn
05_vector_and_qual_trimming.pl -p cgn
06_polyA_trimming.pl -p cgn
07_trimmed_seq_upload.pl -p cgn
08_contamination_screen.pl -p cgn
09_trimmed_seq_quality_check.pl -p cgn
10_lib_assembly.pl -p cgn -l $lib_name
11_lib_assembly_upload.pl -p cgn -l $lib_name

Note:  Repeat 10 and 11 for each library that we got new sequences for, and the 'all' library


FGN:
00_get_fgp_seq.pl
01_create_traces.pl -p fgn
02_basecall.pl -p fgn
03_new_seq_upload.pl -p fgn
04_vector_screen.pl -p fgn
05_vector_and_qual_trimming.pl -p fgn
06_polyA_trimming.pl -p fgn
07_trimmed_seq_upload.pl -p fgn
08_contamination_screen.pl -p fgn
09_trimmed_seq_quality_check.pl -p fgn
10_lib_assembly.pl -p fgn -l $lib_name
11_lib_assembly_upload.pl -p fgn -l $lib_name

Note:  Repeat 10 and 11 for each library that we got new sequences for, and the 'all' library


DETAILED DESCRIPTION OF SCRIPTS:


00_get_brc_seq.pl -o $order_number[ $order_number] [-u $username] [-p $password] [-d $destination_directory]


Description:

Program designed to log in and retrieve sequences from BRCO (Cornell's BioResource Center)
REQUIRES libwww-perl-5.53 (version probably doesn't matter)
Doesn't use ssl
Takes in a list of order numbers (with the -o flag) and retrieves the zipped traces

Command Line Parameters:

$order_number: the BRC assigned order number for the plate 
eg.: 10032471

$username: login user name for BRC 
default: cl295@cornell.edu

$password: BRC login password for the user
default: coffee

$destination_directory: where to place the downloaded files
default: /data/shared/pgn_data_processing/incoming_files/cgn/new_files/


Other Builtin Script Parameters:

$lastname: used to construct url for download
value: lin

$loginURL: address to login for the download
value: http://www.brc.cornell.edu/user/login.php

$zipURL: URL constructed for download
value: dynamically generated after login. Base URL is http://www.brc.cornell.edu/user/res_zipdl.php3/


Notes:
* At least one order number is required, all other command line inputs are optional


===============================================================================

00_get_fgp_seq.pl [-i $new_dir] [-d $done_dir]

Description:

Program designed to log in and retrieve sequences from the PSU (Penn State University) LIMS system
Logs into the PSU machine via ssh using key-based authentication to retrieve the list of files, and uses scp with the same authentication mechanism to do the actual copying
Note: This should check for the existence of the library of the given data,
otherwise it may fail on the first run when the load raw sequences script runs.


Command Line Parameters: 

$new_dir: location for incoming data files
default: /data/shared/pgn_data_processing/incoming_files/fgn/new_files/

$done_dir: location of previously retrieved data.  Used to build a list of files we already have
default: /data/shared/pgn_data_processing/incoming_files/fgn/done_files/


Other Builtin Script Parameters:

$login_user: the username to log in to the PSU machine
value: transfer

$login_identification: the location of the identification file for SSH2 key based authentication
value: /data/shared/pgn_data_processing/scripts/processing_components/data_files/fgn_psu_ssh/identification

$psu_machine_ip: address of the machine at PSU that holds the data
value: 146.186.29.44


===============================================================================

01_create_traces.pl -p $project [-i $new_dir] [-d $done_dir] [-m $moveto_dir]

Description:

Program designed to take a set of zipped tracefile folders, separate out the chromas, gzip them, and put them in their destination location for future processing


Command Line Parameters: 

$project: specifies the project that applies to this run
default: none (mandatory parameter)
valid values: cgn, fgn, pgn

$in_dir: location for the tracefile zipped folders
default: /data/shared/pgn_data_processing/incoming_files/$project/new_files

$out_dir: destination for the processed traces
default: /data/shared/pgn_data_processing/trace_files/$project

$moveto_dir: destination for the original zipped folders after processing
default: /data/shared/pgn_data_processing/incoming_files/$project/done_files


===============================================================================

02_basecall.pl  -p $project [-i $in_dir] [-o $seq_out_file] [-q $qual_out_file] [-m $move_dir] [-t $tmp_dir] [-e $phred_param_file] [-d $dye_chem]

Description:

Program designed to take a directory of chromas, recursively traverse it and process all chromas, generate sequence and quality files, and moved the processed chromas into a storage directory


Command Line Parameters: 

$project: specifies the project that applies to this run
default: none (mandatory parameter)
valid values: cgn, fgn, pgn

$in_dir: location of tracefiles (these can be in subdirectories of this dir)
default: /data/shared/pgn_data_processing/trace_files/$project

$seq_out_file: pathname for the sequence file output
default: /tmp/basecall_seqs_fasta

$qual_out_file: pathname for the quality values output
default: /tmp/basecall_seqs_fasta.qual

$move_dir: directory to which the trace files are moved to after basecalling
default: /data/shared/pgn_data_processing/processed_traces/$project

$tmp_dir: directory where temporary files or this run are stored
default: /tmp

$phred_param_file: environment variable that specifies the location of the "phredpar.dat" file
default: /usr/local/src/phred/phredpar.dat

$dye_chem: information on dye used used, usually the last letter of an ABI tracefile, eg. 'i' or 'e'
default: none, if value is not provided a default is given on a per project basis


Other Builtin Script Parameters:

$trace_list_file: temporary file for storing traces processed
value: $tmp_dir/tracelist.txt


Notes:

* Script assumes that all the chromas are individually gzipped (rather than all together in a zip file).  Script 01_create_traces.pl is supposed to make sure that's the case.
* Script has only been tested with ABI chromas.  Behaviour with other types of trace files is undefined.


===============================================================================

03_new_seq_upload.pl -p $project [-i $seq_in_file] [-q $qual_in_file] [-t $seq_type] [-f $seq_format] [-s $sequencing_info] [-r $read_direction] [-o $other_id_type] [-s $source_info_dir]


Description:

Program designed to take in a fasta file and corresponding quality values and load them into the appropriate database (specified via the "$project" parameter).  


Command Line Parameters: 

$project: specifies the project that applies to this run
default: none (mandatory parameter)
valid values: cgn, fgn, pgn

$seq_in_file: fasta file containing sequences to be uploaded
default: /tmp/basecall_seqs_fasta

$qual_in_file: quality values file for the sequences to be uploaded
default: /tmp/basecall_seqs_fasta.qual

$seq_type: type of sequence being uploaded (nucleotide, protein, etc)
default: nucleotide

$seq_format: formatting of the input sequence
default: FASTA 
note: Only FASTA files are currently supported

$sequencing_info: db index of the sequencing facility information
default: 1 
note: this value is set to 1 by default, then reset in the specific project branches based on branch specific information

$read_direction: sequencing direction (5' or 3')
default: 5

$other_id_type: the kind of external identifier being uploaded along with the sequence
default: clone name

$source_info_dir: external flatfiles that hold information about sequencing location for different libraries
default: /data/shared/pgn_data_processing/incoming_files/$project/file_source/


Other Builtin Script Parameters:

$use_seqinfo_on_disk: flag to read sequencing info from disk or assign sequencing facility info in script
value: 0


Notes:

* There are a lot of assumptions regarding the format of the defline, so any old fasta file won't work.
* A lot of project specific handling is done along the project branches
* 3' sequence processing hasn't been tested, use at your own risk
* There is no robust rollback if one of the database loading steps fails
If you need to roll back a loading batch, use the plate removal script
Plate removal script is /data/shared/pgn_data_processing/scripts/data_manipulation/pull_plate.pl (as of 2003-10-14)
* There are several major possible failure points:
  - if there is a missmatch between the nr of sequences and nr of quality value entries, script will abort with an error
  - if it is set to read sequencing information from disk, and can't open the file to read it, it will abort with an error
  - if the tracefile already exists in the database, it will skip that sequence; note that this can be a problem if a previous load attempt aborted after loading the tracefile info, so make sure you run the plate remove script before attempting to reload
  -  if it can't figure out the library name from the defline, it will print a message and skip that sequence
  - if this is fgn data and it can't find the source info it will skip those sequences
  - if it can't prepare or execute an SQL statement, it will die with an error


===============================================================================

04_vector_screen.pl -p $project [-i $seq_in_file] [-a $qual_in_file] [-o $seq_out_file] [-q $qual_out_file] [-f $vector_file] [-c $cross_match] [-s $screening_output] [-v $verbose]


Description:

This script simply runs crossmatch to identify vector sequences and masks them in a temporary file
It does not load anything into the database


Command Line Parameters: 

$project: specifies the project that applies to this run
default: none (mandatory parameter)
valid values: cgn, fgn, pgn

$seq_in_file: sequence fasta file created from database as input to crossmatch
default: /tmp/seqs_fasta

$qual_in_file:  quality values fasta file created from database as input to crossmatch
default: /tmp/seqs_fasta.qual

$seq_out_file: sequence fasta file created by crossmatch containing the masked vector regions
default: /tmp/seqs_fasta.screen

$qual_out_file:  quality values file for the screened sequences, should be a copy of the $qual_in_file
default: /tmp/seqs_fasta.screen.qual

$vector_file: vector sequence file for crossmatch
default: ./data_files/vector.seq

$cross_match: location of the crossmath binary
default: /usr/local/bin/cross_match

$screening_output: file to which the standard output of crossmatch gets redirected
default: /tmp/screen.out

$verbose: flag for verbosity of output
default: undefined (variable is unset)


Other Builtin Script Parameters:

$minmatch: minimum word length value for crossmatch; see crossmatch docs for further info
value: 12

$minscore: minimum score value for crossmatch; see crossmatch docs for further info
value: 20


Notes:

* Script will print a warning if for some reason quality values are missing for a sequence, but will still create an empty entry for them if they are missing


===============================================================================

05_vector_and_qual_trimming.pl -p $project [-i $seq_in_file] [-a $qual_in_file] [-o $seq_out_file] [-q $qual_out_file] [-t $treshold_qual] [-v $verbose]


Description:

This script does all the heavy lifting when it comes to sequence trimming.  It will parse out the sequences into vector-flanked chunks, look for adaptor sequence, quality-trim the chunks using our custom algorithm, and select the best chunk as _the_ trimmed sequence for this sequence.  It also loads into the databse all the trimming info such as start and stop of vector region, start and stop of adaptor region, start and stop of quality trimming region.


Command Line Parameters: 

$project: specifies the project that applies to this run
default: none (mandatory parameter)
valid values: cgn, fgn, pgn

$seq_in_file: vector masked sequence fasta file to process
default: /tmp/seqs_fasta.screen

$qual_in_file:  quality values fasta file for the vector masked sequences
default: /tmp/seqs_fasta.screen.qual

$seq_out_file: output file for trimmed sequences
default: /tmp/seqs_fasta.trim

$qual_out_file:  output file for trimmed quality values
default: /tmp/seqs_fasta.trim.qual

$treshold_qual: the quality value that will be set as the floor for the quality trimming algorithm
default: 15

$verbose: flag for verbosity of output
default: undefined (variable is unset)


Notes:
* A lot of voodoo is going on in this script, please make sure you understand how the trimming algorithm works before monkeying with it
* There are several possible failure points:
  - if there is a missmatch between the nr of sequences and nr of quality value entries, script will abort with an error
  - if it can't open files for input or ouptut, script will die with an error
  - if it can't find a quality value entry for a specific sequence, it will die with an error
  - if it can't prepare or execute any of the SQL, it will die with an error

  
===============================================================================

06_polyA_trimmming.pl -p $project  [-i $seq_in_file] [-a $qual_in_file] [-o $seq_out_file] [-q $qual_out_file] [-n $nr_repeats] [-k $polyA_to_keep]


Description:

This script tries to correct problems with sequences having polyA tails pulling in garbage after the polyA.  This problem is very noticeable in the coffee seed libraries from CGN.  It also standardized the length of polyA retained in the trimmed sequence to whatever value $polyA_to_keep is set to (usually 20)


Command Line Parameters: 

$project: specifies the project that applies to this run
default: none (mandatory parameter)
valid values: cgn, fgn, pgn

$seq_in_file: trimmed sequence fasta file to process
default: /tmp/seqs_fasta.trim

$qual_in_file: quality values fasta file for trimmed sequences
default: /tmp/seqs_fasta.trim.qual

$seq_out_file: output file for polyA trimmed sequences
default: /tmp/seqs_fasta.polytrim

$qual_out_file: output file for polyA trimmed quality values
default: /tmp/seqs_fasta.polytrim.qual

$nr_repeats: the number of 'A' nucleotides in a row that is considered a polyA tail
default: 12

$polyA_to_keep:  the number of polyA nucleotides to keep in the trimmed sequence
default: 20


Notes:

* This script does not handle polyT from 3' sequences
* The polyA is identified on the trimmed sequence, but its start and end location is calculated on the raw sequence, hence the need to pull the raw sequence from the database
* The default value of 12 for $nr_repeats was established empirically by doing an analysis of nr of stretches of 'A' vs length of the stretch, in 3 nucleotide increments.  Stretches of 12 nucleotide in length were both the most abundant (by a wide margin) and the vast majority of them was at the 3' end. 
 
 
===============================================================================

07_trimmed_seq_upload.pl -p $project  [-i $seq_in_file] [-q $qual_in_file] 


Description:

This script simply loads the trimmed sequences into the database.  All the trimming info has already been loaded by the trimming scripts.  This data is in a sense superfluous, since the trimmed sequence can be calculated from the raw sequence and trimming locations, but since all future operations (both web and pipeline) deal with trimmed sequences exclusively, having the already trimmed sequence stored improves performance.


Command Line Parameters: 

$project: specifies the project that applies to this run
default: none (mandatory parameter)
valid values: cgn, fgn, pgn

$seq_in_file: polyA trimmed sequence fasta file to process
default: /tmp/seqs_fasta.polytrim

$qual_in_file: quality values fasta file for polyA trimmed sequences
default: /tmp/seqs_fasta.polytrim.qual


Notes:

* This script will override existing trimmed sequence entries for the same raw sequence

 
===============================================================================

08_contamination_screen.pl -p $project  [-i $seq_in_file] [-b $blast_result] [-c $contamination_target]


Description:

This script takes the newly loaded trimmed sequences and blasts them against the cloning organism genome (typically E.coli) looking for high stringency matches to detect contamination


Command Line Parameters: 

$project: specifies the project that applies to this run
default: none (mandatory parameter)
valid values: cgn, fgn, pgn

$seq_in_file: input file to blast containing sequences to be evaluated
default: /tmp/seq_to_eval.fasta

$blast_result: file where the blast run results will be stored
default: /tmp/seq_vs_ecoli

$contamination_target: target db for blast
default: /data/shared/pgn_data_processing/E.coli_K12/Ecoli_genome


Other Builtin Script Parameters:

$blast_script: helper script used to run blast and parse blast results
value: ./helper_scripts/batch_blast_generic.pl

$min_evalue: blast will only report matches "better" (smaller evalue) than this evalue
value: 1e-100


Notes:

* This requires the helper script batch_blast_generic.pl to run properly
* The target sequence needs to be formated for blast

 
===============================================================================

09_trimmed_seq_quality_check.pl -p $project [-t $treshold] [-l $min_ln] [-a $max_ambiguity_pct] [-n $ambiguity_treshold] [-s $max_single_letter_pct] [-d $max_two_letters_pct] [-v $verbose]


Description:

This script does all the quality evaluation. It looks for a minimum overall quality, then a minimum sequence length, then a maximum ambiguity allowed (max percentage of 'N' nucleotides), then it checks sequence complexity (to detect certain sequencing artefacts).  Checks happen in the order described, and only sequences passing one step get evaluated in the next one.


Command Line Parameters: 

$project: specifies the project that applies to this run
default: none (mandatory parameter)
valid values: cgn, fgn, pgn

$treshold: minimum average quality value for a sequence to pass
default: 15

$min_ln: minimum sequence length (in basepairs) required to pass
default: 150

$ambiguity_treshold: the quality value below which a nucleotide is considered equivalent to 'N' for ambiguity percentage calculation
default: 5

$max_ambiguity_pct: the maximum percentage of nucleotides allowed to be 'N' in the trimmed sequence
default: 4

$max_single_letter_pct: the maximum percentage of nucleotides that are allowed to be the same letter (A,T,G,or C) for the sequence to pass
default: 60

$max_two_letters_pct: the maximum percentage of nucleotides that are allowed to be accounted for by just 2 letters (any 2 of the single letter percentages adding up to this value) for the sequence to pass
default: 80

$verbose: flag for verbosity of output
default: undefined (variable is unset)


Notes:

* All of these checks exclude any polyA tail that might exist before performing the calculations
* The script "knows" the proper quality codes for the various failure types.  If those codes change in the database, change them in this script as well or sequences will get assigned erroneous codes


===============================================================================

10_lib_assembly.pl -p $project -l $lib [-o $seq_out_file] [-q $qual_out_file]
[-r $phrap_results_file] [-a $align_file]

Description:

This scripts retrieves all the quality passed, trimmed sequences for a given library, and assembles them in contigs.


Command Line Parameters: 

$project: specifies the project that applies to this run
default: none (mandatory parameter)
valid values: cgn, fgn, pgn

$lib: library name, specifies the library to be assembled
default: none (mandatory parameter)
valid values: depends on project

$seq_out_file: fasta file where assembly results are stored
default: /tmp/$lib.fasta

$qual_out_file: quality values for the assembly results
default: /tmp/$lib.fasta.qual

$phrap_results_file: file where the results of the phrap run will be stored
default: /tmp/${lib}_phrap.out

$align_file:  file where the contig members alignment information is stored for parsing
default: /tmp/${lib}.alignment

Notes:

* Nomenclature standardization is done for the deflines according to the phrap documentation.  This depends on the proper information being in the database, but it isn't essential for the assembly process
* The format of the alignment information file is documented in the file itself; if you change that format, please make sure to change the self-documenting part of the code as well
* The 'all' library will assemble all the sequences into one build, regardless of the library they belong to

===============================================================================

11_lib_assembly_upload.pl -p $project -l $lib [-i $contig_file] [-q $contig_qual_file] [-a $alignment_file]


Description:

This script reads the contigs generated by 10_lib_assembly.pl and their alignment information and loads the data into the database.  It first removes any existing contig info for that library before loading the new data, since these are temporary evaluation assemblies


Command Line Parameters: 

$project: specifies the project that applies to this run
default: none (mandatory parameter)
valid values: cgn, fgn, pgn

$lib: library name, specifies the library to be assembled
default: none (mandatory parameter)
valid values: depends on project

$contig_file: fasta file containing the contigs to be uploaded
default: /tmp/${lib}.fasta.contigs

$contig_qual_file: quality values for the contigs to be uploaded
default: /tmp/${lib}.fasta.contigs.qual

$alignment_file: the file with the alignment info generated by the assembly script
default: /tmp/${lib}.alignment


Notes: 

* If the format of the alignment info change in the 10_lib_assembly.pl script, please change the parsing code here accordingly or it will load in garbage