SCRIPT FLOW: CGN: 00_get_brc_seq.pl -o $order_number[ $order_number] 01_create_traces.pl -p cgn 02_basecall.pl -p cgn 03_new_seq_upload.pl -p cgn 04_vector_screen.pl -p cgn 05_vector_and_qual_trimming.pl -p cgn 06_polyA_trimming.pl -p cgn 07_trimmed_seq_upload.pl -p cgn 08_contamination_screen.pl -p cgn 09_trimmed_seq_quality_check.pl -p cgn 10_lib_assembly.pl -p cgn -l $lib_name 11_lib_assembly_upload.pl -p cgn -l $lib_name Note: Repeat 10 and 11 for each library that we got new sequences for, and the 'all' library FGN: 00_get_fgp_seq.pl 01_create_traces.pl -p fgn 02_basecall.pl -p fgn 03_new_seq_upload.pl -p fgn 04_vector_screen.pl -p fgn 05_vector_and_qual_trimming.pl -p fgn 06_polyA_trimming.pl -p fgn 07_trimmed_seq_upload.pl -p fgn 08_contamination_screen.pl -p fgn 09_trimmed_seq_quality_check.pl -p fgn 10_lib_assembly.pl -p fgn -l $lib_name 11_lib_assembly_upload.pl -p fgn -l $lib_name Note: Repeat 10 and 11 for each library that we got new sequences for, and the 'all' library DETAILED DESCRIPTION OF SCRIPTS: 00_get_brc_seq.pl -o $order_number[ $order_number] [-u $username] [-p $password] [-d $destination_directory] Description: Program designed to log in and retrieve sequences from BRCO (Cornell's BioResource Center) REQUIRES libwww-perl-5.53 (version probably doesn't matter) Doesn't use ssl Takes in a list of order numbers (with the -o flag) and retrieves the zipped traces Command Line Parameters: $order_number: the BRC assigned order number for the plate eg.: 10032471 $username: login user name for BRC default: cl295@cornell.edu $password: BRC login password for the user default: coffee $destination_directory: where to place the downloaded files default: /data/shared/pgn_data_processing/incoming_files/cgn/new_files/ Other Builtin Script Parameters: $lastname: used to construct url for download value: lin $loginURL: address to login for the download value: http://www.brc.cornell.edu/user/login.php $zipURL: URL constructed for download value: dynamically generated after login. Base URL is http://www.brc.cornell.edu/user/res_zipdl.php3/ Notes: * At least one order number is required, all other command line inputs are optional =============================================================================== 00_get_fgp_seq.pl [-i $new_dir] [-d $done_dir] Description: Program designed to log in and retrieve sequences from the PSU (Penn State University) LIMS system Logs into the PSU machine via ssh using key-based authentication to retrieve the list of files, and uses scp with the same authentication mechanism to do the actual copying Note: This should check for the existence of the library of the given data, otherwise it may fail on the first run when the load raw sequences script runs. Command Line Parameters: $new_dir: location for incoming data files default: /data/shared/pgn_data_processing/incoming_files/fgn/new_files/ $done_dir: location of previously retrieved data. Used to build a list of files we already have default: /data/shared/pgn_data_processing/incoming_files/fgn/done_files/ Other Builtin Script Parameters: $login_user: the username to log in to the PSU machine value: transfer $login_identification: the location of the identification file for SSH2 key based authentication value: /data/shared/pgn_data_processing/scripts/processing_components/data_files/fgn_psu_ssh/identification $psu_machine_ip: address of the machine at PSU that holds the data value: 146.186.29.44 =============================================================================== 01_create_traces.pl -p $project [-i $new_dir] [-d $done_dir] [-m $moveto_dir] Description: Program designed to take a set of zipped tracefile folders, separate out the chromas, gzip them, and put them in their destination location for future processing Command Line Parameters: $project: specifies the project that applies to this run default: none (mandatory parameter) valid values: cgn, fgn, pgn $in_dir: location for the tracefile zipped folders default: /data/shared/pgn_data_processing/incoming_files/$project/new_files $out_dir: destination for the processed traces default: /data/shared/pgn_data_processing/trace_files/$project $moveto_dir: destination for the original zipped folders after processing default: /data/shared/pgn_data_processing/incoming_files/$project/done_files =============================================================================== 02_basecall.pl -p $project [-i $in_dir] [-o $seq_out_file] [-q $qual_out_file] [-m $move_dir] [-t $tmp_dir] [-e $phred_param_file] [-d $dye_chem] Description: Program designed to take a directory of chromas, recursively traverse it and process all chromas, generate sequence and quality files, and moved the processed chromas into a storage directory Command Line Parameters: $project: specifies the project that applies to this run default: none (mandatory parameter) valid values: cgn, fgn, pgn $in_dir: location of tracefiles (these can be in subdirectories of this dir) default: /data/shared/pgn_data_processing/trace_files/$project $seq_out_file: pathname for the sequence file output default: /tmp/basecall_seqs_fasta $qual_out_file: pathname for the quality values output default: /tmp/basecall_seqs_fasta.qual $move_dir: directory to which the trace files are moved to after basecalling default: /data/shared/pgn_data_processing/processed_traces/$project $tmp_dir: directory where temporary files or this run are stored default: /tmp $phred_param_file: environment variable that specifies the location of the "phredpar.dat" file default: /usr/local/src/phred/phredpar.dat $dye_chem: information on dye used used, usually the last letter of an ABI tracefile, eg. 'i' or 'e' default: none, if value is not provided a default is given on a per project basis Other Builtin Script Parameters: $trace_list_file: temporary file for storing traces processed value: $tmp_dir/tracelist.txt Notes: * Script assumes that all the chromas are individually gzipped (rather than all together in a zip file). Script 01_create_traces.pl is supposed to make sure that's the case. * Script has only been tested with ABI chromas. Behaviour with other types of trace files is undefined. =============================================================================== 03_new_seq_upload.pl -p $project [-i $seq_in_file] [-q $qual_in_file] [-t $seq_type] [-f $seq_format] [-s $sequencing_info] [-r $read_direction] [-o $other_id_type] [-s $source_info_dir] Description: Program designed to take in a fasta file and corresponding quality values and load them into the appropriate database (specified via the "$project" parameter). Command Line Parameters: $project: specifies the project that applies to this run default: none (mandatory parameter) valid values: cgn, fgn, pgn $seq_in_file: fasta file containing sequences to be uploaded default: /tmp/basecall_seqs_fasta $qual_in_file: quality values file for the sequences to be uploaded default: /tmp/basecall_seqs_fasta.qual $seq_type: type of sequence being uploaded (nucleotide, protein, etc) default: nucleotide $seq_format: formatting of the input sequence default: FASTA note: Only FASTA files are currently supported $sequencing_info: db index of the sequencing facility information default: 1 note: this value is set to 1 by default, then reset in the specific project branches based on branch specific information $read_direction: sequencing direction (5' or 3') default: 5 $other_id_type: the kind of external identifier being uploaded along with the sequence default: clone name $source_info_dir: external flatfiles that hold information about sequencing location for different libraries default: /data/shared/pgn_data_processing/incoming_files/$project/file_source/ Other Builtin Script Parameters: $use_seqinfo_on_disk: flag to read sequencing info from disk or assign sequencing facility info in script value: 0 Notes: * There are a lot of assumptions regarding the format of the defline, so any old fasta file won't work. * A lot of project specific handling is done along the project branches * 3' sequence processing hasn't been tested, use at your own risk * There is no robust rollback if one of the database loading steps fails If you need to roll back a loading batch, use the plate removal script Plate removal script is /data/shared/pgn_data_processing/scripts/data_manipulation/pull_plate.pl (as of 2003-10-14) * There are several major possible failure points: - if there is a missmatch between the nr of sequences and nr of quality value entries, script will abort with an error - if it is set to read sequencing information from disk, and can't open the file to read it, it will abort with an error - if the tracefile already exists in the database, it will skip that sequence; note that this can be a problem if a previous load attempt aborted after loading the tracefile info, so make sure you run the plate remove script before attempting to reload - if it can't figure out the library name from the defline, it will print a message and skip that sequence - if this is fgn data and it can't find the source info it will skip those sequences - if it can't prepare or execute an SQL statement, it will die with an error =============================================================================== 04_vector_screen.pl -p $project [-i $seq_in_file] [-a $qual_in_file] [-o $seq_out_file] [-q $qual_out_file] [-f $vector_file] [-c $cross_match] [-s $screening_output] [-v $verbose] Description: This script simply runs crossmatch to identify vector sequences and masks them in a temporary file It does not load anything into the database Command Line Parameters: $project: specifies the project that applies to this run default: none (mandatory parameter) valid values: cgn, fgn, pgn $seq_in_file: sequence fasta file created from database as input to crossmatch default: /tmp/seqs_fasta $qual_in_file: quality values fasta file created from database as input to crossmatch default: /tmp/seqs_fasta.qual $seq_out_file: sequence fasta file created by crossmatch containing the masked vector regions default: /tmp/seqs_fasta.screen $qual_out_file: quality values file for the screened sequences, should be a copy of the $qual_in_file default: /tmp/seqs_fasta.screen.qual $vector_file: vector sequence file for crossmatch default: ./data_files/vector.seq $cross_match: location of the crossmath binary default: /usr/local/bin/cross_match $screening_output: file to which the standard output of crossmatch gets redirected default: /tmp/screen.out $verbose: flag for verbosity of output default: undefined (variable is unset) Other Builtin Script Parameters: $minmatch: minimum word length value for crossmatch; see crossmatch docs for further info value: 12 $minscore: minimum score value for crossmatch; see crossmatch docs for further info value: 20 Notes: * Script will print a warning if for some reason quality values are missing for a sequence, but will still create an empty entry for them if they are missing =============================================================================== 05_vector_and_qual_trimming.pl -p $project [-i $seq_in_file] [-a $qual_in_file] [-o $seq_out_file] [-q $qual_out_file] [-t $treshold_qual] [-v $verbose] Description: This script does all the heavy lifting when it comes to sequence trimming. It will parse out the sequences into vector-flanked chunks, look for adaptor sequence, quality-trim the chunks using our custom algorithm, and select the best chunk as _the_ trimmed sequence for this sequence. It also loads into the databse all the trimming info such as start and stop of vector region, start and stop of adaptor region, start and stop of quality trimming region. Command Line Parameters: $project: specifies the project that applies to this run default: none (mandatory parameter) valid values: cgn, fgn, pgn $seq_in_file: vector masked sequence fasta file to process default: /tmp/seqs_fasta.screen $qual_in_file: quality values fasta file for the vector masked sequences default: /tmp/seqs_fasta.screen.qual $seq_out_file: output file for trimmed sequences default: /tmp/seqs_fasta.trim $qual_out_file: output file for trimmed quality values default: /tmp/seqs_fasta.trim.qual $treshold_qual: the quality value that will be set as the floor for the quality trimming algorithm default: 15 $verbose: flag for verbosity of output default: undefined (variable is unset) Notes: * A lot of voodoo is going on in this script, please make sure you understand how the trimming algorithm works before monkeying with it * There are several possible failure points: - if there is a missmatch between the nr of sequences and nr of quality value entries, script will abort with an error - if it can't open files for input or ouptut, script will die with an error - if it can't find a quality value entry for a specific sequence, it will die with an error - if it can't prepare or execute any of the SQL, it will die with an error =============================================================================== 06_polyA_trimmming.pl -p $project [-i $seq_in_file] [-a $qual_in_file] [-o $seq_out_file] [-q $qual_out_file] [-n $nr_repeats] [-k $polyA_to_keep] Description: This script tries to correct problems with sequences having polyA tails pulling in garbage after the polyA. This problem is very noticeable in the coffee seed libraries from CGN. It also standardized the length of polyA retained in the trimmed sequence to whatever value $polyA_to_keep is set to (usually 20) Command Line Parameters: $project: specifies the project that applies to this run default: none (mandatory parameter) valid values: cgn, fgn, pgn $seq_in_file: trimmed sequence fasta file to process default: /tmp/seqs_fasta.trim $qual_in_file: quality values fasta file for trimmed sequences default: /tmp/seqs_fasta.trim.qual $seq_out_file: output file for polyA trimmed sequences default: /tmp/seqs_fasta.polytrim $qual_out_file: output file for polyA trimmed quality values default: /tmp/seqs_fasta.polytrim.qual $nr_repeats: the number of 'A' nucleotides in a row that is considered a polyA tail default: 12 $polyA_to_keep: the number of polyA nucleotides to keep in the trimmed sequence default: 20 Notes: * This script does not handle polyT from 3' sequences * The polyA is identified on the trimmed sequence, but its start and end location is calculated on the raw sequence, hence the need to pull the raw sequence from the database * The default value of 12 for $nr_repeats was established empirically by doing an analysis of nr of stretches of 'A' vs length of the stretch, in 3 nucleotide increments. Stretches of 12 nucleotide in length were both the most abundant (by a wide margin) and the vast majority of them was at the 3' end. =============================================================================== 07_trimmed_seq_upload.pl -p $project [-i $seq_in_file] [-q $qual_in_file] Description: This script simply loads the trimmed sequences into the database. All the trimming info has already been loaded by the trimming scripts. This data is in a sense superfluous, since the trimmed sequence can be calculated from the raw sequence and trimming locations, but since all future operations (both web and pipeline) deal with trimmed sequences exclusively, having the already trimmed sequence stored improves performance. Command Line Parameters: $project: specifies the project that applies to this run default: none (mandatory parameter) valid values: cgn, fgn, pgn $seq_in_file: polyA trimmed sequence fasta file to process default: /tmp/seqs_fasta.polytrim $qual_in_file: quality values fasta file for polyA trimmed sequences default: /tmp/seqs_fasta.polytrim.qual Notes: * This script will override existing trimmed sequence entries for the same raw sequence =============================================================================== 08_contamination_screen.pl -p $project [-i $seq_in_file] [-b $blast_result] [-c $contamination_target] Description: This script takes the newly loaded trimmed sequences and blasts them against the cloning organism genome (typically E.coli) looking for high stringency matches to detect contamination Command Line Parameters: $project: specifies the project that applies to this run default: none (mandatory parameter) valid values: cgn, fgn, pgn $seq_in_file: input file to blast containing sequences to be evaluated default: /tmp/seq_to_eval.fasta $blast_result: file where the blast run results will be stored default: /tmp/seq_vs_ecoli $contamination_target: target db for blast default: /data/shared/pgn_data_processing/E.coli_K12/Ecoli_genome Other Builtin Script Parameters: $blast_script: helper script used to run blast and parse blast results value: ./helper_scripts/batch_blast_generic.pl $min_evalue: blast will only report matches "better" (smaller evalue) than this evalue value: 1e-100 Notes: * This requires the helper script batch_blast_generic.pl to run properly * The target sequence needs to be formated for blast =============================================================================== 09_trimmed_seq_quality_check.pl -p $project [-t $treshold] [-l $min_ln] [-a $max_ambiguity_pct] [-n $ambiguity_treshold] [-s $max_single_letter_pct] [-d $max_two_letters_pct] [-v $verbose] Description: This script does all the quality evaluation. It looks for a minimum overall quality, then a minimum sequence length, then a maximum ambiguity allowed (max percentage of 'N' nucleotides), then it checks sequence complexity (to detect certain sequencing artefacts). Checks happen in the order described, and only sequences passing one step get evaluated in the next one. Command Line Parameters: $project: specifies the project that applies to this run default: none (mandatory parameter) valid values: cgn, fgn, pgn $treshold: minimum average quality value for a sequence to pass default: 15 $min_ln: minimum sequence length (in basepairs) required to pass default: 150 $ambiguity_treshold: the quality value below which a nucleotide is considered equivalent to 'N' for ambiguity percentage calculation default: 5 $max_ambiguity_pct: the maximum percentage of nucleotides allowed to be 'N' in the trimmed sequence default: 4 $max_single_letter_pct: the maximum percentage of nucleotides that are allowed to be the same letter (A,T,G,or C) for the sequence to pass default: 60 $max_two_letters_pct: the maximum percentage of nucleotides that are allowed to be accounted for by just 2 letters (any 2 of the single letter percentages adding up to this value) for the sequence to pass default: 80 $verbose: flag for verbosity of output default: undefined (variable is unset) Notes: * All of these checks exclude any polyA tail that might exist before performing the calculations * The script "knows" the proper quality codes for the various failure types. If those codes change in the database, change them in this script as well or sequences will get assigned erroneous codes =============================================================================== 10_lib_assembly.pl -p $project -l $lib [-o $seq_out_file] [-q $qual_out_file] [-r $phrap_results_file] [-a $align_file] Description: This scripts retrieves all the quality passed, trimmed sequences for a given library, and assembles them in contigs. Command Line Parameters: $project: specifies the project that applies to this run default: none (mandatory parameter) valid values: cgn, fgn, pgn $lib: library name, specifies the library to be assembled default: none (mandatory parameter) valid values: depends on project $seq_out_file: fasta file where assembly results are stored default: /tmp/$lib.fasta $qual_out_file: quality values for the assembly results default: /tmp/$lib.fasta.qual $phrap_results_file: file where the results of the phrap run will be stored default: /tmp/${lib}_phrap.out $align_file: file where the contig members alignment information is stored for parsing default: /tmp/${lib}.alignment Notes: * Nomenclature standardization is done for the deflines according to the phrap documentation. This depends on the proper information being in the database, but it isn't essential for the assembly process * The format of the alignment information file is documented in the file itself; if you change that format, please make sure to change the self-documenting part of the code as well * The 'all' library will assemble all the sequences into one build, regardless of the library they belong to =============================================================================== 11_lib_assembly_upload.pl -p $project -l $lib [-i $contig_file] [-q $contig_qual_file] [-a $alignment_file] Description: This script reads the contigs generated by 10_lib_assembly.pl and their alignment information and loads the data into the database. It first removes any existing contig info for that library before loading the new data, since these are temporary evaluation assemblies Command Line Parameters: $project: specifies the project that applies to this run default: none (mandatory parameter) valid values: cgn, fgn, pgn $lib: library name, specifies the library to be assembled default: none (mandatory parameter) valid values: depends on project $contig_file: fasta file containing the contigs to be uploaded default: /tmp/${lib}.fasta.contigs $contig_qual_file: quality values for the contigs to be uploaded default: /tmp/${lib}.fasta.contigs.qual $alignment_file: the file with the alignment info generated by the assembly script default: /tmp/${lib}.alignment Notes: * If the format of the alignment info change in the 10_lib_assembly.pl script, please change the parsing code here accordingly or it will load in garbage