Tomato Genome Sequencing Consortium Workshop Sept 18th September 2004 Wageningen Netherlands Lukas Muller – US Project Funded by NSF as a 2 yr project starting sept 2004? Aims: • 400,000 BAC end sequences, 2400 reads from sheared library to get unbiased view of genome • 20 full BACs • FISH analysis of selected BACs • Bioinfomatics hub and data archive for project • Summary data for whole project • Links to other group Sequencing outsourced to companies – SeqWright Houston (ABI 3730’s) 20 full BACs assembled with phred/phrap. Bioinformatics involves: • clustering of BAC ends • sheared sequences using SGN pipeline. • Analysis of gene coding regions • repeats and BAC annotation using SGN pipeline. The 20 BACs are chosen to cover the euchromatin – heterochromatin boundaries. NB chromosome 5 will be sequenced by India and chromosome 8 by Japan. Ralph van Berloo – Roeland van Ham (Netherlands) In the context of biosystems and genomics Wageningen university and Plant International. Tomato goal = 200 BACs Current status: 6 BACs completed, 11 underway, also physical mapping. Will sequence to 8x coverage phase 2 finishing on euchromatin with sequencing error rate to 1 in 10,000. Have experienced some problems with the early mapping and took longer than expected. Potato project: 300 BACs, 11 completed, 5 ongoing. Goals of CBSG: • System biology approach • Scan whole genome for trait markers Contribution to SOL: Sequencing, automated genome assembly, annotation and comparative genomics. Already running: Developing automated assembly pipeline TOPAS for tomato and potato, auto-finishing tool. Also generic annotation pipeline Annotation (blast etc) of contigs, annotation database. Head towards comparative genomics of tomato/potato 
Timelines: Depends on reality – the physical mapping is a big challenge. Positioning of the BACs now better and this will effect the original strategy. Fist idea was to sequence the tip of chromosome 6 where the resistance genes are thought to be. Now looking at the whole euchromatin. Using AFLP/FISH to confirm the BACs. The AFLP map is not as precise as expected – especially around the centromere. Using the same BAC library as everyone else. Sequence is not released yet – internal to CBSG. Lukas- sequence release should be a topic for later discussion. Expect to use BERMUDA guidelines – but could perhaps extend this to say, a month. Fast release good if everyone does it, but this may be an issue for contributing companies. Final intent is fully public release. Heiko Schoof - MIPS Plant Group Arabidopsis gene project, bioinformatics centre for ESSA, currently maize and Medicago genome. Annotation, genome databases, data integration, mgsp – whole genome analysis and comparative analysis. etc. UrMeLDB – European Medicago and legume database . Database federation is not a technology problem – but a human one (i.e. you have to want to do it!) MIPS/SOL: Important.goal to minimise duplication of effort. Offer infrastructure, resources, genome databases, compute, web services. Use INRA/Toulouse/VIB Ghent and Eugene automated analysis pipeline. Other focus – integration, PlaNet/Biomoby technology . Largest problem in data integration is getting people to use the same naming schemes. They use biomoby to move data to/from Toulouse for gene prediction with Eugene – send data or data diffs and receive back predictions ion TIGR XML format Currently comparing gene predictions from Eugene and Fgenesh (softberry.com) for Medicago. In Medicago all annotation is BAC-based i.e. do best annotation you can on a BAC and then that’s it. Lukas- alternative model would be to run a first pass annotation on BAC by BAC then full pass annotation ‘finish’ on the whole chromosome. Heiko- actually in Medicago annotation is not fixed – every release results in a rerun. So in Medicago, is fragmentation a problem? Presumably this leads to everyone trying to rerun the same analyses. Ideally you would have a federated project to stop this. Many people can provide interfaces but want the standards to be there so have a standard code of behaviour. Interfaces should be standard with reference data and individual’s data in standard non-proprietary format. Stephane Rombauts – Ghent, Belgium • Structural gene prediction • Genome structural analysis • Provide EU-Sol community with gene models as soon as ready • Provide annotation tools to users Structural annotation: Build training sets (pipeline to do this) Build tomato – solanaceae-specific Eugene platform. Identify and annotate transposable elements as those interfere with gene prediction Predict non coding RNAs - tRNAs, rRNAs, microRNAs. Analysis of whole genome structure usng i-ADHoRe. Provides overview of block duplications through available plant genomes. Use GBROWSE and share through LDAS and Biomoby See They recently finished Poplar annotation – 3 labs collaborated. All did whole genome annotation, then put all 3 together and selected the best gene models using the comparisons. Shared everything gene models up. Made large use of conference calls, no face to face meetings. Their evaluations suggest Eugene is as good as Fgenesh for Poplar. Problems suggest that some transposable elements were not masked and came through into gene predictions as short genes. N.B. Eugene doesn’t predict genes across gaps whereas Fgenesh can. Eugene is very fast, 10MB in ~15 mins. on single processor (~2GB RAM) once the blast processing is done. David de Koeyer – Canadian Potato Project. Project in second year at present. Research areas include EST sequence generation (18 libraries, 100,000 seqs.), microarray generation, potato mutant population generation, activation-tagged lines (10,000 lines). Are a federal govt. agency and hence have to use secure intranet across whole of Canada. Produce a set of tools for users. Use perl, apache, html, PHP, javascript, MySQL etc. Use Canadian bioinformatics resources incl. enhanced Blast cluster – pipeline using Paracel. Annotate with Pfam, Prosite, Tmap as well as Sigcleave, and other usual suspects. Particular bioinformatics interests include expression analysis. Luisa Chiusano –Italy funded by Italian Ministry of Agriculture. Title is Agro-nanotech. Given 560,000 Eu. Start April 2004 for 2 years. Based at University of Naples. Chromosome 12: 12 seed BACs. Projected to use 113 BACs to cover 11MB? of euchromatin. Status: seed BACs are being selected from list of 250 from Cornell. Analyses expected include: • structural/functional annotation • computational prediction • conflict resolution/manual curation • comparative genomics and phylogenetics • ID and characterisation of genes • identification of markers • SNPs- multiple resistant varieties vs. Wild-type • Microarray design Farid Regad France Re-sequencing of TOM1 and other Unigene clones (INRA) to improve microarray data. Also INRA-funded sequencing of chromosome 7. (500,000 EU in 2005) Using ‘Genepole’ bioinformatics platform in Toulouse for assembly and annotation. (as used by MIPS). Timetable:- 150 BACs at HTG phase 2 expected at least, but timing unknown at this stage. NB The number of expected BACs to cover is calculated with a 30% overlap at present – worst case scenario. Sequencing is under discussion – may go to Genoscope there or may be outsourced to commercial company. MOBY and Lessons learnt from the Medicago Project – Heiko Schoof MIPS open source project started 2001 initiated by Mark Wilkinson in Canada. federated and interconnected databases web service toolkit (API, registry, ontologies), uses SOAP, WSDL, XML. Uses concept of biological data service providers and data hosts. MOBY-S is moby services whereas S-MOBY is semantic moby – not covered here. Whole system is heavily ontology based: -ontology based data representation (syntax) - ontology based web service description system (semantic) - ontology based aware web services registry (discovery) i.e. ontologies have to be agreed upon by everyone before you can use anything. e.g. does ‘gene’ object contain/describe UTR’s or not? Layered system uses: Moby central registry Moby server any service provider Moby objects XML Moby client N.B. Moby central is in Canada (mirrored at MIPS?) Moby clients available include: Gbrowse (Canada) Gbrowse (PlaNet) see Taverna PlaNet client PlaNet partner client But not yet very user friendly. What do you get at present? • toolkit to set up services incl. perl, java clients (IBM are working on VBS and C++ clients, also user interface called Bioferret) • registry central • automatic integration of services into network what you don’t get yet • no clever or mature clients (but Taverna getting there) • integration tools for consistency checks • you have to integrate your own data and map it to standard agreed objects – nothing does this for you - but enables a consistent view once you integrate. Medicago project and BioplaNet use biomoby services. Rice genome project are looking at it, as are GO consortium. N.B. Martin Senger is EBI Biomoby contact, Mike Niemi at IBM. Michael Gribskov is also listed as interested, as is Luke McCarthy, FlyBase (Chris Mungall, Michael Ashburner). Provenance: All objects have a field to cover where the data come from. The data provider can add metadata to cover the history of the data but the format of this is not standardised and not obligatory. Wherever possible, standard formats e.g. MAGEML are used for markup. Nottingham PlaNet group are currently working on decomposing MAGEML to component objects – more flexibility. Lincoln Stein is looking/has recently looked at possibility of combining DAS and Biomoby – but probably not likely to happen. There are bottlenecks here until good clients can use Biomoby instead of DAS. Also, it is relatively quick to build a biomoby client to hard-code specific interactions. MIPS have started to use GBROWSE as the base of a plugin architecture to display new types of objects, but this time-consuming to code. Lessons learned from Medicago Use federated databases rather than a central service i.e. with analysis, data storage and maintenance at partners’ own sites But central integration and display i.e. web services Data integration is a human problem not a technology problem at present (we can do it but do we want to?) • Reluctance to make data public • Degree of access - e.g. Will put on a web page but won’t allow free download • Need good practice acknowledging integrated work • Different concepts/vocabularies PlaNet overview: Data synchronised and mirrored using Biomoby or other exchange format. N.B. Nottingham are using EnsEMBL to display Arabidopsis data. Some performance issues particularly when trying to directly integrate multiple databases in a single query. Need caches locally to store multiple data all being fed into one view – can use Biomoby to populate the caches. Have to convince people there is added value in sharing. This is good for users but also more access for providers too i.e. higher profile project more used. Medicago: - see www.urmeldb.net • Neutral web views not branded by hoster – not strictly part of their own services so doesn’t look the same. • Careful acknowledgements and credits • Standard formats and operating procedures • Clear definition of tasks and dataflow – to limit duplications of effort and re-runs – everyone gets their own niche • Primary data sources for some types of analyses are globally set • Sufficiently simple to be successful – don’t use complex technology for the sake of it. Lukas – hard to ask complex queries of distributed resources through Biomoby – easier to use SQL canned queries. This means it still is good to have a good central body of info in one place. For this reason PlaNET are looking at MART for central optimised fast queries. i.e. biomoby is not perfect for everything. If you use common agreed coordinates you can stack and view all alternative feature views at the same time. This is viewed as useful for comparing and evaluating annotation/ analyses. Maize GDB Approach: Uses community annotation. An individual can register as an ‘expert’ curator and then submit their ‘best’ gene model. Sounds a bit like the VEGA approach. Lukas asks for a strategy to do something like this for tomato towards the end of the project – with or without a jamboree. Discussion Lukas – looking at the Solanaceae draft – funding agencies like such documents, and this draft currently contains some fundamental points for discussion. Chromosome assemblies – defines data format for assemblies including information on junctions. Heiko - Arabidopsis have decided that the TIGR format for keeping overlap info is not sufficient. Need to archive the evidence of assembly i.e. not just sequence but seq. overlap, other info e.g. PCR walking used etc. An alternative format to store this info is ongoing at MIPS. Original output from analyses such as BLAST – keep or not keep? Both Ghent and MIPS keep this for the last run i.e the total native outfile not just parsed output. This is not necessarily backed up however. Ghent use this to check problems in parsing and also go back to it if individual queries are made on particular exons. Other groups queried whether it was strictly necessary to keep the native outfiles. Sequence quality Make ‘reasonable’ efforts to close gaps where ‘reasonable’ needs to be defined and framed for all groups in terms of cost. Overlaps – shoot for 5-10KB, shorter allowed if additional info is available e.g. PCR but standard markup to convey the evidence needed. Standardised BAC names. Lukas wants but Heiko sees problems ahead. Have to ensure that everyone uses the same IDs for the same feature – this is not easy. Also, need individual ID parts for the places doing the work so you can see who did what. Ghent - did this sort of thing – 4 digit chromosome id, contig number, 4 digits for gene incremented by one, and name of method e.g. GRAIL, Eugene. Lukas – maybe use a prefix to show who ran the annotation Medicago uses institution, thing, evidence for prediction of gene model. They don’t cover who did what in the ID itself, but add it in the GAMEXML. General consensus - probably not sufficient to use Arabidopsis-like ID/contig names. Important to track who did what/evidence used etc. Important comments from Arabidopsis BACs are immutable objects and sequence has to correspond to BAC sequence as close as possible. i.e. sequence should not be added or removed to make a gene annotation possible. Heiko is this an over simplification? Have to document what you did for the annotation – i.e. if you had to collapse overlaps. Lukas- need chromosome-based names e.g. SL02g0001 e.g. where SL = Solanum lycopersicon 02= number of chromosome g= gene (or other e.g. t=transposable element, r=repeat, ?=rna gene) 0001 = number – but how many zero’s to add to make sure you have enough number space for the whole project. Ghent – how to code rna genes in a name? N suggested. Heiko- the gene classification wrt Medicago needs updating. Types of evidence for gene predictions are encoded in the name – ranging from ab initio through to full-length agreement of transcript with a known gene e.g. F = full gene model supported E= est partial match to full length sequence H= some homology to known est/gene I= everything else Note there is a separate level of encoded evidence for gene functional prediction. Ghent – need some way to capture expert manual curation where gene is hand-corrected so that next round of annotation (e.g. training sets) capture these. Italy- would it be useful for a jamboree for the manual expert curation? E.g. species-specific annotators and gene-family specific curators together. Ghent - if you put up the data well in advance on the web and then invite experts to annotate via the web would get same effect and better value for money. Jamborees are limited by space/time/money constraints. Again the idea came up to set up an ‘experts’ list – even before worrying about a jamboree. This would help to contact people and get them to look at the data. Disadvantage of web-only expert curation is that don’t want to cut & paste 1000s of times, also no direct contact to actually get people to do things. Standardised file formats are needed: Heiko – GAMEXML is not always good enough at present but have to start somewhere for annotations. Question – who will use TIGR assembler or other non-Phrap assemblers? At least one person (Wageningen) won’t use Phrap and so how to correctly describe the assembly so that everyone has a compatible method (e.g. Phrap .ace files). Some groups have a specific need to use a particular assembler e.g. Wageningen have an existing assembly pipeline used for multi purposes and so don’t want to change just for this project. BAC registration Want to implement this at Cornell. E.g status of BACs, who doing what . So how to update this? No interface is currently defined. Medicago currently use Excel spreadsheets via email. Heiko- the most important part is to define a central person who will update the data and how often it will get updated. This is very important for error spotting e.g. mapping problems. Could be done via web platform or email, software to be decided upon. Data archiving, data formats, data access – chair Heiko Heiko proposes a code of good practice. 1. archive primary data 2. document what is used, how used, evidence given etc. 3. get people working on common data together to make sure that they have common data exchange format(s) Suggest Lukas to keep master list of who is working on what i.e. working groups or task forces for each type of data. SGN wants to have a central ftp archive of everything. What is the best way of doing this? – this will depend on the local data provider. i.e. to be negotiated e.g. SGN pulls, or mirrors, upload, download, SMTP etc. Could be done as a biomoby query for tracefile upload, updates query etc. Credit/traceability – keep original source of information even when going through multiple hands, not just the last person to add info. Genome Annotation – chair Shephane Rombauts Should be only one place to get the latest version i.e. latest BAC, transcripts, on a per-BAC basis. But who will maintain it? One centre could maintain all by ftp. A timestamp is important to track annotations. Need to keep previous releases too. Must have some kind of history tracking. Heiko - history tracking already done for Medicago. Need one central maintainer. Could then use ftp mirror etc to update everyone else. Could this be at Cornell? Also try to synchronise annotation groups so they all work on the same sequence version. Not so much of a problem when groups all share serial processes in the same pipeline i.e. this forces annotation of same core of sequence. CBSG state that they use a non-standard annotation pipeline but want to look at making it more ‘standard’ but they have to run a pipeline anyway as they are annotating potato. For several groups, assembly and annotation pipelines are continuous and cannot be easily split. Heiko – funding agencies may or may not be convinced by distributed annotation/compute efforts and we have to keep them happy and show good value for money. How to assess quality – difficult to do when everyone uses something different. Italy - interested in using GRID technology to share compute. Heiko – but if we share the same pipeline then that also means sharing the code and not everyone wants to/can do this. But if someone is running something that is not generally available or shared then can it be used as a ‘service’ i.e. Eugene as it is not open source. Lukas - using several different methods/platforms? is good – get to see different ideas. Some don’t want to use any non-open source code at all –say it makes checking results too difficult. General thought is that good tools (e.g. Eugene) should be used but the settings and configuration used should be made available for verification, consistency checking etc. Heiko – if we use Taverna or equivalent then the pluggable pipeline can be re-used by all. His paradigm appears to be everyone does their own annotation but uses distributed community services where appropriate. Italy raise the point that they want to run their own analyses as they want to build up expertise locally to be able to do this – argue that you don’t learn the same if you don’t do it yourself. But Problem – the whole annotation has to run through the same outline at the end of the project. Need a united appearance that doesn’t confuse users. How to provide a standard for annotation if everyone uses different pipelines? Could perhaps be helped by centrally providing a set of training data for everyone’s programs? Practical question? How many groups have to explicitly (according to their grant) annotate themselves UK? – no will be collaborative France – not explicitly but will have to use some in-house tools Ghent – only doing structural annotation Publication Every project owns its own chromosome. If you can manage to get a paper solely on your chromosome great, but this is unlikely to be possible for all/any of them. Whole genome publication authored by the SOL consortium (names as appendix) Then do what you want with the whole genome data. Bishop -Original idea was to have an ‘international chromosome’ everyone sequenced and annotated 5 BACs each. This to be published as joint paper and set the standards for annotating everything else. Asks for definite consistency across the published chromosomes – both on whole chromosome and BAC-by-BAC basis. i.e. show uniformity to the outside world.