The Report for the 23rd International Collaborators Meeting

International Nucleotide Sequence Database Collaboration (INSDC), consisted of DDBJ, EBI and NCBI, hold the international collaborators meeting every year.

In 2010, the meeting was held at EBI in UK, 19-21 May, to discuss practical matters to maintain and update nucleotide sequence data archives; DDBJ, EMBL-Bank, GenBank, Sequence Read Archive and Trace Archive. As a result of travel disruptions relating to the Icelandic volcano, the meeting was shorter than expected. Despite these difficulties, we believe that we made significant progress at the meeting. The outcomes of the meeting are summarized below.

The Items; Discussed and To Be Studied

Current movements related to sequence data submission and standardized description formats for sampling information

Sampling information for genome scale data
According to the request from Genomic Standards Consortium (GSC), INSDC has discussed to include sampling information of genome scale data in complying with Minimum Information about a (Meta)Genome Sequence (MIGS/MIMS) and Minimum Information about an Environmental Sequence (MIENS) in sequence data records, since 2005
Since 2009, DDBJ, EMBL-Bank and GenBank have been using structured COMMENT/CC lines to describe this kind of extended metadata. However, making reference database for extensible metadata has some advantages; to maintain and to update independently; to reduce redundancy of contents. So, INSDC should consider to provide extensible metadata by some reference database.
See also Genomic Standards Consortium on Wikipedia

Minimal submission requirements for INSDC
INSDC will register its minimal submission requirements into Minimum Information for Biological and Biomedical Investigations (MIBBI). MIBBI is a project to synthesize reporting guidelines from various communities into a suite of orthogonal standards.

Prokaryotic Annotation Workshop
Researchers participated Prokaryotic Annotation Workshop, hosted by NCBI, requested to INSDC some modifications for the description rules of features and qualifiers. The requests were mainly from J. Craig Venter Institute (JCVI).
INSDC mainly discussed how to cite references for annotated features and a guideline for protein nomenclature for values of /product qualifiers in CDS features.

BioProject database

Since 2005, INSDC has discussed project ID assignment as a flag to specify many kinds of large scale sequencing projects with considerable modifications.

In 2010, the schema for project ID will be largely modified to extend its targets to many kinds of biological data other than nucleotide sequence, such as array, mass spectrometry, and so on. The database was renamed to BioProject database. BioProject database will be provided from NCBI near future.

Strain level taxonomy ID assignment for microorganism genome submission

All organism names that are represented in the sequence data of INSDC are registered to the taxonomy database.
Since 2009, taxonomy database has considered to terminate assignment of strain level taxonomy ID for microorganism genomes. However, since many institutes have already cited those strain level IDs, we will continue to add strain level taxids for prokaryotes at least for one more year.

European Nucleotide Archive (ENA) launched at EBI

From May 2010 at UK, the European Nucleotide Archive (ENA) has been launched, consolidating three major sequence resources in Europe, EMBL Nucleotide Sequence Database (EMBL-Bank), Trace Archive and Sequence Read Archive, to become Europe's primary access point to globally comprehensive nucleotide sequence information.

Names for activities of INSDC

Since 2009, new collaborators have joined to INSDC. So, some INSDC documents about policies and activities should be updated.

  • INSDC is the official name for the entirety of the collaboration, including traditional sequence databases (DDBJ, EMBL-Bank and GenBank), the next-gen sequence databases and Trace Archive.
  • We agreed to call the international activities to archive raw sequencing data from next-generation sequencers "Sequence Read Archive". After the agreement, DRA has been renamed from "DDBJ Read Archive" to "DDBJ Sequence Read Archive".
  • The operations of INSDC to archive sequence chromatograms (traces) is named "Trace Archive".

Sequence Read Archive (SRA)

A paper for SRA introduction
We will prepare a joint SRA paper with details about the data model.

Dealing with data from new sequencing platforms
SRA schema will be updated to support new sequencing platforms;

  • Complete Genomics
  • Helicos
  • Pacific BioSciences
  • Ion Torrent

Changes in future revisions of the Feature Table Document: Features and Qualifiers

The following items will be applied from October 2010 with the revision of Feature Table Definition, if not otherwise specified.

The conflict feature will be no longer used.

For data submitted to DDBJ, the conflict feature can be no longer used.

Three qualifiers, /codon, /label and /partial will be no longer used.

For data submitted to DDBJ, those qualifiers can be no longer used.

The /gene_synonym qualifier will be legal when either /gene or /locus_tag is used in a feature.

The description rule for transposable element will be changed.

Since 2006, transposable element has been described with repeat_region feature and /mobile_element qualifier. mobile_element feature and /mobile_element_type qualifier will be added and used to describe transposable element.
This modification will be applied in December 2010.

A new qualifier, /whole_replicon, will be legal for source feature.

To flag entries oriented to sequence whole replicon, we will use /whole_replicon qualifier.
Time course for this addition has not yet been specified.

Improvement of format for /artificial_location qualifier.

Since 2009, /artificial_location qualifier has been introduced as a valueless qualifier. To classify the reasons of its usages, the qualifier will have either of two controlled values; "heterogenous population sequenced" or "low-quality sequence region".

Improvement of formats for /experiment and /inference qualifiers.

On the basis of requests from Prokaryotic Annotation Workshop, formats for /experiment and /inference qualifiers will be improved mainly to cite its support evidence in a feature.

  • header classification with "COORDINATES", "DESCRIPTION" and "EXISTS"
  • description of PubMed ID (PMID) and Digital Object Identifier (DOI) to indicate cited publications

Examples

            /experiment="COORDINATES: N-terminus verified by Edman degradation
            [PMID: 8096212]"
            /inference="DESCRIPTION: similar to AA sequence: INSDC: AAF23014.2"

How to improve pseudogene annotation?

As mentioned above, Prokaryotic Annotation Workshop requested to improve description rules of features and qualifiers. One of their requests is improvement of pseudogene annotation. Also, to solve a problem of /pseudo qualifier usage in ICM2009, we discussed this issue. However, we could not reach any agreement in the meeting, mainly because of difficulties to keep integrity with existing records.
This issue will be reconsidered.