24th INSDC meeting report: May 23-27 2011, Osaka, Japan

2011

24th INSDC meeting report: May 23-27 2011, Osaka, Japan

International Nucleotide Sequence Database Collaboration (INSDC), consisted of DDBJ, EBI and NCBI, hold the international meeting every year.
In 2011, the meeting was held at Osaka in Japan, 23-27 May, to discuss practical matters to maintain and update nucleotide sequence data archives; DDBJ, EMBL-Bank, GenBank, Sequence Read Archive (SRA) and Trace Archive.
Though there were still aftermaths of the Great East Japan Earthquake, DDBJ could host ICM2011 with understanding and cooperation of NCBI and EBI.
The outcomes of the meeting are summarized below.

The Items; Discussed and To Be Studied

NCBI continues SRA and Trace Archive repositories after October 1, 2011.: Recently, NCBI announced that due to budget constraints, it would be discontinuing its SRA and Trace Archive repositories for high-throughput sequence data. However, NIH has since committed interim funding for SRA in its current form until October 1, 2011. In addition, NCBI has been working with staff from other NIH Institutes and NIH grantees to develop an approach to continue archiving a widely used subset of next generation sequencing data after October 1, 2011.
In addition, NCBI will continue to provide access to existing SRA and Trace Archive data for the foreseeable future. NCBI is also continuing to discuss with NIH Institutes approaches for handling other next-generation sequencing data associated with specific large-scale studies.

BioProject database: Since 2005, INSDC has discussed project ID assignment as a flag to specify not only genomic and metagenomic sequencing projects but also many kinds of biological projects with considerable modifications.
In 2011, the schema of BioProject is introduced. See also DDBJ BioProject Database.
A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project.
The format of BioProject accession numbers is PRJ[D|E|N][A-Z]+integer; e.g. D=DDBJ; E=EBI; N=NCBI;
for example: PRJNA38683

INSDC agreed with definitions of its common entry statuses in the following table;

* specific deadlines are available from INSDC partner.
Status name	Causes	Implications
Public	Data are submitted with no request for confidential hold prior to publication or have reached an owner-agreed public release date.	Data are fully available
Confidential	Data owner requires and indicates to INSDC staff that confidentiality is required until a release date or publication in the literature, whichever comes earlier.	Data are not available publicly through any means. A data release date is recorded for the data, which are subsequently and automatically released as Public on reaching this date or being cited in a publication prior to this date. In the event that a release date must be extended, data owners are required to contact the INSDC partner responsible for the submission with sufficient notice*.
Suppressed	(1) Data are found by the owner to be incorrectly annotated or contaminated with no opportunity on the part of the owner to be updated. (2) Data owners realise after sequences have been released that they failed to request a confidential status, either at the time of submission, or within the period between completion of submission processing and the date on which the submission is normally made available to the public (this time period can vary among the INSDC members).	Data are removed where possible from INSDC partner direct search tools (such as text and sequence similarity search) but remain available by accession number.
Replaced	Data owners generate new data under new accession identifiers that directly replace existing data; this expected to be rare since replacement data normally use the existing accession identifiers for the records that they replace.	Data are removed where possible from INSDC partner direct search tools (such as text and sequence similarity search) but remain available by accession number. Where possible, look-up by original accession identifiers leads to a re-direct to new records available under the new accession identifiers.
Killed	(1)The submitter has requested a Confidential status or an extension to an existing release date, but the INSDC partner, or their submissions brokering collaborator, has failed to apply the appropriate release date correctly. (2) Data are found to have been submitted to the databases without the permission of the rightful owner; this is expected to be extremely rare and requires formal institutional contact with the submitting institution.	Data are not directly available publicly from INSDC partners through any means. However, because the data will have been distributed previously as Public, the INSDC partners cannot exercise any control on the resultant use of the data by third parties.

For submissions of CON division, AGP format version 2.0 will be applied from December 2011.

Changes in SRA XML schema

SRA XML schema version 1.3 has already been applied for SRA data since June 2011.
SRA XML schema version 2.0 will be discussed.

Forthcoming changes in The DDBJ/EMBL/GenBank Feature Table: Definition

Two new feature keys, centromere and telomere, will be legal from October 2011.
A new feature key, assembly_gap, will be legal from December 2011.
The feature key is closely related to the modification of AGP format. With version up of AGP format, sequencing gaps of CON records will be described with assembly_gap features; i.e. not gap features.
Value format of /anticodon qualifier will be modified on April 2012.
Improvement of pseudogene annotation
As mentioned in ICM2010 report, Prokaryotic Annotation Workshop requested INSDC to improve its pseudogene annotation. Also, to solve a problem of /pseudo qualifier usage in ICM2009, we discussed and decided following modification;
A new qualifier key, /pseudogene, will be legal from April 2012, and an old qualifier key, /pseudo, will be no longer accepted in future submissions. The /pseudogene qualifier will be legal for only pseudogenes.
Implementation of /whole_replicon was cancelled.
At ICM2010, we were to flag entries oriented to sequence whole replicon with /whole_replicon qualifier. However, we reconsidered that BioProject records would be more helpful to store representatives of whole genome data. INSDC will include a new item in BioProject records to flag representative data of ‘genomic molecule’ instead of addition to /whole_replicon qualifier.

Subsequent to this discussion, INSDC partners agreed to the following definition of the INSDC meaning of ‘genomic molecule’
```
The submitter of a genomic assembly defines his/her INSDC sequence record as a 'genomic molecule', 
meaning a chromosome, plasmid or linkage group, when it is the submitter's intention to use 
that sequence record permanently as that biological molecule and the sequence is the current 
reasonable model of the biological molecule. Whether the record shows a complete representation 
of the molecule or not is not necessarily a factor under consideration for this submitter-declared 
'genomic molecule'.
```

2011