25th INSDC meeting report: June 11-13 2012, Bethesda, USA
2012
25th INSDC meeting report: June 11-13 2012, Bethesda, USA
International Nucleotide Sequence Database Collaboration
(INSDC), consisted of DDBJ,
EBI and NCBI,
hold the international meeting every year.
In 2012, the meeting was held at NCBI in USA, 11-13 June, to discuss
practical matters to maintain and update nucleotide sequence data
archives; DDBJ,
EMBL-Bank,
GenBank, Sequence
Read Archive (SRA) and Trace Archive
The outcomes of the meeting are summarized below.
The Items; Discussed and To Be Studied
- BioSample database
- The BioSample database contains descriptions of biological source materials used in experimental assays. The purpose of the BioSample database is to provide unified storage and access to information about biological samples. These samples may have investigation information stored in other databases (e.g. nucleotide sequence, expression).
- Both EBI and NCBI have already been independently collecting BioSample at NCBI and BioSamples at EBI. Including DDBJ, we discussed how to collect and share BioSample data collaboratively as an INSDC activity.
- The format of BioSample accession number is SAM[D|E|N]+ 8 digits; e.g. D=DDBJ; E=EBI; N=NCBI;
for example: SAMD01234567 - Strain level taxonomy ID assignment for microorganism genome submission
- All organism names that are represented in the sequence data of INSDC are registered to the taxonomy database.
- Since 2009, taxonomy database has considered to terminate assignment of strain level taxonomy ID for microorganism genomes.
- However, taxonomy database agreed not to stop assigning strain-level taxonomy IDs for prokaryotic strains with sequenced genomes until at least 2013. The change to the current practice will not be made until BioSamples has reached maturity and sample content is exchanged, and so the change may not take place until later in 2013 or beyond.
- Genome scale submissions
- Both submitters and users require INSDC to collect genome data varied in samples and study goals.
- Especially in bulk (re)sequenced cases, INSDC should ask data providers to submit at least one set of assembled/annotated reference genome data, to accept only raw reads to SRA for other genomes with BAM, VCF, GFF in analysis object; i.e. without draft assembles of WGS and scaffold CON data.
- While, in cases of genomes sequenced/assembled to finished level, i.e. possibly treated as a reference genome, INSDC should not label ‘complete genome’ in KEYWORDS for genome data without feature annotation. We should support submitters to annotate their sequences by providing tools and help documents for the minimal standards/requirements
- NCBI introduced one of its activities, Assembly (Genome Collection) database. The Genome Collection database has information about the structure of assembled genomes as represented in an AGP file or as a collection of completely sequenced chromosomes. INSDC agreed to collaborate with this activity.
Changes related to INSDC submission
- Termination of accepting new submission of MGA data
- Since 2004, we have accepted the submission of MGA data.
However, along with the popularization of new sequencing platforms, this data model seemed to be out of date.
So, we decided to terminate accepting new submission of MGA data.
For new submissions, please conact to DDBJ Sequence Read Archive (DRA) and Genomic Expression Archive (GEA).
Changes in SRA XML schema
SRA XML schema version 1.4 has already been applied for SRA data since July 2012.
SRA XML schema version 1.5 will be applied in the near future.
The modification points would be elimination and consolidation of redundant description items.
SRA SRA XML schema version 2.0 continues to be discussed for refactoring SRA metadata with BioProject and BioSample
Forthcoming changes in The DDBJ/EMBL/GenBank Feature Table: Definition
The following items will be applied from October 2012 with the revision of Feature Table Definition, if not otherwise specified.
-
Value format of /anticodon qualifier will be modified to include its sequence.
-
A new value, “pcr”, will be legal for /linkage_evidence
-
The /frequency qualifier will be no more legal for source feature.
-
The controlled vocabularies for qualifier values will be reposited on INSDC web site to revise them more frequent than every 6 months, revision of Feature Table Definition.
-
Since controlled values of /mobile_element_type qualifier are not systematic and could not well cover some instances, INSDC will discuss to improve the controlled values. It will be revised in 2013.
-
To represent nonfunctional genes, /pseudo qualifier will be legal again.
In ICM2011, INSDC adopted the new qualifier /pseudogene and deprecated the older /pseudo. The use of the /pseudogene qualifier indicates that the feature is a pseudogene of the type listed in the value.However, researchers still need to represent nonfunctional genes that are not formally described as pseudogenes.
So, INSDC concluded to continue using /pseudo qualifier in parallel to /pseudogene qualifier would allow this.