31st INSDC meeting report: May 15-17 2018, Bethesda, USA

2018

31st INSDC meeting report: May 15-17 2018, Bethesda, USA

International Nucleotide Sequence Database Collaboration (INSDC), consisted of DDBJ Center, EBI and NCBI, hold the international meeting every year.
In 2018, the meeting was held at NCBI, 15-17 May, to discuss practical matters to maintain and update nucleotide sequence data archives; DDBJ, ENA, GenBank and Sequence Read Archive (SRA).
The outcomes of the meeting are summarized below.

The Items; Discussed and To Be Studied

Format expansions for accession numbers and for /protein_id.

By the end of 2018, INSDC will expand the accession number formats as follows;

For bulk sequence data, WGS, TSA and TLS, currently using a “4+2+6/7/8” format ( four-letter prefix and two digits for set version followed by six digits), the new format will be a six-letter prefix and a two-digit set version number followed by 7, 8, or 9 digits (“6+2+7/8/9” format); for example, AAAAAA020000001.
For conventional sequence data, currently using a “2+6” format (two-letter prefix followed by six digits) and a “1+5” format, the new format will be two-letter prefix followed by eight digits (“2+8” format).
For /protein_id, currently using a “3+5” format (three-letter prefix followed by five digits), the new format will be three-letter prefix followed by seven digits (“3+7” format).

International Protein Nomenclature Guidelines are introduced.

INSDC agreed to recommend the international protein nomenclature guidelines to submitters of DDBJ, ENA and GenBank. The guidelines were revised and reorganized from previous guidelines of UniProt and NCBI by NCBI, EBI and the Protein Information Resource and the Swiss Institute for Bioinformatics.
The guidelines are exclusively focused on nomenclature. The future changes of guidelines can be tracked in GitHub.See also NCBI Insights and Inside UniProt.

Increasing metagenomic data

Assembling/analyzing methods for metagenomic approaches are progressed and data submissions of them are increased. We discussed to accommodate with metagenomic data. Both BioSample and flatfile will be expanded to deal with minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) as requested by the Genomic Standards Consortium.
See Nat. Biotechnol. 35:725-731 (2017) about MIMAG and MISAG in detail.

Forthcoming changes in The DDBJ/ENA/GenBank Feature Table: Definition

The following items will be applied after October 2018 with the next revision of Feature Table Definition, if not otherwise specified.

A new qualifier, /metagenome_source, will be available for source feature. In cases of metagenomic data, the /metagenome_source is mandatory when /organism does not show “xxx metagenome” or the like. The /metagenome_source must contain the word “metagenome” and must exist in the taxonomy database.
/metagenome_source requires /environmental_sample qualifiers.
Definition of /EC_number qualifier qualifier will be slightly changed.

Data Access Policy

For COP14 and the Nagoya Protocol: DDBJ reported the results of the Conference of the Parties Convention on Biological Diversity 14 (COP14) for their opinion on whether “digital sequence information” should be included in the Nagoya Protocol. The Science Council of Japan and the International Chamber of Commerce do not favor this action.
We, INSDC, have an existing policy of free and unrestricted access to all of the data records.
For GDPR: In May 2018, the European General Data Protection Regulations (GDPR) have come into force. ENA presented the work and the changes that arise from these with respect to ENA operations.
NGS Quality Scores: Most of SRA storage is taken up by quality scores, however, the sequencing technologies are going to natural progression and precisions of base calling do not impact to assemblies or analyses. Besides, to save storage of users during analyses and to transfer SRA data from submitters and to users more easier, we consider that it is preferable to remove quality scores from SRA data. We would like to discuss about how to accept future NGS data with major research communities including vendors, tool maintainers and users.