22nd INSDC meeting report: May 12-15 2009, Bethesda, USA

2009

22nd INSDC meeting report: May 12-15 2009, Bethesda, USA

International Nucleotide Sequence Database Collaboration (INSDC), the three data banks; DDBJ, EMBL-Bank/EBI, GenBank/NCBI hold the international meeting every year.
In 2009, the meeting was held at NCBI in USA, 12-13 May.

DDBJ, EMBL-Bank, GenBank reported each bank activities in the last year, discussed practical matters to maintain and update INSDC.

Also, since this year (2009), INSDC has added a coraborative meeting to deal with mass sequence data produced by the next generation sequencers (Short Read Archive) and traces produced by traditional sequencers (Trace Archive).
The first meeting for this collaboration was held at NCBI in USA, 14-15 May 2009.

Short Read Archive

Trace Archive

DTA; DDBJ Trace Archive
Ensembl Trace Server (EBI)
Trace Archive (NCBI)

The outcomes of the two meetings are summarized below.

The Items; Discussed and To Be Studied

Sequence data from the next generation sequencers: As mentioned above, the databases collecting outputs from the next generation sequencers have joined INSDC since 2009.NSDC will request major scientific journals that DRA/ERA/SRA accession numbers for corresponding sequence data should be included in paper submissions.
DDBJ/EMBL-Bank/GenBank reject submissions of EST sequence data produced by 454 sequencers (GS-20, GS-FLX, etc.).In principle, only DRA/ERA/SRA should accept those kinds of EST data.
the database for project ID: Since 2005, INSDC has started to discuss project ID assignment as a flag to specify genome projects, and how to indicate project ID on flat files.; In 2008, INSDC decided to use project ID, not only for genome/metegenome projects, but also many kinds of large scale sequencing projects including transcriptomes.
DDBJ and GenBank indicate project ID at DBLINK line on flat files. EMBL-Bank indicate project IDs in PR line on flat files. For the genome/metagenome projects, we have almostly completed to assign project IDs.; DDBJ/EMBL-Bank/GenBank require to describe project IDs for TSA submissions and thier primary entries
Termination of strain level taxonomy ID assignment for microorganism genome submission: All organism names that are represented in the sequence data of DDBJ/EMBL-Bank/GenBank are registered to the taxonomy database.Taxonomy database assigned strain level taxonomy IDs for whole genome scale submissions of microorganisms, to flag those genome projects.
Since INSDC provided project IDs as a solution to index genome projects, we discussed to terminate assignment of strain level taxonomy ID for microorganism genomes. However, since many institutes have already cited those strain level IDs, we should carefully considrer that the policy change would cause confusion.
Frame mismatched candidates of protein coding regions of high-throughput sequence data: Increasing submissions of large scale draft sequence data, submitters often want to annotate frame mismatched candidates of protein coding regions with CDS features avoiding translation errors by operatively joined location.
To distinguish these kinds of CDS features, we will prepare a new qualifier, /artificial_location qualifier as a flag. In this regard, however DDBJ/EMBL-Bank/GenBank will accept only submissions from whole genome scaleprojects including large scale transcriptomes.
Structured COMMENT/CC line to capture metadata: Recently, GenBank started to use structured COMMENT approach to capture metadata related to a biological sample that has been sequenced.
The concept behind structured COMMENT is to provide submitters with a mechanism that allows them to supply a set of tag/value data elements that currently are not supported by the Feature Table.
DDBJ/EMBL-Bank/GenBank will discuss the format of structured COMMENT/CC line to use it in a formalized way.

Changes to the Feature Table Document: Features and Qualifiers

The following items will be applied from October 2009 with the revision of Feature Table Definition, if not otherwise specified.

Should /pseudo qualifier be renamed?

“The word “pseudo” is likely to be associated with “pseudogene” but it is used for both putative pseudogenes and non-functional forms, so, the /pseudo qualifier should be separated and/or renamed to reflect their actual usages.
This issue will be reconsidered.
The value, “annotated by transcript or proteomic data”, will be legal for /exception qualifier
A new qualifier, /haplogroup, will be legal for source feature.
For the /strain qualifier, it is no more legal to describe multiple equivalent names.

Previously (before May 2009), DDBJ accepted the sequence data with description of multiple-names in a /strain qualifier;
```
       /strain="ATCC #### (= JCM ### = NBRC ###)"
```
To describe equivalent strain names, appropriate usage of /note qualifier is recommended.
```
      /note="strain coidentity: JCM ### = NBRC ###"
      /strain="ATCC ####"
```
A new qualifier, /artificial_location, will be legal for CDS and mRNA features, as mentioned above.

The modification will be applied in December 2009.
Improvement of the format of /inference qualifier

In order to describe inferential supports more effectively, format /inference qualifier will be improved.
The discussion has been continued since 2008.