Last updated:2017.10.10.

Categories for Sequence Data

Sequence data submitted to DDBJ are classified and stored into following categories.

Real Sequence Data
Raw Output Data from Sequencers
Annotated/Assembled Data
Project/Sample
Controlled Access

 

Real Sequence Data

Raw Output Data from Sequencers

DRA: DDBJ Sequence Read Archive
Archival database for output data generated by next-generation sequencing machines including Roche 454 GS System,Illumina Genome Analyzer,Applied Biosystems SOLiD System and others.
DTA: DDBJ Trace Archive
Archival database of DNA sequence chromatograms (traces), base calls, and quality estimates for single-pass reads from various large-scale sequencing projects.

 

Annotated/Assembled Data

DDBJ (traditional DDBJ)

Narrowly-defined DDBJ. DDBJ is a counterpart of GenBank and ENA (EMBL-Bank) to accept sequences with feature annotation and to provide them in flat file.
The data in traditional DDBJ is classified into followings;

Division, conventional sequence data

Data type, bulk sequence data

Sequenced by whom

If you are not sure to which database you should submit your data, see following sites;

Using Mass Submission System, the submitted nucleotide sequences are classified into one of the categories according to the descriptions of the DATATYPE, DIVISION, and KEYWORD.

 

Project/Sample

BioProject
Database to organize research projects and the corresponding data.
It is required to submit to BioProject before sequence data submissions for TSA, TLS, WGS or complete-genome scale except viruses, plasmids and organelles.
BioSample
Database to capture and store descriptive information about the biological source materials, or samples, used to generate experimental data.

 

Controlled Access

JGA: Japanese Genotype-phenotype Archive
Database for permanent archiving and sharing of all types of individual-level genetic and de-identified phenotypic data resulting from biomedical research projects.

 

 

Categories of Annotated/Assembled Data

General data: classified by source species

The data that are not classified into any categories described in the sections are called general data and belong here.
In principle, it is required for general data to have at least one source feature and at least one other Biological feature.
Submitted sequences are automatically classified into one of the following divisions on the basis of the taxonomy of the source organisms.

  • HUM ; Human
  • PRI ; Primates (other than human)
  • ROD ; Rodents
  • MAM ; Mammals (other than primates or rodents)
  • VRT ; Vertebrates (other than mammals)
  • INV ; Invertebrates
  • PLN ; Plants or fungi
  • BCT ; Bacteria
  • VRL ; Viruses
  • PHG ; Phages

 

ENV/SYN: impossible to identify souce species, Environmental Samples and Synthetic Constructs

Environmental samples and artificially constructed sequences are classified into ENV - envrionmental_samples and SYN division, respectively.
In principle, it is required for ENV and SYN data to have at least one source feature and at least one other Biological feature.

  • ENV; sequences obtained via environmental sampling methods, direct PCR, DGGE, etc.
    For ENV submissions, it is necessary to describe an environmental_sample qualifier on the source feature.
  • SYN; synthetic constructs, sequences constructed by artificial manipulations
    For SYN submissions, in general, the entry often has plural source features, so it should be cared.
    See also Example of Submission; E05) synthetic construct.

 

EST/GSS/HTC/HTG/STS: Divisions for Feasibility of Sequencing

Sequences derived from high throughput projects, such as large scale analyses like EST dataset, ongoing whole genome scale sequencing, and so on, are classified into the following divisions, respectively.
Basically only one source feature should be described for an entry in those divisions.
In this regard, however, the entries including HTC or HTG division can have some Biological feature like as general data, if necessary.

EST
expressed sequence tags, cDNA sequences read short single pass
GSS
genome survey sequences, genome sequences read short single pass
STS
sequence tagged sites, tagged sequences for genome sequencing
recommended to use primer_bind feature and PCR_conditions qualifier.
HTC
high throughput cDNA sequences from cDNA sequencing projects, not EST
This division is to include unfinished high throughput cDNA sequences.
HTG
high throughput genomic sequences mainly from genome sequencing projects
Unfinished HTG entries are classified into different levels, as follow;

  • phase0;Survey sequence generated for the purpose of library quality assessment and detection of overlaps with other clones before construction of piece contig(s)
  • phase1;Unfinished sequence having contigs that have NOT been ordered and oriented
  • phase2;Unfinished sequence having contigs that have been ordered and oriented

 

CON: Contig / Constructed, Tiling of Entries

Many genome projects submitting a lot of HTG and/or WGS entries can often provide the information to assemble a series of their entries and reconstruct a genome structure. An accession number would be assigned for such contig tiling path, so called "CON entry", which is classified into CON division.

See also steps of genome sequencing, categories of sequence data and their correspondences.

We can NOT directly accept only the submission of CON entry.
At first you have to submit all piece entries to construct the contig, then a CON entry will be constructed.
AGP file is required to submit CON entries.

 

WGS: Fragment Sequences during WGS Assembling Process

The large set of contigs from the proceeding genome project can be submitted as one of bulk sequence data, Whole Genome Shotgun (WGS).
Please note that WGS data is different from others in its format of accession number.
See also steps of genome sequencing, categories of sequence data and their correspondences.

 

TSA: Transcriptome Shotgun Assembly

Since 2008, we have accepted one of bulk sequence data, Transcriptome Shotgun Assembly (TSA) categorized for assembled RNA transcript sequences.
Basically only one source feature should be described for a TSA entry.
TSA entries can have some Biological features like as general data, if necessary.
Please note that TSA data may be different from others in its format of accession number.
See also steps of transcriptome project, categories of sequence data and their correspondences

 

TLS: Targeted Locus Study

Since 2016, we have accepted one of bulk sequence data, Targeted Locus Study (TLS), including 16S rRNA or some other targeted loci mainly to be clustered into operational taxonomic unit.
TLS entries can have some Biological features like as general data.
Please note that TLS data is different from others in its format of accession number.

 

TPA: Third Party Annotation and/or Assemble

TPA (Third Party Data) is a nucleotide sequence data collection in which each entry is obtained by assembling primary entries publicized from DDBJ/EMBL-Bank/GenBank, Trace Archive, and/or Sequence Read Archive with additional feature annotation(s) determined by experimental or inferential methods by TPA submitter. Those assemblies include two cases; one or more primary entries are used and newly determined sequence is contained. TPA sequence data should be submitted to DDBJ/EMBL-Bank/GenBank as a part of the process to publish biological research for primary nucleotide sequences.
See also TPA Submission Guidelines.

ページの先頭へ戻る