• Newly released or re-released DRAs cannot be searched on DDBJ Search
  • Entries from ENA and GenBank during a specific period are not being reflected in getentry

DDBJ Annotated/Assembled Sequences

  • Home
  • Submission
    • Before Submission
    • Web submission
    • Mass Submission
    • Data Update
  • Search
    • getentry
    • ARSA
  • Flat file
    • Feature Table
    • Feature key
    • Qualifier key
    • Nucleotide Sequences
    • Organism qualifier
    • Identifiers
    • Description of Location
    • Protein Coding Sequence
    • The Genetic Codes
    • Codes Used in Sequence Description
    • Description Examples of Sequence Data
  • Data categories
    • Data Submission from Genome Project
    • Pseudohaplotype
    • WGS
    • Finished level genomic sequences
    • Metagenome Assembly
    • Single amplified genome
    • HTG
    • Environmental sample
    • ENV
    • TLS
    • Data Submission from Transcriptome Project
    • TSA
    • EST
    • HTC
    • Third Party Data (TPA)
  • FAQ
  • Other
    • Patent
    • MGA
  • Home
  • ddbj
  • Data Categories

Data Categories

Division

General data: classified by source species

The data that are not classified into any categories described in the sections are called general data and belong here.
In principle, it is required for general data to have at least one source feature and at least one other Biological feature.
Submitted sequences are automatically classified into one of the following divisions on the basis of the taxonomy of the source organisms.

Division Description
HUM Human
PRI Primates (other than human)
ROD Rodents
MAM Mammals (other than primates or rodents)
VRT Vertebrates (other than mammals)
INV Invertebrates
PLN Plants or fungi
BCT Bacteria
VRL Viruses
PHG Phages

ENV/SYN: impossible to identify souce species, Environmental Samples and Synthetic Constructs

Environmental samples and artificially constructed sequences are classified into ENV and SYN division,respectively.
In principle, it is required for ENV and SYN data to have at least one source feature and at least one other Biological feature.

Division Description
ENV Sequences obtained via environmental sampling methods, direct PCR, DGGE, etc.
For ENV submissions, it is necessary to describe an environmental_sample qualifier on the source feature.
SYN Synthetic constructs; sequences constructed by artificial manipulations
For SYN submissions, in general, the entry often has plural source features, so it should be cared.
See also Description Examples of Sequence Data: E05) synthetic construct..

EST/GSS/HTC/HTG: Divisions for Feasibility of Sequencing

Sequences derived from high throughput projects, such as large scale analyses like EST dataset, ongoing whole genome scale sequencing, and so on, are classified into the following divisions, respectively.
Basically only one source feature should be described for an entry in those divisions.
In this regard, however, the entries including HTC or HTG division can have some Biological features like as generaldata, if necessary.

Division Description
EST Expressed sequence tags, cDNA sequences read short single pass.
GSS Genome survey sequences, genome sequences read short single pass.
HTC High throughput cDNA sequences from cDNA sequencing projects, not EST.
This division is to include unfinished high throughput cDNA sequences.
HTG High throughput genomic sequences mainly from genome sequencing projects.
Unfinished HTG entries are classified into different levels, as follow;
  • phase0;Survey sequence generated for the purpose of library quality assessment and detection of overlaps with other clones before construction of piece contig(s)
  • phase1;Unfinished sequence having contigs that have NOT been ordered and oriented
  • phase2;Unfinished sequence having contigs that have been ordered and oriented

Data type, bulk sequence data

WGS: Fragment Sequences during WGS Assembling Process

The large set of contigs from the proceeding genome project can be submitted as one of bulk sequence data, Whole Genome Shotgun (WGS).
Please note that WGS data is different from others in its format of accession number.
See also Steps of genome sequencing, categories of sequence data and their correspondences .

TSA: Transcriptome Shotgun Assembly

Since 2008, we have accepted one of bulk sequence data, Transcriptome Shotgun Assembly (TSA) categorized for assembled RNA transcript sequences.
Basically only one source feature should be described for a TSA entry.
TSA entries can have some Biological features like as general data, if necessary.
Please note that TSA data may be different from others in its format of accession number.
See also steps of transcriptome project, categories of sequence data and their correspondences

TLS: Targeted Locus Study

Since 2016, we have accepted one of bulk sequence data, Targeted Locus Study (TLS), including 16S rRNA or some other targeted loci mainly to be clustered into operational taxonomic unit.
TLS entries can have some Biological features like as general data.
Please note that TLS data is different from others in its format of accession number.

Distinguishing that the nucleotide sequences are not determined by the submitters

TPA: Third Party Data and primary sequence data

TPA (Third Party Data) is a nucleotide sequence data collection in which each entry is obtained by assembling primary entries publicized from DDBJ/ENA/GenBank, and/or Sequence Read Archive with additional feature annotation(s) determined by experimental or inferential methods by TPA submitter.
Those assemblies include two cases; one or more primary entries are used and newly determined sequence is contained.
TPA sequence data should be submitted to DDBJ/ENA/GenBank as a part of the process to publish biological research for primary nucleotide sequences.
See also TPA Submission Guidelines.

Data types in MSS submission

Type Description
WGS: Whole Genome Shotgun The sequences are WGS (draft genome) excluding MAG or SAG.
GNM: Finished Level Genome Sequence, non-WGS The sequences are Finished Level Genomic Sequences (not WGS) excluding MAG or SAG.
MAG: Metagenome-Assembled Genome The sequences are MAG.
SAG: Single Amplified Genome The sequences are SAG.
TLS: Targeted Locus Study The sequences are TLS.
HTG: High Throughput Genomic Sequences The sequences are HTG.
TSA: Transcriptome Shotgun Assembly The sequences are TSA.
HTC: High Throughput cDNA Sequences The sequences are HTC.
EST: Expressed Sequence Tags The sequences are EST.
MISC: Sequences that are not included in above types The sequences do not match any types.
ASK: Ask DDBJ curator to judge a correct datatype Ask DDBJ curators to counsult the data type.

Decision of the data type and the registration site for submitting the nucleotide sequences

  • Steps of genome sequencing, categories of sequence data and their correspondences
  • Steps of transcriptome project, categories of sequence data and their correspondences
  • Navigation

Related pages

  • Data Submission from Genome Project
  • Submission of environmental sequences
  • Data Submission from Transcriptome Project
  • Third Party Data (TPA)