Categories for Sequence Data
Categories for Sequence Data
Acceptable data for DDBJ
For the request of Primary entry submission, in principle, DDBJ accepts any nucleotide sequences that are experimentally determined by submitters, but can not accept computational predicted and/or cited sequences.
Even if your sequence is identical to previously reported sequence(s), on the condition that the sequence is independently determined, you can submit it as a “new” entry.
DDBJ also acccepts an entry that is obtained by assembling primary entries publicized from DDBJ/ENA/GenBank of INSDC and/or is added annotation(s) by experimental or inferential method by submitter as TPA (third party data).
However, some types of sequence data are not acceptable for DDBJ.
When you are to publicize raw output data for your studies related to SNPs, WGS, transcriptome and so on, we recommend you to contact with DDBJ Trace Archive, or DDBJ Sequence Read Archive, instead of DDBJ / ENA / GenBank.
See Overview of International Nucleotide Sequence Databases Policies
Submisson of the data including identical sequences or partially duplicated sequences
Basically, DDBJ accepts all sequence data that are independently determined, even though seqences are identical each other. For variation studies, DDBJ also accepts submissions of representative data.
If you determine many sequences derived from the same indivisual,
we strongly recommend to update sequence data submitted previously, rather than to submit new sequence data many times.
However, since multicycle submissions for a single resource are required by any reasons; right for sequence data, phases of sequencing etc.,
DDBJ does not restrict them.
Sequencing Data
Annotated/assembled sequences
- DDBJ
- Narrowly-defined DDBJ. DDBJ is a counterpart of GenBank and ENA (EMBL-bank) to accept sequences with feature annotation and to provide them in flat file.
- About the data in traditional DDBJ is classified, see Categories of Annotated/Assembled Data.
If you are not sure to which database you should submit your data, see following sites;
- Steps of genome sequencing, categories of sequence data and their correspondences
- Steps of transcriptome project, categories of sequence data and their correspondences
- Division
- Categories of Annotated/Assembled Data
Using Mass Submission System (MSS), the submitted nucleotide sequences are classified into one of the categories according to the descriptions of the DATATYPE, DIVISION, KEYWORD.
Sequencing and alignment data from next-generation sequencing platforms
- DRA: DDBJ Sequence Read Archive
- Archival database for output data generated by next-generation sequencing machines including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System and others.
- DTA: DDBJ Trace Archive
- Archival database of DNA sequence chromatograms (traces), base calls, and quality estimates for single-pass reads from various large-scale sequencing projects.
Functional genomics data
- Genomic Expression Archive (GEA)
- A public database of functional genomics data such as gene expression, epigenetics and genotyping SNP array. Both microarray- and sequence-based data are accepted.
Research project
- BioProject
- Database to organize research projects and the corresponding data.
It is required to submit to BioProject before sequence data submissions for TSA, TLS, WGS or complete-genome scale except viruses, plasmids and organelles.
Biological sample
- BioSample
- Database to capture and store descriptive information about the biological source materials, or samples, used to generate experimental data.
Human data requiring controlled-access
- JGA: Japanese Genotype-phenotype Archive
- Database for permanent archiving and sharing of all types of individual-level genetic and de-identified phenotypic data resulting from biomedical research projects.
Annotated/Assembled Data Categories
Division conventional sequence data
General data: classified by source species
The data that are not classified into any categories described in the sections are called general data and belong here.
In principle, it is required for general data to have at least one source feature and at least one other Biological feature.
Submitted sequences are automatically classified into one of the following divisions on the basis of the taxonomy of the source organisms.
Division | Description |
---|---|
HUM | Human |
PRI | Primates (other than human) |
ROD | Rodents |
MAM | Mammals (other than primates or rodents) |
VRT | Vertebrates (other than mammals) |
INV | Invertebrates |
PLN | Plants or fungi |
BCT | Bacteria |
VRL | Viruses |
PHG | Phages |
ENV/SYN: impossible to identify souce species, Environmental Samples and Synthetic Constructs
Environmental samples and artificially constructed sequences are classified into ENV and SYN division,respectively.
In principle, it is required for ENV and SYN data to have at least one source feature and at least one other Biological feature.
Division | Description |
---|---|
ENV | Sequences obtained via environmental sampling methods, direct PCR, DGGE, etc. For ENV submissions, it is necessary to describe an environmental_sample qualifier on the source feature. |
SYN | Synthetic constructs, sequences constructed by artificial manipulations For SYN submissions, in general, the entry often has plural source features, so it should be cared. See also Description Examples of Sequence Data: E05) synthetic construct.. |
EST/GSS/HTC/HTG/STS: Divisions for Feasibility of Sequencing
Sequences derived from high throughput projects, such as large scale analyses like EST dataset, ongoing whole genome scale sequencing, and so on, are classified into the following divisions, respectively.
Basically only one source feature should be described for an entry in those divisions.
In this regard, however, the entries including HTC or HTG division can have some Biological features like as general data, if necessary.
Division | Description |
---|---|
EST | Expressed sequence tags, cDNA sequences read short single pass. |
GSS | Genome survey sequences, genome sequences read short single pass. |
HTC | High throughput cDNA sequences from cDNA sequencing projects, not EST. This division is to include unfinished high throughput cDNA sequences. |
HTG | High throughput genomic sequences mainly from genome sequencing projects. Unfinished HTG entries are classified into different levels, as follow;
|
Data type, bulk sequence data
WGS: Fragment Sequences during WGS Assembling Process
The large set of contigs from the proceeding genome project can be submitted as one of bulk sequence data, Whole Genome Shotgun (WGS).
Please note that WGS data is different from others in its format of accession number.
See also Steps of genome sequencing, categories of sequence data and their correspondences.
TSA: Transcriptome Shotgun Assembly
Since 2008, we have accepted one of bulk sequence data, Transcriptome Shotgun Assembly (TSA) categorized for assembled RNA transcript sequences.
Basically only one source feature should be described for a TSA entry.
TSA entries can have some Biological features like as general data, if necessary.
Please note that TSA data may be different from others in its format of accession number.
See also steps of transcriptome project, categories of sequence data and their correspondences
TLS: Targeted Locus Study
Since 2016, we have accepted one of bulk sequence data, Targeted Locus Study (TLS), including 16S rRNA or some other targeted loci mainly to be clustered into operational taxonomic unit.
TLS entries can have some Biological features like as general data.
Please note that TLS data is different from others in its format of accession number.
Sequenced by whom
TPA; Third Party Data and primary sequence data
TPA (Third Party Data) is a nucleotide sequence data collection in which
each entry is obtained by assembling primary entries publicized from DDBJ/ENA/GenBank, and/or
Sequence Read Archive with additional feature annotation(s) determined
by experimental or inferential methods by TPA submitter.
Those assemblies include two cases; one or more primary entries are used and newly determined sequence is contained.
TPA sequence data should be submitted to DDBJ/ENA/GenBank as a part of the process to publish
biological research for primary nucleotide sequences.
See also TPA Submission Guidelines.