DDBJ flat file format
DDBJ/EMBL-Bank/GenBank, the International Nucleotide Sequence Database Collaboration (INSDC) collects the nucleotide sequences experimentally determined, and constructs the database in accordance with the rule agreed with the three databanks.
The database is a collection of “entry” which is the unit of the data. The entry submitted to DDBJ is processed and publicized according to the DDBJ format for distribution (flat file). The flat file includes the sequence and the information of submitters, references, source organisms, and “feature” information, etc. The “feature” is defined by The DDBJ/ENA/GenBank Feature Table Definition to describe the biological nature such as gene function and other property of the nucleotide sequence.
The virtual sample of DDBJ flat file
LOCUS AB000000 450 bp mRNA linear HUM 01-JUN-2009 DEFINITION Homo sapiens GAPD mRNA for glyceraldehyde-3-phosphate dehydrogenase, partial cds. ACCESSION AB000000 VERSION AB000000.1 KEYWORDS . SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 450) AUTHORS Mishima,H. and Shizuoka,T. TITLE Direct Submission JOURNAL Submitted (30-NOV-2008) to the DDBJ/EMBL/GenBank databases. Contact:Hanako Mishima National Institute of Genetics, DNA Data Bank of Japan; Yata 1111, Mishima, Shizuoka 411-8540, Japan REFERENCE 2 AUTHORS Mishima,H., Shizuoka,T. and Fuji,I. TITLE Glyceraldehyde-3-phosphate dehydrogenase expressed in human liver JOURNAL Unpublished (2009) COMMENT Human cDNA sequencing project. FEATURES Location/Qualifiers source 1..450 /chromosome="12" /clone="GT200015" /clone_lib="lambda gt11 human liver cDNA (GeneTech. No.20)" /db_xref="taxon:9606" /map="12p13" /mol_type="mRNA" /organism="Homo sapiens" /tissue_type="liver" CDS 86..>450 /codon_start=1 /gene="GAPD" /product="glyceraldehyde-3-phosphate dehydrogenase" /protein_id="BAA12345.1" /transl_table=1 /translation="MAKIKIGINGFGRIGRLVARVALQSDDVELVAVNDPFITTDYMT YMFKYDTVHGQWKHHEVKVKDSKTLLFGEKEVTVFGCRNPKEIPWGETSAEFVVEYTG VFTDKDKAVAQLKGGAKKV" BASE COUNT 102 a 119 c 131 g 98 t ORIGIN 1 cccacgcgtc cggtcgcatc gcacttgtag ctctcgaccc ccgcatctca tccctcctct 61 cgcttagttc agatcgaaat cgcaaatggc gaagattaag atcgggatca atgggttcgg 121 gaggatcggg aggctcgtgg ccagggtggc cctgcagagc gacgacgtcg agctcgtcgc 181 cgtcaacgac cccttcatca ccaccgacta catgacatac atgttcaagt atgacactgt 241 gcacggccag tggaagcatc atgaggttaa ggtgaaggac tccaagaccc ttctcttcgg 301 tgagaaggag gtcaccgtgt tcggctgcag gaaccctaag gagatcccat ggggtgagac 361 tagcgctgag tttgttgtgg agtacactgg tgttttcact gacaaggaca aggccgttgc 421 tcaacttaag ggtggtgcta agaaggtctg //
Flat file displays the information provided by submitters with DDBJ
Even when the sequences are similar, the contents on the flat files may vary according to the submitter’s research aim etc.
Please take that point into consideration when you refer search results.
locus name, sequence length, molecule type, molecular form, division, the date of last release
Locus name is a unique ID of the entry in the database. In DDBJ, since July 1996, the locus name has been assigned the same asaccession number.
Notice: No information is available on the Master record of MGA data.
According to the value of /mol_type qqualifier for source feature, it is described as DNA, RNA, mRNA, rRNA, tRNA, or cRNA.
This column indicates whether molecular form of nucleotide sequence is “linear” or “circular”. If the entry is the full length of circular form, “circular” is appeared.
DDBJ classifies entries into 21 divisions as below;
a: taxonomic divisions
|PRI||primates (other than human)|
|MAM||mammals (other than primates and rodents)|
|VRT||vertebrates (other than mammals)|
|INV||invertebrates (animals other than vertebrates)|
|PLN||plants, fungi, plastids (eukaryotes other than animals)|
|BCT||bacteria (including both Eubacteria and Archaea)|
b: other divisions
|PAT||sequence data related to patent application
The data those which Japan Patent Office (JPO), United States Patent and Trademark Office (USPTO), European Patent Office (EPO), and Korean Intellectual Property Office (KIPO) collected, processed and released.
|ENV||sequences obtained via environmental sampling methods|
|SYN||synthetic constructs; artificially constructed sequences|
|EST||expressed sequence tags; short single pass cDNA sequences|
|TSA||transcriptome shotgun assemblies; assembled mRNA sequences|
|GSS||genome survey sequences; short single pass genomic sequences|
|HTC||high throughput cDNA sequences;
The sequence submitted from cDNA sequencing projects except for EST. This division is to include unfinished high throughput cDNA sequences, each of which has 5’UTR and 3’UTR at both ends and part of a coding region. The sequence may also include introns. When the sequence becomes finished later, it moves to the corresponding taxonomic division.
|HTG||high throughput genomic sequences;
The sequence submitted mainly from genome sequencing projects which regarded a clone as a sequencing unit.
|STS||sequence tagged sites
The tag site for genome sequencing. The information of chromosome, map, PCR_condition is necessary for this division.
|UNA||the data not annotated
The UNA division is not used recently.
|CON||Contig / Constructed
To conjugate a series of entries, such as those submitted from a genome project, each of the three data banks constructs an entry and assign an accession number to a large scale sequence dataset. Such entries are classified into the CON division. The entry in the CON division has the information of joined accession numbers instead of the sequence data. The corresponding entries of the CON entry have been submitted to other divisions.
The current publicized date is described. If the entry is updated and reopened to public site, this date will be changed.
The definition briefly describes the information of gene(s). “DEFINITION” is constructed by each of the three data banks in accordance with standard rules in principle.However, in the case of EST or GSS submission using Mass Submission System, DDBJ will sometimes ask submitters to construct “DEFINITION”.
- Complete sequence of maize catalase coding gene
DEFINITION Zea mays Cat3 gene for catalase, complete cds.
Format: [organism name] [gene name] gene for [product name], complete cds.
- organism name: The scientific name is indicated as the organism name, in principle.
- gene name: the symbol of the gene
- product name: the general name of product
- complete cds: this coding sequence is complete
- Partial sequence of human glyceraldehyde-3-phosphate dehydrogenase coding cDNA
DEFINITION Homo sapiens mRNA for glyceraldehyde-3-phosphate dehydrogenase, partial cds.
- Format: [organism name] mRNA for [product name], partial cds.
- partial cds: this protein coding sequence is partial
- The gene name is omitted, because the submitter did not describe.
- Partial sequence of Bacillus 16S rRNA
DEFINITION Bacillus sp. AZ25 gene for 16S rRNA, partial sequence.
Format: [organism name] [strain name] gene for [product name], partial sequence.
- In cases of unidentified species, comparison of intraspecies, and so on, describe name of strain, isolate or some, as identifier.
- partial sequence: this sequence is part of 16S rRNA.
- Multiple CDS of rat mitochondrial DNA
DEFINITION Rattus norvegicus mitochondrial genes for cytochrome c oxidase subunit II, ATPase subunit 6, cytochrome c oxidase subunit III, partial and complete cds.
- Format: [organism name] [gene name 1], [gene name 2], …. genes for [product name 1], [product name 2], ….. , partial and complete cds.
- The gene names and/or product names are subsequently described from 5’to 3’ end.
- “partial, complete and partial cds” is abbreviated to “partial and complete cds”.
- If some genes have only gene names or product names, only gene name or product name is described principally.
- If the “DEFINITION” is too long, some information, such as map position, is described instead of the gene or product names.
- Sometimes gene cluster or operon name is described, if it is considered reasonable.
- EST data of human liver 3’ end
DEFINITION Homo sapiens cDNA, clone:ABC123, 3' end, expressed in liver.
- Format: [organism name] cDNA, clone:[clone name], [other information].
- The clone name is mandatory.
- GSS data of mouse chromosome 1q
DEFINITION Mus musculus DNA, clone:1H11A14, 1q region.
Format: [organism name] DNA, clone:[clone name], [other information].
- The clone name is mandatory.
- TPA (Third Party Data) of human GAPD
DEFINITION TPA_exp: Homo sapiens GAPD mRNA forglyceraldehyde-3-phosphate dehydrogenase, complete cds.
- Format: [TPA header]: [organism name] [gene name] mRNA for [product name], complete cds.
- In the case of TPA (Third Party data), either of “TPA_exp” (for TPA:experimental) or “TPA_inf” (for TPA:inferential) is described at the beginning of DEFINITION.
This line shows accession number of the entry data.
- Conventional sequence data
- A unique accession number is issued to the data submitter by each of the three data banks. The accession number is composed of 1 alphabet character and 5 digits (ex. A12345) or 2 alphabet characters and 6 digits (ex. AB123456). The former style was used in 1980s, but later the latter style was introduced because of data explosion.
The alphabet part is called “prefix”. Please refer the prefix list.
If multiple entries are united to an entry, or if an entry is extensively modified after the submission, the responsible data banks may assign a new accession number to it. In these cases, the new accession number is called the primary accession number, and the old accession number(s) is/are called the secondary accession number(s). In the flat file, the primary accession number is indicated first, then the secondary accession number(s) follows. You can find the same updated entry with both the primary and the secondary accession numbers.
ACCESSION AB999999 AB888888 AB777777
||primary accession number|
||secondary accession number|
- Bulk sequence data; WGS, TSA, TLS
- The accession number assigned to each entry of WGS, TSA and TLS data consists of 4 alphabet characters and 8 (sometimes 9 or 10, if necessary) digits.
The alphabet part is called prefix.
See also For Large Scale Data (four prefix).
ZZZZ (4 letters) Prefix to distinguish each project, project_id 01 (2 digits) Version number of the data set, set_version 000001 (6 digits) ID of each individual sequence (It might be 7 or 8 digits depended on the number of entries.)
The set_version goes up for every update of the dataset. Example:ZZZZ02000001
ACCESSION ZZZZ01000001 ZZZZ01000000
||primary accession number|
- For MGA data
- This (ACEESSION) line shows a number assigned by INSDC to a resource.
- The number is composed of 5 alphabet characters and 7 digits (ex. ABCDE0000001).An accession number assigned to an entry of a resource units is displayed in the MGA lines.
AB (first two characters) identi identifier to each project. CDE (third to fifth characters) identifier to each of resources on each project. 0000001 (7 digit numeric numbers) number for each sequence entry in a resource.
*1 The information about each project id is avilable at the project_index page.
*2 “resource” here means a unit of identical origin, such as tissue, cells, from which sequence are obtained.
||number to a resource unit|
This line consists of an accession number and a version number, like “AB123456.1”, in which the digit(s) after the period is a version number.
The data open to public for the first time is version number as “1”. The reason for adding VERSION is that since a released sequence sometimes revised by the submitter, the accession number alone cannot specify the sequence in question causing the user a trouble. The number is increased by one every time when a revised sequence is made public. And accession number will NOT be changed generally.
- For MGA data
- This line consists of a number assigned to a resources unit in which the digit(s) after the period is a version number.
Since the sequence of an MGA entry is not allowed to update, the version number has to be “1”.
||number to a resource unit|
The DBLINK line is used to link other databases for BioProject, BioSample accession numbers, Sequence Read Archive Run accession numbers and so on.
DDBJ has replaced the PROJECT line by DBLINK line format since 2009 to expand for other data resources than projects.
DBLINK BioProject:PRJDA12345 BioSample:SAMD01234567 Sequence Read Archive:DRR012345, DRR012346
||The name of linked database: BioProject Database|
||Linked ID in the database; BioProject accession number|
||The name of linked database: BioSample Database|
||Linked ID in the database; BioSample accession number|
||The name of linked database: Sequence Read Archive (SRA)|
||Linked ID in the database; SRA Run accession numbers|
For now, KEYWORDS lines are used to indicate the detail category of the data (EST, TSA, HTC, HTG, GSS, WGS, TPA etc) information about experimental method, “finishing level” of genome sequencing and else, if necessary. See also INSDC agreed methodological keywords.
This line shows the scientific name (and common name, if defined) on organism from which the sequence is obtained and an organelle type if the sequence is derived from an organelle other than the nucleus.
SOURCE Homo sapiens (human)
||The scientific name from which the sequence is obtained.|
The organism name and its phylogenic lineage from which the sequence is obtained are described.
The scientific name is indicated as the organism name in 1st line. If the sequence is obtained from an unidentified organism or artificially synthesized, the name registered on the Unified Taxonomy Database is described instead of scientific name.
The phylogenic lineage information based on the Unified Taxonomy Database is started from 2nd line.
ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
||The scientific name from which the sequence is obtained|
||The phylogenic lineage information of Homo sapiens|
The information of submitter(s) is described as REFERENCE 1 (except old entries and some CON entries).
In the case of Nucleotide Sequence Submission System, REFERENCE 1 is processed with the information entered on “Contact person” and “Submitter” pages. In the case of Mass Submission System, REFERENCE 1 is processed with the information entered in annotation file.
REFERENCE 1 (bases 1 to 450)
Notice: The portion, “(bases 1 to 450)”, is not available on the Master record of MGA data.
Submitter(s) of the entry is/are indicated in principle. Submitter is responsible for the data and can update it.
AUTHORS Mishima,H. and Shizuoka,T.
||The submitters of this entry|
“Direct Submission” is indicated to follow the standard form.
TITLE Direct Submission
At first, “Accept Date” of the entry is indicated. “Accept Date” is defined as the date when DDBJ have received the acceptable data to assign accession number in principle. Even if the entry is updated, “Accept Date” is NOT changed. Then, the information about the address and the affiliation of “Contact Person” is indicated.
JOURNAL Submitted (30-NOV-2008) to the DDBJ/EMBL/GenBank databases. Contact:Hanako Mishima National Institute of Genetics, DNA Data Bank of Japan; Yata 1111, Mishima, Shizuoka 411-8540, Japan
||Accept date of this entry is 30-NOV-2008|
||The information about the address and the affiliation of Hanako Mishima.|
E-mail address, phone & fax nos.
- To follow the Japanese law of protecting personal information, DDBJ will delete both phone and fax numbers, and E-mail address from the flat files of the entries submitted to DDBJ.However, if you wish to disclose any of the three items, please contact us with contact form,?specifying the item(s) to be disclosed.
- When you wish to contact to the submitter(s) of an entry of your interest,please contact us with the inquiry form with reasons briefly;i.e. asking to transfer cloned sequences, etc, then we will forward your messeage to the submitter(s).
Phone and fax numbers and E-mail address are deleted.
JOURNAL Submitted (30-NOV-2000) to the DDBJ/EMBL/GenBank databases. Contact:Hanako Mishima National Institute of Genetics, DNA Data Bank of Japan; Yata 1111, Mishima, Shizuoka 411-8540, Japan
When the submitters wish to keep their contact information disclosed, it will be described as,
JOURNAL Submitted (30-NOV-2000) to the DDBJ/EMBL/GenBank databases. Contact:Hanako Mishima National Institute of Genetics, DNA Data Bank of Japan; Yata 1111, Mishima, Shizuoka 411-8540, Japan E-mail :email@example.com Phone :81-55-981-6853 Fax :81-55-981-6849
The information of references related to the submitted sequence is indicated on REFERENCE line (other than (REFERENCE 1). Since REFERENCE 2 indicates the publication status of the sequence, the reference which does not describe about the submitting sequence is indicated as REFERENCE 3 or after, not as REFERENCE 2.
When DDBJ notices a paper publication with an accession number, DDBJ will update the entry with the accession number, if necessary. During the process of the update, the prepublication paper(s) described in the line(s), REFERENCE 2 and/or later, will be revised without any notice to submitters, if applicable; i.e. When the submitted data, submitters’ affiliation, author names, title, and journal name of the prepublication paper, are enough reasonable to be revised.
- In the cases of the manuscript in preparation, submitted for publication, in press, or published
REFERENCE 2 AUTHORS Mishima,H., Shizuoka,T. and Fuji,I. TITLE Glyceraldehyde-3-phosphate dehydrogenase expressed in human liver JOURNAL Unpublished (2009)
||The (presumptive) author(s) of the reference is/are described.|
||The (presumptive) title of the reference is described.|
||In the cases of the paper published or In Press, the journal name is described. In the case of unpublished manuscript, “Unpublished” is described to follow the standard form.|
- In the case of no schedule for publication except the international nucleotide database.
REFERENCE 2 AUTHORS Mishima,H., Shizuoka,T. and Fuji,I. TITLE Glyceraldehyde-3-phosphate dehydrogenase expressed in human liver JOURNAL Published Only in Database(2009)
||The author(s) of the submission entered by submitter(s) is/are described.|
||The title of the submission entered by submitter(s) is described.|
||“Published Only in Database” is indicated.
The parenthetic number is the year when the entry has been firstly publicized.
The information about an entry that can not be described using FEATURES or the other fields. For instance, if submitter has the other affiliation to REFERENCE 1, it can be described on COMMENT line.
COMMENT Human cDNA sequencing project.
- Structured COMMENT
- Structured COMMENT is a format to describe and to share some datasets undefined in feature/qualifier.
SUsing structured COMMENTs, datasets can be shared via flatfiles of INSDC in the community of submitters and users.
To describe structured COMMENT, the dataset is required to be describe in structured sets of [names of items] and [values of items] on COMMENT line.
There are some predetermined formats of structured COMMENTs that are required to submit some kinds of sequence data derived from genome projects (includingWGS, transcriptome projects (including TSA) and so on.
COMMENT ##Genome-Assembly-Data-START## Finishing Goal :: Finished Current Finishing Status :: High Quality Draft Assembly Method :: Newbler v. 2.3 Genome Coverage :: 30x Sequencing Technology :: 454 GS Junior; Illumina GA II ##Genome-Assembly-Data-END##
The above example is an additional information, “Genome-Assembly-Data”, that is required for genome projects.
The contents between ##Genome-Assembly-Data-START## and ##Genome-Assembly-Data-END## are delimited item names and their values by “ :: “.
||The first line of the structured COMMENT defined as “Genome-Assembly-Data”.|
||The last line of the structured COMMENT defined as “Genome-Assembly-Data”.|
||The final goal of the genome project is “Finished” level.|
||The current status of the genome project is “High Quality Draft” level.|
||The software to assemble reads of sequences is Newbler and its version is 2.3.|
||The sequencing depth of the genome sequences is approximately 30 fold.|
||454 GS Junior; Illumina GA II – the platforms (sequencers) to determine the genome sequences are “454 GS Junior” and “Illumina GA II”.|
- For MGA data
- For MGA Submission, the process for obtaining the submitted sequence data e.g.; (methods for preparing sequences from tissues or cells and processing the sequences for submission) is described.
COMMENT The CAGE (cap analysis gene expression) is based on preparation and sequencing of concatamers of DNA tags deriving from the initial 20/21 nucleotides from 5' end mRNAs. Full-length cDNAs were at first selected with the Cap-Trapper method. Then, a specific linker (Linker1, some linker contain 5 bp sequences that have 15 variations for each rna sample) containing the ClassIIs restriction enzyme site MmeI was then ligated to the single-strand cDNA and then the second strand of cDNA synthesized. The resulting double-stranded cDNA was cleaved by the restriction enzyme MmeI and a second linker (Linker2) was ligated to the 2 bp overhang at the MmeI cleaved site, to produce a 5' 20/21 tag having two linkers at both sides. The ligation products were separated from unmodified DNA with magnetic beads. The 5' end cDNA tags were released from the beads, and the DNA fragments were amplified in a PCR step by using the two linker-specific primers (Primer1 (uni-PCR), Primer2 (MmeI-PCR)). The desired 32-37 bp tags were purified and ligated to form concatamers, and then the concatamer were fractionated and ligated to the plasmid ZErO-2. The ligations were finally electroporated into DH10b cells (Invitrogen) and obtained plasmids were sequenced with forward primers. CAGE libraries were sequenced with forward primers essentially as described with minor modifications to use zeocin for selection of recombinants. We used in-house developed algorithms for the extraction of tags and for masking the vectors. CAGE tags were extracted with the following parameters: vector masking, minimum 12 bp recognition allowed; linker (13 bp) masking: maximum mismatch, 2 bp allowed; XmaJI site maximum mismatch, 2 bp allowed; tag length, 17-24 bp. Linker1: "Upper oligonucleotide GN6": biotin-agagagagacctcgagtaactataacggtcctaaggtagcgacctagg (5 bp) tccgacGNNNNN and "Upper oligonucleotide N6":
Biological features of a submitted sequence data are described with “Feature” key (the biological nature of the annotated feature), “Location” (the region of the sequence which corresponds to Feature), and “Qualifier” (supplementary information about Feature). In principle, EST or GSS entries are not described with any features except the “source” key.
FEATURES are indicated on the basis of the information provided by submitter and modified by databanks to describe the appropriate annotation. The rules of feature description agreed with three databanks are explained at The DDBJ/ENA/GenBank Feature Table Definition in detail.
Feature keys are briefly classified into 3 groups;
- group 1: biological source of the sequence (source)
The feature, “source” (group 1) is mandatory for all entries in the international nucleotide database.
The qualifiers “/organism” and “/mol_type” are mandatory for source feature.
- group 2: biological function features of the region
Feature keys in group 2 fall into families which are in some sense similar in function and which are annotated in a similar manner.A functional family may have a “generic” or miscellaneous key, which can be recognized by the ‘misc_’ prefix, that can used for instances not covered by the other defined keys of that group.
e.g. CDS, rRNA, etc.
- group 3: difference and/or change of the sequence data
e.g. variation, conflict, etc.
One of the most frequently used feature key is “CDS” to describe coding sequence for protein. See also CDS feature page.
FEATURES Location/Qualifiers source 1..450 /chromosome="12" /clone="GT200015" /clone_lib="lambda gt11 human liver cDNA (GeneTech. No.20)" /db_xref="taxon:9606" /map="12p13" /mol_type="mRNA" /organism="Homo sapiens" /tissue_type="liver" CDS 86..>450 /codon_start=1 /gene="GAPD" /product="glyceraldehyde-3-phosphate dehydrogenase" /protein_id="BAA12345.1" /transl_table=1 /translation="MAKIKIGINGFGRIGRLVARVALQSDDVELVAVNDPFITTDYMT YMFKYDTVHGQWKHHEVKVKDSKTLLFGEKEVTVFGCRNPKEIPWGETSAEFVVEYTG VFTDKDKAVAQLKGGAKKV"
Identifies the biological source of the specified span of the sequence.
||The region from 1st to 450th base of the sequence is derived from the source described with following qualifiers.|
||The sequence is obtained from chromosome 12.|
||The clone name which the sequence is obtained.|
||The clone library name which the sequence is obtained.|
||The sequence is located on 12p13.|
||The sequence is derived from a organism correspond to taxonomy database ID: 9606 (human).|
||The sequence is derived from mRNA.|
||The sequence is obtained from human.|
||The sequence is obtained from liver.|
Coding sequence; sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon).
||The region from 86th to 450th base of the sequence is coding a protein described with following qualifiers.”>” means that 3’end is not completed for the region of CDS. The rule to describe “Location” is explained at Description of Location in detail.|
||The frame reading amino acid translation of the first codon is the 1st base of this region (86th base of the entry).|
||gene symbol, see gene qualifier|
||product name, see product qualifier|
||This is the ID assigned to amino acid sequence by the international nucleotide database.
It is indicated as 3 alphabet characters and 5 digits.
The number next to “.” indicates he version number of protein ID. If the amino acid sequence is updated, the version number goes up (the protein_id is NOT changed).
||The nucleotide sequence of CDS region is translated into amino acid sequence according to genetic code table 1.|
||The nucleotide sequence of CDS region is conceptually translated into one-letter abbreviated amino acid sequence (Amino Acid Codes), except setting the qualifierexception.
In the case of setting the qualifier pseudogene or pseudo, /translation is NOT indicated.
”//” is the terminal symbol of the entry.