• Entries from ENA and GenBank during a specific period are not being reflected in getentry

DDBJ Annotated/Assembled Sequences

  • Home
  • Submission
    • Before Submission
    • Web submission
    • Mass Submission
    • Data Update
  • Search
    • getentry
    • ARSA
  • Flat file
    • Feature Table
    • Feature key
    • Qualifier key
    • Nucleotide Sequences
    • Organism qualifier
    • Identifiers
    • Description of Location
    • Protein Coding Sequence
    • The Genetic Codes
    • Codes Used in Sequence Description
    • Description Examples of Sequence Data
  • Data categories
    • Data Submission from Genome Project
    • Pseudohaplotype
    • WGS
    • Finished level genomic sequences
    • Metagenome Assembly
    • Single amplified genome
    • HTG
    • Environmental sample
    • ENV
    • TLS
    • Data Submission from Transcriptome Project
    • TSA
    • EST
    • HTC
    • Third Party Data (TPA)
  • FAQ
  • Other
    • Patent
    • MGA
  • Home
  • ddbj
  • DDBJ flat file format

DDBJ flat file format

DDBJ (DNA Data Bank of Japan) shares annotated/assembled nucleotide sequence data as a member of INSDC (International Nucleotide Sequence Database Collaboration).
For the sharing purpose, DDBJ collects the nucleotide sequences experimentally determined, and constructs the database in accordance with the rule agreed with INSDC.

The database also includes the data from Japan Patent Office (JPO), European Patent Office (EPO), United States Patent and Trademark Office (USPTO), and Korean Intellectual Property Office (KIPO).

The database is a collection of “entry” which is the unit of the data.
The entry submitted to DDBJ is processed and publicized according to the DDBJ format for distribution (flat file).
The flat file includes the sequence and the information of submitters, references, source organisms, and “feature” information, etc.
The “feature” is defined by The DDBJ/ENA/GenBank Feature Table Definition to describe the biological nature such as gene function and other property of the nucleotide sequence.

The virtual sample of DDBJ flat file

LOCUS       AB000000              450 bp    mRNA    linear   HUM 01-JUN-2009
DEFINITION  Homo sapiens GAPD mRNA for glyceraldehyde-3-phosphate
            dehydrogenase, partial cds.
ACCESSION   AB000000
VERSION     AB000000.1
KEYWORDS    .
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 450)
  AUTHORS   Mishima,H. and Shizuoka,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (30-NOV-2008) to the DDBJ/EMBL/GenBank databases.
            Contact:Hanako Mishima
            National Institute of Genetics, DNA Data Bank of Japan; Yata 1111,
            Mishima, Shizuoka 411-8540, Japan
REFERENCE   2
  AUTHORS   Mishima,H., Shizuoka,T. and Fuji,I.
  TITLE     Glyceraldehyde-3-phosphate dehydrogenase expressed in human liver
  JOURNAL   Unpublished (2009)
COMMENT     Human cDNA sequencing project.
FEATURES             Location/Qualifiers
     source          1..450
                     /chromosome="12"
                     /clone="GT200015"
                     /collection_date="2007"
                     /db_xref="taxon:9606"
                     /geo_loc_name="Japan"
                     /map="12p13"
                     /mol_type="mRNA"
                     /organism="Homo sapiens"
                     /tissue_type="liver"
     CDS             86..>450
                     /codon_start=1
                     /gene="GAPD"
                     /product="glyceraldehyde-3-phosphate dehydrogenase"
                     /protein_id="BAA12345.1"
                     /transl_table=1
                     /translation="MAKIKIGINGFGRIGRLVARVALQSDDVELVAVNDPFITTDYMT
                     YMFKYDTVHGQWKHHEVKVKDSKTLLFGEKEVTVFGCRNPKEIPWGETSAEFVVEYTG
                     VFTDKDKAVAQLKGGAKKV"
BASE COUNT          102 a          119 c          131 g           98 t
ORIGIN
        1 cccacgcgtc cggtcgcatc gcacttgtag ctctcgaccc ccgcatctca tccctcctct
       61 cgcttagttc agatcgaaat cgcaaatggc gaagattaag atcgggatca atgggttcgg
      121 gaggatcggg aggctcgtgg ccagggtggc cctgcagagc gacgacgtcg agctcgtcgc
      181 cgtcaacgac cccttcatca ccaccgacta catgacatac atgttcaagt atgacactgt
      241 gcacggccag tggaagcatc atgaggttaa ggtgaaggac tccaagaccc ttctcttcgg
      301 tgagaaggag gtcaccgtgt tcggctgcag gaaccctaag gagatcccat ggggtgagac
      361 tagcgctgag tttgttgtgg agtacactgg tgttttcact gacaaggaca aggccgttgc
      421 tcaacttaag ggtggtgcta agaaggtctg
//

Flat file displays the information provided by submitters with DDBJ format.
Even when the sequences are similar, the contents on the flat files may vary according to the submitter’s research aim etc.
Please take that point into consideration when you refer search results.

FIELD COMMENTS

LOCUS

locus name, sequence length, molecule type, molecular form, division, the date of last release

Locus Name

Locus name is a unique ID of the entry in the database. In DDBJ, since July 1996, the locus name has been assigned the same asaccession number.

Length of Sequence

Notice: No information is available on the Master record of MGA data.

Molecule Type

According to the value of /mol_type qqualifier for source feature, it is described as DNA, RNA, mRNA, rRNA, tRNA, or cRNA.

Molecular Form

This column indicates whether molecular form of nucleotide sequence is “linear” or “circular”. If the entry is the full length of circular form, “circular” is appeared.

Division

DDBJ classifies entries into 21 divisions as below;

a: taxonomic divisions

HUM human
PRI primates (other than human)
ROD rodents
MAM mammals (other than primates and rodents)
VRT vertebrates (other than mammals)
INV invertebrates (animals other than vertebrates)
PLN plants, fungi, plastids (eukaryotes other than animals)
BCT bacteria (including both Eubacteria and Archaea)
VRL viruses
PHG bacteriophages

b: other divisions

PAT sequence data related to patent application
The data those which Japan Patent Office (JPO), United States Patent and Trademark Office (USPTO), European Patent Office (EPO), and Korean Intellectual Property Office (KIPO) collected, processed and released.
ENV sequences obtained via environmental sampling methods
SYN synthetic constructs; artificially constructed sequences
EST expressed sequence tags; short single pass cDNA sequences
TSA transcriptome shotgun assemblies; assembled mRNA sequences
GSS genome survey sequences; short single pass genomic sequences
HTC high throughput cDNA sequences;
The sequence submitted from cDNA sequencing projects except for EST. This division is to include unfinished high throughput cDNA sequences, each of which has 5’UTR and 3’UTR at both ends and part of a coding region. The sequence may also include introns. When the sequence becomes finished later, it moves to the corresponding taxonomic division.
HTG high throughput genomic sequences;
The sequence submitted mainly from genome sequencing projects which regarded a clone as a sequencing unit.
STS DDBJ currently terminated accepting new submissions.
sequence tagged sites
The tag site for genome sequencing. The information of chromosome, map, PCR_condition is necessary for this division.
UNA DDBJ currently terminated accepting new submissions.
the data not annotated
CON DDBJ currently terminated accepting new submissions.
Contig / Constructed
To conjugate a series of entries, such as those submitted from a genome project, each of the three data banks constructs an entry and assign an accession number to a large scale sequence dataset. Such entries are classified into the CON division. The entry in the CON division has the information of joined accession numbers instead of the sequence data. The corresponding entries of the CON entry have been submitted to other divisions.

The date of last release

The current publicized date is described. If the entry is updated and reopened to public site, this date will be changed.

DEFINITION

The definition briefly describes the information of gene(s). “DEFINITION” is constructed by each of the three data banks in accordance with standard rules in principle.
However, in the case of EST or GSS submission using Mass Submission System, DDBJ will sometimes ask submitters to construct “DEFINITION”.

[Sample]

Complete sequence of maize catalase coding gene
  DEFINITION  Zea mays Cat3 gene for catalase, complete cds.

Format: [organism name] [gene name] gene for [product name], complete cds.

  • organism name: The scientific name is indicated as the organism name, in principle.
  • gene name: the symbol of the gene
  • product name: the general name of product
  • complete cds: this coding sequence is complete
Partial sequence of human glyceraldehyde-3-phosphate dehydrogenase coding cDNA
  DEFINITION  Homo sapiens mRNA for glyceraldehyde-3-phosphate 
              dehydrogenase, partial cds.
Format: [organism name] mRNA for [product name], partial cds.
  • partial cds: this protein coding sequence is partial
  • The gene name is omitted, because the submitter did not describe.
Partial sequence of Bacillus 16S rRNA
  DEFINITION  Bacillus sp. AZ25 gene for 16S rRNA, partial 
              sequence.

Format: [organism name] [strain name] gene for [product name], partial sequence.

  • In cases of unidentified species, comparison of intraspecies, and so on, describe name of strain, isolate or some, as identifier.
  • partial sequence: this sequence is part of 16S rRNA.
Multiple CDS of rat mitochondrial DNA
  DEFINITION  Rattus norvegicus mitochondrial genes for cytochrome 
              c oxidase subunit II, ATPase subunit 6, cytochrome c 
              oxidase subunit III, partial and complete cds.
Format: [organism name] [gene name 1], [gene name 2], …. genes for [product name 1], [product name 2], ….. , partial and complete cds.
  • The gene names and/or product names are subsequently described from 5’to 3’ end.
  • “partial, complete and partial cds” is abbreviated to “partial and complete cds”.
  • If some genes have only gene names or product names, only gene name or product name is described principally.
  • If the “DEFINITION” is too long, some information, such as map position, is described instead of the gene or product names.
  • Sometimes gene cluster or operon name is described, if it is considered reasonable.
EST data of human liver 3’ end
  DEFINITION  Homo sapiens cDNA, clone:ABC123, 3' end, expressed 
              in liver.
Format: [organism name] cDNA, clone:[clone name], [other information].
  • The clone name is mandatory.
GSS data of mouse chromosome 1q
  DEFINITION  Mus musculus DNA, clone:1H11A14, 1q region.

Format: [organism name] DNA, clone:[clone name], [other information].

  • The clone name is mandatory.
TPA (Third Party Data) of human GAPD
  DEFINITION  TPA_exp: Homo sapiens GAPD mRNA forglyceraldehyde-3-phosphate 
              dehydrogenase, complete cds.
Format: [TPA header]: [organism name] [gene name] mRNA for [product name], complete cds.
  • In the case of TPA (Third Party data), either of “TPA_exp” (for TPA:experimental) or “TPA_inf” (for TPA:inferential) is described at the beginning of DEFINITION.

ACCESSION

This line shows accession number of the entry data.

Conventional sequence data
A unique accession number is issued to the data submitter by each of the three data banks. The accession number is composed of 1 alphabet character and 5 digits (ex. A12345) or 2 alphabet characters and 6 digits (ex. AB123456). The former style was used in 1980s, but later the latter style was introduced because of data explosion.
The alphabet part is called “prefix”. Please refer the prefix list.

If multiple entries are united to an entry, or if an entry is extensively modified after the submission, the responsible data banks may assign a new accession number to it. In these cases, the new accession number is called the primary accession number, and the old accession number(s) is/are called the secondary accession number(s). In the flat file, the primary accession number is indicated first, then the secondary accession number(s) follows. You can find the same updated entry with both the primary and the secondary accession numbers.

  ACCESSION   AB999999 AB888888 AB777777
AB999999 primary accession number
AB888888 AB777777 secondary accession number
Bulk sequence data; WGS, TSA, TLS
The accession number assigned to each entry of WGS, TSA and TLS data consists of 4 alphabet characters and 8 (sometimes 9 or 10, if necessary) digits.
The alphabet part is called prefix.
See also For Large Scale Data (four prefix).
Example:ZZZZ01000001
ZZZZ (4 letters) Prefix to distinguish each project, project_id
01 (2 digits) Version number of the data set, set_version
000001 (6 digits) ID of each individual sequence (It might be 7 or 8 digits depended on the number of entries.)

The set_version goes up for every update of the dataset. Example:ZZZZ02000001

  ACCESSION   ZZZZ01000001 ZZZZ01000000
ZZZZ01000001 primary accession number
ZZZZ01000000 set ID
For MGA data
This (ACEESSION) line shows a number assigned by INSDC to a resource.
The number is composed of 5 alphabet characters and 7 digits (ex. ABCDE0000001).An accession number assigned to an entry of a resource units is displayed in the MGA lines.
Example:ABCDE0000001
AB (first two characters) identi identifier to each project.
CDE (third to fifth characters) identifier to each of resources on each project.
0000001 (7 digit numeric numbers) number for each sequence entry in a resource.

*1 The information about each project id is avilable at the project_index page.
*2 “resource” here means a unit of identical origin, such as tissue, cells, from which sequence are obtained.

  ACCESSION   ZZZZZ0000000
ZZZZZ0000000 number to a resource unit

VERSION

This line consists of an accession number and a version number, like “AB123456.1”, in which the digit(s) after the period is a version number.

The data open to public for the first time is version number as “1”. The reason for adding VERSION is that since a released sequence sometimes revised by the submitter, the accession number alone cannot specify the sequence in question causing the user a trouble. The number is increased by one every time when a revised sequence is made public. And accession number will NOT be changed generally.

  VERSION      AB000000.1
AB000000 accession number
1 version number
For MGA data
This line consists of a number assigned to a resources unit in which the digit(s) after the period is a version number.
Since the sequence of an MGA entry is not allowed to update, the version number has to be “1”.
  VERSION    ZZZZZ0000000.1
ZZZZZ000000 number to a resource unit
1 version number

DBLINK

The DBLINK line is used to link other databases for BioProject, BioSample accession numbers, Sequence Read Archive Run accession numbers and so on.

DDBJ has replaced the PROJECT line by DBLINK line format since 2009 to expand for other data resources than projects.

DBLINK      BioProject:PRJDA12345
            BioSample:SAMD01234567
            Sequence Read Archive:DRR012345, DRR012346     
BioProject The name of linked database: BioProject Database
PRJDA12345 Linked ID in the database; BioProject accession number
BioSample The name of linked database: BioSample Database
SAMD01234567 Linked ID in the database; BioSample accession number
Sequence Read Archive The name of linked database: Sequence Read Archive (SRA)
DRR012345, DRR012346 Linked ID in the database; SRA Run accession numbers

KEYWORDS

The KEYWORDS lines were used for indexing (gene) and (product) names in the past.

For now, KEYWORDS lines are used to indicate the detail category of the data (EST, TSA, HTC, HTG, GSS, WGS, TPA etc) information about experimental method, “finishing level” of genome sequencing and else, if necessary. See also INSDC agreed methodological keywords.

SOURCE

This line shows the scientific name (and common name, if defined) on organism from which the sequence is obtained and an organelle type if the sequence is derived from an organelle other than the nucleus.

SOURCE      Homo sapiens (human)
Homo sapiens (human) The scientific name from which the sequence is obtained.

ORGANISM

The organism name and its phylogenic lineage from which the sequence is obtained are described.

The scientific name is indicated as the organism name in 1st line. If the sequence is obtained from an unidentified organism or artificially synthesized, the name registered on the Unified Taxonomy Database is described instead of scientific name.

The phylogenic lineage information based on the Unified Taxonomy Database is started from 2nd line.

  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
Homo sapiens The scientific name from which the sequence is obtained
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. The phylogenic lineage information of Homo sapiens

REFERENCE 1

The information of submitter(s) is described as REFERENCE 1 (except old entries and some CON entries).

In the case of Nucleotide Sequence Submission System, REFERENCE 1 is processed with the information entered on “Contact person” and “Submitter” pages. In the case of Mass Submission System, REFERENCE 1 is processed with the information entered in annotation file.

REFERENCE   1   (bases 1 to 450)

Notice: The portion, “(bases 1 to 450)”, is not available on the Master record of MGA data.

AUTHORS

Submitter(s) of the entry is/are indicated in principle. Submitter is responsible for the data and can update it.

  AUTHORS   Mishima,H. and Shizuoka,T.
Mishima,H. and Shizuoka,T The submitters of this entry

TITLE

“Direct Submission” is indicated to follow the standard form.

  TITLE     Direct Submission

JOURNAL

At first, “Accept Date” of the entry is indicated. “Accept Date” is defined as the date when DDBJ have received the acceptable data to assign accession number in principle. Even if the entry is updated, “Accept Date” is NOT changed. Then, the information about the address and the affiliation of “Contact Person” is indicated.

  JOURNAL   Submitted (30-NOV-2008) to the DDBJ/EMBL/GenBank databases.
            Contact:Hanako Mishima
            National Institute of Genetics, DNA Data Bank of Japan; Yata 1111,
            Mishima, Shizuoka 411-8540, Japan
Submitted (30-NOV-2008) to the DDBJ/EMBL/GenBank databases. Accept date of this entry is 30-NOV-2008
Contact:Hanako Mishima
National Institute of Genetics, DNA Data Bank of Japan; Yata 1111,
Mishima, Shizuoka 411-8540, Japan
The information about the address and the affiliation of Hanako Mishima.

E-mail address, phone & fax nos.

  • To follow the Japanese law of protecting personal information, DDBJ will delete both phone and fax numbers, and E-mail address from the flat files of the entries submitted to DDBJ.However, if you wish to disclose any of the three items, please contact us with contact form,?specifying the item(s) to be disclosed.
  • If you wish to contact the submitter(s) of an entry of your interest, please contact us via the contact form by selecting ‘Inquiry to the sequence submitters’ and briefly stating your reason (e.g., requesting the transfer of cloned sequences, etc.). We will then forward your message to the submitter(s).

Phone and fax numbers and E-mail address are deleted.

  JOURNAL   Submitted (30-NOV-2000) to the DDBJ/EMBL/GenBank databases.
            Contact:Hanako Mishima
            National Institute of Genetics, DNA Data Bank of Japan; Yata 1111,
            Mishima, Shizuoka 411-8540, Japan

When the submitters wish to keep their contact information disclosed, it will be described as,

  JOURNAL   Submitted (30-NOV-2000) to the DDBJ/EMBL/GenBank databases.
            Contact:Hanako Mishima
            National Institute of Genetics, DNA Data Bank of Japan; Yata 1111,
            Mishima, Shizuoka 411-8540, Japan
            E-mail :mishima@supernig.nig.ac.jp
            Phone  :81-55-981-6853
            Fax    :81-55-981-6849

REFERENCE 2

The information of references related to the submitted sequence is indicated on REFERENCE line (other than (REFERENCE 1). Since REFERENCE 2 indicates the publication status of the sequence, the reference which does not describe about the submitting sequence is indicated as REFERENCE 3 or after, not as REFERENCE 2.

When DDBJ notices a paper publication with an accession number, DDBJ will update the entry with the accession number, if necessary. During the process of the update, the prepublication paper(s) described in the line(s), REFERENCE 2 and/or later, will be revised without any notice to submitters, if applicable; i.e. When the submitted data, submitters’ affiliation, author names, title, and journal name of the prepublication paper, are enough reasonable to be revised.

In the cases of the manuscript in preparation, submitted for publication, in press, or published
  REFERENCE   2
    AUTHORS   Mishima,H., Shizuoka,T. and Fuji,I.
    TITLE     Glyceraldehyde-3-phosphate dehydrogenase expressed in human liver
    JOURNAL   Unpublished (2009)
AUTHORS The (presumptive) author(s) of the reference is/are described.
TITLE The (presumptive) title of the reference is described.
JOURNAL In the cases of the paper published or In Press, the journal name is described. In the case of unpublished manuscript, “Unpublished” is described to follow the standard form.
In the case of no schedule for publication except the international nucleotide database.
  REFERENCE   2
    AUTHORS   Mishima,H., Shizuoka,T. and Fuji,I.
    TITLE     Glyceraldehyde-3-phosphate dehydrogenase expressed in human liver
    JOURNAL   Published Only in Database(2009)
AUTHORS The author(s) of the submission entered by submitter(s) is/are described.
TITLE The title of the submission entered by submitter(s) is described.
JOURNAL “Published Only in Database” is indicated.
The parenthetic number is the year when the entry has been firstly publicized.

COMMENT

The information about an entry that can not be described using FEATURES or the other fields. For instance, if submitter has the other affiliation to REFERENCE 1, it can be described on COMMENT line.

  COMMENT     Human cDNA sequencing project.
Structured COMMENT
Structured COMMENT is a format to describe and to share some datasets undefined in feature/qualifier.
SUsing structured COMMENTs, datasets can be shared via flatfiles of INSDC in the community of submitters and users.
To describe structured COMMENT, the dataset is required to be describe in structured sets of [names of items] and [values of items] on COMMENT line.
There are some predetermined formats of structured COMMENTs that are required to submit some kinds of sequence data derived from genome projects (includingWGS, transcriptome projects (including TSA) and so on.
  COMMENT     ##Genome-Assembly-Data-START##
              Finishing Goal           :: Finished
              Current Finishing Status :: High Quality Draft
              Assembly Method          :: Newbler v. 2.3
              Genome Coverage          :: 30x
              Sequencing Technology    :: 454 GS Junior; Illumina GA II
              ##Genome-Assembly-Data-END##

:
The above example is an additional information, “Genome-Assembly-Data”, that is required for genome projects.
The contents between ##Genome-Assembly-Data-START## and ##Genome-Assembly-Data-END## are delimited item names and their values by “ :: “.

##Genome-Assembly-Data-START## The first line of the structured COMMENT defined as “Genome-Assembly-Data”.
##Genome-Assembly-Data-END## The last line of the structured COMMENT defined as “Genome-Assembly-Data”.
Finishing Goal :: Finished The final goal of the genome project is “Finished” level.
Current Finishing Status :: High Quality Draft The current status of the genome project is “High Quality Draft” level.
Assembly Method :: Newbler v. 2.3 The software to assemble reads of sequences is Newbler and its version is 2.3.
Genome Coverage :: 30x The sequencing depth of the genome sequences is approximately 30 fold.
Sequencing Technology :: 454 GS Junior; Illumina GA II 454 GS Junior; Illumina GA II – the platforms (sequencers) to determine the genome sequences are “454 GS Junior” and “Illumina GA II”.
For MGA data
For MGA Submission, the process for obtaining the submitted sequence data e.g.; (methods for preparing sequences from tissues or cells and processing the sequences for submission) is described.
  COMMENT     The CAGE (cap analysis gene expression) is based on preparation
              and sequencing of concatamers of DNA tags deriving from the
              initial 20/21 nucleotides from 5' end mRNAs.
              Full-length cDNAs were at first selected with the Cap-Trapper
              method. Then, a specific linker (Linker1, some linker contain 5 bp
              sequences that have 15 variations for each rna sample) containing
              the ClassIIs restriction enzyme site MmeI was then ligated to the
              single-strand cDNA and then the second strand of cDNA synthesized.
              The resulting double-stranded cDNA was cleaved by the restriction
              enzyme MmeI and a second linker (Linker2) was ligated to the 2 bp
              overhang at the MmeI cleaved site, to produce a 5' 20/21 tag
              having two linkers at both sides. The ligation products were
              separated from unmodified DNA with magnetic beads. The 5' end cDNA
              tags were released from the beads, and the DNA fragments were
              amplified in a PCR step by using the two linker-specific primers
              (Primer1 (uni-PCR), Primer2 (MmeI-PCR)). The desired 32-37 bp tags
              were purified and ligated to form concatamers, and then the
              concatamer were fractionated and ligated to the plasmid ZErO-2.
              The ligations were finally electroporated into DH10b cells
              (Invitrogen) and obtained plasmids were sequenced with forward
              primers.
              CAGE libraries were sequenced with forward primers essentially as
              described with minor modifications to use zeocin for selection of
              recombinants. We used in-house developed algorithms for the
              extraction of tags and for masking the vectors. CAGE tags were
              extracted with the following parameters: vector masking, minimum
              12 bp recognition allowed; linker (13 bp) masking: maximum
              mismatch, 2 bp allowed; XmaJI site maximum mismatch, 2 bp allowed;
              tag length, 17-24 bp.
              Linker1: "Upper oligonucleotide GN6":
              biotin-agagagagacctcgagtaactataacggtcctaaggtagcgacctagg (5 bp)
              tccgacGNNNNN and "Upper oligonucleotide N6":

FEATURES

Biological features of a submitted sequence data are described with “Feature” key (the biological nature of the annotated feature), “Location” (the region of the sequence which corresponds to Feature), and “Qualifier” (supplementary information about Feature). In principle, EST or GSS entries are not described with any features except the “source” key.

FEATURES are indicated on the basis of the information provided by submitter and modified by databanks to describe the appropriate annotation. The rules of feature description agreed with three databanks are explained at The DDBJ/ENA/GenBank Feature Table Definition in detail.

Feature keys are briefly classified into 3 groups;

  • group 1: biological source of the sequence (source)
    The feature, “source” (group 1) is mandatory for all entries in the international nucleotide database.
    The qualifiers “/organism” and “/mol_type” are mandatory for source feature.
  • group 2: biological function features of the region
    Feature keys in group 2 fall into families which are in some sense similar in function and which are annotated in a similar manner.A functional family may have a “generic” or miscellaneous key, which can be recognized by the ‘misc_’ prefix, that can used for instances not covered by the other defined keys of that group.
    e.g. CDS, rRNA, etc.
  • group 3: difference and/or change of the sequence data
    e.g. variation, conflict, etc.

One of the most frequently used feature key is “CDS” to describe coding sequence for protein. See also CDS feature page.

FEATURES             Location/Qualifiers
     source          1..450
                     /chromosome="12"
                     /clone="GT200015"
                     /collection_date="2007"
                     /db_xref="taxon:9606"
                     /geo_loc_name="Japan"
                     /map="12p13"
                     /mol_type="mRNA"
                     /organism="Homo sapiens"
                     /tissue_type="liver"
     CDS             86..>450
                     /codon_start=1
                     /gene="GAPD"
                     /product="glyceraldehyde-3-phosphate dehydrogenase"
                     /protein_id="BAA12345.1"
                     /transl_table=1
                     /translation="MAKIKIGINGFGRIGRLVARVALQSDDVELVAVNDPFITTDYMT
                     YMFKYDTVHGQWKHHEVKVKDSKTLLFGEKEVTVFGCRNPKEIPWGETSAEFVVEYTG
                     VFTDKDKAVAQLKGGAKKV"

source

Identifies the biological source of the specified span of the sequence.

source 1..450 The region from 1st to 450th base of the sequence is derived from the source described with following qualifiers.
/chromosome="12" The sequence is obtained from chromosome 12.
/clone="GT200015" The clone name which the sequence is obtained.
/collection_date="2007" The collection date of the sample.
/geo_loc_name="Japan" The collection site of the sample.
/map="12p13" The sequence is located on 12p13.
/db_xref="taxon:9606" The sequence is derived from a organism correspond to taxonomy database ID: 9606 (human).
/mol_type="mRNA" The sequence is derived from mRNA.
/organism="Homo sapiens" The sequence is obtained from human.
/tissue_type="liver" The sequence is obtained from liver.

CDS

Coding sequence; sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon).

CDS 86..>450 The region from 86th to 450th base of the sequence is coding a protein described with following qualifiers.”>” means that 3’end is not completed for the region of CDS. The rule to describe “Location” is explained at Description of Location in detail.
/codon_start=1 The frame reading amino acid translation of the first codon is the 1st base of this region (86th base of the entry).
/gene="GAPD" gene symbol, see gene qualifier
/product="glyceraldehyde-3-phosphate dehydrogenase" product name, see product qualifier 
/protein_id="BAA12345.1" This is the ID assigned to amino acid sequence by the international nucleotide database.
It is indicated as 3 alphabet characters and 5 digits.
The number next to “.” indicates he version number of protein ID. If the amino acid sequence is updated, the version number goes up (the protein_id is NOT changed).
/transl_table=1 The nucleotide sequence of CDS region is translated into amino acid sequence according to genetic code table 1.
/translation="MAKIKIGINGF(syncopation)AVAQLKGGAKKV" The nucleotide sequence of CDS region is conceptually translated into one-letter abbreviated amino acid sequence (Amino Acid Codes), except setting the qualifierexception.
In the case of setting the qualifier pseudogene or pseudo, /translation is NOT indicated.

//

”//” is the terminal symbol of the entry.

Related pages

  • Description of Location
  • Protein Coding Sequence; CDS feature
  • Description Examples of Sequence Data
  • Codes Used in Sequence Description
  • Organism qualifier
  • Feature Key
  • Qualifier key
  • The Genetic Codes