One of the most frequently used feature keys is "CDS" to describe coding sequence for protein. The location of CDS feature basically indicates the base range(s) from the start point of initiation codon to the end point termination codon location. CDS feature indicates the amino acid translation with codon table (indicated by transl_table qualifier) of the source organism and the description of the frame codon_start and transl_except, on the basis of the information provided from submitter (In the case of setting the qualifier, pseudo or pseudogene, translation is NOT indicated).
Some qualifiers are also described to indicate the product name and/or function of the corresponding protein on the basis of the information provided from submitter. If the information is confirmed experimentally, experiment qualifier can be described. In the case of the predicted information by the homology of the sequence etc., inference qualifier can be described.
Since the criteria of similarity and homology is not defined in the international nucleotide database, whether the gene is a homolog of some other gene(s) or not is judged by submitters, entirely.In principle, the information about motif and higher-order structure of protein is NOT described on the flat file.
DDBJ does not have any right for the gene nomenclature. Also, DDBJ does not make any official collaboration with any committee of gene nomenclature. If there is no particular incident, the descriptions related to gene nomenclature are described as provided by submitter.
DDBJ recommends to describe 'symbolic ID of locus' in gene qualifier, and the name of protein product in product qualifier.
DDBJ also recommends to use the comprehensible description for product, because the value for product qualifier is frequently use for the summary information in many result logs of similarity searches or some other retrieval systems.
DDBJ policies for the descriptions of gene and product are follows, though they have no binding force to submitters;
For user's convenience to refer contents of database, DDBJ recommends to describe the names of gene and product as follows;
Though we recognize that there are many exceptions in which the gene nomenclature of some model organisms do not fall into the above rule, we recommend the above rule, because we wish to make contents of DDBJ/EMBL/GenBank as useful as possible.
Please do not hesitate to contact us when you like to update the informnation of protein in your entry after once submitted to DDBJ. See also the page, Data Updates/Correction: after getting your accession number, when you like to update your data.
When you find termination codon(s) in the range of CDS feature that you presume, at first, please confirm that following items are appropriately specified.
If all of above items are correct and termination codon is still found in the range of CDS feature, it should be processed by either of following ways, in principle.
In case of low accuracy of sequencing, use this solution, in principle.
Describe with misc_feature, not CDS.
Because it is not sure if the corresponding protein exist or not.
Describe referred information in inference qualifier.
Describe a short explanation in note qualifier; "putative frameshift mutation", "Ig rearrangement", "TCR beta rearrangement" or else.
If you have not yet confirmed any collateral evidence to identify a pseugdogene (i.e. relationship of orthologues and paralogues in other species, missing any corresponding transcript, or some), you shuold not call it pseudogene.
Describe original CDS location with pseudogene qualifier.
When you use pseudogene qualifier, translation is not described for the CDS feature, because the corresponding protein would not exist in vivo.
See also Example of Submission B06.
Describe CDS location corresponding to the truncated protein.
i.e. The CDS location should be shortened.
Adjust CDS location with "join" operator at the point of ribosomal slippage.
In case of + 1 frameshift at the 90th base
CDS join(21..90,90..449)
In case of - 1 frameshift at the 91st base
CDS join(21..90,92..451)
Then, add ribosomal_slippage qualifier as a flag to indicate the adjustment is legal.
After this adjustment of location, the amino acid sequence in translation qualifier is conceptually translated one.
See also Example of Submission B10.
On submission via Nucleotide Sequence Submission System, please use "Submission Information" box to tell us the ribosomal slippage in detail.
Basically, it should be described for genome annotation.
Describe original CDS location with exception qualifier.
When the exception qualifier is described, amino acid sequence for translation qualifier can be provided by submitter. So, you can use the amino acid sequence confirmed via cDNA or some for the CDS feature on the genomic sequence.
Describe referred information in inference qualifier.
See also Example of Submission B09.
On submission via Nucleotide Sequence Submission System, please use "Submission Information" box to tell us translational exceptions in detail.
Describe original CDS location with transl_except qualifier.
For example;
/transl_except=(pos:213..215,aa:Sec)
# To use "U" (one letter abbreviation for selenocystein) for amino acid translation
/transl_except=(pos:213..215,aa:Pyl)
# To use "O" (one letter abbreviation for pyrrolysine) for amino acid translation
Describe with misc_feature, not CDS.
To avoid the point of frameshift, adjust CDS location with "join" operator, operatively, to make amino acid sequence with conceptual translation.
Add artificial_location qualifier as a flag to indicate the operative adjustment.
For submissions via Nucleotide Sequence Submission System, it is forbidden to use artificial_location qualifier.
In general, the rule about description of location for CDS is in common with all other features.
See Description of Location in detail.
In general, on prokaryotic genome or mature mRNA, the location of CDS feature should be described as a simple span, so, there is no need to use "join" operator.
In case of using a 'join'ed location for CDS feature, it should indicate how to conjugate exons specified in the genomic sequence each other, on the process of mRNA maturation. Basically, we can not accept any CDS feature with 'join'ed location other than to indicate splicing pattern of the CDS.
However, there are three major exceptions as below;
For example, in the page, Explanation of DDBJ flat file format, the amino acid sequence in the value of translation qualifier is processed from the nucleotide sequence by using following items;
i.e.
CDS 86..>450
/codon_start=1
/transl_table=1
Acording to above items, the region from 86th to 450th bases of the nucleotide sequence is translated into the amino acid sequence as with 1 letter abbreviation below;
86 atg gcg aag att aag atc ggg atc aat ggg ttc ggg agg atc ggg
aa: M A K I K I G I N G F G R I G
131 agg ctc gtg gcc agg gtg gcc ctg cag agc gac gac gtc gag ctc
aa: R L V A R V A L Q S D D V E L
176 gtc gcc gtc aac gac ccc ttc atc acc acc gac tac atg aca tac
aa: V A V N D P F I T T D Y M T Y
221 atg ttc aag tat gac act gtg cac ggc cag tgg aag cat cat gag
aa: M F K Y D T V H G Q W K H H E
266 gtt aag gtg aag gac tcc aag acc ctt ctc ttc ggt gag aag gag
aa: V K V K D S K T L L F G E K E
311 gtc acc gtg ttc ggc tgc agg aac cct aag gag atc cca tgg ggt
aa: V T V F G C R N P K E I P W G
356 gag act agc gct gag ttt gtt gtg gag tac act ggt gtt ttc act
aa: E T S A E F V V E Y T G V F T
401 gac aag gac aag gcc gtt gct caa ctt aag ggt ggt gct aag aag
aa: D K D K A V A Q L K G G A K K
446 gtc tg
aa: V ?
The last two bases, "tg", can not be specified to be traslated into either of amino acids, C (cysteine), W (tryptphan), or * (termination codon), so, it is not described.
Finally, the translated amino acid sequence is described in the value of translation as below.
/translation="MAKIKIGINGFGRIGRLVARVALQSDDVELVAVNDPFITTDYMT
YMFKYDTVHGQWKHHEVKVKDSKTLLFGEKEVTVFGCRNPKEIPWGETSAEFVVEYTG
VFTDKDKAVAQLKGGAKKV"
The codon_start qualifier indicates the offset at which the first complete codon of a CDS feature can be found, relative to the first base of that feature.
When the location of CDS feature is started from initiation codon, the value of codon_start is 1, consistently.
If the location of CDS feature is not started from initiation codon, codon_start is required to specify from either of 1, 2, or 3, appropriately.
Although the nucleotide sequence is same, depending on the description of codon_start, translated amino acid sequence is different as followings.
the number of base position: tens place 11111111112222222
the number of base position: ones place 12345678901234567890123456
nucelotide sequence ttcggctgcagaagataaataaataa
translated amino acid sequence, case 1 F G C R R *
translated amino acid sequence, case 2 S A A E D K *
translated amino acid sequence, case 3 R L Q K I N K *
case 1
CDS <1..18
/codon_start=1
/transl_table=1
/translation="FGCRR"
case 2
CDS <1..22
/codon_start=2
/transl_table=1
/translation="SAAEDK"
case 3
CDS <1..26
/codon_start=3
/transl_table=1
/translation="RLQKINK"