One of the most frequently used feature keys is "CDS" to describe coding sequence for protein. The location of CDS feature basically indicates the base range(s) from the start point of initiation codon to the end point termination codon location. CDS feature indicates the amino acid translation with codon table (indicated by transl_table qualifier) of the source organism and the description of the frame codon_start and transl_except, on the basis of the information provided from submitter (In the case of setting the qualifier, pseudo or pseudogene, translation is NOT indicated).
Some qualifiers are also described to indicate the product name and/or function of the corresponding protein on the basis of the information provided from submitter. If the information is confirmed experimentally, experiment qualifier can be described. In the case of the predicted information by the homology of the sequence etc., inference qualifier can be described.
Since the criteria of similarity and homology is not defined in the international nucleotide database, whether the gene is a homolog of some other gene(s) or not is judged by submitters, entirely. In principle, the information about motif and higher-order structure of protein is NOT described on the flat file.
Gene nomenclature at DDBJ
DDBJ does not have any right for the gene nomenclature. Also, DDBJ does not make any official collaboration with any committee of gene nomenclature. If there is no particular incident, the descriptions related to gene nomenclature are described as provided by submitter.
DDBJ also recommends to use the comprehensible description for product, because the value for product qualifier is frequently use for the summary information in many result logs of similarity searches or some other retrieval systems.
For user's convenience to refer contents of database, DDBJ recommends to describe the names of gene and product as follows;
How to describe CDS feature, when termination codon is found in the range
If all of above items are correct and termination codon is still found in the range of CDS feature, it should be processed by either of following ways, in principle.
Locations with "join" operators are basically described to indicate splicing results
In general, the rule about description of location for CDS is in common with all other features.See Description of Location in detail.
In general, on prokaryotic genome or mature mRNA, the location of CDS feature should be described as a simple span, so, there is no need to use "join" operator.
In case of using a 'join'ed location for CDS feature, it should indicate how to conjugate exons specified in the genomic sequence each other, on the process of mRNA maturation. Basically, we can not accept any CDS feature with 'join'ed location other than to indicate splicing pattern of the CDS.
- On the circular genome sequence, to indicate the conjugation of the end of the sequence and the start of it
- For viruses or some, to indicate ribosomal slippage is occured in the process of translation.
- CDS locations operatively adjusted to avoid frameshift errors in draft sequences from genome or transcriptome projects, with a flag artificial_location qualifierqualifier.
Translated amino acid sequence described at translation qualifier
For example, in the page, Explanation of DDBJ flat file format, the amino acid sequence in the value of translation qualifer qualifier is processed from the nucleotide sequence by using following items;
CDS 86..>450 /codon_start=1 /transl_table=1
CDS 86.. >450
|The region from 86th to 450th base of the sequence is coding a protein described with following qualifiers.
">" means that 3'end is not completed for the region of CDS.
See the rule for Description of Location in detail.
|The frame reading amino acid translation of the first codon is the 1st base of the above CDS location (86th base of the entry).|
|The nucleotide sequence of CDS region is translated into amino acid sequence according to The Genetic Codes, No. 1 table.|
86 atg gcg aag att aag atc ggg atc aat ggg ttc ggg agg atc ggg aa: M A K I K I G I N G F G R I G 131 agg ctc gtg gcc agg gtg gcc ctg cag agc gac gac gtc gag ctc aa: R L V A R V A L Q S D D V E L 176 gtc gcc gtc aac gac ccc ttc atc acc acc gac tac atg aca tac aa: V A V N D P F I T T D Y M T Y 221 atg ttc aag tat gac act gtg cac ggc cag tgg aag cat cat gag aa: M F K Y D T V H G Q W K H H E 266 gtt aag gtg aag gac tcc aag acc ctt ctc ttc ggt gag aag gag aa: V K V K D S K T L L F G E K E 311 gtc acc gtg ttc ggc tgc agg aac cct aag gag atc cca tgg ggt aa: V T V F G C R N P K E I P W G 356 gag act agc gct gag ttt gtt gtg gag tac act ggt gtt ttc act aa: E T S A E F V V E Y T G V F T 401 gac aag gac aag gcc gtt gct caa ctt aag ggt ggt gct aag aag aa: D K D K A V A Q L K G G A K K 446 gtc tg aa: V ?
/translation="MAKIKIGINGFGRIGRLVARVALQSDDVELVAVNDPFITTDYMT YMFKYDTVHGQWKHHEVKVKDSKTLLFGEKEVTVFGCRNPKEIPWGETSAEFVVEYTG VFTDKDKAVAQLKGGAKKV"
Offset of the frame at translation initiation by codon_start
The codon_start qualifier indicates the offset at which the first complete codon of a CDS feature can be found, relative to the first base of that feature.
When the location of CDS feature is started from initiation codon, the value of codon_start is 1, consistently.
If the location of CDS feature is not started from initiation codon, codon_start is required to specify from either of 1, 2, or 3, appropriately. Although the nucleotide sequence is same, depending on the description of codon_start, translated amino acid sequence is different as followings.
the number of base position: tens place 11111111112222222 the number of base position: ones place 12345678901234567890123456 nucelotide sequence ttcggctgcagaagataaataaataa translated amino acid sequence, case 1 F G C R R * translated amino acid sequence, case 2 S A A E D K * translated amino acid sequence, case 3 R L Q K I N K *
CDS <1..18 /codon_start=1 /transl_table=1 /translation="FGCRR"
CDS <1..22 /codon_start=2 /transl_table=1 /translation="SAAEDK"
CDS <1..26 /codon_start=3 /transl_table=1 /translation="RLQKINK"