DDBJ Annotated/Assembled Sequences

  • Home
  • Submission
    • Before Submission
    • Web submission
    • Mass Submission
    • Data Update
  • Search
    • getentry
    • ARSA
  • Flat file
    • Feature key
    • Qualifier key
    • Nucleotide Sequences
    • Organism qualifier
    • Identifiers
    • Description of Location
    • Protein Coding Sequence
    • The Genetic Codes
    • Codes Used in Sequence Description
    • Example of Submission
  • Data categories
    • Data Submission from Genome Project
    • Pseudohaplotype
    • WGS
    • Finished level genomic sequences
    • Metagenome Assembly
    • Single amplified genome
    • CON
    • GSS
    • HTG
    • Submission of environmental sequences
    • ENV
    • TLS
    • Data Submission from Transcriptome Project
    • TSA
    • EST
    • HTC
    • Third Party Data (TPA)
  • FAQ
  • Other
    • Patent
    • MGA
  • Home
  • ddbj
  • Guideline to use /locus_tag qualifier

Guideline to use /locus_tag qualifier

Proper Use of /locus_tag in Genome Submissions

At the International Nucleotide Sequence Database Collaborators meeting, it was agreed that we would require genome projects to be registered with the database. Each genome project would be assigned an ID in order to allow us to associate multiple sequences of a single genome project with each other. This Genome Project ID will appear in a new line type below ACCESSION and VERSION in the flat file. Registration of Genome Projects can be done at DDBJ, EBI or NCBI.

For genome sequence submissions, DDBJ provides Mass Submission System (MSS). See also Data Submission from Genome Project. You can specify /locus_tag prefix for your genome data through BioSample submission.

Locus_tags are identifiers that are systematically applied to every gene in a genome. These tags have become surrogate gene names by the biological community. If two submitters of two different genomes use the same systematic names to describe two very different genes in two very different genomes, it can be very confusing. In order to prevent this from happening INSD has created a registry of locus_tag prefixes. Submitters of eukaryotic and prokaryotic genomes should register their prefix prior to submitting their genome. All components of a project (such as multiple chromosomes or plasmids, etc) should use the same locus_tag prefix.

The locus_tag prefix can contain only alpha-numeric characters and it must be between 3 and 12 characters long inclusive. It should start with a letter, but numerals can be in the 2nd position or later in the string. (ex. A1C). There should be no symbols, such as -* in the prefix. The locus_tag prefix is to be separated from the tag value by an underscore ‘’, eg A1C_00001.

Locus_tags should be assigned to all protein coding and non-coding genes such as structural RNAs. /locus_tag should appear on gene, mRNA, CDS, 5’UTR, 3’UTR, intron, exon, tRNA, rRNA, ncRNA, misc_RNA, etc within a genome project submission. repeat_regions do not have locus_tag qualifiers. The same locus_tag should be used for all components of a single gene. For example, all of the exons, CDS, mRNA and gene features for a particular gene would have the same locus_tag. There should only be one locus_tag associated with one /gene, i.e. if a /locus_tag is associated with a /gene symbol in any feature, that gene symbols (and only that /gene symbol) must also be present on every other feature that contains that locus_tag.

Locus_tags are systematically added to genes within a genome. They are generally in sequential order on the genome. If a genome center were to update a genome and provide additional annotation, the new genes could either [1] be assigned the next sequential available locus_tag or [2] the submitter can leave gaps when initially assigning locus_tags and fill in new annotation with tag values that are between the gaps.

Use: Incremental locus_tags

       Original          Revised
        submission        submission
        ABC_0022
                          ABC_4568 (new gene)
        ABC_0023          ABC_0023

OR: Gaps in original locus_tags

       Original          Revised
        submission        submission
        ABC_0020          ABC_0020
                          ABC_0021 (new gene)
        ABC_0030          ABC_0030

BUT NOT:Decimal integers

       Original          Revised
        submission        submission
        ABC_0020          ABC_0020
                          ABC_0020.1 (new gene)
        ABC_0030          ABC_0030

It is preferable to use the same numbering convention for all locus_tags within a project no matter whether the gene is a protein coding gene or structural RNA or from one chromosome or another.

However, submitters wishing to encode information about chromosome number, or RNA type in the locus_tag value, may add this information to the /locus_tag after the prefix and underscore:

        ABC_I00001 for gene 1, chromosome I
        ABC_II00001 for gene 1, chromosome II
        ABC_r1112 for ribosomal RNA genes
        ABC_t1113 for tRNA genes

A submitter can register for a locus_tag prefix and BioProject/BioSample at NCBI , EBI or DDBJ.

For genome sequence submissions, DDBJ provides Mass Submission System (MSS). See also Data Submission from Genome Project. You can specify /locus_tag prefix for your genome data through BioSample submission.

You can find the same guideline at NCBI.