Sequence File

The sequence file is a text file in FASTA-like format contains all nucleotide sequences.
In the sequence file, one array data consists of a line of header lines starting with ">" and a sequence of entities of the second and subsequent lines. You must insert the end flag (//) at the end of each sequence.

Example: Sequence File

>CLN01  <-- Entry name for the first one
ggacaggctgccgcaggagccaggccgggagcaggaagaggcttcgggggagccggagaa
ctgggccagatgcgcttcgtgggcgaagcctgaggaaaaagagagtgaggcaggagaatc
gcttgaaccccggaggcggaaccgcactccagcctgggcgacagagtgagactta
//      <-- End flag
>CLN02  <-- Entry name for the second one
ctcacacagatgcgcgcacaccagtggttgtaacagaagcctgaggtgcgctcgtggtca
gaagagggcatgcgcttcagtcgtgggcgaagcctgaggaaaaaatagtcattcatataa
atttgaacacacctgctgtggctgtaactctgagatgtgctaaataaaccctctt
//      <-- End flag

Format and Syntax

It is required to validate formats of sequence file by UME or Parser.

  • First line starts with [>], followed by the Entry name at the head of each sequence.
  • Entry names must be unique in the sequence file.
    It is common to use clone name or isolate name as unique Entry name.
  • Entry name is required to be described in less than 32 letters of characters which do not contain [space], " [double-quote], ? [question], [back-slash].
  • The names and the orders of Entry in the both of sequence and annotation files should be matched.
    The accession numbers will be assigned in the order of entries.
  • Sequence file is required to contain NO space or blank line.
  • You can use not only a, t, g and c but also characters in Nucleotide base codes for your nucleotide sequences, if necessary.
  • In principle, please remove the base code 'n' locating 5' or 3' end of sequences. For especially EST submissions, please do not send the raw outputs of a sequencer. You should screen your sequences to remove unreliable output(s) often locating at 5'-end.
  • Remove the sequences derived from vector, linker or adaptor.
    If you would like to submit some artificially constructed sequence itself, such as an expression vector etc., you do not have to remove that.
  • Please be sure to input the end flag [//] at the end of each sequence.
  • In case of CON entry, AGP file can be used as a substitute for sequence file.

Annotation File

The annotation file is the tab delimited text file consisting of five columns of Entry, Feature, Location, Qualifier, and Value that contains your data other than sequences, such as submitters, references and biological features.
You can make the files with some scripts, spread sheets (such as MS Excel), text editors and so on.

Example:Annotation file Required
Entry Feature Location Qualifier Value
COMMON SUBMITTER ab_name Robertson,G.R.
ab_name Mishima,H.
contact Hanako Mishima
email mishima@ddbj.nig.ac.jp
phone 81-55-981-6853
fax 81-55-981-6853
phext 3207
institute National Institute of Genetics
department DNA Data Bank of Japan
country Japan
state Shizuoka
city Mishima
street Yata 1111
zip 411-8540
REFERENCE title Mouse Genome Sequencing
ab_name Robertson,G.R.
ab_name Mishima,H
year 2012
status Unpublished
COMMENT line Please visit our website
line URL: http://www.ddbj.nig.ac.jp/
CLN01 source 1..12297 organism Mus musculus
mol_type genomic DNA
clone PC0110
chromosome 8
CDS join(<1..456,609..879,1070..1213) product protein kinase
codon_start 2
CLN02 source 1..12393 organism Mus musculus
mol_type genomic DNA
clone PC0210
chromosome 8
CDS 9365..9640 product hypothetical protein

Format and Syntax

It is required to validate formats of annotation file by UME or Parser.

Entry
Please enter the Entry name into Entry column. Entry name has to correspond to each name in the sequence file as described at How to Make Sequence File.
Do not enter anything in the Entry column until the first line for the next entry.
Feature
There are two types of Features, Biological feature and DDBJ original features. The detail descriptions for Features are explained below.
Do not enter anything in Feature columns until the first line for the next feature.
Location
Location can be described in the columns adjacent Feature columns filed with either of Biological feature or PRIMARY_CONTIG feature.
Qualifier
Qualifier is described in every line, in principle. It depends on the Feature whether each Qualifier is mandatory, available, or not to use for the Feature. Details are explained below.
Value
The format of Value is different depending on Qualifiers. Details will be explained below.
Other
In annotation file, it is judged as end when a blank line was found. Therefore, when you input multiple entries, please be sure not to make a blank line until the end of file.

References for Describing Biological Features

Name Refix Date Remarks
Feature Table Definition 2016/11/17 version 10.6
Feature/Qualifier usage matrix 2016/11/09
Example of Submission 2014/11/27 Examples of features in DDBJ flat file

Samples and Relationships with DDBJ flat files

Data contents PDF format
(with remarks)
Tab-delimited text format Relationships with flatfile
Protein coding sequence (CDS) CDS CDS general_ann2ff
Ribosomal RNA 16S_rRNA 16S_rRNA
ITS (Internal Transcribed Spacer) ITS ITS
Microsatellite marker Microsatellite_marker Microsatellite_marker
Mitochondrial sequence mtDNA mtDNA
ENV(Environmental Samples) ENV ENV
EST (Expressed Sequence Tags) EST EST EST_ann2ff
GSS (Genome Survey Sequences) GSS GSS
HTG (High Throughput Genomic Sequences) HTG HTG HTG_ann2ff
TSA (Transcriptome Shotgun Assembly); assembled from EST TSA TSA TSA_ann2ff
TSA; assembled from short reads TSA_SRA_assemble TSA_SRA_assemble TSA_SRA_ann2ff
WGS (Whole Genome Shotgun) WGS WGS WGS_ann2ff
WGS; piece of scaffold CON WGS_piece_CON WGS_piece_CON
CON entries for WGS scaffold WGS_scaffold WGS_scaffold CON_ann2ff
AGP file for CON entries AGP AGP
TPA (Third Party Annotation) TPA TPA TPA_ann2ff

COMMON

COMMON entry for the common information to all entries

  • In annotation file, entry name COMMON can be described in Entry column for the common information to all entries.
  • The information described in COMMON entry will be reflected in all entries.
  • Usually, COMMON is used for SUBMITTER/REFERENCE/DATE/COMMENT, but it can also be used for Biological feature when all the information of Feature, Location, Qualifiers and Values are common to all entries.

Use of COMMON entry

Meta-base position 'E' for the location description
Example: rRNA feature in COMMON entry
Entry Feature Location Qualifier Value
COMMON rRNA <1..>E product 16S rRNA

There are many submissions that have common Feature information for all entries in their Qualifiers, and Values except their Locations because of difference of their sequence lengths, such as phylogenic studies with rRNA sequences.

In such cases, you can describe the common Feature in COMMON entry by using meta-base position 'E' in its Location instead of the number of the sequence end points.

Meta-description '@@[entry]@@ 'is available for clone, note, ff_definition
Example: source feature in COMMON entry
Entry Feature Location Qualifier Value
COMMON source 1..E organism Homo sapiens
mol_type genomic DNA
submitter_seqid contig: @@[entry]@@
ff_definition @@[organism]@@ DNA, @@[submitter_seqid]@@

There are some submissions that have common Feature information for all entries in their Qualifiers, and Values except their Locations and clone name or contig names, such as EST, GSS, TSA, TLS, WGS, WGS scaffold (CON division), and so on.

In such cases, you can describe the Feature: source in COMMON entry only if you use clone or contig names as entry name.

  • You can use meta-base position 'E' in its Location instead of the number of the sequence end points.
  • For the Value of clone, submitter_seqid, note, ff_definition, a meta description @@[entry]@@, entry enclosed by "@@[" and "]@@", is available to quote entry names. It will be replaced by the entry names which are quoted from a sequence file.

SUBMITTER

Example: SUBMITTER in annotation file Requierd
Entry Feature Location Qualifier Value
COMMON SUBMITTER ab_name Robertson,G.R.
ab_name Mishima,H.
consrtm Mouse Genome Consortium
contact Hanako Mishima
email mishima@ddbj.nig.ac.jp
url http://www.ddbj.nig.ac.jp
phone 81-55-981-6853
fax 81-55-981-6853
phext 3207
institute National Institute of Genetics
department DNA Data Bank of Japan
country Japan
state Shizuoka
city Mishima
street Yata 1111
zip 411-8540
List of Qualifiers for SUBMITTER
Qualifier Legal characters for each Value (Remarks) Number of letters
ab_name (abbreviation of author name) alphabets, .[period], ,[comma], -[hyphen], ' [apostrophe] 64
contact (contact person) alphabets, .[period], ,[comma], -[hyphen], ' [apostrophe], [space]
(In order of first, middle, and last names delimited with)
first(64),
middle(128), last(64)
consrtm (consortium) alphabets, digits, [space], -[hyphen], ' [apostrophe], .[period], _[underscore], .[comma], ( ) # & @ / ; : + * 255
email alphabets, digits, @, .[period], -[hyphen], _[underscore] 64
url All printable characters but [space] 255
phone, fax, phext digits, -[hyphen] (DO NOT enter + before country code) 16
institute, department All printable characters but [back-slash], ` [back-quote] 255
country, state alphabets, digits, [space], -[hyphen], '[apostrophe], .[period], _[underscore], ,[comma], ( ) # & @ / ; : + * 32
city alphabets, digits, [space], -[hyphen], '[apostrophe], .[period], _[underscore], ,[comma], ( ) # & @ / ; : + * 64
street alphabets, digits, [space], -[hyphen], '[apostrophe], .[period], _[underscore], ,[comma], ( ) # & @ / ; : + * 255
zip alphabets, digits, -[hyphen] 16

Requirements for Describing SUBMITTER

  • Basically it is necessary to enter one SUBMITTER for each entry. But COMMON can be used for describing SUBMITTER that is common to all entries.
    When SUBMITTER is written by using COMMON, SUBMITTER cannot be used for the other entries in the same annotation file.
  • Submitters are the persons who have the responsibility in the contents of the submitted data and have the right to update the data.
  • Qualifier: ab_name in SUBMITTER can be used repeatedly for multiple submitters and those submitters are shown in the released file in the order of this annotation file.
  • It is necessary to specify a contact person whom DDBJ will contact with about the data by using Qualifier: contact.
  • The abbreviation of the author name according to the format of REFERENCE author should be described in Value of Qualifier: ab_name.
    Value format:
    last name[comma]initial of first name[period]initial of middle name[period]
    Example:
    Miyashita,Y.
    Robertson,G.R.

    Although some names (e.g. name with a hyphen) may show a warning message owing to format error, it is possible to input.

  • Each Value for the Qualifier except ab_name in SUBMITTER cannot be used repeatedly. They can be used for only contact person. If you would like to submit the information of multiple institutes, please contact us before your submission.

REFERENCE

Example: REFERENCE in annotation file Requierd
Entry Feature Location Qualifier Value
REFERENCE title Sequence and analysis of mouse ch.8
ab_name Robertson,G.R.
ab_name Mishima,H.
status Published
year 2003
journal Nature
volume 8
start_page 15
end_page 20
List of Qualifiers for REFERENCE
Qualifier Legal characters for each Value (Remarks) Number of letters
title All printable characters but [back-slash], ` [back-quote] 255
ab_name (abbreviation of author name) alphabets, .[period], ,[comma], -[hyphen], ' [apostrophe] 64
consrtm(consortium) alphabets, digits, [space], -[hyphen], ' [apostrophe], .[period], _[underscore],
,[comma], ( ) # & @ / ; : + *
255
status Either one of follows;
Unpublished, In press, Published
-
year digits(4 figures of A.D.) 4
journal All printable characters but [back-slash], ` [back-quote] (PubMed type abbreviation) 128
volume, start_page, end_page alphabets, digits, -[hyphen] 8

Requirements for Describing REFERENCE

  • It is necessary to specify at least one REFERENCE for each entry. However, COMMON can be used for describing the REFERENCE that is common to all entries.
  • The abbreviation of the author name according to the format of REFERENCE author should be described in Value of Qualifier: ab_name.
    Value format:
    last name[comma]initial of first name[period]initial of middle name[period]
    Example:
    Miyashita,Y.
    Robertson,G.R.

    Please pay no attention to a warning message about name format error (e.g. name with a hyphen).

  • If the Value of status is "In Press", Qualifier: journal is also a mandatory item.
  • If the Value of status is "Published", Qualifier: journal, volume, start_page and end_page are also mandatory items.
  • Please input "Unpublished" in the status, if you do not prepare any publication.
  • Please input ISO abbreviation in the journal if you have.
  • If you need to enter more than two REFERENCE features, please input the first REFERENCE directly related to your sequences and then put the other(s) that would be helpful for understanding the data after the first one.
  • When you use REFERENCE features for both COMMON entry and other entries, the REFERENCE feature(s) specified for each entry will be loaded into DDBJ after one(s) given by COMMON entry.
  • When you cite two or more REFERENCE features for an entry, they will be shown on the DDBJ flat file in the same order on the annotation file.

DATE

Example: DATE/hold_date in annotation file
Entry Feature Location Qualifier Value
COMMON DATE hold_date 20181125

Requirements for Describing DATE

  • DATE and hold_date are required to be described in COMMON entry.
  • If you want to keep confidential your data until a specific date, please set the date with 8 digits (e.g. 20181125).
  • Delimiters (i.e. -- (hyphen), / (slash) etc.) is not allowed to use for Value of hold_date.
  • Do not enter any DATE, if your data should be open to public immediately.
  • DATE should be included for COMMON entry. If the date is not common to all entries, please prepare the file for each.
  • If you set a hold_date, your data will be released according to Principle of "Hold-Until-Published" data release.

COMMENT/ST_COMMENT

Example: COMMENT and ST_COMMENT in annotation file
Entry Feature Location Qualifier Value
COMMENT line This clone was obtained at our laboratory.
COMMENT line Please visit our web site.
line URL:http://www.ddbj.nig.ac.jp
ST_COMMENT tagset_id Genome-Assembly-Data
Finishing Goal High Quality Draft
Current Finishing Status High Quality Draft
Assembly Method GS De Novo Assembler v. 2.0
Assembly Name Mmus_1.0
Genome Coverage 50x
Sequencing Technology 454 GS FLX; ABI 3730

There are two kinds of COMMENTs, "general COMMENT" and "structured COMMENT".

Requirements for Describing COMMENT (General COMMENT)

  • Please use general COMMENT if you want to describe additional information for your data.
  • It will automatically start a new-line by 60 letters including spaces. If you want to start a new-line other than 60 letters, please add Qualifier: line.
  • All printable characters except [back-slash] are legal for the Value of Qualifier: line.
  • COMMON entry can be used for describing COMMENT that is common to all entries.
  • When you put multiple COMMENT features, please put each COMMENT for a Feature column, separately.
  • When an entry has both COMMENT features specific to it and common with all other entries described in COMMON entry, those will be shown on DDBJ flat file in the order, COMMENT in COMMON entry at first, then followed by one specific to the entry. On DDBJ flat files, in the case of plural COMMENTs, they will be shown in DDBJ format on same order of the annotation file.
  • When you use COMMENT features for both COMMON entry and other entries, the COMMENT feature(s) specified for each entry will be loaded into DDBJ after one(s) given by COMMON entry.
  • When you describe two or more COMMENT features for an entry, they will be shown on the DDBJ flat file in the same order on the annotation file.
  • For EST submissions, some particular COMMENT description is required.Details

Requirements for Describing ST_COMMENT (structured COMMENT)

  • ST_COMMENT is a feature to describe structured COMMENT.
  • Though ST_COMMENT can be defined by user community, ST_COMMENT in predetermined format is required to submit sequence data derived from genome Project (including WGS) or transcriptome Project (including TSA).
  • ST_COMMENT is composed of dataset name (tagset_id), names of items (user-defined Qualifier) and values of items (Value).
  • In the initial line of Structured COMMENT feature, describe tagset_id as Qualifier and dataset name as its Value.

    In case of genome project, describe "Genome-Assembly-Data" for the value of tagset_id qualifier.
    In case of transcriptome project, describe "Assembly-Data" for the value of tagset_id qualifier.

  • Describe a name of item as Qualifier name and its value as Value.
    In case of Genome-Assembly-Data, use following Qualifiers.

    List of Qualifiers for Genome-Assembly-Data Requierd
    Qualifier designation and content Remarks
    Finishing Goal Finishing goal of the genome project. Use controlled vocabulary. Please use either of fofllowing terms:
    "Standard Draft",
    "High-Quality Draft",
    "Improved High-Quality Draft",
    "Noncontiguous Finished",
    "Finished"
    Current Finishing Status Current Finishing Status of the genome project. Use controlled vocabulary.
    Assembly Method Name of program and the version used assembling sequences. Mandatory.
    Assembly Name Name that the submitter has given to that assembly of the genome. Mandatory for Eukaryote. we recommend to describe in the format:
    [abbreviated name of species or common name of organism] + [version]
    (i.e. Btau_4.0)
    Genome Coverage Approximate sequencing depth. Mandatory.
    Sequencing Technology Platform(s) used to generate the sequence. Mandatory.

    In case of Assembly-Data, use following Qualifiers.

    List of Qualifiers for Assembly-Data Requierd
    Qualifier designation and content
    Assembly Method Name of program and the version used assembling sequences. Mandatory.
    Assembly Name Name and version for assembled sequences
    Coverage Approximate sequencing depth.
    Sequencing Technology Platform(s) used to generate the sequence. Mandatory.
  • If you have any question to describe ST_COMMENT, please contact us by email prior to your submission.

Biological Features

Example: source and CDS features in annotation file Requierd
Entry Feature Location Qualifier Value
source 1..12297 organism Mus musculus
mol_type genomic_DNA
chromosome 8
clone PC0110
CDS join(<1..456,609..879,1070..1213) product protein kinase
codon_start 2
rRNA 1279..3000 product 18S rRNA
CDS complement(join(3213..4981,9901..11677)) gene tbpA
product TATA-box binding protein

For detail definitions and descriptions of Biological features, please read Feature Table Definition.

Requirements for Describing Feature/Location/Qualifier

  • In Feature Table Definition, each Qualifier has a / [slash] on its head, however do not use slashes for Qualifiers in the annotation file.
  • Qualifiers marked with * (organism、mol_type) are mandatory items. Features, source and at least one other feature are mandatory items for each entry. Please be sure to input them correctly.
  • You can find the rule to describe Location on Description of Location.
  • You can see Qualifiers are legal for each Feature in Feature/Qualifier Usage Matrix. Some of Features have mandatory Qualifier(s). Please be sure to specify Features and Qualifiers according to their name in the table. They are strictly defined such as case-sensitive (to distinguish upper case or lower), to use "_" [underscore], and so on.
  • See also Sample annotation file and Example of Submission
  • When you describe CDS features, Protein Coding Sequence; CDS feature would be helpful.
  • Files containing CDS feature(s) should be checked with UME or transChecker.

Requirements for Describing Value

DIVISION

DIVISION feature in annotation file indicates that entries are corresponding only to one of CON / ENV / EST / GSS / HTC / HTG / STS / SYN / TSA.

Example: DIVISION in annotation file
Entry Feature Location Qualifier Value
COMMON DIVISION division EST

Requirements for Describing DIVISION

  • Please enter the division name, 3 capital letters in the Value for Qualifier: division.
  • In principle, please describe the DIVISION feature in the COMMON entry.

DATATYPE

DATATYPE feature indicates that entries are corresponding to either of WGS, TLS, TPA, or TPA-WGS.

Example: DATATYPE in annotation file
Entry Feature Location Qualifier Value
COMMON DATATYPE type WGS

Requirements for Describing DATATYPE

  • Please enter the name of type, WGS, TLS, TPA, or TPA-WGS in the Value for Qualifier: type.
  • Please describe the DATATYPE feature in the COMMON entry.

KEYWORD

On the basis of categories indicated at the sections, DIVISION and DATATYPE, KEYWORDs with controlled vocabulary describe more detail and specified information, such as experimental methods.
Please see INSDC agreed methodological keywords, which qualify controlled keyword terms.

Example: KEYWORD in annotation file
Entry Feature Location Qualifier Value
KEYWORD keyword ENV
Specified values for KEYWORD/keywordRequierd
Categories the values for keyword Remarks
WGS WGS see also For WGS and scaffold CON.
ENV ENV
EST EST
some other terms Please refer to For EST Submissions.
HTC HTC some other terms Please contact us before your submission.
HTG HTG some other terms Depending on the phase. Please contact us before your submission.
GSS GSS
STS STS
TPA TPA, Third Party Data
TPA:inferential or TPA:experimental Either of two is mandatory.
TSA TSA, Transcriptome Shotgun Assembly
TLS TLS, Targeted Locus Study
Others Please contact us before your submission.

Requirements for Describing KEYWORD

  • Please describe the specified values for Qualifier: keyword.
  • Please contact us before your submission to make sure the detail descriptions of KEYWORD.
For WGS and scaffold CON
  • For WGS and scaffold CON, please select a keyword from the following list.
    • STANDARD_DRAFT
    • HIGH_QUALITY_DRAFT
    • IMPROVED_HIGH_QUALITY_DRAFT
    • NON_CONTIGUOUS_FINISHED
    Example: WGS draft genomeRequierd
    Entry Feature Location Qualifier Value
    KEYWORD keyword WGS
    keyword STANDARD_DRAFT
For EST Submissions
  • For EST submissions, at least two keywords are required;
    EST and one of following three terms;

    • For 5' EST submissions --- 5'-end sequence (5'-EST)
    • For 3' EST submissions --- 3'-end sequence (3'-EST)
    • Other than above two cases --- unspecified EST
  • Example : 5' ESTRequierd
    Entry Feature Location Qualifier Value
    KEYWORD keyword EST
    keyword 5'-end sequence (5'-EST)
  • In the case of 3' EST, to distinguish whether your sequences are corresponding to anti-sense or sense strand, please describe either of following two COMMENTs.
    Example : For anti-sense strand;Requierd
    Entry Feature Location Qualifier Value
    COMMENT line 3'-EST sequences are presented as anti-sense strand.
    Example : For sense strand;Requierd
    Entry Feature Location Qualifier Value
    COMMENT line 3'-EST sequences are presented as sense strand.
For HTG submissions
  • For HTG submissions, we recommend to use keywords to indicate sequencing status of HTG data.
  • Example I: containing unordered piecesRequierd
    Entry Feature Location Qualifier Value
    KEYWORD keyword HTG
    keywrod HTGS_PHASE1
    keyword HTGS_DRAFT
    Example II: containing only ordered piecesRequierd
    Entry Feature Location Qualifier Value
    KEYWORD keyword HTG
    keyword HTGS_PHASE2

DBLINK

The DBLINK line is used to link other databases, such as BioProject ID, BioSample ID and Sequence Read Archive (DRA/ERA/SRA).

Example: DBLINK in annotation fileRequierd
Entry Feature Location Qualifier Value
DBLINK project PRJDB12345
biosample SAMD90000000
sequence read archive DRR999000
sequence read archive DRR999001

Requirements for Describing DBLINK

locus_tag

For the submission in the whole genome scale with many annotated features, we recommend to use the qualifier locus_tag, for the Biological Features indicating protein products (CDSs), and transcripts (rRNA, tRNA and so on).
The locus_tag prefix and BioProject ID should be registered at DDBJ BioProject Database in advance.

source: ff_definition

ff_definition is a Qualifier that is not defined in The DDBJ/EMBL/GenBank Feature Table: Definition.
One ff_definition can be described in an entry, if necessary.

Example: ff_definition in annotation file
Entry Feature Location Qualifier Value
source 1..516 organism Mus musculus
mol_type mRNA
ff_definition @@[organism]@@ mRNA, clone: @@[clone]@@
clone PC0110
Value formats of ff_definition
Categories Format for the value of ff_definition
WGS @@[organism]@@ @@[strain]@@ DNA, @@[submitter_seqid]@@, [other information]
BAC/YAC genomic clones in unfinished phase (HTG) @@[organism]@@ DNA, chromosome @@[map]@@, [BAC/YAC] clone: @@[clone]@@, *** SEQUENCING IN PROGRESS ***
BAC/YAC genomic clones in finished phase @@[organism]@@ DNA, chromosome @@[map]@@, [BAC/YAC] clone: @@[clone]@@
EST @@[organism]@@ mRNA, clone: @@[clone]@@, [other information]
@@[organism]@@ cDNA, clone: @@[clone]@@, [other information]
GSS @@[organism]@@ DNA, clone: @@[clone]@@, [other information]
STS @@[organism]@@ DNA, @@[map]@@, [marker name], sequence tagged site
Others Please contact us before your submission, if necessary.

Requirements for Describing source: ff_definition

  • The Qualifier: ff_definition can be described on source, one of Biological features.
  • You can describe only one ff_difinition for one entry.
  • The value of ff_definition will be used for the DEFINITION line in the format of DDBJ flat file. Please refer to Sample annotation file and The relationships between annotation file and DDBJ flat file.
  • For the Value of ff_definition, a meta description (e.g. @@[organism]@@ and @@[clone]@@) is available to quote values of other qualifiers. The meta description, Qualifier name enclosed by "@@[ and ]@@", will be replaced by the value of target Qualifier ("organism", "clone" in the above sample) when ff_definition is reflected in DEFINITION line on DDBJ flat file.
  • In principle, you can describe DEFINITION according to the above table, however, if you like to input the values of ff_definition qualifiers, please contact us before your submission.

assembly_gap: Sequencing Gap Region

In cases of whole genome scale sequencing such as HTG or large scale of assembled EST sequences such as TSA division, the entries may have some sequencing gaps that would be resulted from the process of assembling or the region difficult to read. You can indicate them by describing "n" in its sequence. In annotation file, you have to indicate the regions of sequencing gaps with assembly_gap features.

Example: assembly_gap in annotation fileRequierd
Entry Feature Location Qualifier Value
assembly_gap 101..200 estimated_length unknown
gap_type within scaffold
linkage_evidence paired-ends

Requirements for Describing assembly_gap: Sequencing Gap Region

  • Though the assembly_gap feature is one of Biological features, the format is slightly different from others.
  • You can NOT use join, order, complement for the Location of assembly_gap features.
Length of the gap is unknown

The location of span of the assembly_gap feature for an unknown gap has to be specified by the submitter; the specified gap length has to be reasonable (less or = 1000) and will be indicated as "n"'s in the sequence.
It is required to indicate unknown for the Value of Qualifier: estimated_length on the assembly_gap feature.

In case of transcriptome record (TSA division), the value of the estimated_length of assembly_gap features must be in an integer, not be “unknown”.

Length of the gap is estimated

The location span of the assembly_gap feature for "known" gap should be indicated by the number of "n"'s in the sequence. It is required to indicate known for the Value of Qualifier: estimated_length on the assembly_gap feature.

TOPOLOGY

Please enter circular for the Qualifier of TOPOLOGY feature, when the topology of whole nucleotide molecule is circular and the first and the end positions are conjugated on real molecules.
i.e. Complete genome sequence of a circular virus

Example: TOPOLOGY in annotation file
Entry Feature Location Qualifier Value
TOPOLOGY circular

Requirements for Describing TOPOLOGY

TPA/TSA: PRIMARY_CONTIG, Citation of Primary Entries

PRIMARY_CONTIG, entry, and primary_bases are the Feature and Qualifiers prepared to describe the alignments of primary entries for TPA/TSA submission.

Example: PRIMARY_CONTIG in annotation file
Entry Feature Location Qualifier Value
PRIMARY_CONTIG 1..438 entry ZZ000010.1
primary_bases 1..438
PRIMARY_CONTIG 377..696 entry ZZ000011.1
primary_bases 1..320
complement
PRIMARY_CONTIG 590..1191 entry ZZ000022.0
primary_bases 1..601
Qualifiers available for PRIMARY_CONTIG
Qualifier Remarks for the value description
entry Accession number of the cited primary entry (with version number)
primary_bases input the base span cited from the primary sequence.
The base span of the cited primary sequence. Example) 1..500
complement To indicate citing the complementary strand of primary sequence

Requirements for Describing TPA/TSA: PRIMARY_CONTIG, Citation of Primary Entries

  • Please specify the value for DATATYPE/type, TPA or DIVISION/division, TSA in the annotation file.
  • In PRIMARY_CONTIG, it is necessary to refer to accession number(s) (with version) in the primary database and enter the base spans of the primary sequences that contribute to the TPA/TSA sequence.
  • You can not use join, order, complement for Location column. Please describe each PRIMARY_CONTIG and location even in the same entry.
  • If the primary entry has been submitted to DDBJ/EMBL-Bank/GenBank, a version number is required for accession number. If the primary entry is not public, please use 0 [zero] for the version. e.g. ZZ000022.0
  • If primary sequence is corresponding to reverse strand in the TPA/TSA sequence, please put complement qualifier.
  • In detail, refer to Sample annotation file and The relationships between annotation file and DDBJ flat file.

AGP File

AGP file is required to submit CON entries.
An AGP file is the tab delimited text file consisting of nine columns of the order and orientation etc of the piece entries to construct CON entry.
You can make the files with some scripts, spread sheets (such as MS Excel), text editors and so on.

Sequence file is not required when the sequence can be constructed from AGP file.

The AGP file format was initially developed by UCSC, EBI and NCBI.

Example: AGP file
#1 2 3 4 5 6 7 8 9
scaffold1 1 1345 1 W BZZZ01123456.1 1 1345 +
scaffold1 1346 2845 2 N 1500 scaffold yes align_genus
scaffold1 2846 4301 3 W BZZZ01123457.1 1 1456 +
scaffold1 4302 4401 4 U 100 scaffold yes align_genus
scaffold1 4402 5631 5 W BZZZ01123458.1 1 1230 -
scaffold2 1 650 1 W BZZZ01123486.1 1 1345 +
scaffold2 651 750 2 N 100 scaffold yes align_genus
scaffold2 751 2980 3 W BZZZ01123488.1 1 1230 -

Format and Syntax

It is required to validate formats of AGP file by UME.

  • AGP file consists of nine columns.
  • Columns should be tab delimited.
  • AGP file is required to contain NO space or blank line.
  • The use of comment lines, starting with a # symbol, at the head of the file is encouraged.
Description on each column(column 1 - column 5)
column content description
1 object CON entry name, the identifier for the object being assembled.
i.e. a chromosome, scaffold or contig.
CON entry name has to correspond to each name in the annotation file as described at Annotation File.
2 object_beg The starting coordinates of the component/gap on the object.
3 object_end The ending coordinates of the component/gap on the object.
4 part_number The line count for the components/gaps that make up the object.
5 component_type The sequencing status of the component. These typically correspond to keywords in the International Sequence Database (GenBank/EMBL/DDBJ) submission. Current acceptable values are:
A Active Finishing
D Draft HTG (often phase1 and phase2 are called Draft, whether or not they have the draft keyword)
F Finished HTG (phase3)
G Whole Genome Finishing
O Other sequence (typically means no HTG keyword)
P Pre Draft
W WGS contig
N gap with specified size
U gap of unknown size, defaulting to 100 bases

* component: a sequence used to construct a larger sequence (i.e. piece entry)

The description of column 6 to 9 depends on the value in column 5 whether it has gap or not.

Description on each column(column 6 - column 9):
If column 5 contains A, D, F, G, O, P and W except from N and U
column content Description
6 component_id The accession number with version or
local identifier for the component
7 component_beg The beginning of the part of the component that contributes to the object
8 component_end The end of the part of the component that contributes to the object
9 orientation The orientation of the component relative to the object.
Acceptable values are:
+ plus
- minus
? unknown
0 zero; unknown (deprecated)
na irrelevant
By default, components with "?", "0" or "na" are treated as if they had + orientation.

* component: a sequence used to construct a larger sequence (i.e. piece entry)る配列 (ピースエントリ)

Description on each column(column 6 - column 9):If column 5 contains N and U
column content description
6 gap_length [component_type: N] The length of gap (bp)
[component_type: U] 100
7 gap_type This column specifies the gap type. Accepted values:
scaffold a gap between two sequence contigs in a scaffold (superscaffold or ultra-scaffold).
contig an unspanned gap between two sequence contigs.
centromere a gap inserted for the centromere.
short_arm a gap inserted at the start of an acrocentric chromosome.
heterochromatin a gap inserted for an especially large region of heterochromatic sequence (may also include the centromere)
telomere a gap inserted for the telomere.
repeat an unresolvable repeat.
8 linkage The linkage between the adjacent lines (Values: "yes" or "no")
9 linkage evidence This specifies the type of evidence used to assert linkage (as indicated in column 8b). Accepted values:
na used when no linkage is being asserted (column 8b is 'no')
paired-ends paired sequences from the two ends of a DNA fragment.
align_genus alignment to a reference genome within the same genus.
align_xgenus alignment to a reference genome within another genus.
align_trnscpt alignment to a transcript from the same species.
within_clone sequence on both sides of the gap is derived from the same clone, but the gap is not spanned by paired-ends. The adjacent sequence contigs have unknown order and orientation
clone_contig linkage is provided by a clone contig in the tiling path (TPF). For example, a gap where there is a known clone, but there is not yet sequence for that clone.
map linkage asserted using a non-sequence based map such as RH, linkage, fingerprint or optical.
strobe strobe sequencing (PacBio).
unspecified used when converting old AGPs that lack a field for linkage evidence into the new format.
If there are multiple lines of evidence to support linkage, all can be listed using a ‘;’ delimiter.
(e.g. "paired-ends;align_xgenus ")
  • The length of gap for an 'unknown' gap should be 100 bp. It is required to indicate "U" for the value of component_type and "100" for the value of gap_length.
  • Information about continuity is provided by a combination of the value in the gap_type and linkage. Please refer to the following table.
    gap_type linkage Interpretation and description
    Within-scaffold gaps: sequences on either side of the gap are in a single scaffold.
    scaffold yes Do not break scaffold
    There is evidence linking sequence contigs on both sides of the gap.
    repeat yes Do not break scaffold
    If an unresolvable repeat unit is spanned by linkage evidence, the linkage will be 'yes'.
    Scaffold-breaking gaps: sequences on either side of the gap are in separate scaffolds.
    contig no Break scaffold
    A contig gap indicates there is no evidence to link the adjacent sequence contigs.
    repeat no Break scaffold
    If an unresolvable repeat unit is not spanned by linkage evidence, the linkage will be 'no'.
    centromere
    short_arm
    heterochromatin
    telomer
    no Break scaffold
    Gaps with these biological types are used for laying out scaffolds along a chromosome.
    Invalid gap/linkage combinations
    contig yes Invalid
    If there is evidence of linkage between the adjacent sequence contigs, the gap type should be scaffold.
    scaffold no Invalid
    If there is no evidence of linkage between the adjacent sequence contigs, the gap type should be contig.
    centromere
    short_arm
    heterochromatin
    telomere
    yes Invalid
    It is invalid to use these biological types within a scaffold.