DDBJ Annotated/Assembled Sequences
Submission File Format
Sequence File
The sequence file is a text file in FASTA-like format contains all nucleotide sequences. In the sequence file, one array data consists of a line of header lines starting with “>” and a sequence of entities of the second and subsequent lines. You must insert the end flag (//) at the end of each sequence.
Example: Sequence File
>CLN01 <-- Entry name for the first one
ggacaggctgccgcaggagccaggccgggagcaggaagaggcttcgggggagccggagaa
ctgggccagatgcgcttcgtgggcgaagcctgaggaaaaagagagtgaggcaggagaatc
gcttgaaccccggaggcggaaccgcactccagcctgggcgacagagtgagactta
// <-- End flag
>CLN02 <-- Entry name for the second one
ctcacacagatgcgcgcacaccagtggttgtaacagaagcctgaggtgcgctcgtggtca
gaagagggcatgcgcttcagtcgtgggcgaagcctgaggaaaaaatagtcattcatataa
atttgaacacacctgctgtggctgtaactctgagatgtgctaaataaaccctctt
// <-- End flag
Format and Syntax
It is required to validate formats of sequence file by UME or Parser.
- First line starts with [>], followed by the Entry name at the head of each sequence.
- Entry names must be unique in the sequence file.It is common to use clone name or isolate name as unique Entry name.
- Entry name is required to be described in less than 32 letters of characters which do not contain space, “ double-quote, = equal, | pipe, > greater-than, [] angled brackets or \ back-slash.
- The names and the orders of Entry in the both of sequence and annotation files should be matched.The accession numbers will be assigned in the order of entries.
- Sequence file is required to contain NO space or blank line.
- You can use not only a, t, g and c but also characters in Nucleotide base codes for your nucleotide sequences, if necessary.
- In principle, please remove the base code ‘n’ locating 5’ or 3’ end of sequences. For especially EST submissions, please do not send the raw outputs of a sequencer. You should screen your sequences to remove unreliable output(s) often locating at 5’-end.
- Remove the sequences derived from vector, linker or adaptor.If you would like to submit some artificially constructed sequence itself, such as an expression vector etc., you do not have to remove that.
- Please be sure to input the end flag [//] at the end of each sequence.
- In case of CON entry, AGP file can be used as a substitute for sequence file.
Annotation File
The annotation file is the tab delimited text file consisting of five columns of Entry, Feature, Location, Qualifier, and Value that contains your data other than sequences, such as submitters, references and biological features.
You can make the files with some scripts, spread sheets (such as MS Excel), text editors and so on.
Example:Annotation file (Required)
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
COMMON | SUBMITTER | ab_name | Robertson,G.R. | |
ab_name | Mishima,H. | |||
contact | Hanako Mishima | |||
mishima@ddbj.nig.ac.jp | ||||
phone | 81-55-981-6853 | |||
fax | 81-55-981-6853 | |||
phext | 3207 | |||
institute | National Institute of Genetics | |||
department | DNA Data Bank of Japan | |||
country | Japan | |||
state | Shizuoka | |||
city | Mishima | |||
street | Yata 1111 | |||
zip | 411-8540 | |||
REFERENCE | title | Mouse Genome Sequencing | ||
ab_name | Robertson,G.R. | |||
ab_name | Mishima,H | |||
year | 2012 | |||
status | Unpublished | |||
COMMENT | line | Please visit our website | ||
line | URL: http://www.ddbj.nig.ac.jp/ | |||
CLN01 | source | 1..12297 | organism | Mus musculus |
mol_type | genomic DNA | |||
clone | PC0110 | |||
chromosome | 8 | |||
CDS | join(<1..456,609..879,1070..1213) | product | protein kinase | |
codon_start | 2 | |||
CLN02 | source | 1..12393 | organism | Mus musculus |
mol_type | genomic DNA | |||
clone | PC0210 | |||
chromosome | 8 | |||
CDS | 9365..9640 | product | hypothetical protein |
Format and Syntax
It is required to validate formats of annotation file by UME or Parser.
- Entry
- Please enter the Entry name into Entry column. Entry name has to correspond to each name in the sequence file as described at How to Make Sequence File.
- Do not enter anything in the Entry column until the first line for the next entry.
- Feature
- There are two types of Features, Biological feature and DDBJ original features. The detail descriptions for Features are explained below.<
- Do not enter anything in Feature columns until the first line for the next feature.
- Location
- Location can be described in the columns adjacent Feature columns filed with either of Biological feature or PRIMARY_CONTIG feature.
- Qualifier
- Qualifier is described in every line, in principle. It depends on the Feature whether each Qualifier is mandatory, available, or not to use for the Feature. Details are explained below.
- Value
- The format of Value is different depending on Qualifiers. Details will be explained below.
- Other
- In annotation file, it is judged as end when a blank line was found. Therefore, when you input multiple entries, please be sure not to make a blank line until the end of file.
References for Describing Biological Features
Name | Refix Date | Remarks |
---|---|---|
Feature Table Definition | 2024/10/01 | version 11.3 |
Feature/Qualifier usage matrix | 2016/11/09 | |
Example of Submission | 2014/11/27 | Examples of features in DDBJ flat file |
COMMON
COMMON entry for the common information to all entries
- In annotation file, entry name COMMON can be described in Entry column for the common information to all entries.
- The information described in COMMON entry will be reflected in all entries.
- Usually, COMMON is used for SUBMITTER/REFERENCE/DATE/COMMENT, but it can also be used for Biological feature when all the information of Feature, Location, Qualifiers and Values are common to all entries.
Use of COMMON entry
- Meta-base position ‘E’ for the location description
- Example: rRNA feature in COMMON entry
Entry Feature Location Qualifier Value COMMON rRNA <1..>E product 16S rRNA There are many submissions that have common Feature information for all entries in their Qualifiers, and Values except their Locations because of difference of their sequence lengths, such as phylogenic studies with rRNA sequences.
In such cases, you can describe the common Feature in COMMON entry by using meta-base position ‘E’ in its Location instead of the number of the sequence end points.
- Meta-description ‘@@[entry]@@ ‘is available for clone, note, ff_definition
- Example: source feature in COMMON entry
Entry Feature Location Qualifier Value COMMON source 1..E organism Homo sapiens mol_type genomic DNA submitter_seqid @@[entry]@@ ff_definition @@[organism]@@ DNA, @@[submitter_seqid]@@ There are some submissions that have common Feature information for all entries in their Qualifiers, and Values except their Locations and clone name or contig names, such as EST, GSS, TSA, TLS, WGS, WGS scaffold (CON division), and so on.
In such cases, you can describe the Feature: source in COMMON entry only if you use clone or contig names as entry name.
- You can use meta-base position ‘E’ in its Location instead of the number of the sequence end points.
- For the Value of clone, submitter_seqid, note, ff_definition, a meta description @@[entry]@@, entry enclosed by “@@[” and “]@@”, is available to quote entry names. It will be replaced by the entry names which are quoted from a sequence file.
SUBMITTER
Example: SUBMITTER in annotation file (Requierd)
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
COMMON | SUBMITTER | ab_name | Robertson,G.R. | |
ab_name | Mishima,H. | |||
consrtm | Mouse Genome Consortium | |||
contact | Hanako Mishima | |||
mishima@ddbj.nig.ac.jp | ||||
url | http://www.ddbj.nig.ac.jp | |||
phone | 81-55-981-6853 | |||
fax | 81-55-981-6853 | |||
phext | 3207 | |||
institute | National Institute of Genetics | |||
department | DNA Data Bank of Japan | |||
country | Japan | |||
state | Shizuoka | |||
city | Mishima | |||
street | Yata 1111 | |||
zip | 411-8540 |
List of Qualifiers for SUBMITTER
Qualifier | Legal characters for each Value (Remarks) | Number of letters |
---|---|---|
ab_name (abbreviation of author name) | alphabets, .[period], ,[comma], -[hyphen], ‘ [single quote as apostrophe] | 64 |
contact (contact person) | alphabets, .[period], ,[comma], -[hyphen], ‘ [single quote as apostrophe], [space] (In order of first, middle, and last names delimited with) | first(64),middle(128), last(64) |
consrtm (consortium) | alphabets, digits, [space], -[hyphen], ‘ [single quote as apostrophe], .[period], _[underscore], .[comma], ( ) # & @ / ; : + * | 255 |
alphabets, digits, @, .[period], -[hyphen], _[underscore] | 64 | |
url | All printable characters but [space] | 255 |
phone, fax, phext | digits, -[hyphen] (DO NOT enter + before country code) | 16 |
institute, department | All printable characters but [back-slash], ` [back-quote] | 255 |
country, state | alphabets, digits, [space], -[hyphen], ‘[single quote as apostrophe], .[period], _[underscore], ,[comma], ( ) # & @ / ; : + * | 32 |
city | alphabets, digits, [space], -[hyphen], ‘[single quote as apostrophe], .[period], _[underscore], ,[comma], ( ) # & @ / ; : + * | 64 |
street | alphabets, digits, [space], -[hyphen], ‘[single quote as apostrophe], .[period], _[underscore], ,[comma], ( ) # & @ / ; : + * | 255 |
zip | alphabets, digits, -[hyphen] | 16 |
Requirements for Describing SUBMITTER
- Basically it is necessary to enter one SUBMITTER for each entry. But COMMON can be used for describing SUBMITTER that is common to all entries.
When SUBMITTER is written by using COMMON, SUBMITTER cannot be used for the other entries in the same annotation file. - Submitters are the persons who have the responsibility in the contents of the submitted data and have the right to update the data.
- Qualifier: ab_name in SUBMITTER can be used repeatedly for multiple submitters and those submitters are shown in the released file in the order of this annotation file.
- It is necessary to specify a contact person whom DDBJ will contact with about the data by using Qualifier: contact.
-
The abbreviation of the author name according to the format of REFERENCE author should be described in Value of Qualifier: ab_name.
- Value format:
- last name[comma]initial of first name[period]initial of middle name[period]
- Example:
- Miyashita,Y.
- Robertson,G.R.
Although some names (e.g. name with a hyphen) may show a warning message owing to format error, it is possible to input.
- Each Value for the Qualifier except ab_name in SUBMITTER cannot be used repeatedly. They can be used for only contact person. If you would like to submit the information of multiple institutes, please contact us before your submission.
REFERENCE
Example: REFERENCE in annotation file (Requierd)
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
REFERENCE | title | Sequence and analysis of mouse ch.8 | ||
ab_name | Robertson,G.R. | |||
ab_name | Mishima,H. | |||
status | Published | |||
year | 2003 | |||
journal | Nature | |||
volume | 8 | |||
start_page | 15 | |||
end_page | 20 |
List of Qualifiers for REFERENCE
Qualifier | Legal characters for each Value (Remarks) | Number of letters |
---|---|---|
title | All printable characters but [back-slash], ` [back-quote] | 255 |
ab_name?(abbreviation of author name) | alphabets, .[period], ,[comma], -[hyphen], ' [single quote as apostrophe] | 64 |
consrtm(consortium) | alphabets, digits, [space], -[hyphen], ' [single quote as apostrophe], .[period], _[underscore], ,[comma], ( ) # & @ / ; : + * |
255 |
status | Either one of follows; Unpublished, In press, Published |
- |
year | digits(4 figures of A.D.) | 4 |
journal | All printable characters but [back-slash], ` [back-quote] (PubMed type abbreviation) | 128 |
volume, start_page, end_page | alphabets, digits, -[hyphen] | 8 |
Requirements for Describing REFERENCE
- It is necessary to specify at least one REFERENCE for each entry. However, COMMON can be used for describing the REFERENCE that is common to all entries.
-
The abbreviation of the author name according to the format of REFERENCE author should be described in Value of Qualifier: ab_name.
- Value format:
- last name[comma]initial of first name[period]initial of middle name[period]
- Example:
- Miyashita,Y.
- Robertson,G.R.
Please pay no attention to a warning message about name format error (e.g. name with a hyphen).
- If the Value of status is “In Press”, Qualifier: journal is also a mandatory item.
- If the Value of status is “Published”, Qualifier: journal, volume, start_page and end_page are also mandatory items.
- Please input “Unpublished” in the status, if you do not prepare any publication.
- Please input ISO abbreviation in the journal if you have.
- If you need to enter more than two REFERENCE features, please input the first REFERENCE directly related to your sequences and then put the other(s) that would be helpful for understanding the data after the first one.
- When you use REFERENCE features for both COMMON entry and other entries, the REFERENCE feature(s) specified for each entry will be loaded into DDBJ after one(s) given by COMMON entry.
- When you cite two or more REFERENCE features for an entry, they will be shown on the DDBJ flat file in the same order on the annotation file.
DATE
Example: DATE/hold_date in annotation file
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
COMMON | DATE | hold_date | 20231125 |
Requirements for Describing DATE
- DATE and hold_date are required to be described in COMMON entry.
- If you want to keep confidential your data until a specific date, please set the date with 8 digits (e.g. 20231125).
- Delimiters (i.e. – (hyphen), / (slash) etc.) is not allowed to use for Value of hold_date.
- Do not enter any DATE, if your data should be open to public immediately.
- DATE should be included for COMMON entry. If the date is not common to all entries, please prepare the file for each.
- If you set a hold_date, your data will be released according to Principle of “Hold-Until-Published” data release.
COMMENT/ST_COMMENT
Example: COMMENT and ST_COMMENT in annotation file
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
COMMENT | line | This clone was obtained at our laboratory. | ||
COMMENT | line | Please visit our web site. | ||
line | URL:http://www.ddbj.nig.ac.jp | |||
ST_COMMENT | tagset_id | Genome-Assembly-Data | ||
Assembly Method | GS De Novo Assembler v. 2.0 | |||
Assembly Name | Mmus_1.0 | |||
Genome Coverage | 50x | |||
Sequencing Technology | 454 GS FLX; ABI 3730 |
※ There are two kinds of COMMENTs, “general COMMENT” and “structured COMMENT”.
Requirements for Describing COMMENT (General COMMENT)
- Please use general COMMENT if you want to describe additional information for your data.
- It will automatically start a new-line by 60 letters including spaces. If you want to start a new-line other than 60 letters, please add Qualifier: line.
- All printable characters except [back-slash] are legal for the Value of Qualifier: line.
- COMMON entry can be used for describing COMMENT that is common to all entries.
- When you put multiple COMMENT features, please put each COMMENT for a Feature column, separately.
- When an entry has both COMMENT features specific to it and common with all other entries described in COMMON entry, those will be shown on DDBJ flat file in the order, COMMENT in COMMON entry at first, then followed by one specific to the entry. On DDBJ flat files, in the case of plural COMMENTs, they will be shown in DDBJ format on same order of the annotation file.
- When you use COMMENT features for both COMMON entry and other entries, the COMMENT feature(s) specified for each entry will be loaded into DDBJ after one(s) given by COMMON entry.
- When you describe two or more COMMENT features for an entry, they will be shown on the DDBJ flat file in the same order on the annotation file.
- For EST submissions, some particular COMMENT description is required.Details
Requirements for Describing ST_COMMENT (Structured Comment)
-
ST_COMMENT is a feature to describe the structured comment in the flat file.
-
Though ST_COMMENT can be defined by user community, ST_COMMENT in predetermined format is required to submit sequence data derived from genome Project (including WGS) or transcriptome Project (including TSA).
-
ST_COMMENT is composed of dataset name (tagset_id), names of items (user-defined Qualifier) and values of items (Value).
-
In the initial line of Structured COMMENT feature, describe tagset_id as Qualifier and dataset name as its Value.
In case of genome project, describe “Genome-Assembly-Data” for the value of tagset_id qualifier.
In case of transcriptome project, describe “Assembly-Data” for the value of tagset_id qualifier. -
Describe a name of item as Qualifier name and its value as Value. In case of Genome-Assembly-Data, use following Qualifiers.
In case of Assembly-Data, use following Qualifiers. -
List of Qualifiers for Genome-Assembly-Data (Requierd)
Qualifier Description Remarks Assembly Method Name of program and the version used assembling sequences. Mandatory. The program version must be presented just after “ v. “ (e.g. Velvet v. 2.0) Assembly Name Name that the submitter has given to that assembly of the genome. Mandatory for Eukaryote. We recommend to describe in the format: [abbreviated name of species or common name of organism] + [version] (i.e. Btau_4.0) Genome Coverage Approximate sequencing depth. Mandatory. (e.g. 125x) Use “Unknown” when the coverage is not known. Sequencing Technology Platform(s) used to generate the sequence. Mandatory. Use semicolon with a space to describe the multiple platforms (e.g. 454 GS FLX; ABI 3730) -
List of Qualifiers for Assembly-Data (Requierd)
Qualifier Description Remarks Assembly Method Name of program and the version used assembling sequences. Mandatory. The program version must be presented just after “ v. “ (e.g. Velvet v. 2.0) Assembly Name Name and version for assembled sequences Recommended format: [abbreviated name of species or common name of organism] + [version] (i.e. Btau_4.0) Coverage Approximate sequencing depth. (e.g. 125x) Use “Unknown” when the coverage is not known. Sequencing Technology Platform(s) used to generate the sequence. Mandatory. Use semicolon with a space to describe the multiple platforms (e.g. 454 GS FLX; ABI 3730) -
If you have any question to describe ST_COMMENT, please contact us by email prior to your submission.
Biological Features
Example: source and CDS features in annotation file (Requierd)
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
source | 1..12297 | organism | Mus musculus | |
mol_type | genomic_DNA | |||
chromosome | 8 | |||
clone | PC0110 | |||
CDS | join(<1..456,609..879,1070..1213) | product | protein kinase | |
codon_start | 2 | |||
rRNA | 1279..3000 | product | 18S rRNA | |
CDS | complement(join(3213..4981,9901..11677)) | gene | tbpA | |
product | TATA-box binding protein |
※For detail definitions and descriptions of Biological features, please read Feature Table Definition.
Requirements for Describing Feature/Location/Qualifier
- In Feature Table Definition, each Qualifier has a / [slash] on its head, however do not use slashes for Qualifiers in the annotation file.
- Qualifiers marked with * (organism、mol_type) are mandatory items. Features, source and at least one other feature are mandatory items for each entry. Please be sure to input them correctly.
- You can find the rule to describe Location on Description of Location.
- You can see Qualifiers are legal for each Feature in Feature/Qualifier Usage Matrix. Some of Features have mandatory Qualifier(s). Please be sure to specify Features and Qualifiers according to their name in the table. They are strictly defined such as case-sensitive (to distinguish upper case or lower), to use “_” [underscore], and so on.
- See also Sample annotation file and Example of Submission
- When you describe CDS features, Protein Coding Sequence; CDS feature would be helpful.
- Files containing CDS feature(s) should be checked with UME or transChecker.
Requirements for Describing Value
- The legal character type for Values depends on the Qualifiers as shown in the table, Feature/Qualifier Usage Matrix and Feature Table Definition.
- Please be sure to input (or not to input) Values in accordance with value types in tables.
DIVISION
DIVISION feature in annotation file indicates that entries are corresponding only to one of CON / ENV / EST / GSS / HTC / HTG / STS / SYN / TSA.
Example: DIVISION in annotation file
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
COMMON | DIVISION | division | EST |
Requirements for Describing DIVISION
- Please enter the division name, 3 capital letters in the Value for Qualifier: division.
- In principle, please describe the DIVISION feature in the COMMON entry.
DATATYPE
DATATYPE feature indicates that entries are corresponding to either of WGS, TLS, TPA, or TPA-WGS.
Example: DATATYPE in annotation file
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
COMMON | DATATYPE | type | WGS |
Requirements for Describing DATATYPE
- Please enter the name of type, WGS, TLS, TPA, or TPA-WGS in the Value for Qualifier: type.
- Please describe the DATATYPE feature in the COMMON entry.
KEYWORD
On the basis of categories indicated at the sections, DIVISION and DATATYPE, KEYWORDs with controlled vocabulary describe more detail and specified information, such as experimental methods.
Please see INSDC agreed methodological keywords, which qualify controlled keyword terms.
Example: KEYWORD in annotation file
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
KEYWORD | keyword | ENV |
Specified values for KEYWORD/keyword(Requierd)
Categories | the values for keyword | Remarks |
---|---|---|
WGS | WGS | see also For WGS and scaffold CON. |
ENV | ENV | |
EST | EST | |
some other terms | Please refer to For EST Submissions. | |
HTC | HTC some other terms | Please contact us before your submission. |
HTG | HTG, some other terms | Depending on the phase. Please contact us before your submission. |
GSS | GSS | |
STS | STS | |
TPA | TPA, Third Party Data | |
TPA:inferential or TPA:experimental | Either of two is mandatory. | |
TSA | TSA, Transcriptome Shotgun Assembly | |
TLS | TLS, Targeted Locus Study | |
Others | Please contact us before your submission. |
Requirements for Describing KEYWORD
- Please describe the specified values for Qualifier: keyword.
- Please contact us before your submission to make sure the detail descriptions of KEYWORD.
For WGS and scaffold CON
-
For WGS and scaffold CON, please select a keyword from the following list.
- STANDARD_DRAFT
- HIGH_QUALITY_DRAFT
- IMPROVED_HIGH_QUALITY_DRAFT
- NON_CONTIGUOUS_FINISHED
Example: WGS draft genome (Requierd)
Entry Feature Location Qualifier Value KEYWORD keyword WGS keyword STANDARD_DRAFT
For EST Submissions
-
For EST submissions, at least two keywords are required; EST and one of following three terms;
- For 5’ EST submissions — 5’-end sequence (5’-EST)
- For 3’ EST submissions — 3’-end sequence (3’-EST)
- Other than above two cases — unspecified EST
Example : 5’ EST (Requierd)
Entry Feature Location Qualifier Value KEYWORD keyword EST keyword 5’-end sequence (5’-EST) -
In the case of 3’ EST, to distinguish whether your sequences are corresponding to anti-sense or sense strand, please describe either of following two COMMENTs.
Example : For 3’ EST, anti-sense strand (Requierd)
Entry Feature Location Qualifier Value COMMENT line 3’-EST sequences are presented as anti-sense strand. Example : For 3’ EST, sense strand (Requierd)
Entry Feature Location Qualifier Value COMMENT line 3’-EST sequences are presented as sense strand.
For HTG submissions
-
For HTG submissions, we recommend to use keywords to indicate sequencing status of HTG data.
Example I: containing unordered pieces (Requierd)
Entry Feature Location Qualifier Value KEYWORD keyword HTG keywrod HTGS_PHASE1 keyword HTGS_DRAFT Example II: containing only ordered pieces (Requierd)
Entry Feature Location Qualifier Value KEYWORD keyword HTG keyword HTGS_PHASE2
DBLINK
The DBLINK line is used to link other databases, such as BioProject ID, BioSample ID and Sequence Read Archive (DRA/ERA/SRA).
Example: DBLINK in annotation file (Requierd)
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
DBLINK | project | PRJDB12345 | ||
biosample | SAMD90000000 | |||
sequence read archive | DRR999000 | |||
sequence read archive | DRR999001 |
Requirements for Describing DBLINK
- If you have registered your project to the DDBJ BioProject Database, please enter the project ID in the Value for Qualifier: The sample ID of DDBJ BioSample also writes in the Value of for Qualifier.
- An assembly from raw reads of Sequence Read Archive is required to have run accession number(s) in the Value for Qualifier.
- See also DDBJ BioProject Database, DDBJ BioSample Database and DDBJ Sequence Read Archive.
locus_tag
For the submission in the whole genome scale with many annotated features, we recommend to use the qualifier locus_tag, for the Biological Features indicating protein products (CDSs), and transcripts (rRNA, tRNA and so on).
The locus_tag prefix and BioSample ID should be registered at DDBJ BioSample Database in advance.
source: ff_definition
ff_definition is a Qualifier that is not defined in The DDBJ/EMBL/GenBank Feature Table: Definition. One ff_definition can be described in an entry, if necessary.
Example: ff_definition in annotation file
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
source | 1..516 | organism | Mus musculus | |
mol_type | mRNA | |||
ff_definition | @@[organism]@@ mRNA, clone: @@[clone]@@ | |||
clone | PC0110 |
Value formats of ff_definition
Categories | Format for the value of ff_definition |
---|---|
WGS | @@[organism]@@ @@[strain]@@ DNA, @@[submitter_seqid]@@, [other information] |
BAC/YAC genomic clones in unfinished phase (HTG) | @@[organism]@@ DNA, chromosome @@[map]@@, [BAC/YAC] clone: @@[clone]@@, *** SEQUENCING IN PROGRESS *** |
BAC/YAC genomic clones in finished phase | @@[organism]@@ DNA, chromosome @@[map]@@, [BAC/YAC] clone: @@[clone]@@ |
EST | @@[organism]@@ mRNA, clone: @@[clone]@@, [other information] |
@@[organism]@@ cDNA, clone: @@[clone]@@, [other information] | |
GSS | @@[organism]@@ DNA, clone: @@[clone]@@, [other information] |
STS | @@[organism]@@ DNA, @@[map]@@, [marker name], sequence tagged site |
Others | Please contact us before your submission, if necessary. |
Requirements for Describing source: ff_definition
- The Qualifier: ff_definition can be described on source, one of Biological features.
- You can describe only one ff_difinition for one entry.
- The value of ff_definition will be used for the DEFINITION line in the format of DDBJ flat file. Please refer to Sample annotation file and The relationships between annotation file and DDBJ flat file.
- For the Value of ff_definition, a meta description (e.g. @@[organism]@@ and @@[clone]@@) is available to quote values of other qualifiers. The meta description, Qualifier name enclosed by “@@[ and ]@@”, will be replaced by the value of target Qualifier (“organism”, “clone” in the above sample) when ff_definition is reflected in DEFINITION line on DDBJ flat file.
- In principle, you can describe DEFINITION according to the above table, however, if you like to input the values of ff_definition qualifiers, please contact us before your submission.
assembly_gap: Sequencing Gap Region
In cases of whole genome scale sequencing such as HTG or large scale of assembled EST sequences such as TSA division, the entries may have some sequencing gaps that would be resulted from the process of assembling or the region difficult to read. You can indicate them by describing “n” in its sequence. In annotation file, you have to indicate the regions of sequencing gaps with assembly_gap features.
Example: assembly_gap in annotation file (Requierd)
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
assembly_gap | 101..200 | estimated_length | unknown | |
gap_type | within scaffold | |||
linkage_evidence | paired-ends |
Requirements for Describing assembly_gap: Sequencing Gap Region
- Though the assembly_gap feature is one of Biological features, the format is slightly different from others.
- You can NOT use join, order, complement for the Location of assembly_gap features.
Length of the gap is unknown
The location of span of the assembly_gap feature for an unknown gap has to be specified by the submitter; the specified gap length has to be reasonable (less or = 1000) and will be indicated as “n”’s in the sequence.
It is required to indicate unknown for the Value of Qualifier: estimated_length on the assembly_gap feature.
In case of transcriptome record (TSA division), the value of the estimated_length of assembly_gap features must be in an integer, not be “unknown”.
Length of the gap is estimated
The location span of the assembly_gap feature for “known” gap should be indicated by the number of “n”’s in the sequence. It is required to indicate known for the Value of Qualifier: estimated_length on the assembly_gap feature.
TOPOLOGY
Please enter circular for the Qualifier of TOPOLOGY feature, when the topology of whole nucleotide molecule is circular and the first and the end positions are conjugated on real molecules.
i.e. Complete genome sequence of a circular virus
Example: TOPOLOGY in annotation file
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
TOPOLOGY | circular |
Requirements for Describing TOPOLOGY
- In DDBJ flat file, topology is indicated in the LOCUS line. See also Sample annotation file.
TPA/TSA: PRIMARY_CONTIG, Citation of Primary Entries
PRIMARY_CONTIG, entry, and primary_bases are the Feature and Qualifiers prepared to describe the alignments of primary entries for TPA/TSA submission.
Example: PRIMARY_CONTIG in annotation file
Entry | Feature | Location | Qualifier | Value |
---|---|---|---|---|
PRIMARY_CONTIG | 1..438 | entry | ZZ000010.1 | |
primary_bases | 1..438 | |||
PRIMARY_CONTIG | 377..696 | entry | ZZ000011.1 | |
primary_bases | 1..320 | |||
complement | ||||
PRIMARY_CONTIG | 590..1191 | entry | ZZ000022.0 | |
primary_bases | 1..601 |
Qualifiers available for PRIMARY_CONTIG
Qualifier | Remarks for the value description |
---|---|
entry | Accession number of the cited primary entry (with version number) |
primary_bases | input the base span cited from the primary sequence. The base span of the cited primary sequence. Example) 1..500 |
complement | To indicate citing the complementary strand of primary sequence |
Requirements for Describing TPA/TSA: PRIMARY_CONTIG, Citation of Primary Entries
-
Please specify the value for DATATYPE/type, TPA or DIVISION/division, TSA in the annotation file.
-
In PRIMARY_CONTIG, it is necessary to refer to accession number(s) (with version) in the primary database and enter the base spans of the primary sequences that contribute to the TPA/TSA sequence.
-
You can not use join, order, complement for Location column. Please describe each PRIMARY_CONTIG and location even in the same entry.
-
If the primary entry has been submitted to DDBJ/EMBL-Bank/GenBank, a version number is required for accession number. If the primary entry is not public, please use 0 [zero] for the version. e.g. ZZ000022.0
-
If primary sequence is corresponding to reverse strand in the TPA/TSA sequence, please put complement qualifier.
-
In detail, refer to Sample annotation file and The relationships between annotation file and DDBJ flat file.
Sample annotation
General data | Protein coding sequence (CDS) | CDS |
Ribosomal RNA | 16S_rRNA | |
ITS (Internal Transcribed Spacer) | ITS | |
Microsatellite marker | Microsatellite marker | |
Mitochondrial sequence | mtDNA | |
ENV (Environmental Samples) | ENV | |
Genome data | complete genome sequence (Bacteria) | complete_genome_BCT |
Finished level genome sequence with biological feature (Eukaryote) | Finished_genome_eukaryote | |
WGS (Whole Genome Shotgun) without annotation | WGS | |
WGS (Whole Genome Shotgun) with annotation | WGS_annotation | |
WGS; piece of scaffold CON | WGS_piece_CON | |
CON entries for WGS scaffold | WGS_scaffold | |
MAGs (Metagenome-Assembled Genomes, MAGs) for Complete genome | MAGs_CompleteGenome | |
MAGs (Metagenome-Assembled Genomes, MAGs) for Draft genome | MAGs_WGS | |
AGP file for CON entries | AGP | |
GSS (Genome Survey Sequences) | GSS | |
HTG (High Throughput Genomic Sequences) | HTG | |
Large transcripts data | TSA (Transcriptome Shotgun Assembly); assembled from EST | TSA |
TSA; assembled from short reads without annotation | TSA_SRA_assemble_NoANN | |
TSA; assembled from short reads with annotation | TSA_SRA_assemble_Ann | |
EST (Expressed Sequence Tags) | EST | |
TLS (Targeted Locus Study) | TLS (Targeted Locus Study) | TLS |
TPA (Third Party Data) | TPA (Third Party Data) | TPA |
TPA assembly (Third Party Data) | TPA-assembly_WGS | |
TPA assembly (Third Party Data) | TPA-assembly | |
Annotation: Flat file | Protein coding sequence (CDS) | ann2-ff |
AGP File
AGP file is required to submit CON entries. An AGP file is the tab delimited text file consisting of nine columns of the order and orientation etc of the piece entries to construct CON entry. You can make the files with some scripts, spread sheets (such as MS Excel), text editors and so on.
Sequence file is not required when the sequence can be constructed from AGP file.
The AGP file format was initially developed by UCSC, EBI and NCBI.
Example: AGP file
#1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|
scaffold1 | 1 | 1345 | 1 | W | BZZZ01123456.1 | 1 | 1345 | + |
scaffold1 | 1346 | 2845 | 2 | N | 1500 | scaffold | yes | align_genus |
scaffold1 | 2846 | 4301 | 3 | W | BZZZ01123457.1 | 1 | 1456 | + |
scaffold1 | 4302 | 4401 | 4 | U | 100 | scaffold | yes | align_genus |
scaffold1 | 4402 | 5631 | 5 | W | BZZZ01123458.1 | 1 | 1230 | - |
scaffold2 | 1 | 650 | 1 | W | BZZZ01123486.1 | 1 | 1345 | + |
scaffold2 | 651 | 750 | 2 | N | 100 | scaffold | yes | align_genus |
scaffold2 | 751 | 2980 | 3 | W | BZZZ01123488.1 | 1 | 1230 | - |
Format and Syntax
It is required to validate formats of AGP file by UME.
- AGP file consists of nine columns.
- Columns should be tab delimited.
- AGP file is required to contain NO space or blank line.
- The use of comment lines, starting with a # symbol, at the head of the file is encouraged.
Description on each column(column 1 - column 5)
column | content | description | |
---|---|---|---|
1 | object | CON entry name, the identifier for the object being assembled. i.e. a chromosome, scaffold or contig. CON entry name has to correspond to each name in the annotation file as described at Annotation File. |
|
2 | object_beg | The starting coordinates of the component/gap on the object. | |
3 | object_end | The ending coordinates of the component/gap on the object. | |
4 | part_number | The line count for the components/gaps that make up the object. | |
5 | component_type | The sequencing status of the component. These typically correspond to keywords in the International Sequence Database (GenBank/EMBL/DDBJ) submission. Current acceptable values are: | |
A | Active Finishing | ||
D | Draft HTG (often phase1 and phase2 are called Draft, whether or not they have the draft keyword) | ||
F | Finished HTG (phase3) | ||
G | Whole Genome Finishing | ||
O | Other sequence (typically means no HTG keyword) | ||
P | Pre Draft | ||
W | WGS contig | ||
N | gap with specified size | ||
U | gap of unknown size, defaulting to 100 bases |
The description of column 6 to 9 depends on the value in column 5 whether it has gap or not.
Description on each column(column 6 - column 9): If column 5 contains A, D, F, G, O, P and W except from N and U
column | content | Description | |
---|---|---|---|
6 | component_id | The accession number with version or local identifier for the component |
|
7 | component_beg | The beginning of the part of the component that contributes to the object | |
8 | component_end | The end of the part of the component that contributes to the object | |
9 | orientation | The orientation of the component relative to the object. Acceptable values are: |
|
+ | plus | ||
- | minus | ||
? | unknown | ||
0 | zero; unknown (deprecated) | ||
na | irrelevant | ||
By default, components with "?", "0" or "na" are treated as if they had + orientation. |
Description on each column(column 6 - column 9):If column 5 contains N and U
column | content | description | |
---|---|---|---|
6 | gap_length | [component_type: N] The length of gap (bp) [component_type: U] 100 |
|
7 | gap_type | This column specifies the gap type. Accepted values: | |
scaffold | a gap between two sequence contigs in a scaffold (superscaffold or ultra-scaffold). | ||
contig | an unspanned gap between two sequence contigs. | ||
centromere | a gap inserted for the centromere. | ||
short_arm | a gap inserted at the start of an acrocentric chromosome. | ||
heterochromatin | a gap inserted for an especially large region of heterochromatic sequence (may also include the centromere) | ||
telomere | a gap inserted for the telomere. | ||
repeat | an unresolvable repeat. | ||
8 | linkage | The linkage between the adjacent lines (Values: "yes" or "no") | |
9 | linkage evidence | This specifies the type of evidence used to assert linkage (as indicated in column 8b). Accepted values: | |
na | used when no linkage is being asserted (column 8b is 'no') | ||
paired-ends | paired sequences from the two ends of a DNA fragment. | ||
align_genus | alignment to a reference genome within the same genus. | ||
align_xgenus | alignment to a reference genome within another genus. | ||
align_trnscpt | alignment to a transcript from the same species. | ||
within_clone | sequence on both sides of the gap is derived from the same clone, but the gap is not spanned by paired-ends. The adjacent sequence contigs have unknown order and orientation | ||
clone_contig | linkage is provided by a clone contig in the tiling path (TPF). For example, a gap where there is a known clone, but there is not yet sequence for that clone. | ||
map | linkage asserted using a non-sequence based map such as RH, linkage, fingerprint or optical. | ||
strobe | strobe sequencing (PacBio). | ||
unspecified | used when converting old AGPs that lack a field for linkage evidence into the new format. | ||
If there are multiple lines of evidence to support linkage, all can be listed using a ‘;’ delimiter. (e.g. "paired-ends;align_xgenus ") |
-
The length of gap for an ‘unknown’ gap should be 100 bp. It is required to indicate “U” for the value of component_type and “100” for the value of gap_length.
-
Information about continuity is provided by a combination of the value in the gap_type and linkage. Please refer to the following table.
Example: source feature in COMMON entry
gap_type | linkage | Interpretation and description |
---|---|---|
Within-scaffold gaps: sequences on either side of the gap are in a single scaffold. | ||
scaffold | yes | Do not break scaffold There is evidence linking sequence contigs on both sides of the gap. |
repeat | yes | Do not break scaffold If an unresolvable repeat unit is spanned by linkage evidence, the linkage will be 'yes'. |
Scaffold-breaking gaps: sequences on either side of the gap are in separate scaffolds. | ||
contig | no | Break scaffold A contig gap indicates there is no evidence to link the adjacent sequence contigs. |
repeat | no | Break scaffold If an unresolvable repeat unit is not spanned by linkage evidence, the linkage will be 'no'. |
centromere short_arm heterochromatin telomer |
no | Break scaffold Gaps with these biological types are used for laying out scaffolds along a chromosome. |
Invalid gap/linkage combinations | ||
contig | yes | Invalid If there is evidence of linkage between the adjacent sequence contigs, the gap type should be scaffold. |
scaffold | no | Invalid If there is no evidence of linkage between the adjacent sequence contigs, the gap type should be contig. |
centromere short_arm heterochromatin telomere |
yes | Invalid It is invalid to use these biological types within a scaffold. |