Submission File Format

Sequence File

The sequence file is a text file in FASTA-like format contains all nucleotide sequences. In the sequence file, one array data consists of a line of header lines starting with “>” and a sequence of entities of the second and subsequent lines. You must insert the end flag (//) at the end of each sequence.

Example: Sequence File

>CLN01  <-- Entry name for the first one
ggacaggctgccgcaggagccaggccgggagcaggaagaggcttcgggggagccggagaa
ctgggccagatgcgcttcgtgggcgaagcctgaggaaaaagagagtgaggcaggagaatc
gcttgaaccccggaggcggaaccgcactccagcctgggcgacagagtgagactta
//      <-- End flag
>CLN02  <-- Entry name for the second one
ctcacacagatgcgcgcacaccagtggttgtaacagaagcctgaggtgcgctcgtggtca
gaagagggcatgcgcttcagtcgtgggcgaagcctgaggaaaaaatagtcattcatataa
atttgaacacacctgctgtggctgtaactctgagatgtgctaaataaaccctctt
//      <-- End flag

Format and Syntax

It is required to validate formats of sequence file by UME or Parser.

First line starts with [>], followed by the Entry name at the head of each sequence.
Entry names must be unique in the sequence file.It is common to use clone name or isolate name as unique Entry name.
Entry name is required to be described in less than 32 letters of characters which do not contain space, “ double-quote, = equal, | pipe, > greater-than, [] angled brackets or \ back-slash.
The names and the orders of Entry in the both of sequence and annotation files should be matched.The accession numbers will be assigned in the order of entries.
Sequence file is required to contain NO space or blank line.
You can use not only a, t, g and c but also characters in Nucleotide base codes for your nucleotide sequences, if necessary.
In principle, please remove the base code ‘n’ locating 5’ or 3’ end of sequences. For especially EST submissions, please do not send the raw outputs of a sequencer. You should screen your sequences to remove unreliable output(s) often locating at 5’-end.
Remove the sequences derived from vector, linker or adaptor.If you would like to submit some artificially constructed sequence itself, such as an expression vector etc., you do not have to remove that.
Please be sure to input the end flag [//] at the end of each sequence.
In case of CON entry, AGP file can be used as a substitute for sequence file.

Annotation File

The annotation file is the tab delimited text file consisting of five columns of Entry, Feature, Location, Qualifier, and Value that contains your data other than sequences, such as submitters, references and biological features.
You can make the files with some scripts, spread sheets (such as MS Excel), text editors and so on.

Example:Annotation file (Required)

Entry	Feature	Location	Qualifier	Value
COMMON	SUBMITTER		ab_name	Robertson,G.R.
			ab_name	Mishima,H.
			contact	Hanako Mishima
			email	mishima@ddbj.nig.ac.jp
			institute	National Institute of Genetics
			department	DNA Data Bank of Japan
			country	Japan
			state	Shizuoka
			city	Mishima
			street	Yata 1111
			zip	411-8540
	REFERENCE		title	Mouse Genome Sequencing
			ab_name	Robertson,G.R.
			ab_name	Mishima,H
			year	2012
			status	Unpublished
	COMMENT		line	Please visit our website
			line	URL: http://www.ddbj.nig.ac.jp/
CLN01	source	1..12297	organism	Mus musculus
			mol_type	genomic DNA
			clone	PC0110
			chromosome	8
	CDS	join(<1..456,609..879,1070..1213)	product	protein kinase
			codon_start	2
CLN02	source	1..12393	organism	Mus musculus
			mol_type	genomic DNA
			clone	PC0210
			chromosome	8
	CDS	9365..9640	product	hypothetical protein

Format and Syntax

It is required to validate formats of annotation file by UME or Parser.

Entry: Please enter the Entry name into Entry column. Entry name has to correspond to each name in the sequence file as described at How to Make Sequence File.; Do not enter anything in the Entry column until the first line for the next entry.
Feature: There are two types of Features, Biological feature and DDBJ original features. The detail descriptions for Features are explained below.<; Do not enter anything in Feature columns until the first line for the next feature.
Location: Location can be described in the columns adjacent Feature columns filed with either of Biological feature or PRIMARY_CONTIG feature.
Qualifier: Qualifier is described in every line, in principle. It depends on the Feature whether each Qualifier is mandatory, available, or not to use for the Feature. Details are explained below.
Value: The format of Value is different depending on Qualifiers. Details will be explained below.
Other: In annotation file, it is judged as end when a blank line was found. Therefore, when you input multiple entries, please be sure not to make a blank line until the end of file.

References for Describing Biological Features

Name	Remarks
Feature Table Definition	version 11.3
Feature/Qualifier usage matrix
Description Examples of Sequence Data	Examples of features in DDBJ flat file

COMMON

COMMON entry for the common information to all entries

In annotation file, entry name COMMON can be described in Entry column for the common information to all entries.
The information described in COMMON entry will be reflected in all entries.
Usually, COMMON is used for SUBMITTER/REFERENCE/DATE/COMMENT, but it can also be used for Biological feature when all the information of Feature, Location, Qualifiers and Values are common to all entries.

Use of COMMON entry

Meta-base position ‘E’ for the location description

Example: rRNA feature in COMMON entry

Entry	Feature	Location	Qualifier	Value
COMMON	rRNA	<1..>E	product	16S rRNA

There are many submissions that have common Feature information for all entries in their Qualifiers, and Values except their Locations because of difference of their sequence lengths, such as phylogenic studies with rRNA sequences.

In such cases, you can describe the common Feature in COMMON entry by using meta-base position ‘E’ in its Location instead of the number of the sequence end points.

Meta-description ‘@@[entry]@@ ‘is available for clone, note, ff_definition

Example: source feature in COMMON entry

Entry	Feature	Location	Qualifier	Value
COMMON	source	1..E	organism	Homo sapiens
			mol_type	genomic DNA
			submitter_seqid	@@[entry]@@
			ff_definition	@@[organism]@@ DNA, @@[submitter_seqid]@@

There are some submissions that have common Feature information for all entries in their Qualifiers, and Values except their Locations and clone name or contig names, such as EST, GSS, TSA, TLS, WGS, WGS scaffold (CON division), and so on.

In such cases, you can describe the Feature: source in COMMON entry only if you use clone or contig names as entry name.

You can use meta-base position ‘E’ in its Location instead of the number of the sequence end points.
For the Value of clone, submitter_seqid, note, ff_definition, a meta description @@[entry]@@, entry enclosed by “@@[” and “]@@”, is available to quote entry names. It will be replaced by the entry names which are quoted from a sequence file.

SUBMITTER

Example: SUBMITTER in annotation file　(Requierd)

Entry	Feature	Qualifier	Value
COMMON	SUBMITTER	ab_name	Robertson,G.R.
		ab_name	Mishima,H.
		consrtm	Mouse Genome Consortium
		contact	Hanako Mishima
		email	mishima@ddbj.nig.ac.jp
		url	http://www.ddbj.nig.ac.jp
		institute	National Institute of Genetics
		department	DNA Data Bank of Japan
		country	Japan
		state	Shizuoka
		city	Mishima
		street	Yata 1111
		zip	411-8540

List of Qualifiers for SUBMITTER

Qualifier	Legal characters for each Value (Remarks)	Number of letters
ab_name (abbreviation of author name)	alphabets, .[period], ,[comma], -[hyphen], ‘ [single quote as apostrophe]	64
contact (contact person)	alphabets, .[period], ,[comma], -[hyphen], ‘ [single quote as apostrophe], [space] (In order of first, middle, and last names delimited with)	first(64),middle(128), last(64)
consrtm (consortium)	alphabets, digits, [space], -[hyphen], ‘ [single quote as apostrophe], .[period], _[underscore], .[comma], ( ) # & @ / ; : + *	255
email	alphabets, digits, @, .[period], -[hyphen], _[underscore]	64
url	All printable characters but [space]	255
institute, department	All printable characters but [back-slash], ` [back-quote]	255
country, state	alphabets, digits, [space], -[hyphen], ‘[single quote as apostrophe], .[period], _[underscore], ,[comma], ( ) # & @ / ; : + *	32
city	alphabets, digits, [space], -[hyphen], ‘[single quote as apostrophe], .[period], _[underscore], ,[comma], ( ) # & @ / ; : + *	64
street	alphabets, digits, [space], -[hyphen], ‘[single quote as apostrophe], .[period], _[underscore], ,[comma], ( ) # & @ / ; : + *	255
zip	alphabets, digits, -[hyphen]	16

Requirements for Describing SUBMITTER

Basically it is necessary to enter one SUBMITTER for each entry. But COMMON can be used for describing SUBMITTER that is common to all entries.
When SUBMITTER is written by using COMMON, SUBMITTER cannot be used for the other entries in the same annotation file.
Submitters are the persons who have the responsibility in the contents of the submitted data and have the right to update the data.
Qualifier: ab_name in SUBMITTER can be used repeatedly for multiple submitters and those submitters are shown in the released file in the order of this annotation file.
It is necessary to specify a contact person whom DDBJ will contact with about the data by using Qualifier: contact.
The abbreviation of the author name according to the format of REFERENCE author should be described in Value of Qualifier: ab_name.

Value format:

last name[comma]initial of first name[period]initial of middle name[period]

Example:

Miyashita,Y.

Robertson,G.R.

Although some names (e.g. name with a hyphen) may show a warning message owing to format error, it is possible to input.
Each Value for the Qualifier except ab_name in SUBMITTER cannot be used repeatedly. They can be used for only contact person. If you would like to submit the information of multiple institutes, please contact us before your submission.

REFERENCE

Example: REFERENCE in annotation file　(Requierd)

Feature	Qualifier	Value
REFERENCE	title	Sequence and analysis of mouse ch.8
	ab_name	Robertson,G.R.
	ab_name	Mishima,H.
	status	Published
	year	2003
	journal	Nature
	volume	8
	start_page	15
	end_page	20

List of Qualifiers for REFERENCE

Qualifier	Legal characters for each Value (Remarks)	Number of letters
title	All printable characters but [back-slash], ` [back-quote]	255
ab_name?(abbreviation of author name)	alphabets, .[period], ,[comma], -[hyphen], ' [single quote as apostrophe]	64
consrtm(consortium)	alphabets, digits, [space], -[hyphen], ' [single quote as apostrophe], .[period], _[underscore], ,[comma], ( ) # & @ / ; : + *	255
status	Either one of follows; Unpublished, In press, Published	-
year	digits(4 figures of A.D.)	4
journal	All printable characters but [back-slash], ` [back-quote] (PubMed type abbreviation)	128
volume, start_page, end_page	alphabets, digits, -[hyphen]	8

Requirements for Describing REFERENCE

It is necessary to specify at least one REFERENCE for each entry. However, COMMON can be used for describing the REFERENCE that is common to all entries.
The abbreviation of the author name according to the format of REFERENCE author should be described in Value of Qualifier: ab_name.

Value format:

last name[comma]initial of first name[period]initial of middle name[period]

Example:

Miyashita,Y.

Robertson,G.R.

Please pay no attention to a warning message about name format error (e.g. name with a hyphen).
If the Value of status is “In Press”, Qualifier: journal is also a mandatory item.
If the Value of status is “Published”, Qualifier: journal, volume, start_page and end_page are also mandatory items.
Please input “Unpublished” in the status, if you do not prepare any publication.
If the Value of status is “Unpublished”, Qualifier: year is not required.
Please input PubMed type or ISO abbreviation in the journal if you have.
If you need to enter more than two REFERENCE features, please input the first REFERENCE directly related to your sequences and then put the other(s) that would be helpful for understanding the data after the first one.
When you use REFERENCE features for both COMMON entry and other entries, the REFERENCE feature(s) specified for each entry will be loaded into DDBJ after one(s) given by COMMON entry.
When you cite two or more REFERENCE features for an entry, they will be shown on the DDBJ flat file in the same order on the annotation file.

DATE

Example: DATE/hold_date in annotation file

Entry	Feature	Location	Qualifier	Value
COMMON	DATE		hold_date	20231125

Requirements for Describing DATE

DATE and hold_date are required to be described in COMMON entry.
If you want to keep confidential your data until a specific date, please set the date with 8 digits (e.g. 20231125).
Delimiters (i.e. – (hyphen), / (slash) etc.) is not allowed to use for Value of hold_date.
Do not enter any DATE, if your data should be open to public immediately.
DATE should be included for COMMON entry. If the date is not common to all entries, please prepare the file for each.
If you set a hold_date, your data will be released according to Principle of “Hold-Until-Published” data release.

COMMENT/ST_COMMENT

Example: COMMENT and ST_COMMENT in annotation file

Feature	Qualifier	Value
COMMENT	line	This clone was obtained at our laboratory.
COMMENT	line	Please visit our web site.
	line	URL:http://www.ddbj.nig.ac.jp
ST_COMMENT	tagset_id	Genome-Assembly-Data
	Assembly Method	GS De Novo Assembler v. 2.0
	Assembly Name	Mmus_1.0
	Genome Coverage	50x
	Sequencing Technology	454 GS FLX; ABI 3730

※ There are two kinds of COMMENTs, “general COMMENT” and “structured COMMENT”.

Requirements for Describing COMMENT (General COMMENT)

Please use general COMMENT if you want to describe additional information for your data.
It will automatically start a new-line by 60 letters including spaces. If you want to start a new-line other than 60 letters, please add Qualifier: line.
All printable characters except [back-slash] are legal for the Value of Qualifier: line.
COMMON entry can be used for describing COMMENT that is common to all entries.
When you put multiple COMMENT features, please put each COMMENT for a Feature column, separately.
When an entry has both COMMENT features specific to it and common with all other entries described in COMMON entry, those will be shown on DDBJ flat file in the order, COMMENT in COMMON entry at first, then followed by one specific to the entry. On DDBJ flat files, in the case of plural COMMENTs, they will be shown in DDBJ format on same order of the annotation file.
When you use COMMENT features for both COMMON entry and other entries, the COMMENT feature(s) specified for each entry will be loaded into DDBJ after one(s) given by COMMON entry.
When you describe two or more COMMENT features for an entry, they will be shown on the DDBJ flat file in the same order on the annotation file.
For EST submissions, some particular COMMENT description is required.Details

Requirements for Describing ST_COMMENT (Structured Comment)

ST_COMMENT is a feature to describe the structured comment in the flat file.
Though ST_COMMENT can be defined by user community, ST_COMMENT in predetermined format is required to submit sequence data derived from genome Project (including WGS) or transcriptome Project (including TSA).
ST_COMMENT is composed of dataset name (tagset_id), names of items (user-defined Qualifier) and values of items (Value).
In the initial line of Structured COMMENT feature, describe tagset_id as Qualifier and dataset name as its Value.

In case of genome project, describe “Genome-Assembly-Data” for the value of tagset_id qualifier.
In case of transcriptome project, describe “Assembly-Data” for the value of tagset_id qualifier.
Describe a name of item as Qualifier name and its value as Value. In case of Genome-Assembly-Data, use following Qualifiers.
In case of Assembly-Data, use following Qualifiers.

List of Qualifiers for Genome-Assembly-Data (Requierd)

Qualifier	Description	Remarks
Assembly Method	Name of program and the version used assembling sequences. Mandatory.	The program version must be presented just after “ v. “ (e.g. Velvet v. 2.0)
Assembly Name	Name that the submitter has given to that assembly of the genome. Mandatory for Eukaryote.	We recommend to describe in the format： [abbreviated name of species or common name of organism] + [version] (i.e. Btau_4.0)
Genome Coverage	Approximate sequencing depth. Mandatory. (e.g. 125x)	Use “Unknown” when the coverage is not known.
Sequencing Technology	Platform(s) used to generate the sequence. Mandatory.	Use semicolon with a space to describe the multiple platforms (e.g. 454 GS FLX; ABI 3730)

List of Qualifiers for Assembly-Data (Requierd)

Qualifier	Description	Remarks
Assembly Method	Name of program and the version used assembling sequences. Mandatory.	The program version must be presented just after “ v. “ (e.g. Velvet v. 2.0)
Assembly Name	Name and version for assembled sequences	Recommended format： [abbreviated name of species or common name of organism] + [version] (i.e. Btau_4.0)
Coverage	Approximate sequencing depth. (e.g. 125x)	Use “Unknown” when the coverage is not known.
Sequencing Technology	Platform(s) used to generate the sequence. Mandatory.	Use semicolon with a space to describe the multiple platforms (e.g. 454 GS FLX; ABI 3730)

If you have any question to describe ST_COMMENT, please contact us by email prior to your submission.

Biological Features

Example: source and CDS features in annotation file (Requierd)

Feature	Location	Qualifier	Value
source	1..12297	organism	Mus musculus
		mol_type	genomic_DNA
		chromosome	8
		clone	PC0110
CDS	join(<1..456,609..879,1070..1213)	product	protein kinase
		codon_start	2
rRNA	1279..3000	product	18S rRNA
CDS	complement(join(3213..4981,9901..11677))	gene	tbpA
		product	TATA-box binding protein

※For detail definitions and descriptions of Biological features, please read Feature Table Definition.

Requirements for Describing Feature/Location/Qualifier

In Feature Table Definition, each Qualifier has a / [slash] on its head, however do not use slashes for Qualifiers in the annotation file.
Qualifiers marked with * (organism、mol_type) are mandatory items. Features, source and at least one other feature are mandatory items for each entry. Please be sure to input them correctly.
You can find the rule to describe Location on Description of Location.
You can see Qualifiers are legal for each Feature in Feature/Qualifier Usage Matrix. Some of Features have mandatory Qualifier(s). Please be sure to specify Features and Qualifiers according to their name in the table. They are strictly defined such as case-sensitive (to distinguish upper case or lower), to use “_” [underscore], and so on.
See also Sample annotation file and Description Examples of Sequence Data
When you describe CDS features, Protein Coding Sequence; CDS feature would be helpful.
Files containing CDS feature(s) should be checked with UME or transChecker.

Requirements for Describing Value

The legal character type for Values depends on the Qualifiers as shown in the table, Feature/Qualifier Usage Matrix and Feature Table Definition.
Please be sure to input (or not to input) Values in accordance with value types in tables.

DIVISION

DIVISION feature in annotation file indicates that entries are corresponding only to one of CON / ENV / EST / GSS / HTC / HTG / STS / SYN / TSA.

Example: DIVISION in annotation file

Entry	Feature	Location	Qualifier	Value
COMMON	DIVISION		division	EST

Requirements for Describing DIVISION

Please enter the division name, 3 capital letters in the Value for Qualifier: division.
In principle, please describe the DIVISION feature in the COMMON entry.

DATATYPE

DATATYPE feature indicates that entries are corresponding to either of WGS, TLS, TPA, or TPA-WGS.

Example: DATATYPE in annotation file

Entry	Feature	Location	Qualifier	Value
COMMON	DATATYPE		type	WGS

Requirements for Describing DATATYPE

Please enter the name of type, WGS, TLS, TPA, or TPA-WGS in the Value for Qualifier: type.
Please describe the DATATYPE feature in the COMMON entry.

KEYWORD

On the basis of categories indicated at the sections, DIVISION and DATATYPE, KEYWORDs with controlled vocabulary describe more detail and specified information, such as experimental methods.
Please see INSDC agreed methodological keywords, which qualify controlled keyword terms.

Example: KEYWORD in annotation file

Entry	Feature	Location	Qualifier	Value
	KEYWORD		keyword	ENV

Specified values for KEYWORD/keyword(Requierd)

Categories	the values for keyword	Remarks
WGS	WGS	see also For WGS and scaffold CON.
ENV	ENV
EST	EST
EST	some other terms	Please refer to For EST Submissions.
HTC	HTC some other terms	Please contact us before your submission.
HTG	HTG, some other terms	Depending on the phase. Please contact us before your submission.
GSS	GSS
STS	STS
TPA	TPA, Third Party Data
TPA	TPA:inferential or TPA:experimental	Either of two is mandatory.
TSA	TSA, Transcriptome Shotgun Assembly
TLS	TLS, Targeted Locus Study
Others		Please contact us before your submission.

Requirements for Describing KEYWORD

Please describe the specified values for Qualifier: keyword.
Please contact us before your submission to make sure the detail descriptions of KEYWORD.

For WGS and scaffold CON

For WGS and scaffold CON, please select a keyword from the following list.

STANDARD_DRAFT
HIGH_QUALITY_DRAFT
IMPROVED_HIGH_QUALITY_DRAFT
NON_CONTIGUOUS_FINISHED

Example: WGS draft genome (Requierd)

Entry	Feature	Location	Qualifier	Value
	KEYWORD		keyword	WGS
			keyword	STANDARD_DRAFT

For EST Submissions

For EST submissions, at least two keywords are required; EST and one of following three terms;

For 5’ EST submissions — 5’-end sequence (5’-EST)
For 3’ EST submissions — 3’-end sequence (3’-EST)
Other than above two cases — unspecified EST

Example : 5’ EST (Requierd)

Entry	Feature	Location	Qualifier	Value
	KEYWORD		keyword	EST
			keyword	5’-end sequence (5’-EST)

In the case of 3’ EST, to distinguish whether your sequences are corresponding to anti-sense or sense strand, please describe either of following two COMMENTs.

Example : For 3’ EST, anti-sense strand (Requierd)

Entry	Feature	Location	Qualifier	Value
	COMMENT		line	3’-EST sequences are presented as anti-sense strand.

Example : For 3’ EST, sense strand (Requierd)

Entry	Feature	Location	Qualifier	Value
	COMMENT		line	3’-EST sequences are presented as sense strand.

For HTG submissions

For HTG submissions, we recommend to use keywords to indicate sequencing status of HTG data.

Example I: containing unordered pieces (Requierd)

Feature	Qualifier	Value
KEYWORD	keyword	HTG
	keywrod	HTGS_PHASE1
	keyword	HTGS_DRAFT

Example II: containing only ordered pieces (Requierd)

Entry	Feature	Location	Qualifier	Value
	KEYWORD		keyword	HTG
			keyword	HTGS_PHASE2

DBLINK

The DBLINK line is used to link other databases, such as BioProject ID, BioSample ID and Sequence Read Archive (DRA/ERA/SRA).

Example: DBLINK in annotation file (Requierd)

Feature	Qualifier	Value
DBLINK	project	PRJDB12345
	biosample	SAMD90000000
	sequence read archive	DRR999000
	sequence read archive	DRR999001

Requirements for Describing DBLINK

If you have registered your project to the DDBJ BioProject Database, please enter the project ID in the Value for Qualifier: The sample ID of DDBJ BioSample also writes in the Value of for Qualifier.
An assembly from raw reads of Sequence Read Archive is required to have run accession number(s) in the Value for Qualifier.
See also DDBJ BioProject Database, DDBJ BioSample Database and DDBJ Sequence Read Archive.

locus_tag

For the submission in the whole genome scale with many annotated features, we recommend to use the qualifier locus_tag, for the Biological Features indicating protein products (CDSs), and transcripts (rRNA, tRNA and so on).
The locus_tag prefix and BioSample ID should be registered at DDBJ BioSample Database in advance.

source: ff_definition

ff_definition is a Qualifier that is not defined in The DDBJ/EMBL/GenBank Feature Table: Definition. One ff_definition can be described in an entry, if necessary.

Example: ff_definition in annotation file

Feature	Location	Qualifier	Value
source	1..516	organism	Mus musculus
		mol_type	mRNA
		ff_definition	@@[organism]@@ mRNA, clone: @@[clone]@@
		clone	PC0110

Value formats of ff_definition

Categories	Format for the value of ff_definition
WGS	@@[organism]@@ @@[strain]@@ DNA, @@[submitter_seqid]@@, [other information]
BAC/YAC genomic clones in unfinished phase (HTG)	@@[organism]@@ DNA, chromosome @@[map]@@, [BAC/YAC] clone: @@[clone]@@, * SEQUENCING IN PROGRESS *
BAC/YAC genomic clones in finished phase	@@[organism]@@ DNA, chromosome @@[map]@@, [BAC/YAC] clone: @@[clone]@@
EST	@@[organism]@@ mRNA, clone: @@[clone]@@, [other information]
EST	@@[organism]@@ cDNA, clone: @@[clone]@@, [other information]
GSS	@@[organism]@@ DNA, clone: @@[clone]@@, [other information]
STS	@@[organism]@@ DNA, @@[map]@@, [marker name], sequence tagged site
Others	Please contact us before your submission, if necessary.

Requirements for Describing source: ff_definition

The Qualifier: ff_definition can be described on source, one of Biological features.
You can describe only one ff_difinition for one entry.
The value of ff_definition will be used for the DEFINITION line in the format of DDBJ flat file. Please refer to Sample annotation file and The relationships between annotation file and DDBJ flat file.
For the Value of ff_definition, a meta description (e.g. @@[organism]@@ and @@[clone]@@) is available to quote values of other qualifiers. The meta description, Qualifier name enclosed by “@@[ and ]@@”, will be replaced by the value of target Qualifier (“organism”, “clone” in the above sample) when ff_definition is reflected in DEFINITION line on DDBJ flat file.
In principle, you can describe DEFINITION according to the above table, however, if you like to input the values of ff_definition qualifiers, please contact us before your submission.

assembly_gap: Sequencing Gap Region

In cases of whole genome scale sequencing such as HTG or large scale of assembled EST sequences such as TSA division, the entries may have some sequencing gaps that would be resulted from the process of assembling or the region difficult to read. You can indicate them by describing “n” in its sequence. In annotation file, you have to indicate the regions of sequencing gaps with assembly_gap features.

Example: assembly_gap in annotation file (Requierd)

Feature	Location	Qualifier	Value
assembly_gap	101..200	estimated_length	unknown
		gap_type	within scaffold
		linkage_evidence	paired-ends

Requirements for Describing assembly_gap: Sequencing Gap Region

Though the assembly_gap feature is one of Biological features, the format is slightly different from others.
You can NOT use join, order, complement for the Location of assembly_gap features.

Length of the gap is unknown

The location of span of the assembly_gap feature for an unknown gap has to be specified by the submitter; the specified gap length has to be reasonable (less or = 1000) and will be indicated as “n”’s in the sequence.
It is required to indicate unknown for the Value of Qualifier: estimated_length on the assembly_gap feature.

In case of transcriptome record (TSA division), the value of the estimated_length of assembly_gap features must be in an integer, not be “unknown”.

Length of the gap is estimated

The location span of the assembly_gap feature for “known” gap should be indicated by the number of “n”’s in the sequence. It is required to indicate known for the Value of Qualifier: estimated_length on the assembly_gap feature.

TOPOLOGY

Please enter circular for the Qualifier of TOPOLOGY feature, when the topology of whole nucleotide molecule is circular and the first and the end positions are conjugated on real molecules.
i.e. Complete genome sequence of a circular virus

Example: TOPOLOGY in annotation file

Entry	Feature	Location	Qualifier	Value
	TOPOLOGY		circular

Requirements for Describing TOPOLOGY

In DDBJ flat file, topology is indicated in the LOCUS line. See also Sample annotation file.

TPA/TSA: PRIMARY_CONTIG, Citation of Primary Entries

PRIMARY_CONTIG, entry, and primary_bases are the Feature and Qualifiers prepared to describe the alignments of primary entries for TPA/TSA submission.

Example: PRIMARY_CONTIG in annotation file

Feature	Location	Qualifier	Value
PRIMARY_CONTIG	1..438	entry	ZZ000010.1
		primary_bases	1..438
PRIMARY_CONTIG	377..696	entry	ZZ000011.1
		primary_bases	1..320
		complement
PRIMARY_CONTIG	590..1191	entry	ZZ000022.0
		primary_bases	1..601

Qualifiers available for PRIMARY_CONTIG

Qualifier	Remarks for the value description
entry	Accession number of the cited primary entry (with version number)
primary_bases	input the base span cited from the primary sequence. The base span of the cited primary sequence. Example) 1..500
complement	To indicate citing the complementary strand of primary sequence

Requirements for Describing TPA/TSA: PRIMARY_CONTIG, Citation of Primary Entries

Please specify the value for DATATYPE/type, TPA or DIVISION/division, TSA in the annotation file.
In PRIMARY_CONTIG, it is necessary to refer to accession number(s) (with version) in the primary database and enter the base spans of the primary sequences that contribute to the TPA/TSA sequence.
You can not use join, order, complement for Location column. Please describe each PRIMARY_CONTIG and location even in the same entry.
If the primary entry has been submitted to DDBJ/EMBL-Bank/GenBank, a version number is required for accession number. If the primary entry is not public, please use 0 [zero] for the version. e.g. ZZ000022.0
If primary sequence is corresponding to reverse strand in the TPA/TSA sequence, please put complement qualifier.
In detail, refer to Sample annotation file and The relationships between annotation file and DDBJ flat file.
- TPA (Third Party Annotation)： Sample
- TSA (Transcriptome Shotgun Assembly)： Sample
- TSA; assembled from short reads： Sample

Sample annotation

General data	Protein coding sequence (CDS)	CDS
	Ribosomal RNA	16S_rRNA
	ITS (Internal Transcribed Spacer)	ITS
	Microsatellite marker	Microsatellite marker
	Mitochondrial sequence	mtDNA
	ENV (Environmental Samples)	ENV
Genome data	complete genome sequence (Bacteria)	complete_genome_BCT
	Finished level genome sequence with biological feature (Eukaryote)	Finished_genome_eukaryote
	WGS (Whole Genome Shotgun) without annotation	WGS
	WGS (Whole Genome Shotgun) with annotation	WGS_annotation
	WGS; piece of scaffold CON	WGS_piece_CON
	CON entries for WGS scaffold	WGS_scaffold
	MAGs (Metagenome-Assembled Genomes, MAGs) for Complete genome	MAGs_CompleteGenome
	MAGs (Metagenome-Assembled Genomes, MAGs) for Draft genome	MAGs_WGS
	AGP file for CON entries	AGP
	GSS (Genome Survey Sequences)	GSS
	HTG (High Throughput Genomic Sequences)	HTG
Large transcripts data	TSA (Transcriptome Shotgun Assembly); assembled from EST	TSA
	TSA; assembled from short reads without annotation	TSA_SRA_assemble_NoANN
	TSA; assembled from short reads with annotation	TSA_SRA_assemble_Ann
	EST (Expressed Sequence Tags)	EST
TLS (Targeted Locus Study)	TLS (Targeted Locus Study)	TLS
TPA (Third Party Data)	TPA (Third Party Data)	TPA
	TPA assembly (Third Party Data)	TPA-assembly_WGS
	TPA assembly (Third Party Data)	TPA-assembly
Annotation: Flat file	Protein coding sequence (CDS)	ann2-ff

AGP File

AGP file is required to submit CON entries. An AGP file is the tab delimited text file consisting of nine columns of the order and orientation etc of the piece entries to construct CON entry. You can make the files with some scripts, spread sheets (such as MS Excel), text editors and so on.

Sequence file is not required when the sequence can be constructed from AGP file.

The AGP file format was initially developed by UCSC, EBI and NCBI.

Example: AGP file

#1	2	3	4	5	6	7	8	9
scaffold1	1	1345	1	W	BZZZ01123456.1	1	1345	+
scaffold1	1346	2845	2	N	1500	scaffold	yes	align_genus
scaffold1	2846	4301	3	W	BZZZ01123457.1	1	1456	+
scaffold1	4302	4401	4	U	100	scaffold	yes	align_genus
scaffold1	4402	5631	5	W	BZZZ01123458.1	1	1230	-
scaffold2	1	650	1	W	BZZZ01123486.1	1	1345	+
scaffold2	651	750	2	N	100	scaffold	yes	align_genus
scaffold2	751	2980	3	W	BZZZ01123488.1	1	1230	-

Format and Syntax

It is required to validate formats of AGP file by UME.

AGP file consists of nine columns.
Columns should be tab delimited.
AGP file is required to contain NO space or blank line.
The use of comment lines, starting with a # symbol, at the head of the file is encouraged.

Description on each column（column 1 - column 5）

* component: a sequence used to construct a larger sequence (i.e. piece entry)
column	content	description
1	object	CON entry name, the identifier for the object being assembled. i.e. a chromosome, scaffold or contig. CON entry name has to correspond to each name in the annotation file as described at Annotation File.
2	object_beg	The starting coordinates of the component/gap on the object.
3	object_end	The ending coordinates of the component/gap on the object.
4	part_number	The line count for the components/gaps that make up the object.
5	component_type	The sequencing status of the component. These typically correspond to keywords in the International Sequence Database (GenBank/EMBL/DDBJ) submission. Current acceptable values are:
		A	Active Finishing
		D	Draft HTG (often phase1 and phase2 are called Draft, whether or not they have the draft keyword)
		F	Finished HTG (phase3)
		G	Whole Genome Finishing
		O	Other sequence (typically means no HTG keyword)
		P	Pre Draft
		W	WGS contig
		N	gap with specified size
		U	gap of unknown size, defaulting to 100 bases

The description of column 6 to 9 depends on the value in column 5 whether it has gap or not.

Description on each column（column 6 - column 9）： If column 5 contains A, D, F, G, O, P and W except from N and U

* component: a sequence used to construct a larger sequence (i.e. piece entry)
column	content	Description
6	component_id	The accession number with version or local identifier for the component
7	component_beg	The beginning of the part of the component that contributes to the object
8	component_end	The end of the part of the component that contributes to the object
9	orientation	The orientation of the component relative to the object. Acceptable values are:
		+	plus
		-	minus
		?	unknown
		0	zero; unknown (deprecated)
		na	irrelevant
		By default, components with "?", "0" or "na" are treated as if they had + orientation.

Description on each column（column 6 - column 9）：If column 5 contains N and U

column	content	description
6	gap_length	[component_type: N] The length of gap (bp) [component_type: U] 100
7	gap_type	This column specifies the gap type. Accepted values:
		scaffold	a gap between two sequence contigs in a scaffold (superscaffold or ultra-scaffold).
		contig	an unspanned gap between two sequence contigs.
		centromere	a gap inserted for the centromere.
		short_arm	a gap inserted at the start of an acrocentric chromosome.
		heterochromatin	a gap inserted for an especially large region of heterochromatic sequence (may also include the centromere)
		telomere	a gap inserted for the telomere.
		repeat	an unresolvable repeat.
8	linkage	The linkage between the adjacent lines (Values: "yes" or "no")
9	linkage evidence	This specifies the type of evidence used to assert linkage (as indicated in column 8b). Accepted values:
		na	used when no linkage is being asserted (column 8b is 'no')
		paired-ends	paired sequences from the two ends of a DNA fragment.
		align_genus	alignment to a reference genome within the same genus.
		align_xgenus	alignment to a reference genome within another genus.
		align_trnscpt	alignment to a transcript from the same species.
		within_clone	sequence on both sides of the gap is derived from the same clone, but the gap is not spanned by paired-ends. The adjacent sequence contigs have unknown order and orientation
		clone_contig	linkage is provided by a clone contig in the tiling path (TPF). For example, a gap where there is a known clone, but there is not yet sequence for that clone.
		map	linkage asserted using a non-sequence based map such as RH, linkage, fingerprint or optical.
		strobe	strobe sequencing (PacBio).
		unspecified	used when converting old AGPs that lack a field for linkage evidence into the new format.
		If there are multiple lines of evidence to support linkage, all can be listed using a ‘;’ delimiter. (e.g. "paired-ends;align_xgenus ")

The length of gap for an ‘unknown’ gap should be 100 bp. It is required to indicate “U” for the value of component_type and “100” for the value of gap_length.
Information about continuity is provided by a combination of the value in the gap_type and linkage. Please refer to the following table.

Example: source feature in COMMON entry

gap_type	linkage	Interpretation and description
Within-scaffold gaps: sequences on either side of the gap are in a single scaffold.
scaffold	yes	Do not break scaffold There is evidence linking sequence contigs on both sides of the gap.
repeat	yes	Do not break scaffold If an unresolvable repeat unit is spanned by linkage evidence, the linkage will be 'yes'.
Scaffold-breaking gaps: sequences on either side of the gap are in separate scaffolds.
contig	no	Break scaffold A contig gap indicates there is no evidence to link the adjacent sequence contigs.
repeat	no	Break scaffold If an unresolvable repeat unit is not spanned by linkage evidence, the linkage will be 'no'.
centromere short_arm heterochromatin telomer	no	Break scaffold Gaps with these biological types are used for laying out scaffolds along a chromosome.
Invalid gap/linkage combinations
contig	yes	Invalid If there is evidence of linkage between the adjacent sequence contigs, the gap type should be scaffold.
scaffold	no	Invalid If there is no evidence of linkage between the adjacent sequence contigs, the gap type should be contig.
centromere short_arm heterochromatin telomere	yes	Invalid It is invalid to use these biological types within a scaffold.