Last updated:2015.1.6.

How to Make AGP File

AGP file is required to submit CON entries.
The AGP file format was initially developed by UCSC, EBI and NCBI.

CON sequence file

Sequence file is not required when the sequence can be constructed from AGP file.

CON annotation file

Please write the value for DIVISION/division as CON in the annotation file.

In detail, see also Sample Annotation File and The relationships between annotation file and DDBJ flat file.

CON AGP file

AGP file can provide the order and orientation of the piece entries to construct CON entry.

#1 2 3 4 5 6 7 8 9
scaffold1 1 1345 1 W BZZZ01123456.1 1 1345 +
scaffold1 1346 2845 2 N 1500 scaffold yes align_genus
scaffold1 2846 4301 3 W BZZZ01123457.1 1 1456 +
scaffold1 4302 4401 4 U 100 scaffold yes align_genus
scaffold1 4402 5631 5 W BZZZ01123458.1 1 1230 -
scaffold2 1 650 1 W BZZZ01123486.1 1 1345 +
scaffold2 651 750 2 N 100 scaffold yes align_genus
scaffold2 751 2980 3 W BZZZ01123488.1 1 1230 -
  • AGP file consists of nine columns.
  • There is a single column definition up to column 5, and then each column will have two definitions, depending on the value in column 5.
  • Columns should be tab delimited.
  • AGP file is required to contain NO space or blank line.
  • The use of comment lines, starting with a # symbol, at the head of the file is encouraged.
  • AGP file can be checked with the "UME" (Utilities for MSS Error check).
[Description on each column]
column content description
1 object CON entry name, the identifier for the object being assembled.
i.e. a chromosome, scaffold or contig.
2 object_beg The starting coordinates of the component/gap on the object.
3 object_end The ending coordinates of the component/gap on the object.
4 part_number The line count for the components/gaps that make up the object.
5 component_type The sequencing status of the component. These typically correspond to keywords in the International Sequence Database (GenBank/EMBL/DDBJ) submission. Current acceptable values are:
A Active Finishing
D Draft HTG (often phase1 and phase2 are called Draft, whether or not they have the draft keyword).
F Finished HTG (phase3)
G Whole Genome Finishing
O Other sequence (typically means no HTG keyword)
P Pre Draft
W WGS contig
N gap with specified size
U gap of unknown size, defaulting to 100 bases

* component: a sequence used to construct a larger sequence (i.e. piece entry)

  • Please enter the CON entry name into object. CON entry name has to correspond to each name in the annotation file as described at How to Make Annotation File.
  • The description of column 6 to 9 depends on the value in column 5 whether it has gap or not.
If column 5 contains A, D, F, G, O, P and W except from N and U;
column content Description
6 component_id The accession number with version or local identifier for the component
7 component_beg The beginning of the part of the component that contributes to the object
8 component_end The end of the part of the component that contributes to the object
9 orientation The orientation of the component relative to the object.
Acceptable values are:
+ plus
- minus
? unknown
0 zero; unknown (deprecated)
na irrelevant
By default, components with "?", "0" or "na" are treated as if they had + orientation.

* component: a sequence used to construct a larger sequence (i.e. piece entry)

If column 5 contains N and U;
column content Description
6 gap_length [component_type: N] The length of gap (bp)
[component_type: U] 100
7 gap_type This column specifies the gap type. Accepted values:
scaffold a gap between two sequence contigs in a scaffold (superscaffold or ultra-scaffold).
contig an unspanned gap between two sequence contigs.
centromere a gap inserted for the centromere.
short_arm a gap inserted at the start of an acrocentric chromosome.
heterochromatin a gap inserted for an especially large region of heterochromatic sequence (may also include the centromere).
telomere a gap inserted for the telomere.
repeat an unresolvable repeat.
8 linkage The linkage between the adjacent lines (Values: "yes" or "no")
9 linkage evidence This specifies the type of evidence used to assert linkage (as indicated in column 8b). Accepted values:
na used when no linkage is being asserted (column 8b is 'no')
paired-ends paired sequences from the two ends of a DNA fragment.
align_genus alignment to a reference genome within the same genus.
align_xgenus alignment to a reference genome within another genus.
align_trnscpt alignment to a transcript from the same species.
within_clone sequence on both sides of the gap is derived from the same clone, but the gap is not spanned by paired-ends. The adjacent sequence contigs have unknown order and orientation
clone_contig linkage is provided by a clone contig in the tiling path (TPF). For example, a gap where there is a known clone, but there is not yet sequence for that clone.
map linkage asserted using a non-sequence based map such as RH, linkage, fingerprint or optical.
strobe strobe sequencing (PacBio).
unspecified used when converting old AGPs that lack a field for linkage evidence into the new format.
If there are multiple lines of evidence to support linkage, all can be listed using a ‘;’ delimiter.
(e.g. "paired-ends;align_xgenus ")
  • The length of gap for an 'unknown' gap should be 100 bp. It is required to indicate "U" for the value of component_type and "100" for the value of gap_length.
  • Information about continuity is provided by a combination of the value in the gap_type and linkage. Please refer to the following table.
gap_type linkage Interpretation and description
Within-scaffold gaps: sequences on either side of the gap are in a single scaffold.
scaffold yes Do not break scaffold
There is evidence linking sequence contigs on both sides of the gap.
repeat yes Do not break scaffold
If an unresolvable repeat unit is spanned by linkage evidence, the linkage will be 'yes'.
Scaffold-breaking gaps: sequences on either side of the gap are in separate scaffolds.
contig no Break scaffold
A contig gap indicates there is no evidence to link the adjacent sequence contigs.
repeat no Break scaffold
If an unresolvable repeat unit is not spanned by linkage evidence, the linkage will be 'no'.
centromere
short_arm
heterochromatin
telomer
no Break scaffold
Gaps with these biological types are used for laying out scaffolds along a chromosome.
Invalid gap/linkage combinations
contig yes Invalid
If there is evidence of linkage between the adjacent sequence contigs, the gap type should be scaffold.
scaffold no Invalid
If there is no evidence of linkage between the adjacent sequence contigs, the gap type should be contig.
centromere
short_arm
heterochromatin
telomere
yes Invalid
It is invalid to use these biological types within a scaffold.
ページの先頭へ戻る