DDBJ Annotated/Assembled Sequences
WGS
The whole genome shotgun approach (the whole genome is once blasted into millions of fragment, which are sequenced and reassembled to produce a series of sequence ‘scaffolds’.) has been used to sequence the genome of various organisms.
The large set of contigs from the proceeding genome project can be submitted to DDBJ/ENA/GenBank as WGS data.
See also INSDC standards for genome assembly submission
See the list of publicized WGS data.
You can submit WGS data to DDBJ via Mass Submission System (MSS).
Acceptable WGS data
- The WGS entries are contigs (overlapping reads) and/or the scaffolds (assembled contigs separated by gaps).
- The WGS entries can contain consecutive "n" s to represent sequencing gaps.
Unacceptable WGS data
- Assembled genome sequences from multiple organisms that are not metagenomes.
- The following cases without chromosome assembly (contigs and scaffolds)
- Organelle genome contigs alone.
- Plasmids contigs alone.
Submission of WGS entry
The Submitters visit the MSS form site and make an application.
- Prior to assembly sequence data submission, it is required to submit to BioProject and BioSample databases.
- If you wish to annotate all protein-coding genes and non-protein-coding RNA genes on the assembly sequences, please register a locus_tag prefix when submitting each BioSample.
- Sample annotation: (WGS sample annotation).
Sample flat file
Aspects of WGS
- Basically, each WGS sequence submitted to DDBJ is assigned an accession number that consists of 6 alphabet characters and 9 digits (since January 2024) or 4 alphabet characters and 8 digits.
- “WGS” and either of controlled terms (STANDARD_DRAFT, HIGH_QUALITY_DRAFT, IMPROVED_HIGH_QUALITY_DRAFT, ANNOTATION_GRADE, NON_CONTIGUOUS_FINISHED) indicating the degree of completion as genome sequence are indicated in KEYWORDS line. The definitions of each KEYWORD can be found on the following website(INSDC agreed methodological keywords).
- A summary of the assembly is displayed in the COMMENT.
Tag name | Value (information) |
Assembly Method | Name of the assembly algorithm(s) with version number it was run. |
Assembly Name | A brief name suitable for display that does not include the organism name. This is mandatory for eukaryotes. |
Genome Coverage | The estimated base coverage across the genome. |
Sequencing Technology | Sequencing platform(s) used. |
LOCUS ZZZZZZ010000001 123456 bp DNA linear ROD 07-AUG-2024
DEFINITION Mus musculus C57BL6 DNA, EN0001.
ACCESSION ZZZZZZ010000001 ZZZZZZ010000000
VERSION ZZZZZZ010000001.1
DBLINK BioProject:PRJDB99999
Sequence Read Archive:DRR999998, DRR999999
BioSample:SAMD99999999
KEYWORDS WGS; STANDARD_DRAFT.
SOURCE Mus musculus
ORGANISM Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Myomorpha;
Muroidea; Muridae;Murinae; Mus; Mus.
REFERENCE 1 (bases 1 to 123456)
AUTHORS Mishima,H. and Shizuoka,T.
TITLE Direct Submission
JOURNAL Submitted (01-MAY-2024) to the DDBJ/EMBL/GenBank databases.
Contact:Hanako Mishima
National Institute of Genetics, DNA Data Bank of Japan; Yata 1111,
Mishima, Shizuoka 411-8540, Japan
REFERENCE 2
AUTHORS Mishima,H., Shizuoka,T. and Fuji,I.
TITLE Mouse whole genome shotgun sequence
JOURNAL Unpublished (2024)
COMMENT Whole genome shotgun sequencing project.
#Genome-Assembly-Data-START##
Assembly Method :: HGAP v. 1.0; Celera Assembler v. 7.0;
Quiver v. 1.4.0; Sequencher v. 5.1
Assembly Name :: MusC56 v1
Genome Coverage :: 238x
Sequencing Technology :: PacBio RS, Illumina GAIIx
##Genome-Assembly-Data-END##
FEATURES Location/Qualifiers
source 1..123456
/collection_date="missing: lab stock"
/db_xref="taxon:10090"
/geo_loc_name="Japan"
/mol_type="genomic DNA"
/organism="Mus musculus"
/strain="C57BL6"
/submitter_seqid="EN0001"
CDS complement(join(147..1241,1364..1816))
/codon_start=1
/locus_tag="DDBJGEN_0001G0001"
/product="hypothetical protein"
/protein_id="xxxxxxxxxx.1"
/transl_table=1
/translation="MTEHIFEKISLNLSNIINKCVYKQTTLNDAQNE
IKETMNVIINQYNHYITKDVMDEILILTSKLLYSQNIESLIIYLNKL
(snipped)
GFFRMYQIWNVS"
assembly_gap 2982..3269
/estimated_length=288
/gap_type="within scaffold"
/linkage_evidence="paired_ends"
tRNA 3569..3643
/locus_tag="DDBJGEN_t0001G0001"
/product="tRNA-Ser"
-- The rest is snipped --
//