WGS

The whole genome shotgun approach (the whole genome is once blasted into millions of fragment, which are sequenced and reassembled to produce a series of sequence ‘scaffolds’.) has been used to sequence the genome of various organisms.

The large set of contigs from the proceeding genome project can be submitted to DDBJ/ENA/GenBank as WGS data.
See also INSDC standards for genome assembly submission

See the list of publicized WGS data.

You can submit WGS data to DDBJ via Mass Submission System (MSS).

Acceptable WGS data

In principle, DDBJ/ENA/GenBank can accept assemblies (i.e. overlapping reads) that are appropriately assembled sequences and can not accept redundant reads (i.e. raw read sequences). If you wish to publicize raw read sequences, we recommend you to contact DDBJ Sequence Read Archive (DRA).

The WGS entries are contigs (overlapping reads) and/or the scaffolds (assembled contigs separated by gaps).
The WGS entries can contain consecutive "n" s to represent sequencing gaps.

Unacceptable WGS data

Assembled genome sequences from multiple organisms that are not metagenomes.
The following cases without chromosome assembly (contigs and scaffolds)
- Organelle genome contigs alone.
- Plasmids contigs alone.

Submission of WGS entry

The Submitters visit the MSS form site and make an application.

Prior to assembly sequence data submission, it is required to submit to BioProject and BioSample databases.
If you wish to annotate all protein-coding genes and non-protein-coding RNA genes on the assembly sequences, please register a locus_tag prefix when submitting each BioSample.
Sample annotation: (WGS sample annotation).

Sample flat file

Aspects of WGS

Basically, each WGS sequence submitted to DDBJ is assigned an accession number that consists of 6 alphabet characters and 9 digits (since January 2024) or 4 alphabet characters and 8 digits.
“WGS” and either of controlled terms （STANDARD_DRAFT, HIGH_QUALITY_DRAFT, IMPROVED_HIGH_QUALITY_DRAFT, ANNOTATION_GRADE, NON_CONTIGUOUS_FINISHED) indicating the degree of completion as genome sequence are indicated in KEYWORDS line. The definitions of each KEYWORD can be found on the following website(INSDC agreed methodological keywords).
A summary of the assembly is displayed in the COMMENT.

Tag name	Value （information）
Assembly Method	Name of the assembly algorithm(s) with version number it was run.
Assembly Name	A brief name suitable for display that does not include the organism name. This is mandatory for eukaryotes.
Genome Coverage	The estimated base coverage across the genome.
Sequencing Technology	Sequencing platform(s) used.

LOCUS       ZZZZZZ010000001              123456 bp    DNA    linear   ROD 07-AUG-2024
DEFINITION  Mus musculus C57BL6 DNA, EN0001. 
ACCESSION   ZZZZZZ010000001 ZZZZZZ010000000
VERSION     ZZZZZZ010000001.1
DBLINK      BioProject:PRJDB99999
            Sequence Read Archive:DRR999998, DRR999999
            BioSample:SAMD99999999
KEYWORDS    WGS; STANDARD_DRAFT.
SOURCE      Mus musculus
  ORGANISM  Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; 
            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Myomorpha; 
            Muroidea; Muridae;Murinae; Mus; Mus.
REFERENCE   1  (bases 1 to 123456)
  AUTHORS   Mishima,H. and Shizuoka,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (01-MAY-2024)
            Contact:Hanako Mishima
            National Institute of Genetics, DNA Data Bank of Japan; Yata 1111,
            Mishima, Shizuoka 411-8540, Japan
REFERENCE   2
  AUTHORS   Mishima,H., Shizuoka,T. and Fuji,I.
  TITLE     Mouse whole genome shotgun sequence
  JOURNAL   Unpublished (2024)
COMMENT     Whole genome shotgun sequencing project.
            #Genome-Assembly-Data-START##
            Assembly Method       :: HGAP v. 1.0; Celera Assembler v. 7.0; 
                                     Quiver v. 1.4.0; Sequencher v. 5.1
            Assembly Name         :: MusC56 v1
            Genome Coverage       :: 238x
            Sequencing Technology :: PacBio RS, Illumina GAIIx
            ##Genome-Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..123456
                     /collection_date="missing: lab stock"
                     /db_xref="taxon:10090"
                     /geo_loc_name="Japan"
                     /mol_type="genomic DNA"
                     /organism="Mus musculus"
                     /strain="C57BL6"
                     /submitter_seqid="EN0001"
     CDS             complement(join(147..1241,1364..1816))
                     /codon_start=1
                     /locus_tag="DDBJGEN_0001G0001"
                     /product="hypothetical protein"
                     /protein_id="xxxxxxxxxx.1"
                     /transl_table=1
                     /translation="MTEHIFEKISLNLSNIINKCVYKQTTLNDAQNE
                     IKETMNVIINQYNHYITKDVMDEILILTSKLLYSQNIESLIIYLNKL
                     (snipped)
                     GFFRMYQIWNVS"
     assembly_gap    2982..3269
                     /estimated_length=288
                     /gap_type="within scaffold"
                     /linkage_evidence="paired_ends"
     tRNA             3569..3643
                     /locus_tag="DDBJGEN_t0001G0001"
                     /product="tRNA-Ser"

-- The rest is snipped --
//