DDBJ Annotated/Assembled Sequences
MSS - Mass Submission System
Submission of research data from human subjects
For all data from human subjects researches submitted to DDBJ, it is submitter’s responsibility to ensure that the dignity and the right of participant (human subject) is protected in accordance with all applicable laws, regulations and policies of submitter’s institute. In principle, make sure to remove any direct personal identifiers of human subjects from your submissions. Before submission, read “Submission of research data from human subjects”.
Overview
Mass Submission System (MSS) is the service to accept relatively large scale nucleotide sequence data (not reads) through sending text files. The nucleotide sequence data belonging to either of the following cases should be submitted via MSS, because they are not acceptable via the NSSS: DDBJ Nucleotide Sequence Submission System. Please note the points other than number or length of your data.
a) Either of the following categories or amounts of sequence data
- EST, TSA,
HTC, GSS,
HTG, WGS,
TLS, TPA
- See Categories for Sequence Data in detail.
- Submission with long sequences, greater than 500 kb in its length
- Complex submission containing many features for one sequence, more than 30 features
- Submission consists of large number of sequences, greater than 100 in total
b) Regardless finished or draft level, sequence data of whole-length scale replicons
- (Nuclear) genome
- Organelle genome
- Chromosome
- Virus/Phage genome/segments
- Plasmid
c) Sequence data to be described BioProject or BioSample in DBLINK
When you need to use DBLINK to link BioProject or BioSample, the following cases are included, but not limited to them.
- Sequence data from metagenome analyses, environmental profilings, and so on
- Sequence data of targeted genes to be linked each other
- When you are planning to submit or have submitted whole genome scale data obtained from the same samples.
- Required to submit prokaryotic 16S rRNA gene for phylogenic report
- Advanced paper submission of any other targeted gene(s)/cluster region(s)
- Basically, if none of the above applies to your data, DDBJ recommends using the NSSS: DDBJ Nucleotide Sequence Submission System.
- If you are to submit reads from sequencers, see DRA, DDBJ Sequence Read Archive.
Registration process in MSS
Preparation of the submission files
Required files for the registration
- Sequence file
- The text file that contains all nucleotide sequences in FASTA-like format. see Submission file format:Sequence file.
- Annotation file
- The tab delimited text file that contains metadata (submitters, reference) and annotation (Feature/Qualifier) see Submission file format:Annotation file
- For the prokaryote genome, you can create the files by using DFAST(DDBJ Fast Annotation and Submission Tool).
- AGP file(only in case of CON entries)
- [Caution] DDBJ currently terminated accepting new submissions.
- The tab delimited text file to construct CON sequence that contains the order, orientation, and type of each piece entry. If nucleotide sequence can be assembled from an AGP file, you do not need to send a sequence file. see Submission file format:AGP file
Getting BioProject & BioSample ID
- According to the type of the data, you must obtain BioProject ID and BioSample ID (also reserve locus_tag prefix) to prepare the submission files.
- In principle, it is not acceptable to change the taken locus_tag prefix, so please be careful when registering the prefix.
- See the table to know what data type needs BioProject and BioSample.
Sample & Documentation
- Sample annotations
- User guide of annotation file
- For prokaryote genome, we strongly recommend that you should use DFAST (DDBJ Fast Annotation and Submission Tool).
- See DFAST: creating the submission files and obtain the submission files.
- For whole genome-scale sequence, it is optional to describe biological features except source and assembly_gap. However, in case of a novel species that have not been reported so far, it is required to describe feature annotation against at least one genome as a representative.
- When you submit a genome with annotation, it is required to reserve locus_tag prefix at the registration of BioSample.
- For TSA data, it is optional (basically unnecessary) to describe biological features except source and assembly_gap.
- In case of EST, you can not describe any biological features except source.
- See the table to know what data type needs annotation.
Tool for checking the submission files
Before submitting to DDBJ, the files must be checked with software tools provided from DDBJ.
- UME (Utilities for MSS file Error check) The tool checks the syntax, format and amino acid translation of CDS features in sequence file and annotation file. It includes both Parser and transChecker. OS: Windows, Linux/macOS Details:UME User’s Manual
- Parser The tool checks the syntax and format of sequence file and annotation file. OS: Linux Details:Parser User’s Manual
- transChecker The tool validates the amino acid translation of CDS features (protein-coding sequence) in annotation file and sequence file. OS: Linux Details:transChecker User’s Manual
Download: Validation tools for MSS data files
- Validation tools for data files do not have any function to make files for your submission. So, please make your submission files by using text editor, spreadsheet software, or some application in your PC, appropriately.
- Syntax errors due to using undefined characters, contamination of control codes, and so on would cause a major obstacle during processing submitted data, which may result in significant delay of issuing accession numbers.
- When you have to describe CDS (protein coding sequences) as one of Biological feature for the annotation of your sequence, you must check the amino acid translation of CDS features by using UME or transChecker tool before submitting to DDBJ.
- Before installing the validation tools, see End-user license agreement.
Creating account
- If you have not obtained a DDBJ account, create your account (see HELP).
- If you would like to use SCP/SFTP for transferring the files to DDBJ, you also need to register a public key to your DDBJ account. See “Data upload” for detailed descriptions how to transfer the files.
Applying for the registration
Please apply for your submission through “Application form for MSS”. You can include multiple submission to the same application only if the submission data meet the requirements shown below. You need to separate the submission if any one of the files does not meet the requirements.
You can upload the files in the application form if you have prepared the submission files. If you have not created the submission files and could not upload the files during the MSS application, please upload the submission file later. You will see the description for uploading the files in the reply email.
A Mass-ID is issued just after you complete the application. The Mass-ID (e.g. [DDBJ:NSUB000001]) is included within the subject name of the email. Please keep the subject when you reply to the email sent from DDBJ.
- Requirements that you can include the same application
- · All submission files have the same contact person
- · All submission files have the same data type
- · All submission files have the same hold date
- Example cases that are allowed for the same application
- Draft genomes consist of twenty Bacterial strains ➞ WGS: Whole Genome Shotgun
- Finished level genome sequences from three isolates of eukaryote genomes ➞ GNM: Finished Level Genome sequence, non-WGS
- Assembled transcribed sequences among the same species. It comprises multiple sets from different organisms. ➞ TSA: Transcriptome Shotgun Assembly
Example case where you must apply for the registration more than two times
- a. Draft genome of chromosome, and complete plasmid sequences from a Bacterial strain
- You must separate into two submissions.
- Draft genome ➞ WGS: Whole Genome Shotgun
- Complete plasmid sequence(s) ➞ MISC: Sequences that are not included in above types
- b. Draft genome sequences of chromosomes from a eukaryotic isolate, and complete organelle genome
- You must separate into two submissions.
- Draft genome ➞ WGS: Whole Genome Shotgun
- Complete organelle sequence(s) ➞ MISC: Sequences that are not included in above types
- c. Draft genome sequences of chromosomes from a eukaryotic isolate, and assembled transcript sequences in large-scale
- You must separate into two submissions.
- Draft genome ➞ WGS: Whole Genome Shotgun
- Assembled transcripts ➞ TSA: Transcriptome Shotgun Assembly
How to upload the submission files
Submitters can transfer the submission files from MSS form by either one of the methods indicated below.
- Uploading from browser
- Specifying the DFAST job ID
- Loading the files which have been transferred to SFTP server
- Select this method when the total submission files exceed more than 10 Gbytes in uncompressed size. You need public/private key pair to use SFTP. First, register a public key to your account, and then upload the files according to “Data upload”.
- Read the description below
<NOTE> Do not send submission files as email attachment unless there is some particular reason.
File format for uploading to SFTP server
- The destination directory is /mass
- The mass directory is a target to import the files when MSS Application Form is used. Therefore only the the submission files should be placed here.
- MSS form reads the files recursively from the subdirectories under mass/.
- There are some rules for the submission file name. As to the compressed files, the files in the compressed archive should be subject to the rule.
- File extension of the annotation file should be either one of .ann, .annt, .tsv, or .ann.txt
- File extension of the nucleotide sequence file should be either one of .fasta, .seq.fa, .fa, .fna, or .seq
- An annotation file and a nucleotide sequence file must comprise a pair. The system determines as a pair whose filenames without the extension has the same name.
- Excluding the case of re-submitting the submission files as the request from DDBJ curator.
- Use alphanumeric and part of symbolic (excluding space, backtick, angle brackets “<>”, and parentheses “()”) characters for the file name. Do not include multibyte character such as Japanese font.
- MSS Application Form can import the files from compressed archive. The following compression types are available.
- gzip, bzip2, xz, lzip, lzma, lzop, zstd, compress e.g. 20230322-1.tar.gz 20230322-2.tar.bz2 20230322-3.tar.xz 20230322-4.zip 20230322-5.tar.lzma 20230322-6.tar.lzo 20230322-7.tar.zst 20230322-8.tar.Z
Review of the submission files
Upload the submission files to DDBJ after you have checked the submission files by using the file checking tool.
DDBJ reviews submission files and then informs the submitter of some correction requests and/or inquiries. If there is no problem with the submitted the files, DDBJ will register the submitted data to our database and send the accession number(s) to the contact person and/or submitters by an email.
Optional: Before preparing the entire sequence and annotation files, you can send a part of your data as a test submission, and then ask DDBJ whether the submission files are correctly created or not.
Publication of the data
If you specified immediate release in the submission process, the submitted data are open to public as soon as possible. Whereas, if you specified the hold date, the data will be released based on the principle of “Hold-Until-Published” data release. The registered data are distributed in a flat file format converted according to a DDBJ defined rule, depending on the contents of sequence and annotation. Please refer to the relationships between annotation files and flat files.
Requirement of BioProject, BioSample ID
- Genome
Your submission | BioProject | BioSample | Annotation with biological feature |
locus_tag | Need DRA | You should select |
---|---|---|---|---|---|---|
Draft genome w/ annotation | M | M | M | M | OPT | WGS |
Draft genome w/o annotation | M | M | NR | NR | OPT | WGS |
Finished level genome sequence, non-WGS | M | M | M | M | OPT | GNM |
Metagenome-Assembled Genome w/ annotation | M | M | M | M | M | MAG |
Metagenome-Assembled Genome w/o annotation | M | M | NR | NR | M | MAG |
Single Amplified Genome w/ annotation | M | M | M | M | OPT | SAG |
Single Amplified Genome w/o annotation | M | M | NR | NR | OPT | SAG |
High Throughput Genomic Sequences | M | M | OPT | NR | OPT | HTG |
Transcriptome Shotgun Assembly | M | M | OPT | NR | M | TSA |
High Throughput cDNA Sequences | M | M | OPT | NR | OPT | HTC |
Expressed Sequence Tags | M | M | NR | NR | OPT | EST |
Virus/Phage genome | NR | NR | OPT | NR | OPT | MISC |
Plasmid genome only | NR | NR | OPT | NR | OPT | MISC |
Organelle genome only | NR | NR | OPT | NR | OPT | MISC |
Finished Level Genome + Plasmid | M | M | M | M | OPT | GNM, MISC for each submission |
Finished Level Genome + Organelle | M | M | M | M | OPT | GNM, MISC for each submission |
M, Mandatory; NR, Not required; OPT, Optional
- Transcriptome
Your submission | BioProject | BioSample | Annotation with biological feature |
locus_tag | Need DRA | You should select |
---|---|---|---|---|---|---|
Transcriptome Shotgun Assembly | M | M | OPT | NR | M | TSA |
High Throughput cDNA Sequences | M | M | OPT | NR | OPT | HTC |
Expressed Sequence Tags | M | M | NR | NR | OPT | EST |
M, Mandatory; NR, Not required; OPT, Optional
- Targeted Locus Study
Your submission | BioProject | BioSample | Annotation with biological feature |
locus_tag | Need DRA | You should select |
---|---|---|---|---|---|---|
Targeted Locus Study | M | M | M | NR | OPT | TLS |
M, Mandatory; NR, Not required; OPT, Optional
DFAST for the submission of prokaryote genomes
DFAST(DDBJ Fast Annotation and Submission Tool)
DFAST is a rapid annotation pipeline service for prokaryote genomes, which also generates the annotation files that can be directly submitted to DDBJ. We strongly recommend that the submitters use DFAST for the registration of the prokaryote genomes to Annotated/Assembled Sequences database.
Registration procedure for the prokaryote genome
- You need DDBJ account which has been obtained through DFAST in order to register the prokaryote genome and the annotation into DDBJ. Registration of BioProject, BioSample and locus_tag prefix when biological feature are descriebed are required in advance.
- If you login to DFAST with DDBJ account, you can manage the jobs analyzed in DFAST. If you have not obtained the login account, see “DDBJ Account” to create a new account.
How to submit the data obtained in DFAST
- Login to DFAST with your account. First, upload the fasta file in “job submission page”, and start the job to analyze the genome. At this stage, you can obtain a job ID. When the job is finished, click “DDBJ submission” tab on the page. The annotation and sequence files, which are needed for MSS submission, are created after you fill necessary data (e.g. BioProject ID, BioSample ID, locus_tag prefix, and other metadata) into the form in metadata section.(*1) Finally, click “Format Check” to do the syntax check of the files.
- Submitting by DFAST job ID
- Copy the target job ID (format: ########-####-####-####-############)
- Submitting the files downloaded from DFAST
- In the job management page, add checkmark to the job number that you would like to submit to DDBJ.
- Select “MSS” for the file format type, and click “DOWNLOAD” to download the submission files. Please check the meta information carefully. If you encounter a warning, check again and correct the metadata that you have filled (*2). If you would like to edit the annotation and meta data on a text file, download the files and open them by text editor.
- Apply for the submission through “Application form for MSS”. According to the process shown in “The Flow of MSS”, send the submission files to DDBJ.
*1 You can use DFAST and obtain the result of genome annotation without logging in. After you login to DFAST, you can import the job into your account by the function of “Job History” on the menu bar if you remember the job ID.
*2 The function for checking the metadata in DFAST is simple. You may be asked to correct the files by DDBJ curators after you submit the data.