Sequence Read Archive
DDBJ Sequence Read Archive Handbook
DDBJ Sequence Read Archive (DRA)is an archive database for output data generated by next-generation sequencing machines including Roche 454 GS System®, Illumina Genome Analyzer®, Applied Biosystems SOLiD® System, and others. DRA is a member of the International Nucleotide Sequence Database Collaboration (INSDC)and archiving the data in a close collaboration with NCBI Sequence Read Archive (SRA)and EBI Sequence Read Archive (ERA).
Three INSDC partners regularly exchange data other than Analysis.
DRA accepts sequencing data from capillary sequencing platforms in fastq format. To submit sequencing chromatograms in addition to bases and qualities, please submit data to the DDBJ Trace Archive.
Metadata
Metadata objects
The metadata describes how the associated data have been obtained. The metadata are composed of 6 objects, Submission, BioProject, BioSample, Experiment, Run and Analysis. Each of these objects is defined by its XML schema and is related each other. Multiple Experiments can “point” to a single Sample, but not vice-versa.
Accession numbers with distinct prefixes are assigned to each object. Metadata and accession number system are common in DRA/ERA/SRA. The Experiment, Run and Analysis are the SRA objects, and the BioProject and BioSample are external database objects.
For details, please see the DRA XML schema
- Submission
- A container object only for grouping objects to be submitted.
- BioProject
- An overall description of a single research initiative; a project will typically relate to multiple samples and datasets.
- BioSample
- Description of biological source material; each physically unique specimen should be registered as a single BioSample with a unique set of attributes.
- Experiment
- A description of sample-specific sequencing library and sequencing methods. An Experiment references 1 BioProject and 1 BioSample. Multiple Experiments can “point” to a single Sample, but not vice-versa.
- Run
- Runs describe the files that belong to the previously created Experiments. They specify the data files for a specific sample to be processed by DRA. Note that all data files listed in a Run will be merged into a single SRA archive file, so files from different samples or replicates should not be grouped in the same Run. Paired-end data files, conversely, MUST be listed in a single run in order for the two files to be correctly processed as paired-end.
- Analysis
- Packages data associated with sequence read objects that are intended for downstream usage or that otherwise needs an archival home. Submit alignment data in bam file to Run. Please contact to DRA team to ask mirroring of analysis data.Analysis files are provided on the DDBJ ftp site and not indexed by the DRASearch.
Organization of metadata objects
Followings are examples of metadata. Submitters can organize metadata objects flexibly.
- Most simple case
- Comparative genome sequencing of three strains (paired-end)
- Technical and biological replicates (paired-end)
- Related sequencing data are reported in two publications
Most simple case
Comparative genome sequencing of three strains (paired-end)
Include paired-end read files in a Run.
Technical and biological replicates (paired-end)
Related FAQ: How many samples do I need for my DRA submission?
Related sequencing data are reported in two publications.
Items in each metadata object.
Required*
Conditionally required*
Submission
Center Name
Enter submitter’s organization.
- Center Name*
- A submitter’s center name. Center Name List. A center name abbreviation is required to submit data to DRA.
In the metadata creation tool, the center name is automatically filled with the account information.
The Center Name is an abbreviation operationally used by SRA and is not for indicating ownership of submission. Submitters listed in Submitter hold ownership of submission.
- Lab Name*
- Laboratory name within submitting institution. The Lab name is pre-entered with “Lab/Group”, “Department (2)”, “Department (1)”,”Organization” of D-way account. Text can be editted.
Hold Until
Specify how to release the data.
- Hold Until*
- Direct the DRA to release the record on or after the specified date.Submitter can set the hold date for a maximum of 2 years and can change the date before the record is released.
- Immediate Release*
- Direct the DRA to release the record immediately after submission is processed.
Submitter
The DRA contacts the listed address(es) regarding the submission by e-mail.Include contact information of PI and non-PI member(s) who actually submits data.The contact information is not made public. If you want to display the contact information, enter the information in the BioProject.
- Name*
- Name of submitter.
- E-mail*
- E-mail of submitter.
BioProject
- BioProject ID*
- Select a project registered to BioProject or submit a new project. For submission to BioProject, please refer to the BioProject Handbook.
BioSample
- BioSample ID*
- Select samples registered to BioSample or create and submit new samples. For submission to BioSample, please refer to BioSample Handbook.
Experiment
- Alias
- Name of the experiment designated by the archive. This alias is used to reference metadata objects without accession numbers.
- BioSample Used*
- Select the BioSample this experiment uses.
- Title*
- Short text that can be used to call out experiment records in searches or in displays. A title like “[Sequencing Instrument Model] [paired end] sequencing of [BioSample ID]” (for example, “Illumina HiSeq 2000 paired end sequencing of SAMD00025741”) is automatically constructed. To enter user-defined titles, download Experiment metadata into a tab-delimited text file, edit title values and upload it.
- Library Name
- The submitter’s name for this library.
- Library Source*
- The Library Source specifies the type of source material that is being sequenced.
Library Source Description GENOMIC Genomic DNA (includes PCR products from genomic DNA). TRANSCRIPTOMIC Transcription products or non genomic DNA (EST, cDNA, RT-PCR, screened libraries). METAGENOMIC Mixed material from metagenome. METATRANSCRIPTOMIC Transcription products from community targets. SYNTHETIC Synthetic DNA. VIRAL RNA Viral RNA. OTHER Other, unspecified, or unknown library source material.
- Library Selection*
- Whether any method was used to select and/or enrich the material
being sequenced.
Library Selection Description RANDOM Random shearing only. PCR Source material was selected by designed primers. RANDOM PCR Source material was selected by randomly generated primers. RT-PCR Source material was selected by reverse transcription PCR. HMPR Hypo-methylated partial restriction digest. MF Methyl Filtrated. repeat fractionation Selection for less repetitive (and more gene rich) sequence through Cot filtration (CF) or other fractionation techniques based on DNA kinetics. size fractionation Physical selection of size appropriate targets. MSLL Methylation Spanning Linking Library. cDNA complementary DNA. cDNA_randomPriming cDNA_oligo_dT PolyA PolyA selection or enrichment for messenger RNA (mRNA); should replace cDNA enumeration. Oligo-dT enrichment of messenger RNA (mRNA) by hybridization to Oligo-dT. Inverse rRNA depletion of ribosomal RNA by oligo hybridization. ChIP Chromatin immunoprecipitation. MNase Micrococcal Nuclease (MNase) digestion. DNAse Deoxyribonuclease (DNase) digestion. Hybrid Selection Selection by hybridization in array or solution. Reduced Representation Reproducible genomic subsets, often generated by restriction fragment size selection, containing a manageable number of loci to facilitate re-sampling. Restriction Digest DNA fractionation using restriction enzymes. 5-methylcytidine antibody Selection of methylated DNA fragments using an antibody raised against 5-methylcytosine or 5-methylcytidine (m5C)MBD2 protein methyl-CpG binding domain : Enrichment by methyl-CpG binding domain. MBD2 protein methyl-CpG binding domain MBD2 protein methyl-CpG binding domain. CAGE Cap-analysis gene expression. RACE Rapid Amplification of cDNA Ends. MDA multiple displacement amplification. padlock probes capture method Padlock Probes capture strategy to be used in conjuction with Bisulfite-Seq. other Other library enrichment, screening, or selection process. unspecified Library enrichment, screening, or selection is not specified.
- Library Strategy*
- Sequencing technique intended for this library.
Library Strategy Description WGS Whole genome shotgun. WGA Whole genome amplification. WXS Random sequencing of exonic regions selected from the genome. RNA-Seq Random sequencing of whole transcriptome. miRNA-Seq Micro RNA and other small non-coding RNA sequencing. ncRNA-Seq Capture of other non-coding RNA types, including post-translation modification types such as snRNA (small nuclear RNA) or snoRNA (small nucleolar RNA), or expression regulation types such as siRNA (small interfering RNA) or piRNA/piwi/RNA (piwi-interacting RNA). ssRNA-seq strand-specific RNA sequencing WCS Whole chromosome (or other replicon) shotgun. CLONE Genomic clone based (hierarchical) sequencing. POOLCLONE Shotgun of pooled clones (usually BACs and Fosmids). AMPLICON Sequencing of overlapping or distinct PCR or RT-PCR products. CLONEEND Clone end (5’, 3’, or both) sequencing. FINISHING Sequencing intended to finish (close) gaps in existing coverage. RAD-Seq Restriction Site Associated DNA Sequence ChIP-Seq Direct sequencing of chromatin immunoprecipitates. MNase-Seq Direct sequencing following MNase digestion. DNase-Hypersensitivity Sequencing of hypersensitive sites, or segments of open chromatin that are more readily cleaved by DNaseI. Bisulfite-Seq Sequencing following treatment of DNA with bisulfite to convert cytosine residues to uracil depending on methylation status. EST Single pass sequencing of cDNA templates. FL-cDNA Full-length sequencing of cDNA templates. CTS Concatenated Tag Sequencing. MRE-Seq Methylation-Sensitive Restriction Enzyme Sequencing strategy. MeDIP-Seq Methylated DNA Immunoprecipitation Sequencing strategy. MBD-Seq Direct sequencing of methylated fractions sequencing strategy. Tn-Seq Gene fitness determination through transposon seeding. FAIRE-seq Formaldehyde Assisted Isolation of Regulatory Elements SELEX Systematic Evolution of Ligands by EXponential enrichment RIP-Seq Direct sequencing of RNA immunoprecipitates (includes CLIP-Seq, HITS-CLIP and PAR-CLIP). ChIA-PET Direct sequencing of proximity-ligated chromatin immunoprecipitates. Hi-C Chromosome Conformation Capture technique where a biotin-labeled nucleotide is incorporated at the ligation junction, enabling selective purification of chimeric DNA ligation junctions followed by deep sequencing ATAC-seq Assay for Transposase-Accessible Chromatin (ATAC) strategy is used to study genome-wide chromatin accessibility. alternative method to DNase-seq that uses an engineered Tn5 transposase to cleave DNA and to integrate primer DNA sequences into the cleaved genomic DNA Targeted-Capture Tethered Chromatin Conformation Capture Synthetic-Long-Read binning and barcoding of large DNA fragments to facilitate assembly of the fragment Other Library strategy not listed.
- Library Construction Protocol
- Free form text describing the protocol by which the sequencing library was constructed. Please include protocols of DNA fragmentation, ligation and enrichment. If a library preparation kit is used, include the name and version (if any) of the kit (for example, Illumina Nextera DNA Library Preparation Kit).
Reference: Alnasir J, Shanahan HP. Investigation into the annotation of protocol sequencing steps in the sequence read archive. Gigascience. 2015 May 9;4:23. doi: 10.1186/s13742-015-0064-7. eCollection 2015. PMID: 25960871 (Open Access)
- Instrument*
- Select a sequencing instrument model.
Instrument Model 454 GS 454 GS 20 454 GS FLX 454 GS FLX+ 454 GS FLX Titanium 454 GS Junior Illumina Genome Analyzer Illumina Genome Analyzer II Illumina Genome Analyzer IIx Illumina HiSeq 1000 Illumina HiSeq 1500 Illumina HiSeq 2000 Illumina HiSeq 2500 Illumina HiSeq 3000 Illumina HiSeq 4000 Illumina NovaSeq 6000 Illumina MiSeq Illumina MiniSeq Illumina iSeq 100 Illumina HiScanSQ HiSeq X Five HiSeq X Ten NextSeq 500 NextSeq 550 Helicos HeliScope AB SOLiD System AB SOLiD System 2.0 AB SOLiD System 3.0 AB SOLiD 3 Plus System AB SOLiD 4 System AB SOLiD 4hq System AB SOLiD PI System AB 5500 Genetic Analyzer AB 5500xl Genetic Analyzer AB 5500xl-W Genetic Analysis System Complete Genomics MinION GridION PromethION PacBio RS PacBio RS II Sequel Ion Torrent PGM Ion Torrent Proton Ion Torrent S5 Ion Torrent S5 XL AB 310 Genetic Analyzer AB 3130 Genetic Analyzer AB 3130xL Genetic Analyzer AB 3500 Genetic Analyzer AB 3500xL Genetic Analyzer AB 3730 Genetic Analyzer AB 3730xL Genetic Analyzer
- Spot Type*
- Select a layout of reads in sequencing data files.
Spot Type | Description |
---|---|
single | Single read |
paired (FF) | Paired reads with same direction. |
paired (FR) | Paired reads with opposite direction. |
- Nominal Length*
- Size of the insert for Paired reads.
- Nominal Sdev
- Standard deviation of insert size.
- Spot Length*
- The read length in submitted sequencing files. For mate pairs, this
number includes mate pairs, but does not include gap lengths.
- When the spot length is constant, enter a constant value.
- For 454 platforms producing reads with variable length, enter a constant flow count.
- For fastq files with variable length, enter an average length.
Run
- Alias
- Name of the run designated by the archive. This alias is used to reference metadata objects without accession numbers.
- Title*
- Short text that can be used to call out run records in searches or in displays. A title like “[Sequencing Instrument Model] [paired end] sequencing of [BioSample ID]” (for example, “Illumina HiSeq 2000 paired end sequencing of SAMD00025741”) is automatically constructed. To enter user-defined titles, download Run metadata into a tab-delimited text file, edit title values and upload it.
- Experiment Referenced*
- Select the experiment this run belongs to.
Data files for Run
Select data files for a Run.
- Run/Analysis
- Specify whether a data file belongs to the Run or Analysis. In the web submission form, this field is un-editable and is automatically filled according to the selected Run or Analysis. To upload metadata in tsv file, this field needs to be specified manually.
- File Name*
- The name of a sequence data file. Uploaded filenames are automatically filled in.
- Run/Analysis contains files*
- Select a Run to which the data file belongs.
- File Type*
- The sequence data file format. For the fastq files with variable
read length, select ‘generic_fastq’. For the fastq files with
constant read length, select ‘fastq’.
File Type Description generic_fastq fastq files with variable read length fastq fastq files with constant read length sff 454 Standard Flowgram Format file hdf5 PacBio hdf5 Format file bam Binary SAM format for use by loaders that combine alignment and sequencing data tab A tab-delimited table maps “SN in SQ line of BAM header” and “reference fasta file” reference_fasta Reference sequence file in single fasta format used to construct SRA archive file format. Filename must end with “.fa”
- MD5 Checksum*
- MD5 checksum of a sequence data file. How to obtain the MD5 checksum values.
Analysis
- Alias
- Name of the analysis designated by the archive.This alias is used to reference metadata objects without accession numbers.
- Title*
- Title of the analyis object.
- Description*
- Describes the contents of the analysis.
- Analysis Type*
- Select an Analysis type. Submit alignment data to Run in bam format.
Analysis Type Description De Novo Assembly A placement of sequences including trace, SRA, GI records into a multiple alignment from which a consensus is computed.. Sequence Annotation Per sequence annotation of named attributes and values.
Example: Processed sequencing data for submission to dbEST without assembly.
Reads have already been submitted to one of the sequence read archives in raw form.
The fasta data submitted under this analysis object result from the following treatments, which may serve to filter reads from the raw dataset:
- sequencing adapter removal
- low quality trimming
- poly-A tail removal
- strand orientation
- contaminant removal.Abundance Measurement Identify the tools and processing steps used to produce the abundance measurements (coverage tracks).
Data files for Analysis
Select data files for an Analysis.
- Run/Analysis
- Specify whether a data file belongs to the Run or Analysis. In the web submission form, this field is un-editable and is automatically filled according to the selected Run or Analysis. To upload metadata in tsv file, this field needs to be specified manually.
- File Name*
- The name of an analysis file.
- Run/Analysis contains files*
- Select an Analysis to which the data file belongs.
- File Type*
- The analysis data file format.
File Type Description bam Binary form of the Sequence alignment/map format for read placements, from the SAM tools project.
See http://sourceforge.net/projects/samtools/.tab A tab delimited text file that can be viewed as a spreadsheet. The first line should contain column headers.. ace Multiple alignment file output from the phred assembler and similar programs.
See http://www.phrap.org/consed/distributions/README.16.0.txt for a description of the ACE file format..fasta Sequence data format indicating sequence base calls.The format is simple: a header line initiated with the > character, data lines following with base calls.. wig The wiggle (WIG) format allows display of continuous-valued data in track format.This display type is useful for GC percent, probability scores, and transcriptome data.
See http://genome.ucsc.edu/goldenPath/help/wiggle.html for a description of the Wiggle Track format..bed BED format provides a flexible way to define the data lines that are displayed in an annotation track.
See http://genome.ucsc.edu/FAQ/FAQformat#format1 for a description of the BED format..vcf Variant Call Format.
See http://www.1000genomes.org/wiki/analysis/variant%20call%20format/vcf-variant-call-format-version-41 for a description of the VCF format.maf Mutation Annotation Format gff General Feature Format csv tsv
- MD5 Checksum*
- MD5 checksum of a run data file. How to obtain the MD5 checksum values.
Run data files
- The DRA does NOT accept fasta only datasets. The minimum submission level for SRA is base/color calls with quality scores.
- Make sure the file names are constructed only from alphanumerals [A-Z,a-z,0-9], underscores [_], hyphens [-] and dots [.], with no whitespaces, brackets, other punctuations or symbols.
- Barcoded data files should be demultiplexed prior to submission and a unique BioSample should be created for each barcoded sample; in other words, each BioSample must be linked to one or more unique data files.
- In case of fastq files, submit paired reads in separate files. For bam and sff files, paired reads need to be described in single file.
- Upload data files directly under a submission directory. Submitted archive files should NOT contain any directory structure.
- Binary data formats, including BAM, SFF and HDF5 should be submitted without applying any additional compression.
Formats of sequencing data files
The DRA metadata submission tool cannot describe technical reads (adapter, primer and barcode sequences). “To submit raw data contain technical reads” and “To use metadata elements in DRA XML schema the but not in the submission tool”, submitters need to create metadata in XML files(XML examples).
Generic formats
Format | Platform | Recommended |
---|---|---|
BAM | all platforms | Yes |
fastq | all platforms | Yes |
Platform specific formats
Format | Platform | Recommended |
---|---|---|
SFF | 454 and Ion Torrent | Yes |
PacBio HDF | PacBio | Yes |
SOLiD csfasta/qual | SOLiD | No (please convert to fastq/bam) |
Illumina qseq and scarf | Illumina | No (please convert to fastq/bam) |
BAM file
Binary Alignment/Map files (BAM) represent one of the preferred DRA submission formats. BAM is a compressed version of the Sequence Alignment/Map (SAM) format (see SAMv1.pdf). BAM files can be decompressed to a human-readable text format (SAM) using SAM/BAM-specific utilities (e.g. samtools) and can contain unaligned sequences as well. DRA recommends to submit BAM including unaligned reads as primary data into Run.
SAM is a tab-delimited format including both the raw read data and information about the alignment of that read to a known reference sequence(s). There are two main sections in a SAM file, the header and the alignment (sequence read) sections, each of which are described below. Note that this documentation will focus on a description of the SAM format with respect to submission of BAM files to the DRA (i.e. DRA doe not accept SAM files for submission). A more comprehensive discussion of the format specifications can be found at the samtools website.
SAM Header Example:
@HD VN:1.4 SO:coordinate
@SQ SN:CHROMOSOME_I LN:15072423
UR:ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/invertebrates/Caenorhabditis_elegans/
WBcel215/Primary_Assembly/assembled_chromosomes/FASTA/chrI.fa.gz AS:ce10
SP:Caenorhabditis elegans
@SQ SN:CHROMOSOME_II LN:15279345
UR:ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/invertebrates/Caenorhabditis_elegans/
WBcel215/Primary_Assembly/assembled_chromosomes/FASTA/chrII.fa.gz AS:ce10
SP:Caenorhabditis elegans
@RG ID:1 PL:ILLUMINA LB:C_ele_05 DS:WGS of C elegans PG:BamIndexDecoder
@PG ID:bwa PN:bwa VN:0.5.10-tpx
SAM Alignment Example:
3658435 145 CHROMOSOME_I 1 0 100M CHROMOSOME_II 2716898 0
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT
AAGCCT
@CCC?:CCCCC@CCCEC>AFDFDBEGHEAHCIGIHHGIGEGJGGIIIHFHIHGF@HGGIGJJJJJIJJJJJJJJJJJJJJJJJJJJJHHHHHFF
FFFCCC RG:Z:1 NH:i:1 NM:i:0
5482659 65 CHROMOSOME_I 1 0 100M CHROMOSOME_II 11954696 0
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT
AAGCCT
CCCFFFFFHHGHGJJGIJHIJIJJJJJIJJJJJIJJGIJJJJJIIJIIJFJJJJJFIJJJJIIIIGIIJHHHHDEEFFFEEEEEDDDDCDCCCA
AA?CC: RG:Z:1 NH:i:1 NM:i:0
BAM file processing
The header and alignment section are internally consistent: each aligned read has an RNAME (reference sequence name, 3rd field) that matches an SN tag value from the header (e.g., CHROMOSOME_I), and, if provided,the alignment read group optional field (RG:Z:) is consistent with the read group ID in the header (1). It is also important to ensure that the FLAG fields (2nd field in each line) are correctly set for the data. The SRA pipeline will attempt to resolve incorrect FLAG values, but sufficiently incorrect values can lead to processing errors. The SRA does not archive optional and non-standard tags/field values contained in the alignment section. However, the entire header section of the bam file is retained. Additionally, although the SAM format allows for an equal sign (=) in the sequence field to represent a match to the reference sequence or only an asterisk (*) in both the sequence and quality fields, the DRA processing software does not recognize either of these formats.
Please note that unexpected notations used to indicated paired reads can lead to failure to recognize the pairs and an improper SRA archive (i.e. paired reads are treated like fragments). For example, using :0 and :1 at the end of the read names is atypical and is currently not recognized as an indication of read 1 and 2 in a pair. It would be better to exclude these notations and provide the two reads with the same names. Expected notations for particular platforms will work. For example, Illumina reads with /1 or /2 appended is an expected notation. Further, neglecting to set the proper bits for paired reads in the SAM/BAM flags (e.g. multi-segment template 1-bit, first segment 64-bit, and last segment 128-bit) or splitting paired reads into separate bam files can result in an improper SRA archive or failure to generate the SRA archive.
In the case of submitting alignment data, you need to submit “BAM”, “INSDC, refseq accession number OR reference multi-fasta” and “SN-reference mapping table”. Submit one bam file per Run.
When submitting bam file into Analysis instead of Run, the mapping table is unnecessary. However, please consider to submit bam including unaligned reads as primary data into Run.
-
BAM file submission
The alignment data can be submitted in the BAM format. The bam files should be readable by SAMtools and picard. The BAM files are nearly optimal in terms of compression and should be submitted uncompressed.
-
Specify reference by INSDC/RefSeq accession number {#Specify_reference_by_INSDC/RefSeq_accession_number}
If references are found in list, references can be specified by their accession.version number (for example, NC_000001.11). Version numberis necessary. Accession numbers for references can be searched in NCBI Assembly.
-
Specify reference by supplying multi-fasta
If references are not found in the list, submit a reference file in multi-fasta format. Select “reference_fasta” in the Run file type. The reference name in the bam header and reference sequence are linked by the name in bam header and fasta defline via the mapping table. If sequence length is different between @SQ-LN and multi-fasta, a warning is raised.
-
Specify reference by both INSDC/RefSeq accession number and multi-fasta {#Specify_reference_by_both_INSDC/RefSeq_accession_number_and_multi-fasta}
If a part of references are found in list, these references can be specified by their accession.version number (for example, NC_000001.11). The rest of references needs to be supplied by uploading a multi-fasta file. In the SN-reference mapping table, list accession.version numbers and sequence names of multi-fasta deflines.
-
SN-reference mapping table
A tab delimited text file describing mapping between “SN in SQ line in BAM header” and “accession OR sequence name in fasta file”. Select “tab” in the Run file type
BAM header
@HD VN:1.0 GO:none SO:coordinate
@SQ SN:chr1 LN:249698942
@SQ SN:chr2 LN:242508799
@SQ SN:chr3 LN:198450956
...
SN-fasta mapping table. In the example, the reference named ref1 in multi-fasta file corresponds to the SN:chr1.
chr1 ref1
chr2 ref2
chr3 ref3
...
Reference multi-fasta.
>ref1
CGGTGGGGGTGGTGTTAGTACCCCATCTTGTAGGTCTGAAACACAAAGTGTGGGGTGTCT
...
>ref2
TCCACCAACGTTAGAAGGCCTTGGCCCCCAGAGAGCCAATTTCACAATCCAGAAGTCCCC
...
>ref3
GTGTGTGACCAGGGAGGTCCCCGGCCCAGCTCCCATCCCAGAACCCAGCTCACCTACCTT
...
SN-fasta mapping table. In the example, the reference “NC_000001.11” corresponds to the SN:chr1.
chr1 NC_000001.11
chr2 NC_000002.12
chr3 NC_000003.12
...
fastq
Run filetype needs to be specified depending on whether read length is constant or not.
- For the fastq files with constact read length, select ‘fastq’ in the file typeof Run. Paired reads should appear in the same order in the paired files.
- For the fastq files with variable read length, select ‘generic_fastq’ in the file typeof Run.
Format of fastq, for details, please see NCBI website.
- Quality values must be in Phred scale. By default, 33 (!) is used for Phred quality offset. In the case of 64 (@), update the ascii_offset of Run XMLto ‘ascii_offset=”@”’.
- In the DRA metadata submission web interface, technical reads (adapters, linkers, barcodes) cannot be described. When submitting fastq including technical reads, please describe technical reads in the Experiment XML according to Formats of sequencing data files (XML examples). The Experiment XML submission is not necessary for fastq without technical reads.
- Paired reads must split and submitted using two Fastq files. The read names must have a suffix identifying the first and second read from the pair, for example ‘/1’ and ‘/2’.
- The first line for each read must start with ‘@’.
- The base calls and quality scores must be separated by a line starting with ‘+’.
- The Fastq files must be compressed using gzip or bzip2.
454
The DRA accepts sequencing run data from the 454 platform in the sff and fastq/bam format. These files should reflect the sequencing run setup. If a sff file contains data derived from more than one sample, please break up resulting fastq file into files contain data from only one sample.
The read names found in the .sfffile are meaningful and reflect the addressing scheme for the picotitre plate as well as a globally unique run id. Please do not rewrite this name in the sff as such addressing information will be lost. The sff file format is nearly optimal in terms of footprint, so there is little to be gained by further compressing them. Therefore, please provide .sfffiles uncompressed.
Illumina Genome Analyzer
Illumina pipeline v1.4 and later
DRA does not accept qseq files. Please convert qseq to fastq/bam.
SOLiD
SOLiD Native Format
DRA does not accept SOLiD native files. Please convert the native files to fastq/bam.
Ion Torrent
Submit Ion Torrent data in the sff or fastq/bam format.
Helicos Heliscope
Submit Helicos data in the sms(helicos_native) or fastq/bam format created with the fixed-quality value, “14”.
Complete Genomics
Submit Complete Genomics data in the fastq/bam format.
Pacific Biosciences
HDF5
Pacific Biosciences uses HDF5, a container file with a directory-like structure, to store raw data. The DRA accepts both bas.h5 and bax.h5 file submissions. Note that submission of data from the RS II instrument requires one Run consists of one *.bas.h5 file and three *.bax.h5 files. Do not rename files.
bam
We support the submission of the following types of PacBio bam files. Include 1 bam file per Run. For an unaligned bam file, reference and mapping table are not necessary.
- subread BAM files (*.subreads.bam)
- CCS read BAM files (*.ccs.bam)
fastq
The DRA also accepts Pacific Biosciences data in the fastq format. Because the read length varies, select the “generic_fastq” for the Run filetype.
Oxford Nanopore
Submit Oxford Nanopore data in the fastq/bam format.
Capillary sequencing platform
Submit capillary sequencing data in the fastq/bam format.
Analysis data files
PacBio Base Modification Files
PacBiosequence data also permits the analysis of methylated bases within the sequence, which can be extremely helpful to the scientific community. For example, the precise positions of those modified bases can be used to determine the specificity of the DNA methyltransferases that produced them. The PacBio analysis suite contains an analysis workflow (RS_Modification_and_Motif_Analysis) to extract these sequences and produce several files:
- motif_summary.csv
- modifications.csv
- modifications.gff
- motifs.gff
It would be beneficial to the scientific community if you were able to perform this analysis and submit at least the motif_summary.csv file for prokaryotes via as a DRA Analysis object. Please submit these files as data files of the Analysis with Sequence Annotation typein addition to sequencing reads in Run. For assistance, contact us.
NCBI guidelines of PacBio Base Modification Files
BioNano Whole Genome Map Files
BioNano mapping technology produces whole genome maps. These maps can be used in a variety of genomic analyses, including de novo assembly, structural variant detection and assembly curation. For example, BioNano physical maps can be integrated with de novo genome assemblies produced from next-generation technologies to produce high quality hybrid assemblies with increased continuity and completeness, especially in regions of genomic complexity. Files produced as part of the BioNano mapping and or hybrid assembly process include:
- CMAP
- The BioNano Genomics Irys .cmap file is a raw data view of a molecule set or assembly reporting a label site position within a genome map identified during a run. The Irys .cmap file reports the start and end coordinates and the locations of the labels on a map using a tab-delimited text based file.
- COORD
- The purpose of the .coord file is to relate the coordinates of scaffolds in a hybrid assembly to the corresponding AGP submission. The .coord file maps positions from the hybrid cmap, which may not begin or end with sequence gaps. The scaffolds are trimmed up to the leftmost label of leftmost sequence and the rightmost label of the rightmost sequence.
- XMAP
- The BioNano Genomics Irys .xmap file is a cross-comparison between two maps. The Irys .xmap file reports the comparison derived from the alignment between an anchor .cmap file and a query .cmap file. The data line displays the map start and end coordinates and the locations of the labels on the map using a tab-delimited text based file.
- SMAP
- The BioNano Genomics Irys .smap file is a description of structural variations (SV) detected between two genome maps. The Irys .smap file reports the structural variants discovered during an alignment between an anchor .cmap file and a query .cmap file. The data line displays the start and end coordinates and the locations of the SV on the map using a tab-delimited, text-based file.
- BNX
- The BioNano Genomics Irys .bnx file is a raw data view of molecule and label information and quality scores per channel identified during a run.
For the latest file specifications, please see the BioNano GitHub site.
If you are using BioNano data as part of your assembly generation pipeline, it would be extremely useful to the scientific community if you could submit a package comprised minimally of the molecule .bnx file and the resulting de novo assembly file EXP_REFINEFINAL1.cmap and COORD files as a DRA Analysis. We will add an analysis type and filetypes for the BioNano Genome Map files. In the meantime, please submit the BioNano files as the analysis type “De Novo Assembly” and the filetype “tsv” (Example, DRZ011299, DRZ011300).
Submission to the DRA
- Submission of research data from human subjects
- For submitting data from human subjects (human data) to the databases of DDBJ center, it is submitter’s responsibility to ensure that the dignity and right of human subject are protected in accordance with all applicable laws, ordinances, guidelines and policies of submitter’s institution. In principle, make sure to remove any direct personal identifiers of human subjects from your data to be submitted. Before submitting human data, read the “Submission of research data from human subjects”.
- Submission of Patent Related Sequences
- Please read “Submission of Patent Related Sequences” and “Patent Priority and Other Priority”before submitting patent related sequences.
Metadata and sequence data are required for submission to the DRA.
Please submit the assembled sequence data to the DDBJ. The DDBJ Mass Submission System (MSS)accepts the genomic or abundant sequence data generated by massively parallel sequencing platforms.
Data submission to DRA
1. Obtain a submission account
- Create a D-way submission account
- To enable DRA submission, register a public key and a center name to the account
2. Create a DRA submission and upload data files
- Create a new DRA submission ( Add DRA submission functionality to
your account)
All sequencing data in single submission will be released at the same time. - Upload data files by scp before submitting BioProject, BioSample, Experiment and Run
3. Submit project and sample information
BioProject
- A description of the reseach effort
- “Why” you sequenced your samples
BioSample 
- A description of biologically or physically unique specimens
- “What” you sequenced
metadata can be submitted as a tab-delimited text file
4. Submit Experiment and Run
DRA Experiment 
- A description of a sample-specific sequencing library
- “How” you performed the sequencing
- Multiple Experiments “point” to a single Sample, but not vice-versa.
DRA Run 
- Validate data files after submitting Experiment and Run
- All files linked to a Run are “merged” into a single SRA file format
5. Validate sequencing data files
- Start to convert sequencing data files into a SRA file for archiving.
- Submission passed validation step will be reviewed and accessioned.
How to submit data to the DRA
Submission to BioProject/BioSample/DRA (6 min 50 sec、created:2015)
Submission Account
At the DNA Data Bank of Japan (DDBJ) center, BioProject, BioSample, and DRAsubmissions are managed in user’s account.
According to the Submission Account Handbook, obtain a submission account and enable DRA submission in the account.
Organize data
Examples of metadata object organization is here. In single submission, only one BioProject can be registered. Multiple BioSample, Experiment, Run objects can be registered. To easily organize your data into a submission, please first consider number of BioSamples.
In this chapter, submission steps are explained by submitting a example submission “paired-end genome sequencing of three bacterial strains”.
Create a new submission
Login the D-way (https://ddbj.nig.ac.jp/D-way/) and the top page is displayed. Move to the DRA submission site from the “DRA” menu at the top.
Create a new submission by clicking the [New submission]. At this time, in the DRA file server (ftp-private.ddbj.nig.ac.jp), the corresponding subdirectory is created under the submitter’s home directory. Upload sequence data files to this subdirectory.
List of submission status is as follows. The DRA team reviews submission whose status is in “submission_validated” or “data_error”.
List of submission status
Status | Explanation |
---|---|
New | Metadata has not been submitted. |
metadata_submitted | Metadata has been submitted. |
data_validating | Validating data files. |
data_error | Error occurred in data validation process. |
submission_validated | Metadata and data have been validated. |
completed | Accession numbers have been issued. |
confidential | Archive files has been created and submission is kept private |
Public | Released to public. |
Upload sequence data
Sequence data files need to be uploaded before creating metadata. To create metadata first, upload some files.
Transfer sequence data by using terminal (Linux/Mac OS X)
Transfer the files by executing,
$ scp -i private-key-for-auth <Your Files> <D-way Login ID>@ftp-private.ddbj.nig.ac.jp:~/<DRA Submission ID>
- -i: specify the private key for authentication which is pair of a public key registered to your D-way account.
- <Your Files> Files to be transferred. Ex: file1 file2 (file1 and file2), file* (all files whose filenames start with “file”)
- <D-way Login ID> D-way Login ID (ex. test07)
- <DRA Submission ID> DRA Submission ID (ex. test07-0018)
- command example: scp -i private-key-for-auth strainA_1.fastq test07@ftp-private.ddbj.nig.ac.jp:~/test07-0018
Enter the passphrase set for the keys.
You can directly handle the transferred files by logging in the server. SSH login the server by executing,
Enter the passphrase set for the keys.
After logging in successfully, the following prompt is displayed.
The login environment is private for the submitter. Users other than the submitter cannot access the data. Executable commands are restricted to the following ones. Users can delete unnecessary files.
Transfer sequence data by using WinSCP (Windows)
Submission to DRA ~upload data files (Windows)~
Install and run the “WinSCP” (http://winscp.net/eng/download.php).
Set items as below and click the [Advanced…] button.
Be sure to select the “binary mode” for file transfer. Do NOT select the “text mode”.
- File protocol: SFTP
- Host name: ftp-private.ddbj.nig.ac.jp
- Port number: 22
- User name: (D-way Login ID)
- Password: (Leave empty)
Please select the private key, which you created beforehand, from “Private key file” in “Authentication”.
Last, click the [Login] button in the lower center
At the first time of login, a warning message is displayed; however, please select “Yes” (this message will not be displayed again). Next, enter the passphrase set for the keys.
After login successfully, a folder of your PC is displayed at left, and your private directory in the server is displayed at right. Select the files at the left window and drag & drop them into the right window to transfer the files to the server.
You can delete the transferred files by selecting the files and clicking the [Delete] button.
Transfer sequence data by using Cyberduck (Mac OS X)
Submission to DRA ~upload data files (Mac) ~
Download and install the Cyberduck (http://cyberduck.ch).
Run the Cyberduck and click the [Open Connection] button in the Cyberduck menu.
Select “SFTP (SSH File Transfer Protocol)” .
Set as follows and tick off “Use Public Key Authentication” in the More Options.
- Server: ftp-private.ddbj.nig.ac.jp
- Port: 22
- Username: (D-way Login ID)
- Password: (Leave empty)
- Add to Keychain: (Check)
By default, the private key is created in “User’s home folder > .ssh folder (invisible in Finder) > id_rsa”.
At the first time of login, a warning message is displayed; however, please select “Always” (this message will not be displayed again).
After login successfully, your private directory in the server is displayed in the window. Select the files in your PC and drag & drop them into the window to transfer the files to the server.
Users can ssh login ftp-private.ddbj.nig.ac.jp server by using a private
key. Executable commands are restricted to the following ones. Users can
delete unnecessary files.
ls cd cp mv rm more mkdir tar gzip gunzip bzip2 bunzip2 zip unzip
When sending submission files too large for e-mail attachment, submitters can upload the files for the DDBJ Mass Submission System (MSS) by using the DRA file server. After contacting the MSS team, upload the files to the /submission/[submitter ID]/mass directory.
Create metadata by using the tool
Move to the submission detail page by clicking the submission ID.
Click the [Enter / Update metadata] button to run the DRA metadata creation tool.
When no file is uploaded to the submission directory, following message is displayed. To submit metadata, please upload data files.
To submit metadata first, upload some files (for example, empty text file).
When there are many Experiment and Run objects, create metadata XMLs by using the excel for the DRA metadata and the XML generator. The metadata can be registered by uploading the Submission/Experiment/Run XMLs in D-way. Please see the GitHub page for details.
The metadata are composed of the Submission, BioProject, BioSample, Experiment, Run, Analysis (optional) objects. In the metadata creation tool, enter content from left to right tabs.
Required items are marked with *.
The entered content is checked when submitters click the [Save] button or before moving to the other tab. When error messages are displayed, please revise the content.
Submission
Set the hold date within two years. Include principal investigator(s) and submitter(s) who actually submit data in the Submitter. The DRA dose not disclose the submitter information to public.
All data in a submission are released at the same time. If you want to release data at different time, please divide a submission.
Study
Submit a new project by clicking [New submission], or select a project registered in the account.
Only one project can be submitted. To reference a project obtained in the other account, please contact DRA team.
To submit a BioProject, enter content from left to right tabs. The second panel is for BioProject submission. Submitter information is copied with that of DRA submission.
For BioProject metadata, please see the BioProject Handbook.
To submit data corresponding to personal identification code to DRA/GEA/DDBJ, your data submission application needs to be approved by NBDC before the submission. If your application was approved, write the NBDC application ID (for example, J-DS000001-001) in “Private comments to DDBJ staff” of BioProject.
To submit genome assemblies to DDBJ, a unique Locus tag prefix is necessary.
Locus tag prefix generation box will appear when [Project data type=”Genome Sequencing” or “Metagenome”] AND [Capture=”Whole”] AND [Objective=”Sequence” or “Annotation” or “Assembly”]. Registration of a unique locus tag prefix is required for studies that result in genome assemblies.
The locus_tag prefix can contain only alpha-numeric characters and it must be at least 3-12 characters long. It should start with a letter, but numerals can be in the 2nd position or later in the string. (ex. A1C). There should be no symbols, such as -* in the prefix. The locus_tag prefix is to be separated from the tag value by an underscore ‘’, eg A1C_00001.
Please leave the prefix box empty, when a prefix is not necessary for WGS only submission.
Prefix is managed by NCBI. When a project is submitted, our system tries to reserve prefix to NCBI. When the prefix has already been reserved, an error message will be displayed. Please enter a different prefix and submit again.
When multiple prefixes are necessary, please contact us.
Check the content in “OVERVIEW” and submit a project by clicking [Submit BioProject].
After submitting a project, submitted one is selected in Study.
Sample
Submit new samples by clicking [New submission], or select samples submitted in the account.
Upper limit is about 2,000 samples per submission.
To select a range of samples, first check a checkbox and click next box with pressing the “Shift”. Filter samples by entering text in the upper box, and click [Select filtered BioSamples] to select all filtered samples.
To reference samples obtained in the other account, please contact us.
To submit a BioSample, enter content from left to right tabs. The second panel is for BioSample submission. Submitter information is copied with that of DRA submission.
Biological and technical replicates are represented by separate BioSamples. Regarding necessary number of sample for sequence submission, please see the “FAQ: How many samples do I need for my DRA submission?”
For BioSample metadata, please see the BioSample Handbook.
Select a sample type in the “SAMPLE TYPE”. For genome samples, minimum sample attributes are defined by MIxS.
For the Sample type, please see the BioSample Handbook.
Download a template text file according to the selected sample type to enter sample attributes.
A main sample submission step is to describe samples by required, optional and user-defined attributes.
BioSample attribute list. User-defined attributes can be added at rightmost column. Please see the “Human Sample” page about human samples.
BioSample submission file examples
A text file is separated by tab and can be opened and editted in spreadsheet editor (e.g. Excel®). Attribute names are in a header line. Attributes with “*” are required.
From second lines, enter one sample per line. Enter PSUB submission id in bioproject_id for project without PRJD accession numbers. For attributes without measured values, enter “missing” or “not applicable”.
Upload the BioSample submission file by selecting the file and clicking the Continue button. The validator checks the uploaded file and feedbacks error and warning messages. Submitter can not submit the BioSample until all errors are resolved.
For the validation rules and messages, please see Validation rules page.
Check content in the last “OVERVIEW” and submit samples. In the “ATTRIBUTES” area, the submitted sample attribute file can be downloaded.
After submitting BioSamples, submitted BioSamples are selected in the “Sample” tab.
Experiment
Experiment and Run as same as selected BioSamples are automatically created. Each BioSample,Experiment and Run are referenced. The Experiment and Run are automatically generated when the Experiment tab is initially displayed.
BioProject | - BioSample (1) | - Experiment (1) | - Run (1) |
- BioSample (2) | - Experiment (2) | - Run (2) | |
- BioSample (3) | - Experiment (3) | - Run (3) |
In this example, 3 Experiments are created and each Experiment reference unique BioSample.
Add an Experiment by clicking the [Add new Experiment(s)] and delete an Experiment by clicking the [Delete]. Experiment referenced by Run cannot be deleted.
Experiments can be submitted in a tab-delimited text file. First save and fix Aliases (e.g., test07-0040_Experiment_0001 - 0003) by clicking the [Save]. Alias is used as a name until accession numbers are issued.
Download content into a tab-delimited text file by clicking the [Download TSV file].
Metadata can be editted in spreadsheet software (e.g. Excel®).
If “Title” values are empty, titles are automatically constructed as “[Sequencing Instrument Model] [paired end] sequencing of [BioSample ID]” (e.g., “Illumina HiSeq 2000 paired end sequencing of SAMD00025741”). Submitters can provide user-defined text in the “Title”.
- Reference samples in “BioSample Used” by “SSUB BioSample Submission ID”
- “Sample name” (example, SSUB003746 : Genome bacteria strain A). Spaces around “:” are ignored.
Save editted content in a tab-delimited text file and select and upload it by clicking the [Upload TSV file].
Upload in tab-delimited text file and NOT in spreadsheet software specific format.
Run
Experiment and Run as same as selected BioSamples are automatically created. Each Run references unique Experiment.
In this example, three Runs are created and each Run references unique Experiment.
Add Run by clicking the [Add another Run(s)] and delete Run by clicking the [Delete]. Run linked to files cannot be deleted.
After fixing aliases by clicking the [Save], run content can be downloaded into a tab-delimited text file. To distinguish the data files for Run, enter “Run” in the leftmost “Run/Analysis” column.
Click the [Select data files for Run] and link uploaded files to Run.
All files uploaded to the submission directory are shown. Associate a file to a Run by selecting a Run alias in “Run/Analysis contains files”.
Enter File type and MD5 Checksum for files. File attributes can be entered by uploading a tab-delimited text file.
Note that all data files listed in a Run will be merged into a single SRA archive file, so files from different samples or replicates should not be grouped in the same Run. Paired-end data files, conversely, MUST be listed in a single run in order for the two files to be correctly processed as paired-end.
For fastq with variable read length, select “generic_fastq” for filetype.
When an Analysis (optional) is unnecessary, submit metadata by clicking the [Submit/Update DRA metadata].
After submitting DRA metadata, start validation of data files. Click the link “Validate uploaded data files to finish this submission”.
Analysis (optional)
Create Analysis as many as required, enter content of each Analysis. Unnecessary Analysis can be deleted by clicking the [Delete].
Click the [Select data files for Analysis] and link files to Analysis.
Enter file attributes and associate them with Analysis. When submitting the file attributes by uploading the tab-delimited text file, to distinguish the data files for Analysis, enter “Analysis” in the leftmost “Run/Analysis” column.
Submit DRA metadata by clicking the [Submit/Update DRA metadata] and proceed to data validation process. Only md5 of analysis files are checked during validation.
Create metadata in XML files
The DRA metadata submission tool cannot describe technical reads (adapter, primer and barcode sequences). “To submit raw data contain technical reads” and “To use metadata elements in the DRA XML schema but not in the submission tool”, submitters need to create or edit metadata in XML files.
-
Create a new DRA submission.
-
Prepare the Submission, Experiment, Run and Analysis (optional) XML files.
-
Un-accessioned BioProject and BioSample can be referenced in Experiment XML as follows.
-
Validate XML files against xsd by following Unix commands. You cannot upload XML with any errors.
-
Upload validated XML files. Select the Submission, Experiment, Run and Analysis (optional) XML files and upload them at once.
Uploaded XML files are validated against SRA schema and relationship between XML objects are checked. If errors are displayed, modify and re-upload the XML files.
Edit metadata in XML files
The DRA metadata submission tool cannot describe technical reads (adapter, primer and barcode sequences). “To submit raw data contain technical reads” and “To use metadata elements in the DRA XML schema but not in the submission tool”, submitters need to create or edit metadata in XML files.
- Create and submit metadata by using the web-based tool.
- Download the Submission, Experiment, Run and Analysis (optional) XML files of the submission with status “metadata_submitted”.
- Edit the downloaded XML files. For how to describe technical reads, please see the example page. For available metadata elements, please see the explanation in DRA XML schema.
- Un-accessioned BioProject and BioSample can be referenced in Experiment XML as follows.
- Validate XML files against xsd by following Unix commands. You cannot upload XML with any errors.
- Upload modified XML files. Select the Submission, Experiment, Run and Analysis (optional) XML files and upload them at once.
Uploaded XML files are validated against SRA schema and relationship between XML objects are checked. If errors are displayed, modify and re-upload the XML files.
Validation of data files
Submitted data files are converted to the SRA files for archiving. During this conversion process, MD5 value, file format and integrity between files and metadata are validated.
In the “Data Files”, filenames in the Run and Analysis, MD5 values in the Run and Analysis and those of uploaded files, are displayed.
Click the [Validate data files] and validate uploaded data files.
The files are validated in the following order.
FAQ: How to deal with validation errors?
MD5 Check
Consistency between the MD5 values in the metadata and of uploaded files are checked. Inconsistency in the MD5 values cause errors. When MD5 errors occur, revise metadata and re-upload files.
Data Check
Submitted data files are converted to the SRA files for archiving. During this conversion process, MD5 value, file format and integrity between files and metadata are validated. When errors occur, revise metadata and re-upload files. Validation of large files takes time.
If no errors occur, submission status become “submission_validated”, and validated files are moved to separate directory.
The DRA staff review submissions with status “submission_validated”. Please do not touch submissions until the DRA staff contact submitters.
Revise a submission with “data_error” <a name=”Revise_a_submission_with_“data_error””></a>
Any errors in the validation process make the submission status to “data_error”. Revise metadata and/or re-upload data files after stopping the validation by clicking the [Stop validation] button. After revision, click the [Validate data files] button and start validation again.
FAQ: How to deal with validation errors?
Submission status is backed to “metadata_submitted”. Revise and re-submit metadata or re-upload data files.
Accession numbers
When both the metadata and sequence data are validated (Status “submission_validated”), accession numbers with the prefix DR (Submission (DRA),Experiment (DRX),Run (DRR),Analysis (DRZ)) are assigned (“acc_issued”, “complete” or “private”). Accession numbers are displayed in the “Component”.
Limited-time access to archived fastq/SRA files
To allow submitter to download and check archived fastq/SRA files, the files are copied to the following directories on the ftp-private.ddbj.nig.ac.jp server. To save disk space, the copied files are automatically deleted in one month.
Due to unexpected decrease of available disk space, copied fastq/SRA files may be deleted within one month or the copy service may be suspended. We will inform submitters on the website in advance as much as possible, however, this annoucement could be immediately before the deletion or service suspension.
- (submitter’s home)/report/dra/(DRA submission accession)/fastq/
- (submitter’s home)/report/dra/(DRA submission accession)/sra/
例
- submitter/report/dra/DRA000001/fastq/DRR000001.fastq.bz2
- submitter/report/dra/DRA000001/fastq/DRR000002.fastq.bz2
- submitter/report/dra/DRA000001/fastq/DRR000002_1.fastq.bz2
- submitter/report/dra/DRA000001/fastq/DRR000002_2.fastq.bz2
- submitter/report/dra/DRA000001/sra/DRR000001.sra
- submitter/report/dra/DRA000001/sra/DRR000002.sra
Data release
After the registered data is loaded into the database, the Status becomes “complete (private)” and the submission is kept private until one of the following conditions are met.
All data in a submission are released at the same time. If you want to release data at different time, please divide a submission.
- A. Submitter requests to release their data.
- B. Submitter has published their accession number(s) and it has been
confirmed.
We do not release the data when its accession number(s) has been published wrongly by other than the submitter.
“publish” means to disclose accession number(s) to the public through paper, thesis, academic meeting, internet, press report etc. - C. Specified hold-date has come.
- D. DDBJ/EMBL-Bank/GenBank records (e.g., TSA, WGS, CONetc.) citing DRA Run (DRR) accession number(s) have been made public.
Data are released without permission from submitters in the cases B, C and D. In the case D, an entire DRA submission contains cited DRR Run(s) is made public.
FAQ: How are linked BioProject/BioSample/sequence data released?
When the data is released, in a few days, the released data will become searchable at DRASearchand the data will be mirrored to the NCBI SRA.
The list of available fastq files at the DRA file server: fastqlist
Update submission
Update in each database
Database | Update |
---|---|
Annotated sequence database | Request updates from web form |
Sequence Read Archive (DRA) | Login D-way and update metadata To add or delete sequencing data, request updates from web form |
BioProject/BioSample | Request updates from web form |
Change hold date
You can set the hold date for a maximum of 4 years and can change it. To change the hold date, click the [Change] button in the Hold Date and move to hold date change page.
To immediately release the submission, click the [Release Now]. In the middle of the night, the submission is released, data files will be made available at ftpand metadata will be indexed by the DRA search systemin a few days.
Update metadata
Update metadata by clicking the [Enter / Update metadata] button. A part of fields are blocked from editing. After editing your metadata, please be sure to click the [Submit/Update DRA metadata] button and reflect the updates to the DRA server.
Add data files
Data files cannot be directly added to the archived Run. In another DRA submission, create new Experiment-Run objects referencing existing BioProject and BioSample records to add data files.
Similar to Run, data files cannot be directly added to the archived Analysis. To replace archived Analysis, please contact to the DRA team.
Login D-wayand create a new submission by clicking the [New submission]. Select the BioProject and BioSample IDs to which data to be added. Next, add the DRA Experiment and Run objects.
- To add a new sample, share a BioProject ID and create a BioSample - Experiment - Run in a new DRA submission.
- To add data files to existing sample, share BioProject and BioSample IDs and create an Experiment - Run in a new DRA submission.
Submit metadata and validate the appended data files. Accession numbers will be issued to the appended Experiment/Run objects.
The BioProject ID remains same, but different DRA submission number is assigned.
To add data files to the existing DRA submission, please contact us.
Withdraw archived objects
To withdrawing archived Experiment, Run and Analysis objects, please contact us.
Supplement: MD5
MD5 (Message Digest Algorithm 5) is a hash function which calculates a hash value (MD5 number, 32-digit numbers and letters) of a given file. Because the MD5 number of the damaged file is distinct from the original one, we can check whether the transferred file is intact or not by comparing the numbers before and after the file transfer.
Obtain MD5 number (Linux)
Obtain the MD5 numbers of the files by executing,
$ md5sum file1 file2
9F6E6800CFAE7749EB6C486619254B9C file1
B636E0063E29709B6082F324C76D0911 file2
Obtain MD5 number (Mac OS X)
Obtain the MD5 numbers of the files by executing,
$ md5 file1 file2
9F6E6800CFAE7749EB6C486619254B9C file1
B636E0063E29709B6082F324C76D0911 file2
Obtain MD5 number (Windows)
Install and run the Fsum Frontend (sourceforge.net/projects/fsumfe/)
.
At first, tick off “md5”.
After clicking the [+] button, open the sequence data files that you need. You can select multiple files at the same time.
Click the [Calculate hashes] button. The MD5 numbers of the files are
displayed.
By clicking the [Export] button, you can obtain the list of the MD5
numbers as a html, a csv, or a xml file.