DDBJ will continue Sequence Raw Data Archiving

2011/2/22

Contents

DDBJ will continue Sequence Raw Data Archiving

    DDBJ has been archiving raw data from Sanger sequencers and so called next-generation sequencers as a part of EBI/NCBI/DDBJ International Nucleotide Sequence Database Collaboration (INSDC), by receiving data submissions from sequence centers mainly in Japan and also from several other countries.

    The data submitted to DDBJ are processed into INSDC approved format and exchanged among NCBI and EBI frequently to make a same data set in either bank at the timing of data publication to minimize the difference in convenience of data access from all over the world.

    In light of the recent announcement that NCBI, who has been playing a hub in raw data archiving, will discontinue its Sequence Read Archive and Trace Archive repositories, DDBJ's archiving will be affected somehow in the near future.

    However, at this moment, DDBJ does not plan to discontinue either of the service to meet the demand of the domestic community as well as of the global one.

    DDBJ has just started to formulate the plan to minimize the effect to the present activity of INSDC as well as the entire community in collaboration with other INSDC members.

Present status of DDBJ Raw sequence data Archive

  • DDDJ raw data archiving accepts submission of raw data from autosequencers to be shared among the whole community.
  • This is a part of the International Nucleotide Sequence Database Collaboration (INSDC) which is a collaboration among NCBI/EBI/DDBJ.
  • Two types of archiving are provided to accommodate different principles in autosequencers, namely DDBJ Read Archive and DDBJ Trace Archive.

DDBJ Sequence Read Archive(DRA)[2]

  • Raw data from so-called next generation sequencers is the subject for archiving.
    • "A Read File" is data representing chronological color change of a spot [3] which corresponds to extension reaction of one independent molecular species from tens to hundreds of bases.
    • Autosequencers can generate millions or more read files by a single run.
    • Resulting raw data from a single run usually amounts to 100 mega byte to 20-30 giga bytes, which is 1/3 of the disk space in an iPad.

Upon submission to DDBJ

  • Meta data describing the submitters, materials and methods for the reaction as well as read data will become accessible from either of the INSDC collaborators upon publication.
    • DDBJ will first issue a unique accession number for the submission.
    • DDBJ will make metadata in INSDC format and send it to NCBI which is mirrored in all INSDC.
    • DDBJ will create the submitter's account where necessary data are placed by submitters.
    • DDBJ will convert vender specific raw data files into INSDC approved format (SRA format[4]) via NCBI's service.
    • This step will require data transfer using Aspera server in NCBI for mass data transfer.
    • Usually a single submission amounts in the order of 10 runs and transfer of it takes a few seconds to a few hours.
    • Open access data is accumulated in NCBI likewise from EBI, and by copying the entire open access archive from NCBI frequently, the one same open access dataset is made available from any of the three INSDC sites.
    • The size of the data stored only in one bank depends on the amount of submitted, but unpublished data as well as the personal genome data shared under control by researcher groups.
    • This fraction is usually much more than open access data and NCBI has by far a bigger amount than DDBJ.

The size of the DRA

  • As of last week, users can search, browse and download 95,388 runs of open access data from DDBJ.
    • This amounts to 71 tera bytes in the SRA-lite format which lacks only the intensity data found in the SRA format.
    • 71 tera bytes is a disk space of almost one thousand iPads.
    • Submitted but embargo data in DDBJ amounts to 721 runs, which occupies a few tera bytes.
    • DDBJ allocates 273 tera bytes for this service which is planned to be increased in two steps up to 20 peta bytes in the coming 24 months.

Trace Archive at DDBJ (DTA)[5]

  • Trace Archive service accepts submission of raw data generated from autosequencers using Sanger reaction to be shared by the entire community.
    • Trace file is a chronological change of color intensities along a single capillary gel or a lane in a gel plate [6]
    • A typical autosequencer generates dozens to a few hundreds of such data by a single run.
    • A trace is a few hundred bases long and amounts to a few hundred kilo bytes.

Upon submission to DDBJ

  • Meta data describing experimental background becomes searchable and trace data becomes browsable and downloadable from all INSDC collaborators.
    • DDBJ will help generate meta-data file in an INSDC approved format.
    • DDBJ will create a submitter's account where necessary data are placed by submitters.
    • DDBJ will transfer the data to NCBI and NCBI will issue the unique identifier for the service.
    • DDBJ will store and serve only the trace data submitted to DDBJ.