- Basic rules
- IDF: Investigation Description Format
- SDRF: Sample and Data Relationship Format
- Instruction for sequencing data
DOR uses the MAGE-TAB version 1.1 format. For full specification, please refer to MAGE-TAB Specification Version 1.1.
The MAGE-TAB format uses a number of different files to capture information about a functional genomics experiment:
The IDF, SDRF, ADF and data matrix files should be in plain, tab-delimited text format.
The IDF file is used to give an overview of the experiment, including the experimental variables (factors) used, protocols, quality control strategy, publication information and contact details. Also included in the IDF file is an list of sources from which controlled vocabulary terms may have been used elsewhere in the MAGE-TAB document. These term sources may be fully-fledged ontologies (e.g. the MGED ontology), databases providing queryable accession numbers (e.g. ArrayExpress/DOR), or simply a file defining terms for local users.
The SDRF file describes the relationship between every step in the chain of biological materials used in the experiment through to the hybridization, and the acquisition and normalization of data. Experimental factors, protocols, protocol parameters and term sources defined in the IDF are referenced by the SDRF.
The ADF file provides the array-level annotation for the experiment, relating the row-level identifiers in the data files to biological sequence annotation. Array designs are usually deposited in the DOR as separate submissions to the experimental data, and in the case of commercial arrays may not need to be submitted to DOR at all. ADF page.
The IDF file should contain a pointer to the SDRF via the SDRF File tag. Data files and ADF files are referenced from the SDRF table directly. All data files should be in a single directory or archive with no sub-directory structure.
An experimental data submission will usually consist of an IDF file, an SDRF file, and a series of data files. Typically there will be one raw data file per hybridization in an array-based experiment. In a sequencing-based experiment, typically there will be one raw data file per sample. Each hybridization may also have a normalized data file, or the final transformed data may be combined into a data matrix file.
Blank lines containing zero or more spaces or tabs are permitted in any of these files. Lines starting with the # symbol are interpreted as comments and are ignored.
MAGE-TAB uses the column names end with "Name" (e.g., Sample Name) that contain object identifiers (names). Object identifiers (names) defined in the IDF (e.g., Protocol Name) should be referenced only in the columns named ending with "REF" (e.g., Protocol REF).
All of the MAGE-TAB components (IDF, ADF, SDRF and data matrices) allow for referencing ontology terms or database accessions from external sources. In each case the source of the term(s) is indicated by a separate Term Source REF field in the IDF.
IDF, SDRF and ADF documents contain data divided into columns and rows. Columns are separated by tab characters, while lines are separated by newlines and/or carriage returns. Fields within columns may be escaped by surrounding them with double quotes, indicating that any tab or newline characters contained therein are not to be interpreted as a field delimiter. Quote characters within fields must be escaped with a backslash. Note that column headers are also permitted to be enclosed in double quotes, but no characters other than spaces are permitted between the multiple keywords that comprise a column header.
For more detail on the MAGE-TAB document format, please see the MAGE-TAB Specification Version 1.1.
The IDF component of a MAGE-TAB document provides top-level information concerning an investigation in a single tab-delimited format. The IDF consists of a set of unique fields attached to their corresponding values in a simple tab-delimited text format. For example, "Experiment Description" should be followed by a free-text description of the experiment. Most of the fields in the IDF document can handle multiple values.
|Experimental Design||Ontology term||Ontology term||...|
|Experimental Design Term Source REF||Term Source Name||Term Source Name||...|
|Experimental Design Term Accession Number||Term Accession Number||Term Accession Number||...|
|Experimental Factor Name||Text||Text||...|
|Experimental Factor Type||Ontology term||Ontology term||...|
|Experimental Factor Term Source REF||Term Source Name||Term Source Name||...|
|Experimental Factor Term Accession Number||Term Accession Number||Term Accession Number||...|
|Person Last Name||Text||Text||...|
|Person First Name||Text||Text||...|
|Person Mid Initials||Text||Text||...|
|Person Roles||Ontology term (semicolon-delimited list)||Ontology term (semicolon-delimited list)||...|
|Person Roles Term Source REF||Term Source Name||Term Source Name||...|
|Person Roles Term Accession Number||Term Accession Number||Term Accession Number||...|
|Quality Control Type||Ontology term||Ontology term||...|
|Quality Control Term Source REF||Term Source Name||Term Source Name||...|
|Quality Control Term Accession Number||Term Accession Number||Term Accession Number||...|
|Replicate Type||Ontology term||Ontology term||...|
|Replicate Term Source REF||Term Source Name||Term Source Name||...|
|Replicate Term Accession Number||Term Accession Number||Term Accession Number||...|
|Normalization Type||Ontology term||Ontology term||...|
|Normalization Term Source REF||Term Source Name||Term Source Name||...|
|Normalization Term Accession Number||Term Accession Number||Term Accession Number||...|
|Date of Experiment||Date (YYYY-MM-DD)|
|Public Release Date||Date (YYYY-MM-DD)|
|PubMed ID||PubMed ID||PubMed ID||...|
|Publication Author List||Text||Text||...|
|Publication Status||Ontology term||Ontology term||...|
|Publication Status Term Source REF||Term Source Name||Term Source Name||...|
|Publication Status Term Accession Number||Term Accession Number||Term Accession Number||...|
|Protocol Type||Ontology term||Ontology term||...|
|Protocol Term Source REF||Term Source Name||Term Source Name||...|
|Protocol Term Accession Number||Term Accession Number||Term Accession Number||...|
|Protocol Parameters||Text (semicolon-delimited list)||Text (semicolon-delimited list)||...|
|Term Source Name||Text tag as used in SDRF||Text tag as used in SDRF||...|
|Term Source File||URI||URI||...|
|Term Source Version||Text||Text||...|
The second column indicates the type of entry expected for each row. The rows highlighted in blue do not allow multiple values. Rows highlighted in yellow may consist of multiple values in columns listed horizontally, one for each element described. For example, one should use as many Person Last Name columns as there are contacts for the investigation. In cases where multiple terms need to be entered into a single column, they should be separated by semicolons (e.g., Protocol Parameters, Person Roles). All such semicolon-separated roles must be from one ontology.
Note that fields which contain ontology individual terms should indicate the origin of those terms using the relevant Term Source REF field. Dates should be supplied in the ISO format YYYY-MM-DD. See an example of IDF.
Comment fields can be included freely to add comments. The name associated with the comment is included in square brackets in the row name, and the value entered in the body of the IDF. Types are not currently supported. Example use-cases for the IDF are Comment[Goal] to describe the goal of study. DOR will include these fields for our own local implementation.
To specify bibliographic references accompanying the experiment, it is sufficient to enter just the PubMed ID for each citation into the IDF. Where a given article is not yet published, the available information should be given using the IDF fields shown.
The most important concept behind the SDRF is the investigation design graph, where nodes correspond to biomaterials (e.g., samples, RNA extracts, labeled cDNA, etc.) or data objects (e.g., raw or normalized data files), and edges showing the relationships between these objects. Attributes can be attached to nodes and to edges. The attributes are the descriptions of the biomaterial or data properties, e.g., sample descriptions attached to sample nodes, protocols attached to edges, raw data-files attached to hybridizations. The attributes can be pointers to some longer descriptions or external objects, e.g., protocols described in the IDF file.
The SDRF file consists of a table in which each hybridization channel is represented by a row, and columns represent the steps of the experiment. The ordering of these columns is important, and should read left-to-right in chronological order. The overall organization of this table is shown below.
Each block in the diagram above starts with a "Name" or "File" column (e.g. "Extract Name", "Array Data File"), followed by a set of attribute columns. Each block is separated from its predecessor by "Protocol REF" columns containing references to the "Protocol Name" values defined in the IDF.
A further set of columns is used to specify the values for the variables ("experimental factors") within the experiment. These Factor Value columns reference the Experimental Factor Names defined in the IDF, and should be placed after the hybridization section (i.e., to the right of it, in or after the scanning, normalization and data section in the image above). The contents of these columns will usually duplicate those in a material Characteristics or a protocol Parameter Value column.
Samples represent steps in the chain of treatments applied to the original Source.
Extracts refer to the extracted nucleic acid used in the experiment. If you need to represent separate nucleic acid extraction and chromatin immunoprecipitation steps in your SDRF, we recommend that you use two Extract steps.
- Labeled Extract
The Labeled Extracts in an experiment are those materials which have been conjugated to a label of some kind, prior to hybridization on an array. A Label column must be included with the Labeled Extract Name column to indicate which label was used.
The assay/hybridization is a key section in the SDRF, since it connects the "materials" area of the SDRF from the "data" area. Describe assays using arrays (Hybridization) or assays not using arrays (Assay). Note that the values in Assay Name/Hybridization Name columns may be used in Data Matrix files to link columns of data to individual assays/hybridizations.
If desired, the act of scanning the hybridized array may be represented as a distinct node in the experimental graph, and encoded in the SDRF using Scan Name columns. These columns are optional, but can be useful in cases where e.g. multiple scans have been made of a single hybridized array, but where the data files do not explicitly reflect this. Note that the values in Scan Name columns may be used in Data Matrix files to link columns of data to individual scanning events.
- Array Data File
Similarly to the use of Scan Name columns above, it is possible to represent the act of normalizing your data independently from the listing of data files themselves. This is done using the optional Normalization Name column.
- Derived Array Data File
The processed data files which have been derived from the raw data should be listed in an Derived Array Data File column. Note that this generally only applies to processed data arranged into one file per assay/hybridization (or scan, or normalization). If your files contain processed data columns for more than one assay/hybridization, you should reformat these into the MAGE-TAB Data Matrix format and include them instead in a Derived Array Data Matrix File column.
The "Name" and "File" node columns are linked by "Protocol REF" columns which represent the graph edges (Protocol REF is the only type of edge possible). Furthermore, each node and edge column may be associated with one or more attribute columns containing annotation, e.g., "Source Name" may be associated with "Provider"; "Parameter Value " with "Unit". In each case the attribute column follows immediately after the respective node or edge column. Similarly, where ontology terms are used a "Term Source REF" column should follow immediately to the right of the column containing the actual ontology terms (see example).
The list in the table below summarizes which label tags can follow each node identifier in the table, and which modifier tags may be used:
|Source Name||Characteristics, Provider, Material Type, Description, Comment|
|Sample Name||Characteristics, Material Type, Description, Comment|
|Extract Name||Characteristics, Material Type, Description, Comment|
|Labeled Extract Name||Characteristics, Material Type, Description, Label, Comment|
|Hybridization Name||Array Data File, Derived Array Data File, Array Data Matrix File, Derived Array Data Matrix File, Array Design File / REF, Technology Type, Comment|
|Assay Name||Technology Type, Array Data File, Derived Array Data File, Array Data Matrix File, Derived Array Data Matrix File, Array Design File / REF, Comment|
|Scan Name||Array Data File, Derived Array Data File, Array Data Matrix File, Derived Array Data Matrix File, Comment|
|Normalization Name||Derived Array Data File, Derived Array Data Matrix File, Comment|
|Array Data File||Comment|
|Derived Array Data File||Comment|
|Array Data Matrix File||Comment|
|Derived Array Data Matrix File||Comment|
|Array Design File / REF||Term Source REF, Comment|
|Protocol REF||Term Source REF, Parameter, Performer, Date, Comment|
|Characteristics||Unit, Term Source REF|
|Material Type||Term Source REF|
|Technology Type||Term Source REF|
|Label||Term Source REF|
|Factor Value()||Unit, Term Source REF|
|Unit, Comment, Term Source REF||Unit|
|Term Source REF||Description|
|Term Source REF||Term Accession Number|
|Term Accession Number||Comment|
Element column headers in the SDRF, except for Protocol REF, must occur in the following order and with the following cardinalities. The attributes of an element or of another attribute must follow the attributed element or attribute without any intervening element or attribute. When an element or attribute has more than one attribute, there is no ordering defined for that set, except:
- Factor Value: must occur after all element nodes and the attributes of those element nodes.
- Comment: must immediately follow either the element or attribute node for which it is a Comment, or another such Comment. This permits an unambiguous association of a Comment with the element or attribute for on which it comments.
- Term Source REF: must immediately follow the ontology term for which it provides the source reference. This permits an unambiguous association of the Term Source REF to the ontology term.
|Element Nodes and Factor Values||Cardinality||Notes|
|Labeled Extract Name||0..1|
|Hybridization Name||0..1||Either Hybridization Name or Assay Name can be present, but not both.|
|Assay Name||0..1||Either Assay Name or Hybridization Name can be present, but not both.|
|Array Data File||0..*|
|Array Data Matrix File||0..*|
|Derived Array Data File||0..*|
|Derived Array Data Matrix File||0..*|
|Attributes - all are optional||Cardinality||Notes|
|Array Design File||0..1|
|Array Design REF||0..1|
|Technology Type||0..1||Is an attribute for an Assay Name, but can also be an attribute for a Hybridization Name.|
|Term Source REF||0..1|
|Term Accession Number||0..1|
- A sequencing protocol (Protocol Type="sequencing") must be provided. This protocol must have a Protocol Hardware value saying which sequencing instrument was used.
List of sequencing instrument:
454 GS, 454 GS 20, 454 GS FLX, 454 GS FLX Titanium, 454 GS Junior, Illumina Genome Analyzer, Illumina Genome Analyzer II, Illumina Genome Analyzer IIx, Illumina HiSeq 2000, Illumina HiSeq 1000, Illumina MiSeq, AB SOLiD System, AB SOLiD System 2.0, AB SOLiD System 3.0, AB SOLiD 4 System, AB SOLiD 4hq System, AB SOLiD PI System, AB SOLiD 5500, AB SOLiD 5500xl, Helicos HeliScope, Complete Genomics, PacBio RS, Ion Torrent PGM, unspecified.
- Include Assay Name and Technology Type columns.
- In the Protocol REF column before the Assay Name column, reference the sequencing protocol in the IDF.
- The sequencing protocol should have a Performer - this is used as the run center name.
- Raw files must go in the Array Data File column. In the following Comment[FILE_TYPE] column, select a file format from sff，Illumina_native_qseq，Illumina_native_fastq，SOLiD_native_csfasta，SOLiD_native_qual，Helicos_native. For the necessary raw data files, please see this page.
- These 4 extra Comment columns should be added after Extract Name to provide information about how the library was prepared in SDRF.
Comment[LIBRARY_LAYOUT] - either SINGLE or PAIRED. When PAIRED, create following columns and enter values.
- Comment[LIBRARY_SOURCE] - one of GENOMIC, TRANSCRIPTOMIC, METAGENOMIC, METATRANSCRIPTOMIC, NON GENOMIC, SYNTHETIC, VIRAL RNA, OTHER.
- Comment[LIBRARY_STRATEGY] - one of WGS, WXS, RNA-Seq, WCS, CLONE, POOLCLONE, AMPLICON, CLONEEND, FINISHING, ChIP-Seq, MNase-Seq, DNase-Hypersensitivity, Bisulfite-Seq, EST, FL-cDNA, CTS, MRE-Seq, MeDIP-Seq, MBD-Seq, OTHER.
- Comment[LIBRARY_SELECTION] - one of RANDOM, PCR, RANDOM PCR, RT-PCR, HMPR, MF, CF-S, CF-M, CF-H, CF-T, MSLL, cDNA, ChIP, MNase, DNAse, Hybrid Selection, Reduced Representation, Restriction Digest, 5-methylcytidine antibody, MBD2 protein methyl-CpG binding domain, CAGE, RACE, size fractionation, other, unspecified.
- Comment[LIBRARY_LAYOUT] - either SINGLE or PAIRED. When PAIRED, create following columns and enter values.
Include the following attributes as Comment columns after Assay Name in the SDRF:
- For LS454:
- For Illumina:
- Comment[SEQUENCE_LENGTH] (integer - The fixed number of bases expected in each raw sequence, including both mate pairs and any technical reads.).
- For ABI SOLID:
- Comment[SEQUENCE_LENGTH] (integer - The fixed number of bases expected in each raw sequence, including both mate pairs and any technical reads.).
- For Helicos: